comcrawl-common-crawl/drop_duplicates.py at c39f4284ba8f8e5a1f6d2d37f98a5d03d02fe1b3 - comcrawl-common-crawl - gitea-ailhan-registry

alihan/comcrawl-common-crawl

mirror of https://github.com/michaelharms/comcrawl.git synced 2021-09-27 00:43:48 +03:00

Files

Michael Harms c39f4284ba implementing basic functionality and basic tests

2020-01-10 18:54:08 +01:00

10 lines

250 B

Python

Raw Blame History

 import comcrawl as cc
 results = cc.search("https://index.commoncrawl.org/*")
 results = results.sort_values(by="timestamp")
 results = results.drop_duplicates("url", keep="first")
 results["html"] = cc.download(results)
 results.to_csv("results.csv")