1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

Add documentation of extractor_cls argument

This commit is contained in:
Frankie Robertson
2021-02-06 16:19:08 +02:00
parent 563de2b7d7
commit 41beb6d80f
2 changed files with 9 additions and 3 deletions

View File

@@ -232,7 +232,8 @@ def __start_commoncrawl_extractor(warc_download_url, callback_on_article_extract
:param continue_after_error:
:param show_download_progress:
:param log_level:
:param extractor_cls:
:param extractor_cls: A subclass of CommonCrawlExtractor, which can be used
to add custom filtering by overriding .filter_record(...)
:return:
"""
commoncrawl_extractor = extractor_cls()

View File

@@ -6,8 +6,13 @@ script stores the extracted articles in JSON files, but this behaviour can be ad
on_valid_article_extracted. To speed up the crawling and extraction process, the script supports multiprocessing. You can
control the number of processes with the parameter my_number_of_extraction_processes.
You can also crawl and extract articles programmatically, i.e., from within your own code, by using the class
CommonCrawlCrawler provided in newsplease.crawler.commoncrawl_crawler.py
You can also crawl and extract articles programmatically, i.e., from within
your own code, by using the class CommonCrawlCrawler or the function
commoncrawl_crawler.crawl_from_commoncrawl(...) provided in
newsplease.crawler.commoncrawl_crawler.py. In this case there is also the
possibility of passing in a your own subclass of CommonCrawlExtractor as
extractor_cls=... . One use case here is that your subclass can customise
filtering by overriding `.filter_record(...)`.
In case the script crashes and contains a log message in the beginning that states that only 1 file on AWS storage
was found, make sure that awscli was correctly installed. You can check that by running aws --version from a terminal.