1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

204 Commits

Author SHA1 Message Date
Felix Hamborg
cc0be8e5e2 Update commoncrawl.py 2021-06-27 21:29:05 +02:00
Felix Hamborg
e6cdd0f22b add warc file filtering 2021-04-22 11:03:45 +02:00
Shang Wang
1bea565ca1 Support warc_files_end_date for common crawl crawler. 2021-04-19 00:17:33 -04:00
Felix Hamborg
6232b7d60f fix #206 2021-03-16 12:04:59 +01:00
Felix Hamborg
145cb0b641 add fetch_images optionto ccnc script (default=false) 2021-02-24 08:53:24 +01:00
Felix Hamborg
14b9cef89c Merge pull request #203 from mood-mapping-muppets/no-fetch-images-newspaper
Add option of not fetching images using newspaper library and make default for commoncrawl
2021-02-24 08:43:52 +01:00
Felix Hamborg
af1e03531b Merge pull request #201 from mood-mapping-muppets/empty-warc-newspaper
Filter empty responses from WARC to avoid spurious exceptions from `newspaper`
2021-02-16 11:17:23 +01:00
Frankie Robertson
11e6d7748c Add option of not fetching images using newspaper library and make default for commoncrawl 2021-02-10 15:17:32 +02:00
Frankie Robertson
f9da1b7424 Handle publish_date == "" in newspaper_extractor 2021-02-10 15:15:50 +02:00
Frankie Robertson
8456ec4f89 Filter empty responses from WARC to avoid spurious exceptions from newspaper 2021-02-10 15:14:01 +02:00
Frankie Robertson
5ed9b804bf Fallback to utf-8 when document gives unknown encoding 2021-02-08 07:18:05 +02:00
Frankie Robertson
2ade640525 Add option of replacing unicode decode errors in WARC/common crawl extraction 2021-02-08 07:12:18 +02:00
Felix Hamborg
f3e478cde3 Merge pull request #200 from mood-mapping-muppets/allow-none-fully-extracted-log
Check for when log_pathname_fully_extracted_warcs is None and don't log in this case
2021-02-07 14:59:59 +01:00
Frankie Robertson
584facb897 Check for when log_pathname_fully_extracted_warcs is None and don't log in this case 2021-02-06 16:55:10 +02:00
Felix Hamborg
9f7dfef939 Merge pull request #199 from lgov/temp_file_fix
* newsplease/crawler/commoncrawl_crawler.py: use a system temp file
2021-02-06 15:46:21 +01:00
Felix Hamborg
7f986642bc Merge pull request #197 from mood-mapping-muppets/continue-article-lang-detect-exception
Continue detection when LangDetectException in <article>
2021-02-06 15:44:43 +01:00
Felix Hamborg
336c74bc95 Merge pull request #193 from mood-mapping-muppets/programmatic-filter
Make filter_record public to enable subclasses to override
2021-02-06 15:43:48 +01:00
Frankie Robertson
41beb6d80f Add documentation of extractor_cls argument 2021-02-06 16:19:08 +02:00
Lieven Govaerts
ae3172d652 * newsplease/crawler/commoncrawl_crawler.py: use a system temp file instead of one in the current directory, which might not be writable. 2021-02-06 13:21:57 +01:00
Frankie Robertson
292133f9f1 Continue detection when LangDetectException in <article> 2021-02-06 14:20:58 +02:00
Felix Hamborg
f69eb3463e Merge pull request #184 from lgov/filter_warc_files_by_exact_date
Filter commoncrawl warc files after exact timestamp, not only per year+month
2021-02-06 12:14:44 +01:00
Felix Hamborg
c6163f0032 Merge pull request #195 from mood-mapping-muppets/makedir-race-condition
Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first
2021-02-05 15:58:29 +01:00
Felix Hamborg
acdd1377dd Merge pull request #196 from mood-mapping-muppets/longest-article-max-fix
Fix longest article detection langdetect
2021-02-05 15:57:45 +01:00
Frankie Robertson
f4b1efe06b Fix longest article detection langdetect 2021-02-05 15:55:34 +02:00
Frankie Robertson
45bd8e5e41 Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first 2021-02-04 19:04:35 +02:00
Frankie Robertson
1116f1ca7a Fallback to bytes fromstring when lxml unicode fromstring fails 2021-02-02 14:42:47 +02:00
Frankie Robertson
563de2b7d7 Add extractor_cls parameter to crawl_from_commoncrawl 2021-02-02 10:30:50 +02:00
Frankie Robertson
89f51e3b6b Make filter_record public to enable subclasses to override 2021-02-01 19:15:42 +02:00
Lieven Govaerts
1fdaae738f * commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date. 2020-12-04 19:22:02 +01:00
Felix Hamborg
1a40d7bd7f Update sample.json 2020-10-22 21:51:10 +02:00
Felix Hamborg
eee9318860 Update sample.json 2020-10-22 21:50:43 +02:00
Felix Hamborg
76f978dfa7 Update sample.json 2020-10-22 21:49:28 +02:00
Archit Sharma
14854405d0 fixes #170: custom headers for requests 2020-08-03 10:58:04 +05:30
Sebastian Nagel
8a974649bf Verbose logging of exceptions if continue_after_error
- log ignored exceptions more verbosely including exception value and
  stack trace
- try to address #163 (find out reason for the permission error)
2020-07-13 11:28:34 +02:00
Ya-Liang Chang (Allen)
4cc3fb5bab Revert logger to LOGGER 2020-06-30 14:02:26 -07:00
Ya-Liang Chang (Allen)
025f4708f7 Move decode_response() to response_decoder.py 2020-06-29 11:53:10 -07:00
Ya-Liang Chang (Allen)
9dd44fe121 Fix config warning 2020-06-23 12:05:45 -07:00
Ya-Liang Chang (Allen)
b7f4d3b7c4 Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py 2020-06-23 12:05:20 -07:00
Felix Hamborg
44a9f32eee Merge pull request #158 from tbrknt/unused-delete-configuration-in-example
my_delete_warc_after_extraction
2020-05-23 17:55:46 +02:00
Felix Hamborg
06bf917667 Merge pull request #157 from tbrknt/add-windows-compatibility-for-subprocesses
Less OS Dependency
2020-05-23 17:54:59 +02:00
Berkant Kepez
00eaa97ea7 Make use of my_delete_warc_after_extraction configuration in the example. 2020-05-22 00:50:01 +02:00
Berkant Kepez
c6edf464c7 Decrease the dependency on OS regarding Subprocesses in __get_remote_index methods. 2020-05-22 00:48:53 +02:00
Berkant Kepez
60a1ef1059 Add exitcode check for commands executed in __get_remote_input. 2020-05-22 00:03:44 +02:00
Felix Hamborg
07174cc593 Merge pull request #145 from thihara/item_class
Changes to make Scrapy Item class customizable via configuration
2020-05-13 11:41:08 +02:00
Thihara Neranjya
20a90dfea3 Renamed modle_util.py into class_loader.py. More specific error handling for loading custom news item module and class. Fixed a typo. 2020-05-12 08:07:24 +05:30
Felix Hamborg
bc00fac124 Update sample.json 2020-05-05 09:09:25 +02:00
Felix Hamborg
c862976173 Merge pull request #148 from thihara/date_extraction
Fixed broken date extraction due to beautiful soup's tag.text.
2020-04-30 10:20:17 +02:00
Thihara Neranjya
86570b1ad2 Merge branch 'date_extraction' of https://github.com/thihara/news-please into item_class 2020-04-29 20:00:57 +05:30
Thihara Neranjya
de7012cd2f Fixed broken date extraction due to beautiful soup's tag.text. Replaced with tag.string 2020-04-29 19:52:18 +05:30
Thihara Neranjya
411f46698d Merge branch 'master' of https://github.com/thihara/news-please into item_class 2020-04-29 12:54:10 +05:30