1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00
Commit Graph

634 Commits

Author SHA1 Message Date
Frankie Robertson
2ade640525 Add option of replacing unicode decode errors in WARC/common crawl extraction 2021-02-08 07:12:18 +02:00
Felix Hamborg
1bbef4a189 Update README.md 2021-02-07 18:18:01 +01:00
Felix Hamborg
f3e478cde3 Merge pull request #200 from mood-mapping-muppets/allow-none-fully-extracted-log
Check for when log_pathname_fully_extracted_warcs is None and don't log in this case
2021-02-07 14:59:59 +01:00
Felix Hamborg
09754829c7 Update README.md 2021-02-06 19:39:23 +01:00
Frankie Robertson
584facb897 Check for when log_pathname_fully_extracted_warcs is None and don't log in this case 2021-02-06 16:55:10 +02:00
Felix Hamborg
9f7dfef939 Merge pull request #199 from lgov/temp_file_fix
* newsplease/crawler/commoncrawl_crawler.py: use a system temp file
2021-02-06 15:46:21 +01:00
Felix Hamborg
7f986642bc Merge pull request #197 from mood-mapping-muppets/continue-article-lang-detect-exception
Continue detection when LangDetectException in <article>
2021-02-06 15:44:43 +01:00
Felix Hamborg
336c74bc95 Merge pull request #193 from mood-mapping-muppets/programmatic-filter
Make filter_record public to enable subclasses to override
2021-02-06 15:43:48 +01:00
Frankie Robertson
41beb6d80f Add documentation of extractor_cls argument 2021-02-06 16:19:08 +02:00
Lieven Govaerts
ae3172d652 * newsplease/crawler/commoncrawl_crawler.py: use a system temp file instead of one in the current directory, which might not be writable. 2021-02-06 13:21:57 +01:00
Frankie Robertson
292133f9f1 Continue detection when LangDetectException in <article> 2021-02-06 14:20:58 +02:00
Felix Hamborg
a5f2fb4bd1 Update setup.py 2021-02-06 12:20:43 +01:00
Felix Hamborg
f69eb3463e Merge pull request #184 from lgov/filter_warc_files_by_exact_date
Filter commoncrawl warc files after exact timestamp, not only per year+month
2021-02-06 12:14:44 +01:00
Felix Hamborg
6d29e8367a Update setup.py 2021-02-05 16:06:54 +01:00
Felix Hamborg
c6163f0032 Merge pull request #195 from mood-mapping-muppets/makedir-race-condition
Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first
2021-02-05 15:58:29 +01:00
Felix Hamborg
acdd1377dd Merge pull request #196 from mood-mapping-muppets/longest-article-max-fix
Fix longest article detection langdetect
2021-02-05 15:57:45 +01:00
Frankie Robertson
f4b1efe06b Fix longest article detection langdetect 2021-02-05 15:55:34 +02:00
Frankie Robertson
45bd8e5e41 Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first 2021-02-04 19:04:35 +02:00
Felix Hamborg
48ab3300ac Update setup.py 2021-02-02 14:02:25 +01:00
Felix Hamborg
6beef8bcf9 Merge pull request #194 from mood-mapping-muppets/robust-fromstring
Fallback to bytes fromstring when lxml unicode fromstring fails
2021-02-02 14:02:04 +01:00
Frankie Robertson
1116f1ca7a Fallback to bytes fromstring when lxml unicode fromstring fails 2021-02-02 14:42:47 +02:00
Frankie Robertson
563de2b7d7 Add extractor_cls parameter to crawl_from_commoncrawl 2021-02-02 10:30:50 +02:00
Frankie Robertson
89f51e3b6b Make filter_record public to enable subclasses to override 2021-02-01 19:15:42 +02:00
Felix Hamborg
afc7b411de Update README.md 2021-01-20 09:30:47 +01:00
Felix Hamborg
b7bc13659b Update README.md 2021-01-14 15:04:32 +01:00
Felix Hamborg
506034941c Update README.md 2021-01-14 15:03:47 +01:00
Felix Hamborg
0b26d71633 Merge pull request #191 from shradhasehgal/req-fix
Fixes #189: cchardet installation
2021-01-03 23:43:53 +01:00
shradhasehgal
87b23e7a4f Requirements cchardet fix 2020-12-29 11:37:28 +05:30
Lieven Govaerts
1fdaae738f * commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date. 2020-12-04 19:22:02 +01:00
Felix Hamborg
6a5d12daac Update README.md 2020-10-23 10:14:51 +02:00
Felix Hamborg
f9eb825584 Update README.md 2020-10-23 10:12:51 +02:00
Felix Hamborg
1a40d7bd7f Update sample.json 2020-10-22 21:51:10 +02:00
Felix Hamborg
eee9318860 Update sample.json 2020-10-22 21:50:43 +02:00
Felix Hamborg
76f978dfa7 Update sample.json 2020-10-22 21:49:28 +02:00
Felix Hamborg
de733476e9 Update README.md 2020-10-16 10:01:35 +02:00
Felix Hamborg
c8253bd852 Update README.md 2020-10-16 10:01:00 +02:00
Felix Hamborg
328b2b9591 Update README.md 2020-09-18 09:20:29 +02:00
Felix Hamborg
534deea5d1 Update README.md 2020-09-18 09:19:33 +02:00
Felix Hamborg
916fd8eeff Update README.md 2020-09-18 09:17:05 +02:00
Felix Hamborg
5d30d8c824 Merge pull request #171 from arcolife/fix_headers
fixes #170: custom headers for requests
2020-08-03 14:11:48 +02:00
Archit Sharma
14854405d0 fixes #170: custom headers for requests 2020-08-03 10:58:04 +05:30
Felix Hamborg
c79e8c3595 Merge pull request #167 from sebastian-nagel/verbose-exception-logging-if-continue
Verbose logging of exceptions if continue_after_error
2020-07-15 19:52:08 +02:00
Sebastian Nagel
8a974649bf Verbose logging of exceptions if continue_after_error
- log ignored exceptions more verbosely including exception value and
  stack trace
- try to address #163 (find out reason for the permission error)
2020-07-13 11:28:34 +02:00
Felix Hamborg
7e16862290 Merge pull request #165 from amjltc295/amjltc295/use-str-html-instead-of-bytes-html-to-speed-up
Amjltc295/use str html instead of bytes html to speed up
2020-07-01 13:50:10 +02:00
Ya-Liang Chang (Allen)
4cc3fb5bab Revert logger to LOGGER 2020-06-30 14:02:26 -07:00
Ya-Liang Chang (Allen)
025f4708f7 Move decode_response() to response_decoder.py 2020-06-29 11:53:10 -07:00
Ya-Liang Chang (Allen)
9dd44fe121 Fix config warning 2020-06-23 12:05:45 -07:00
Ya-Liang Chang (Allen)
14d9bbc11b Update requirements 2020-06-23 12:05:30 -07:00
Ya-Liang Chang (Allen)
b7f4d3b7c4 Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py 2020-06-23 12:05:20 -07:00
Felix Hamborg
1639c9f3bd Update bug_report.md 2020-05-31 17:44:28 +02:00