Frankie Robertson
|
2ade640525
|
Add option of replacing unicode decode errors in WARC/common crawl extraction
|
2021-02-08 07:12:18 +02:00 |
|
Felix Hamborg
|
1bbef4a189
|
Update README.md
|
2021-02-07 18:18:01 +01:00 |
|
Felix Hamborg
|
f3e478cde3
|
Merge pull request #200 from mood-mapping-muppets/allow-none-fully-extracted-log
Check for when log_pathname_fully_extracted_warcs is None and don't log in this case
|
2021-02-07 14:59:59 +01:00 |
|
Felix Hamborg
|
09754829c7
|
Update README.md
|
2021-02-06 19:39:23 +01:00 |
|
Frankie Robertson
|
584facb897
|
Check for when log_pathname_fully_extracted_warcs is None and don't log in this case
|
2021-02-06 16:55:10 +02:00 |
|
Felix Hamborg
|
9f7dfef939
|
Merge pull request #199 from lgov/temp_file_fix
* newsplease/crawler/commoncrawl_crawler.py: use a system temp file
|
2021-02-06 15:46:21 +01:00 |
|
Felix Hamborg
|
7f986642bc
|
Merge pull request #197 from mood-mapping-muppets/continue-article-lang-detect-exception
Continue detection when LangDetectException in <article>
|
2021-02-06 15:44:43 +01:00 |
|
Felix Hamborg
|
336c74bc95
|
Merge pull request #193 from mood-mapping-muppets/programmatic-filter
Make filter_record public to enable subclasses to override
|
2021-02-06 15:43:48 +01:00 |
|
Frankie Robertson
|
41beb6d80f
|
Add documentation of extractor_cls argument
|
2021-02-06 16:19:08 +02:00 |
|
Lieven Govaerts
|
ae3172d652
|
* newsplease/crawler/commoncrawl_crawler.py: use a system temp file instead of one in the current directory, which might not be writable.
|
2021-02-06 13:21:57 +01:00 |
|
Frankie Robertson
|
292133f9f1
|
Continue detection when LangDetectException in <article>
|
2021-02-06 14:20:58 +02:00 |
|
Felix Hamborg
|
a5f2fb4bd1
|
Update setup.py
|
2021-02-06 12:20:43 +01:00 |
|
Felix Hamborg
|
f69eb3463e
|
Merge pull request #184 from lgov/filter_warc_files_by_exact_date
Filter commoncrawl warc files after exact timestamp, not only per year+month
|
2021-02-06 12:14:44 +01:00 |
|
Felix Hamborg
|
6d29e8367a
|
Update setup.py
|
2021-02-05 16:06:54 +01:00 |
|
Felix Hamborg
|
c6163f0032
|
Merge pull request #195 from mood-mapping-muppets/makedir-race-condition
Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first
|
2021-02-05 15:58:29 +01:00 |
|
Felix Hamborg
|
acdd1377dd
|
Merge pull request #196 from mood-mapping-muppets/longest-article-max-fix
Fix longest article detection langdetect
|
2021-02-05 15:57:45 +01:00 |
|
Frankie Robertson
|
f4b1efe06b
|
Fix longest article detection langdetect
|
2021-02-05 15:55:34 +02:00 |
|
Frankie Robertson
|
45bd8e5e41
|
Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first
|
2021-02-04 19:04:35 +02:00 |
|
Felix Hamborg
|
48ab3300ac
|
Update setup.py
|
2021-02-02 14:02:25 +01:00 |
|
Felix Hamborg
|
6beef8bcf9
|
Merge pull request #194 from mood-mapping-muppets/robust-fromstring
Fallback to bytes fromstring when lxml unicode fromstring fails
|
2021-02-02 14:02:04 +01:00 |
|
Frankie Robertson
|
1116f1ca7a
|
Fallback to bytes fromstring when lxml unicode fromstring fails
|
2021-02-02 14:42:47 +02:00 |
|
Frankie Robertson
|
563de2b7d7
|
Add extractor_cls parameter to crawl_from_commoncrawl
|
2021-02-02 10:30:50 +02:00 |
|
Frankie Robertson
|
89f51e3b6b
|
Make filter_record public to enable subclasses to override
|
2021-02-01 19:15:42 +02:00 |
|
Felix Hamborg
|
afc7b411de
|
Update README.md
|
2021-01-20 09:30:47 +01:00 |
|
Felix Hamborg
|
b7bc13659b
|
Update README.md
|
2021-01-14 15:04:32 +01:00 |
|
Felix Hamborg
|
506034941c
|
Update README.md
|
2021-01-14 15:03:47 +01:00 |
|
Felix Hamborg
|
0b26d71633
|
Merge pull request #191 from shradhasehgal/req-fix
Fixes #189: cchardet installation
|
2021-01-03 23:43:53 +01:00 |
|
shradhasehgal
|
87b23e7a4f
|
Requirements cchardet fix
|
2020-12-29 11:37:28 +05:30 |
|
Lieven Govaerts
|
1fdaae738f
|
* commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date.
|
2020-12-04 19:22:02 +01:00 |
|
Felix Hamborg
|
6a5d12daac
|
Update README.md
|
2020-10-23 10:14:51 +02:00 |
|
Felix Hamborg
|
f9eb825584
|
Update README.md
|
2020-10-23 10:12:51 +02:00 |
|
Felix Hamborg
|
1a40d7bd7f
|
Update sample.json
|
2020-10-22 21:51:10 +02:00 |
|
Felix Hamborg
|
eee9318860
|
Update sample.json
|
2020-10-22 21:50:43 +02:00 |
|
Felix Hamborg
|
76f978dfa7
|
Update sample.json
|
2020-10-22 21:49:28 +02:00 |
|
Felix Hamborg
|
de733476e9
|
Update README.md
|
2020-10-16 10:01:35 +02:00 |
|
Felix Hamborg
|
c8253bd852
|
Update README.md
|
2020-10-16 10:01:00 +02:00 |
|
Felix Hamborg
|
328b2b9591
|
Update README.md
|
2020-09-18 09:20:29 +02:00 |
|
Felix Hamborg
|
534deea5d1
|
Update README.md
|
2020-09-18 09:19:33 +02:00 |
|
Felix Hamborg
|
916fd8eeff
|
Update README.md
|
2020-09-18 09:17:05 +02:00 |
|
Felix Hamborg
|
5d30d8c824
|
Merge pull request #171 from arcolife/fix_headers
fixes #170: custom headers for requests
|
2020-08-03 14:11:48 +02:00 |
|
Archit Sharma
|
14854405d0
|
fixes #170: custom headers for requests
|
2020-08-03 10:58:04 +05:30 |
|
Felix Hamborg
|
c79e8c3595
|
Merge pull request #167 from sebastian-nagel/verbose-exception-logging-if-continue
Verbose logging of exceptions if continue_after_error
|
2020-07-15 19:52:08 +02:00 |
|
Sebastian Nagel
|
8a974649bf
|
Verbose logging of exceptions if continue_after_error
- log ignored exceptions more verbosely including exception value and
stack trace
- try to address #163 (find out reason for the permission error)
|
2020-07-13 11:28:34 +02:00 |
|
Felix Hamborg
|
7e16862290
|
Merge pull request #165 from amjltc295/amjltc295/use-str-html-instead-of-bytes-html-to-speed-up
Amjltc295/use str html instead of bytes html to speed up
|
2020-07-01 13:50:10 +02:00 |
|
Ya-Liang Chang (Allen)
|
4cc3fb5bab
|
Revert logger to LOGGER
|
2020-06-30 14:02:26 -07:00 |
|
Ya-Liang Chang (Allen)
|
025f4708f7
|
Move decode_response() to response_decoder.py
|
2020-06-29 11:53:10 -07:00 |
|
Ya-Liang Chang (Allen)
|
9dd44fe121
|
Fix config warning
|
2020-06-23 12:05:45 -07:00 |
|
Ya-Liang Chang (Allen)
|
14d9bbc11b
|
Update requirements
|
2020-06-23 12:05:30 -07:00 |
|
Ya-Liang Chang (Allen)
|
b7f4d3b7c4
|
Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py
|
2020-06-23 12:05:20 -07:00 |
|
Felix Hamborg
|
1639c9f3bd
|
Update bug_report.md
|
2020-05-31 17:44:28 +02:00 |
|