1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

Commit Graph

  • 3c25624706 Update README.md master Felix Hamborg 2021-08-14 10:17:52 +02:00
  • 8490b6dd14 Update README.md Felix Hamborg 2021-08-14 10:15:42 +02:00
  • 7e11b1c915 Update bug_report.md Felix Hamborg 2021-08-02 15:06:12 +02:00
  • cc0be8e5e2 Update commoncrawl.py Felix Hamborg 2021-06-27 21:29:05 +02:00
  • 1e94327a3b Update README.md Felix Hamborg 2021-06-19 21:16:50 +02:00
  • 520c975926 add cchardet Felix Hamborg 2021-05-06 10:01:52 +02:00
  • e6cdd0f22b add warc file filtering Felix Hamborg 2021-04-22 11:03:45 +02:00
  • cf7d78ef15 Merge pull request #210 from shangw-nvidia/shangw-nvidia/warc_files_end_date Felix Hamborg 2021-04-22 10:55:31 +02:00
  • 1bea565ca1 Support warc_files_end_date for common crawl crawler. Shang Wang 2021-04-19 00:17:33 -04:00
  • 44a196f1eb Update README.md Felix Hamborg 2021-04-08 13:53:32 +02:00
  • 84ec12a089 Update README.md Felix Hamborg 2021-03-26 14:23:59 +01:00
  • 6ef48710d3 Update README.md Felix Hamborg 2021-03-26 14:23:22 +01:00
  • 0adfb8913b Update README.md Felix Hamborg 2021-03-23 13:39:41 +01:00
  • 6232b7d60f fix #206 Felix Hamborg 2021-03-16 12:04:59 +01:00
  • 961948477d inc version Felix Hamborg 2021-02-24 08:53:58 +01:00
  • 145cb0b641 add fetch_images optionto ccnc script (default=false) Felix Hamborg 2021-02-24 08:53:24 +01:00
  • 14b9cef89c Merge pull request #203 from mood-mapping-muppets/no-fetch-images-newspaper Felix Hamborg 2021-02-24 08:43:52 +01:00
  • af1e03531b Merge pull request #201 from mood-mapping-muppets/empty-warc-newspaper Felix Hamborg 2021-02-16 11:17:23 +01:00
  • 0b53affe20 Merge pull request #202 from mood-mapping-muppets/empty-publish-date-newspaper-extractor Felix Hamborg 2021-02-11 16:32:54 +01:00
  • 11e6d7748c Add option of not fetching images using newspaper library and make default for commoncrawl Frankie Robertson 2021-02-08 10:41:28 +02:00
  • f9da1b7424 Handle publish_date == "" in newspaper_extractor Frankie Robertson 2021-02-08 10:04:10 +02:00
  • 8456ec4f89 Filter empty responses from WARC to avoid spurious exceptions from newspaper Frankie Robertson 2021-02-08 09:42:57 +02:00
  • d3c3eed26b Merge pull request #198 from mood-mapping-muppets/robust-unicode-warc Felix Hamborg 2021-02-08 16:20:04 +01:00
  • 5ed9b804bf Fallback to utf-8 when document gives unknown encoding Frankie Robertson 2021-02-08 07:18:05 +02:00
  • 2ade640525 Add option of replacing unicode decode errors in WARC/common crawl extraction Frankie Robertson 2021-02-06 12:58:44 +02:00
  • 1bbef4a189 Update README.md Felix Hamborg 2021-02-07 18:18:01 +01:00
  • f3e478cde3 Merge pull request #200 from mood-mapping-muppets/allow-none-fully-extracted-log Felix Hamborg 2021-02-07 14:59:59 +01:00
  • 09754829c7 Update README.md Felix Hamborg 2021-02-06 19:39:23 +01:00
  • 584facb897 Check for when log_pathname_fully_extracted_warcs is None and don't log in this case Frankie Robertson 2021-02-06 16:55:10 +02:00
  • 9f7dfef939 Merge pull request #199 from lgov/temp_file_fix Felix Hamborg 2021-02-06 15:46:21 +01:00
  • 7f986642bc Merge pull request #197 from mood-mapping-muppets/continue-article-lang-detect-exception Felix Hamborg 2021-02-06 15:44:43 +01:00
  • 336c74bc95 Merge pull request #193 from mood-mapping-muppets/programmatic-filter Felix Hamborg 2021-02-06 15:43:48 +01:00
  • 41beb6d80f Add documentation of extractor_cls argument Frankie Robertson 2021-02-06 16:19:08 +02:00
  • ae3172d652 * newsplease/crawler/commoncrawl_crawler.py: use a system temp file instead of one in the current directory, which might not be writable. Lieven Govaerts 2021-02-06 13:21:57 +01:00
  • 292133f9f1 Continue detection when LangDetectException in <article> Frankie Robertson 2021-02-06 12:58:03 +02:00
  • a5f2fb4bd1 Update setup.py Felix Hamborg 2021-02-06 12:20:43 +01:00
  • f69eb3463e Merge pull request #184 from lgov/filter_warc_files_by_exact_date Felix Hamborg 2021-02-06 12:14:44 +01:00
  • 6d29e8367a Update setup.py Felix Hamborg 2021-02-05 16:06:54 +01:00
  • c6163f0032 Merge pull request #195 from mood-mapping-muppets/makedir-race-condition Felix Hamborg 2021-02-05 15:58:29 +01:00
  • acdd1377dd Merge pull request #196 from mood-mapping-muppets/longest-article-max-fix Felix Hamborg 2021-02-05 15:57:45 +01:00
  • f4b1efe06b Fix longest article detection langdetect Frankie Robertson 2021-02-05 15:54:15 +02:00
  • 45bd8e5e41 Avoid race condition by using exist_ok=True for makedirs rather than checking for exists first Frankie Robertson 2021-02-04 19:04:35 +02:00
  • 48ab3300ac Update setup.py Felix Hamborg 2021-02-02 14:02:25 +01:00
  • 6beef8bcf9 Merge pull request #194 from mood-mapping-muppets/robust-fromstring Felix Hamborg 2021-02-02 14:02:04 +01:00
  • 1116f1ca7a Fallback to bytes fromstring when lxml unicode fromstring fails Frankie Robertson 2021-02-02 14:42:47 +02:00
  • 563de2b7d7 Add extractor_cls parameter to crawl_from_commoncrawl Frankie Robertson 2021-02-02 10:30:50 +02:00
  • 89f51e3b6b Make filter_record public to enable subclasses to override Frankie Robertson 2021-02-01 19:15:42 +02:00
  • afc7b411de Update README.md Felix Hamborg 2021-01-20 09:30:47 +01:00
  • b7bc13659b Update README.md Felix Hamborg 2021-01-14 15:04:32 +01:00
  • 506034941c Update README.md Felix Hamborg 2021-01-14 15:03:47 +01:00
  • 0b26d71633 Merge pull request #191 from shradhasehgal/req-fix Felix Hamborg 2021-01-03 23:43:53 +01:00
  • 87b23e7a4f Requirements cchardet fix shradhasehgal 2020-12-29 11:37:28 +05:30
  • 1fdaae738f * commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date. Lieven Govaerts 2020-12-04 19:22:02 +01:00
  • 6a5d12daac Update README.md Felix Hamborg 2020-10-23 10:14:51 +02:00
  • f9eb825584 Update README.md Felix Hamborg 2020-10-23 10:12:51 +02:00
  • 1a40d7bd7f Update sample.json Felix Hamborg 2020-10-22 21:51:10 +02:00
  • eee9318860 Update sample.json Felix Hamborg 2020-10-22 21:50:43 +02:00
  • 76f978dfa7 Update sample.json Felix Hamborg 2020-10-22 21:49:28 +02:00
  • de733476e9 Update README.md Felix Hamborg 2020-10-16 10:01:35 +02:00
  • c8253bd852 Update README.md Felix Hamborg 2020-10-16 10:01:00 +02:00
  • 328b2b9591 Update README.md Felix Hamborg 2020-09-18 09:20:29 +02:00
  • 534deea5d1 Update README.md Felix Hamborg 2020-09-18 09:19:33 +02:00
  • 916fd8eeff Update README.md Felix Hamborg 2020-09-18 09:17:05 +02:00
  • 5d30d8c824 Merge pull request #171 from arcolife/fix_headers Felix Hamborg 2020-08-03 14:11:48 +02:00
  • 14854405d0 fixes #170: custom headers for requests Archit Sharma 2020-08-03 10:38:44 +05:30
  • c79e8c3595 Merge pull request #167 from sebastian-nagel/verbose-exception-logging-if-continue Felix Hamborg 2020-07-15 19:52:08 +02:00
  • 8a974649bf Verbose logging of exceptions if continue_after_error - log ignored exceptions more verbosely including exception value and stack trace - try to address #163 (find out reason for the permission error) Sebastian Nagel 2020-07-13 11:28:34 +02:00
  • 7e16862290 Merge pull request #165 from amjltc295/amjltc295/use-str-html-instead-of-bytes-html-to-speed-up Felix Hamborg 2020-07-01 13:50:10 +02:00
  • 4cc3fb5bab Revert logger to LOGGER Ya-Liang Chang (Allen) 2020-06-30 14:01:19 -07:00
  • 025f4708f7 Move decode_response() to response_decoder.py Ya-Liang Chang (Allen) 2020-06-29 11:53:10 -07:00
  • 9dd44fe121 Fix config warning Ya-Liang Chang (Allen) 2020-06-23 12:05:45 -07:00
  • 14d9bbc11b Update requirements Ya-Liang Chang (Allen) 2020-06-23 12:05:30 -07:00
  • b7f4d3b7c4 Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py Ya-Liang Chang (Allen) 2020-06-23 12:05:20 -07:00
  • 1639c9f3bd Update bug_report.md Felix Hamborg 2020-05-31 17:44:28 +02:00
  • 3a2c9cad1a Update bug_report.md Felix Hamborg 2020-05-31 17:43:31 +02:00
  • 3e31f106a1 Merge pull request #160 from petlack/fix/159/missing-hurry-filesize Felix Hamborg 2020-05-30 14:20:10 +02:00
  • ef7649917e #159 Add missing hurry.filesize to requirements.txt peter@output.sk 2020-05-28 01:19:01 +03:00
  • 44a9f32eee Merge pull request #158 from tbrknt/unused-delete-configuration-in-example Felix Hamborg 2020-05-23 17:55:46 +02:00
  • 06bf917667 Merge pull request #157 from tbrknt/add-windows-compatibility-for-subprocesses Felix Hamborg 2020-05-23 17:54:59 +02:00
  • e6707c3985 Merge pull request #156 from tbrknt/subprocess-exitcode-check Felix Hamborg 2020-05-23 17:52:17 +02:00
  • 00eaa97ea7 Make use of my_delete_warc_after_extraction configuration in the example. Berkant Kepez 2020-05-22 00:46:55 +02:00
  • c6edf464c7 Decrease the dependency on OS regarding Subprocesses in __get_remote_index methods. Berkant Kepez 2020-05-22 00:23:42 +02:00
  • 60a1ef1059 Add exitcode check for commands executed in __get_remote_input. Berkant Kepez 2020-05-22 00:03:44 +02:00
  • c93bf81a3e make scrapy pipeline item configurable Felix Hamborg 2020-05-13 11:43:05 +02:00
  • 07174cc593 Merge pull request #145 from thihara/item_class Felix Hamborg 2020-05-13 11:41:08 +02:00
  • 20a90dfea3 Renamed modle_util.py into class_loader.py. More specific error handling for loading custom news item module and class. Fixed a typo. Thihara Neranjya 2020-05-12 08:07:24 +05:30
  • bc00fac124 Update sample.json Felix Hamborg 2020-05-05 09:09:25 +02:00
  • d24b3bd82f Merge branch 'master' of github.com:fhamborg/news-please Felix Hamborg 2020-04-30 10:21:24 +02:00
  • beef89c21a incr Felix Hamborg 2020-04-30 10:21:13 +02:00
  • c862976173 Merge pull request #148 from thihara/date_extraction Felix Hamborg 2020-04-30 10:20:17 +02:00
  • 86570b1ad2 Merge branch 'date_extraction' of https://github.com/thihara/news-please into item_class Thihara Neranjya 2020-04-29 20:00:57 +05:30
  • de7012cd2f Fixed broken date extraction due to beautiful soup's tag.text. Replaced with tag.string Thihara Neranjya 2020-04-29 19:52:18 +05:30
  • 411f46698d Merge branch 'master' of https://github.com/thihara/news-please into item_class Thihara Neranjya 2020-04-29 12:54:10 +05:30
  • 858758aec9 Fixed incorrect variable reference Thihara Neranjya 2020-04-29 12:39:25 +05:30
  • 2d23efb493 Removed f-strings and used format method. Added the new config parameter into config_lib.cfg file Thihara Neranjya 2020-04-28 14:04:31 +05:30
  • 57fbd8508a Update setup.py Felix Hamborg 2020-04-28 10:26:31 +02:00
  • 61ebcd9ae4 Merge pull request #146 from moyid/add-warc-date-filter-to-commoncrawl Felix Hamborg 2020-04-27 22:40:06 +02:00
  • 3eb23e775b revisions to adding commoncrawl warc date filter 默奕 2020-04-27 13:26:47 -07:00
  • 6e1016e92a revisions to adding commoncrawl warc date filter 默奕 2020-04-27 13:13:50 -07:00
  • ae56026328 add date filter for commoncrawl warc files 默奕 2020-04-25 13:13:21 -07:00