1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

658 Commits

Author SHA1 Message Date
Felix Hamborg
0b26d71633 Merge pull request #191 from shradhasehgal/req-fix
Fixes #189: cchardet installation
2021-01-03 23:43:53 +01:00
shradhasehgal
87b23e7a4f Requirements cchardet fix 2020-12-29 11:37:28 +05:30
Lieven Govaerts
1fdaae738f * commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date. 2020-12-04 19:22:02 +01:00
Felix Hamborg
6a5d12daac Update README.md 2020-10-23 10:14:51 +02:00
Felix Hamborg
f9eb825584 Update README.md 2020-10-23 10:12:51 +02:00
Felix Hamborg
1a40d7bd7f Update sample.json 2020-10-22 21:51:10 +02:00
Felix Hamborg
eee9318860 Update sample.json 2020-10-22 21:50:43 +02:00
Felix Hamborg
76f978dfa7 Update sample.json 2020-10-22 21:49:28 +02:00
Felix Hamborg
de733476e9 Update README.md 2020-10-16 10:01:35 +02:00
Felix Hamborg
c8253bd852 Update README.md 2020-10-16 10:01:00 +02:00
Felix Hamborg
328b2b9591 Update README.md 2020-09-18 09:20:29 +02:00
Felix Hamborg
534deea5d1 Update README.md 2020-09-18 09:19:33 +02:00
Felix Hamborg
916fd8eeff Update README.md 2020-09-18 09:17:05 +02:00
Felix Hamborg
5d30d8c824 Merge pull request #171 from arcolife/fix_headers
fixes #170: custom headers for requests
2020-08-03 14:11:48 +02:00
Archit Sharma
14854405d0 fixes #170: custom headers for requests 2020-08-03 10:58:04 +05:30
Felix Hamborg
c79e8c3595 Merge pull request #167 from sebastian-nagel/verbose-exception-logging-if-continue
Verbose logging of exceptions if continue_after_error
2020-07-15 19:52:08 +02:00
Sebastian Nagel
8a974649bf Verbose logging of exceptions if continue_after_error
- log ignored exceptions more verbosely including exception value and
  stack trace
- try to address #163 (find out reason for the permission error)
2020-07-13 11:28:34 +02:00
Felix Hamborg
7e16862290 Merge pull request #165 from amjltc295/amjltc295/use-str-html-instead-of-bytes-html-to-speed-up
Amjltc295/use str html instead of bytes html to speed up
2020-07-01 13:50:10 +02:00
Ya-Liang Chang (Allen)
4cc3fb5bab Revert logger to LOGGER 2020-06-30 14:02:26 -07:00
Ya-Liang Chang (Allen)
025f4708f7 Move decode_response() to response_decoder.py 2020-06-29 11:53:10 -07:00
Ya-Liang Chang (Allen)
9dd44fe121 Fix config warning 2020-06-23 12:05:45 -07:00
Ya-Liang Chang (Allen)
14d9bbc11b Update requirements 2020-06-23 12:05:30 -07:00
Ya-Liang Chang (Allen)
b7f4d3b7c4 Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py 2020-06-23 12:05:20 -07:00
Felix Hamborg
1639c9f3bd Update bug_report.md 2020-05-31 17:44:28 +02:00
Felix Hamborg
3a2c9cad1a Update bug_report.md 2020-05-31 17:43:31 +02:00
Felix Hamborg
3e31f106a1 Merge pull request #160 from petlack/fix/159/missing-hurry-filesize
#159 Add missing hurry.filesize to requirements.txt
2020-05-30 14:20:10 +02:00
peter@output.sk
ef7649917e #159 Add missing hurry.filesize to requirements.txt 2020-05-28 01:19:01 +03:00
Felix Hamborg
44a9f32eee Merge pull request #158 from tbrknt/unused-delete-configuration-in-example
my_delete_warc_after_extraction
2020-05-23 17:55:46 +02:00
Felix Hamborg
06bf917667 Merge pull request #157 from tbrknt/add-windows-compatibility-for-subprocesses
Less OS Dependency
2020-05-23 17:54:59 +02:00
Felix Hamborg
e6707c3985 Merge pull request #156 from tbrknt/subprocess-exitcode-check
Add exitcode check for Subprocesses
2020-05-23 17:52:17 +02:00
Berkant Kepez
00eaa97ea7 Make use of my_delete_warc_after_extraction configuration in the example. 2020-05-22 00:50:01 +02:00
Berkant Kepez
c6edf464c7 Decrease the dependency on OS regarding Subprocesses in __get_remote_index methods. 2020-05-22 00:48:53 +02:00
Berkant Kepez
60a1ef1059 Add exitcode check for commands executed in __get_remote_input. 2020-05-22 00:03:44 +02:00
Felix Hamborg
c93bf81a3e make scrapy pipeline item configurable 2020-05-13 11:43:05 +02:00
Felix Hamborg
07174cc593 Merge pull request #145 from thihara/item_class
Changes to make Scrapy Item class customizable via configuration
2020-05-13 11:41:08 +02:00
Thihara Neranjya
20a90dfea3 Renamed modle_util.py into class_loader.py. More specific error handling for loading custom news item module and class. Fixed a typo. 2020-05-12 08:07:24 +05:30
Felix Hamborg
bc00fac124 Update sample.json 2020-05-05 09:09:25 +02:00
Felix Hamborg
d24b3bd82f Merge branch 'master' of github.com:fhamborg/news-please 2020-04-30 10:21:24 +02:00
Felix Hamborg
beef89c21a incr 2020-04-30 10:21:13 +02:00
Felix Hamborg
c862976173 Merge pull request #148 from thihara/date_extraction
Fixed broken date extraction due to beautiful soup's tag.text.
2020-04-30 10:20:17 +02:00
Thihara Neranjya
86570b1ad2 Merge branch 'date_extraction' of https://github.com/thihara/news-please into item_class 2020-04-29 20:00:57 +05:30
Thihara Neranjya
de7012cd2f Fixed broken date extraction due to beautiful soup's tag.text. Replaced with tag.string 2020-04-29 19:52:18 +05:30
Thihara Neranjya
411f46698d Merge branch 'master' of https://github.com/thihara/news-please into item_class 2020-04-29 12:54:10 +05:30
Thihara Neranjya
858758aec9 Fixed incorrect variable reference 2020-04-29 12:39:25 +05:30
Thihara Neranjya
2d23efb493 Removed f-strings and used format method. Added the new config parameter into config_lib.cfg file 2020-04-28 14:04:31 +05:30
Felix Hamborg
57fbd8508a Update setup.py 2020-04-28 10:26:31 +02:00
Felix Hamborg
61ebcd9ae4 Merge pull request #146 from moyid/add-warc-date-filter-to-commoncrawl
add date filter for commoncrawl warc files
2020-04-27 22:40:06 +02:00
默奕
3eb23e775b revisions to adding commoncrawl warc date filter 2020-04-27 13:26:47 -07:00
默奕
6e1016e92a revisions to adding commoncrawl warc date filter 2020-04-27 13:13:50 -07:00
默奕
ae56026328 add date filter for commoncrawl warc files 2020-04-25 13:13:21 -07:00