Felix Hamborg
|
0b26d71633
|
Merge pull request #191 from shradhasehgal/req-fix
Fixes #189: cchardet installation
|
2021-01-03 23:43:53 +01:00 |
|
shradhasehgal
|
87b23e7a4f
|
Requirements cchardet fix
|
2020-12-29 11:37:28 +05:30 |
|
Lieven Govaerts
|
1fdaae738f
|
* commoncrawl_crawler.py: After filtering the warc files by year-month, also remove the ones from before warc_files_start_date.
|
2020-12-04 19:22:02 +01:00 |
|
Felix Hamborg
|
6a5d12daac
|
Update README.md
|
2020-10-23 10:14:51 +02:00 |
|
Felix Hamborg
|
f9eb825584
|
Update README.md
|
2020-10-23 10:12:51 +02:00 |
|
Felix Hamborg
|
1a40d7bd7f
|
Update sample.json
|
2020-10-22 21:51:10 +02:00 |
|
Felix Hamborg
|
eee9318860
|
Update sample.json
|
2020-10-22 21:50:43 +02:00 |
|
Felix Hamborg
|
76f978dfa7
|
Update sample.json
|
2020-10-22 21:49:28 +02:00 |
|
Felix Hamborg
|
de733476e9
|
Update README.md
|
2020-10-16 10:01:35 +02:00 |
|
Felix Hamborg
|
c8253bd852
|
Update README.md
|
2020-10-16 10:01:00 +02:00 |
|
Felix Hamborg
|
328b2b9591
|
Update README.md
|
2020-09-18 09:20:29 +02:00 |
|
Felix Hamborg
|
534deea5d1
|
Update README.md
|
2020-09-18 09:19:33 +02:00 |
|
Felix Hamborg
|
916fd8eeff
|
Update README.md
|
2020-09-18 09:17:05 +02:00 |
|
Felix Hamborg
|
5d30d8c824
|
Merge pull request #171 from arcolife/fix_headers
fixes #170: custom headers for requests
|
2020-08-03 14:11:48 +02:00 |
|
Archit Sharma
|
14854405d0
|
fixes #170: custom headers for requests
|
2020-08-03 10:58:04 +05:30 |
|
Felix Hamborg
|
c79e8c3595
|
Merge pull request #167 from sebastian-nagel/verbose-exception-logging-if-continue
Verbose logging of exceptions if continue_after_error
|
2020-07-15 19:52:08 +02:00 |
|
Sebastian Nagel
|
8a974649bf
|
Verbose logging of exceptions if continue_after_error
- log ignored exceptions more verbosely including exception value and
stack trace
- try to address #163 (find out reason for the permission error)
|
2020-07-13 11:28:34 +02:00 |
|
Felix Hamborg
|
7e16862290
|
Merge pull request #165 from amjltc295/amjltc295/use-str-html-instead-of-bytes-html-to-speed-up
Amjltc295/use str html instead of bytes html to speed up
|
2020-07-01 13:50:10 +02:00 |
|
Ya-Liang Chang (Allen)
|
4cc3fb5bab
|
Revert logger to LOGGER
|
2020-06-30 14:02:26 -07:00 |
|
Ya-Liang Chang (Allen)
|
025f4708f7
|
Move decode_response() to response_decoder.py
|
2020-06-29 11:53:10 -07:00 |
|
Ya-Liang Chang (Allen)
|
9dd44fe121
|
Fix config warning
|
2020-06-23 12:05:45 -07:00 |
|
Ya-Liang Chang (Allen)
|
14d9bbc11b
|
Update requirements
|
2020-06-23 12:05:30 -07:00 |
|
Ya-Liang Chang (Allen)
|
b7f4d3b7c4
|
Change SimpleCrawler._fetch_url to return string instead of bytes to speed up, based on https://github.com/adbar/trafilatura/blob/master/trafilatura/utils.py
|
2020-06-23 12:05:20 -07:00 |
|
Felix Hamborg
|
1639c9f3bd
|
Update bug_report.md
|
2020-05-31 17:44:28 +02:00 |
|
Felix Hamborg
|
3a2c9cad1a
|
Update bug_report.md
|
2020-05-31 17:43:31 +02:00 |
|
Felix Hamborg
|
3e31f106a1
|
Merge pull request #160 from petlack/fix/159/missing-hurry-filesize
#159 Add missing hurry.filesize to requirements.txt
|
2020-05-30 14:20:10 +02:00 |
|
peter@output.sk
|
ef7649917e
|
#159 Add missing hurry.filesize to requirements.txt
|
2020-05-28 01:19:01 +03:00 |
|
Felix Hamborg
|
44a9f32eee
|
Merge pull request #158 from tbrknt/unused-delete-configuration-in-example
my_delete_warc_after_extraction
|
2020-05-23 17:55:46 +02:00 |
|
Felix Hamborg
|
06bf917667
|
Merge pull request #157 from tbrknt/add-windows-compatibility-for-subprocesses
Less OS Dependency
|
2020-05-23 17:54:59 +02:00 |
|
Felix Hamborg
|
e6707c3985
|
Merge pull request #156 from tbrknt/subprocess-exitcode-check
Add exitcode check for Subprocesses
|
2020-05-23 17:52:17 +02:00 |
|
Berkant Kepez
|
00eaa97ea7
|
Make use of my_delete_warc_after_extraction configuration in the example.
|
2020-05-22 00:50:01 +02:00 |
|
Berkant Kepez
|
c6edf464c7
|
Decrease the dependency on OS regarding Subprocesses in __get_remote_index methods.
|
2020-05-22 00:48:53 +02:00 |
|
Berkant Kepez
|
60a1ef1059
|
Add exitcode check for commands executed in __get_remote_input.
|
2020-05-22 00:03:44 +02:00 |
|
Felix Hamborg
|
c93bf81a3e
|
make scrapy pipeline item configurable
|
2020-05-13 11:43:05 +02:00 |
|
Felix Hamborg
|
07174cc593
|
Merge pull request #145 from thihara/item_class
Changes to make Scrapy Item class customizable via configuration
|
2020-05-13 11:41:08 +02:00 |
|
Thihara Neranjya
|
20a90dfea3
|
Renamed modle_util.py into class_loader.py. More specific error handling for loading custom news item module and class. Fixed a typo.
|
2020-05-12 08:07:24 +05:30 |
|
Felix Hamborg
|
bc00fac124
|
Update sample.json
|
2020-05-05 09:09:25 +02:00 |
|
Felix Hamborg
|
d24b3bd82f
|
Merge branch 'master' of github.com:fhamborg/news-please
|
2020-04-30 10:21:24 +02:00 |
|
Felix Hamborg
|
beef89c21a
|
incr
|
2020-04-30 10:21:13 +02:00 |
|
Felix Hamborg
|
c862976173
|
Merge pull request #148 from thihara/date_extraction
Fixed broken date extraction due to beautiful soup's tag.text.
|
2020-04-30 10:20:17 +02:00 |
|
Thihara Neranjya
|
86570b1ad2
|
Merge branch 'date_extraction' of https://github.com/thihara/news-please into item_class
|
2020-04-29 20:00:57 +05:30 |
|
Thihara Neranjya
|
de7012cd2f
|
Fixed broken date extraction due to beautiful soup's tag.text. Replaced with tag.string
|
2020-04-29 19:52:18 +05:30 |
|
Thihara Neranjya
|
411f46698d
|
Merge branch 'master' of https://github.com/thihara/news-please into item_class
|
2020-04-29 12:54:10 +05:30 |
|
Thihara Neranjya
|
858758aec9
|
Fixed incorrect variable reference
|
2020-04-29 12:39:25 +05:30 |
|
Thihara Neranjya
|
2d23efb493
|
Removed f-strings and used format method. Added the new config parameter into config_lib.cfg file
|
2020-04-28 14:04:31 +05:30 |
|
Felix Hamborg
|
57fbd8508a
|
Update setup.py
|
2020-04-28 10:26:31 +02:00 |
|
Felix Hamborg
|
61ebcd9ae4
|
Merge pull request #146 from moyid/add-warc-date-filter-to-commoncrawl
add date filter for commoncrawl warc files
|
2020-04-27 22:40:06 +02:00 |
|
默奕
|
3eb23e775b
|
revisions to adding commoncrawl warc date filter
|
2020-04-27 13:26:47 -07:00 |
|
默奕
|
6e1016e92a
|
revisions to adding commoncrawl warc date filter
|
2020-04-27 13:13:50 -07:00 |
|
默奕
|
ae56026328
|
add date filter for commoncrawl warc files
|
2020-04-25 13:13:21 -07:00 |
|