1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

39 Commits

Author SHA1 Message Date
Felix Hamborg
14b9cef89c Merge pull request #203 from mood-mapping-muppets/no-fetch-images-newspaper
Add option of not fetching images using newspaper library and make default for commoncrawl
2021-02-24 08:43:52 +01:00
Frankie Robertson
11e6d7748c Add option of not fetching images using newspaper library and make default for commoncrawl 2021-02-10 15:17:32 +02:00
Frankie Robertson
8456ec4f89 Filter empty responses from WARC to avoid spurious exceptions from newspaper 2021-02-10 15:14:01 +02:00
Frankie Robertson
5ed9b804bf Fallback to utf-8 when document gives unknown encoding 2021-02-08 07:18:05 +02:00
Frankie Robertson
2ade640525 Add option of replacing unicode decode errors in WARC/common crawl extraction 2021-02-08 07:12:18 +02:00
Fischer Jemison
09a810ca72 updates from_url return type docs 2020-04-19 14:11:31 -07:00
Felix Hamborg
c90a8a0a0f from_warc properly handles encoding
increase version
2019-07-11 16:23:51 +02:00
Felix Hamborg
4c103c5806 from_warc properly handles encoding
increase version
2019-07-11 16:23:33 +02:00
Felix Hamborg
05a3a80994 from_warc properly handles encoding
increase version
2019-07-11 15:41:56 +02:00
Dimitris Z
429d06b31a SonarQube Code Smell analysis and small refactoring 2018-07-01 01:38:16 +03:00
Felix Hamborg
a2bc51210a add timeout to crawling in lib mode
new pypi version including docker
2018-05-29 12:50:50 +02:00
Felix Hamborg
6f7a75dd98 add download_date to API download of single and multiple URLs 2017-11-27 17:38:01 +01:00
Felix Hamborg
50aa40b898 fix NewsArticle object conversion to dict 2017-10-05 14:39:24 +02:00
Felix Hamborg
4f0e754408 change to relative imports in commoncrawl scripts so that a git clone is sufficient to run the example and no installation is required.
increase version
2017-10-05 12:06:15 +02:00
Felix Hamborg
a0a9f276e5 fix bugs 2017-08-16 18:01:31 +09:00
Felix Hamborg
e2c8611e24 refactor commoncrawl into a crawler class that can be invoked programmatically and a convenient example script 2017-08-16 15:26:54 +09:00
Felix Hamborg
be12fcff22 version increase 2017-08-04 20:15:59 +02:00
Felix Hamborg
79c65dd72a fix #34 2017-08-04 20:13:51 +02:00
Felix Hamborg
25457b7476 fix author field export 2017-06-29 13:07:32 +02:00
Felix Hamborg
120592a56c fix author field export 2017-06-29 12:00:00 +02:00
Felix Hamborg
1b28268106 library mode does not use scrapy any longer but crawls the articles using urllib 2017-06-28 16:47:27 +02:00
Felix Hamborg
b8b82aff43 fix #28 2017-06-15 19:51:51 +02:00
Felix Hamborg
696667d2ba add warc support 2017-05-31 17:07:08 +02:00
Felix Hamborg
e122900319 reformat code 2017-05-31 12:56:52 +02:00
Felix Hamborg
d77eb6122b add warc support 2017-05-29 12:12:04 +02:00
Felix Hamborg
9a0321a578 add warc support 2017-05-29 12:03:09 +02:00
Felix Hamborg
c55ecd4641 add warc support 2017-05-28 16:56:41 +02:00
Felix Hamborg
c5b47cbab2 add warc support 2017-05-28 16:51:45 +02:00
Felix Hamborg
23daedc2b9 warc support 2017-05-28 14:31:55 +02:00
Felix Hamborg
d276282cb8 crawl from html string 2017-05-27 21:09:56 +02:00
Felix Hamborg
0ce22f1297 crawl from html string 2017-05-27 20:45:55 +02:00
Felix Hamborg
b033fb1c6d robots.txt update in configs 2017-05-27 18:38:00 +02:00
Felix Hamborg
3966306e49 rename methods 2017-05-21 12:42:35 +02:00
Felix Hamborg
ddc6ec6435 add library mode for files containing urls
fix minor bugs
update doc
2017-05-17 16:23:37 +02:00
felix
9f8d2bb2f4 fix bug that occurred when crawlin multiple articles 2017-02-25 11:24:34 +01:00
felix
aca3f48d21 fix bug that occurred when crawlin multiple articles 2017-02-25 11:10:38 +01:00
felix
0384f47760 more convenient name 2017-02-24 17:59:33 +01:00
felix
b78bb39ddc fix bug 2017-02-24 17:53:48 +01:00
felix
4d8199ff42 reorga 2016-11-09 18:33:45 +01:00