1
0
mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

658 Commits

Author SHA1 Message Date
Felix Hamborg
db9d6e0a07 Update commoncrawl.py 2019-10-27 13:24:21 +01:00
Felix Hamborg
de99f87a95 deprecate use of attribute text. maintext is preferred from now on 2019-10-19 14:53:04 +02:00
Felix Hamborg
6cc3ab8043 Update commoncrawl.py 2019-10-02 15:59:02 +02:00
Felix Hamborg
1959114eb8 Update README.md 2019-09-30 17:41:03 +02:00
Felix Hamborg
64956af6b2 remove ccnc announcement since the dataset is still in preparation 2019-09-27 08:33:22 +02:00
Felix Hamborg
a68407e493 Update README.md 2019-09-25 12:21:11 +02:00
Felix Hamborg
8d4679d2f2 Create no-response.yml 2019-09-24 16:33:01 +02:00
Felix Hamborg
5156574323 Update README.md 2019-09-19 22:28:38 +02:00
Felix Hamborg
e73c3897bc Update README.md 2019-09-19 22:26:57 +02:00
Felix Hamborg
58d6d40c8a Update README.md 2019-09-19 17:01:14 +02:00
Felix Hamborg
4f9420e93d Update README.md 2019-09-17 11:14:25 +02:00
Felix Hamborg
2d5dbb4e1d Update README.md 2019-09-17 11:13:28 +02:00
Felix Hamborg
9681733a37 Update README.md 2019-09-13 09:34:41 +02:00
Felix Hamborg
8d8f97b1be Update README.md 2019-09-13 09:32:28 +02:00
Felix Hamborg
afa51e43e2 Update README.md 2019-09-13 09:28:59 +02:00
Felix Hamborg
947e38151f Update setup.py 2019-08-06 10:32:06 +02:00
Felix Hamborg
cd40b772a8 Merge pull request #117 from JeyB88/bugfix/WrongResponseAccessInElasticsearchStorage
fixed wrong response access to check if previous version of processin…
2019-08-06 10:31:50 +02:00
Felix Hamborg
3c8af46c14 Update setup.py 2019-08-06 10:29:38 +02:00
Felix Hamborg
d7dc03527d Merge pull request #118 from donglixp/patch-1
solve issue #107 (TypeError: not iterable)
2019-08-06 10:28:41 +02:00
Li Dong
c0d9726941 Update commoncrawl_extractor.py 2019-08-05 20:25:30 +08:00
Li Dong
ab5c855272 incorrect attr name
should  be "date_publish"
2019-08-05 20:06:15 +08:00
Li Dong
2dc52b7c83 solve issue #107 (TypeError: not iterable)
solve issue #107 (TypeError: argument of type 'NewsArticle' is not iterable)
https://github.com/fhamborg/news-please/issues/107

```bash
Traceback (most recent call last):
  File "/data/anaconda/envs/giga/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/anaconda/envs/giga/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/data/news-please/newsplease/examples/commoncrawl.py", line 169, in <module>
    main()
  File "/mnt/data/news-please/newsplease/examples/commoncrawl.py", line 165, in main
    continue_process=True)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_crawler.py", line 297, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_crawler.py", line 208, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
    self.__run()
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 237, in __process_warc_gz_file
    filter_pass, article = self.__filter_record(record)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 121, in __filter_record
    publishing_date = self.__get_publishing_date(warc_record, article)
  File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 140, in __get_publishing_date
    if 'publish_date' in article:
TypeError: argument of type 'NewsArticle' is not iterable
```
2019-08-05 19:35:00 +08:00
JeyB
42544c23e2 fixed wrong response access to check if previous version of processing item already exist 2019-08-03 18:09:37 +02:00
Felix Hamborg
c0e7b56d0b Update docker.sh 2019-07-15 18:38:38 +02:00
Felix Hamborg
8913886e56 add docker script for cc 2019-07-15 17:18:44 +02:00
Felix Hamborg
663c229de4 add docker script for cc 2019-07-15 15:39:32 +02:00
Felix Hamborg
1231e68d76 add docker script for cc 2019-07-15 15:20:06 +02:00
Felix Hamborg
f1e3f87ad8 add entrypoint news-please-cc for commoncrawl script 2019-07-15 15:06:27 +02:00
Felix Hamborg
288a9d04c0 add entrypoint news-please-cc for commoncrawl script 2019-07-15 15:05:42 +02:00
Felix Hamborg
27caa85928 Merge remote-tracking branch 'origin/master' 2019-07-15 15:05:32 +02:00
Felix Hamborg
e2b82f8087 add entrypoint news-please-cc for commoncrawl script 2019-07-15 15:05:23 +02:00
Felix Hamborg
77faccc6c6 Update README.md 2019-07-14 09:29:00 +02:00
Felix Hamborg
15b27a9ed5 cc:fix path problems
cc:add main params for download locations
2019-07-12 18:13:07 +02:00
Felix Hamborg
14bae0ef60 cc: more comprehensive log 2019-07-12 17:04:22 +02:00
Felix Hamborg
cdb1dab688 cc: more comprehensive logging output
cc: callback for warc completion
fix #105
2019-07-12 16:52:50 +02:00
Felix Hamborg
c90a8a0a0f from_warc properly handles encoding
increase version
2019-07-11 16:23:51 +02:00
Felix Hamborg
4c103c5806 from_warc properly handles encoding
increase version
2019-07-11 16:23:33 +02:00
Felix Hamborg
05a3a80994 from_warc properly handles encoding
increase version
2019-07-11 15:41:56 +02:00
Felix Hamborg
5a1058ba96 Update README.md 2019-06-12 21:49:22 +02:00
Felix Hamborg
4fc37fe536 Update README.md 2019-06-12 21:45:31 +02:00
Felix Hamborg
0a3c43014e Update README.md 2019-06-12 21:44:57 +02:00
Felix Hamborg
e98032ae76 Update bug_report.md 2019-05-23 10:28:22 +02:00
Felix Hamborg
e8d18dcd6e Update README.md 2019-05-20 17:36:44 +02:00
Felix Hamborg
8c21ff3f14 various minor fixes and improvements 2019-05-13 16:50:38 +02:00
Felix Hamborg
26606865eb Merge pull request #108 from fshafalir/patch-2
Update pipelines.py
2019-05-10 16:43:02 +02:00
fshafalir
7bd02d932b Update pipelines.py
To support Python < 3.6
2019-05-10 11:28:45 +02:00
Felix Hamborg
731837c2a6 Update README.md 2019-04-08 18:40:11 +02:00
Felix Hamborg
78529dd0e4 increase version 2019-04-02 09:02:08 -04:00
Felix Hamborg
7e51507f29 Merge pull request #97 from tsoernes/re-comp
Avoid compiling regexes each iteration
2019-04-02 08:49:00 -04:00
torstein
1e5489198e Improve regex naming 2019-04-01 11:29:40 +02:00