news-please-content-crawler

mirror of https://github.com/fhamborg/news-please.git synced 2021-09-19 22:26:00 +03:00

Author	SHA1	Message	Date
Felix Hamborg	db9d6e0a07	Update commoncrawl.py	2019-10-27 13:24:21 +01:00
Felix Hamborg	de99f87a95	deprecate use of attribute text. maintext is preferred from now on	2019-10-19 14:53:04 +02:00
Felix Hamborg	6cc3ab8043	Update commoncrawl.py	2019-10-02 15:59:02 +02:00
Felix Hamborg	1959114eb8	Update README.md	2019-09-30 17:41:03 +02:00
Felix Hamborg	64956af6b2	remove ccnc announcement since the dataset is still in preparation	2019-09-27 08:33:22 +02:00
Felix Hamborg	a68407e493	Update README.md	2019-09-25 12:21:11 +02:00
Felix Hamborg	8d4679d2f2	Create no-response.yml	2019-09-24 16:33:01 +02:00
Felix Hamborg	5156574323	Update README.md	2019-09-19 22:28:38 +02:00
Felix Hamborg	e73c3897bc	Update README.md	2019-09-19 22:26:57 +02:00
Felix Hamborg	58d6d40c8a	Update README.md	2019-09-19 17:01:14 +02:00
Felix Hamborg	4f9420e93d	Update README.md	2019-09-17 11:14:25 +02:00
Felix Hamborg	2d5dbb4e1d	Update README.md	2019-09-17 11:13:28 +02:00
Felix Hamborg	9681733a37	Update README.md	2019-09-13 09:34:41 +02:00
Felix Hamborg	8d8f97b1be	Update README.md	2019-09-13 09:32:28 +02:00
Felix Hamborg	afa51e43e2	Update README.md	2019-09-13 09:28:59 +02:00
Felix Hamborg	947e38151f	Update setup.py	2019-08-06 10:32:06 +02:00
Felix Hamborg	cd40b772a8	Merge pull request #117 from JeyB88/bugfix/WrongResponseAccessInElasticsearchStorage fixed wrong response access to check if previous version of processin…	2019-08-06 10:31:50 +02:00
Felix Hamborg	3c8af46c14	Update setup.py	2019-08-06 10:29:38 +02:00
Felix Hamborg	d7dc03527d	Merge pull request #118 from donglixp/patch-1 solve issue #107 (TypeError: not iterable)	2019-08-06 10:28:41 +02:00
Li Dong	c0d9726941	Update commoncrawl_extractor.py	2019-08-05 20:25:30 +08:00
Li Dong	ab5c855272	incorrect attr name should be "date_publish"	2019-08-05 20:06:15 +08:00
Li Dong	2dc52b7c83	solve issue #107 (TypeError: not iterable) solve issue #107 (TypeError: argument of type 'NewsArticle' is not iterable) https://github.com/fhamborg/news-please/issues/107 ```bash Traceback (most recent call last): File "/data/anaconda/envs/giga/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/data/anaconda/envs/giga/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/mnt/data/news-please/newsplease/examples/commoncrawl.py", line 169, in <module> main() File "/mnt/data/news-please/newsplease/examples/commoncrawl.py", line 165, in main continue_process=True) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_crawler.py", line 297, in crawl_from_commoncrawl log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_crawler.py", line 208, in __start_commoncrawl_extractor log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl self.__run() File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run self.__process_warc_gz_file(local_path_name) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 237, in __process_warc_gz_file filter_pass, article = self.__filter_record(record) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 121, in __filter_record publishing_date = self.__get_publishing_date(warc_record, article) File "/mnt/data/news-please/newsplease/crawler/commoncrawl_extractor.py", line 140, in __get_publishing_date if 'publish_date' in article: TypeError: argument of type 'NewsArticle' is not iterable ```	2019-08-05 19:35:00 +08:00
JeyB	42544c23e2	fixed wrong response access to check if previous version of processing item already exist	2019-08-03 18:09:37 +02:00
Felix Hamborg	c0e7b56d0b	Update docker.sh	2019-07-15 18:38:38 +02:00
Felix Hamborg	8913886e56	add docker script for cc	2019-07-15 17:18:44 +02:00
Felix Hamborg	663c229de4	add docker script for cc	2019-07-15 15:39:32 +02:00
Felix Hamborg	1231e68d76	add docker script for cc	2019-07-15 15:20:06 +02:00
Felix Hamborg	f1e3f87ad8	add entrypoint news-please-cc for commoncrawl script	2019-07-15 15:06:27 +02:00
Felix Hamborg	288a9d04c0	add entrypoint news-please-cc for commoncrawl script	2019-07-15 15:05:42 +02:00
Felix Hamborg	27caa85928	Merge remote-tracking branch 'origin/master'	2019-07-15 15:05:32 +02:00
Felix Hamborg	e2b82f8087	add entrypoint news-please-cc for commoncrawl script	2019-07-15 15:05:23 +02:00
Felix Hamborg	77faccc6c6	Update README.md	2019-07-14 09:29:00 +02:00
Felix Hamborg	15b27a9ed5	cc:fix path problems cc:add main params for download locations	2019-07-12 18:13:07 +02:00
Felix Hamborg	14bae0ef60	cc: more comprehensive log	2019-07-12 17:04:22 +02:00
Felix Hamborg	cdb1dab688	cc: more comprehensive logging output cc: callback for warc completion fix #105	2019-07-12 16:52:50 +02:00
Felix Hamborg	c90a8a0a0f	from_warc properly handles encoding increase version	2019-07-11 16:23:51 +02:00
Felix Hamborg	4c103c5806	from_warc properly handles encoding increase version	2019-07-11 16:23:33 +02:00
Felix Hamborg	05a3a80994	from_warc properly handles encoding increase version	2019-07-11 15:41:56 +02:00
Felix Hamborg	5a1058ba96	Update README.md	2019-06-12 21:49:22 +02:00
Felix Hamborg	4fc37fe536	Update README.md	2019-06-12 21:45:31 +02:00
Felix Hamborg	0a3c43014e	Update README.md	2019-06-12 21:44:57 +02:00
Felix Hamborg	e98032ae76	Update bug_report.md	2019-05-23 10:28:22 +02:00
Felix Hamborg	e8d18dcd6e	Update README.md	2019-05-20 17:36:44 +02:00
Felix Hamborg	8c21ff3f14	various minor fixes and improvements	2019-05-13 16:50:38 +02:00
Felix Hamborg	26606865eb	Merge pull request #108 from fshafalir/patch-2 Update pipelines.py	2019-05-10 16:43:02 +02:00
fshafalir	7bd02d932b	Update pipelines.py To support Python < 3.6	2019-05-10 11:28:45 +02:00
Felix Hamborg	731837c2a6	Update README.md	2019-04-08 18:40:11 +02:00
Felix Hamborg	78529dd0e4	increase version	2019-04-02 09:02:08 -04:00
Felix Hamborg	7e51507f29	Merge pull request #97 from tsoernes/re-comp Avoid compiling regexes each iteration	2019-04-02 08:49:00 -04:00
torstein	1e5489198e	Improve regex naming	2019-04-01 11:29:40 +02:00

... 2 3 4 5 6 ...

658 Commits