This commit is contained in:
Felix Hamborg
2020-04-30 10:21:13 +02:00
parent 2b55c29798
commit beef89c21a
49 changed files with 121 additions and 0 deletions

BIN
dist/news-please-1.2.25.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.26.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.27.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.28.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.31.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.32.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.33.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.35.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.36.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.39.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.40.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.41.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.42.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.43.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.44.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.50.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.51.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.52.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.2.53.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.3.10.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.3.11.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.3.13.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.3.14.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.10.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.11.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.12.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.13.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.14.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.15.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.16.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.17.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.18.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.19.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.20.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.21.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.22.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.23.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.24.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.25.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.4.26.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.5.1.tar.gz vendored Normal file

Binary file not shown.

BIN
dist/news-please-1.5.2.tar.gz vendored Normal file

Binary file not shown.

View File

@@ -0,0 +1,25 @@
Metadata-Version: 1.1
Name: news-please
Version: 1.5.2
Summary: news-please is an open source easy-to-use news extractor that just works.
Home-page: https://github.com/fhamborg/news-please
Author: Felix Hamborg
Author-email: felix.hamborg@uni-konstanz.de
License: Apache License 2.0
Download-URL: https://github.com/fhamborg/news-please
Description: news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website. Furthermore, its API allows developers to access the exctraction functionality within their software. news-please also implements a workflow optimized for the news archive provided by commoncrawl.org, allowing users to efficiently crawl and extract news articles including various filter options.
Keywords: news crawler news scraper news extractor crawler extractor scraper information retrieval
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Internet
Classifier: Topic :: Scientific/Engineering :: Information Analysis

View File

@@ -0,0 +1,65 @@
LICENSE.txt
MANIFEST.in
README.md
requirements.txt
setup.py
news_please.egg-info/PKG-INFO
news_please.egg-info/SOURCES.txt
news_please.egg-info/dependency_links.txt
news_please.egg-info/entry_points.txt
news_please.egg-info/not-zip-safe
news_please.egg-info/requires.txt
news_please.egg-info/top_level.txt
newsplease/NewsArticle.py
newsplease/__init__.py
newsplease/__main__.py
newsplease/config.py
newsplease/helper.py
newsplease/single_crawler.py
newsplease/config/config.cfg
newsplease/config/config_lib.cfg
newsplease/config/sitelist.hjson
newsplease/crawler/__init__.py
newsplease/crawler/commoncrawl_crawler.py
newsplease/crawler/commoncrawl_extractor.py
newsplease/crawler/items.py
newsplease/crawler/simple_crawler.py
newsplease/crawler/spiders/__init__.py
newsplease/crawler/spiders/download_crawler.py
newsplease/crawler/spiders/gdelt_crawler.py
newsplease/crawler/spiders/recursive_crawler.py
newsplease/crawler/spiders/recursive_sitemap_crawler.py
newsplease/crawler/spiders/rss_crawler.py
newsplease/crawler/spiders/sitemap_crawler.py
newsplease/examples/__init__.py
newsplease/examples/commoncrawl.py
newsplease/examples/downloadfromfile.py
newsplease/examples/downloadfromurl.py
newsplease/helper_classes/__init__.py
newsplease/helper_classes/heuristics.py
newsplease/helper_classes/parse_crawler.py
newsplease/helper_classes/savepath_parser.py
newsplease/helper_classes/url_extractor.py
newsplease/helper_classes/sub_classes/__init__.py
newsplease/helper_classes/sub_classes/heuristics_manager.py
newsplease/pipeline/__init__.py
newsplease/pipeline/pipelines.py
newsplease/pipeline/extractor/__init__.py
newsplease/pipeline/extractor/article_candidate.py
newsplease/pipeline/extractor/article_extractor.py
newsplease/pipeline/extractor/cleaner.py
newsplease/pipeline/extractor/comparer/__init__.py
newsplease/pipeline/extractor/comparer/comparer.py
newsplease/pipeline/extractor/comparer/comparer_Language.py
newsplease/pipeline/extractor/comparer/comparer_author.py
newsplease/pipeline/extractor/comparer/comparer_date.py
newsplease/pipeline/extractor/comparer/comparer_description.py
newsplease/pipeline/extractor/comparer/comparer_text.py
newsplease/pipeline/extractor/comparer/comparer_title.py
newsplease/pipeline/extractor/comparer/comparer_topimage.py
newsplease/pipeline/extractor/extractors/__init__.py
newsplease/pipeline/extractor/extractors/abstract_extractor.py
newsplease/pipeline/extractor/extractors/date_extractor.py
newsplease/pipeline/extractor/extractors/lang_detect_extractor.py
newsplease/pipeline/extractor/extractors/newspaper_extractor.py
newsplease/pipeline/extractor/extractors/readability_extractor.py

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,4 @@
[console_scripts]
news-please = newsplease.__main__:main
news-please-cc = newsplease.examples.commoncrawl:main

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,24 @@
Scrapy>=1.1.0
PyMySQL>=0.7.9
psycopg2-binary>=2.8.4
hjson>=1.5.8
elasticsearch>=2.4
beautifulsoup4>=4.3.2
readability-lxml>=0.6.2
newspaper3k>=0.2.8
langdetect>=1.0.7
python-dateutil>=2.4.0
plac>=0.9.6
dotmap>=1.2.17
readability-lxml>=0.6.2
PyDispatcher>=2.0.5
warcio>=1.3.3
ago>=0.0.9
six>=1.10.0
lxml>=3.3.5
awscli>=1.11.117
hurry.filesize>=0.9
bs4
[:sys_platform == "win32"]
pywin32>=220

View File

@@ -0,0 +1 @@
newsplease