Modify download_corpora.py and README.rst to show python3 updates

2021-06-07 22:52:17 +03:00 · 2014-12-17 12:12:24 -08:00
parent e26f1def50
commit 76956d28dc
2 changed files with 44 additions and 29 deletions
--- a/README.rst
+++ b/README.rst
@@ -1,5 +1,5 @@
-Newspaper: Article scraping & curation
-=======================================
+Newspaper3k: Article scraping & curation
+========================================

 .. image:: https://badge.fury.io/py/newspaper.png
    :target: http://badge.fury.io/py/newspaper
@@ -15,7 +15,8 @@ Inspired by `requests`_ for its simplicity and powered by `lxml`_ for its speed:
 .. _`tweeted by`: https://twitter.com/kennethreitz/status/419520678862548992
 .. _`The Changelog`: http://thechangelog.com/newspaper-delivers-instapaper-style-article-extraction/

-Basic Demo: http://newspaper-demo.herokuapp.com
+
+Newspaper is a **Python3** library! Alternatively, view the `Python2 branch`_

 **We support 10+ languages and everything is in unicode!**

@@ -57,16 +58,16 @@ A Glance:

    >>> for article in cnn_paper.articles:
    >>>     print(article.url)
-    u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
-    u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
+    'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
+    'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
    ...

    >>> for category in cnn_paper.category_urls():
    >>>     print(category)

-    u'http://lifestyle.cnn.com'
-    u'http://cnn.com/world'
-    u'http://tech.cnn.com'
+    'http://lifestyle.cnn.com'
+    'http://cnn.com/world'
+    'http://tech.cnn.com'
    ...

 .. code-block:: pycon
@@ -78,23 +79,23 @@ A Glance:
    >>> article.download()

    >>> article.html
-    u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
+    '<!DOCTYPE HTML><html itemscope itemtype="http://...'

 .. code-block:: pycon

    >>> article.parse()

    >>> article.authors
-    [u'Leigh Ann Caldwell', 'John Honway']
+    ['Leigh Ann Caldwell', 'John Honway']

    >>> article.text
-    u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
+    'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

    >>> article.top_image
-    u'http://someCDN.com/blah/blah/blah/file.png'
+    'http://someCDN.com/blah/blah/blah/file.png'

    >>> article.movies
-    [u'http://youtube.com/path/to/link.com', ...]
+    ['http://youtube.com/path/to/link.com', ...]

 .. code-block:: pycon

@@ -104,7 +105,7 @@ A Glance:
    ['New Years', 'resolution', ...]

    >>> article.summary
-    u'The study shows that 93% of people ...'
+    'The study shows that 93% of people ...'


 Newspaper has *seamless* language extraction and detection.
@@ -141,9 +142,9 @@ If you are certain that an *entire* news source is in one language, **go ahead a

    >>> for category in sina_paper.category_urls():
    >>>     print(category)
-    u'http://health.sina.com.cn'
-    u'http://eladies.sina.com.cn'
-    u'http://english.sina.com'
+    'http://health.sina.com.cn'
+    'http://eladies.sina.com.cn'
+    'http://english.sina.com'
    ...

    >>> article = sina_paper.articles[0]
@@ -172,6 +173,7 @@ Interested in adding a new language for us? Refer to: `Docs - Adding new languag
 Features
 --------

+- Full Python3 and Python2 support
 - Works in 10+ languages (English, Chinese, German, Arabic, ...)
 - Multi-threaded article download framework
 - News url identification
@@ -189,6 +191,9 @@ Get it now
 Installing newspaper is simple with `pip <http://www.pip-installer.org/>`_.
 However, you will run into fixable issues if you are trying to install on ubuntu.

+Note that our Python3 package name is ``newspaper3k`` while our Python2
+package name is ``newspaper``
+
 **If you are on Debian / Ubuntu**, install using the following:

 - Python development version, needed for Python.h::
@@ -205,11 +210,11 @@ However, you will run into fixable issues if you are trying to install on ubuntu

 - Install the distribution via pip::

-    $ pip install newspaper
+    $ pip3 install newspaper3k

 - Download NLP related corpora::

-    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python2.7
+    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3


 **If you are on OSX**, install using the following, you may use both homebrew or macports:
@@ -220,9 +225,9 @@ However, you will run into fixable issues if you are trying to install on ubuntu

    $ brew install libtiff libjpeg webp little-cms2

-    $ pip install newspaper
+    $ pip3 install newspaper3k

-    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python2.7
+    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3


 **Otherwise**, install with the following:
@@ -235,13 +240,16 @@ NOTE: You will still most likely need to install the following libraries via you

 ::

-    $ pip install newspaper
+    $ pip3 install newspaper3k

-    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
+    $ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

 Development
 -----------

+Newspaper has two branches up for development. *This* branch, the master, is our Python3
+codebase while our Python2 branch is located on *python-2-head*.
+
 If you'd like to contribute and hack on the newspaper project, feel free to clone
 a development version of this repository locally::

@@ -250,12 +258,18 @@ a development version of this repository locally::
 Once you have a copy of the source, you can embed it in your Python package,
 or install it into your site-packages easily::

-    $ pip install -r requirements.txt
-    $ python setup.py install
+    $ pip3 install -r requirements.txt
+    $ python3 setup.py install

-Feel free to give our testing suite a shot::
+Feel free to give our testing suite a shot, everything is mocked!::

-    $ python tests/unit_tests.py
+    $ python3 tests/unit_tests.py
+
+
+Demo
+----
+
+View a working online demo here: http://newspaper-demo.herokuapp.com

 LICENSE
 -------
@@ -271,3 +285,5 @@ to talk about the future of this library and news extraction in general!
 .. _`email & contact me`: mailto:lucasyangpersonal@gmail.com
 .. _`python-goose's`: https://github.com/grangier/python-goose
 .. _`here`: https://github.com/codelucas/newspaper/blob/master/GOOSE-LICENSE.txt
+
+.. _`Python2 branch`: https://github.com/codelucas/newspaper/tree/python-2-head
--- a/download_corpora.py
+++ b/download_corpora.py
@@ -1,4 +1,3 @@
-#!/usr/bin/env python2.7
 # -*- coding: utf-8 -*-
 """
 Downloads the necessary NLTK models and corpora required to support
@@ -11,7 +10,7 @@ REQUIRED_CORPORA = [
    'punkt',  # Required for WordTokenizer
    'maxent_treebank_pos_tagger',  # Required for NLTKTagger
    'movie_reviews',  # Required for NaiveBayesAnalyzer
-    'wordnet', # Required for lemmatization and Wordnet
+    'wordnet',  # Required for lemmatization and Wordnet
    'stopwords'
 ]