Add Postgresql Storage option with init and reset of data

2021-09-19 22:26:00 +03:00 · 2020-04-05 15:12:48 +03:00
parent 39aff553be
commit 054d30a154
8 changed files with 312 additions and 20 deletions
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
 # **news-please** #
- 
+
 [![PyPI version](https://img.shields.io/pypi/v/news-please.svg)](https://pypi.org/project/news-please/)
 [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=XX272QZV9A2FN&source=url)

-<img align="right" height="128px" width="128px" src="https://raw.githubusercontent.com/fhamborg/news-please/master/misc/logo/logo-256.png" /> 
+<img align="right" height="128px" width="128px" src="https://raw.githubusercontent.com/fhamborg/news-please/master/misc/logo/logo-256.png" />

-news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-please combines the power of multiple state-of-the-art libraries and tools, such as [scrapy](https://scrapy.org/), [Newspaper](https://github.com/codelucas/newspaper), and [readability](https://github.com/buriy/python-readability). news-please also features a library mode, which allows Python developers to use the crawling and extraction functionality within their own program. Moreover, news-please allows to conveniently [crawl and extract articles](/newsplease/examples/commoncrawl.py) from commoncrawl.org. 
+news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-please combines the power of multiple state-of-the-art libraries and tools, such as [scrapy](https://scrapy.org/), [Newspaper](https://github.com/codelucas/newspaper), and [readability](https://github.com/buriy/python-readability). news-please also features a library mode, which allows Python developers to use the crawling and extraction functionality within their own program. Moreover, news-please allows to conveniently [crawl and extract articles](/newsplease/examples/commoncrawl.py) from commoncrawl.org.

 If you like news-please and would like to [contribute](#Contribution-and-custom-features) to it, please have a look at our list of [issues that need help](https://github.com/fhamborg/news-please/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22). Of course, we are always looking forward to [pull requests](#contribution-and-custom-features) containing bug fixes, improvements, or your own ideas.

@@ -50,9 +50,9 @@ news-please supports three use cases, which are explained in more detail in the
 It's super easy, we promise!

 ### Installation
-news-please runs on Python 3.5+. 
+news-please runs on Python 3.5+.
 ```
-$ pip3 install news-please 
+$ pip3 install news-please
 ```
 Some folks from the great conda-forge community are working on [including news-please in conda-forge](https://github.com/conda-forge/staged-recipes/issues/3994); we'll update here once news-please can be installed using conda.

@@ -75,7 +75,7 @@ NewsPlease.from_file(path)
 ```
 or if you have raw HTML data (you can also provide the original URL to increase the accuracy of extracting the publishing date)
 ```python
-NewsPlease.from_html(html, url=None) 
+NewsPlease.from_html(html, url=None)
 ```
 or if you have a [WARC file](https://github.com/webrecorder/warcio) (also check out our [commoncrawl workflow](https://github.com/fhamborg/news-please/blob/master/newsplease/examples/commoncrawl.py), which provides convenient methods to filter commoncrawl's archive for specific news outlets and dates)
 ```
@@ -102,7 +102,7 @@ Most likely, you will not want to crawl from the websites provided in our exampl
 news-please also supports export to ElasticSearch. Using Elasticsearch will also enable the versioning feature. First, enable it in the [`config.cfg`](https://github.com/fhamborg/news-please/wiki/configuration) at the config directory, which is by default `~/news-please/config` but can also be changed with the `-c` parameter to a custom location. In case the directory does not exist, a default directory will be created at the specified location.

    [Scrapy]
-    
+
    ITEM_PIPELINES = {
                       'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
                       'newsplease.pipeline.pipelines.ElasticsearchStorage':350
@@ -118,25 +118,46 @@ That's it! Except, if your Elasticsearch database is not located at `http://loca
    ...

    # Credentials used  for authentication (supports CA-certificates):
-    
-    use_ca_certificates = False           # True if authentification needs to be performed 
+
+    use_ca_certificates = False           # True if authentification needs to be performed
    ca_cert_path = '/path/to/cacert.pem'  
    client_cert_path = '/path/to/client_cert.pem'  
    client_key_path = '/path/to/client_key.pem'  
    username = 'root'  
-    secret = 'password' 
+    secret = 'password'
+
+### Postgresql
+
+news-please allows for storing of articles to a Postgresql database, including the versioning feature. In the [`config.cfg`] file add the PostgresqlStorage pipeline and adjust database credentials:
+
+    [Scrapy]
+
+    ITEM_PIPELINES = {
+                   'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
+                   'newsplease.pipeline.pipelines.PostgresqlStorage':350
+                 }
+
+    [Postgresql]
+
+    # Postgresql-Connection required for saving meta-informations
+    host = localhost
+    port = 5432
+    database = 'news-please'
+    user = 'user'
+    password = 'password'
+

 ### What's next?
 We have collected a bunch of useful information for both [users](https://github.com/fhamborg/news-please/wiki/user-guide)  and [developers](https://github.com/fhamborg/news-please/wiki/developer-guide). As a user, you will most likely only deal with two files: [`sitelist.hjson`](https://github.com/fhamborg/news-please/wiki/user-guide#sitelisthjson) (to define sites to be crawled) and [`config.cfg`](https://github.com/fhamborg/news-please/wiki/configuration) (probably only rarely, in case you want to tweak the configuration).

 ## Wiki and support (also, how to open an issue)
-You can find more information on usage and development in our [wiki](https://github.com/fhamborg/news-please/wiki)! Before contacting us, please check out the wiki. If you still have questions on how to use news-please, please create a new [issue](https://github.com/fhamborg/news-please/issues) on GitHub. Please understand that we are not able to provide individual support via email. We think that help is more valuable if it is shared publicly so that more people can benefit from it. 
+You can find more information on usage and development in our [wiki](https://github.com/fhamborg/news-please/wiki)! Before contacting us, please check out the wiki. If you still have questions on how to use news-please, please create a new [issue](https://github.com/fhamborg/news-please/issues) on GitHub. Please understand that we are not able to provide individual support via email. We think that help is more valuable if it is shared publicly so that more people can benefit from it.

 ### Issues
 For bug reports, we ask you to use the Bug report template. Make sure you're using the latest version of news-please, since we cannot give support for older versions. Unfortunately, we cannot give support for issues or questions sent by email.

 ### Donation
-Your donations are greatly appreciated! They will free me up to work on this project more, to take on tasks such as adding new features, bug-fix support, and addressing further concerns with the library. 
+Your donations are greatly appreciated! They will free me up to work on this project more, to take on tasks such as adding new features, bug-fix support, and addressing further concerns with the library.

 * [GitHub Sponsors](https://github.com/sponsors/fhamborg)
 * [PayPal](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=XX272QZV9A2FN&source=url)
@@ -179,11 +200,11 @@ Do you want to contribute? Great, we are always happy for any support on this pr

 Please note that we usually do not have enough resources to implement features requested by users - instead we recommend to implement them yourself, and send a pull request.

-By contributing to this project, you agree that your contributions will be licensed under the project's license (see below). 
+By contributing to this project, you agree that your contributions will be licensed under the project's license (see below).

 ## License
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use news-please except in compliance with the License. A copy of the License is included in the project, see the file [LICENSE.txt](LICENSE.txt).

-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The news-please logo is courtesy of [Mario Hamborg](https://mario.hamborg.eu/). 
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The news-please logo is courtesy of [Mario Hamborg](https://mario.hamborg.eu/).

 Copyright 2016-2020 The news-please team
--- a/newsplease/main.py
+++ b/newsplease/main.py
@@ -10,6 +10,7 @@ from subprocess import Popen

 import plac
 import pymysql
+import psycopg2
 from elasticsearch import Elasticsearch
 from scrapy.utils.log import configure_logging

@@ -50,6 +51,7 @@ class NewsPleaseLauncher(object):
    shutdown = False
    thread_event = None
    mysql = None
+    postgresql = None
    elasticsearch = None
    number_of_active_crawlers = 0
    config_directory_default_path = "~/news-please-repo/config/"
@@ -58,8 +60,8 @@ class NewsPleaseLauncher(object):

    __single_crawler = False

-    def __init__(self, cfg_directory_path, is_resume, is_reset_elasticsearch, is_reset_json, is_reset_mysql,
-                 is_no_confirm, library_mode=False):
+    def __init__(self, cfg_directory_path, is_resume, is_reset_elasticsearch,
+        is_reset_json, is_reset_mysql, is_reset_postgresql, is_no_confirm, library_mode=False):
        """
        The constructor of the main class, thus the real entry point to the tool.
        :param cfg_file_path:
@@ -67,6 +69,7 @@ class NewsPleaseLauncher(object):
        :param is_reset_elasticsearch:
        :param is_reset_json:
        :param is_reset_mysql:
+        :param is_reset_postgresql:
        :param is_no_confirm:
        """
        configure_logging({"LOG_LEVEL": "ERROR"})
@@ -104,17 +107,20 @@ class NewsPleaseLauncher(object):
        self.cfg = CrawlerConfig.get_instance()
        self.cfg.setup(self.cfg_file_path)
        self.mysql = self.cfg.section("MySQL")
+        self.postgresql = self.cfg.section("Postgresql")
        self.elasticsearch = self.cfg.section("Elasticsearch")

        # perform reset if given as parameter
        if is_reset_mysql:
            self.reset_mysql()
+        if is_reset_postgresql:
+            self.reset_postgresql()
        if is_reset_json:
            self.reset_files()
        if is_reset_elasticsearch:
            self.reset_elasticsearch()
        # close the process
-        if is_reset_elasticsearch or is_reset_json or is_reset_mysql:
+        if is_reset_elasticsearch or is_reset_json or is_reset_mysql or is_reset_postgresql:
            sys.exit(0)

        self.json_file_path = self.cfg_directory_path + self.cfg.section('Files')['url_input_file_name']
@@ -398,6 +404,50 @@ Cleanup MySQL database:
                pymysql.IntegrityError, TypeError) as error:
            self.log.error("Database reset error: %s", error)

+    def reset_postgresql(self):
+        """
+        Resets the Postgresql database.
+        """
+
+        confirm = self.no_confirm
+
+        print("""
+Cleanup Postgresql database:
+    This will truncate all tables and reset the whole database.
+""")
+
+        if not confirm:
+            confirm = 'yes' in builtins.input(
+                """
+    Do you really want to do this? Write 'yes' to confirm: {yes}"""
+                    .format(yes='yes' if confirm else ''))
+
+        if not confirm:
+            print("Did not type yes. Thus aborting.")
+            return
+
+        print("Resetting database...")
+
+        try:
+            # initialize DB connection
+            self.conn = psycopg2.connect(host=self.postgresql["host"],
+                                        port=self.postgresql["port"],
+                                        database=self.postgresql["database"],
+                                        user=self.postgresql["user"],
+                                        password=self.postgresql["password"])
+            self.cursor = self.conn.cursor()
+
+            self.cursor.execute("TRUNCATE TABLE CurrentVersions RESTART IDENTITY")
+            self.cursor.execute("TRUNCATE TABLE ArchiveVersions RESTART IDENTITY")
+            self.conn.commit()
+            self.cursor.close()
+
+        except psycopg2.DatabaseError as error:
+            self.log.error("Database reset error: %s", error)
+        finally:
+            if self.conn is not None:
+                self.conn.close()
+
    def reset_elasticsearch(self):
        """
        Resets the Elasticsearch Database.
@@ -628,21 +678,23 @@ Cleanup files:
    reset_elasticsearch=plac.Annotation('reset Elasticsearch indexes', 'flag'),
    reset_json=plac.Annotation('reset JSON files', 'flag'),
    reset_mysql=plac.Annotation('reset MySQL database', 'flag'),
+    reset_postgresql=plac.Annotation('reset Postgresql database', 'flag'),
    reset_all=plac.Annotation('combines all reset options', 'flag'),
    no_confirm=plac.Annotation('skip confirm dialogs', 'flag')
 )
-def cli(cfg_file_path, resume, reset_elasticsearch, reset_mysql, reset_json, reset_all, no_confirm):
+def cli(cfg_file_path, resume, reset_elasticsearch, reset_mysql, reset_postgresql, reset_json, reset_all, no_confirm):
    "A generic news crawler and extractor."

    if reset_all:
        reset_elasticsearch = True
        reset_json = True
        reset_mysql = True
+        reset_postgresql = True

    if cfg_file_path and not cfg_file_path.endswith(os.path.sep):
        cfg_file_path += os.path.sep

-    NewsPleaseLauncher(cfg_file_path, resume, reset_elasticsearch, reset_json, reset_mysql, no_confirm)
+    NewsPleaseLauncher(cfg_file_path, resume, reset_elasticsearch, reset_json, reset_mysql, reset_postgresql, no_confirm)


 def main():
--- a/newsplease/config/config.cfg
+++ b/newsplease/config/config.cfg
@@ -190,6 +190,15 @@ username = 'root'
 password = 'password'


+[Postgresql]
+
+# Postgresql-Connection required for saving meta-informations
+host = localhost
+port = 5432
+database = 'news-please'
+user = 'root'
+password = 'password'
+

 [Elasticsearch]

--- a/newsplease/config/config_lib.cfg
+++ b/newsplease/config/config_lib.cfg
@@ -196,6 +196,15 @@ username = 'root'
 password = 'password'


+[Postgresql]
+
+# Postgresql-Connection required for saving meta-informations
+host = localhost
+port = 5432
+database = 'news-please'
+user = 'root'
+password = 'password'
+

 [Elasticsearch]

--- a/newsplease/init-postgresql-db.sql
+++ b/newsplease/init-postgresql-db.sql
@@ -0,0 +1,53 @@
+--
+-- Table structure for table ArchiveVersions
+--
+
+DROP TABLE IF EXISTS ArchiveVersions;
+CREATE TABLE ArchiveVersions (
+  id SERIAL PRIMARY KEY,
+  date_modify timestamp(0) NOT NULL,
+  date_download timestamp(0) NOT NULL,
+  localpath varchar(255) NOT NULL,
+  filename varchar(2000) NOT NULL,
+  source_domain varchar(255) NOT NULL,
+  url varchar(2000) NOT NULL,
+  image_url varchar(2000),
+  title varchar(255) NOT NULL,
+  title_page varchar(255) NOT NULL,
+  title_rss varchar(255),
+  maintext text NOT NULL,
+  description text,
+  date_publish timestamp(0),
+  authors varchar(255) ARRAY,
+  language varchar(255),
+  ancestor int NOT NULL DEFAULT 0,
+  descendant int NOT NULL,
+  version int NOT NULL DEFAULT 2
+);
+
+--
+-- Table structure for table CurrentVersions
+--
+
+DROP TABLE IF EXISTS CurrentVersions;
+CREATE TABLE CurrentVersions (
+  id SERIAL PRIMARY KEY,
+  date_modify timestamp(0) NOT NULL,
+  date_download timestamp(0) NOT NULL,
+  localpath varchar(255) NOT NULL,
+  filename varchar(2000) NOT NULL,
+  source_domain varchar(255) NOT NULL,
+  url varchar(2000) NOT NULL,
+  image_url varchar(2000),
+  title varchar(255) NOT NULL,
+  title_page varchar(255) NOT NULL,
+  title_rss varchar(255),
+  maintext text NOT NULL,
+  description text,
+  date_publish timestamp(0),
+  authors varchar(255) ARRAY,
+  language varchar(255),
+  ancestor int NOT NULL DEFAULT 0,
+  descendant int NOT NULL DEFAULT 0,
+  version int NOT NULL DEFAULT 1
+);
--- a/newsplease/pipeline/pipelines.py
+++ b/newsplease/pipeline/pipelines.py
@@ -9,6 +9,7 @@ import os.path
 import sys

 import pymysql
+import psycopg2
 from dateutil import parser as dateparser
 from elasticsearch import Elasticsearch
 from scrapy.exceptions import DropItem
@@ -263,7 +264,6 @@ class MySQLStorage(object):
        # Close DB connection - garbage collection
        self.conn.close()

-
 class ExtractedInformationStorage(object):
    """
    Provides basic functionality for Storages
@@ -344,6 +344,152 @@ class ExtractedInformationStorage(object):
        news_article.url = item['url']
        return news_article

+class PostgresqlStorage(ExtractedInformationStorage):
+    """
+    Handles remote storage of the meta data in the DB
+    """
+
+    log = None
+    cfg = None
+    database = None
+    conn = None
+    cursor = None
+    # initialize necessary DB queries for this pipe
+    compare_versions = ("SELECT * FROM CurrentVersions WHERE url=%s")
+    insert_current = ("INSERT INTO CurrentVersions(date_modify,date_download, \
+                        localpath,filename,source_domain, \
+                        url,image_url,title,title_page, \
+                        title_rss,maintext,description, \
+                        date_publish,authors,language, \
+                        ancestor,descendant,version) \
+                        VALUES (%(date_modify)s,%(date_download)s, \
+                            %(localpath)s,%(filename)s,%(source_domain)s, \
+                            %(url)s,%(image_url)s,%(title)s,%(title_page)s, \
+                            %(title_rss)s,%(maintext)s,%(description)s, \
+                            %(date_publish)s,%(authors)s,%(language)s, \
+                            %(ancestor)s,%(descendant)s,%(version)s) \
+                        RETURNING id")
+
+    insert_archive = ("INSERT INTO ArchiveVersions(id,date_modify,date_download,\
+                        localpath,filename,source_domain, \
+                        url,image_url,title,title_page, \
+                        title_rss,maintext,description, \
+                        date_publish,authors,language, \
+                        ancestor,descendant,version) \
+                        VALUES (%(db_id)s,%(date_modify)s,%(date_download)s, \
+                            %(localpath)s,%(filename)s,%(source_domain)s, \
+                            %(url)s,%(image_url)s,%(title)s,%(title_page)s, \
+                            %(title_rss)s,%(maintext)s,%(description)s, \
+                            %(date_publish)s,%(authors)s,%(language)s, \
+                            %(ancestor)s,%(descendant)s,%(version)s)")
+
+
+    delete_from_current = ("DELETE FROM CurrentVersions WHERE id = %s")
+
+    # init database connection
+    def __init__(self):
+        # import logging
+        self.log = logging.getLogger(__name__)
+        self.cfg = CrawlerConfig.get_instance()
+        self.database = self.cfg.section("Postgresql")
+        # Establish DB connection
+        # Closing of the connection is handled once the spider closes
+        self.conn = psycopg2.connect(host=self.database["host"],
+                            port=self.database["port"],
+                            database=self.database["database"],
+                            user=self.database["user"],
+                            password=self.database["password"])
+        self.cursor = self.conn.cursor()
+
+    def process_item(self, item, spider):
+        """
+        Store item data in DB.
+        First determine if a version of the article already exists,
+          if so then 'migrate' the older version to the archive table.
+        Second store the new article in the current version table
+        """
+
+        # Set defaults
+        version = 1
+        ancestor = 0
+
+        # Search the CurrentVersion table for an old version of the article
+        try:
+            self.cursor.execute(self.compare_versions, (item['url'],))
+        except psycopg2.DatabaseError as error:
+            self.log.error("Something went wrong in query: %s", error)
+
+        # Save the result of the query. Must be done before the add,
+        # otherwise the result will be overwritten in the buffer
+        old_version = self.cursor.fetchone()
+
+        if old_version is not None:
+            old_version_list = {
+                'db_id': old_version[0],
+                'date_modify': old_version[1],
+                'date_download': old_version[2],
+                'localpath': old_version[3],
+                'filename': old_version[4],
+                'source_domain': old_version[5],
+                'url': old_version[6],
+                'image_url': old_version[7],
+                'title': old_version[8],
+                'title_page': old_version[9],
+                'title_rss': old_version[10],
+                'maintext': old_version[11],
+                'description': old_version[12],
+                'date_publish': old_version[13],
+                'authors': old_version[14],
+                'language': old_version[15],
+                'ancestor': old_version[16],
+                'descendant': old_version[17],
+                'version': old_version[18] }
+
+            # Update the version number and the ancestor variable for later references
+            version = (old_version[18] + 1)
+            ancestor = old_version[0]
+
+        # Add the new version of the article to the CurrentVersion table
+        current_version_list = ExtractedInformationStorage.extract_relevant_info(item)
+        current_version_list['ancestor'] = ancestor
+        current_version_list['descendant'] = 0
+        current_version_list['version'] = version
+
+        try:
+            self.cursor.execute(self.insert_current, current_version_list)
+            self.conn.commit()
+            self.log.info("Article inserted into the database.")
+        except psycopg2.DatabaseError as error:
+            self.log.error("Something went wrong in commit: %s", error)
+
+        # Move the old version from the CurrentVersion table to the ArchiveVersions table
+        if old_version is not None:
+            # Set descendant attribute
+            try:
+                old_version_list['descendant'] = self.cursor.fetchone()[0]
+            except psycopg2.DatabaseError as error:
+                self.log.error("Something went wrong in id query: %s", error)
+
+            # Delete the old version of the article from the CurrentVersion table
+            try:
+                self.cursor.execute(self.delete_from_current, (old_version_list['db_id'],))
+                self.conn.commit()
+            except psycopg2.DatabaseError as error:
+                self.log.error("Something went wrong in delete: %s", error)
+
+            # Add the old version to the ArchiveVersion table
+            try:
+                self.cursor.execute(self.insert_archive, old_version_list)
+                self.conn.commit()
+                self.log.info("Moved old version of an article to the archive.")
+            except psycopg2.DatabaseError as error:
+                self.log.error("Something went wrong in archive: %s", error)
+
+        return item
+
+    def close_spider(self, spider):
+        # Close DB connection - garbage collection
+        self.conn.close()

 class InMemoryStorage(ExtractedInformationStorage):
    """
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,6 +2,7 @@ pywin32>=220 ; sys_platform == 'win32'
 lxml>=3.3.5
 Scrapy>=1.1.0
 PyMySQL>=0.7.9
+psycopg2>=2.8.4
 hjson>=1.5.8
 elasticsearch>=2.4
 beautifulsoup4>=4.3.2
--- a/setup.py
+++ b/setup.py
@@ -32,6 +32,7 @@ news-please is an open source, easy-to-use news crawler that extracts structured
      install_requires=[
          'Scrapy>=1.1.0',
          'PyMySQL>=0.7.9',
+          'psycopg2>=2.8.4',
          'hjson>=1.5.8',
          'elasticsearch>=2.4',
          'beautifulsoup4>=4.3.2',