1
0
mirror of https://github.com/nikhiljsk/preprocess_nlp.git synced 2021-10-18 10:21:04 +03:00
2020-02-18 12:45:26 +05:30
2020-02-18 12:45:26 +05:30
2020-02-13 15:11:17 +05:30

Preprocess NLP Text

Framework Description

A simple framework for preprocessing or cleaning of text using parallel execution by leveraging the multiprocessing library in Python. Completely written is Python code, this repo holds an easy way to preprocess text with various defined stages implemented using standardized techniques in Natural Language Processing (NLP). Contains both sequential and parallel ways for preprocessing text with an option of user-defined number of processes. Also, added a module which can be used to find the top words in the corpus, lets you choose a threshold to consider the words that can stay in the corpus and replaces the others, A simple way to reduce vocabulary size decreasing input vector size for deep learning models. Includes support for both parallel and sequential.

Various stages of preprocessing include:

Stage Description
remove_tags_nonascii Remove HTML tags, emails, URLs, non-ascii characters and converts accented characters
lower_case Converts the text to lower_case
expand_contractions Expands the word contractions
remove_punctuation Remove punctuation from text, but sentences are seperated by ' . '
remove_esacape_chars Remove escapse characters like \n, \t etc
remove_stopwords Remove stopwords using nltk python
remove_numbers Remove all digits in the text
lemmatize Uses WordNetLemmatizer to lemmatize text
stemming Uses SnowballStemmer for stemming of text
min_word_len Minimum word length to keep in text

Code - Components

Various Python files and there purpose mentioned here:

How to run

  1. pip install -r requirements.txt
  2. Import preprocess_nlp.py and use the functions preprocess_nlp(for sequential) and asyn_call_preprocess(for parallel) as defined in notebook
  3. Import vocab_elimination_nlp.py and use functions as defined in the python file.

Sequential & Parallel Processing

  1. Sequential - Processes records in a sequential order, does not consume a lot of CPU Memory but is slower compared to Parallel processing
  2. Parallel - Can create multiple processes (customizable/user-defined) to preprocess text parallelly, Memory intensive and faster.

Future Updates

  • Feature extractions like Nouns, Verbs, Adjectives, Numbers, Noun Phrases, NERs, Keywords
  • Vectorization tools like TF-IDF, GloVe, Word2Vec, Bag of Words

Refer the code for Docstrings and other function related documentation.
Cheers :)

Languages
Python 56.9%
Jupyter Notebook 43.1%