alihan/clean-text-nlp-preprocessing

Fork 0

mirror of https://github.com/jfilter/clean-text.git synced 2021-09-19 22:32:58 +03:00

Go to file

Johannes Filter 4fd108bb63 0.1.0

2019-04-24 18:40:20 +02:00

cleantext

improve docs

2019-04-24 18:21:27 +02:00

tests

simplify remove punct

2019-03-22 22:42:51 +01:00

.editorconfig

getting stuff done

2018-12-21 21:18:08 +01:00

.gitignore

getting stuff done

2018-12-21 21:18:08 +01:00

.travis.yml

typo

2018-12-21 21:53:07 +01:00

LICENSE

fix license

2019-03-22 20:26:13 +01:00

Pipfile

fix pip file

2019-03-22 19:46:13 +01:00

Pipfile.lock

improve docs

2019-04-24 18:21:27 +02:00

README.md

further improve docs

2019-04-24 18:35:36 +02:00

setup.py

0.1.0

2019-04-24 18:40:20 +02:00

README.md

clean-text

Clean your text with clean-text to create normalized text representations. For instance, turn this corrupted input:

A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght &lt;3!«

into this

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text uses ftfy, unidecode and numerous hand-crafted rules, i.e., RegEx.

Installation

pip install clean-text[gpl]

This will install the GPL-licensed package unidecode. If it is not available, clean-text will resort to Python's unicodedata.normalize for transliteration. Unicode symbols are encoded to their clostest ASCII equivlaent. So ê gets turned into e. However, you may also disable this feature altogether.

pip install clean-text

Usage

from cleantext import clean

clean("some input",
    fix_unicode=True, # fix various unicode errors
    to_ascii=True, # transliterate to closest ASCII representation
    lower=True, # lowercase text
    no_line_breaks=False, # fully strip linebreaks
    no_urls=False, # replace all URLs with a special token
    no_emails=False, # replace all email addresses with a special token
    no_phone_numbers=False, # replace all phone numbers with a special token
    no_numbers=False, # replace all numbers with a special token
    no_digits=False, # replace all digits with a special token
    no_currency_symbols=False, # replace all currency symbols with a special token
    no_punct=False, # fully remove punctuation
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en" # change to 'de' for German special handling
)

Carefully choose the arguments that fit your task. The default parameters are listed above. Whitespace is always normalized.

You may also only use specific functions for cleaning. For this, take a look at the source code.

Development

install Pipenv
get the package: git clone https://github.com/jfilter/clean-text && cd clean-text && pipenv install
run tests: pipenv run pytest

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

If you don't like the output of clean-text, consider adding a test with your specific input and desired output.

Acknowledgements

Built upon the work by Burton DeWilde's for Textacy.

License

Apache