mirror of
https://github.com/jfilter/clean-text.git
synced 2021-09-19 22:32:58 +03:00
improve README
This commit is contained in:
43
README.md
43
README.md
@@ -1,4 +1,4 @@
|
||||
# clean-text
|
||||
# `clean-text` [](https://travis-ci.com/jfilter/clean-text) [](https://pypi.org/project/clean-text/) [](https://pypi.org/project/clean-text/)
|
||||
|
||||
Clean your text with `clean-text` to create normalized text representations. For instance, turn this corrupted input:
|
||||
|
||||
@@ -9,7 +9,7 @@ A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.
|
||||
»Yóù àré rïght <3!«
|
||||
```
|
||||
|
||||
into this
|
||||
into this clean output:
|
||||
|
||||
```txt
|
||||
A bunch of 'new' references, including [moana](<URL>).
|
||||
@@ -21,40 +21,46 @@ A bunch of 'new' references, including [moana](<URL>).
|
||||
|
||||
## Installation
|
||||
|
||||
To install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode) alongside:
|
||||
|
||||
```bash
|
||||
pip install clean-text[gpl]
|
||||
```
|
||||
|
||||
This will install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode). If it is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize) for [transliteration](https://en.wikipedia.org/wiki/Transliteration). Unicode symbols are encoded to their clostest ASCII equivlaent. So `ê` gets turned into `e`. However, you may also disable this feature altogether.
|
||||
You may want to abstain from GPL:
|
||||
|
||||
```bash
|
||||
pip install clean-text
|
||||
```
|
||||
|
||||
If [unidecode](https://github.com/takluyver/Unidecode) is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize) for [transliteration](https://en.wikipedia.org/wiki/Transliteration).
|
||||
Transliteration to closest ASCII symbols involes manually mappings, i.e., `ê` to `e`. Unidecode's hand-crafted mapping is superiour but unicodedata's are sufficent.
|
||||
However, you may want to disable this feature altogether depening on your data and use case.
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from cleantext import clean
|
||||
|
||||
clean("some input",
|
||||
fix_unicode=True, # fix various unicode errors
|
||||
to_ascii=True, # transliterate to closest ASCII representation
|
||||
lower=True, # lowercase text
|
||||
no_line_breaks=False, # fully strip linebreaks
|
||||
no_urls=False, # replace all URLs with a special token
|
||||
no_emails=False, # replace all email addresses with a special token
|
||||
no_phone_numbers=False, # replace all phone numbers with a special token
|
||||
no_numbers=False, # replace all numbers with a special token
|
||||
no_digits=False, # replace all digits with a special token
|
||||
no_currency_symbols=False, # replace all currency symbols with a special token
|
||||
no_punct=False, # fully remove punctuation
|
||||
fix_unicode=True, # fix various unicode errors
|
||||
to_ascii=True, # transliterate to closest ASCII representation
|
||||
lower=True, # lowercase text
|
||||
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
|
||||
no_urls=False, # replace all URLs with a special token
|
||||
no_emails=False, # replace all email addresses with a special token
|
||||
no_phone_numbers=False, # replace all phone numbers with a special token
|
||||
no_numbers=False, # replace all numbers with a special token
|
||||
no_digits=False, # replace all digits with a special token
|
||||
no_currency_symbols=False, # replace all currency symbols with a special token
|
||||
no_punct=False, # fully remove punctuation
|
||||
replace_with_url="<URL>",
|
||||
replace_with_email="<EMAIL>",
|
||||
replace_with_phone_number="<PHONE>",
|
||||
replace_with_number="<NUMBER>",
|
||||
replace_with_digit="0",
|
||||
replace_with_currency_symbol="<CUR>",
|
||||
lang="en" # change to 'de' for German special handling
|
||||
lang="en" # set to 'de' for German special handling
|
||||
)
|
||||
```
|
||||
|
||||
@@ -62,6 +68,8 @@ Carefully choose the arguments that fit your task. The default parameters are li
|
||||
|
||||
You may also only use specific functions for cleaning. For this, take a look at the [source code](https://github.com/jfilter/clean-text/blob/master/cleantext/clean.py).
|
||||
|
||||
So far, only English and German are fully supported. It should work for the majority of Western languages. If you need some special handling for you language, feel free to contribute. 🙃
|
||||
|
||||
## Development
|
||||
|
||||
- install [Pipenv](https://pipenv.readthedocs.io/en/latest/)
|
||||
@@ -76,6 +84,11 @@ If you have a **question**, found a **bug** or want to propose a new **feature**
|
||||
|
||||
If you don't like the output of `clean-text`, consider adding a [test](https://github.com/jfilter/clean-text/tree/master/tests) with your specific input and desired output.
|
||||
|
||||
## Related Work
|
||||
|
||||
- https://github.com/pudo/normality
|
||||
- https://github.com/davidmogar/cucco
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
Built upon the work by [Burton DeWilde](https://github.com/bdewilde)'s for [Textacy](https://github.com/chartbeat-labs/textacy).
|
||||
|
||||
Reference in New Issue
Block a user