1
0
mirror of https://github.com/jfilter/clean-text.git synced 2021-09-19 22:32:58 +03:00

further improve docs

This commit is contained in:
Johannes Filter
2019-04-24 18:35:36 +02:00
parent 220b63fe87
commit 5f7a32f1f6
2 changed files with 11 additions and 8 deletions

View File

@@ -1,9 +1,9 @@
# clean-text
Clean your text with `clean-text` to create normalized text represenations. For instance, turn this corrupted input:
Clean your text with `clean-text` to create normalized text representations. For instance, turn this corrupted input:
```txt
There's a bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).
A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).
»Yóù àré rïght <3!«
@@ -12,12 +12,12 @@ There's a bunch of \\u2018new\\u2019 references, including [Moana](https://en.wi
into this
```txt
there's a bunch of 'new' references, including [moana](<URL>).
A bunch of 'new' references, including [moana](<URL>).
"you are right <3!"
```
`clean-text` uses [ftfy](https://github.com/LuminosoInsight/python-ftfy), [unidecode](https://github.com/takluyver/Unidecode) and numerous hand-crafted rules such as RegEx.
`clean-text` uses [ftfy](https://github.com/LuminosoInsight/python-ftfy), [unidecode](https://github.com/takluyver/Unidecode) and numerous hand-crafted rules, i.e., RegEx.
## Installation
@@ -25,7 +25,7 @@ there's a bunch of 'new' references, including [moana](<URL>).
pip install clean-text[gpl]
```
This will install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode). If it is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize). This is used for transliteration. So `ê` gets turned into `e`. So you can also install it without unidecode.
This will install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode). If it is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize) for [transliteration](https://en.wikipedia.org/wiki/Transliteration). Unicode symbols are encoded to their clostest ASCII equivlaent. So `ê` gets turned into `e`. However, you may also disable this feature altogether.
```bash
pip install clean-text
@@ -60,6 +60,8 @@ clean("some input",
Carefully choose the arguments that fit your task. The default parameters are listed above. Whitespace is always normalized.
You may also only use specific functions for cleaning. For this, take a look at the [source code](https://github.com/jfilter/clean-text/blob/master/cleantext/clean.py).
## Development
- install [Pipenv](https://pipenv.readthedocs.io/en/latest/)
@@ -76,7 +78,7 @@ If you don't like the output of `clean-text`, consider adding a [test](https://g
## Acknowledgements
Built upon the work by [Burton DeWilde](https://github.com/bdewilde)'s [Textacy](https://github.com/chartbeat-labs/textacy).
Built upon the work by [Burton DeWilde](https://github.com/bdewilde)'s for [Textacy](https://github.com/chartbeat-labs/textacy).
## License

View File

@@ -7,6 +7,7 @@ with open("README.md", "r") as fh:
classifiers = [
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'License :: OSI Approved :: MIT License',
]
@@ -14,11 +15,11 @@ version = '0.0.0'
setup(name='cleantext',
version=version,
description='Clean your dirty text',
description='Clean Your Text to Create Normalized Text Representations',
long_description=long_description,
long_description_content_type="text/markdown",
author='Johannes Filter',
author_email='ragha@outlook.com, hi@jfilter.de',
author_email='hi@jfilter.de',
url='https://github.com/jfilter/clean-text',
license='MIT',
install_requires=['ftfy'],