* Guided KeyBERT
* Update default SBERT model
This commit is contained in:
Maarten Grootendorst
2021-09-28 15:29:27 +02:00
committed by GitHub
parent c8c6993b30
commit 6ab9af1cfe
12 changed files with 113 additions and 29 deletions

View File

@@ -26,6 +26,6 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[test]"
pip install -e ".[dev]"
- name: Run Checking Mechanisms
run: make check

View File

@@ -75,12 +75,6 @@ pip install keybert[spacy]
pip install keybert[use]
```
To install all backends:
```
pip install keybert[all]
```
<a name="usage"/></a>
### 2.2. Usage
@@ -136,7 +130,7 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.
<a name="maxsum"/></a>
@@ -205,7 +199,7 @@ and pass it through KeyBERT with `model`:
```python
from keybert import KeyBERT
kw_model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
```
Or select a SentenceTransformer model with your own parameters:
@@ -214,7 +208,7 @@ Or select a SentenceTransformer model with your own parameters:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

View File

@@ -1,3 +1,17 @@
## **Version 0.5.0**
*Release date: 28 September, 2021*
**Highlights**:
* Added Guided KeyBERT
* kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
* Thanks to [@zolekode](https://github.com/zolekode) for the inspiration!
* Use the newest all-* models from SBERT
**Miscellaneous**:
* Added instructions in the FAQ to extract keywords from Chinese documents
## **Version 0.4.0**
*Release date: 23 June, 2021*

View File

@@ -1,7 +1,7 @@
## **Which embedding model works best for which language?**
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
on your data, the model, and your specific use-case. However, the default model in KeyBERT
(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
(`"all-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual**
documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance.
If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead.
@@ -17,4 +17,27 @@ topic modeling to HTML-code to extract topics of code, then it becomes important
## **Can I use the GPU to speed up the model?**
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
## **How can I use KeyBERT with Chinese documents?**
You need to make sure you use a Tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing [`jieba`](https://github.com/fxsjy/jieba) for this:
```python
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
words = jieba.lcut(text)
return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
```
Then, simply pass the vectorizer to your KeyBERT instance:
```python
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)
```

View File

@@ -8,7 +8,7 @@ and pass it through KeyBERT with `model`:
```python
from keybert import KeyBERT
kw_model = KeyBERT(model="paraphrase-MiniLM-L6-v2")
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
```
Or select a SentenceTransformer model with your own parameters:
@@ -16,7 +16,7 @@ Or select a SentenceTransformer model with your own parameters:
```python
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
```

View File

@@ -72,7 +72,7 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
```
**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).
I would advise either `"paraphrase-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"`
for multi-lingual documents or any other language.
### Max Sum Similarity
@@ -147,4 +147,30 @@ candidates = [candidate[0] for candidate in candidates]
# KeyBERT init
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates)
```
### Guided KeyBERT
Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.
Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:
```python
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)
```

View File

@@ -1,3 +1,3 @@
from keybert._model import KeyBERT
__version__ = "0.4.0"
__version__ = "0.5.0"

View File

@@ -31,7 +31,7 @@ class KeyBERT:
"""
def __init__(self,
model="paraphrase-MiniLM-L6-v2"):
model="all-MiniLM-L6-v2"):
""" KeyBERT initialization
Arguments:
@@ -60,8 +60,9 @@ class KeyBERT:
diversity: float = 0.5,
nr_candidates: int = 20,
vectorizer: CountVectorizer = None,
highlight: bool = False) -> Union[List[Tuple[str, float]],
List[List[Tuple[str, float]]]]:
highlight: bool = False,
seed_keywords: List[str] = None) -> Union[List[Tuple[str, float]],
List[List[Tuple[str, float]]]]:
""" Extract keywords/keyphrases
NOTE:
@@ -99,6 +100,8 @@ class KeyBERT:
highlight: Whether to print the document and highlight
its keywords/keyphrases. NOTE: This does not work if
multiple documents are passed.
seed_keywords: Seed keywords that may guide the extraction of keywords by
steering the similarities towards the seeded keywords
Returns:
keywords: the top n keywords for a document with their respective distances
@@ -116,7 +119,8 @@ class KeyBERT:
use_mmr=use_mmr,
diversity=diversity,
nr_candidates=nr_candidates,
vectorizer=vectorizer)
vectorizer=vectorizer,
seed_keywords=seed_keywords)
if highlight:
highlight_document(docs, keywords)
@@ -143,7 +147,8 @@ class KeyBERT:
use_mmr: bool = False,
diversity: float = 0.5,
nr_candidates: int = 20,
vectorizer: CountVectorizer = None) -> List[Tuple[str, float]]:
vectorizer: CountVectorizer = None,
seed_keywords: List[str] = None) -> List[Tuple[str, float]]:
""" Extract keywords/keyphrases for a single document
Arguments:
@@ -157,6 +162,8 @@ class KeyBERT:
diversity: The diversity of results between 0 and 1 if use_mmr is True
nr_candidates: The number of candidates to consider if use_maxsum is set to True
vectorizer: Pass in your own CountVectorizer from scikit-learn
seed_keywords: Seed keywords that may guide the extraction of keywords by
steering the similarities towards the seeded keywords
Returns:
keywords: the top n keywords for a document with their respective distances
@@ -175,6 +182,11 @@ class KeyBERT:
doc_embedding = self.model.embed([doc])
candidate_embeddings = self.model.embed(candidates)
# Guided KeyBERT with seed keywords
if seed_keywords is not None:
seed_embeddings = self.model.embed([" ".join(seed_keywords)])
doc_embedding = np.average([doc_embedding, seed_embeddings], axis=0, weights=[3, 1])
# Calculate distances and extract keywords
if use_mmr:
keywords = mmr(doc_embedding, candidate_embeddings, candidates, top_n, diversity)

View File

@@ -16,13 +16,13 @@ class SentenceTransformerBackend(BaseEmbedder):
sentence-transformers model:
```python
from keybert.backend import SentenceTransformerBackend
sentence_model = SentenceTransformerBackend("paraphrase-MiniLM-L6-v2")
sentence_model = SentenceTransformerBackend("all-MiniLM-L6-v2")
```
or you can instantiate a model yourself:
```python
from keybert.backend import SentenceTransformerBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
sentence_model = SentenceTransformerBackend(embedding_model)
```
"""
@@ -36,7 +36,7 @@ class SentenceTransformerBackend(BaseEmbedder):
else:
raise ValueError("Please select a correct SentenceTransformers model: \n"
"`from sentence_transformers import SentenceTransformer` \n"
"`model = SentenceTransformer('paraphrase-MiniLM-L6-v2')`")
"`model = SentenceTransformer('all-MiniLM-L6-v2')`")
def embed(self,
documents: List[str],

View File

@@ -4,7 +4,7 @@ from ._sentencetransformers import SentenceTransformerBackend
def select_backend(embedding_model) -> BaseEmbedder:
""" Select an embedding model based on language or a specific sentence transformer models.
When selecting a language, we choose `paraphrase-MiniLM-L6-v2` for English and
When selecting a language, we choose `all-MiniLM-L6-v2` for English and
`paraphrase-multilingual-MiniLM-L12-v2` for all other languages as it support 100+ languages.
Returns:

View File

@@ -48,7 +48,7 @@ with open("README.md", "r", encoding='utf-8') as fh:
setup(
name="keybert",
packages=find_packages(exclude=["notebooks", "docs"]),
version="0.4.0",
version="0.5.0",
author="Maarten Grootendorst",
author_email="maartengrootendorst@gmail.com",
description="KeyBERT performs keyword extraction with state-of-the-art transformer models.",
@@ -76,8 +76,7 @@ setup(
"test": test_packages,
"docs": docs_packages,
"dev": dev_packages,
"flair": flair_packages,
"all": extra_packages
"flair": flair_packages
},
python_requires='>=3.6',
)

View File

@@ -4,7 +4,7 @@ from sklearn.feature_extraction.text import CountVectorizer
from keybert import KeyBERT
doc_one, doc_two = get_test_data()
model = KeyBERT(model='paraphrase-MiniLM-L6-v2')
model = KeyBERT(model='all-MiniLM-L6-v2')
@pytest.mark.parametrize("keyphrase_length", [(1, i+1) for i in range(5)])
@@ -68,6 +68,22 @@ def test_extract_keywords_multiple_docs(keyphrase_length):
assert len(keyword[0].split(" ")) <= keyphrase_length[1]
def test_guided():
""" Test whether the keywords are correctly extracted """
top_n = 5
seed_keywords = ["time", "night", "day", "moment"]
keywords = model.extract_keywords(doc_one,
min_df=1,
top_n=top_n,
seed_keywords=seed_keywords)
assert isinstance(keywords, list)
assert isinstance(keywords[0], tuple)
assert isinstance(keywords[0][0], str)
assert isinstance(keywords[0][1], float)
assert len(keywords) == top_n
def test_error():
""" Empty doc should raise a ValueError """
with pytest.raises(AttributeError):