1
0
mirror of https://github.com/QData/TextAttack.git synced 2021-10-13 00:05:06 +03:00
Files
textattack-nlp-transformer/docs/datasets_models/Example_1_sklearn.ipynb

300 lines
12 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## sklearn and TextAttack\n",
"\n",
"This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features. \n",
"\n",
"We will load data using `nlp`, train the models, and attack them using TextAttack."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1cBRUj2l0m8o81vJGGFgO-o_zDLj24M5Y?usp=sharing)\n",
"\n",
"[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/examples/1_Introduction_and_Transformations.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"\n",
"This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tfidf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"...successfully loaded training data\n",
"Total length of training data: 25000\n",
"...augmented data with len_tokens and average_words\n",
"...successfully loaded testing data\n",
"Total length of testing data: 25000\n",
"...augmented data with len_tokens and average_words\n",
"...successfully created the unstemmed BOW data\n",
"...successfully created the unstemmed TFIDF data\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jxm/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training accuracy of BOW Unstemmed: 1.0\n",
"Testing accuracy of BOW Unstemmed: 0.83864\n",
" precision recall f1-score support\n",
"\n",
" 0 0.83 0.85 0.84 12500\n",
" 1 0.85 0.83 0.84 12500\n",
"\n",
" accuracy 0.84 25000\n",
" macro avg 0.84 0.84 0.84 25000\n",
"weighted avg 0.84 0.84 0.84 25000\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jxm/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training accuracy of TFIDF Unstemmed: 0.98864\n",
"Testing accuracy of TFIDF Unstemmed: 0.85672\n",
" precision recall f1-score support\n",
"\n",
" 0 0.85 0.87 0.86 12500\n",
" 1 0.86 0.85 0.86 12500\n",
"\n",
" accuracy 0.86 25000\n",
" macro avg 0.86 0.86 0.86 25000\n",
"weighted avg 0.86 0.86 0.86 25000\n",
"\n"
]
}
],
"source": [
"import nlp\n",
"import os\n",
"import pandas as pd\n",
"import re\n",
"from nltk import word_tokenize\n",
"from nltk.stem import PorterStemmer\n",
"from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS\n",
"from sklearn import preprocessing\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# Nice to see additional metrics\n",
"from sklearn.metrics import classification_report\n",
"\n",
"def load_data(dataset_split='train'):\n",
" dataset = nlp.load_dataset('imdb')[dataset_split]\n",
" # Open and import positve data\n",
" df = pd.DataFrame()\n",
" df['Review'] = [review['text'] for review in dataset]\n",
" df['Sentiment'] = [review['label'] for review in dataset]\n",
" # Remove non-alphanumeric characters\n",
" df['Review'] = df['Review'].apply(lambda x: re.sub(\"[^a-zA-Z]\", ' ', str(x)))\n",
" # Tokenize the training and testing data\n",
" df_tokenized = tokenize_review(df)\n",
" return df_tokenized\n",
"\n",
"def tokenize_review(df):\n",
" # Tokenize Reviews in training\n",
" tokened_reviews = [word_tokenize(rev) for rev in df['Review']]\n",
" # Create word stems\n",
" stemmed_tokens = []\n",
" porter = PorterStemmer()\n",
" for i in range(len(tokened_reviews)):\n",
" stems = [porter.stem(token) for token in tokened_reviews[i]]\n",
" stems = ' '.join(stems)\n",
" stemmed_tokens.append(stems)\n",
" df.insert(1, column='Stemmed', value=stemmed_tokens)\n",
" return df\n",
"\n",
"def transform_BOW(training, testing, column_name):\n",
" vect = CountVectorizer(max_features=10000, ngram_range=(1,3), stop_words=ENGLISH_STOP_WORDS)\n",
" vectFit = vect.fit(training[column_name])\n",
" BOW_training = vectFit.transform(training[column_name])\n",
" BOW_training_df = pd.DataFrame(BOW_training.toarray(), columns=vect.get_feature_names())\n",
" BOW_testing = vectFit.transform(testing[column_name])\n",
" BOW_testing_Df = pd.DataFrame(BOW_testing.toarray(), columns=vect.get_feature_names())\n",
" return vectFit, BOW_training_df, BOW_testing_Df\n",
"\n",
"def transform_tfidf(training, testing, column_name):\n",
" Tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=10000, stop_words=ENGLISH_STOP_WORDS)\n",
" Tfidf_fit = Tfidf.fit(training[column_name])\n",
" Tfidf_training = Tfidf_fit.transform(training[column_name])\n",
" Tfidf_training_df = pd.DataFrame(Tfidf_training.toarray(), columns=Tfidf.get_feature_names())\n",
" Tfidf_testing = Tfidf_fit.transform(testing[column_name])\n",
" Tfidf_testing_df = pd.DataFrame(Tfidf_testing.toarray(), columns=Tfidf.get_feature_names())\n",
" return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df\n",
"\n",
"def add_augmenting_features(df):\n",
" tokened_reviews = [word_tokenize(rev) for rev in df['Review']]\n",
" # Create feature that measures length of reviews\n",
" len_tokens = []\n",
" for i in range(len(tokened_reviews)):\n",
" len_tokens.append(len(tokened_reviews[i]))\n",
" len_tokens = preprocessing.scale(len_tokens)\n",
" df.insert(0, column='Lengths', value=len_tokens)\n",
"\n",
" # Create average word length (training)\n",
" Average_Words = [len(x)/(len(x.split())) for x in df['Review'].tolist()]\n",
" Average_Words = preprocessing.scale(Average_Words)\n",
" df['averageWords'] = Average_Words\n",
" return df\n",
"\n",
"def build_model(X_train, y_train, X_test, y_test, name_of_test):\n",
" log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)\n",
" y_pred = log_reg.predict(X_test)\n",
" print('Training accuracy of '+name_of_test+': ', log_reg.score(X_train, y_train))\n",
" print('Testing accuracy of '+name_of_test+': ', log_reg.score(X_test, y_test))\n",
" print(classification_report(y_test, y_pred)) # Evaluating prediction ability\n",
" return log_reg\n",
"\n",
"# Load training and test sets\n",
"# Loading reviews into DF\n",
"df_train = load_data('train')\n",
"\n",
"print('...successfully loaded training data')\n",
"print('Total length of training data: ', len(df_train))\n",
"# Add augmenting features\n",
"df_train = add_augmenting_features(df_train)\n",
"print('...augmented data with len_tokens and average_words')\n",
"\n",
"# Load test DF\n",
"df_test = load_data('test')\n",
"\n",
"print('...successfully loaded testing data')\n",
"print('Total length of testing data: ', len(df_test))\n",
"df_test = add_augmenting_features(df_test)\n",
"print('...augmented data with len_tokens and average_words')\n",
"\n",
"# Create unstemmed BOW features for training set\n",
"unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(df_train, df_test, 'Review')\n",
"print('...successfully created the unstemmed BOW data')\n",
"\n",
"# Create TfIdf features for training set\n",
"unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(df_train, df_test, 'Review')\n",
"print('...successfully created the unstemmed TFIDF data')\n",
"\n",
"# Running logistic regression on dataframes\n",
"bow_unstemmed = build_model(df_train_bow_unstem, df_train['Sentiment'], df_test_bow_unstem, df_test['Sentiment'], 'BOW Unstemmed')\n",
"\n",
"tfidf_unstemmed = build_model(df_train_tfidf_unstem, df_train['Sentiment'], df_test_tfidf_unstem, df_test['Sentiment'], 'TFIDF Unstemmed')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attacking\n",
"\n",
"TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)\n",
"\n",
"Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from textattack.models.wrappers import SklearnModelWrapper\n",
"\n",
"model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from textattack.datasets import HuggingFaceNlpDataset\n",
"from textattack.attack_recipes import TextFoolerJin2019\n",
"\n",
"dataset = HuggingFaceNlpDataset(\"imdb\", None, \"train\")\n",
"attack = TextFoolerJin2019.build(model_wrapper)\n",
"\n",
"results = attack.attack_dataset(dataset, indices=range(20))\n",
"for idx, result in enumerate(results):\n",
" print(f'Result {idx}:')\n",
" print(result.__str__(color_method='ansi'))\n",
" print('\\n\\n')\n",
"print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}