## sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features. 

We will load data using `nlp`, train the models, and subsequently attack them using TextAttack.

### Training

This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tf–idf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset.


In [1]:
import nlp
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report

def load_data(dataset_split='train'):
    dataset = nlp.load_dataset('imdb')[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df['Review'] = [review['text'] for review in dataset]
    df['Sentiment'] = [review['label'] for review in dataset]
    # Remove non-alphanumeric characters
    df['Review'] = df['Review'].apply(lambda x: re.sub("[^a-zA-Z]", ' ', str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized

def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = ' '.join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column='Stemmed', value=stemmed_tokens)
    return df

def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(max_features=10000, ngram_range=(1,3), stop_words=ENGLISH_STOP_WORDS)
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(BOW_training.toarray(), columns=vect.get_feature_names())
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(BOW_testing.toarray(), columns=vect.get_feature_names())
    return vectFit, BOW_training_df, BOW_testing_Df

def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=10000, stop_words=ENGLISH_STOP_WORDS)
    Tfidf_fit = Tfidf.fit(training[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(Tfidf_training.toarray(), columns=Tfidf.get_feature_names())
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(Tfidf_testing.toarray(), columns=Tfidf.get_feature_names())
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df

def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column='Lengths', value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x)/(len(x.split())) for x in df['Review'].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df['averageWords'] = Average_Words
    return df

def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print('Training accuracy of '+name_of_test+': ', log_reg.score(X_train, y_train))
    print('Testing accuracy of '+name_of_test+': ', log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg

# Load training and test sets
# Loading reviews into DF
df_train = load_data('train')

print('...successfully loaded training data')
print('Total length of training data: ', len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print('...augmented data with len_tokens and average_words')

# Load test DF
df_test = load_data('test')

print('...successfully loaded testing data')
print('Total length of testing data: ', len(df_test))
df_test = add_augmenting_features(df_test)
print('...augmented data with len_tokens and average_words')

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(df_train, df_test, 'Review')
print('...successfully created the unstemmed BOW data')

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(df_train, df_test, 'Review')
print('...successfully created the unstemmed TFIDF data')

# Running logistic regression on dataframes
bow_unstemmed = build_model(df_train_bow_unstem, df_train['Sentiment'], df_test_bow_unstem, df_test['Sentiment'], 'BOW Unstemmed')

tfidf_unstemmed = build_model(df_train_tfidf_unstem, df_train['Sentiment'], df_test_tfidf_unstem, df_test['Sentiment'], 'TFIDF Unstemmed')

...successfully loaded training data
Total length of training data:  25000
...augmented data with len_tokens and average_words
...successfully loaded testing data
Total length of testing data:  25000
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training accuracy of BOW Unstemmed:  1.0
Testing accuracy of BOW Unstemmed:  0.83864
              precision    recall  f1-score   support

           0       0.83      0.85      0.84     12500
           1       0.85      0.83      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training accuracy of TFIDF Unstemmed:  0.98864
Testing accuracy of TFIDF Unstemmed:  0.85672
              precision    recall  f1-score   support

           0       0.85      0.87      0.86     12500
           1       0.86      0.85      0.86     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



### Attacking

TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model.

In [2]:
from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)

In [3]:
from textattack.datasets import HuggingFaceNlpDataset
from textattack.attack_recipes import TextFoolerJin2019

dataset = HuggingFaceNlpDataset("imdb", None, "train")
attack = TextFoolerJin2019(model_wrapper)

results = attack.attack_dataset(dataset, indices=range(20))
for idx, result in enumerate(results):
    print(f'Result {idx}:')
    print(result.__str__(color_method='ansi'))
    print('\n' + ('*' * 40) + '\n')

[34;1mtextattack[0m: Loading [94mnlp[0m dataset [94mimdb[0m, split [94mtrain[0m.
Using /var/folders/_q/cf258j890896hmrr2q8b52200000gn/T/tfhub_modules to cache modules.
[34;1mtextattack[0m: Unknown if model of class <class 'textattack.models.wrappers.sklearn_model_wrapper.SklearnModelWrapper'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.


Result 0:
[92mPositive (100%)[0m --> [91mNegative (93%)[0m

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's [92msatire[0m is much closer to reality than is "Teachers". The scramble to [92msurvive[0m financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a [92mpity[0m that it isn't!

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, 

Result 3:
[92mPositive (100%)[0m --> [91mNegative (65%)[0m

This is easily the most [92munderrated[0m film inn the Brooks cannon. Sure, its [92mflawed[0m. It does not give a realistic view of homelessness ([92munlike[0m, say, how Citizen Kane gave a [92mrealistic[0m view of lounge singers, or Titanic gave a [92mrealistic[0m view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only [92mcomplaint[0m is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).

This is easily the most [91moverrated[0m film inn the Brooks cannon. Sure, its [91mrotten[0m. It does not give a realistic view of homelessness ([91malthough[0m, say, how Citizen Kane gave a [91mactual[0m view

Result 8:
[92mPositive (100%)[0m --> [91m[FAILED][0m

THE NIGHT LISTENER (2006) **1/2 Robin Williams, Toni Collette, Bobby Cannavale, Rory Culkin, Joe Morton, Sandra Oh, John Cullum, Lisa Emery, Becky Ann Baker. (Dir: Patrick Stettner) <br /><br />Hitchcockian suspenser gives Williams a stand-out low-key performance.<br /><br />What is it about celebrities and fans? What is the near paranoia one associates with the other and why is it almost the norm? <br /><br />In the latest derange fan scenario, based on true events no less, Williams stars as a talk-radio personality named Gabriel No one, who reads stories he's penned over the airwaves and has accumulated an interesting fan in the form of a young boy named Pete Logand (Culkin) who has submitted a manuscript about the travails of his troubled youth to No one's editor Ashe (Morton) who gives it to No one to read for himself. <br /><br />No one is naturally disturbed but ultimately intrigued about the nightmarish existence of Pete 

Result 11:
[92mPositive (100%)[0m --> [91mNegative (50%)[0m

I liked the film. Some of the action scenes were very interesting, tense and well done. I especially [92mliked[0m the opening scene which had a semi truck in it. A very tense action [92mscene[0m that seemed well done.<br /><br />Some of the transitional scenes were filmed in interesting ways such as time lapse photography, unusual colors, or interesting angles. Also the film is funny is several parts. I also [92mliked[0m how the evil guy was portrayed too. I'd give the film an 8 out of 10.

I liked the film. Some of the action scenes were very interesting, tense and well done. I especially [91mprefer[0m the opening scene which had a semi truck in it. A very tense action [91mfilmmaking[0m that seemed well done.<br /><br />Some of the transitional scenes were filmed in interesting ways such as time lapse photography, unusual colors, or interesting angles. Also the film is funny is several parts. I also [91mbelove

Result 15:
[92mPositive (95%)[0m --> [91mNegative (67%)[0m

Like one of the previous commenters said, this had the foundations of a great movie but something happened on the way to delivery. Such a waste because Collette's performance was [92meerie[0m and Williams was believable. I just kept waiting for it to get better. I don't think it was bad editing or needed another director, it could have just been the film. It came across as a Canadian movie, something like the first few seasons of X-Files. Not cheap, just hokey. Also, it needed a little more suspense. Something that makes you jump off your seat. The movie reached that moment then faded away; kind of like a false climax. I can see how being too suspenseful would have taken away from the "reality" of the story but I thought that part was reached when Gabriel was in the hospital looking for the boy. This movie needs to have a Director's cut that tries to fix these problems.

Like one of the previous commenters said, this had

Result 18:
[92mPositive (97%)[0m --> [91mNegative (54%)[0m

If there is one thing to recommend about this film is that it is intriguing. The premise certainly [92mdraws[0m the audience in because it is a mystery, and throughout the film there are hints that there is something dark lurking about. However, there is not much tension, and Williams' mild mannered portrayal doesn't do much to makes us relate to his obsession with the boy.<br /><br />Collete fares much better as the woman whose true nature and intentions are not very clear. The production felt rushed and holes are apparent. It certainly feels like a preview for a much more complete and better effort. The book is probably better.<br /><br />One thing is certain: Taupin must have written something truly good to have inspired at least one commendable effort.

If there is one thing to recommend about this film is that it is intriguing. The premise certainly [91mattract[0m the audience in because it is a mystery, and throu

### Conclusion

We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!