Feature Extraction

Feature extraction is the transformation of original data to a data set with a reduced number of variables, which contains the most discriminatory information. In a text or image classification problem, you have to extract features in order to train a model. These algorithms can interpret only numbers and you have to intelligently convert the text into the numerical feature vectors.

Alt text that describes the graphic

  • Text classification or text categorization is an activity of labelling natural language texts with relevant predefined categories. The idea is to automatically organize text in different classes.

If you think about it, text is just a series of ordered words that usually carry some meaning. If we take each unique word from all the available texts, we can create our own vocabulary. And every word in the vocabulary can be a feature. For each text a feature vector will be an array where feature values are simply the numbers of the unique words belonging to a specific text, i.e. just the count of each word in one text. And if some word is not in the text, its feature value is zero. Therefore, the word order in a text is not important, just the number of occurrences. This method is called a Bag of words BOW and it’s quite common and simple to use.

  • To solve the problem of word order we use another approach called Word Embedding.
  • Word Embedding: It is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.
  • Word2vec is a model of word embedding that takes as input a large corpus of text and produces a vector space with each unique word being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space. Word2Vec is very famous at capturing meaning and demonstrating it on tasks like calculating analogy questions of the form a is to b as c is to ?. For example, man is to woman as uncle is to ? (aunt) using a simple vector offset method based on cosine distance. For example, here are vector offsets for three word pairs illustrating the gender relation:

Alt text that describes the graphic

Alt text that describes the graphic

Alt text that describes the graphic

A tf-idf word-frequency array

  • tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

Alt text that describes the graphic

In this exercise, we create a tf-idf word frequency array for a toy collection of documents. For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit() and transform() methods like other sklearn objects.

TruncatedSVD and csr_matrix

TruncatedSVD singular value decomposition (SVD) is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays.

  • scikit-learn PCA doesn't support csr_matrix
  • Use scikit-learn TruncatedSVD instead
  • Performs same transformation
In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
from sklearn.decomposition import TruncatedSVD 
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ['cats say meow', 'dogs say woof', 'dogs chase cats', 'never say ever',
            'faith for one', 'faith to one', 'i hate dogs', 'i hate cats']

model = TruncatedSVD(n_components=8)

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)
print(csr_mat.toarray())

model.fit(csr_mat)
transformed = model.transform(csr_mat)

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print('\n',words)
[[0.50559126 0.         0.         0.         0.         0.
  0.         0.69911012 0.         0.         0.50559126 0.
  0.        ]
 [0.         0.         0.50559126 0.         0.         0.
  0.         0.         0.         0.         0.50559126 0.
  0.69911012]
 [0.50559126 0.69911012 0.50559126 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.62956522 0.         0.
  0.         0.         0.62956522 0.         0.4552969  0.
  0.        ]
 [0.         0.         0.         0.         0.54044255 0.64485945
  0.         0.         0.         0.54044255 0.         0.
  0.        ]
 [0.         0.         0.         0.         0.54044255 0.
  0.         0.         0.         0.54044255 0.         0.64485945
  0.        ]
 [0.         0.         0.65330828 0.         0.         0.
  0.757092   0.         0.         0.         0.         0.
  0.        ]
 [0.65330828 0.         0.         0.         0.         0.
  0.757092   0.         0.         0.         0.         0.
  0.        ]]

 ['cats', 'chase', 'dogs', 'ever', 'faith', 'for', 'hate', 'meow', 'never', 'one', 'say', 'to', 'woof']
In [3]:
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np

var_ratio = model.explained_variance_ratio_
cum_var_ratio = np.cumsum(var_ratio)
col_num = model.n_components
feat_names = ['C'+str(num) for num in list(range(1,col_num+1,1))]

sns.barplot(y=var_ratio, x=feat_names)
sns.pointplot(y=var_ratio, x=feat_names, color='black')
plt.grid(False)
plt.title("Explained variance Ratio by each component", fontsize=14)
plt.ylabel("Explained variance ratio (%)")
plt.show()

DictVectorizer

  • The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.
  • DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.
In [4]:
from sklearn.feature_extraction import DictVectorizer

v = DictVectorizer(sparse=False)

d = [{'height': 1, 'length': 0, 'width': 1},
     {'height': 2, 'length': 1, 'width': 0},
     {'height': 1, 'length': 3, 'width': 2}]

v.fit_transform(d)
Out[4]:
array([[1., 0., 1.],
       [2., 1., 0.],
       [1., 3., 2.]])
In [5]:
v.get_feature_names()
Out[5]:
['height', 'length', 'width']

DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing models that typically work by extracting feature windows around a particular word of interest.

  • For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
In [6]:
pos_window = [
        {
         'word-2': 'the',
         'pos-2': 'DT',
         'word-1': 'cat',
         'pos-1': 'NN',
         'word+1': 'on',
         'pos+1': 'PP',
     },
    ]

This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier

In [7]:
vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
pos_vectorized.toarray()
Out[7]:
array([[1., 1., 1., 1., 1., 1.]])
In [8]:
vec.get_feature_names()
Out[8]:
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

Note that this transformer will only do a binary one-hot encoding when feature values are of type string.

20 newsgroups dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In [9]:
from sklearn import datasets

news = datasets.fetch_20newsgroups(subset='all')

X = news.data
y = news.target
In [10]:
print("Number of articles: " + str(len(X)))
print("Number of diffrent categories: " + str(len(news.target_names)))
news.target_names
Number of articles: 18846
Number of diffrent categories: 20
Out[10]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
In [50]:
import pandas as pd 

df = pd.DataFrame({'class': y,
                   'text':  X })
df
Out[50]:
class text
0 10 From: Mamatha Devineni Ratnam <mr47+@andrew.cm...
1 3 From: mblawson@midway.ecn.uoknor.edu (Matthew ...
2 17 From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...
3 3 From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...
4 4 From: Alexander Samuel McDiarmid <am2o+@andrew...
... ... ...
18841 13 From: jim.zisfein@factory.com (Jim Zisfein) \n...
18842 12 From: rdell@cbnewsf.cb.att.com (richard.b.dell...
18843 3 From: westes@netcom.com (Will Estes)\nSubject:...
18844 1 From: steve@hcrlgw (Steven Collins)\nSubject: ...
18845 7 From: chriss@netcom.com (Chris Silvester)\nSub...

18846 rows × 2 columns

In [51]:
df.text[0]
Out[51]:
"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

To get the frequency distribution of the words in the text, we can utilize the nltk.FreqDist() function, which lists the top words used in the text, providing a rough idea of the main topic in the text data:

In [52]:
reviews = df.text.str.cat(sep=' ')

#function to split text into word
tokens = word_tokenize(reviews)

vocabulary = set(tokens)

print(len(vocabulary))

frequency_dist = nltk.FreqDist(tokens)

print(sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)[0:50])
283303
['>', ',', '.', 'the', ':', '--', 'to', ')', 'of', '(', 'a', 'and', '@', 'I', 'is', 'in', 'that', "'AX", '?', "''", 'it', 'for', 'you', '!', '<', '``', 'on', 'be', 'have', '|', 'are', 'not', 'with', '$', '#', "'s", 'The', ';', '%', 'this', '-', "n't", 'as', 'was', ']', 'or', '[', 'do', 'From', '&']

This gives the top 50 words used in the text, though it is obvious that some of the stop words, such as the, frequently occur in the English.

Let us remove punctuation to further cleanup the text corpus.

In [53]:
import string

punc = list(string.punctuation)
tokens = [w for w in tokens if not w in punc]
print(tokens[0:50])
['From', 'Mamatha', 'Devineni', 'Ratnam', 'mr47+', 'andrew.cmu.edu', 'Subject', 'Pens', 'fans', 'reactions', 'Organization', 'Post', 'Office', 'Carnegie', 'Mellon', 'Pittsburgh', 'PA', 'Lines', '12', 'NNTP-Posting-Host', 'po4.andrew.cmu.edu', 'I', 'am', 'sure', 'some', 'bashers', 'of', 'Pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'Pens', 'massacre', 'of', 'the', 'Devils', 'Actually', 'I']

stop words

Stop words are words that don’t bring any useful information in the processing text and they are usually filtered out before or after processing. They refer to the most common words like I, her, by, about, here, etc. Removing such words in the context of sentiment analysis can easily upgrade your accuracy significantly.

In [54]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens[0:50])
['From', 'Mamatha', 'Devineni', 'Ratnam', 'mr47+', 'andrew.cmu.edu', 'Subject', 'Pens', 'fans', 'reactions', 'Organization', 'Post', 'Office', 'Carnegie', 'Mellon', 'Pittsburgh', 'PA', 'Lines', '12', 'NNTP-Posting-Host', 'po4.andrew.cmu.edu', 'I', 'sure', 'bashers', 'Pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'Pens', 'massacre', 'Devils', 'Actually', 'I', 'bit', 'puzzled', 'bit', 'relieved', 'However', 'I', 'going', 'put', 'end', 'non-PIttsburghers', 'relief', 'bit', 'praise']

A helpful visualization tool wordcloud helps to create word clouds by placing words on a canvas randomly, with sizes proportional to their frequency in the text.

In [55]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

frequency_dist = nltk.FreqDist(tokens)
wordcloud = WordCloud().generate_from_frequencies(frequency_dist)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Plot count of individual class occurrences

In [13]:
f, axe = plt.subplots(1, 1,figsize=(14,5))
axe.grid(False)
sns.despine(left=True, bottom=True)
sns.countplot(df['class'], ax=axe)
for p in axe.patches:
        axe.annotate('{}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+30), 
                     color='k', fontweight='medium', fontsize=12)
        axe.yaxis.tick_left()
        axe.set_xlabel('Target', fontsize=14)
        axe.set_ylabel('Number of Occurrences', fontsize=14)

there are 18846 newsgroup documents, distributed almost evenly across 20 different newsgroups. Our goal is to create a classifier that will classify each document based on its content.

Define training function

In [14]:
from sklearn.model_selection import train_test_split
import time

def train(classifier, X, y):
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

    classifier.fit(X_train, y_train)
    end = time.time()

    print("Accuracy: " + str(classifier.score(X_test, y_test)) + ", Training Duration: " + str(end - start))
  • let’s build a classifier! We’ll start with a multinomial Naive Bayes classifier which is suitable for discrete classification.
  • The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
  • The multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for each categories.
In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

trial1 = Pipeline([ ('vectorizer', TfidfVectorizer()), ('classifier', MultinomialNB())])

train(trial1, X, y)
Accuracy: 0.8538461538461538, Training Duration: 2.955958127975464

Let us remove the stop words to further cleanup the text corpus.

In [16]:
from nltk.corpus import stopwords

trial2 = Pipeline([ ('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))),
                   ('classifier', MultinomialNB())])

train(trial2, X, y)
Accuracy: 0.8806366047745358, Training Duration: 2.9737298488616943

Parameter tuning

Accuracy is better and the training is faster, but the alpha parameter of the Naive-Bayes classifier is the default, so let’s do some hyperparameter tuning.

  • We will use GridSearchCV an exhaustive search over specified parameter values for an estimator.
In [17]:
# Find correct parameter name for the estimator
trial2.get_params().keys()
Out[17]:
dict_keys(['memory', 'steps', 'verbose', 'vectorizer', 'classifier', 'vectorizer__analyzer', 'vectorizer__binary', 'vectorizer__decode_error', 'vectorizer__dtype', 'vectorizer__encoding', 'vectorizer__input', 'vectorizer__lowercase', 'vectorizer__max_df', 'vectorizer__max_features', 'vectorizer__min_df', 'vectorizer__ngram_range', 'vectorizer__norm', 'vectorizer__preprocessor', 'vectorizer__smooth_idf', 'vectorizer__stop_words', 'vectorizer__strip_accents', 'vectorizer__sublinear_tf', 'vectorizer__token_pattern', 'vectorizer__tokenizer', 'vectorizer__use_idf', 'vectorizer__vocabulary', 'classifier__alpha', 'classifier__class_prior', 'classifier__fit_prior'])
In [19]:
from sklearn.model_selection import GridSearchCV

# Number of cross validation folds
n_folds = 5

# Returns a range between two values evenly sampled in log space.
alpha = np.geomspace(0.05, 0.00005, 30) 

param_grid1 = {'classifier__alpha': alpha}

estimator1 = GridSearchCV(estimator=trial2,
                         param_grid=param_grid1,
                         n_jobs=4,
                         cv=n_folds,
                         return_train_score=True,
                         scoring='accuracy'
                        )

estimator1.fit(X=X, y=y)
Out[19]:
GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        norm='l2',
                                                        preprocessor=None,
                                                        smooth_idf=True,
                                                        stop_words=['i', '...
       2.86807626e-03, 2.26017683e-03, 1.78112395e-03, 1.40360810e-03,
       1.10610815e-03, 8.71664411e-04, 6.86911898e-04, 5.41318367e-04,
       4.26583926e-04, 3.36167877e-04, 2.64915845e-04, 2.08765947e-04,
       1.64517228e-04, 1.29647190e-04, 1.02167986e-04, 8.05131014e-05,
       6.34480502e-05, 5.00000000e-05])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=0)

Create df of search results to plot

In [26]:
results1 = estimator1.cv_results_

test_scores1 = pd.DataFrame({fold: results1[f'split{fold}_test_score'] for fold in range(n_folds)}, 
                           index=alpha).stack().reset_index()

test_scores1.columns = ['alpha', 'fold', 'accuracy']
In [31]:
mean1 = test_scores1.groupby('alpha').accuracy.mean()
best_alpha1, best_score1 = mean1.idxmin(), mean1.min()
In [28]:
test_scores1.head()
Out[28]:
alpha fold accuracy
0 0.05 0 0.912732
1 0.05 1 0.924383
2 0.05 2 0.919873
3 0.05 3 0.910321
4 0.05 4 0.910056

Function to plot grid search

In [41]:
def plot_results(model, name='Num Trees', param_name = 'param_regressor__min_samples_split'):
    # param_name = param_name
    # Extract information from the cross validation model
    train_scores = model.cv_results_['mean_train_score']
    test_scores = model.cv_results_['mean_test_score']
    train_time = model.cv_results_['mean_fit_time']
    param_values = list(model.cv_results_[param_name])
    
    # Plot the scores over the parameter
    plt.subplots(1, 2, figsize=(14, 5))
    plt.subplot(121)
    plt.errorbar(param_values, train_scores, yerr=train_scores.std(), label = 'train', color='b', fmt='-o')
    plt.errorbar(param_values, test_scores, yerr=test_scores.std(), label = 'test', color='red', fmt='-o')
    plt.legend(bbox_to_anchor=(1, 1.05), prop={'size': 14})
    plt.grid(False)
    plt.xlabel(name, fontsize=14)
    plt.ylabel('Accuracy', fontsize=14)
    plt.title(f'Score vs {name}', fontsize=14)
    
    plt.subplot(122)
    plt.plot(param_values, train_time, 'ro-', color='y')
    plt.xlabel(name, fontsize=14)
    plt.ylabel('Train Time (sec)', fontsize=14)
    plt.title('Training Time vs %s' % name, fontsize=14)
    plt.grid(False)
    plt.tight_layout(pad = 1)
    plt.show();
In [37]:
estimator1.cv_results_.keys()
Out[37]:
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_classifier__alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])
In [42]:
print(f'GridSearchCV Results Best Alpha: {best_alpha1} | Best Accuracy: {best_score1:.3f}')

plot_results(estimator1, name = 'Alpha', 
             param_name='param_classifier__alpha')
GridSearchCV Results Best Alpha: 4.9999999999999996e-05 | Best Accuracy: 0.908

We can see that the best accuracy for alpha found by GridSearchCV is an improvement now lets set the parameter min_dif to ignore the words that appear fewer than x times in all documents:

In [47]:
for min_df in [0, 2, 3, 4, 5]:
    trial3 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'), 
                                                      min_df=min_df)),
                       ('classifier', MultinomialNB(alpha=best_alpha1))])
    
    print(f'\nTraining min_df = {min_df}')
    train(trial3, X, y)
Training min_df = 0
Accuracy: 0.9114058355437665, Training Duration: 2.438108444213867

Training min_df = 2
Accuracy: 0.9050397877984084, Training Duration: 2.3294408321380615

Training min_df = 3
Accuracy: 0.9023872679045093, Training Duration: 2.471414804458618

Training min_df = 4
Accuracy: 0.9007957559681697, Training Duration: 2.250080108642578

Training min_df = 5
Accuracy: 0.89973474801061, Training Duration: 2.2035810947418213

Resulting accuracy slightly decreased. We can try and stem the data with nltk (i.e. reduce inflected words to their word root, with he use of tokenizer within TfidfVectorizer, it usually helps performance, and we can add punctuations to a list of stop words nltks string.

Alt text that describes the graphic

  • Stemming is a process of reducing inflected words to their word stem, i.e root. It doesn’t have to be morphological; you can just chop off the words ends. For example, the word “solv” is the stem of words “solve” and “solved”.
  • Lemmatization is another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.
    • The lemmatized form of studies is: study
    • The lemmatized form of studying is: study
  • Tokenization converts sentences to words.
In [20]:
# using NLTK library, we can do lot of text preprocesing
import nltk
from nltk.tokenize import word_tokenize

#function to split text into word
tokens = word_tokenize("The quick brown fox jumps over the lazy dog")
print(tokens)
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
In [19]:
# NLTK provides several stemmer interfaces like Porter stemmer, #Lancaster Stemmer, Snowball Stemmer
from nltk.stem import PorterStemmer

porter = PorterStemmer()
stems = []
for t in tokens:    
    stems.append(porter.stem(t))
print(stems)
['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog']
In [51]:
import string

def stemming_tokenizer(text):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(text)]

trial5 = Pipeline([ ('vectorizer', TfidfVectorizer(tokenizer=stemming_tokenizer, 
                     stop_words=stopwords.words('english') + list(string.punctuation))), 
                    ('classifier', MultinomialNB(alpha=0.005))])

train(trial5, X, y)
/home/aj/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:385: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'re", "'s", "'ve", '``', 'abov', 'ani', 'becaus', 'befor', 'could', 'doe', 'dure', 'ha', 'hi', 'might', 'must', "n't", 'need', 'onc', 'onli', 'ourselv', 'sha', 'themselv', 'thi', 'veri', 'wa', 'whi', 'wo', 'would', 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
Accuracy: 0.9228116710875331, Training Duration: 71.27371549606323

Accuracy is slightly better, but the training was much longer. This is a great example of time consumption created by stemming. Sometimes accuracy can cost you computation speed and you should find a balance between the two. Don’t stem if the accuracy doesn’t improve significantly.

Test different classifiers

Now, let’s try some other usual text classifiers like Support Vector Classification with stochastic gradient descent SGDClassifier and linear SVC. They are initially slower but may get better accuracy without the use of a stemmer:

In [56]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

for classifier in [SGDClassifier(), LinearSVC()]:
    
    trial6 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))), 
                       ('classifier', classifier)])
    
    print(f'\nTraining with {classifier.__class__.__name__}')
    train(trial6, X, y)
Training with SGDClassifier
Accuracy: 0.9259946949602123, Training Duration: 3.730632781982422

Training with LinearSVC
Accuracy: 0.9320954907161804, Training Duration: 3.7987682819366455

An accuracy of 93.2% for linear SVC is very acceptable. Acceptable accuracy depends on the specific problem: type/length of the text, number of categories and differences between them, etc. Here, an accuracy of 93% is good because we have 20 categories and some of them are similar, like comp.sys.ibm.pc.hardware and comp.sys.mac.hardware.

Model evaluation

We could diver deeper but we will stop here and check the characteristics of the best model. We will use confusion_matrix() from sckit-learn to compare real and predicted categories:

In [60]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

start = time.time()
classifier = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))),
                       ('classifier', LinearSVC(C=10))])

X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, 
                                                    test_size=0.2, random_state=11)
classifier.fit(X_train, y_train)
end = time.time()

print("Accuracy: " + str(classifier.score(X_test, y_test)) + ", Time duration: " + str(end - start))
Accuracy: 0.9352785145888595, Time duration: 7.838569402694702
In [63]:
y_pred = classifier.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
labels = news.target_names

# Plot confusion_matrix
fig, ax = plt.subplots(figsize=(13, 10))
sns.heatmap(conf_mat, annot=True, fmt ="d", cmap = "Set3",
xticklabels=labels, yticklabels=labels)
plt.ylabel('Actual', fontsize=14)
plt.xlabel('Predicted', fontsize=14)
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()

The confusion matrix is a great way to see which categories the model is mixing. For example, there are 17 articles from category comp.os.ms-windows.mics that are wrongly classified as comp.sys.ibm.pc.hardware. Also, let’s check the accuracy of each category separately with classification_report():

In [64]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=labels))
                          precision    recall  f1-score   support

             alt.atheism       0.96      0.95      0.96       172
           comp.graphics       0.86      0.91      0.88       184
 comp.os.ms-windows.misc       0.92      0.85      0.89       204
comp.sys.ibm.pc.hardware       0.83      0.87      0.85       195
   comp.sys.mac.hardware       0.93      0.92      0.93       195
          comp.windows.x       0.93      0.87      0.90       204
            misc.forsale       0.85      0.88      0.86       164
               rec.autos       0.93      0.94      0.93       180
         rec.motorcycles       0.97      0.97      0.97       173
      rec.sport.baseball       0.97      0.97      0.97       217
        rec.sport.hockey       0.97      0.99      0.98       178
               sci.crypt       0.95      0.98      0.96       197
         sci.electronics       0.93      0.93      0.93       199
                 sci.med       0.94      0.99      0.97       183
               sci.space       0.96      0.99      0.97       207
  soc.religion.christian       0.94      0.96      0.95       211
      talk.politics.guns       0.97      0.96      0.96       208
   talk.politics.mideast       0.99      0.99      0.99       200
      talk.politics.misc       0.96      0.93      0.94       175
      talk.religion.misc       0.94      0.83      0.88       124

                accuracy                           0.94      3770
               macro avg       0.94      0.93      0.93      3770
            weighted avg       0.94      0.94      0.94      3770

Category misc.forsale has the lowest accuracy, but the overall accuracy is good.

Finally lets take a look at the PrecisionRecallCurve and not the most ideal in this situation but for kicks the ROC curve.

In [74]:
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.classifier import ROCAUC

fig, ax = plt.subplots(figsize=(10, 5))
roc = ROCAUC(classifier, classes=labels, ax=ax)
roc.fit(X_train, y_train)        
roc.score(X_test, y_test)
plt.title('ROC AUC', fontweight='bold', fontsize=14)
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.grid(False)
plt.legend(bbox_to_anchor=(1.1, 1.05))
plt.show();

fig, ax = plt.subplots(figsize=(10, 5))
prc = PrecisionRecallCurve(classifier, ax=ax)
prc.fit(X_train, y_train)        
prc.score(X_test, y_test)
plt.title('Precision Recall Curve', fontweight='bold', fontsize=14)
plt.xlabel('Recall', fontsize=14)
plt.ylabel('Precision', fontsize=14)
plt.grid(False)
plt.legend()
plt.show();
In [ ]: