# Dimensionality Reduction¶

• Dimensionality reduction produces new data that captures the most important information contained in the source data. Rather than grouping existing data into clusters, these algorithms transform existing data into a new dataset that uses significantly fewer features or observations to represent the original information.

• As the number of features increases, the model becomes more complex. Having more features increases the likelihood of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the training data and in turn could overfit, resulting in poor performance on new data.
• Improve accuracy.
• Reduce overfitting.
• Faster training.
• Improve Data Visualization.
• Increase model explainability.
• Another commonly used technique to reduce the number of feature in a dataset is Feature Selection. The difference between Feature Selection and Feature Extraction is that feature selection aims instead to rank the importance of the existing features in the dataset and discard less important ones.

## Unsupervised dimensionality reduction¶

• If your number of features is high, it may be useful to reduce it with an unsupervised algorithm prior to fitting a supervised algorithm. Many of the Unsupervised learning methods implement a transform method that can reduce the dimensionality.

### Walk through¶

#### Imports¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches
from pylab import rcParams
import seaborn as sns
from matplotlib.pyplot import figure
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from cycler import cycler
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.classifier import ClassificationReport
warnings.filterwarnings('ignore')
plt.rcParams['axes.prop_cycle'] = cycler(color='brgy')

/home/aj/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:144: FutureWarning:

The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.



In [2]:
df = pd.read_csv('mushrooms.csv')
pd.options.display.max_columns = None
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
class                       8124 non-null object
cap-shape                   8124 non-null object
cap-surface                 8124 non-null object
cap-color                   8124 non-null object
bruises                     8124 non-null object
odor                        8124 non-null object
gill-attachment             8124 non-null object
gill-spacing                8124 non-null object
gill-size                   8124 non-null object
gill-color                  8124 non-null object
stalk-shape                 8124 non-null object
stalk-root                  8124 non-null object
stalk-surface-above-ring    8124 non-null object
stalk-surface-below-ring    8124 non-null object
stalk-color-above-ring      8124 non-null object
stalk-color-below-ring      8124 non-null object
veil-type                   8124 non-null object
veil-color                  8124 non-null object
ring-number                 8124 non-null object
ring-type                   8124 non-null object
spore-print-color           8124 non-null object
population                  8124 non-null object
habitat                     8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB


View data

In [3]:
df.head()

Out[3]:
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-shape stalk-root stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k e e s s w w p w o p k s u
1 e x s y t a f c b k e c s s w w p w o p n n g
2 e b s w t l f c b n e c s s w w p w o p n n m
3 p x y w t p f c n n e e s s w w p w o p k s u
4 e x s g f n f w b k t e s s w w p w o e n a g

Plot target classes

In [4]:
total = len(df)

plt.figure(figsize=(13,5))
plt.subplot(121)
g = sns.countplot(x='class', data=df)
g.set_title("Mushroom class Count \np: Poisonous | e: Edible", fontsize=14)
g.set_ylabel('Count', fontsize=14)
for p in g.patches:
height = p.get_height()
g.text(p.get_x()+p.get_width()/2.,
height + 5,
'{:1.2f}%'.format(height/total*100),
ha="center", fontsize=14, fontweight='bold')
plt.margins(y=0.1)
plt.show()

In [5]:
df['class'].value_counts()

Out[5]:
e    4208
p    3916
Name: class, dtype: int64

Split features and target

In [6]:
X = df.drop(['class'], axis = 1)
Y = df['class']


Onehot encode categorical variables

In [7]:
X = pd.get_dummies(X, prefix_sep='_')

Out[7]:
cap-shape_b cap-shape_c cap-shape_f cap-shape_k cap-shape_s cap-shape_x cap-surface_f cap-surface_g cap-surface_s cap-surface_y cap-color_b cap-color_c cap-color_e cap-color_g cap-color_n cap-color_p cap-color_r cap-color_u cap-color_w cap-color_y bruises_f bruises_t odor_a odor_c odor_f odor_l odor_m odor_n odor_p odor_s odor_y gill-attachment_a gill-attachment_f gill-spacing_c gill-spacing_w gill-size_b gill-size_n gill-color_b gill-color_e gill-color_g gill-color_h gill-color_k gill-color_n gill-color_o gill-color_p gill-color_r gill-color_u gill-color_w gill-color_y stalk-shape_e stalk-shape_t stalk-root_? stalk-root_b stalk-root_c stalk-root_e stalk-root_r stalk-surface-above-ring_f stalk-surface-above-ring_k stalk-surface-above-ring_s stalk-surface-above-ring_y stalk-surface-below-ring_f stalk-surface-below-ring_k stalk-surface-below-ring_s stalk-surface-below-ring_y stalk-color-above-ring_b stalk-color-above-ring_c stalk-color-above-ring_e stalk-color-above-ring_g stalk-color-above-ring_n stalk-color-above-ring_o stalk-color-above-ring_p stalk-color-above-ring_w stalk-color-above-ring_y stalk-color-below-ring_b stalk-color-below-ring_c stalk-color-below-ring_e stalk-color-below-ring_g stalk-color-below-ring_n stalk-color-below-ring_o stalk-color-below-ring_p stalk-color-below-ring_w stalk-color-below-ring_y veil-type_p veil-color_n veil-color_o veil-color_w veil-color_y ring-number_n ring-number_o ring-number_t ring-type_e ring-type_f ring-type_l ring-type_n ring-type_p spore-print-color_b spore-print-color_h spore-print-color_k spore-print-color_n spore-print-color_o spore-print-color_r spore-print-color_u spore-print-color_w spore-print-color_y population_a population_c population_n population_s population_v population_y habitat_d habitat_g habitat_l habitat_m habitat_p habitat_u habitat_w
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
2 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
4 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
In [8]:
print(f'Original coulumn count = {len(df.columns)}, Onehot encolded column count = {len(X.columns)}')

Original coulumn count = 23, Onehot encolded column count = 117


Label encode target variable

In [9]:
Y = LabelEncoder().fit_transform(Y)
Y

Out[9]:
array([1, 0, 0, ..., 0, 1, 0])

Scale features

In [10]:
X = StandardScaler().fit_transform(X)


Function to train, test and score a RandomForestClassifier

In [11]:
# Classifier
trainedforest = RandomForestClassifier(n_estimators=700)

In [12]:
def forest_test(X, Y):

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.30, random_state=0)
# Classifier
trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train,Y_Train)

f, axes = plt.subplots(1,3 ,figsize=(15,5))

preds = trainedforest.predict(X_Test)
classes = list(df['class'].unique())
cm = ConfusionMatrix(
trainedforest, classes=classes, ax = axes[0],
label_encoder={0: 'Poisonous', 1: 'Edible'})
cm.fit(X_Train, Y_Train)
cm.score(X_Test, Y_Test)
axes[0].set_title('Confusion Matrix')
axes[0].set_xlabel('Predicted Class')
axes[0].set_ylabel('True Class')

roc = ROCAUC(trainedforest, classes=["Poisonous", "Edible"], ax = axes[1])
roc.fit(X_Train, Y_Train)
roc.score(X_Test, Y_Test)
axes[1].set_title('ROC AUC')
axes[1].grid(False)
axes[1].legend()

prc = PrecisionRecallCurve(trainedforest, ax = axes[2])
prc.fit(X_Train, Y_Train)
prc.score(X_Test, Y_Test)
axes[2].set_title('Precision Recall Curve')
axes[2].grid(False)
axes[2].legend()

plt.tight_layout()
plt.show();

print('\n',classification_report(Y_Test,preds))


Functions to test and plot 2d and 3d representations of the features vs the target

In [13]:
def complete_test_2D(X, Y, plot_name = ''):

Small_df = pd.DataFrame(data = X, columns = ['C1', 'C2'])
Small_df = pd.concat([Small_df, df['class']], axis = 1)
Small_df['class'] = LabelEncoder().fit_transform(Small_df['class'])
forest_test(X, Y)

plt.figure.figsize=(10,8)

classes = [1, 0]
colors = ['r', 'b']

for clas, color in zip(classes, colors):

plt.scatter(Small_df.loc[Small_df['class'] == clas, 'C1'],
Small_df.loc[Small_df['class'] == clas, 'C2'],
c = color, alpha=0.5)

plt.xlabel('Component 1', fontsize = 12)
plt.ylabel('Component 2', fontsize = 12)
plt.title(f'{plot_name}', fontsize = 15)
plt.legend(['Poisonous', 'Edible'])
plt.grid(False)

plt.show()

In [14]:
def complete_test_3D(X, Y, plot_name = ''):

Small_df = pd.DataFrame(data = X, columns = ['C1', 'C2', 'C3'])
Small_df = pd.concat([Small_df, df['class']], axis = 1)
Small_df['class'] = LabelEncoder().fit_transform(Small_df['class'])
forest_test(X, Y)

fig=plt.figure(figsize=(8,6))

pnt3d = ax.scatter(Small_df['C1'],Small_df['C2'],Small_df['C3'],
c=Small_df['class'],alpha=.5, s=75,cmap='coolwarm',
label=list(Small_df.columns))

one = mpatches.Patch(facecolor='b', label='0', linewidth = 0.5, edgecolor = 'black')
two = mpatches.Patch(facecolor='r', label = '1', linewidth = 0.5, edgecolor = 'black')
ax.set_title(f'{plot_name}', fontsize = 15)
ax.set(xlabel=f'\n{Small_df.columns[0]}',ylabel=f'\n{Small_df.columns[1]}',zlabel=f'\n{Small_df.columns[2]}')
ax.legend(handles=[one, two], title="class", fontsize='medium', fancybox=True)
plt.show()


Intital test

In [15]:
forest_test(X, Y)

               precision    recall  f1-score   support

0       1.00      1.00      1.00      1272
1       1.00      1.00      1.00      1166

accuracy                           1.00      2438
macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438



## Principal Component Analysis (PCA)¶

• When using PCA, we take as input our original data and try to find a combination of the input features which can best summarize the original data distribution to reduce its original dimensions. PCA is able to do this by maximizing variances and minimizing reconstruction error by looking at pair wised distances. In PCA, our original data is projected into a set of orthogonal axes and each of the axes gets ranked in order of importance.
• PCA is an unsupervised learning algorithm, therefore it doesn't care about the data labels but only about variation. This can lead in some cases to misclassification of data.

Testing first 2 principal components

In [16]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
PCA_df = pd.DataFrame(data = X_pca, columns = ['PC1', 'PC2'])
PCA_df = pd.concat([PCA_df, df['class']], axis = 1)
PCA_df['class'] = LabelEncoder().fit_transform(PCA_df['class'])

Out[16]:
PC1 PC2 class
0 -3.284739 1.020096 1
1 -3.969489 -0.856895 0
2 -4.958560 -0.211109 0
3 -3.469972 0.337926 1
4 -2.726585 0.889647 0
In [17]:
complete_test_2D(X_pca, Y, 'PCA')

               precision    recall  f1-score   support

0       0.95      0.98      0.96      1272
1       0.98      0.94      0.96      1166

accuracy                           0.96      2438
macro avg       0.96      0.96      0.96      2438
weighted avg       0.96      0.96      0.96      2438



First 2 principal components Explained Variance

In [18]:
var_ratio = pca.explained_variance_ratio_
cum_var_ratio = np.cumsum(var_ratio)

# create column names
col_num = X_pca.shape[1]
feat_names = ['PC'+str(num) for num in list(range(1,col_num+1,1))]

sns.barplot(y=var_ratio, x=feat_names)
sns.pointplot(y=cum_var_ratio, x=feat_names, color='black', label='cummulative')
plt.grid(False)
plt.title("Explained variance Ratio by each principal components", fontsize=14)
plt.ylabel("Explained variance ratio in percent")
plt.legend(['cummulative'])
plt.show()


Testing first 3 principal components

In [19]:
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
complete_test_3D(X_pca, Y, 'PCA')

               precision    recall  f1-score   support

0       0.98      0.99      0.98      1272
1       0.99      0.97      0.98      1166

accuracy                           0.98      2438
macro avg       0.98      0.98      0.98      2438
weighted avg       0.98      0.98      0.98      2438


In [20]:
var_ratio = pca.explained_variance_ratio_
cum_var_ratio = np.cumsum(var_ratio)

# create column names
col_num = X_pca.shape[1]
feat_names = ['PC'+str(num) for num in list(range(1,col_num+1,1))]

sns.barplot(y=var_ratio, x=feat_names)
sns.pointplot(y=cum_var_ratio, x=feat_names, color='black', label='cummulative')
plt.grid(False)
plt.title("Explained variance Ratio by each principal components", fontsize=14)
plt.ylabel("Explained variance ratio in percent")
plt.legend(['cummulative'])
plt.show()


### t-Distributed Stochastic Neighbor Embedding (t-SNE)¶

• t-SNE is non-linear dimensionality reduction technique which is typically used to visualize high dimensional datasets. Some of the main applications of t-SNE are Natural Language Processing (NLP), speech processing, etc…
• t-SNE works by minimizing the divergence between a distribution constituted by the pairwise probability similarities of the input features in the original high dimensional space and its equivalent in the reduced low dimensional space. t-SNE makes then use of the Kullback-Leiber (KL) divergence in order to measure the dissimilarity of the two different distributions. The KL divergence is then minimized using gradient descent.

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.

• X = parameter to optimize
• y = loss or cost function
• A Cost Function/Loss Function evaluates the performance of our Machine Learning Algorithm. The Loss function computes the error for a single training example while the Cost function is the average of the loss functions for all the training examples.
• The goal is to minimise the function. We need to find that value of X that produces the lowest value of y.
• A derivative is calculated using the power or chain rule as the slope of the cost function parabola at a particular point. The slope is described by drawing a tangent line to the graph at the point. This tangent line computes the desired direction to reach the minima. Partial derivative is calculated for > 2 parameters.

• 2 or more derivatives of the same function = Gradient
• This size of steps taken to reach the minimum is called Learning Rate.
• When using t-SNE, the higher dimensional space is modelled using a Gaussian Distribution, while the lower-dimensional space is modelled using a Student's t-distribution. This is done, in order to avoid an imbalance in the neighbouring points distance distribution caused by the translation into a lower-dimensional space.

Testing first 2 features t-sne

In [21]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
X_tsne = tsne.fit_transform(X)

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 8124 samples in 0.131s...
[t-SNE] Computed neighbors for 8124 samples in 8.856s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8124
[t-SNE] Computed conditional probabilities for sample 2000 / 8124
[t-SNE] Computed conditional probabilities for sample 3000 / 8124
[t-SNE] Computed conditional probabilities for sample 4000 / 8124
[t-SNE] Computed conditional probabilities for sample 5000 / 8124
[t-SNE] Computed conditional probabilities for sample 6000 / 8124
[t-SNE] Computed conditional probabilities for sample 7000 / 8124
[t-SNE] Computed conditional probabilities for sample 8000 / 8124
[t-SNE] Computed conditional probabilities for sample 8124 / 8124
[t-SNE] Mean sigma: 2.658530
[t-SNE] KL divergence after 250 iterations with early exaggeration: 66.852493
[t-SNE] KL divergence after 300 iterations: 2.136555

In [22]:
complete_test_2D(X_tsne, Y, 't-SNE')

               precision    recall  f1-score   support

0       1.00      1.00      1.00      1272
1       1.00      1.00      1.00      1166

accuracy                           1.00      2438
macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438



Testing first 3 features t-sne

In [23]:
tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)
X_tsne = tsne.fit_transform(X)
complete_test_3D(X_tsne, Y, 't-SNE')

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 8124 samples in 0.131s...
[t-SNE] Computed neighbors for 8124 samples in 8.848s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8124
[t-SNE] Computed conditional probabilities for sample 2000 / 8124
[t-SNE] Computed conditional probabilities for sample 3000 / 8124
[t-SNE] Computed conditional probabilities for sample 4000 / 8124
[t-SNE] Computed conditional probabilities for sample 5000 / 8124
[t-SNE] Computed conditional probabilities for sample 6000 / 8124
[t-SNE] Computed conditional probabilities for sample 7000 / 8124
[t-SNE] Computed conditional probabilities for sample 8000 / 8124
[t-SNE] Computed conditional probabilities for sample 8124 / 8124
[t-SNE] Mean sigma: 2.658530
[t-SNE] KL divergence after 250 iterations with early exaggeration: 65.627213
[t-SNE] KL divergence after 300 iterations: 1.901021

               precision    recall  f1-score   support

0       1.00      1.00      1.00      1272
1       1.00      1.00      1.00      1166

accuracy                           1.00      2438
macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438



#### Independent Component Analysis (ICA)¶

• ICA is a linear dimensionality reduction method which takes as input data a mixture of independent components and it aims to correctly identify each of them (deleting all the unnecessary noise). Two input features can be considered independent if both their linear and not linear dependance is equal to zero.
• As a simple example of an ICA application, let’s consider we are given an audio registration in which there are two different people talking. Using ICA we could, for example, try to identify the two different independent components in the registration (the two different people). In this way, we could make our unsupervised learning algorithm recognise between the different speakers in the conversation.
• Using ICA, we can now again reduce our dataset to just three features, test its accuracy using a Random Forest Classifier and plot the results.
In [24]:
from sklearn.decomposition import FastICA

ica = FastICA(n_components=3)
X_ica = ica.fit_transform(X)

complete_test_3D(X_ica, Y, 'ICA')

               precision    recall  f1-score   support

0       0.98      1.00      0.99      1272
1       1.00      0.98      0.99      1166

accuracy                           0.99      2438
macro avg       0.99      0.99      0.99      2438
weighted avg       0.99      0.99      0.99      2438