Data distributions and transformations

Alt text that describes the graphic

  • Why scale your data?
    • Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed. Scaling and standardizing can help features arrive in more digestible form for an algorithm.

Visualizing and describing

Get a snapshot of the composition of the data

In [1]:
from sklearn import datasets
import pandas as pd
import numpy as np

boston = datasets.load_boston()
X, y = boston.data, boston.target
df = pd.DataFrame(data=boston.data, columns=boston.feature_names)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
In [2]:
# display first 5 rows of df 
df.head()
Out[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Pandas describe function to produce some quick descriptive statistics.

In [3]:
# percentile list 
perc =[0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95] 

df.describe(percentiles = perc, include = [np.number, np.object])
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
5% 0.027910 0.000000 2.180000 0.000000 0.409250 5.314000 17.725000 1.461975 2.000000 222.000000 14.700000 84.590000 3.707500
10% 0.038195 0.000000 2.910000 0.000000 0.427000 5.593500 26.950000 1.628300 3.000000 233.000000 14.750000 290.270000 4.680000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
90% 10.753000 42.500000 19.580000 0.000000 0.713000 7.151500 98.800000 6.816600 24.000000 666.000000 20.900000 396.900000 23.035000
95% 15.789150 80.000000 21.890000 1.000000 0.740000 7.587500 100.000000 7.827800 24.000000 666.000000 21.000000 396.900000 26.807500
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

Skew: The degree of distortion from a normal distribution.

Alt text that describes the graphic

  • For example if the response variable is skewed in a house pricing regression, the model will be trained on a much larger number of moderately priced homes, and will be less likely to successfully predict the price for the most expensive houses. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance.
In [4]:
# find skew using pandas 
df.skew().sort_values(ascending=False)
Out[4]:
CRIM       5.223149
CHAS       3.405904
ZN         2.225666
DIS        1.011781
RAD        1.004815
LSTAT      0.906460
NOX        0.729308
TAX        0.669956
RM         0.403612
INDUS      0.295022
AGE       -0.598963
PTRATIO   -0.802325
B         -2.890374
dtype: float64

Shapiro-Wilks test:

  • We can objectively determine if a variable is skewed using the Shapiro-Wilks test. The Shapiro–Wilk test tests the null hypothesis that a sample x1, ..., xn came from a normally distributed population. The null hypothesis for this test is that the data is a sample from a normal distribution, so a p-value less than 0.05 indicates significant skewness.

Rank 1D from Yellowbrick

  • A one-dimensional ranking of features utilizes a ranking algorithm that takes into account only a single feature at a time (e.g. histogram analysis). By default we utilize the Shapiro-Wilk algorithm to assess the normality of the distribution of instances with respect to the feature. A barplot is then drawn showing the relative ranks of each feature.
In [5]:
from yellowbrick.features import Rank1D
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(style="darkgrid", color_codes=True)

rnk1 = Rank1D(algorithm='shapiro')
rnk1.fit(X, y)           # Fit the data to the visualizer
rnk1.transform(X)
plt.close()

rnk1_df = pd.DataFrame(rnk1.ranks_, index=df.columns, columns=['Rank'])
rnk1_df = rnk1_df.sort_values('Rank', ascending=False)

f, axes = plt.subplots(1,2 ,figsize=(12,5))
sns.barplot(x='Rank', y=rnk1_df.index, data=rnk1_df, ax=axes[0])
axes[0].set_title('Features')
axes[0].set_xlabel('Sharpio Rank')
axes[0].grid(False)

sns.barplot(x=df.skew().abs().sort_values(), y=df.skew().abs().sort_values().index, ax=axes[1])
axes[1].set_title('Features')
axes[1].set_xlabel('Absolute Skew')
axes[1].grid(False)

plt.tight_layout()
plt.show();

Function to produce more descriptive statistics.

In [6]:
from scipy import stats

def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values

    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(df[name].value_counts(normalize=True), base=2),2) 

    return summary
In [7]:
resumetable(df)
Dataset Shape: (506, 13)
Out[7]:
Name dtypes Missing Uniques First Value Second Value Third Value Entropy
0 CRIM float64 0 504 0.00632 0.02731 0.02729 8.98
1 ZN float64 0 26 18.00000 0.00000 0.00000 1.95
2 INDUS float64 0 76 2.31000 7.07000 7.07000 5.03
3 CHAS float64 0 2 0.00000 0.00000 0.00000 0.36
4 NOX float64 0 81 0.53800 0.46900 0.46900 6.00
5 RM float64 0 446 6.57500 6.42100 7.18500 8.74
6 AGE float64 0 356 65.20000 78.90000 61.10000 8.05
7 DIS float64 0 412 4.09000 4.96710 4.96710 8.57
8 RAD float64 0 9 1.00000 2.00000 2.00000 2.74
9 TAX float64 0 66 296.00000 242.00000 242.00000 4.83
10 PTRATIO float64 0 46 15.30000 17.80000 17.80000 4.43
11 B float64 0 357 396.90000 396.90000 392.83000 7.21
12 LSTAT float64 0 455 4.98000 9.14000 4.03000 8.77

View individual feature distributions

In [8]:
df.hist(figsize=(11,11), grid=False);

Empirical Cumulative Distribution

In [9]:
# visualising ECDF
from mlxtend.plotting import ecdf

fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(13, 40))
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)

for i, feature in enumerate(list(df), 1):
    
    plt.subplot(len(list(df.columns)), 3, i)
    ecdf(df[feature])
    plt.title(f'{feature}', size=15, fontsize=12, fontweight='medium')
    plt.grid(False)
    
    for j in range(2):
        plt.tick_params(axis='x', labelsize=8)
        plt.tick_params(axis='y', labelsize=8)
        
plt.tight_layout()       
plt.show()
In [10]:
# visualising scaled ECDF and KDE
from sklearn.preprocessing import StandardScaler

num_lines = len(df.columns)
colors = [plt.cm.jet(i) for i in np.linspace(0, 1, num_lines)]

from pylab import rcParams 
from cycler import cycler
rcParams['axes.prop_cycle'] = cycler('color', colors)
rcParams['axes.grid'] = False

z0 = df.values
z1 = StandardScaler().fit_transform(z0)
df_scld = pd.DataFrame(z1, columns=df.columns)

f, axes = plt.subplots(1,2 ,figsize=(16,5))

for i in df_scld.columns:
    ecdf(df_scld[i], ax=axes[0], ecdf_marker='.')
    sns.kdeplot(df_scld[i], ax=axes[1])
    
axes[0].set_title('Standardized Features ECDF')
axes[1].set_title('Standardized Features KDE')
axes[0].legend(list(df_scld.columns))
axes[1].legend(list(df_scld.columns))
plt.show()

View feature boxplot

In [11]:
f, ax = plt.subplots(figsize=(8, 5))
ax.set_xscale("log")
sns.boxplot(data=df , orient="h", palette='Set1', ax=ax)
plt.xlabel('Log')
plt.show();

Standardization, or mean removal and variance scaling:

  • Standardization of datasets is a common requirement for many machine learning estimators. Models might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and standard deviation of 1.
In [12]:
from sklearn import preprocessing

X_train = np.array([[ 1., -1.,  2.],
                     [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
print(scaler.transform(X_train))
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

Scaling features to a range:

  • An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

Transform features by scaling each feature to a given range.

In [13]:
from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], 
        [-0.5, 6], 
        [0, 10], 
        [1, 18]]

scaler = MinMaxScaler()
scaler.fit(data)
print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]

Scale each feature by its maximum absolute value

In [14]:
from sklearn.preprocessing import MaxAbsScaler

X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

transformer = MaxAbsScaler().fit(X)
print(transformer.transform(X))
[[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]

Alt text that describes the graphic

Alt text that describes the graphic

Alt text that describes the graphic

Alt text that describes the graphic

Scaling data with outliers

  • If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.

  • RobustScaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

In [15]:
from sklearn.preprocessing import RobustScaler

X = [[ 1., -2.,  2.],
     [ -2.,  1.,  3.],
     [ 4.,  1., -2.]]

transformer = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0)).fit(X)
print(transformer.transform(X))
[[ 0.  -2.   0. ]
 [-1.   0.   0.4]
 [ 1.   0.  -1.6]]

Function to remove outliers

In [16]:
def CalcOutliers(df_num, limit = 3): 

    # calculating mean and std of the array
    data_mean, data_std = np.mean(df_num), np.std(df_num)

    # seting the cut line to both higher and lower values
    # You can change this value
    cut = data_std * limit

    #Calculating the higher and lower cut values
    lower, upper = data_mean - cut, data_mean + cut

    # creating an array of lower, higher and total outlier values 
    outliers_lower = [x for x in df_num if x < lower]
    outliers_higher = [x for x in df_num if x > upper]
    outliers_total = [x for x in df_num if x < lower or x > upper]

    # array without outlier values
    outliers_removed = [x for x in df_num if x > lower and x < upper]

    print('Total outlier observations: %d' % len(outliers_total)) # printing total number of values outliers of both sides
    print("Total percentage of Outliers: ", round((len(outliers_total) / len(outliers_removed) )*100, 4)) # Percentage of outliers in points
    print('Identified lowest outliers: %d' % len(outliers_lower)) # printing total number of values in lower cut of outliers
    print('Identified upper outliers: %d' % len(outliers_higher)) # printing total number of values in higher cut of outliers
    
    if len(outliers_higher) > 0:
        drp_upper = np.amin(np.array(outliers_higher), axis=0)
        print(f'Drop upper outliers >= {drp_upper}')
        
    if len(outliers_lower) > 0:        
        drp_lower = np.amax(np.array(outliers_lower), axis=0)
        print(f'Drop lower outliers <= {drp_lower}')
        
    if len(outliers_lower) > 0 & len(outliers_higher) > 0:        
        drp_lower = np.amax(np.array(outliers_lower), axis=0)
        drp_upper = np.amin(np.array(outliers_higher), axis=0)
        print(f'Drop outliers <= {drp_lower} and >= {drp_upper}') 

    return 
In [17]:
for i in df.columns:
    print(f'\nCalculating outliers for {i}...')
    CalcOutliers(df[i], limit = 3)
Calculating outliers for CRIM...
Total outlier observations: 8
Total percentage of Outliers:  1.6064
Identified lowest outliers: 0
Identified upper outliers: 8
Drop upper outliers >= 37.6619

Calculating outliers for ZN...
Total outlier observations: 14
Total percentage of Outliers:  2.8455
Identified lowest outliers: 0
Identified upper outliers: 14
Drop upper outliers >= 82.5

Calculating outliers for INDUS...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for CHAS...
Total outlier observations: 35
Total percentage of Outliers:  7.431
Identified lowest outliers: 0
Identified upper outliers: 35
Drop upper outliers >= 1.0

Calculating outliers for NOX...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for RM...
Total outlier observations: 8
Total percentage of Outliers:  1.6064
Identified lowest outliers: 4
Identified upper outliers: 4
Drop upper outliers >= 8.398
Drop lower outliers <= 4.138

Calculating outliers for AGE...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for DIS...
Total outlier observations: 5
Total percentage of Outliers:  0.998
Identified lowest outliers: 0
Identified upper outliers: 5
Drop upper outliers >= 10.5857

Calculating outliers for RAD...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for TAX...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for PTRATIO...
Total outlier observations: 0
Total percentage of Outliers:  0.0
Identified lowest outliers: 0
Identified upper outliers: 0

Calculating outliers for B...
Total outlier observations: 25
Total percentage of Outliers:  5.1975
Identified lowest outliers: 25
Identified upper outliers: 0
Drop lower outliers <= 81.33

Calculating outliers for LSTAT...
Total outlier observations: 5
Total percentage of Outliers:  0.998
Identified lowest outliers: 0
Identified upper outliers: 5
Drop upper outliers >= 34.37

Non-linear transformations

  • Two types of transformations are available: quantile transforms and power transforms. Both quantile and power transforms are based on monotonic transformations of the features and thus preserve the rank of the values along each feature.
  • Quantile transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
  • Power transforms are a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.

Normalization:

  • Normalize samples individually to unit norm.

  • Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

  • This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Alt text that describes the graphic

In [18]:
from sklearn.preprocessing import Normalizer

X = [[4, 1, 2, 2],
     [1, 3, 9, 3],
     [5, 7, 5, 1]]

transformer = Normalizer(norm='l2').fit(X)  
print(transformer.transform(X))
[[0.8 0.2 0.4 0.4]
 [0.1 0.3 0.9 0.3]
 [0.5 0.7 0.5 0.1]]

Box-Cox Transformations

  • When you are dealing with data, you are going to deal with features that are heavily skewed. Transformation techniques are useful to stabilize variance, make the data more normal distribution-like and improve the validity of measures of association.
  • The problem with the Box-Cox Transformation is estimating lambda. This value will depend on the existing data, and should be considered when performing cross validation on out of sample datasets. Make sure to estimate lambda according to the training dataset.
  • Other common transformation include log and square root transformation.

Alt text that describes the graphic

In [19]:
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.model_selection import train_test_split

N_SAMPLES = 1000
FONT_SIZE = 14
BINS = 30

rng = np.random.RandomState(304)
bc = PowerTransformer(method='box-cox')
yj = PowerTransformer(method='yeo-johnson')
# n_quantiles is set to the training set size rather than the default value
# to avoid a warning being raised by this example
qt = QuantileTransformer(n_quantiles=500, output_distribution='normal',
                         random_state=rng)
size = (N_SAMPLES, 1)

# lognormal distribution
X_lognormal = rng.lognormal(size=size)

# chi-squared distribution
dfx = 3
X_chisq = rng.chisquare(df=dfx, size=size)

# weibull distribution
a = 50
X_weibull = rng.weibull(a=a, size=size)

# gaussian distribution
loc = 100
X_gaussian = rng.normal(loc=loc, size=size)

# uniform distribution
X_uniform = rng.uniform(low=0, high=1, size=size)

# bimodal distribution
loc_a, loc_b = 100, 105
X_a, X_b = rng.normal(loc=loc_a, size=size), rng.normal(loc=loc_b, size=size)
X_bimodal = np.concatenate([X_a, X_b], axis=0)

# create plots
distributions = [
    ('Lognormal', X_lognormal),
    ('Chi-squared', X_chisq),
    ('Weibull', X_weibull),
    ('Gaussian', X_gaussian),
    ('Uniform', X_uniform),
    ('Bimodal', X_bimodal)
]

colors = ['#D81B60', '#0188FF', '#FFC107',
          '#B7A2FF', '#000000', '#2EC5AC']

fig, axes = plt.subplots(nrows=8, ncols=3, figsize=(12, 18))
axes = axes.flatten()

axes_idxs = [(0, 3, 6, 9), (1, 4, 7, 10), (2, 5, 8, 11), (12, 15, 18, 21),
             (13, 16, 19, 22), (14, 17, 20, 23)]

axes_list = [(axes[i], axes[j], axes[k], axes[l])
             for (i, j, k, l) in axes_idxs]

for distribution, color, axes in zip(distributions, colors, axes_list):
    name, X = distribution
    X_train, X_test = train_test_split(X, test_size=.5)

    # perform power transforms and quantile transform
    X_trans_bc = bc.fit(X_train).transform(X_test)
    lmbda_bc = round(bc.lambdas_[0], 2)
    X_trans_yj = yj.fit(X_train).transform(X_test)
    lmbda_yj = round(yj.lambdas_[0], 2)
    X_trans_qt = qt.fit(X_train).transform(X_test)

    ax_original, ax_bc, ax_yj, ax_qt = axes

    ax_original.hist(X_train, color=color, bins=BINS)
    ax_original.set_title(name, fontsize=FONT_SIZE)
    ax_original.tick_params(axis='both', which='major', labelsize=FONT_SIZE)

    for ax, X_trans, meth_name, lmbda in zip(
            (ax_bc, ax_yj, ax_qt),
            (X_trans_bc, X_trans_yj, X_trans_qt),
            ('Box-Cox', 'Yeo-Johnson', 'Quantile transform'),
            (lmbda_bc, lmbda_yj, None)):
        ax.hist(X_trans, color=color, bins=BINS, )
        title = 'After {}'.format(meth_name)
        
        if lmbda is not None:
            title += r'\n$\lambda$ = {}'.format(lmbda)
        ax.set_title(title, fontsize=FONT_SIZE)
        ax.tick_params(axis='both', which='major', labelsize=FONT_SIZE)
        ax.set_xlim([-3.5, 3.5])

plt.tight_layout()
plt.show()

View Transformations with boxplot and theoretical quantiles

In [20]:
def plotting_3_chart(df, title = 'plot'):
        
    ## Creating a customized chart. and giving in figsize and everything. 
    fig = plt.figure(constrained_layout=True, figsize=(8,5))
    ## creating a grid of 3 cols and 3 rows. 
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)

    ## Customizing the histogram grid. 
    ax1 = fig.add_subplot(grid[0, :2])
    ## Set the title. 
    ax1.set_title(f'{title} distribution')
    ## plot the histogram.f'{title} 
    sns.distplot(df, norm_hist=True, ax = ax1, fit=stats.norm, bins=30)
    ax1.legend(('normal', f'{title}'))

    # customizing the QQ_plot. 
    ax2 = fig.add_subplot(grid[1, :2])
    ## Set the title. 
    ax2.set_title('QQ_plot')
    ## Plotting the QQ_Plot. 
    stats.probplot(df, plot = ax2)

    ## Customizing the Box Plot. 
    ax3 = fig.add_subplot(grid[:, 2])
    ## Set title. 
    ax3.set_title('Box Plot')
    ## Plotting the box plot. 
    sns.boxplot(df, orient='v', ax = ax3)
    
    plt.show();
In [21]:
import pylab 
import matplotlib.gridspec as gridspec
sns.set(style="darkgrid", color_codes=True)

# Creat dummy arrays, skewed to the left
x = stats.loggamma.rvs(3, size=700) + 3

# How is the distribution for x?
plotting_3_chart(x, title = 'no transformation')

# What happens when log transformation?
plotting_3_chart(pd.Series(np.log(x)), title = 'log')

# What happens when sqrt transformation?
plotting_3_chart(pd.Series(np.sqrt(x)), title = 'sqrt')

# Now what happens when box-cox transformation?
x_bc, lmda = stats.boxcox(x)
plotting_3_chart(pd.Series(x_bc), title = 'box-cox')

print("lambda parameter for Box-Cox Transformation is {}".format(lmda))
lambda parameter for Box-Cox Transformation is 1.979697721508571

View non transformed features vs features with outliers removed and boxcox transformation

In [22]:
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
import warnings
warnings.filterwarnings('ignore')

boston = datasets.load_boston()
X, y = boston.data, boston.target
df = pd.DataFrame(data=boston.data, columns=boston.feature_names)

df_bc = df.copy()

# Normalize skewed features
for i in df_bc.columns:
    df_bc[i] = boxcox1p(df_bc[i], boxcox_normmax(df_bc[i] + 1))
In [23]:
# Remove outliers
transformer = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0))
z1 = transformer.fit_transform(df_bc.values)
df_bc = pd.DataFrame(z1, columns=df_bc.columns)
In [24]:
rnk1 = Rank1D(algorithm='shapiro')
rnk1.fit(X, y)           
rnk1.transform(X)
plt.close()

rnk1_df = pd.DataFrame(rnk1.ranks_, index=df.columns, columns=['Rank'])
rnk1_df = rnk1_df.sort_values('Rank', ascending=False)

X0 = df_bc

rnk10 = Rank1D(algorithm='shapiro')
rnk10.fit(X0, y)           
rnk10.transform(X0)
plt.close()

rnk1_df0 = pd.DataFrame(rnk10.ranks_, index=df_bc.columns, columns=['Rank'])
rnk1_df0 = rnk1_df0.sort_values('Rank', ascending=False)

f, axes = plt.subplots(1,2 ,figsize=(12,5))
sns.barplot(x='Rank', y=rnk1_df.index, data=rnk1_df, ax=axes[0])
axes[0].set_title('Features')
axes[0].set_xlabel('Sharpio Rank')
axes[0].grid(False)

sns.barplot(x='Rank', y=rnk1_df0.index, data=rnk1_df0, ax=axes[1])
axes[1].set_title('BoxCox + RobustScaler Features')
axes[1].set_xlabel('Sharpio Rank')
axes[1].grid(False)

plt.tight_layout()
plt.show();

Discretization

  • Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability.

KBinsDiscretizer discretizes features into k bins

In [25]:
X = np.array([[ -3., 5., 15 ],
              [  0., 6., 14 ],
              [ -1., 4., 18 ],
              [  6., 3., 11 ]])

est = preprocessing.KBinsDiscretizer(n_bins=[4, 3, 2], encode='ordinal').fit(X)
print(est.transform(X))
[[0. 2. 1.]
 [2. 2. 0.]
 [1. 1. 1.]
 [3. 0. 0.]]

Feature binarization: Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

In [26]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing

print(binarizer.transform(X))
[[1. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]

Generating polynomial features

  • Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

In [27]:
from sklearn.preprocessing import PolynomialFeatures

X = np.arange(20).reshape(5, 4)

print('Normal:')
print(pd.DataFrame(X))

poly = PolynomialFeatures(degree=2)

print('\nPoly:')         
print(pd.DataFrame(poly.fit_transform(X)))
Normal:
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

Poly:
     0     1     2     3     4      5      6      7      8      9     10  \
0  1.0   0.0   1.0   2.0   3.0    0.0    0.0    0.0    0.0    1.0    2.0   
1  1.0   4.0   5.0   6.0   7.0   16.0   20.0   24.0   28.0   25.0   30.0   
2  1.0   8.0   9.0  10.0  11.0   64.0   72.0   80.0   88.0   81.0   90.0   
3  1.0  12.0  13.0  14.0  15.0  144.0  156.0  168.0  180.0  169.0  182.0   
4  1.0  16.0  17.0  18.0  19.0  256.0  272.0  288.0  304.0  289.0  306.0   

      11     12     13     14  
0    3.0    4.0    6.0    9.0  
1   35.0   36.0   42.0   49.0  
2   99.0  100.0  110.0  121.0  
3  195.0  196.0  210.0  225.0  
4  323.0  324.0  342.0  361.0  
In [ ]: