Missing data

  • Values in a data set may be missing. This leaves us unable to fit a model or perform other useful strategies to the dataset. Missing data tends to introduce bias that leads to misleading results so they cannot be ignored. (Filling missing values by testing which impacts the variance of a given dataset the least is the best approach.)
In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data

  • Titanic Dataset: Build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). This data was collected after the accident.
In [2]:
url = 'https://raw.githubusercontent.com/krishnaik06/Feature-Engineering-Live-sessions/master/titanic.csv'
df = pd.read_csv(url)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [3]:
df.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

What are the different types of Missing Data?

  1. Missing Data Not At Random(MNAR): Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.
In [4]:
df.isnull().sum()
Out[4]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
  • Is there a relationship between Age and Cabin? Yes, when we gave the survey all details were available outside of first age because the data was taken after the crash so that person would be dead so no Age could be recorded.The second variable Cabin if the passenger is dead they also cannot give any cabin information. So the Age and Cabin variables are MNAR.

Missingo heatmap

  • The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another. Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). Entries marked $<1$ or $<-1$ have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous.
In [5]:
import missingno as msno

msno.heatmap(df, cmap='viridis')
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show();

Missingo dendrogram

  • The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap. The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero. To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.
In [6]:
msno.dendrogram(df);
  1. Missing Completely at Random, MCAR: A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.
In [7]:
df[df['Embarked'].isnull()]
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
  • The Embarked variable is which station the passengers where picked up and dropped off at. This variable is MCAR both Age and Cabin are present

Missingo matrix

  • A nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion. The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset. This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.
In [8]:
import missingno as msno
msno.matrix(df);
  1. Missing Data Not At Random(MNAR): Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.
In [9]:
# Percentage of missing values
df.isnull().mean()
Out[9]:
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64
In [10]:
# Binary transform if cabin is present or not
df['cabin_null'] = np.where(df['Cabin'].isnull(),1,0)

# Find percent of missing cabin values for survivers and non survivers
df.groupby(['Survived'])['cabin_null'].mean()
Out[10]:
Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64
  1. Missing At Random(MAR): Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data.
  • Whether or not someone answered #13 on your survey has nothing to do with the missing values, but it does have to do with the values of some other variable.
  • The idea is, if we can control for this conditional variable, we can get a random subset. Good techniques for data that is missing at random need to incorporate variables that are related to the missingness.

All the techniques of handling missing values

  1. Mean, Median, Mode replacement
  2. Random Sample Imputation
  3. Capturing NAN values with a new feature
  4. End of distribution imputation
  5. Arbitrary imputation
  6. Frequent categories imputation
  7. Forward and backward filling
  8. Estimator imputation

How To Handle Continuous and Discrete Continuous Missing Values

Mean, Median, Mode imputation

  • When should we apply? Mean, median, mode imputation has the assumption that the data are missing completely at random (MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variable.
In [11]:
df = df[['Age','Fare','Survived']]
df.head()
Out[11]:
Age Fare Survived
0 22.0 7.2500 0
1 38.0 71.2833 1
2 26.0 7.9250 1
3 35.0 53.1000 1
4 35.0 8.0500 0
In [12]:
# Percentage of missing values
df.isnull().mean()
Out[12]:
Age         0.198653
Fare        0.000000
Survived    0.000000
dtype: float64
In [13]:
# Function to impute with a fillna method
def impute_nan(df=df,variable='Age',strat=None, method=None):
    df[variable + "_" + method]=df[variable].fillna(strat)
In [14]:
methods = [df.Age.mean(), df.Age.median(), df.Age.mode()]
names = ['mean', 'median', 'mode']

for m, name in zip(methods, names):
  type_ = int(m)
  impute_nan(df,'Age', type_, method=name)
In [15]:
df.head()
Out[15]:
Age Fare Survived Age_mean Age_median Age_mode
0 22.0 7.2500 0 22.0 22.0 22.0
1 38.0 71.2833 1 38.0 38.0 38.0
2 26.0 7.9250 1 26.0 26.0 26.0
3 35.0 53.1000 1 35.0 35.0 35.0
4 35.0 8.0500 0 35.0 35.0 35.0

Median is more robust to outliers, if the dataset has outliers using the mean could affect the analysis.

In [16]:
age_cols = [x for x in list(df.columns) if 'Age' in x]

for col in age_cols:
    df[col].plot(kind='kde', label=f'{col}, std = {round(df[col].std(),3)}')
plt.legend()
plt.show()

Advantages And Disadvantages of Mean, Median, Mode Imputation

Advantages

  1. Easy to implement(robust to outliers)
  2. Faster way to obtain the complete dataset

Disadvantages

  1. Change or Distortion in the original variance
  2. Impacts Correlation

Random Sample Imputation

  • Random sample imputation consists of taking random observation from the dataset and we use these random observations to replace the NaN values. This method assumes that the data are missing completely at random (MCAR).
In [17]:
# Generate a random sample from Age without missing values of the same length as the missing values
sample = df['Age'].dropna().sample(df['Age'].isnull().sum(), random_state=0)
print('# of samples', len(sample))
sample.head()
# of samples 177
Out[17]:
423    28.00
177    50.00
305     0.92
292    36.00
889    26.00
Name: Age, dtype: float64
In [18]:
def impute_random(df, variable):
    df[variable  + "_random"] = df[variable]
    # Create the random sample to fill the NaNs
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # Pandas needs the same index in order to find and fill the NaN locations
    random_sample.index = df[df[variable].isnull()].index
    # New column without NaNs filled by random sample 
    df.loc[df[variable].isnull(),variable+'_random']=random_sample
In [19]:
impute_random(df,"Age")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Age           714 non-null float64
Fare          891 non-null float64
Survived      891 non-null int64
Age_mean      891 non-null float64
Age_median    891 non-null float64
Age_mode      891 non-null float64
Age_random    891 non-null float64
dtypes: float64(6), int64(1)
memory usage: 48.9 KB
In [20]:
df.head()
Out[20]:
Age Fare Survived Age_mean Age_median Age_mode Age_random
0 22.0 7.2500 0 22.0 22.0 22.0 22.0
1 38.0 71.2833 1 38.0 38.0 38.0 38.0
2 26.0 7.9250 1 26.0 26.0 26.0 26.0
3 35.0 53.1000 1 35.0 35.0 35.0 35.0
4 35.0 8.0500 0 35.0 35.0 35.0 35.0
In [21]:
age_cols = ['Age', 'Age_mode', 'Age_random']

for col in age_cols:
    df[col].plot(kind='kde', label=f'{col}, std = {round(df[col].std(),3)}')
plt.legend()
plt.show()

The random sample imputation actually matches the original distribution the best with the least distortion in variance

Random Sample Imputation Advantages and Disadvantage

Advantages

  1. Easy To implement
  2. There is less distortion in variance

Disadvantages

  1. Every situation randomness wont work

Capturing NaN values with a new feature

It works well if the data are not missing completely at random but show charateristics of Missing Data Not At Random (MNAR). This also works for categorical data.

In [22]:
# Create a feature that is a binary representation of its preseance
df['Age_NAN'] = np.where(df['Age'].isnull(), 1, 0)
df.head()
Out[22]:
Age Fare Survived Age_mean Age_median Age_mode Age_random Age_NAN
0 22.0 7.2500 0 22.0 22.0 22.0 22.0 0
1 38.0 71.2833 1 38.0 38.0 38.0 38.0 0
2 26.0 7.9250 1 26.0 26.0 26.0 26.0 0
3 35.0 53.1000 1 35.0 35.0 35.0 35.0 0
4 35.0 8.0500 0 35.0 35.0 35.0 35.0 0

Capturing NaN values with a new feature Advantages and Disadvantages

Advantages

  1. Easy to implement
  2. Captures the importance of missing values

Disadvantages

  1. Creating Additional Features (curse of dimensionality)

End of Distribution imputation

If there is suspicion that the missing value is not MAR or MCAR then capturing that information is important. In this case, one would want to replace missing data with values that are at the tails of the distribution of the variable.

  • If normally distributed, we use the mean +/- $3$ times Standard Deviation.
  • If the distribution is skewed, use the IQR proximity rule.
In [23]:
df.Age.hist(bins=50);
In [24]:
sns.boxplot('Age',data=df);

Outliers only exist in the extreme

In [25]:
# Get extreme values from distribution 
extreme = df.Age.mean() + (3 * df.Age.std())
print(f'Extreme = {extreme}')
df.loc[np.where(df['Age'] >= extreme)]
Extreme = 73.27860964406095
Out[25]:
Age Fare Survived Age_mean Age_median Age_mode Age_random Age_NAN
630 80.0 30.000 1 80.0 80.0 80.0 80.0 0
851 74.0 7.775 0 74.0 74.0 74.0 74.0 0
In [26]:
def impute_extreme(df, variable, extreme):
    df[variable + "_end_dist"]=df[variable].fillna(extreme)
In [27]:
impute_extreme(df,'Age', extreme)
df.head()
Out[27]:
Age Fare Survived Age_mean Age_median Age_mode Age_random Age_NAN Age_end_dist
0 22.0 7.2500 0 22.0 22.0 22.0 22.0 0 22.0
1 38.0 71.2833 1 38.0 38.0 38.0 38.0 0 38.0
2 26.0 7.9250 1 26.0 26.0 26.0 26.0 0 26.0
3 35.0 53.1000 1 35.0 35.0 35.0 35.0 0 35.0
4 35.0 8.0500 0 35.0 35.0 35.0 35.0 0 35.0
In [28]:
age_cols = ['Age', 'Age_mode', 'Age_random', 'Age_end_dist']

for col in age_cols:
    df[col].plot(kind='kde', label=f'{col}, std = {round(df[col].std(),3)}')
plt.legend()
plt.show()
In [29]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot('Age', data=df)
plt.title('Age')
plt.subplot(122)
sns.boxplot('Age_end_dist', data=df)
plt.title('Age_end_dist')
plt.show()

End of Distribution imputation Advantages and Disadvantages

Advantages

  1. Easy to implement
  2. Can bring out the importance of missing values.

Disadvantages

  1. Can distort variance
  2. If there are many NaNs it will mask true outliers in the distribution
  3. If there are few NaNs the replaced NaNs may be considered an outlier and be pre-processed in a subsequent task of feature engineering.

Dropping all columns or rows

This works best when there are a very small number of missing values.

  • Dropping all columns removes any variable that has any missing data
  • Dropping all rows removes all missing data from all columns
In [30]:
# create missing values df
data = [   ('D',1,10,6,np.NaN),
           ('D',2,12,10,12),
           ('X',1,28,15,np.NaN),
           ('D',3,np.NaN,4,np.NaN),
           ('X',2,np.NaN,20,25),
           ('X',3,32,31,25),
           ('T',1,220,250,np.NaN),
           ('X',4,30,22,np.NaN),
           ('T',2,240,170,np.NaN),
           ('X',2,38,27,np.NaN),
           ('T',3,np.NaN,44,np.NaN),
           ('D',1,20,18,80),
           ('D',4,200,120,150)]

labels = ['item1', 'month','normal_price','item2','final_price']

df = pd.DataFrame.from_records(data, columns=labels)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
item1           13 non-null object
month           13 non-null int64
normal_price    10 non-null float64
item2           13 non-null int64
final_price     5 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 648.0+ bytes
In [31]:
df
Out[31]:
item1 month normal_price item2 final_price
0 D 1 10.0 6 NaN
1 D 2 12.0 10 12.0
2 X 1 28.0 15 NaN
3 D 3 NaN 4 NaN
4 X 2 NaN 20 25.0
5 X 3 32.0 31 25.0
6 T 1 220.0 250 NaN
7 X 4 30.0 22 NaN
8 T 2 240.0 170 NaN
9 X 2 38.0 27 NaN
10 T 3 NaN 44 NaN
11 D 1 20.0 18 80.0
12 D 4 200.0 120 150.0
In [32]:
# Removing all rows 
df.dropna(axis=0)
Out[32]:
item1 month normal_price item2 final_price
1 D 2 12.0 10 12.0
5 X 3 32.0 31 25.0
11 D 1 20.0 18 80.0
12 D 4 200.0 120 150.0
In [33]:
# Removing all columns  
df.dropna(axis=1)
Out[33]:
item1 month item2
0 D 1 6
1 D 2 10
2 X 1 15
3 D 3 4
4 X 2 20
5 X 3 31
6 T 1 250
7 X 4 22
8 T 2 170
9 X 2 27
10 T 3 44
11 D 1 18
12 D 4 120

Dropping all columns or rows Advantages and Disadvantages

Advantages

  1. Easy to implement
  2. Cleans the entire dataset

Disadvantages

  1. Only works for a very small number of missing values
  2. Can loose a lot of useful information by removing data completely

Forward and Backward filling

Filling in missing data with its next or previous value. This works specifically when we know the missing data recorded is missing but unchanged.

  • Problem: I want to fill NaN in the normal_price, final_price columns for each item with the 'normal_price','final_price' of its preceding month (if not available by its succeeding month).
In [34]:
# Chain together bfill and ffill to then fill the remaining NaN values:
df.ffill().bfill()
Out[34]:
item1 month normal_price item2 final_price
0 D 1 10.0 6 12.0
1 D 2 12.0 10 12.0
2 X 1 28.0 15 12.0
3 D 3 28.0 4 12.0
4 X 2 28.0 20 25.0
5 X 3 32.0 31 25.0
6 T 1 220.0 250 25.0
7 X 4 30.0 22 25.0
8 T 2 240.0 170 25.0
9 X 2 38.0 27 25.0
10 T 3 38.0 44 25.0
11 D 1 20.0 18 80.0
12 D 4 200.0 120 150.0

Forward and Backward filling Advantages and Disadvantages

Advantages

  1. Easy to implement
  2. Cleans entire dataset

Disadvantages

  1. May remove a ton of useful information
  2. Only works with very few missing values
  3. Only works with very few use cases

How To Handle Categroical Missing Values

Frequent Category Imputation

With this method you impute missing data with the most frequently occurring value. This method would be best suited for categorical data, as missing values have the highest probability of being the most frequently occurring value.

  • Assumptions: Data is missing at random MAR; missing values look like majority.

Dataset

In [35]:
url_train = 'https://raw.githubusercontent.com/liyenhsu/Kaggle-House-Prices/master/data/train.csv'
df = pd.read_csv(url_train, usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
BsmtQual       1423 non-null object
FireplaceQu    770 non-null object
GarageType     1379 non-null object
SalePrice      1460 non-null int64
dtypes: int64(1), object(3)
memory usage: 45.8+ KB
In [36]:
df.isnull().sum()
Out[36]:
BsmtQual        37
FireplaceQu    690
GarageType      81
SalePrice        0
dtype: int64
In [37]:
df.isnull().mean().sort_values(ascending=True)
Out[37]:
SalePrice      0.000000
BsmtQual       0.025342
GarageType     0.055479
FireplaceQu    0.472603
dtype: float64

Compute the frequency of each every feature

In [38]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(12,5))
df['BsmtQual'].value_counts().plot.bar(ax=ax[0])
df['GarageType'].value_counts().plot.bar(ax=ax[1])
df['FireplaceQu'].value_counts().plot.bar(ax=ax[2])
ax[0].set_title('BsmtQual')
ax[1].set_title('GarageType')
ax[2].set_title('FireplaceQu')
plt.tight_layout()
plt.show()
In [39]:
def impute_mode_cat(df, variable):
    most_frequent_category = df[variable].dropna().mode()[0]
    #print(most_frequent_category)
    df[variable + '_mode'] = df[variable]
    df[variable + '_mode'].fillna(most_frequent_category, inplace=True)
In [40]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_mode_cat(df, feature)

df.head()
Out[40]:
BsmtQual FireplaceQu GarageType SalePrice BsmtQual_mode FireplaceQu_mode GarageType_mode
0 Gd NaN Attchd 208500 Gd Gd Attchd
1 Gd TA Attchd 181500 Gd TA Attchd
2 Gd TA Attchd 223500 Gd TA Attchd
3 TA Gd Detchd 140000 TA Gd Detchd
4 Gd TA Attchd 250000 Gd TA Attchd
In [41]:
df.isnull().mean()
Out[41]:
BsmtQual            0.025342
FireplaceQu         0.472603
GarageType          0.055479
SalePrice           0.000000
BsmtQual_mode       0.000000
FireplaceQu_mode    0.000000
GarageType_mode     0.000000
dtype: float64

Frequent Category Imputation Advantages and Disadvantages

Advantages

  1. Easy to implement
  2. Suitable for categorical data

Disadvantages

  1. May create a biased data-set, favoring most frequent value , if there are many NaNs
  2. It distorts the relationship of the most frequent label

Arbitrary Value Imputation

  • Here, the purpose is to flag missing values in the data set. You would impute the missing data with a fixed arbitrary value (a random value).

  • It is mostly used for categorical variables, but can also be used for numeric variables with arbitrary values such as $0$, $999$ or other similar combinations of numbers.

In [42]:
def impute_new_cat(df, variable):
    df[variable + "_newvar"] = np.where(df[variable].isnull(), "Missing", df[variable])
In [43]:
for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_new_cat(df, feature)

df.head()
Out[43]:
BsmtQual FireplaceQu GarageType SalePrice BsmtQual_mode FireplaceQu_mode GarageType_mode BsmtQual_newvar FireplaceQu_newvar GarageType_newvar
0 Gd NaN Attchd 208500 Gd Gd Attchd Gd Missing Attchd
1 Gd TA Attchd 181500 Gd TA Attchd Gd TA Attchd
2 Gd TA Attchd 223500 Gd TA Attchd Gd TA Attchd
3 TA Gd Detchd 140000 TA Gd Detchd TA Gd Detchd
4 Gd TA Attchd 250000 Gd TA Attchd Gd TA Attchd

Arbitrary Value Imputation

Advantages

  1. Easy to implement
  2. Captures the importance of missingess if there is importance

Disadvantages

  1. Distorts the original distribution of the variable
  2. If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
  3. Hard to decide which value to use

Filling Missing Values With a Model

For this we can use scikit learns IterativeImputer. This models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output $y$ and the other feature columns are treated as inputs $X$. A regressor (BayesianRidge(), ExtraTreesRegressor()) is fit on ($X$, $y$) for known $y$. Then, the regressor is used to predict the missing values of $y$. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds.

In [44]:
# create missing values df
data = [   (np.NaN, 'A', 1,10,6,np.NaN),
           ('D', 'D', 2,12,10,12),
           ('X', np.NaN, 1,28,15,np.NaN),
           ('D', 'D', 3,np.NaN,4,np.NaN),
           ('X', 'B', 2,np.NaN,20,25),
           (np.NaN, 'B', 3,32,31,25),
           ('T', 'F', 1,220,250,np.NaN),
           ('X', np.NaN, 4,30,22,np.NaN),
           ('T', np.NaN, 2,240,170,np.NaN),
           ('X', 'C', 2,38,27,np.NaN),
           ('T', 'C', 3,np.NaN,44,np.NaN),
           (np.NaN, 'A', 4,20,18,80),
           ('D', np.NaN, 4,200,120,150)]

labels = ['item1', 'grade', 'month','normal_price','item2','final_price']

df = pd.DataFrame.from_records(data, columns=labels)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 6 columns):
item1           10 non-null object
grade           9 non-null object
month           13 non-null int64
normal_price    10 non-null float64
item2           13 non-null int64
final_price     5 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 752.0+ bytes
In [45]:
df_ = df.copy()
df
Out[45]:
item1 grade month normal_price item2 final_price
0 NaN A 1 10.0 6 NaN
1 D D 2 12.0 10 12.0
2 X NaN 1 28.0 15 NaN
3 D D 3 NaN 4 NaN
4 X B 2 NaN 20 25.0
5 NaN B 3 32.0 31 25.0
6 T F 1 220.0 250 NaN
7 X NaN 4 30.0 22 NaN
8 T NaN 2 240.0 170 NaN
9 X C 2 38.0 27 NaN
10 T C 3 NaN 44 NaN
11 NaN A 4 20.0 18 80.0
12 D NaN 4 200.0 120 150.0
In [46]:
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')

encoder = preprocessing.OrdinalEncoder()

cat_cols = ['item1','grade']

# Function to encode non-null data and replace it in the original data'
def encode(data):
    # Retains only non-null values
    nonulls = np.array(data.dropna())
    # Reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    # Encode data
    impute_ordinal = encoder.fit_transform(impute_reshape)
    # Assign back encoded values to non-null values
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    return data

for columns in cat_cols:
    encode(df[columns])
In [47]:
print('\nOriginal:', '\n', df_)
print('\nEncoded:', '\n',df)
Original: 
    item1 grade  month  normal_price  item2  final_price
0    NaN     A      1          10.0      6          NaN
1      D     D      2          12.0     10         12.0
2      X   NaN      1          28.0     15          NaN
3      D     D      3           NaN      4          NaN
4      X     B      2           NaN     20         25.0
5    NaN     B      3          32.0     31         25.0
6      T     F      1         220.0    250          NaN
7      X   NaN      4          30.0     22          NaN
8      T   NaN      2         240.0    170          NaN
9      X     C      2          38.0     27          NaN
10     T     C      3           NaN     44          NaN
11   NaN     A      4          20.0     18         80.0
12     D   NaN      4         200.0    120        150.0

Encoded: 
    item1 grade  month  normal_price  item2  final_price
0    NaN     0      1          10.0      6          NaN
1      0     3      2          12.0     10         12.0
2      2   NaN      1          28.0     15          NaN
3      0     3      3           NaN      4          NaN
4      2     1      2           NaN     20         25.0
5    NaN     1      3          32.0     31         25.0
6      1     4      1         220.0    250          NaN
7      2   NaN      4          30.0     22          NaN
8      1   NaN      2         240.0    170          NaN
9      2     2      2          38.0     27          NaN
10     1     2      3           NaN     44          NaN
11   NaN     0      4          20.0     18         80.0
12     0   NaN      4         200.0    120        150.0

Imputation Using Deep Learning:

This method works very well with categorical and non-numerical features. For this we use a keras scikit learn wrapper to use a deep learning model as a scikit learn classifier or regressor.

Pros:

  • Quite accurate compared to other methods.
  • It supports continuous and categorical data.

Cons:

  • Can be quite slow with large datasets.
In [48]:
import tensorflow as tf 
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

def reg_model():
    # create model
    model = Sequential()
    # Add and input and dense layer with 12 nurons 
    model.add(Dense(12, input_dim=df.shape[1] - 1, activation='relu'))
    # Add output layer
    model.add(Dense(1, activation='relu'))
    # Compile model
    model.compile(loss=tf.keras.losses.mape, optimizer='adam')
    return model 

def classif_model():
    # create model
    model = Sequential()
    # Add and input and dense layer with 12 nurons 
    model.add(Dense(12, input_dim=df.shape[1] - 1, activation='sigmoid'))
    # Add output layer
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss=tf.keras.losses.kld, optimizer='adam')
    return model  
Using TensorFlow backend.

Imputation Using k-NN:

The $k$ nearest neighbors is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbors to the observation with missing data and then imputing them based on the non-missing values in the neighborhood.

  • How does it work?

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.

Alt text that describes the graphic

Pros:

  • Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

Cons:

  • Computationally expensive. KNN works by storing the whole training dataset in memory.
  • K-NN is quite sensitive to outliers in the data (unlike SVM)
In [49]:
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsClassifier

reg_nn = KerasRegressor(build_fn=reg_model, epochs=20, batch_size=5, verbose=0)
classif_nn = KerasRegressor(build_fn=classif_model, epochs=20, batch_size=5, verbose=0)
reg = BayesianRidge()
knn = KNeighborsClassifier(n_neighbors=2)

z0 = np.round(IterativeImputer(estimator=reg,random_state=0).fit_transform(df.values))
z1 = IterativeImputer(estimator=knn, random_state=0).fit_transform(df.values)
z2 = np.round(IterativeImputer(estimator=reg_nn,random_state=0).fit_transform(df.values))
z3 = np.round(IterativeImputer(estimator=classif_nn,random_state=0).fit_transform(df.values))

df_imp0 = pd.DataFrame(z0, columns=df.columns)
df_imp1 = pd.DataFrame(z1, columns=df.columns)
df_imp2 = pd.DataFrame(z2, columns=df.columns)
df_imp3 = pd.DataFrame(z3, columns=df.columns)

print('\nMissing:')
print('\n',df)
print('\nBayesianRidgeRegressor:')
print('\n',df_imp0)
print('\nKNeighborsClassifier:')
print('\n',df_imp1)
print('\nNNRegressor:')
print('\n',df_imp2)
print('\nNNClassifier:')
print('\n',df_imp3)
Missing:

    item1 grade  month  normal_price  item2  final_price
0    NaN     0      1          10.0      6          NaN
1      0     3      2          12.0     10         12.0
2      2   NaN      1          28.0     15          NaN
3      0     3      3           NaN      4          NaN
4      2     1      2           NaN     20         25.0
5    NaN     1      3          32.0     31         25.0
6      1     4      1         220.0    250          NaN
7      2   NaN      4          30.0     22          NaN
8      1   NaN      2         240.0    170          NaN
9      2     2      2          38.0     27          NaN
10     1     2      3           NaN     44          NaN
11   NaN     0      4          20.0     18         80.0
12     0   NaN      4         200.0    120        150.0

BayesianRidgeRegressor:

     item1  grade  month  normal_price  item2  final_price
0     1.0    0.0    1.0          10.0    6.0         29.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    2.0    1.0          28.0   15.0         40.0
3     0.0    3.0    3.0          16.0    4.0         32.0
4     2.0    1.0    2.0          17.0   20.0         25.0
5     1.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0        182.0
7     2.0    2.0    4.0          30.0   22.0         42.0
8     1.0    3.0    2.0         240.0  170.0        173.0
9     2.0    2.0    2.0          38.0   27.0         47.0
10    1.0    2.0    3.0          58.0   44.0         60.0
11    1.0    0.0    4.0          20.0   18.0         80.0
12    0.0    3.0    4.0         200.0  120.0        150.0

KNeighborsClassifier:

     item1  grade  month  normal_price  item2  final_price
0     0.0    0.0    1.0          10.0    6.0         12.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    1.0    1.0          28.0   15.0         25.0
3     0.0    3.0    3.0          10.0    4.0         12.0
4     2.0    1.0    2.0          28.0   20.0         25.0
5     2.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0         25.0
7     2.0    1.0    4.0          30.0   22.0         25.0
8     1.0    2.0    2.0         240.0  170.0         25.0
9     2.0    2.0    2.0          38.0   27.0         25.0
10    1.0    2.0    3.0          32.0   44.0         25.0
11    2.0    0.0    4.0          20.0   18.0         80.0
12    0.0    0.0    4.0         200.0  120.0        150.0

NNRegressor:

     item1  grade  month  normal_price  item2  final_price
0     0.0    0.0    1.0          10.0    6.0          3.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    0.0    1.0          28.0   15.0          8.0
3     0.0    3.0    3.0           0.0    4.0          1.0
4     2.0    1.0    2.0           4.0   20.0         25.0
5     0.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0         91.0
7     2.0    0.0    4.0          30.0   22.0         11.0
8     1.0    0.0    2.0         240.0  170.0         80.0
9     2.0    2.0    2.0          38.0   27.0         13.0
10    1.0    2.0    3.0           0.0   44.0          8.0
11    0.0    0.0    4.0          20.0   18.0         80.0
12    0.0    0.0    4.0         200.0  120.0        150.0

NNClassifier:

     item1  grade  month  normal_price  item2  final_price
0     1.0    0.0    1.0          10.0    6.0          1.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    1.0    1.0          28.0   15.0          1.0
3     0.0    3.0    3.0           0.0    4.0          1.0
4     2.0    1.0    2.0           1.0   20.0         25.0
5     1.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0          1.0
7     2.0    1.0    4.0          30.0   22.0          1.0
8     1.0    1.0    2.0         240.0  170.0          1.0
9     2.0    2.0    2.0          38.0   27.0          1.0
10    1.0    2.0    3.0           0.0   44.0          1.0
11    1.0    0.0    4.0          20.0   18.0         80.0
12    0.0    1.0    4.0         200.0  120.0        150.0

Filling Missing Values With Model Advantages and Disadvantages

Advantages

  1. Could yield great results
  2. Should preserve variance
  3. Works with MAR and MNAR

Disavantages

  1. Can be computationally expensive
  2. Can take some time

Imputation Using Multivariate Imputation by Chained Equation (MICE)

Alt text that describes the graphic

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.

In [50]:
from impyute.imputation.cs import mice

imputed_training = np.round(mice(df.astype(float).values))
In [52]:
print('\nMissing:')
print('\n',df)
print('\nNNRegressor:')
print('\n',df_imp2)
print('\nNNClassifier:')
print('\n',df_imp3)
print('\nMICE:')
print('\n',pd.DataFrame(imputed_training, columns=df.columns))
Missing:

    item1 grade  month  normal_price  item2  final_price
0    NaN     0      1          10.0      6          NaN
1      0     3      2          12.0     10         12.0
2      2   NaN      1          28.0     15          NaN
3      0     3      3           NaN      4          NaN
4      2     1      2           NaN     20         25.0
5    NaN     1      3          32.0     31         25.0
6      1     4      1         220.0    250          NaN
7      2   NaN      4          30.0     22          NaN
8      1   NaN      2         240.0    170          NaN
9      2     2      2          38.0     27          NaN
10     1     2      3           NaN     44          NaN
11   NaN     0      4          20.0     18         80.0
12     0   NaN      4         200.0    120        150.0

NNRegressor:

     item1  grade  month  normal_price  item2  final_price
0     0.0    0.0    1.0          10.0    6.0          3.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    0.0    1.0          28.0   15.0          8.0
3     0.0    3.0    3.0           0.0    4.0          1.0
4     2.0    1.0    2.0           4.0   20.0         25.0
5     0.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0         91.0
7     2.0    0.0    4.0          30.0   22.0         11.0
8     1.0    0.0    2.0         240.0  170.0         80.0
9     2.0    2.0    2.0          38.0   27.0         13.0
10    1.0    2.0    3.0           0.0   44.0          8.0
11    0.0    0.0    4.0          20.0   18.0         80.0
12    0.0    0.0    4.0         200.0  120.0        150.0

NNClassifier:

     item1  grade  month  normal_price  item2  final_price
0     1.0    0.0    1.0          10.0    6.0          1.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    1.0    1.0          28.0   15.0          1.0
3     0.0    3.0    3.0           0.0    4.0          1.0
4     2.0    1.0    2.0           1.0   20.0         25.0
5     1.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0          1.0
7     2.0    1.0    4.0          30.0   22.0          1.0
8     1.0    1.0    2.0         240.0  170.0          1.0
9     2.0    2.0    2.0          38.0   27.0          1.0
10    1.0    2.0    3.0           0.0   44.0          1.0
11    1.0    0.0    4.0          20.0   18.0         80.0
12    0.0    1.0    4.0         200.0  120.0        150.0

MICE:

     item1  grade  month  normal_price  item2  final_price
0     5.0    0.0    1.0          10.0    6.0        -72.0
1     0.0    3.0    2.0          12.0   10.0         12.0
2     2.0    2.0    1.0          28.0   15.0        -44.0
3     0.0    3.0    3.0          22.0    4.0         97.0
4     2.0    1.0    2.0          49.0   20.0         25.0
5     3.0    1.0    3.0          32.0   31.0         25.0
6     1.0    4.0    1.0         220.0  250.0       -275.0
7     2.0    2.0    4.0          30.0   22.0        106.0
8     1.0    1.0    2.0         240.0  170.0        -24.0
9     2.0    2.0    2.0          38.0   27.0         -0.0
10    1.0    2.0    3.0          71.0   44.0         68.0
11    5.0    0.0    4.0          20.0   18.0         80.0
12    0.0    2.0    4.0         200.0  120.0        150.0
In [ ]: