PolarSPARC

Machine Learning - Data Preparation - Part 2


Bhaskar S 05/22/2022


Overview

In Part 1 of this series, we used the the City of Ames, Iowa Housing Prices data set to perform the various tasks related to the Exploratory Data Analysis (EDA) to gather some insights about the loaded data set. In this part, we will first explain the approach to handling missing values that is related to Feature Engineering, followed by hands-on demo in Python.

Feature Engineering - Missing Values

For crafting a good machine learning model, the model algorithm requires that all feature values (from the training data set as well as the test data set) be numerical values and that there are no missing data values.

If the number of samples (rows) with missing data values (from the data set) is very small, say less than 1%, then one can drop the rows (with the missing data) from the data set without impacting the model creation and evaluation. However, if the number of samples (rows) with missing data values is large (say > 5%), then one cannot drop the rows and instead perform data Imputation.

Imputation is the process of identifying and replacing missing values for each of the features from a data set before it is used for training and testing a machine learning model.

The following are some of the commonly used strategies for imputing missing values:

Hands-On Data Imputation

The first step is to import all the necessary Python modules such as, matplotlib, pandas, and seaborn as shown below:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

The next step is to load the housing price dataset into a pandas dataframe as shown below:


url = './data/ames.csv'
ames_df = pd.read_csv(url)

The next step is to drop some of the features that add no value, or have large number (above 80%) of missing values, or have low correlation to the target feature (SalePrice) from the housing price dataset as shown below:


home_price_df = ames_df.drop(['Order', 'PID', 'Pool.QC', 'Misc.Feature', 'Alley', 'Fence', 'MS.SubClass', 'Lot.Frontage', 'Lot.Area', 'Overall.Cond', 'BsmtFin.SF.1', 'BsmtFin.SF.2', 'Bsmt.Unf.SF', 'X2nd.Flr.SF', 'Low.Qual.Fin.SF', 'Bsmt.Full.Bath', 'Bsmt.Half.Bath', 'Half.Bath', 'Bedroom.AbvGr', 'Kitchen.AbvGr', 'Fireplaces', 'Wood.Deck.SF', 'Open.Porch.SF', 'Enclosed.Porch', 'X3Ssn.Porch', 'Screen.Porch', 'Pool.Area', 'Misc.Val', 'Mo.Sold', 'Yr.Sold'], axis=1)

The next step is to display the shape (rows and columns) of the housing price dataset as shown below:


home_price_df.shape

The following illustration shows the shape of the housing price dataset:

Shape
Figure.1

The next step is to display information about the housing price dataframe, such as index and column types, missing (null) values, memory usage, etc., as shown below:


home_price_df.info()

The following illustration displays the information about the housing price dataframe:

Dataframe Information
Figure.2

The next step is to display the feature names from the housing price dataframe that have missing (nan) values as shown below:


features_na = [feature for feature in ames_df.columns if ames_df[feature].isnull().sum() > 0]
features_na

The following illustration displays the list of all the feature names from the housing price dataframe with missing values:

Features NaN
Figure.3

The next step is to display the features and the count of their missing values from the housing price dataframe as shown below:


home_price_df[features_na].isnull().sum().sort_values()

The following illustration displays the list of the feature names along with the count of their missing value from the housing price dataframe:

Missing Count
Figure.4

The next step is to display the rows corresponding to the features Electrical, Garage.Area, Garage.Cars, and Total.Bsmt.SF with one missing value from the housing price dataframe as shown below:


home_price_df[home_price_df['Garage.Cars'].isnull() | home_price_df['Garage.Area'].isnull() | home_price_df['Total.Bsmt.SF'].isnull() | home_price_df['Electrical'].isnull()]

The following illustration displays the rows with one missing value from the housing price dataframe:

Rows Missing Value
Figure.5

Given that it is only 3 of the 2930 rows from the housing price data set with one missing value, it is okay to drop these rows.

The next step is drop the 3 rows with a missing value from the housing price dataframe as shown below:


one_na_index = home_price_df[home_price_df['Garage.Cars'].isnull() | home_price_df['Garage.Area'].isnull() | home_price_df['Total.Bsmt.SF'].isnull() | home_price_df['Electrical'].isnull()].index
home_price_df = home_price_df.drop(one_na_index, axis=0)

The next step is to re-evaluate the features with missing values and display the count of their missing values from the housing price dataframe as shown below:


features_na = [feature for feature in home_price_df.columns if home_price_df[feature].isnull().sum() > 0]
home_price_df[features_na].isnull().sum().sort_values()

The following illustration displays the list of the feature names along with the count of their missing value from the housing price dataframe:

Missing Count
Figure.6

The next step is to display the rows corresponding to the features Mas.Vnr.Type and Mas.Vnr.Area with 23 missing values from the housing price dataframe as shown below:


home_price_df[home_price_df['Mas.Vnr.Type'].isnull() | home_price_df['Mas.Vnr.Area'].isnull()]

The following illustration displays the rows with 23 missing values from the housing price dataframe:

Rows Missing Values
Figure.7

Looking at the data dictionary of the Ames Housing Price data set, the acceptable value for Mas.Vnr.Type is 'None' and the for Mas.Vnr.Area would be 0.

The next step is to replace the missing values for Mas.Vnr.Type and Mas.Vnr.Area from the housing price dataframe as shown below:


home_price_df['Mas.Vnr.Type'].fillna('None', inplace=True)
home_price_df['Mas.Vnr.Area'].fillna(0, inplace=True)

The next step is to once again to re-evaluate the features with missing values and display the count of their missing values from the housing price dataframe as shown below:


features_na = [feature for feature in home_price_df.columns if home_price_df[feature].isnull().sum() > 0]
home_price_df[features_na].isnull().sum().sort_values()

The following illustration displays the list of the feature names along with the count of their missing value from the housing price dataframe:

Missing Count
Figure.8

The next step is to display the rows corresponding to the features Bsmt.Qual, Bsmt.Cond, BsmtFin.Type.1, BsmtFin.Type.2, and Bsmt.Exposure with about 80+ missing values from the housing price dataframe as shown below:


home_price_df[home_price_df['Bsmt.Qual'].isnull() | home_price_df['Bsmt.Cond'].isnull() | home_price_df['BsmtFin.Type.1'].isnull() | home_price_df['BsmtFin.Type.2'].isnull() | home_price_df['Bsmt.Exposure'].isnull()]

The following illustration displays the rows with about 80+ missing values from the housing price dataframe:

Rows Missing Values
Figure.9

The row at index 444 has a value for BsmtFin.Type.1 but not for BsmtFin.Type.2. Looking at the data dictionary for the Ames Housing Price data set, we can infer we can use of 'GLQ' (for good living quarters).

The next step is to replace the missing value for BsmtFin.Type.2 at index 444 from the housing price dataframe as shown below:


home_price_df.loc[444, 'BsmtFin.Type.2'] = 'GLQ'

The next step is to replace the missing values for Bsmt.Qual, Bsmt.Cond, BsmtFin.Type.1, BsmtFin.Type.2, and Bsmt.Exposure from the housing price dataframe as shown below:


home_price_df['Bsmt.Qual'].fillna('NA', inplace=True)
home_price_df['Bsmt.Cond'].fillna('NA', inplace=True)
home_price_df['BsmtFin.Type.1'].fillna('NA', inplace=True)
home_price_df['BsmtFin.Type.2'].fillna('NA', inplace=True)
home_price_df['Bsmt.Exposure'].fillna('NA', inplace=True)

The next step is to once again to re-evaluate the features with missing values and display the count of their missing values from the housing price dataframe as shown below:


features_na = [feature for feature in home_price_df.columns if home_price_df[feature].isnull().sum() > 0]
home_price_df[features_na].isnull().sum().sort_values()

The following illustration displays the list of the feature names along with the count of their missing value from the housing price dataframe:

Missing Count
Figure.10

The next step is to display the rows corresponding to the features Garage.Type, Garage.Yr.Blt, Garage.Finish, Garage.Qual, and Garage.Cond with about 158 missing values from the housing price dataframe as shown below:


home_price_df[home_price_df['Garage.Type'].isnull() | home_price_df['Garage.Yr.Blt'].isnull() | home_price_df['Garage.Finish'].isnull() | home_price_df['Garage.Qual'].isnull() | home_price_df['Garage.Cond'].isnull()]

The following illustration displays the rows with about 158 missing values from the housing price dataframe:

Rows Missing Values
Figure.11

The row at index 1356 has a value for Garage.Type but not for Garage.Yr.Blt, Garage.Finish, Garage.Qual, and Garage.Cond.

The missing value for Garage.Yr.Blt can be determined by looking at the values for other features such as Year.Built and Year.Remod.Add.

The next step is to display the values corresponding to the features Year.Built, Year.Remod.Add, and Garage.Yr.Blt for the row at index 1356 from the housing price dataframe as shown below:


home_price_df.loc[[1356], ['Year.Built', 'Year.Remod.Add', 'Garage.Yr.Blt']]

The following illustration displays the values corresponding to the features Year.Built, Year.Remod.Add, and Garage.Yr.Blt at index 1356 from the housing price dataframe:

Feature Values 1356
Figure.12

The next step is to display a count plot for Garage.Finish to determine the most frequent value using the housing price data set as shown below:


sns.countplot(x='Garage.Finish', data=home_price_df)
plt.show()

The following illustration shows the count plot for Garage.Finish using the housing price data set:

Count Plot 1
Figure.13

Notice that the value of 'Unf' is the most frequent value for Garage.Finish.

The next step is to display a count plot for Garage.Qual to determine the most frequent value using the housing price data set as shown below:


sns.countplot(x='Garage.Qual', data=home_price_df)
plt.show()

The following illustration shows the count plot for Garage.Qual using the housing price data set:

Count Plot 2
Figure.14

Notice that the value of 'TA' is the most frequent value for Garage.Qual.

The next step is to display a count plot for Garage.Cond to determine the most frequent value using the housing price data set as shown below:


sns.countplot(x='Garage.Cond', data=home_price_df)
plt.show()

The following illustration shows the count plot for Garage.Cond using the housing price data set:

Count Plot 3
Figure.15

Notice that the value of 'TA' is the most frequent value for Garage.Cond.

The next step is to replace the missing values for Garage.Yr.Blt, Garage.Type, Garage.Finish, Garage.Qual, and Garage.Cond at index 1356 (from the information gathered above) in the housing price dataframe as shown below:


home_price_df.loc[1356, 'Garage.Yr.Blt'] = home_price_df.loc[1356, 'Year.Built']
home_price_df.loc[1356, 'Garage.Finish'] = 'Unf'
home_price_df.loc[1356, 'Garage.Qual'] = 'TA'
home_price_df.loc[1356, 'Garage.Cond'] = 'TA'

The next step is to replace the missing values for Garage.Yr.Blt, Garage.Type, Garage.Finish, Garage.Qual, and Garage.Cond from the housing price dataframe as shown below:


home_price_df['Garage.Yr.Blt'].fillna(0, inplace=True)
home_price_df['Garage.Type'].fillna('NA', inplace=True)
home_price_df['Garage.Finish'].fillna('NA', inplace=True)
home_price_df['Garage.Qual'].fillna('NA', inplace=True)
home_price_df['Garage.Cond'].fillna('NA', inplace=True)

The next step is to replace the missing values for Fireplace.Qu from the housing price dataframe as shown below:


home_price_df['Fireplace.Qu'].fillna('NA', inplace=True)

The next step is to display the count for all the missing values from the housing price dataframe as shown below:


home_price_df.isnull().sum()

The following illustration shows the count for all the missing values from the housing price data set:

Count Missing
Figure.16

At this point we have no missing values in the housing price data set.


Hands-on Demo

The following is the link to the Jupyter Notebooks that provides an hands-on demo for this article:


References

Machine Learning - Data Preparation - Part 1


© PolarSPARC