Feature Engineering – Introduction – Study Machine Learning

Feature engineering is one of the significant parts of machine learning. Feature engineering is the process of creation and selection of best features using domain knowledge of the data that help machine learning models to work better. To Perform best feature engineering is an art. An effective feature engineering will boost your model and drastically improve performance. The graphic visualization of data will help to analysis the features.

Things to consider in feature engineering:

Handle imbalanced data
Handle missing data
Handle Outlier
Feature Extraction
Feature selection

This tutorial has explained various feature engineering techniques with example. The Kaggle’s competition dataset House Prices: Advanced Regression Techniques is used for demonstration. Let’s load data and perform analysis.

In [1]: 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]: 
df = pd.read_csv('train.csv')
df.shape
Out[2]: (1460, 81)

Let’s first find the categorical features and numeric features.

In [3]: 
target = df['SalePrice']
df.drop(['Id','SalePrice'],axis=1,inplace=True)
cat_feat = [col for col in df.columns if df[col].dtypes == 'object']
num_feat = [col for col in df.columns if df[col].dtypes != 'object']
print("Number of categorical features :",len(cat_feat))
print("Number of Numerical features   :",len(num_feat))

Out[3]: 
Number of categorical features : 43
Number of Numerical features   : 36

Let’s distribute categorical feature into nominal categorical feature and ordinal categorical features.

In [4]: 
ord_cat_feature = ['LotShape','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual','FireplaceQu','GarageFinish','GarageQual','GarageCond','PoolQC']
nom_cat_feat =  [e for e in cat_feat if e not in ord_cat_feature]
print("Number of nominal categorical features :",len(nom_cat_feat))
print("Number of ordinal categorical features :",len(ord_cat_feature))

Out[4]: 
Number of nominal categorical features : 27
Number of ordinal categorical features : 16

Let’s find the missing values in each feature of the data.

In [5]: 
missing_df = pd.DataFrame(index=df.columns,columns=['Count','Percentage'])
for e in df.columns:
    missing_df.loc[e]['Count'] = df[e].isna().sum()
    missing_df.loc[e]['Percentage'] = df[e].isna().sum()/df.shape[0]
missing_df.sort_values(by='Count',ascending=False,inplace=True)
missing_df.head()

Out[5]: 
            Count Percentage
PoolQC       1453   0.995205
MiscFeature  1406   0.963014
Alley        1369   0.937671
Fence        1179   0.807534
FireplaceQu   690   0.472603

Let’s plot see the distribution of the target variable.

In [6]: target.plot.box()
Out[6]:

There are two houses which have SalePrice greater than 700000.

. . .

Feature Engineering – Introduction

Leave a Reply Cancel reply

Feature Engineering Tutorials