Feature Engineering – Introduction

Feature engineering is one of the significant parts of machine learning. Feature engineering is the process of creation and selection of best features using domain knowledge of the data that help machine learning models to work better. To Perform best feature engineering is an art. An effective feature engineering will boost your model and drastically improve performance. The graphic visualization of data will help to analysis the features.

Things to consider in feature engineering:

  • Handle imbalanced data
  • Handle missing data
  • Handle Outlier
  • Feature Extraction
  • Feature selection

 

This tutorial has explained various feature engineering techniques with example. The Kaggle’s competition dataset House Prices: Advanced Regression Techniques is used for demonstration. Let’s load data and perform analysis.

In [1]: 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]: 
df = pd.read_csv('train.csv')
df.shape
Out[2]: (1460, 81)

Let’s first find the categorical features and numeric features.

In [3]: 
target = df['SalePrice']
df.drop(['Id','SalePrice'],axis=1,inplace=True)
cat_feat = [col for col in df.columns if df[col].dtypes == 'object']
num_feat = [col for col in df.columns if df[col].dtypes != 'object']
print("Number of categorical features :",len(cat_feat))
print("Number of Numerical features   :",len(num_feat))

Out[3]: 
Number of categorical features : 43
Number of Numerical features   : 36

Let’s distribute categorical feature into nominal categorical feature and ordinal categorical features.

In [4]: 
ord_cat_feature = ['LotShape','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual','FireplaceQu','GarageFinish','GarageQual','GarageCond','PoolQC']
nom_cat_feat =  [e for e in cat_feat if e not in ord_cat_feature]
print("Number of nominal categorical features :",len(nom_cat_feat))
print("Number of ordinal categorical features :",len(ord_cat_feature))

Out[4]: 
Number of nominal categorical features : 27
Number of ordinal categorical features : 16

Let’s find the missing values in each feature of the data.

In [5]: 
missing_df = pd.DataFrame(index=df.columns,columns=['Count','Percentage'])
for e in df.columns:
    missing_df.loc[e]['Count'] = df[e].isna().sum()
    missing_df.loc[e]['Percentage'] = df[e].isna().sum()/df.shape[0]
missing_df.sort_values(by='Count',ascending=False,inplace=True)
missing_df.head()

Out[5]: 
            Count Percentage
PoolQC       1453   0.995205
MiscFeature  1406   0.963014
Alley        1369   0.937671
Fence        1179   0.807534
FireplaceQu   690   0.472603

Let’s plot see the distribution of the target variable.

In [6]: target.plot.box()
Out[6]: 

There are two houses which have SalePrice greater than 700000.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Feature Engineering Tutorials