Feature engineering is one of the significant parts of machine learning. Feature engineering is the process of creation and selection of best features using domain knowledge of the data that help machine learning models to work better. To Perform best feature engineering is an art. An effective feature engineering will boost your model and drastically improve performance. The graphic visualization of data will help to analysis the features.
Things to consider in feature engineering:
- Handle imbalanced data
- Handle missing data
- Handle Outlier
- Feature Extraction
- Feature selection
This tutorial has explained various feature engineering techniques with example. The Kaggle’s competition dataset House Prices: Advanced Regression Techniques is used for demonstration. Let’s load data and perform analysis.
In [1]: import matplotlib.pyplot as plt import pandas as pd import numpy as np In [2]: df = pd.read_csv('train.csv') df.shape Out[2]: (1460, 81)
Let’s first find the categorical features and numeric features.
In [3]: target = df['SalePrice'] df.drop(['Id','SalePrice'],axis=1,inplace=True) cat_feat = [col for col in df.columns if df[col].dtypes == 'object'] num_feat = [col for col in df.columns if df[col].dtypes != 'object'] print("Number of categorical features :",len(cat_feat)) print("Number of Numerical features :",len(num_feat)) Out[3]: Number of categorical features : 43 Number of Numerical features : 36
Let’s distribute categorical feature into nominal categorical feature and ordinal categorical features.
In [4]: ord_cat_feature = ['LotShape','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual','FireplaceQu','GarageFinish','GarageQual','GarageCond','PoolQC'] nom_cat_feat = [e for e in cat_feat if e not in ord_cat_feature] print("Number of nominal categorical features :",len(nom_cat_feat)) print("Number of ordinal categorical features :",len(ord_cat_feature)) Out[4]: Number of nominal categorical features : 27 Number of ordinal categorical features : 16
Let’s find the missing values in each feature of the data.
In [5]: missing_df = pd.DataFrame(index=df.columns,columns=['Count','Percentage']) for e in df.columns: missing_df.loc[e]['Count'] = df[e].isna().sum() missing_df.loc[e]['Percentage'] = df[e].isna().sum()/df.shape[0] missing_df.sort_values(by='Count',ascending=False,inplace=True) missing_df.head() Out[5]: Count Percentage PoolQC 1453 0.995205 MiscFeature 1406 0.963014 Alley 1369 0.937671 Fence 1179 0.807534 FireplaceQu 690 0.472603
Let’s plot see the distribution of the target variable.
In [6]: target.plot.box() Out[6]:
There are two houses which have SalePrice greater than 700000.
. . .