Target Encoding for categorical feature

Target Encoding is the Encoding technique for categorical feature using the target feature. Target Encoding methods encode the categorical feature in some meaningful way via using target, where The Label Encoder encodes the categorical feature just by assigning a unique number to each label with no specific meaning.

Here, I have explained the different methods for Target Encoding with examples in Python.

.     .     .

Regression: 

We can use statistical features for target encoding in regression problems. Here, I have used the Kaggle’s data House Prices: Advanced Regression Techniques for demonstration. We supposed to predict house price from features of the house. Four Categorical features are considered, which are encoded using the target encoding approach.

Example:

for e_col in ['Neighborhood','LotShape','MSSubClass','MSZoning'']:
    means =train.groupby(e_col).SalePrice.mean()
    train[e_col+'_target_encoding'] = train[e_col].map(means)
    test[e_col+'_target_encoding'] = test[e_col].map(means)

Statistical feature mean is used in the above example. But, we can also use other statistical features also.

.     .     .

Classification: 

There are many methods exist to encode the categorical feature using the target feature. It’s not necessary to a specific method work for every categorical feature. It is up to you to select an appropriate encoding method to particular categorical feature. Here, I have listed 4 different encoding methods. You can also create a unique method for encoding using target with respect to your problem definitions.

Here, I have used the Kaggle’s data Categorical Feature Encoding Challenge for demonstration.

target value: 0 & 1

for e in train['nom_0'].unique():
    
    no_of_1 = len(train.loc[(train['nom_0'] == e) & (train['target'] == 1)]) 
    no_of_0 = len(train.loc[(train['nom_0'] == e) & (train['target'] == 0)]) 
    
    mean = no_of_1 / (no_of_1 + no_of_0)
    nom_0_mean[e] = round(mean,3)
    
    weight = np.log(no_of_1/no_of_0) * 100
    nom_0_weight[e] = round(weight,3)
    
    nom_0_count[e] = no_of_1
    
    nom_0_diff[e] = no_of_1 - no_of_0
train['nom_0_mean'] = train['nom_0'].map(nom_0_mean)
train['nom_0_weight'] = train['nom_0'].map(nom_0_weight)
train['nom_0_count'] = train['nom_0'].map(nom_0_count)
train['nom_0_diff'] = train['nom_0'].map(nom_0_diff)

 

Here, the count method will not work in this particular case, as a label for green and red are same.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Preprocessing Tutorials

Handle Data Outlier in Machine Learning

Feature Preprocessing for Numerical Features

Handle the Datetime and coordinates Features

Different Label Encoding Methods for Categorical Features

Handle Missing Data in Python

What is Data Cleaning?