Different Label Encoding Methods for Categorical Features

In Machine Learning project, it is very common to have categorical features in data. However, Machine can only understand numbers. So, it is a very essential part to encode categorical feature to numeric feature before it used in Machine Learning Algorithm. There are many encoding methods exist in Machine Learning.

Categorical Data:  Nominal, Ordinal and Cyclical

Nominal: A Categorical features which are only labelled without any order preference are called the Nominal features. 

Example.

Ordinal:  A Categorical feature which is associated with some Order is called the Ordinal feature.

Example: grade of class 10(first class, second class and third class)

Cyclical: The Categorical feature which happens in specific cycles is called a Cyclical feature.

 For example:   Day, Hour, Month and season

Encoding Techniques:

 1) Label Encoder:  Encode Categorical feature in Alphabetic Order.

Encoding : [Red,Black,Yellow,Green]  → [3 , 1 , 4 , 2]

from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
df['Color'] = lbl.fit_transform(df['Color'])

2) Pandas’ Factorize : Encode the categorical feature in order of appearance.

Encoding : [Red,Black,Yellow,Green]  → [0 , 1 , 2 , 3]

label,unique = pd.factorize(dd['Color'])

3) Frequency Encoding: Encode the Categorical feature via mapping Values to their frequencies. This will preserve the information about the values of distributions.

encoding = df.groupby('Color').size()
encoding = encoding / len(df)
Encoding : [Red,Green,Yellow]   →  [0.50,0.33,0.16]
Frequency Encoding Using Rank:
from scipy.stats import rankdata
rankdata(dd['Color'])
Encoding : [Red,Green,Yellow]   →  [ 5, 2.5 , 1]

4) One-hot Encoding: Encode each unique value of a categorical feature into new column and assign value 0 and 1 to the column.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe_data = ohe.fit_transform(dd['Color'])
#using Pandas' get_dummies
oht_df = pd.get_dummies(dd['Color'])

5) Cyclical feature encoding: The feature like day, month, the season is cyclic by nature. These all features happen in a specific cycle. Sine and Cosine transformation are used to encode the cyclical feature.

df['day_sin'] = np.sin(2 * np.pi * df['day_of_week']/7.0)
df['day_cos'] = np.cos(2 * np.pi * df['day_of_week']/7.0)

Leave a Reply

Your email address will not be published. Required fields are marked *

Data Preprocessing Tutorials

Target Encoding for categorical feature

Handle Data Outlier in Machine Learning

Feature Preprocessing for Numerical Features

Handle the Datetime and coordinates Features

Handle Missing Data in Python

What is Data Cleaning?