In Machine Learning project, it is very common to have categorical features in data. However, Machine can only understand numbers. So, it is a very essential part to encode categorical feature to numeric feature before it used in Machine Learning Algorithm. There are many encoding methods exist in Machine Learning.
Categorical Data: Nominal, Ordinal and Cyclical
Nominal: A Categorical features which are only labelled without any order preference are called the Nominal features.
Example.
Ordinal: A Categorical feature which is associated with some Order is called the Ordinal feature.
Example: grade of class 10(first class, second class and third class)
Cyclical: The Categorical feature which happens in specific cycles is called a Cyclical feature.
For example: Day, Hour, Month and season
Encoding Techniques:
1) Label Encoder: Encode Categorical feature in Alphabetic Order.
Encoding : [Red,Black,Yellow,Green] → [3 , 1 , 4 , 2]
from sklearn.preprocessing import LabelEncoder lbl = LabelEncoder() df['Color'] = lbl.fit_transform(df['Color'])
2) Pandas’ Factorize : Encode the categorical feature in order of appearance.
Encoding : [Red,Black,Yellow,Green] → [0 , 1 , 2 , 3]
label,unique = pd.factorize(dd['Color'])
3) Frequency Encoding: Encode the Categorical feature via mapping Values to their frequencies. This will preserve the information about the values of distributions.
encoding = df.groupby('Color').size() encoding = encoding / len(df) Encoding : [Red,Green,Yellow] → [0.50,0.33,0.16]
Frequency Encoding Using Rank:
from scipy.stats import rankdata rankdata(dd['Color']) Encoding : [Red,Green,Yellow] → [ 5, 2.5 , 1]
4) One-hot Encoding: Encode each unique value of a categorical feature into new column and assign value 0 and 1 to the column.
from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder() ohe_data = ohe.fit_transform(dd['Color'])
#using Pandas' get_dummies oht_df = pd.get_dummies(dd['Color'])
5) Cyclical feature encoding: The feature like day, month, the season is cyclic by nature. These all features happen in a specific cycle. Sine and Cosine transformation are used to encode the cyclical feature.
df['day_sin'] = np.sin(2 * np.pi * df['day_of_week']/7.0) df['day_cos'] = np.cos(2 * np.pi * df['day_of_week']/7.0)