Let’s create a Pandas DataFrame that contains features with distinct values.
import pandas as pd import numpy as np data = {'Student_Id':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Name':['Mark', 'Juli', 'Alexa', 'Kevin', 'John', 'Devid', 'Mark', 'Michael', 'Johnson', 'Kevin'], 'Age':[27, 31, 45, np.nan, 34, 48, np.nan, 31, np.nan, 27], 'Location':['USA', 'UK', np.nan, 'France', 'Germany', 'USA', 'Germany', np.nan, 'USA', 'Italy']} df = pd.DataFrame(data) df.head(10)
Output:
Student_Id Name Age Location 0 1 Mark 27.0 USA 1 2 Juli 31.0 UK 2 3 Alexa 45.0 NaN 3 4 Kevin NaN France 4 5 John 34.0 Germany 5 6 Devid 48.0 USA 6 7 Mark NaN Germany 7 8 Michael 31.0 NaN 8 9 Johnson NaN USA 9 10 Kevin 27.0 Italy
Here, Student_Id column contains all distinct values. This feature won’t useful for making the prediction of the target variable as it doesn’t provide any useful insights of the data. Hence, It is better to remove this kind of features.
# Function to return the distinct value columns of a given DataFrame def remove_distinct_value_features(df): return [e for e in df.columns if df[e].nunique() == df.shape[0]]
drop_col = remove_distinct_value_features(df) drop_col
Output:
['Student_Id']
Let’s remove distinct value columns and create new DataFrame.
# Create new DataFrame new_df_columns = [e for e in df.columns if e not in drop_col] new_df = df[new_df_columns] new_df
Name Age Location 0 Mark 27.0 USA 1 Juli 31.0 UK 2 Alexa 45.0 NaN 3 Kevin NaN France 4 John 34.0 Germany 5 Devid 48.0 USA 6 Mark NaN Germany 7 Michael 31.0 NaN 8 Johnson NaN USA 9 Kevin 27.0 Italy
You can also remove columns using Pandas’ df.drop().
# This will drop the columns inplace. df.drop(drop_col,axis=1,inplace=True) # inplace=True # This will create new DataFrame, but the original DataFrame remain same new_df = df.drop(drop_col,axis=1) # default inplace=False
. . .