In this tutorial, you will get to know about unique values in a DataFrame. The real-life dataset often contains duplicate values.
Let’s create a Pandas DataFrame that contains duplicate values.
import pandas as pd import numpy as np data = {'Id':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Name':['Mark', 'Juli', 'Alexa', 'Kevin', 'John', 'Devid', 'Mark', 'Michael', 'Johnson', 'Kevin'], 'Age':[27, 31, 45, np.nan, 34, 48, np.nan, 31, np.nan, 27], 'Location':['USA', 'UK', np.nan, 'France', 'Germany', 'USA', 'Germany', np.nan, 'USA', 'Italy']} df = pd.DataFrame(data) df.head(10)
Output:
Id Name Age Location 0 1 Mark 27.0 USA 1 2 Juli 31.0 UK 2 3 Alexa 45.0 NaN 3 4 Kevin NaN France 4 5 John 34.0 Germany 5 6 Devid 48.0 USA 6 7 Mark NaN Germany 7 8 Michael 31.0 NaN 8 9 Johnson NaN USA 9 10 Kevin 27.0 Italy
Count Unique Values
Pandas provides df.nunique() method to count distinct observation over requested axis.
DataFrame.nunique(self, axis=0, dropna=True) Parameters axis : 0 {0 or ‘index’, 1 or ‘columns’}, default 0 dropna : bool, default True (Don’t include NaN in the counts.)
Let’s define the function that counts the total number of unique values for each column in a DataFrame.
# Function to count the unique values for each column in a DataFrame def count_unique_values(data): total = data.count() temp = pd.DataFrame(total) temp.columns = ['Total'] # Count total number of non-null values uniques = [] for col in data.columns: unique = data[col].nunique() # Get unique values for each column uniques.append(unique) temp['Uniques'] = uniques return(np.transpose(temp))
count_unique_values(df)
Output:
Id Name Age Location Total 10 10 7 8 Uniques 10 8 5 5
Unique Values
Pandas also provide pd.unique() function that returns unique value list of the input column/Series.
Example:
>>> df = pd.DataFrame({'name':['Huli', 'bee', 'Mark'], 'age':[1, 3, 3]}) >>> print(df) name age 0 Huli 11 1 bee 30 2 Mark 30 >>> df['name'].unique() array(['Huli', 'bee', 'Mark'], dtype=object) >>> df['age'].unique() array([11, 30], dtype=int64)
. . .