Pandas provides df.duplicated() function to check whether a duplicate entry exists in a DataFrame or not.
(self, subset=None, keep='first')
Parameters –
subset - (optional) list of columns. Only consider specified columns for identifying duplicates, by default use all of the columns Keep - {‘first’, ‘last’, False}, default ‘first’ first: Mark duplicates as True except for the first occurrence last: Mark duplicates as true except for the last occurrence False: Mark all duplicates as True
# Let's define the DataFrame In [1]: import pandas as pd df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom'], 'Grade' : ['A','C','B','A']}) df Out[1]: student_name Grade 0 Tom A 1 Mark C 2 Mark B 3 Tom A
Let’s find the duplicate data in DataFrame df.
In [2]: df.duplicated() # This will check the duplicate data for all columns. Out[2]: 0 False 1 False 2 False 3 True # found duplicate as entire row is dupliacted. dtype: bool
You need to specify the column names for checking duplicate data in particular columns.
In [3]: df.duplicated(subset=['student_name']) # Check duplicate data in 'student_name' Out[3]: 0 False 1 False 2 True # two duplicate student name found 3 True dtype: bool
You can also control over the duplicate entries which you want to consider by specifying the keep parameter.
# This will mark duplicates as True except for the last occurrence. In [4]: df.duplicated(subset=['student_name'],keep='last') Out[4]: 0 True 1 True 2 False 3 False dtype: bool
Drop Duplicate Data
Pandas’ drop_duplicates() method used to remove the duplicate entries from DataFrame.
(self, subset=None, keep='first', inplace=False)
subset -(optional) list of columns. Only consider specified columns for identifying duplicates, by default use all of the columns Keep - {‘first’, ‘last’, False}, default ‘first’ first: Drop duplicates except for the first occurrence. last: Drop duplicates except for the last occurrence. False: Drop all duplicates. inplace - (default False) Whether to drop duplicates in place or to return a copy
In [5]: # Let's load the DataFrame import pandas as pd df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom'], 'Grade' : ['A','C','B','A']}) df Out[5]: student_name Grade 0 Tom A 1 Mark C 2 Mark B 3 Tom A
Let’s call the drop_duplicates() method with the default parameter.
# This will remove the entire rows with duplicate data all over columns In [6]: df.drop_duplicates() Out[6]: student_name Grade 0 Tom A 1 Mark C 2 Mark B
You can also consider checking duplicate data in a particular column by specifying columns in subset parameter.
# Remove duplicate student_name data In [7]: df.drop_duplicates(subset=['student_name'],keep='last') Out[7]: student_name Grade 2 Mark B 3 Tom A
If you want to make a change in place, use parameter in_place=True. It won’t create a copy of the DataFrame, but make changes in existing DataFrame only.
In [8]: df.drop_duplicates(subset=['student_name'],keep='first',inplace=True) df Out[8]: student_name Grade 0 Tom A 1 Mark C
Finding Unique Data
You can get the number of unique value in DataFrame by using Pandas’ built-in method nunique() and unique() methods.
1. df.nunique() – It will count distinct observations over requested axis. You can also ignore the NaN by specifying the parameter.
(self, axis=0, dropna=True)
axis - (default 0), {0: ‘index’, 1: ‘columns’}, dropna - (default True), Don’t include NaN in the counts.
In [9]: # Let's load the DataFrame import pandas as pd df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom','XY'], 'Grade' : ['A','C','B','A','XY']}) df Out[9]: student_name Grade 0 Tom A 1 Mark C 2 Mark B 3 Tom A 4 XY XY
In [10]: df.nunique() # get the number of distinct observation in each columns Out[10]: student_name 3 Grade 4 dtype: int64
In [11]: df.nunique(axis=1) # axis = 1 ( get the number of uniuqe data in row) Out[11]: 0 2 1 2 2 2 3 2 4 1 dtype: int64
# get the number of unique observation in 'student_name' column In [12]: df['student_name'].nunique() Out[12]: 2
2. pd.unique() – It will return the unique data of a column of DataFrame. It also includes NA values.
values : 1d array-like, series
In [13]: pd.unique(df['student_name']) # get unique value of column student_name Out[13]: array(['Tom', 'Mark'], dtype=object)
. . .