pandas : Handling Duplicate Data

Pandas provides df.duplicated() function to check whether a duplicate entry exists in a DataFrame or not.

Syntax:

DataFrame.duplicated(selfsubset=Nonekeep='first')

Parameters –

subset - (optional) list of columns.    
         Only consider specified columns for identifying duplicates, 
         by default use all of the columns
Keep - {‘first’, ‘last’, False}, default ‘first’
        first: Mark duplicates  as True except for the first occurrence
        last: Mark duplicates as true except for the last occurrence
        False: Mark all duplicates as True

Example

# Let's define the DataFrame
In [1]: 
import pandas as pd 
df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom'], 
                   'Grade' : ['A','C','B','A']})
df
Out[1]:
  student_name Grade
0          Tom     A
1         Mark     C
2         Mark     B
3          Tom     A

Let’s find the duplicate data in DataFrame df.

In [2]: df.duplicated()   # This will check the duplicate data for all columns.
Out[2]:
0    False
1    False
2    False
3     True                # found duplicate as entire row is dupliacted.
dtype: bool

You need to specify the column names for checking duplicate data in particular columns.

In [3]: df.duplicated(subset=['student_name']) # Check duplicate data in 'student_name'
Out[3]:
0    False
1    False
2     True                # two duplicate student name found
3     True
dtype: bool

You can also control over the duplicate entries which you want to consider by specifying the keep parameter.

# This will mark duplicates as True except for the last occurrence.
In [4]: df.duplicated(subset=['student_name'],keep='last') 
Out[4]: 
0     True
1     True
2    False
3    False
dtype: bool

Drop Duplicate Data

Pandas’ drop_duplicates() method used to remove the duplicate entries from DataFrame.

Syntax:

DataFrame.drop_duplicates(selfsubset=Nonekeep='first'inplace=False)

Parameters:

subset -(optional) list of columns.    
         Only consider specified columns for identifying duplicates, 
         by default use all of the columns

Keep - {‘first’, ‘last’, False}, default ‘first’
        first: Drop duplicates except for the first occurrence.
        last: Drop duplicates except for the last occurrence.
        False: Drop all duplicates.

inplace - (default False) Whether to drop duplicates in place or to return a copy

Examples

In [5]:
# Let's load the DataFrame
import pandas as pd 
df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom'], 
                   'Grade' : ['A','C','B','A']})
df
Out[5]:
  student_name Grade
0          Tom     A
1         Mark     C
2         Mark     B
3          Tom     A

Let’s call the drop_duplicates() method with the default parameter.

# This will remove the entire rows with duplicate data all over columns

In [6]: df.drop_duplicates()   
Out[6]:
  student_name Grade
0          Tom     A
1         Mark     C
2         Mark     B

You can also consider checking duplicate data in a particular column by specifying columns in subset parameter.

# Remove duplicate student_name data
In [7]: df.drop_duplicates(subset=['student_name'],keep='last')
Out[7]:
  student_name Grade
2         Mark     B
3          Tom     A

If you want to make a change in place, use parameter in_place=True. It won’t create a copy of the DataFrame, but make changes in existing DataFrame only.

In [8]: 
df.drop_duplicates(subset=['student_name'],keep='first',inplace=True)
df
Out[8]:
  student_name Grade
0          Tom     A
1         Mark     C

Finding Unique Data

You can get the number of unique value in DataFrame by using Pandas’ built-in method nunique() and unique() methods.

1. df.nunique() – It will count distinct observations over requested axis. You can also ignore the NaN by specifying the parameter.

Syntax:

DataFrame.nunique(selfaxis=0dropna=True)

Parameters:

axis -  (default 0), {0: ‘index’, 1: ‘columns’}, 
dropna - (default True), Don’t include NaN in the counts.
In [9]:
# Let's load the DataFrame
import pandas as pd
df = pd.DataFrame({'student_name' : ['Tom','Mark','Mark','Tom','XY'],
                   'Grade' : ['A','C','B','A','XY']})
df
Out[9]:
  student_name Grade
0          Tom     A
1         Mark     C
2         Mark     B
3          Tom     A
4           XY    XY
In [10]: df.nunique()      # get the number of distinct observation in each columns
Out[10]:
student_name    3
Grade           4
dtype: int64
In [11]: df.nunique(axis=1)    # axis = 1 ( get the number of uniuqe data in row)
Out[11]:
0    2
1    2
2    2
3    2
4    1
dtype: int64
# get the number of unique observation in 'student_name' column 
In [12]: df['student_name'].nunique() 
Out[12]: 2  

2. pd.unique() – It will return the unique data of a column of DataFrame. It also includes NA values.

Syntax:

pd.unique(values)

Parameter

values : 1d array-like, series
In [13]: pd.unique(df['student_name'])     # get unique value of column student_name
Out[13]: array(['Tom', 'Mark'], dtype=object)

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Python Pandas Tutorials