In this tutorial, you will get to know about missing values or NaN values in a DataFrame. The real-life dataset often contains missing values. For Data analysis, it is a necessary task to know about the data that what percentage of data is missing?
Let’s create a Pandas DataFrame that contains missing values.
import pandas as pd import numpy as np data = {'Id':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Name':['Mark', 'Juli', 'Alexa', 'Kevin', 'John', 'Devid', 'Mary', 'Michael', 'Johnson', 'Mick'], 'Age':[27, 31, 45, np.nan, 34, 48, np.nan, 25, np.nan, 40], 'Location':['USA', 'UK', np.nan, 'France', np.nan, 'USA', 'germany', np.nan, np.nan, 'Italy']} df = pd.DataFrame(data) df.head(10)
Output:
Id Name Age Location 0 1 Mark 27.0 USA 1 2 Juli 31.0 UK 2 3 Alexa 45.0 NaN 3 4 Kevin NaN France 4 5 John 34.0 NaN 5 6 Devid 48.0 USA 6 7 Mary NaN germany 7 8 Michael 25.0 NaN 8 9 Johnson NaN NaN 9 10 Mick 40.0 Italy
Missing Data
Pandas provides pd.isnull() method that detects the missing values. It returns the same-sized DataFrame with True and False values that indicates whether an element is NA value or not.
NA values – None, numpy.nan gets mapped to True values. Everything else gets mapped to False values.
Example:
>>> pd.isnull(123) False >>> pd.isnull(np.nan) True >>> pd.isnull(None) True >>> df = pd.DataFrame([['abc', 'bee', np.nan], [1, None, 3]]) >>> df 0 1 2 0 abc bee NaN 1 1 None 3.0 >>> pd.isnull(df) 0 1 2 0 False False True 1 False True False
Let’s defined the function that calculates the missing value for each column in a DataFrame.
# Function to count missing values for each columns in a DataFrame def missing_data(data): # Count number of missing value in a column total = data.isnull().sum() # Get Percentage of missing values percent = (data.isnull().sum()/data.isnull().count()*100) temp = pd.concat([total, percent], axis=1, keys=['Total', 'Percent(%)']) # Create a Type column, that indicates the data-type of the column. types = [] for col in data.columns: dtype = str(data[col].dtype) types.append(dtype) temp['Types'] = types return(np.transpose(temp))
missing_data(df)
Output:
Id Name Age Location Total 0 0 3 4 Percent(%) 0 0 30 40 Types int64 object float64 object
. . .