Missing Data refers to no information available for one or more items. The ‘NaN’ (an acronym for Not a Number) or ‘NA’ value is the default marker to represent the missing data. In this tutorial, you will learn various approaches to work with missing data.
Detecting Missing Data
Pandas provide isna() and notna() functions to detect missing data in DataFrame and Series.
In [1]: # let's define the DataFrame import pandas as pd df = pd.DataFrame([[None, np.nan, 8, np.inf], ['X','Y','','Z']], columns=['A','B','C','D']) df Out[1]: A B C D 0 None NaN 8 inf 1 X Y Z
DataFrame.isna() – Detect missing values..Return a boolean same-sized object (DataFrame/Series) indicating if the values are NA.
- True – NA values, such as None or
numpy.NaN
, gets mapped to True values. - False – Everything else gets mapped to False values.
Note: Characters such as empty strings ''
or numpy.inf
are not considered NA values.
In [2]: df.isna() Out[2]: A B C D 0 True True False False 1 False False False False
DataFrame.notna() – Detect non-missing values. Return a boolean same-sized object indicating if the values are not NA.
- True – Non-missing values get mapped to True.
- False – NA values, such as None or
numpy.NaN
, get mapped to False values.
Note – Characters such as empty strings ''
or numpy.inf
are not considered NA values.
In [3]: df.notna() Out[3]: A B C D 0 False False True True 1 True True True True In [4]: df['B'].notna() Out[4]: 0 False 1 True Name: B, dtype: bool
Inserting Missing Data
Because NaN is a float, a column of integers with even one missing values forced to become floating-point dtype.
In [5]: pd.Series([1, 2, np.nan, 4]) Out[5]: 0 1.0 1 2.0 2 NaN 3 4.0 dtype: float64
You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.
In [6]: s = pd.Series([10,20,30]) In [7]: s.loc[1] = None In [8]: s Out[8]: 0 10.0 1 NaN 2 30.0 dtype: float64
Filling Missing Data
DataFrame.fillna() – used to fill the Na/NaN values using specified method.
DataFrame.
fillna
(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
Parameters:
value : Series, or DataFrame method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use the fill NaN values. pad/ffill - Fill values forward bfill/backfill - Fill values backward axis : {0 or ‘index’, 1 or ‘columns’} Axis along which to fill missing values. inplace : (default False) If True, fill in-place. limit : (defailt None) If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None. downcast : (defailt None) A dict of item->dtype of what to downcast if possible.
Example:
In [9]: df = pd.DataFrame([[None, np.nan, 8, np.inf], ['X','Y','Q','Z']], columns=['A','B','C','D']) df Out[9]: A B C D 0 None NaN 8 inf 1 X Y Q Z
Fill with a scalar value.
In [10]: df.fillna(0) # Replace NaN values to 0 (zero) Out[10]: A B C D 0 0 0 8 inf 1 X Y Q Z In [11]: df['B'].fillna('Missing') # Replace NaN values to 'Missing' Out[11]: 0 Missing 1 Y Name: B, dtype: object
Fill gaps forward or backward – Propagate non-null values forward or backward.
In [12]: df = pd.DataFrame([[np.nan, 2, np.nan, 0],[3, 4, np.nan, 1],[np.nan,np.nan,np.nan,5], [np.nan, 3, np.nan, 4]], columns=list('ABCD')) df Out[12]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 In [13]: df.fillna(method='ffill') # ffill - Fill values forward Out[13]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4 In [14]: df.fillna(method='bfill') # bfill - Fill values backward Out[14]: A B C D 0 3.0 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN 3.0 NaN 5 3 NaN 3.0 NaN 4
Limit the amount of filling
If we only want consecutive gaps filled up to a certain number of data points, we can use the limit parameter.
In [15]: df Out[15]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 In [16]: df.fillna(method='ffill',limit=1) # limit=1 (fill NaN up to a 1 data point) Out[16]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 NaN 3.0 NaN 4
Note :
- DataFrame.ffill() is equivalent to fillna(method=’ffill’).
- DataFrame.bfill() is equivalent to fillna(method=’bfill’)
Drop Missing Data
Pandas dropna() method used to remove the missing data.
DataFrame.
dropna
(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 0 - Drop rows which contain missing values. 1 - Drop columns which contain missing values. how : {‘any’, ‘all’}, default ‘any’ any - If any NA values are present, drop that row or column. all - If all values are NA, drop that row or column. thresh : (Optional) Require that many non-NA values. subset : (Optional) Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include. inplace : (default False) If True, operation will perform inplace and return None.
Example:
In [17]: df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, 6, 1], [np.nan, np.nan, np.nan, 5],[np.nan, 3, np.nan, 4]], columns=list('ABCD')) df Out[17]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 6.0 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
In [18]: df.dropna() # Default axis=0 (drop rows with missing values) Out[18]: A B C D 1 3.0 4.0 6.0 1 # drop rows that don't have atleast 2 non-missing values In [19]: df.dropna(thresh=2) Out[19]: A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 6.0 1 3 NaN 3.0 NaN 4 In [20]: df.dropna(axis=1) # axis=1 (drop columns with missing values) Out[20]: D 0 0 1 1 2 5 3 4
. . .