Pandas – Missing Data – Study Machine Learning

Missing Data refers to no information available for one or more items. The ‘NaN’ (an acronym for Not a Number) or ‘NA’ value is the default marker to represent the missing data. In this tutorial, you will learn various approaches to work with missing data.

Detecting Missing Data

Pandas provide isna() and notna() functions to detect missing data in DataFrame and Series.

In [1]:
# let's define the DataFrame
import pandas as pd
df = pd.DataFrame([[None, np.nan, 8, np.inf], ['X','Y','','Z']], 
                  columns=['A','B','C','D'])
df
Out[1]:
      A    B  C    D
0  None  NaN  8  inf
1     X    Y       Z

DataFrame.isna() – Detect missing values..Return a boolean same-sized object (DataFrame/Series) indicating if the values are NA.

True – NA values, such as None or numpy.NaN, gets mapped to True values.
False – Everything else gets mapped to False values.

Note: Characters such as empty strings '' or numpy.inf are not considered NA values.

In [2]: df.isna()
Out[2]:
       A      B      C      D
0   True   True  False  False
1  False  False  False  False

DataFrame.notna() – Detect non-missing values. Return a boolean same-sized object indicating if the values are not NA.

True – Non-missing values get mapped to True.
False – NA values, such as None or numpy.NaN, get mapped to False values.

Note – Characters such as empty strings '' or numpy.inf are not considered NA values.

In [3]: df.notna()
Out[3]:
       A      B     C     D
0  False  False  True  True
1   True   True  True  True

In [4]: df['B'].notna()
Out[4]:
0    False
1     True
Name: B, dtype: bool

Inserting Missing Data

Because NaN is a float, a column of integers with even one missing values forced to become floating-point dtype.

In [5]: pd.Series([1, 2, np.nan, 4])
Out[5]:
0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.

In [6]: s = pd.Series([10,20,30])

In [7]: s.loc[1] = None

In [8]: s
Out[8]:
0    10.0
1     NaN
2    30.0
dtype: float64

Filling Missing Data

DataFrame.fillna() – used to fill the Na/NaN values using specified method.

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None,   downcast=None)

Parameters:

value : Series, or DataFrame
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
         Method to use the fill NaN values.
         pad/ffill - Fill values forward
         bfill/backfill - Fill values backward

axis : {0 or ‘index’, 1 or ‘columns’} Axis along which to fill missing values.
inplace : (default False) If True, fill in-place. 
limit : (defailt None)
        If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : (defailt None) A dict of item->dtype of what to downcast if possible.

Example:

In [9]: 
df = pd.DataFrame([[None, np.nan, 8, np.inf], ['X','Y','Q','Z']],
                  columns=['A','B','C','D'])
df
Out[9]:
      A    B  C    D
0  None  NaN  8  inf
1     X    Y  Q    Z

Fill with a scalar value.

In [10]: df.fillna(0)                 # Replace NaN values to 0 (zero)
Out[10]:
   A  B  C    D
0  0  0  8  inf
1  X  Y  Q    Z

In [11]: df['B'].fillna('Missing')    # Replace NaN values to 'Missing'
Out[11]:
0    Missing
1          Y
Name: B, dtype: object

Fill gaps forward or backward – Propagate non-null values forward or backward.

In [12]: 
df = pd.DataFrame([[np.nan, 2, np.nan, 0],[3, 4, np.nan, 1],[np.nan,np.nan,np.nan,5],
                   [np.nan, 3, np.nan, 4]], columns=list('ABCD'))
df
Out[12]:
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

In [13]: df.fillna(method='ffill')   # ffill - Fill values forward
Out[13]:
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  3.0  4.0 NaN  5
3  3.0  3.0 NaN  4

In [14]: df.fillna(method='bfill')   # bfill - Fill values backward
Out[14]:
     A    B   C  D
0  3.0  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  3.0 NaN  5
3  NaN  3.0 NaN  4

Limit the amount of filling

If we only want consecutive gaps filled up to a certain number of data points, we can use the limit parameter.

In [15]: df
Out[15]:
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

In [16]: df.fillna(method='ffill',limit=1)   # limit=1 (fill NaN up to a 1 data point)
Out[16]:
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  3.0  4.0 NaN  5
3  NaN  3.0 NaN  4

Note :

DataFrame.ffill() is equivalent to fillna(method=’ffill’).
DataFrame.bfill() is equivalent to fillna(method=’bfill’)

Drop Missing Data

Pandas dropna() method used to remove the missing data.

DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
       0 - Drop rows which contain missing values.
       1 - Drop columns which contain missing values.

how : {‘any’, ‘all’}, default ‘any’
       any -  If any NA values are present, drop that row or column.
       all - If all values are NA, drop that row or column.

thresh : (Optional) Require that many non-NA values. 
subset : (Optional) Labels along other axis to consider, 
          e.g. if you are dropping rows these would be a list of columns to include.

inplace : (default False) If True, operation will perform inplace and return None.

Example:

In [17]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, 6, 1], [np.nan, np.nan, np.nan, 5],[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df
Out[17]:
     A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  6.0  1
2  NaN  NaN  NaN  5
3  NaN  3.0  NaN  4

In [18]: df.dropna()           # Default axis=0 (drop rows with missing values)
Out[18]:
     A    B    C  D
1  3.0  4.0  6.0  1

# drop rows that don't have atleast 2 non-missing values

In [19]: df.dropna(thresh=2)   
Out[19]:
     A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  6.0  1
3  NaN  3.0  NaN  4

In [20]: df.dropna(axis=1)   # axis=1 (drop columns with missing values)
Out[20]: 
   D
0  0
1  1
2  5
3  4

. . .

Pandas – Missing Data

Detecting Missing Data

Inserting Missing Data

Filling Missing Data

Drop Missing Data

Leave a Reply Cancel reply

Python Pandas Tutorials