Python Pandas – Descriptive Statistics

Pandas serve a variety of functions to calculate descriptive statistics such  as sum(), mean(), std(), mode(), etc. This entire tutorial has defined these various function of descriptive statistics with examples.

 

Function

Description

sum() Return sum of values
count() Return number of non-null observations
mean() Return mean of values
median() Return median of values
mode() Return mode of values
std() Return standard deviation of values
min() Return minimum
max() Return maximum
abs() Return absolute value
prod() Return Product of values
cumsum() Return cumulative sum
cumprod() Return cumulative product
mad() Mean absolute deviation
var() Unbiased variance
skew() Sample skewness
kurt() Sample kurtosis
quantile() Sample quantile
cummax() Cumulative maximum
cummin() Cumulative minimum
describe() Return summary of descriptive statistics

Note:

Generally, the above descriptive statistical functions worked only on the numeric value. Python interpreter raises an error if you perform these operations on non-numeric data. However, the function sum() and cumsum() can also be performed on string data without throwing any error.

Example:

In [1]:
# Let's load the DataFrame
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,5],
        'grade':['a','b','a','c']}
df = pd.DataFrame(data)
In [2]: df
Out[2]:
    Name  Age grade
0    Tom   28     a
1   Jack   34     b
2  Steve   29     a
3  Ricky    5     c

# the sum() function on string data will concate the strings
In [3]: df.sum()
Out[3]:
Name     TomJackSteveRicky    
Age                     96
grade                 abac
dtype: object
 
In [4]: df['Age'].sum()
Out[4]: 96

DataFrame.cumsum() – Method return a cumulative over a DataFrame axis. It will return the same size DataFrame with cumulative sum.

Parameters:

  • axis(Deafult 0) – 0: index and 1-columns
  • skipna (default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
In [5]: df.cumsum() 
Out[5]: 
                Name Age grade
0                Tom  28     a
1            TomJack  62    ab
2       TomJackSteve  91   aba
3  TomJackSteveRicky  96  abac

DataFrame.mean() – Return the mean of the values for the requested axis.

In [6]: df.mean()           # Find the mean value of the numeric data
Out[6]:
Age    24.0
dtype: float64

In [7]: df.mean(axis=1)     # axis = 1(columns mean are generated for each row) 
Out[7]: 
0    28.0
1    34.0
2    29.0
3     5.0
dtype: float64

DataFrame.count() – Count non-NA cells for each column or row by specifying axis parameter.

  • axis: 0  (‘index’ counts are generated for each column.) ( default axis = 0)
  • axis: 1 (‘columns’ counts are generated for each row.)
In [8]: df.count()
Out[8]:
Age     4
Name    4
grade   4
dtype: int64

In [9]: df.count(axis=1) 
Out[9]: 
0    3
1    3
2    3
3    3
dtype: int64

DataFrame.max() – Return the maximum of the values for the requested axis.

In [10]: df.max()
Out[10]:
Age 34
Name Tom
grade c
dtype: object

In [11]: df.max(axis=1)   # axis=1 -> max value are generated for each row.
Out[11]:
0    28
1    34
2    29
3     5
dtype: int64
# It will through an error as the DataFrame also contain string Data.
In [12]: df.abs()        
Out[12]: ... TypeError: bad operand type for abs(): 'str'

In [13]: df['Age'].abs()
Out[13]:
0    28
1    34
2    29
3     5
Name: Age, dtype: int64

DataFrame.describe() – Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [14]: df.describe()
Out[14]: 
Age
count 4.000000
mean 24.000000
std 12.935739
min 5.000000
25% 22.250000
50% 28.500000
75% 30.250000
max 34.000000

Here, the grade column is a categorical feature, not a numerical column, hence above output excluded it.

In [15]: df['grade'].describe()
Out[15]:
count     4
unique    3
top       a
freq      2
Name: grade, dtype: object

The categorical data are summarized by the number of observation, the number of unique value,  most occurrence value and frequency of it.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Python Pandas Tutorials

Pandas – How to remove DataFrame columns with constant (same) values?

Pandas – How to remove DataFrame columns with only one distinct value?

Pandas – Count unique values for each column of a DataFrame

Pandas – Count missing values (NaN) for each columns in DataFrame

Pandas – MultiIndex

Pandas – Applymap

Pandas – Apply

Pandas – Map

Pandas – Missing Data

Difference between Merge, join, and concatenate

Pandas – Join

pandas : Handling Duplicate Data

Pandas : Handling Categorical Data

Pandas : Data Types

Appending a row to DataFrame

Python Pandas – Merge

Python Pandas – Concatenation & append

Python Pandas – GroupBy

Python Pandas – Visualization

Python Pandas – Options and Customization

Python Pandas – Basic functions

Python Pandas – DataFrame

Python Pandas – Series

Python Pandas – Introduction