Pandas serve a variety of functions to calculate descriptive statistics such as sum(), mean(), std(), mode(), etc. This entire tutorial has defined these various function of descriptive statistics with examples.
Function |
Description |
sum() | Return sum of values |
count() | Return number of non-null observations |
mean() | Return mean of values |
median() | Return median of values |
mode() | Return mode of values |
std() | Return standard deviation of values |
min() | Return minimum |
max() | Return maximum |
abs() | Return absolute value |
prod() | Return Product of values |
cumsum() | Return cumulative sum |
cumprod() | Return cumulative product |
mad() | Mean absolute deviation |
var() | Unbiased variance |
skew() | Sample skewness |
kurt() | Sample kurtosis |
quantile() | Sample quantile |
cummax() | Cumulative maximum |
cummin() | Cumulative minimum |
describe() | Return summary of descriptive statistics |
Note:
Generally, the above descriptive statistical functions worked only on the numeric value. Python interpreter raises an error if you perform these operations on non-numeric data. However, the function sum() and cumsum() can also be performed on string data without throwing any error.
Example:
In [1]: # Let's load the DataFrame import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,5], 'grade':['a','b','a','c']} df = pd.DataFrame(data)
In [2]: df Out[2]: Name Age grade 0 Tom 28 a 1 Jack 34 b 2 Steve 29 a 3 Ricky 5 c # the sum() function on string data will concate the strings In [3]: df.sum() Out[3]: Name TomJackSteveRicky Age 96 grade abac dtype: object In [4]: df['Age'].sum() Out[4]: 96
DataFrame.cumsum() – Method return a cumulative over a DataFrame axis. It will return the same size DataFrame with cumulative sum.
Parameters:
- axis(Deafult 0) – 0: index and 1-columns
- skipna (default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
In [5]: df.cumsum() Out[5]: Name Age grade 0 Tom 28 a 1 TomJack 62 ab 2 TomJackSteve 91 aba 3 TomJackSteveRicky 96 abac
DataFrame.mean() – Return the mean of the values for the requested axis.
In [6]: df.mean() # Find the mean value of the numeric data Out[6]: Age 24.0 dtype: float64 In [7]: df.mean(axis=1) # axis = 1(columns mean are generated for each row) Out[7]: 0 28.0 1 34.0 2 29.0 3 5.0 dtype: float64
DataFrame.count() – Count non-NA cells for each column or row by specifying axis parameter.
- axis: 0 (‘index’ counts are generated for each column.) ( default axis = 0)
- axis: 1 (‘columns’ counts are generated for each row.)
In [8]: df.count() Out[8]: Age 4 Name 4 grade 4 dtype: int64 In [9]: df.count(axis=1) Out[9]: 0 3 1 3 2 3 3 3 dtype: int64
DataFrame.max() – Return the maximum of the values for the requested axis.
In [10]: df.max() Out[10]: Age 34 Name Tom grade c dtype: object In [11]: df.max(axis=1) # axis=1 -> max value are generated for each row. Out[11]: 0 28 1 34 2 29 3 5 dtype: int64
# It will through an error as the DataFrame also contain string Data. In [12]: df.abs() Out[12]: ... TypeError: bad operand type for abs(): 'str' In [13]: df['Age'].abs() Out[13]: 0 28 1 34 2 29 3 5 Name: Age, dtype: int64
DataFrame.describe() – Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN
values.
In [14]: df.describe() Out[14]: Age count 4.000000 mean 24.000000 std 12.935739 min 5.000000 25% 22.250000 50% 28.500000 75% 30.250000 max 34.000000
Here, the grade column is a categorical feature, not a numerical column, hence above output excluded it.
In [15]: df['grade'].describe() Out[15]: count 4 unique 3 top a freq 2 Name: grade, dtype: object
The categorical data are summarized by the number of observation, the number of unique value, most occurrence value and frequency of it.
. . .