Pandas serve a variety of functions to calculate descriptive statistics such as sum(), mean(), std(), mode(), etc. This entire tutorial has defined these various function of descriptive statistics with examples.
Function |
Description |
| sum() | Return sum of values |
| count() | Return number of non-null observations |
| mean() | Return mean of values |
| median() | Return median of values |
| mode() | Return mode of values |
| std() | Return standard deviation of values |
| min() | Return minimum |
| max() | Return maximum |
| abs() | Return absolute value |
| prod() | Return Product of values |
| cumsum() | Return cumulative sum |
| cumprod() | Return cumulative product |
| mad() | Mean absolute deviation |
| var() | Unbiased variance |
| skew() | Sample skewness |
| kurt() | Sample kurtosis |
| quantile() | Sample quantile |
| cummax() | Cumulative maximum |
| cummin() | Cumulative minimum |
| describe() | Return summary of descriptive statistics |
Note:
Generally, the above descriptive statistical functions worked only on the numeric value. Python interpreter raises an error if you perform these operations on non-numeric data. However, the function sum() and cumsum() can also be performed on string data without throwing any error.
Example:
In [1]:
# Let's load the DataFrame
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,5],
'grade':['a','b','a','c']}
df = pd.DataFrame(data)
In [2]: df
Out[2]:
Name Age grade
0 Tom 28 a
1 Jack 34 b
2 Steve 29 a
3 Ricky 5 c
# the sum() function on string data will concate the strings
In [3]: df.sum()
Out[3]:
Name TomJackSteveRicky
Age 96
grade abac
dtype: object
In [4]: df['Age'].sum()
Out[4]: 96
DataFrame.cumsum() – Method return a cumulative over a DataFrame axis. It will return the same size DataFrame with cumulative sum.
Parameters:
- axis(Deafult 0) – 0: index and 1-columns
- skipna (default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
In [5]: df.cumsum()
Out[5]:
Name Age grade
0 Tom 28 a
1 TomJack 62 ab
2 TomJackSteve 91 aba
3 TomJackSteveRicky 96 abac
DataFrame.mean() – Return the mean of the values for the requested axis.
In [6]: df.mean() # Find the mean value of the numeric data Out[6]: Age 24.0 dtype: float64 In [7]: df.mean(axis=1) # axis = 1(columns mean are generated for each row) Out[7]: 0 28.0 1 34.0 2 29.0 3 5.0 dtype: float64
DataFrame.count() – Count non-NA cells for each column or row by specifying axis parameter.
- axis: 0 (‘index’ counts are generated for each column.) ( default axis = 0)
- axis: 1 (‘columns’ counts are generated for each row.)
In [8]: df.count() Out[8]: Age 4 Name 4 grade 4 dtype: int64 In [9]: df.count(axis=1) Out[9]: 0 3 1 3 2 3 3 3 dtype: int64
DataFrame.max() – Return the maximum of the values for the requested axis.
In [10]: df.max() Out[10]: Age 34 Name Tom grade c dtype: object In [11]: df.max(axis=1) # axis=1 -> max value are generated for each row. Out[11]: 0 28 1 34 2 29 3 5 dtype: int64
# It will through an error as the DataFrame also contain string Data. In [12]: df.abs() Out[12]: ... TypeError: bad operand type for abs(): 'str' In [13]: df['Age'].abs() Out[13]: 0 28 1 34 2 29 3 5 Name: Age, dtype: int64
DataFrame.describe() – Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
In [14]: df.describe() Out[14]: Age count 4.000000 mean 24.000000 std 12.935739 min 5.000000 25% 22.250000 50% 28.500000 75% 30.250000 max 34.000000
Here, the grade column is a categorical feature, not a numerical column, hence above output excluded it.
In [15]: df['grade'].describe() Out[15]: count 4 unique 3 top a freq 2 Name: grade, dtype: object
The categorical data are summarized by the number of observation, the number of unique value, most occurrence value and frequency of it.
. . .