A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. This tutorial has explained to perform the various operation on DataFrame using groupby with example.
In [1]: # Let's define the DataFrame import pandas as pd data = {'item' :['tea','coffee','sugar','tea','coffee','sugar','tea','coffee','sugar'], 'year':[2010,2010,2010,2011,2011,2011,2012,2012,2012], 'sales':[1200,1050,500,1500,1200,1000,1230,1300,1420]} df = pd.DataFrame(data,columns=['item','year','sales']) df Out[1]: item year sales 0 tea 2010 1200 1 coffee 2010 1050 2 sugar 2010 500 3 tea 2011 1500 4 coffee 2011 1200 5 sugar 2011 1000 6 tea 2012 1230 7 coffee 2012 1300 8 sugar 2012 1420
Split Data into Groups
Pandas object can be split into a group in many ways. A groups method is used to list group data.
Groupby using single column – It makes the group by using a single column.
In [2]: item_group = df.groupby('item') item_group.groups Out[2]: {'tea': Int64Index([0, 3, 6], dtype='int64'), 'sugar': Int64Index([2, 5, 8], dtype='int64'), 'coffee': Int64Index([1, 4, 7], dtype='int64')}
Groupby using multiple columns – It forms the group by using multiple columns.
In [3]: item_group_multi = df.groupby(['item','year']) item_group_multi.groups Out[3]: {('coffee', 2012): Int64Index([7], dtype='int64'), ('sugar', 2012): Int64Index([8], dtype='int64'), ('coffee', 2011): Int64Index([4], dtype='int64'), ('tea', 2012): Int64Index([6], dtype='int64'), ('sugar', 2010): Int64Index([2], dtype='int64'), ('tea', 2011): Int64Index([3], dtype='int64'), ('coffee', 2010): Int64Index([1], dtype='int64'), ('tea', 2010): Int64Index([0], dtype='int64'), ('sugar', 2011): Int64Index([5], dtype='int64')}
Iterating through Groups
You can also print the group elements by iterating through groups using for loop.
In [4]: for name,group in item_group: print('{}:'.format(name)) print(group) Out[4]: coffee: item year sales 1 coffee 2010 1050 4 coffee 2011 1200 7 coffee 2012 1300 sugar: item year sales 2 sugar 2010 500 5 sugar 2011 1000 8 sugar 2012 1420 tea: item year sales 0 tea 2010 1200 3 tea 2011 1500 6 tea 2012 1230
Selecting a Group
The get_group() method is used to select a particular group.
In [5]: item_group.get_group('coffee') Out[5]: item year sales 1 coffee 2010 1050 4 coffee 2011 1200 7 coffee 2012 1300
Groupby – Aggregations
You can use aggregation function such as mean, sum, etc to get the aggregate value of each group. Aggregation functions are used once the group by object is created.
Example: Let’s calculate the average sales of each item.
In [6]: agg_group = df.groupby('item')['sales'].mean() agg_group Out[6]: item coffee 1183.333333 sugar 973.333333 tea 1310.000000 Name: sales, dtype: float64
Aggregation group for Multiple columns – You can make groups for aggregation value by using multiple columns
Example: Let’s calculate the average and total sales of each item.
In [7]: import numpy as np agg_group = df.groupby('item')['sales'].agg([np.mean,np.sum]) print(agg_group) Out[7]: mean sum item coffee 1183.333333 3550 sugar 973.333333 2920 tea 1310.000000 3930 In [8]: import numpy as np agg_group = df.groupby('item')['sales'].agg(COUNT='count',Standard_deviation=np.std) agg_group Out[8]: COUNT Standard_deviation item coffee 3 125.830574 sugar 3 460.579345 tea 3 165.227116
Filtration
The filter
method returns a subset of the original object. Suppose we want to find the item, which is sale for less than three years.
In [9]: # Let's define a DataFrame import pandas as pd data = {'item' :['tea','coffee','sugar','tea','coffee','sugar','tea','coffee'], 'year':[2010,2010,2010,2011,2011,2011,2012,2012], 'sales':[1200,1050,500,1500,1200,1000,1230,1300]} df = pd.DataFrame(data,columns=['item','year','sales']) df Out[9]: item year sales 0 tea 2010 1200 1 coffee 2010 1050 2 sugar 2010 500 3 tea 2011 1500 4 coffee 2011 1200 5 sugar 2011 1000 6 tea 2012 1230 7 coffee 2012 1300 In [10]: df.groupby('item').filter(lambda x: len(x) < 3 ) Out[10]: item year sales 2 sugar 2010 500 5 sugar 2011 1000
. . .