Python Pandas – GroupBy

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. This tutorial has explained to perform the various operation on DataFrame using groupby with example.

 

 

In [1]:
# Let's define the DataFrame
import pandas as pd
data = {'item' :['tea','coffee','sugar','tea','coffee','sugar','tea','coffee','sugar'],
        'year':[2010,2010,2010,2011,2011,2011,2012,2012,2012],
        'sales':[1200,1050,500,1500,1200,1000,1230,1300,1420]}
df = pd.DataFrame(data,columns=['item','year','sales'])
df
Out[1]:
     item  year  sales
0     tea  2010   1200
1  coffee  2010   1050
2   sugar  2010    500
3     tea  2011   1500
4  coffee  2011   1200
5   sugar  2011   1000
6     tea  2012   1230
7  coffee  2012   1300
8   sugar  2012   1420

Split Data into Groups

Pandas object can be split into a group in many ways. A groups method is used to list group data.

Groupby using single column – It makes the group by using a single column.

In [2]: 
item_group = df.groupby('item')
item_group.groups

Out[2]: 
{'tea': Int64Index([0, 3, 6], dtype='int64'), 
 'sugar': Int64Index([2, 5, 8], dtype='int64'), 
 'coffee': Int64Index([1, 4, 7], dtype='int64')}

Groupby using multiple columns – It forms the group by using multiple columns.

In [3]: 
item_group_multi = df.groupby(['item','year'])
item_group_multi.groups

Out[3]: 
{('coffee', 2012): Int64Index([7], dtype='int64'), 
 ('sugar', 2012): Int64Index([8], dtype='int64'), 
 ('coffee', 2011): Int64Index([4], dtype='int64'), 
 ('tea', 2012): Int64Index([6], dtype='int64'),
 ('sugar', 2010): Int64Index([2], dtype='int64'),
 ('tea', 2011): Int64Index([3], dtype='int64'),
 ('coffee', 2010): Int64Index([1], dtype='int64'),
 ('tea', 2010): Int64Index([0], dtype='int64'),
 ('sugar', 2011): Int64Index([5], dtype='int64')}

Iterating through Groups

You can also print the group elements by iterating through groups using for loop.

In [4]: 
for name,group in item_group:
    print('{}:'.format(name))
    print(group)

Out[4]:
coffee:
     item  year  sales
1  coffee  2010   1050
4  coffee  2011   1200
7  coffee  2012   1300
sugar:
    item  year  sales
2  sugar  2010    500
5  sugar  2011   1000
8  sugar  2012   1420
tea:
  item  year  sales
0  tea  2010   1200
3  tea  2011   1500
6  tea  2012   1230

Selecting a Group

The get_group() method is used to select a particular group.

In [5]: item_group.get_group('coffee')
Out[5]:
     item  year  sales
1  coffee  2010   1050
4  coffee  2011   1200
7  coffee  2012   1300

Groupby – Aggregations

You can use aggregation function such as mean, sum, etc to get the aggregate value of each group. Aggregation functions are used once the group by object is created.

Example: Let’s calculate the average sales of each item.

In [6]: 
agg_group = df.groupby('item')['sales'].mean()
agg_group

Out[6]: 
item
coffee    1183.333333
sugar      973.333333
tea       1310.000000
Name: sales, dtype: float64

Aggregation group for Multiple columns – You can make groups for aggregation value by using multiple columns

Example: Let’s calculate the average and total sales of each item.

In [7]: 
import numpy as np
agg_group = df.groupby('item')['sales'].agg([np.mean,np.sum])
print(agg_group)

Out[7]: 
               mean   sum
item                     
coffee  1183.333333  3550
sugar    973.333333  2920
tea     1310.000000  3930

In [8]: 
import numpy as np
agg_group = df.groupby('item')['sales'].agg(COUNT='count',Standard_deviation=np.std)
agg_group

Out[8]: 
        COUNT  Standard_deviation
item                             
coffee      3          125.830574
sugar       3          460.579345
tea         3          165.227116

Filtration

The filter method returns a subset of the original object.  Suppose we want to find the item, which is sale for less than three years.

In [9]: 
# Let's define a DataFrame
import pandas as pd
data = {'item' :['tea','coffee','sugar','tea','coffee','sugar','tea','coffee'],
        'year':[2010,2010,2010,2011,2011,2011,2012,2012],
        'sales':[1200,1050,500,1500,1200,1000,1230,1300]}
df = pd.DataFrame(data,columns=['item','year','sales'])
df

Out[9]:
     item  year  sales
0     tea  2010   1200
1  coffee  2010   1050
2   sugar  2010    500
3     tea  2011   1500
4  coffee  2011   1200
5   sugar  2011   1000
6     tea  2012   1230
7  coffee  2012   1300

In [10]: df.groupby('item').filter(lambda x: len(x) < 3 )
Out[10]:  
    item  year  sales
2  sugar  2010    500
5  sugar  2011   1000

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Python Pandas Tutorials