Python Pandas – Merge – Study Machine Learning

pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

Parameters:
left :  A DataFrame or named Series object.
right :  Another DataFrame or named Series object.
on : Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects.
left_on : Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
right_on : Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
left_index : If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
right_index : Same usage as left_index for the right DataFrame.
how : One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. Each method has been described below.
sort : Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.
suffixes : A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x', '_y').

Example

import pandas as pd
data1 = {'key1':['k0','k1'],'key2':['k1','k1'],
         'name' :['mark','juli'],'city':['New York','Paris']}

data2 = {'key1':['k1','k2'],'name' :['john','alex'],'city':['London','Tokyo']}

data3 = {'key1':['k0','k1'],'key2':['k0','k1'],
         'name' :['Saty','Jonathan'],'city':['germany','Moscow']}

df1 = pd.DataFrame(data1,index = [0,1],columns=['key1','key2','city','name'])
df2 = pd.DataFrame(data2,index=[2,3],columns=['key1','city','name'])
df3 = pd.DataFrame(data3,index=[1,2],columns=['key1','key2','city','name'])

Merge Two DataFrames on Single Key

In [1]: pd.merge(df1,df3,on='key1')
Out[1]: 
  key1 key2_x    city_x name_x key2_y   city_y    name_y
0   k0     k1  New York   mark     k0  germany      Saty
1   k1     k1     Paris   juli     k1   Moscow  Jonathan

Merge Two DataFrames on Multiple Keys

In [2]: pd.merge(df1,df3,on=['key1','key2'])
Out[2]: 
  key1 key2 city_x name_x  city_y    name_y
0   k1   k1  Paris   juli  Moscow  Jonathan

Merge Two DataFrame with different keys

Merge DataFrame df1 and df3 by considering ‘key2’ as left key for df1 and ‘key1’ as of right key for df3.

In [3]: pd.merge(df1,df3,left_on='key2',right_on='key1')
Out[3]: 
  key1_x key2_x    city_x name_x key1_y key2_y  city_y    name_y
0     k0     k1  New York   mark     k1     k1  Moscow  Jonathan
1     k1     k1     Paris   juli     k1     k1  Moscow  Jonathan

# how ='right' --> right outer join(use only keys from right frame)
In [4]: pd.merge(df1, df3, left_on='key2', right_on='key1' , how = 'right')
Out[4]:
  key1_x key2_x    city_x name_x key1_y key2_y   city_y    name_y
0     k0     k1  New York   mark     k1     k1   Moscow  Jonathan
1     k1     k1     Paris   juli     k1     k1   Moscow  Jonathan
2    NaN    NaN       NaN    NaN     k0     k0  germany      Saty

Merge using row index instead of keys

You can merge the DataFrames using the row index by defining the parameters letf_index and right_index while merging.

In [5]: pd.merge(df1 , df3 , left_index=True , right_index=True)
Out[5]: 

  key1_x key2_x city_x name_x key1_y key2_y   city_y name_y
1     k1     k1  Paris   juli     k0     k0  germany   Saty

Merging on Left DataFrame column & Right DataFrame Index

You can merge the DataFrames using the row index by defining the parameters letf_index and right_index while merging.

In [6]: 
# Let's define another DataFrame df4
data4 = {'name' :['john','alex'],'city':['London','Tokyo']}
df4 = pd.DataFrame(data4 , index=['k1','k2'] , columns=['city','name'])
df4
Out[6]: 
      city  name
k1  London  john
k2   Tokyo  alex

In [7]: pd.merge(df1,df4, left_on="key1", right_index=True)
Out[7]:
  key1 key2 city_x name_x  city_y name_y
1   k1   k1  Paris   juli  London   john

Merge Using ‘how’ Argument

The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA.

Here is a summary of the how options and their SQL equivalent names

Merge method	SQL Join Name	Description
`left`	`LEFT OUTER JOIN`	Use keys from left frame only
`right`	`RIGHT OUTER JOIN`	Use keys from right frame only
`outer`	`FULL OUTER JOIN`	Use union of keys from both frames
`inner`	`INNER JOIN`	Use intersection of keys from both frames

Let’s see the examples of left join, right join, outer join and inner join.

# Left Join
In [8]: pd.merge(df1,df2,how='left',on='key1')
Out[8]: 
  key1 key2    city_x name_x  city_y name_y
0   k0   k1  New York   mark     NaN    NaN
1   k1   k1     Paris   juli  London   john

# Right Join
In [9]: pd.merge(df1,df2,how='right',on='key1')
Out[9]: 
  key1 key2 city_x name_x  city_y name_y
0   k1   k1  Paris   juli  London   john
1   k2  NaN    NaN    NaN   Tokyo   alex

# Outer Join
In [10]: pd.merge(df1,df2,how='outer',on='key1')
Out[10]: 
  key1 key2    city_x name_x  city_y name_y
0   k0   k1  New York   mark     NaN    NaN
1   k1   k1     Paris   juli  London   john
2   k2  NaN       NaN    NaN   Tokyo   alex

# Inner Join
In [11]: pd.merge(df1,df2,how='inner',on='key1')
Out[11]: 
  key1 key2 city_x name_x  city_y name_y
0   k1   k1  Paris   juli  London   john

Handling Overlapping Columns

You can set the parameter Suffix to apply to overlapping column names in the left and right side, respectively. The default value of suffix is (‘_x’, ‘_y’).

In [12]: pd.merge(df1, df3, on='key1', suffixes=('_df1','_df3'))
Out[12]: 
  key1 key2_df1  city_df1 name_df1 key2_df3 city_df3  name_df3
0   k0       k1  New York     mark       k0  germany      Saty
1   k1       k1     Paris     juli       k1   Moscow  Jonathan

You set the value of suffix=(False, False), to raise an exception on overlapping columns.

In [13]: pd.merge(df1, df3, on='key1', suffixes=(False,False))
Out[13]: 
ValueError: columns overlap but no suffix specified: Index(['key2', 'city', 'name'], dtype='object')

. . .

Python Pandas – Merge

Merge using row index instead of keys

Merging on Left DataFrame column & Right DataFrame Index

Merge Using ‘how’ Argument

Handling Overlapping Columns

Leave a Reply Cancel reply

Python Pandas Tutorials