pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
- left : A DataFrame or named Series object.
- right : Another DataFrame or named Series object.
- on : Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects.
- left_on : Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- right_on : Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- left_index : If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
- right_index : Same usage as left_index for the right DataFrame.
- how : One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. Each method has been described below.
- sort : Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.
- suffixes : A tuple of string suffixes to apply to overlapping columns. Defaults to
('_x', '_y')
.
Example
import pandas as pd data1 = {'key1':['k0','k1'],'key2':['k1','k1'], 'name' :['mark','juli'],'city':['New York','Paris']} data2 = {'key1':['k1','k2'],'name' :['john','alex'],'city':['London','Tokyo']} data3 = {'key1':['k0','k1'],'key2':['k0','k1'], 'name' :['Saty','Jonathan'],'city':['germany','Moscow']} df1 = pd.DataFrame(data1,index = [0,1],columns=['key1','key2','city','name']) df2 = pd.DataFrame(data2,index=[2,3],columns=['key1','city','name']) df3 = pd.DataFrame(data3,index=[1,2],columns=['key1','key2','city','name'])
Merge Two DataFrames on Single Key
In [1]: pd.merge(df1,df3,on='key1') Out[1]: key1 key2_x city_x name_x key2_y city_y name_y 0 k0 k1 New York mark k0 germany Saty 1 k1 k1 Paris juli k1 Moscow Jonathan
Merge Two DataFrames on Multiple Keys
In [2]: pd.merge(df1,df3,on=['key1','key2']) Out[2]: key1 key2 city_x name_x city_y name_y 0 k1 k1 Paris juli Moscow Jonathan
Merge Two DataFrame with different keys
Merge DataFrame df1 and df3 by considering ‘key2’ as left key for df1 and ‘key1’ as of right key for df3.
In [3]: pd.merge(df1,df3,left_on='key2',right_on='key1') Out[3]: key1_x key2_x city_x name_x key1_y key2_y city_y name_y 0 k0 k1 New York mark k1 k1 Moscow Jonathan 1 k1 k1 Paris juli k1 k1 Moscow Jonathan # how ='right' --> right outer join(use only keys from right frame) In [4]: pd.merge(df1, df3, left_on='key2', right_on='key1' , how = 'right') Out[4]: key1_x key2_x city_x name_x key1_y key2_y city_y name_y 0 k0 k1 New York mark k1 k1 Moscow Jonathan 1 k1 k1 Paris juli k1 k1 Moscow Jonathan 2 NaN NaN NaN NaN k0 k0 germany Saty
Merge using row index instead of keys
You can merge the DataFrames using the row index by defining the parameters letf_index and right_index while merging.
In [5]: pd.merge(df1 , df3 , left_index=True , right_index=True) Out[5]: key1_x key2_x city_x name_x key1_y key2_y city_y name_y 1 k1 k1 Paris juli k0 k0 germany Saty
Merging on Left DataFrame column & Right DataFrame Index
You can merge the DataFrames using the row index by defining the parameters letf_index and right_index while merging.
In [6]: # Let's define another DataFrame df4 data4 = {'name' :['john','alex'],'city':['London','Tokyo']} df4 = pd.DataFrame(data4 , index=['k1','k2'] , columns=['city','name']) df4 Out[6]: city name k1 London john k2 Tokyo alex In [7]: pd.merge(df1,df4, left_on="key1", right_index=True) Out[7]: key1 key2 city_x name_x city_y name_y 1 k1 k1 Paris juli London john
Merge Using ‘how’ Argument
The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA.
Here is a summary of the how options and their SQL equivalent names
Merge method | SQL Join Name | Description |
---|---|---|
left |
LEFT OUTER JOIN |
Use keys from left frame only |
right |
RIGHT OUTER JOIN |
Use keys from right frame only |
outer |
FULL OUTER JOIN |
Use union of keys from both frames |
inner |
INNER JOIN |
Use intersection of keys from both frames |
Let’s see the examples of left join, right join, outer join and inner join.
# Left Join
In [8]: pd.merge(df1,df2,how='left',on='key1')
Out[8]:
key1 key2 city_x name_x city_y name_y
0 k0 k1 New York mark NaN NaN
1 k1 k1 Paris juli London john
# Right Join
In [9]: pd.merge(df1,df2,how='right',on='key1')
Out[9]:
key1 key2 city_x name_x city_y name_y
0 k1 k1 Paris juli London john
1 k2 NaN NaN NaN Tokyo alex
# Outer Join
In [10]: pd.merge(df1,df2,how='outer',on='key1')
Out[10]:
key1 key2 city_x name_x city_y name_y
0 k0 k1 New York mark NaN NaN
1 k1 k1 Paris juli London john
2 k2 NaN NaN NaN Tokyo alex
# Inner Join
In [11]: pd.merge(df1,df2,how='inner',on='key1')
Out[11]:
key1 key2 city_x name_x city_y name_y
0 k1 k1 Paris juli London john
Handling Overlapping Columns
You can set the parameter Suffix to apply to overlapping column names in the left and right side, respectively. The default value of suffix is (‘_x’, ‘_y’).
In [12]: pd.merge(df1, df3, on='key1', suffixes=('_df1','_df3')) Out[12]: key1 key2_df1 city_df1 name_df1 key2_df3 city_df3 name_df3 0 k0 k1 New York mark k0 germany Saty 1 k1 k1 Paris juli k1 Moscow Jonathan
You set the value of suffix=(False, False), to raise an exception on overlapping columns.
In [13]: pd.merge(df1, df3, on='key1', suffixes=(False,False))
Out[13]:
ValueError: columns overlap but no suffix specified: Index(['key2', 'city', 'name'], dtype='object')
. . .