Two Dataframe Connection (pandas)

pandas

  • concat
    • simple connection between 2 dataframe
  • merge
    • connection standard : common data on specific column → combine those records (df1 has n records & df2 has m records that have common data on specific column compared with counterpart df → Number of total intersection records is n x m)
  • join
    • connection standard : common row index
  • Option used for 3 methods
    • inner : intersection set
    • outer : union set (contains intersection set)
    • left : left set (contains intersection set)
    • right : right set (contains intersection set)

Concat

  • pd.concat()
  • simple connection between 2 dataframe
  • axis=0
    • If same name exists in both df’s column list, combine them (Result column list consists of left df column, intersection column, right df column)
    • Each record’s name(index) remains (just connection)
    • join=’outer’ : return result dataframe (left + intersection + right) from Concat method
    • join=’inner’ : return result dataframe which contains only intersection columns
    • ignore_index=True : original index removed + set again as integer index
    import pandas as pd
    
    df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])
    
    df2 = pd.DataFrame({'c1' : [15, 46, 78, 100], 'c4' : [54, 15, 26, 49], 'c3' : [55, 79, 44, 10], 'c9' : [11, 56, 74, 8]}, index=[2, 3, 4, 5])
    
    df3 = pd.concat([df1, df2], axis=0, join='outer')
    df4 = pd.concat([df2, df1], axis=0, join='outer')
    df5 = pd.concat([df1, df2], axis=0, ignore_index=True)
    df6 = pd.concat([df1, df2], axis=0, join='inner')
    
    print(df1)
    print()
    print(df2)
    print()
    print(df3)
    print()
    print(df4)
    print()
    print(df5)
    print()
    print(df6)
    


       c1  c2  c3
    0  30  40  70
    1  21  50  50
    2  30  50  60
    3  40  20  17
    
        c1  c4  c3  c9
    2   15  54  55  11
    3   46  15  79  56
    4   78  26  44  74
    5  100  49  10   8
    
        c1    c2  c3    c4    c9
    0   30  40.0  70   NaN   NaN
    1   21  50.0  50   NaN   NaN
    2   30  50.0  60   NaN   NaN
    3   40  20.0  17   NaN   NaN
    2   15   NaN  55  54.0  11.0
    3   46   NaN  79  15.0  56.0
    4   78   NaN  44  26.0  74.0
    5  100   NaN  10  49.0   8.0
    
        c1    c4  c3    c9    c2
    2   15  54.0  55  11.0   NaN
    3   46  15.0  79  56.0   NaN
    4   78  26.0  44  74.0   NaN
    5  100  49.0  10   8.0   NaN
    0   30   NaN  70   NaN  40.0
    1   21   NaN  50   NaN  50.0
    2   30   NaN  60   NaN  50.0
    3   40   NaN  17   NaN  20.0
    
        c1    c2  c3    c4    c9
    0   30  40.0  70   NaN   NaN
    1   21  50.0  50   NaN   NaN
    2   30  50.0  60   NaN   NaN
    3   40  20.0  17   NaN   NaN
    4   15   NaN  55  54.0  11.0
    5   46   NaN  79  15.0  56.0
    6   78   NaN  44  26.0  74.0
    7  100   NaN  10  49.0   8.0
    
        c1  c3
    0   30  70
    1   21  50
    2   30  60
    3   40  17
    2   15  55
    3   46  79
    4   78  44
    5  100  10
    


  • axis=1
    • If same name exists in both df’s index list, combine them (Result index list consists of left df index, intersection index, right df index)
    • Each record’s name(index) remains (just connection)
    • join=’outer’ : return result dataframe (left + intersection + right) from Concat method
    • join=’inner’ : return result dataframe which contains only intersection index
    • ignore_index=True : original index removed + set again as integer index


    import pandas as pd
    
    df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])
    
    df2 = pd.DataFrame({'c1' : [15, 46, 78, 100], 'c4' : [54, 15, 26, 49], 'c3' : [55, 79, 44, 10], 'c9' : [11, 56, 74, 8]}, index=[2, 3, 4, 5])
    
    df7 = pd.concat([df1, df2], axis=1, join='outer')
    df8 = pd.concat([df2, df1], axis=1, join='outer')
    df9 = pd.concat([df1, df2], axis=1, join='inner')
    
    print(df7)
    print()
    print(df8)
    print()
    print(df9)
    


         c1    c2    c3     c1    c4    c3    c9
    0  30.0  40.0  70.0    NaN   NaN   NaN   NaN
    1  21.0  50.0  50.0    NaN   NaN   NaN   NaN
    2  30.0  50.0  60.0   15.0  54.0  55.0  11.0
    3  40.0  20.0  17.0   46.0  15.0  79.0  56.0
    4   NaN   NaN   NaN   78.0  26.0  44.0  74.0
    5   NaN   NaN   NaN  100.0  49.0  10.0   8.0
    
          c1    c4    c3    c9    c1    c2    c3
    0    NaN   NaN   NaN   NaN  30.0  40.0  70.0
    1    NaN   NaN   NaN   NaN  21.0  50.0  50.0
    2   15.0  54.0  55.0  11.0  30.0  50.0  60.0
    3   46.0  15.0  79.0  56.0  40.0  20.0  17.0
    4   78.0  26.0  44.0  74.0   NaN   NaN   NaN
    5  100.0  49.0  10.0   8.0   NaN   NaN   NaN
    
       c1  c2  c3  c1  c4  c3  c9
    2  30  50  60  15  54  55  11
    3  40  20  17  46  15  79  56
    


  • We can make connection between dataframe and series or between series and series


import pandas as pd

df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

s1 = pd.Series([45, 100], name='c10', index=[2, 3])
s2 = pd.Series([50, 11], name='c11', index=[2, 4])

df10 = pd.concat([df1, s1], axis=1)
df11 = pd.concat([s1, s2], axis=1)
s3 = pd.concat([s1, s2], axis=0)

print(df10)
print()
print(df11)
print()
print(s3)
   c1  c2  c3    c10
0  30  40  70    NaN
1  21  50  50    NaN
2  30  50  60   45.0
3  40  20  17  100.0

     c10   c11
2   45.0  50.0
3  100.0   NaN
4    NaN  11.0

2     45
3    100
2     50
4     11
dtype: int64

Merge

  • pd.merge()
  • connection standard : common data on specific column → combine those records (df1 has n records & df2 has m records that have common data on specific column compared with counterpart df → Number of total intersection records is n x m)


import pandas as pd


df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])


df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])


df3 = pd.merge(df1, df2, how='outer', on='math')
df4 = pd.merge(df1, df2, how='inner', on='math')
df5 = pd.merge(df2, df1, how='outer', on='math')

print(df1)
print()
print(df2)
print()
print(df3)
print()
print(df4)
print()
print(df5)


   math  english  biology2
0    30       40        70
1    21       50        50
2    30       50        60
3    40       20        17

   math  biology1  physics  english
2    15        50       55       11
3    30        17       79       50
4    78        26       44       74
5    30        49       10        8

   math  english_x  biology2  biology1  physics  english_y
0    30       40.0      70.0      17.0     79.0       50.0
1    30       40.0      70.0      49.0     10.0        8.0
2    30       50.0      60.0      17.0     79.0       50.0
3    30       50.0      60.0      49.0     10.0        8.0
4    21       50.0      50.0       NaN      NaN        NaN
5    40       20.0      17.0       NaN      NaN        NaN
6    15        NaN       NaN      50.0     55.0       11.0
7    78        NaN       NaN      26.0     44.0       74.0

   math  english_x  biology2  biology1  physics  english_y
0    30         40        70        17       79         50
1    30         40        70        49       10          8
2    30         50        60        17       79         50
3    30         50        60        49       10          8

   math  biology1  physics  english_x  english_y  biology2
0    15      50.0     55.0       11.0        NaN       NaN
1    30      17.0     79.0       50.0       40.0      70.0
2    30      17.0     79.0       50.0       50.0      60.0
3    30      49.0     10.0        8.0       40.0      70.0
4    30      49.0     10.0        8.0       50.0      60.0
5    78      26.0     44.0       74.0        NaN       NaN
6    21       NaN      NaN        NaN       50.0      50.0
7    40       NaN      NaN        NaN       20.0      17.0


  • df3 = pd.merge(df1, df2, how=’outer’, on=’math’)
    • on=’math’ : connect the records that have same data on ‘math’ column (df1 has 2 records and df2 has 2 records that have 30 as data on ‘math’ → Consequently, the number of intersection records is 4)
    • how=’outer’ : return result dataframe (left + intersection + right)
  • df4 = pd.merge(df1, df2, how=’inner’, on=’math’)
    • how=’inner’ : return result dataframe (only intersection records)


import pandas as pd


df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])


df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])


df6 = pd.merge(df2, df1, how='outer', on=['math', 'english'])

print(df6)


   math  biology1  physics  english  biology2
0    15      50.0     55.0       11       NaN
1    30      17.0     79.0       50      60.0
2    78      26.0     44.0       74       NaN
3    30      49.0     10.0        8       NaN
4    30       NaN      NaN       40      70.0
5    21       NaN      NaN       50      50.0
6    40       NaN      NaN       20      17.0


  • df6 = pd.merge(df2, df1, how=’outer’, on=[‘math’, ‘english’])
    • on=[‘math’, ‘english’] : connect the records that have same data on both ‘math’ column and ‘english’ column (df1 has 1 record and df2 has 1 record that have (30, 50) as data on ‘math’ and ‘english’ → Consequently, the number of intersection records is 1)


import pandas as pd


df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])


df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])

df7 = pd.merge(df1, df2, how='left', on='math')
df8 = pd.merge(df1, df2, how='right', on='math')

print(df7)
print()
print(df8)


   math  english_x  biology2  biology1  physics  english_y
0    30         40        70      17.0     79.0       50.0
1    30         40        70      49.0     10.0        8.0
2    21         50        50       NaN      NaN        NaN
3    30         50        60      17.0     79.0       50.0
4    30         50        60      49.0     10.0        8.0
5    40         20        17       NaN      NaN        NaN

   math  english_x  biology2  biology1  physics  english_y
0    15        NaN       NaN        50       55         11
1    30       40.0      70.0        17       79         50
2    30       50.0      60.0        17       79         50
3    78        NaN       NaN        26       44         74
4    30       40.0      70.0        49       10          8
5    30       50.0      60.0        49       10          8


  • df7 = pd.merge(df1, df2, how=’left’, on=’math’)
    • how=’left’ : return result dataframe (left + intersection)
  • df8 = pd.merge(df1, df2, how=’right’, on=’math’)
    • how=’right’ : return result dataframe (intersection + right)


import pandas as pd


df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])


df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])

df9 = pd.merge(df1, df2, how='outer', left_on='biology2', right_on='biology1')

print(df9)


   math_x  english_x  biology2  math_y  biology1  physics  english_y
0    30.0       40.0      70.0     NaN       NaN      NaN        NaN
1    21.0       50.0      50.0    15.0      50.0     55.0       11.0
2    30.0       50.0      60.0     NaN       NaN      NaN        NaN
3    40.0       20.0      17.0    30.0      17.0     79.0       50.0
4     NaN        NaN       NaN    78.0      26.0     44.0       74.0
5     NaN        NaN       NaN    30.0      49.0     10.0        8.0


  • df9 = pd.merge(df1, df2, how=’outer’, left_on=’biology2’, right_on=’biology1’)
    • left_on=’biology2’, right_on=’biology1’ : connect the records that df1 value from biology2 is equal to df2 value from biology1 (df1 has 1 record and df2 has 1 record that have 50 as data on ‘biology2’ and ‘biology1’ → Consequently, the number of intersection records is 1) + (df1 has 1 record and df2 has 1 record that have 17 as data on ‘biology2’ and ‘biology1’ → Consequently, the number of intersection records is 1 )

Join

  • df1.join(df2)
  • Similar with merge, but the connection standard is common row index (There should not be common name on 2 column lists)


import pandas as pd

df1 = pd.DataFrame({'math1' : [30, 21, 30, 40], 'english1' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=["Cho", "Kim", "Rivera", "Samoh"])


df2 = pd.DataFrame({'math2' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english2' : [11, 50, 74, 8]}, index=["Sanford", "Cho", "Samoh", "Dongmo"])

df3 = df1.join(df2, how='outer')
df4 = df1.join(df2, how='inner')

print(df1)
print()
print(df2)
print()
print(df3)
print()
print(df4)


        math1  english1  biology2
Cho        30        40        70
Kim        21        50        50
Rivera     30        50        60
Samoh      40        20        17

         math2  biology1  physics  english2
Sanford     15        50       55        11
Cho         30        17       79        50
Samoh       78        26       44        74
Dongmo      30        49       10         8

         math1  english1  biology2  math2  biology1  physics  english2
Cho       30.0      40.0      70.0   30.0      17.0     79.0      50.0
Dongmo     NaN       NaN       NaN   30.0      49.0     10.0       8.0
Kim       21.0      50.0      50.0    NaN       NaN      NaN       NaN
Rivera    30.0      50.0      60.0    NaN       NaN      NaN       NaN
Samoh     40.0      20.0      17.0   78.0      26.0     44.0      74.0
Sanford    NaN       NaN       NaN   15.0      50.0     55.0      11.0

       math1  english1  biology2  math2  biology1  physics  english2
Cho       30        40        70     30        17       79        50
Samoh     40        20        17     78        26       44        74


  • df3 = df1.join(df2, how=’outer’)
    • Common indexes are ‘Cho’ and ‘Samoh’, so it connects between ‘Cho’ record from df1 and ‘Cho’ record from df2 + between ‘Samoh’ record from df1 and ‘Samoh’ record from df2
    • how=’outer’ : return result dataframe (left + intersection + right)
  • df4 = df1.join(df2, how=’inner’)
    • Common indexes are ‘Cho’ and ‘Samoh’, so it connects between ‘Cho’ record from df1 and ‘Cho’ record from df2 + between ‘Samoh’ record from df1 and ‘Samoh’ record from df2
    • how=’inner’ : return result dataframe (only intersection records)

© 2017. All rights reserved.

Powered by Hydejack v조현진