Two Dataframe Connection (pandas)

28 Jan 2024 in Study / Computer science on Pandas

pandas

concat
- simple connection between 2 dataframe
merge
- connection standard : common data on specific column → combine those records (df1 has n records & df2 has m records that have common data on specific column compared with counterpart df → Number of total intersection records is n x m)
join
- connection standard : common row index
Option used for 3 methods
- inner : intersection set
- outer : union set (contains intersection set)
- left : left set (contains intersection set)
- right : right set (contains intersection set)

Concat

pd.concat()
simple connection between 2 dataframe

axis=0

If same name exists in both df’s column list, combine them (Result column list consists of left df column, intersection column, right df column)
Each record’s name(index) remains (just connection)
join=’outer’ : return result dataframe (left + intersection + right) from Concat method
join=’inner’ : return result dataframe which contains only intersection columns
ignore_index=True : original index removed + set again as integer index

import pandas as pd

df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({'c1' : [15, 46, 78, 100], 'c4' : [54, 15, 26, 49], 'c3' : [55, 79, 44, 10], 'c9' : [11, 56, 74, 8]}, index=[2, 3, 4, 5])

df3 = pd.concat([df1, df2], axis=0, join='outer')
df4 = pd.concat([df2, df1], axis=0, join='outer')
df5 = pd.concat([df1, df2], axis=0, ignore_index=True)
df6 = pd.concat([df1, df2], axis=0, join='inner')

print(df1)
print()
print(df2)
print()
print(df3)
print()
print(df4)
print()
print(df5)
print()
print(df6)

   c1  c2  c3
30  40  70
21  50  50
30  50  60
40  20  17

    c1  c4  c3  c9
 15  54  55  11
 46  15  79  56
 78  26  44  74
100  49  10   8

    c1    c2  c3    c4    c9
 30  40.0  70   NaN   NaN
 21  50.0  50   NaN   NaN
 30  50.0  60   NaN   NaN
 40  20.0  17   NaN   NaN
 15   NaN  55  54.0  11.0
 46   NaN  79  15.0  56.0
 78   NaN  44  26.0  74.0
100   NaN  10  49.0   8.0

    c1    c4  c3    c9    c2
 15  54.0  55  11.0   NaN
 46  15.0  79  56.0   NaN
 78  26.0  44  74.0   NaN
100  49.0  10   8.0   NaN
 30   NaN  70   NaN  40.0
 21   NaN  50   NaN  50.0
 30   NaN  60   NaN  50.0
 40   NaN  17   NaN  20.0

    c1    c2  c3    c4    c9
 30  40.0  70   NaN   NaN
 21  50.0  50   NaN   NaN
 30  50.0  60   NaN   NaN
 40  20.0  17   NaN   NaN
 15   NaN  55  54.0  11.0
 46   NaN  79  15.0  56.0
 78   NaN  44  26.0  74.0
100   NaN  10  49.0   8.0

    c1  c3
 30  70
 21  50
 30  60
 40  17
 15  55
 46  79
 78  44
100  10

axis=1

If same name exists in both df’s index list, combine them (Result index list consists of left df index, intersection index, right df index)
Each record’s name(index) remains (just connection)
join=’outer’ : return result dataframe (left + intersection + right) from Concat method
join=’inner’ : return result dataframe which contains only intersection index
ignore_index=True : original index removed + set again as integer index

import pandas as pd

df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({'c1' : [15, 46, 78, 100], 'c4' : [54, 15, 26, 49], 'c3' : [55, 79, 44, 10], 'c9' : [11, 56, 74, 8]}, index=[2, 3, 4, 5])

df7 = pd.concat([df1, df2], axis=1, join='outer')
df8 = pd.concat([df2, df1], axis=1, join='outer')
df9 = pd.concat([df1, df2], axis=1, join='inner')

print(df7)
print()
print(df8)
print()
print(df9)

     c1    c2    c3     c1    c4    c3    c9
30.0  40.0  70.0    NaN   NaN   NaN   NaN
21.0  50.0  50.0    NaN   NaN   NaN   NaN
30.0  50.0  60.0   15.0  54.0  55.0  11.0
40.0  20.0  17.0   46.0  15.0  79.0  56.0
 NaN   NaN   NaN   78.0  26.0  44.0  74.0
 NaN   NaN   NaN  100.0  49.0  10.0   8.0

      c1    c4    c3    c9    c1    c2    c3
  NaN   NaN   NaN   NaN  30.0  40.0  70.0
  NaN   NaN   NaN   NaN  21.0  50.0  50.0
 15.0  54.0  55.0  11.0  30.0  50.0  60.0
 46.0  15.0  79.0  56.0  40.0  20.0  17.0
 78.0  26.0  44.0  74.0   NaN   NaN   NaN
100.0  49.0  10.0   8.0   NaN   NaN   NaN

   c1  c2  c3  c1  c4  c3  c9
30  50  60  15  54  55  11
40  20  17  46  15  79  56

We can make connection between dataframe and series or between series and series

import pandas as pd

df1 = pd.DataFrame({'c1' : [30, 21, 30, 40], 'c2' : [40, 50, 50, 20], 'c3' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

s1 = pd.Series([45, 100], name='c10', index=[2, 3])
s2 = pd.Series([50, 11], name='c11', index=[2, 4])

df10 = pd.concat([df1, s1], axis=1)
df11 = pd.concat([s1, s2], axis=1)
s3 = pd.concat([s1, s2], axis=0)

print(df10)
print()
print(df11)
print()
print(s3)

   c1  c2  c3    c10
30  40  70    NaN
21  50  50    NaN
30  50  60   45.0
40  20  17  100.0

     c10   c11
 45.0  50.0
100.0   NaN
  NaN  11.0

   45
  100
   50
   11
dtype: int64

Merge

pd.merge()
connection standard : common data on specific column → combine those records (df1 has n records & df2 has m records that have common data on specific column compared with counterpart df → Number of total intersection records is n x m)

import pandas as pd


df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])


df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])


df3 = pd.merge(df1, df2, how='outer', on='math')
df4 = pd.merge(df1, df2, how='inner', on='math')
df5 = pd.merge(df2, df1, how='outer', on='math')

print(df1)
print()
print(df2)
print()
print(df3)
print()
print(df4)
print()
print(df5)

   math  english  biology2
  30       40        70
  21       50        50
  30       50        60
  40       20        17

   math  biology1  physics  english
  15        50       55       11
  30        17       79       50
  78        26       44       74
  30        49       10        8

   math  english_x  biology2  biology1  physics  english_y
  30       40.0      70.0      17.0     79.0       50.0
  30       40.0      70.0      49.0     10.0        8.0
  30       50.0      60.0      17.0     79.0       50.0
  30       50.0      60.0      49.0     10.0        8.0
  21       50.0      50.0       NaN      NaN        NaN
  40       20.0      17.0       NaN      NaN        NaN
  15        NaN       NaN      50.0     55.0       11.0
  78        NaN       NaN      26.0     44.0       74.0

   math  english_x  biology2  biology1  physics  english_y
  30         40        70        17       79         50
  30         40        70        49       10          8
  30         50        60        17       79         50
  30         50        60        49       10          8

   math  biology1  physics  english_x  english_y  biology2
  15      50.0     55.0       11.0        NaN       NaN
  30      17.0     79.0       50.0       40.0      70.0
  30      17.0     79.0       50.0       50.0      60.0
  30      49.0     10.0        8.0       40.0      70.0
  30      49.0     10.0        8.0       50.0      60.0
  78      26.0     44.0       74.0        NaN       NaN
  21       NaN      NaN        NaN       50.0      50.0
  40       NaN      NaN        NaN       20.0      17.0

df3 = pd.merge(df1, df2, how=’outer’, on=’math’)
- on=’math’ : connect the records that have same data on ‘math’ column (df1 has 2 records and df2 has 2 records that have 30 as data on ‘math’ → Consequently, the number of intersection records is 4)
- how=’outer’ : return result dataframe (left + intersection + right)
df4 = pd.merge(df1, df2, how=’inner’, on=’math’)
- how=’inner’ : return result dataframe (only intersection records)

import pandas as pd

df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])

df6 = pd.merge(df2, df1, how='outer', on=['math', 'english'])

print(df6)

   math  biology1  physics  english  biology2
  15      50.0     55.0       11       NaN
  30      17.0     79.0       50      60.0
  78      26.0     44.0       74       NaN
  30      49.0     10.0        8       NaN
  30       NaN      NaN       40      70.0
  21       NaN      NaN       50      50.0
  40       NaN      NaN       20      17.0

df6 = pd.merge(df2, df1, how=’outer’, on=[‘math’, ‘english’])
- on=[‘math’, ‘english’] : connect the records that have same data on both ‘math’ column and ‘english’ column (df1 has 1 record and df2 has 1 record that have (30, 50) as data on ‘math’ and ‘english’ → Consequently, the number of intersection records is 1)

import pandas as pd

df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])

df7 = pd.merge(df1, df2, how='left', on='math')
df8 = pd.merge(df1, df2, how='right', on='math')

print(df7)
print()
print(df8)

   math  english_x  biology2  biology1  physics  english_y
  30         40        70      17.0     79.0       50.0
  30         40        70      49.0     10.0        8.0
  21         50        50       NaN      NaN        NaN
  30         50        60      17.0     79.0       50.0
  30         50        60      49.0     10.0        8.0
  40         20        17       NaN      NaN        NaN

   math  english_x  biology2  biology1  physics  english_y
  15        NaN       NaN        50       55         11
  30       40.0      70.0        17       79         50
  30       50.0      60.0        17       79         50
  78        NaN       NaN        26       44         74
  30       40.0      70.0        49       10          8
  30       50.0      60.0        49       10          8

df7 = pd.merge(df1, df2, how=’left’, on=’math’)
- how=’left’ : return result dataframe (left + intersection)
df8 = pd.merge(df1, df2, how=’right’, on=’math’)
- how=’right’ : return result dataframe (intersection + right)

import pandas as pd

df1 = pd.DataFrame({'math' : [30, 21, 30, 40], 'english' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=[0, 1, 2, 3])

df2 = pd.DataFrame({'math' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english' : [11, 50, 74, 8]}, index=[2, 3, 4, 5])

df9 = pd.merge(df1, df2, how='outer', left_on='biology2', right_on='biology1')

print(df9)

   math_x  english_x  biology2  math_y  biology1  physics  english_y
  30.0       40.0      70.0     NaN       NaN      NaN        NaN
  21.0       50.0      50.0    15.0      50.0     55.0       11.0
  30.0       50.0      60.0     NaN       NaN      NaN        NaN
  40.0       20.0      17.0    30.0      17.0     79.0       50.0
   NaN        NaN       NaN    78.0      26.0     44.0       74.0
   NaN        NaN       NaN    30.0      49.0     10.0        8.0

df9 = pd.merge(df1, df2, how=’outer’, left_on=’biology2’, right_on=’biology1’)
- left_on=’biology2’, right_on=’biology1’ : connect the records that df1 value from biology2 is equal to df2 value from biology1 (df1 has 1 record and df2 has 1 record that have 50 as data on ‘biology2’ and ‘biology1’ → Consequently, the number of intersection records is 1) + (df1 has 1 record and df2 has 1 record that have 17 as data on ‘biology2’ and ‘biology1’ → Consequently, the number of intersection records is 1 )

Join

df1.join(df2)
Similar with merge, but the connection standard is common row index (There should not be common name on 2 column lists)

import pandas as pd

df1 = pd.DataFrame({'math1' : [30, 21, 30, 40], 'english1' : [40, 50, 50, 20], 'biology2' : [70, 50, 60, 17]}, index=["Cho", "Kim", "Rivera", "Samoh"])


df2 = pd.DataFrame({'math2' : [15, 30, 78, 30], 'biology1' : [50, 17, 26, 49], 'physics' : [55, 79, 44, 10], 'english2' : [11, 50, 74, 8]}, index=["Sanford", "Cho", "Samoh", "Dongmo"])

df3 = df1.join(df2, how='outer')
df4 = df1.join(df2, how='inner')

print(df1)
print()
print(df2)
print()
print(df3)
print()
print(df4)

        math1  english1  biology2
Cho        30        40        70
Kim        21        50        50
Rivera     30        50        60
Samoh      40        20        17

         math2  biology1  physics  english2
Sanford     15        50       55        11
Cho         30        17       79        50
Samoh       78        26       44        74
Dongmo      30        49       10         8

         math1  english1  biology2  math2  biology1  physics  english2
Cho       30.0      40.0      70.0   30.0      17.0     79.0      50.0
Dongmo     NaN       NaN       NaN   30.0      49.0     10.0       8.0
Kim       21.0      50.0      50.0    NaN       NaN      NaN       NaN
Rivera    30.0      50.0      60.0    NaN       NaN      NaN       NaN
Samoh     40.0      20.0      17.0   78.0      26.0     44.0      74.0
Sanford    NaN       NaN       NaN   15.0      50.0     55.0      11.0

       math1  english1  biology2  math2  biology1  physics  english2
Cho       30        40        70     30        17       79        50
Samoh     40        20        17     78        26       44        74

df3 = df1.join(df2, how=’outer’)
- Common indexes are ‘Cho’ and ‘Samoh’, so it connects between ‘Cho’ record from df1 and ‘Cho’ record from df2 + between ‘Samoh’ record from df1 and ‘Samoh’ record from df2
- how=’outer’ : return result dataframe (left + intersection + right)
df4 = df1.join(df2, how=’inner’)
- Common indexes are ‘Cho’ and ‘Samoh’, so it connects between ‘Cho’ record from df1 and ‘Cho’ record from df2 + between ‘Samoh’ record from df1 and ‘Samoh’ record from df2
- how=’inner’ : return result dataframe (only intersection records)

Two Dataframe Connection (pandas)

Concat

Merge

Join

Bioinformatics

Error

Concat

Merge

Join

Templates (for web app):

Error