Group (pandas)

pandas

  • 1st : split data by the specific standard (grouping)
  • 2nd : The methods for data aggregation or data change are applied to each group (Mapping applied to group : agg, transform, apply, filter)
  • 3rd : combine (Two Dataframe Connection posting)

Grouping

  • split data by the specific standard
  • df.groupby([“standard column1”, “sc2”, “sc3”, …])
    • It returns a object that has key and each group dataframe
    • key : (A value from standard column1, A value from sc2, ….)
    • group dataframe : Dataframe corresponding to a specific key
    • The index on each record is not changed


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]

groups = df.groupby(['pclass'])

for key, group in groups:
    print("pclass :", key)
    print('\n')
    print(group.head())
    print('\n')


pclass : 1


    pclass   age     sex     fare  survived  embark_town
1        1  38.0  female  71.2833         1    Cherbourg
3        1  35.0  female  53.1000         1  Southampton
6        1  54.0    male  51.8625         0  Southampton
11       1  58.0  female  26.5500         1  Southampton
23       1  28.0    male  35.5000         1  Southampton


pclass : 2


    pclass   age     sex     fare  survived  embark_town
9        2  14.0  female  30.0708         1    Cherbourg
15       2  55.0  female  16.0000         1  Southampton
17       2   NaN    male  13.0000         1  Southampton
20       2  35.0    male  26.0000         0  Southampton
21       2  34.0    male  13.0000         1  Southampton


pclass : 3


   pclass   age     sex     fare  survived  embark_town
0       3  22.0    male   7.2500         0  Southampton
2       3  26.0  female   7.9250         1  Southampton
4       3  35.0    male   8.0500         0  Southampton
5       3   NaN    male   8.4583         0   Queenstown
7       3   2.0    male  21.0750         0  Southampton


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]

groups = df.groupby(['pclass', 'sex'])

for key, group in groups:
    print("(pclass, sex) :", key)
    print('\n')
    print(group.head())
    print('\n')


(pclass, sex) : (1, 'female')


    pclass   age     sex      fare  survived  embark_town
1        1  38.0  female   71.2833         1    Cherbourg
3        1  35.0  female   53.1000         1  Southampton
11       1  58.0  female   26.5500         1  Southampton
31       1   NaN  female  146.5208         1    Cherbourg
52       1  49.0  female   76.7292         1    Cherbourg


(pclass, sex) : (1, 'male')


    pclass   age   sex      fare  survived  embark_town
6        1  54.0  male   51.8625         0  Southampton
23       1  28.0  male   35.5000         1  Southampton
27       1  19.0  male  263.0000         0  Southampton
30       1  40.0  male   27.7208         0    Cherbourg
34       1  28.0  male   82.1708         0    Cherbourg


(pclass, sex) : (2, 'female')


    pclass   age     sex     fare  survived  embark_town
9        2  14.0  female  30.0708         1    Cherbourg
15       2  55.0  female  16.0000         1  Southampton
41       2  27.0  female  21.0000         0  Southampton
43       2   3.0  female  41.5792         1    Cherbourg
53       2  29.0  female  26.0000         1  Southampton


(pclass, sex) : (2, 'male')


    pclass   age   sex  fare  survived  embark_town
17       2   NaN  male  13.0         1  Southampton
20       2  35.0  male  26.0         0  Southampton
21       2  34.0  male  13.0         1  Southampton
33       2  66.0  male  10.5         0  Southampton
70       2  32.0  male  10.5         0  Southampton


(pclass, sex) : (3, 'female')


    pclass   age     sex     fare  survived  embark_town
2        3  26.0  female   7.9250         1  Southampton
8        3  27.0  female  11.1333         1  Southampton
10       3   4.0  female  16.7000         1  Southampton
14       3  14.0  female   7.8542         0  Southampton
18       3  31.0  female  18.0000         0  Southampton


(pclass, sex) : (3, 'male')


    pclass   age   sex     fare  survived  embark_town
0        3  22.0  male   7.2500         0  Southampton
4        3  35.0  male   8.0500         0  Southampton
5        3   NaN  male   8.4583         0   Queenstown
7        3   2.0  male  21.0750         0  Southampton
12       3  20.0  male   8.0500         0  Southampton


  • Accessing each group dataframe
    • groups.get_group(“key”) (if key is numerical type, remove “”)


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]

groups = df.groupby(['pclass'])

group = groups.get_group(2)

print(group.head())


    pclass   age     sex     fare  survived  embark_town
9        2  14.0  female  30.0708         1    Cherbourg
15       2  55.0  female  16.0000         1  Southampton
17       2   NaN    male  13.0000         1  Southampton
20       2  35.0    male  26.0000         0  Southampton
21       2  34.0    male  13.0000         1  Southampton


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]

groups = df.groupby(['pclass', 'sex'])
group = groups.get_group((2, 'male'))

print(group.head())


    pclass   age   sex  fare  survived  embark_town
17       2   NaN  male  13.0         1  Southampton
20       2  35.0  male  26.0         0  Southampton
21       2  34.0  male  13.0         1  Southampton
33       2  66.0  male  10.5         0  Southampton
70       2  32.0  male  10.5         0  Southampton


agg

  • The methods for data aggregation or data change are applied to each group (agg, transform, apply, filter)
  • agg method can be used only for aggregation method (집계 메소드, ex. mean(), max(), sum(), count(), size(), var(), std(), describe(), info(), first() (extract the first record from each group and connect them), last() (extract the last record from each group and connect them))
    • groups.agg(mainly user-defined aggregation method)
    • The result dataframe’s indexes are keys
    • The parameter of a function used for mapping is a group dataframe
    • If there are some column series that cannot be applied by aggregation method (ex. type is different), it ignores them and automatically calculate only desired columns
    • groups.specific column.agg(aggregation method)
      • The result is Series representing the result from mapping applied to only specific column


    import pandas as pd
    import seaborn as sns
    
    def func(x):                # x → each group's dataframe
        return x.mean() + x.min()
    
    titanic = sns.load_dataset('titanic')
    
    df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]
    
    groups = df.groupby(['pclass'])
    
    resDF1 = groups.agg(func)
    resDF2 = groups.age.agg(func)
    
    print(resDF1)
    print()
    print(resDF2)
    
    


    age       fare  survived
    pclass
    1       39.153441  84.154687  0.629630
    2       30.547630  20.662183  0.472826
    3       25.560620  13.675550  0.242363
    
    pclass
    1    39.153441
    2    30.547630
    3    25.560620
    Name: age, dtype: float64
    


    • We can use multiple aggregation methods with agg method
      • groups.agg([method1, method2, method3, ….]) : Multiple methods are applied to all columns
      • groups.agg({‘column1’ : method1, ‘c2’ : [method1, method2], ….}) : Method1 is applied to column1 and Both method1 and method2 are applied to column2, ….


      import pandas as pd
      import seaborn as sns
      
      def func(x):                
          return x.mean() + x.min()
      
      titanic = sns.load_dataset('titanic')
      
      df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]
      
      groups = df.groupby(['pclass'])
      
      
      resDF1 = groups.agg([func, max])
      resDF2 = groups.agg({'age' : [func, max], 'fare' : min, 'survived' : max})
      
      
      print(resDF1)
      print()
      print(resDF2)
      
      


                    age             fare            survived    
                   func   max       func       max      func max
      pclass
      1       39.153441  80.0  84.154687  512.3292  0.629630   1
      2       30.547630  70.0  20.662183   73.5000  0.472826   1
      3       25.560620  74.0  13.675550   69.5500  0.242363   1
      
                      age       fare survived
                     func   max  min      max
        pclass
        1       39.153441  80.0  0.0        1
        2       30.547630  70.0  0.0        1
        3       25.560620  74.0  0.0        1
      


    • If we wanna apply just one well-defined aggregation method to all columns or specific columns, we can directly use that method instead of agg (groups.well-defined method())


    import pandas as pd
    import seaborn as sns
    
    titanic = sns.load_dataset('titanic')
    
    df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]
    
    groups = df.groupby(['pclass'])
    
    mean_group = groups.mean()
    std_group = groups.std()
    mean_age_group = groups.age.mean()
    
    print(mean_group.head())
    print()
    print(std_group.head())
    print()
    print(mean_age_group)
    


                  age       fare  survived
    pclass
    1       38.233441  84.154687  0.629630
    2       29.877630  20.662183  0.472826
    3       25.140620  13.675550  0.242363
    
                  age       fare  survived
    pclass
    1       14.802856  78.380373  0.484026
    2       14.001077  13.417399  0.500623
    3       12.495398  11.778142  0.428949
    
    pclass
    1    38.233441
    2    29.877630
    3    25.140620
    Name: age, dtype: float64
    


transform

  • The methods for data aggregation or data change are applied to each group (agg, transform, apply, filter)
  • transform method can be used only for method which should be applied to the individual elements of dataframe (not aggregation method)
    • groups.transform(method)
    • The result dataframe’s index : orginal dataframe index (break the group)
    • If there are some column series that cannot be applied by method (ex. type is different), it ignores them and automatically calculate only desired columns
    • groups.specific column.transform(method)
      • The result is Series representing the result from mapping applied to only specific column


    import pandas as pd
    import seaborn as sns
    
    def func(x):                
        return (x - x.mean()) / x.std()         # Even if df is calculated with series, it is automatically calculated with common columns
                                                # z-score
    titanic = sns.load_dataset('titanic')
    
    df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]
    
    groups = df.groupby(['pclass'])
    
    
    resDF1 = groups.transform(func)
    resDF2 = groups.age.transform(func)
    
    
    print(resDF1)
    print()
    print(resDF2)
    


              age      fare  survived
    0   -0.251342 -0.545549 -0.565014
    1   -0.015770 -0.164217  0.765188
    2    0.068776 -0.488239  1.766263
    3   -0.218434 -0.396205  0.765188
    4    0.789041 -0.477626 -0.565014
    ..        ...       ...       ...
    886 -0.205529 -0.571063 -0.944475
    887 -1.299306 -0.690922  0.765188
    888       NaN  0.829880 -0.565014
    889 -0.826424 -0.690922  0.765188
    890  0.548953 -0.503097 -0.565014
    
    [891 rows x 3 columns]
    
    0     -0.251342
    1     -0.015770
    2      0.068776
    3     -0.218434
    4      0.789041
    ...
    886   -0.205529
    887   -1.299306
    888         NaN
    889   -0.826424
    890    0.548953
    Name: age, Length: 891, dtype: float64
    


apply

  • The methods for data aggregation or data change are applied to each group (agg, transform, apply, filter)
  • apply method can be used for both aggregation method and method which should be applied to the individual elements of dataframe
  • agg VS apply (with aggregation method)
    • groups.agg(method) == groups.apply(method) (except NaN, NaN == NaN : False)
      • If there are some column series that cannot be applied by method (ex. type is different), apply method cannot ignore them and there gonna be error


    • groups.specific_column.agg(method) == groups.specific_column.apply(method)


    • groups.agg(lambda g : command using g.specific_column) != groups.apply(lambda g : command using g.specific_column)
      • groups.agg(lambda g : command using g.specific_column) → represent all columns but all of column series are same (It is the result from applying method to specific column) → undesired
      • groups.apply(lambda g : command using g.specific_column) → represent only specific column
      import pandas as pd
      import seaborn as sns
      
      titanic = sns.load_dataset('titanic')
      
      df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]
      
      groups = df.groupby(['pclass'])
      
      
      resDF1 = groups.agg(lambda g : g.age.max())
      resDF2 = groups.apply(lambda g : g.age.max())
      resDF3 = groups.age.apply(max)
      
      
      print(resDF1)
      print()
      print(resDF2)
      print()
      print(resDF3)
      


               age   sex  fare  survived  embark_town
      pclass
      1       80.0  80.0  80.0      80.0         80.0
      2       70.0  70.0  70.0      70.0         70.0
      3       74.0  74.0  74.0      74.0         74.0
      
      pclass
      1    80.0
      2    70.0
      3    74.0
      dtype: float64
      
      pclass
      1    80.0
      2    70.0
      3    74.0
      Name: age, dtype: float64
      
      


  • transform VS apply (with non-aggregation method)
    • groups.specific_column.transform(method) == groups.specific_column.apply(method)
      • Both yield the mapping result with broken group


    • groups.transform(lambda g : command using g.specific_column) != groups.apply(lambda g : command using g.specific_column)
      • groups.transform(lambda g : command using g.specific_column) → error
      • groups.apply(lambda g : command using g.specific_column) == groups.specific_column.apply(method)


    • groups.transform(method) == groups.apply(method)
      • If there are some column series that cannot be applied by method (ex. type is different), apply method cannot ignore them and there gonna be error
      • If there are multiple columns that are applied by method, apply method returns the dataframe with multi-index (first index : key, second index : original record index)

Filter

  • The methods for data aggregation or data change are applied to each group (agg, transform, apply, filter)
  • Use boolean method and extract the groups that are True from the method → Make dataframe with original dataframe’s order (break the groups)
  • groups.filter(boolean method)


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

df = titanic.loc[:, ['pclass', 'age', 'sex', 'fare', 'survived', 'embark_town']]

groups = df.groupby(['pclass'])


resDF = groups.filter(lambda g : g.survived.sum() > 100)

print(resDF)


     pclass   age     sex     fare  survived  embark_town
0         3  22.0    male   7.2500         0  Southampton
1         1  38.0  female  71.2833         1    Cherbourg
2         3  26.0  female   7.9250         1  Southampton
3         1  35.0  female  53.1000         1  Southampton
4         3  35.0    male   8.0500         0  Southampton
..      ...   ...     ...      ...       ...          ...
885       3  39.0  female  29.1250         0   Queenstown
887       1  19.0  female  30.0000         1  Southampton
888       3   NaN  female  23.4500         0  Southampton
889       1  26.0    male  30.0000         1    Cherbourg
890       3  32.0    male   7.7500         0   Queenstown

[707 rows x 6 columns]


  • Group2 (pclass == 2) is removed

© 2017. All rights reserved.

Powered by Hydejack v조현진