Supervised Learning Basic Flow with sklearn (Machine Learning)

Machine Learning

  • Basic flow with sklearn for ML
  • Examples are “Supervised Learning”
  • sklearn supports for us to load example datasets


from sklearn.datasets import #
  • load_iris()
  • load_breast_cancer()
  • feature
    • == character
    • ex) In breast cancer, ‘mean radius’ or ‘mean texture’ can be the features
  • The datasets are stored as “Dictionary” (Supervised Learning)
    • data (key): data per each feature (ndarray)
    • feature_names (key): feature’s name (list)
    • target (key): label value (for “Classification”) or result number value (for “Regression”) per each record (ndarray)
    • target_names (key): target’s label name (ndarray)
    from sklearn.datasets import load_breast_cancer
    import pandas as pd
    
    bc_dataset = load_breast_cancer()
    
    print(bc_dataset['data'])
    print('------------')
    print(bc_dataset['feature_names'])
    print('------------')
    print(bc_dataset['target'])
    print('------------')
    print(bc_dataset['target_names'])
    print('------------')
    
    df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names'])
    df['clf_label'] = bc_dataset['target']
    
    print(df)
    
    
    [[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
    [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
    [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
    ...
    [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
    [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
    [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
    ------------
    ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
    'mean smoothness' 'mean compactness' 'mean concavity'
    'mean concave points' 'mean symmetry' 'mean fractal dimension'
    'radius error' 'texture error' 'perimeter error' 'area error'
    'smoothness error' 'compactness error' 'concavity error'
    'concave points error' 'symmetry error' 'fractal dimension error'
    'worst radius' 'worst texture' 'worst perimeter' 'worst area'
    'worst smoothness' 'worst compactness' 'worst concavity'
    'worst concave points' 'worst symmetry' 'worst fractal dimension']
    ------------
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
    1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
    1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
    1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
    1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
    1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
    1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
    1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
    0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
    1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
    0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
    1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
    1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    1 1 1 1 1 1 1 0 0 0 0 0 0 1]
    ------------
    ['malignant' 'benign']
    ------------
            mean radius  mean texture  mean perimeter  mean area  ...  worst concave points  worst symmetry  worst fractal dimension  clf_label
       0          17.99         10.38          122.80     1001.0  ...                0.2654          0.4601                  0.11890          0
       1          20.57         17.77          132.90     1326.0  ...                0.1860          0.2750                  0.08902          0
       2          19.69         21.25          130.00     1203.0  ...                0.2430          0.3613                  0.08758          0
       3          11.42         20.38           77.58      386.1  ...                0.2575          0.6638                  0.17300          0
       4          20.29         14.34          135.10     1297.0  ...                0.1625          0.2364                  0.07678          0
       ..           ...           ...             ...        ...  ...                   ...             ...                      ...        ...
       564        21.56         22.39          142.00     1479.0  ...                0.2216          0.2060                  0.07115          0
       565        20.13         28.25          131.20     1261.0  ...                0.1628          0.2572                  0.06637          0
       566        16.60         28.08          108.30      858.1  ...                0.1418          0.2218                  0.07820          0
       567        20.60         29.33          140.10     1265.0  ...                0.2650          0.4087                  0.12400          0
       568         7.76         24.54           47.92      181.0  ...                0.0000          0.2871                  0.07039          1
    
       [569 rows x 31 columns]
    

Select Train & Test Data from Dataset

  • If we have a dataset (like dataframe), there may be many records
  • We have to select the records playing a role in train_data and the records playing a role in test_data
  • train_test_split(data, test_size=, random_state=, stratify=y)

    from sklearn.model_selection import train_test_split
    
    • For the case of supervised learning, data have to split to 2 variables (x: independent variable, y: dependent variable)
    • test_size = α -> test_data_size = α, train_data_size = 1 - α
    • random_state = n (value is not important): Even if we try to call this function for multiple times, it will return same result
    • stratify=(dependent variable): sample distribution ≒ population distribution (It is useful for “Classification”)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd

bc_dataset = load_breast_cancer()

df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names'])
df['clf_label'] = bc_dataset['target']

df_x = df.iloc[:, :-1]
df_y = df['clf_label']

X_train, X_test, Y_train, Y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=1, stratify=df_y)

Supervised Learning Algorithm Overview

  • Estimator
    • Super class for all supervised learning algorithm
    • It has 2 child class: Classifier (for “Classification”) & Regressor (for “Regression”)
    • Estimator has 2 major methods: fit() (training), predict() (prediction for result)
      • In unsupervised learning, their meaning is different
        • fit(): preprocessing
        • transform(): actual process
  • Classifier
    • class for “Classification”
    • “Classification”
      • We have to make the model predicting y (dependent variable) from x (independent variable)
      • If y is labeling data, it is called “Classification”
    • Type
      • DecisionTreeClassifier
      • GradientBoostingClassifier
      • GaussianNB
      • SVC
      • RandomForestClassifier
  • Regressor
    • class for “Regressor”
    • “Regression”
      • We have to make the model predicting y (dependent variable) from x (independent variable)
      • If y is not labeling data, it is called “Regression”
      • Type
        • LinearRegression
        • Lasso
        • GradientBoostingRegressor
        • Ridge
        • RandomForestRegressor

Training and Predict with ML Class’ Object

  • Formation of ML class’ object → fit (training with X_train, Y_train) → predict (with X_test)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

bc_dataset = load_breast_cancer()

df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names'])
df['clf_label'] = bc_dataset['target']

df_x = df.iloc[:, :-1]
df_y = df['clf_label']

X_train, X_test, Y_train, Y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=1, stratify=df_y)

clf_model = DecisionTreeClassifier(random_state=1)
clf_model.fit(X_train, Y_train)
res = clf_model.predict(X_test)

Evaluation with Prediction and Y_test

  • Comparison between prediction from the model and Y_test
  • There are several ways for evaluation
    • example: accuracy_score
    from sklearn.metrics import accuracy_score
    
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

bc_dataset = load_breast_cancer()

df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names'])
df['clf_label'] = bc_dataset['target']

df_x = df.iloc[:, :-1]
df_y = df['clf_label']

X_train, X_test, Y_train, Y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=1, stratify=df_y)

clf_model = DecisionTreeClassifier(random_state=1)
clf_model.fit(X_train, Y_train)
res = clf_model.predict(X_test)

print(np.round(accuracy_score(Y_test, res), 2))
0.96

Basic Flow for Supervised Learning

  • 1st: Get Dataset
  • 2nd: Preprocessing of Dataset → Split x and y
  • 3rd: Split X_train, X_test, Y_train, Y_test
  • 4th: Formation of ML class’ object → fit (training with X_train, Y_train) → predict (with X_test)
  • 5th: Evaluation (Comparison between prediction from the model and Y_test)

© 2017. All rights reserved.

Powered by Hydejack v조현진