Sklearn-Preprocessing (Machine Learning)
in Study / Computer science on Machine learning
- Before using machine learning, we need to do preprocessing of the data
- Null
- Data Encoding
- Feature Scaling
Null
- If one feature has lots of Null, we need to drop it
- If one feature has a few Null data and it is important feature, we need to replace it to a specific value (like mean)
- If one feature has lots of Null but it is important feature, we need to discuss and select the value that gonna replace Null
Data Encoding
- Machine learning cannot calculate the object (string) feature
- Object feature == Category feature + Text feature
- We need to convert them to numerical data
- It is called data encoding
- There are 2 ways for data encoding
- Label Encoding
- One-Hot Encoding
- Label Encoding
- If a single feature has “A”, “B”, “C”, “D” as data
- “A” → 0, “B” → 1, “C” → 2, “D” → 3
- It is easy to understand but there are problems of label encoding
- If order of magnitude is important in certain ML algorithms like linear regression, it will yield low performance
- Tree ML algorithms is ok to use label encoding
from sklearn.preprocessing import LabelEncoder alphabet_features = ["A", "B", "C", "D"] label_encoder = LabelEncoder() label_encoder.fit(alphabet_features) labels = label_encoder.transform(alphabet_features) print(labels) print("--------------") print(label_encoder.classes_) print("--------------") print(label_encoder.inverse_transform([2, 3, 0, 1]))[0 1 2 3] -------------- ['A' 'B' 'C' 'D'] -------------- ['C' 'D' 'A' 'B'] - One-Hot Encoding
- If there are 4 records and a single feature has “A”, “B”, “C”, “D” as data
Original dataframe feature record 1 "A" record 2 "D" record 3 "A" record 4 "B" ↓ feature_A feature_B feature_C feature_D record 1 1 0 0 0 record 2 0 0 0 1 record 3 1 0 0 0 record 4 0 1 0 0from sklearn.preprocessing import OneHotEncoder- When we fit the encoder, the data need to be 2D array
- The result from transforming is sparse matrix, so we need to convert it to dense matrix (well-known matrix)
from sklearn.preprocessing import OneHotEncoder import numpy as np alphabet_features = ["B", "D", "A", "C"] alphabet_features = np.array(alphabet_features).reshape(-1, 1) label_encoder = OneHotEncoder() label_encoder.fit(alphabet_features) labels = label_encoder.transform(alphabet_features) print(labels) print("--------------") print(labels.toarray())(0, 1) 1.0 (1, 3) 1.0 (2, 0) 1.0 (3, 2) 1.0 -------------- [[0. 1. 0. 0.] [0. 0. 0. 1.] [1. 0. 0. 0.] [0. 0. 1. 0.]]- We can do one-hot encoding with pandas easily
- pd.get_dummies(df)
import pandas as pd alphabet_features = ["B", "D", "A", "C"] labels = pd.get_dummies(alphabet_features) print(labels)A B C D 0 0 1 0 0 1 0 0 0 1 2 1 0 0 0 3 0 0 1 0
Feature Scaling
- We set the range of data value toward same levels between all features
- Feature Scaling = Standardization (표준화) + Normalization (정규화)
- Standardization
- It is usually used when the original data distribution is Gaussian distribution (정규분포)
- It is efficient when we use SVM, Linear Regression, Logistic Regression because they assume the data distribution is Gaussian distribution
- StandardScaler
- scaled_value = (value - mean)/std → mean: 0, std: 1
- Because mean and std are sensitive to the outliers, it is not efficient when there are lots of outliers in data distribution
from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler import pandas as pd bc_dataset = load_breast_cancer() df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names']) scaler = StandardScaler() scaler.fit(df) scaled_values = scaler.transform(df) scaled_df = pd.DataFrame(data=scaled_values, columns=bc_dataset['feature_names']) print(scaled_df) print('-------------') print(scaled_df.mean().head()) print('-------------') print(scaled_df.var().head())mean radius mean texture ... worst symmetry worst fractal dimension 0 1.097064 -2.073335 ... 2.750622 1.937015 1 1.829821 -0.353632 ... -0.243890 0.281190 2 1.579888 0.456187 ... 1.152255 0.201391 3 -0.768909 0.253732 ... 6.046041 4.935010 4 1.750297 -1.151816 ... -0.868353 -0.397100 .. ... ... ... ... ... 564 2.110995 0.721473 ... -1.360158 -0.709091 565 1.704854 2.085134 ... -0.531855 -0.973978 566 0.702284 2.045574 ... -1.104549 -0.318409 567 1.838341 2.336457 ... 1.919083 2.219635 568 -1.808401 1.221792 ... -0.048138 -0.751207 [569 rows x 30 columns] ------------- mean radius -3.162867e-15 mean texture -6.530609e-15 mean perimeter -7.078891e-16 mean area -8.799835e-16 mean smoothness 6.132177e-15 dtype: float64 ------------- mean radius 1.001761 mean texture 1.001761 mean perimeter 1.001761 mean area 1.001761 mean smoothness 1.001761 dtype: float64 - RobustScaler
- scaled_value = (value - median)/IQR → median: 0, IQR: 1
- Because median & IQR are not sensitive to the outliers, we can use it when there are lots of the outliers in the data distribution
from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import RobustScaler import pandas as pd bc_dataset = load_breast_cancer() df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names']) scaler = RobustScaler() scaler.fit(df) scaled_values = scaler.transform(df) scaled_df = pd.DataFrame(data=scaled_values, columns=bc_dataset['feature_names']) print(scaled_df) print('-------------') print(scaled_df.median().head()) print('-------------') print((scaled_df.quantile(0.75) - scaled_df.quantile(0.25)).head())mean radius mean texture ... worst symmetry worst fractal dimension 0 1.132353 -1.502664 ... 2.635556 1.884578 1 1.764706 -0.190053 ... -0.106667 0.435500 2 1.549020 0.428064 ... 1.171852 0.365664 3 -0.477941 0.273535 ... 5.653333 4.508244 4 1.696078 -0.799290 ... -0.678519 -0.158099 .. ... ... ... ... ... 564 2.007353 0.630551 ... -1.128889 -0.431135 565 1.656863 1.671403 ... -0.370370 -0.662949 566 0.791667 1.641208 ... -0.894815 -0.089234 567 1.772059 1.863233 ... 1.874074 2.131911 568 -1.375000 1.012433 ... 0.072593 -0.467992 [569 rows x 30 columns] ------------- mean radius 0.0 mean texture 0.0 mean perimeter 0.0 mean area 0.0 mean smoothness 0.0 dtype: float64 ------------- mean radius 1.0 mean texture 1.0 mean perimeter 1.0 mean area 1.0 mean smoothness 1.0 dtype: float64
- Normalization
- It considers the data value’s range more than the data distribution
- It can be used when the distribution is not Gaussian distribution
- If we use “Log Transformation”, we can let the distribution be similar to Gaussian distribution
- MinMaxScaler
- scaled_value = (value - min)/(max - min) → data range: [0, 1] (if there is negative data, [-1, 1])
- Because min & max are sensitive to the outliers, it is not efficient when there are lots of the outliers in the data distribution → RobustScaler can be more efficient than Normalization
from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import MinMaxScaler import pandas as pd bc_dataset = load_breast_cancer() df = pd.DataFrame(data=bc_dataset['data'], columns=bc_dataset['feature_names']) scaler = MinMaxScaler() scaler.fit(df) scaled_values = scaler.transform(df) scaled_df = pd.DataFrame(data=scaled_values, columns=bc_dataset['feature_names']) print(scaled_df) print('-------------') print(scaled_df.min().head()) print('-------------') print(scaled_df.max().head())mean radius mean texture ... worst symmetry worst fractal dimension 0 0.521037 0.022658 ... 0.598462 0.418864 1 0.643144 0.272574 ... 0.233590 0.222878 2 0.601496 0.390260 ... 0.403706 0.213433 3 0.210090 0.360839 ... 1.000000 0.773711 4 0.629893 0.156578 ... 0.157500 0.142595 .. ... ... ... ... ... 564 0.690000 0.428813 ... 0.097575 0.105667 565 0.622320 0.626987 ... 0.198502 0.074315 566 0.455251 0.621238 ... 0.128721 0.151909 567 0.644564 0.663510 ... 0.497142 0.452315 568 0.036869 0.501522 ... 0.257441 0.100682 [569 rows x 30 columns] ------------- mean radius 0.0 mean texture 0.0 mean perimeter 0.0 mean area 0.0 mean smoothness 0.0 dtype: float64 ------------- mean radius 1.0 mean texture 1.0 mean perimeter 1.0 mean area 1.0 mean smoothness 1.0 dtype: float64
- Be careful when we transform the test_data to scaled_test_data after training the model with scaled_train_data
- Scaler_object needs to call “fit” only for the train_data
- For test_data, we need to use (Scaler_object which has already called “fit” with train_data) for transforming it to scaled_test_data
- It is more desired to split the train & test data after scaling the whole data (== calling “fit” of scaler_object for the whole data)
- Tree ML Algorithm doesn’t need Scaling
- Which scaler should we use?
- It is dependent on the situation (We need to test for each scaler)