Category (pandas)
in Study / Computer science on Pandas
- Conversion from the continuous data
- One-hot-encoding
Conversion From Continuous Data
- For conversion from the continuous data to category data type
- 1st : divide the section with same width
- data_count, section = np.histogram(df[column’s name], bins=n)
- Given a specific column’s data, it calculates n sections and each section contains the data whose value is involved in that section’s range
- data_count : array that involves the number of data that each section contains
- section : array that involves the boundaries that each section has
- The number of its elements = n + 1 because they are boundaries
- 2nd : assign the name which gonna be the value of category data to each section
- section_names = [“name1”, “name2”, …]
- 3rd : Which section does a specific record (row) involve?
- df[new column’s name] = pd.cut(x=df[column’s name], bins=section, labels=section_names, include_lowest=True)
- If one datum’s value from one record is involved in “name1” section’s range (minimum <= x < maximum), the record’s datum in new column is “name1”
<example> new column 0 name2 1 name1 2 name1 3 name3 4 name2 5 name1 6 name3 . . .
- 1st : divide the section with same width
One-Hot-Encoding
- When we do machine learning, category data should be converted to dummy variable data (one-hot-encoding)
- Dummy variable : only consists of 0 and 1
- 1 : It is involved in a specific category
- 0 : It is not involved in a specific category
- Method 1 : pandas
- Dummy variable form
- Dataframe (index : each record index, column : category data’ names)
<example> name1 name2 name3 0 1 0 0 # 0 record is involved in name1 category 1 0 0 1 2 0 0 1 3 0 1 0 4 1 0 0 5 1 0 0 6 0 0 1 . . . - Dataframe (index : each record index, column : category data’ names)
- dummies = pd.get_dummies(df[column’s name])
- It yields the dummies from a specific column whose dtypes is category
- Dummy variable form
- Method 2 : sklearn
- from sklearn import preprocessing
- Dummy variable form : sparse matrix (It has lots of 0 as data)
- Sparse matrix is represented like this
- (row number, column number) 1.0 : The value on this location of sparse matrix is 1.0
<example from previous dummies> (0, 0) 1.0 (1, 2) 1.0 (2, 2) 1.0 (3, 1) 1.0 (4, 0) 1.0 (5, 0) 1.0 (6, 2) 1.0 . . . <class 'scipy.sparse.csr.csr_matrix'> # The value on left location is 0 - (row number, column number) 1.0 : The value on this location of sparse matrix is 1.0
- Sparse matrix is represented like this
- Code
from sklearn import preprocessing labelEncoder = preprocessing.LabelEncoder() oneHotEncoder = preprocessing.OneHotEncoder() labeledOneHot = LabelEncoder.fit_transform(df['c1']) # c1 column is category data # This code's result is to return numpy.ndarray / Example : [0 2 2 1 0 0 2] (1 record is involved in "name3" category) # This array is 1D vector and it should be converted to n x 1 matrix newLabeledOneHot = labeledOneHot.reshape(len(labeledOneHot, 1)) # len(labeledOneHot) x 1 matrix oneHotResult = oneHotEncoder.fit_transform(newLabeledOneHot) # Sparse matrix