One Hot Encoding practice
- 在資料前處理(Preprocessing)的步驟,除了缺失值等的處理,對於類別型的資料(categorical data),還需要把資料轉換成電腦能夠理解的形式,而這禮拜練習的 One Hot Encoding 就是一種方式。
- 像是在處理國家人口資料時,如果我們要依各洲來分類,分成「亞洲、歐洲、美洲、非洲、大洋洲」等五類,但電腦讀取的方式是 0 和 1 等數字,若我們將這五洲轉成 1 到 5 的數字,電腦可能會以為各洲是有「順序之分」的,因此我們可以更進一步地把這五洲轉換成由 0 和 1 組成的類別。每個類別都是一個欄位,因此對每個觀察值或列來說,只有一個欄位會出現1,就是只有一個熱的點(One Hot)的編碼(Encoding)。
- 在 python 語言,可以用 sklearn 的套件來做 One Hot Encoding。
- 在 One Hot Encoding 之前,必須要先經過 Label Encoding。
參考資料:
https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC2-4%E8%AC%9B-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86-missing-data-one-hot-encoding-feature-scaling-3b70a7839b4a
https://www.cnblogs.com/zhoukui/p/9159909.html
import os
import numpy as np
import pandas as pd
dir_data = './data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train = pd.read_csv(f_app_train)
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()
(307511, 1)
|
WEEKDAY_APPR_PROCESS_START |
| 0 |
WEDNESDAY |
| 1 |
MONDAY |
| 2 |
MONDAY |
| 3 |
WEDNESDAY |
| 4 |
THURSDAY |
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
sub_train["WEEKDAY_APPR_PROCESS_START"] = labelencoder.fit_transform(sub_train["WEEKDAY_APPR_PROCESS_START"])
sub_train.head()
|
WEEKDAY_APPR_PROCESS_START |
| 0 |
6 |
| 1 |
1 |
| 2 |
1 |
| 3 |
6 |
| 4 |
4 |
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
sub_train = onehotencoder.fit_transform(sub_train)
sub_train
<307511x7 sparse matrix of type '<class 'numpy.float64'>'
with 307511 stored elements in Compressed Sparse Row format>
sub_train.toarray()
array([[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 1., 0., ..., 0., 0., 0.],
[ 0., 1., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 1., 0., 0.]])
sub_train.shape
(307511, 7)
sub_train_2 = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
pd.get_dummies(sub_train_2)
|
WEEKDAY_APPR_PROCESS_START_FRIDAY |
WEEKDAY_APPR_PROCESS_START_MONDAY |
WEEKDAY_APPR_PROCESS_START_SATURDAY |
WEEKDAY_APPR_PROCESS_START_SUNDAY |
WEEKDAY_APPR_PROCESS_START_THURSDAY |
WEEKDAY_APPR_PROCESS_START_TUESDAY |
WEEKDAY_APPR_PROCESS_START_WEDNESDAY |
| 0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 2 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 3 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 4 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 5 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 6 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
| 7 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 8 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 9 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 10 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 11 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 12 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 13 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 14 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 15 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 16 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 17 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 18 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 19 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 20 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 21 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 22 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 23 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 24 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 25 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 26 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 27 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 28 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 29 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
| ... |
... |
... |
... |
... |
... |
... |
... |
| 307481 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 307482 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 307483 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307484 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307485 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 307486 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 307487 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307488 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307489 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 307490 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 307491 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 307492 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 307493 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307494 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307495 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307496 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 307497 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 307498 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307499 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
| 307500 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 307501 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307502 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307503 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
| 307504 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307505 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307506 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 307507 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
| 307508 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
| 307509 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 307510 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307511 rows × 7 columns