One Hot Encoding practice
- 在資料前處理(Preprocessing)的步驟,除了缺失值等的處理,對於類別型的資料(categorical data),還需要把資料轉換成電腦能夠理解的形式,而這禮拜練習的 One Hot Encoding 就是一種方式。
- 像是在處理國家人口資料時,如果我們要依各洲來分類,分成「亞洲、歐洲、美洲、非洲、大洋洲」等五類,但電腦讀取的方式是 0 和 1 等數字,若我們將這五洲轉成 1 到 5 的數字,電腦可能會以為各洲是有「順序之分」的,因此我們可以更進一步地把這五洲轉換成由 0 和 1 組成的類別。每個類別都是一個欄位,因此對每個觀察值或列來說,只有一個欄位會出現1,就是只有一個熱的點(One Hot)的編碼(Encoding)。
- 在 python 語言,可以用 sklearn 的套件來做 One Hot Encoding。
- 在 One Hot Encoding 之前,必須要先經過 Label Encoding。
參考資料:
https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC2-4%E8%AC%9B-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86-missing-data-one-hot-encoding-feature-scaling-3b70a7839b4a
https://www.cnblogs.com/zhoukui/p/9159909.html
import os
import numpy as np
import pandas as pd
# 設定 data_path, 並讀取 app_train
dir_data = './data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train = pd.read_csv(f_app_train)
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()
(307511, 1)
|
WEEKDAY_APPR_PROCESS_START |
0 |
WEDNESDAY |
1 |
MONDAY |
2 |
MONDAY |
3 |
WEDNESDAY |
4 |
THURSDAY |
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
sub_train["WEEKDAY_APPR_PROCESS_START"] = labelencoder.fit_transform(sub_train["WEEKDAY_APPR_PROCESS_START"]) # 開始Encoding
sub_train.head()
|
WEEKDAY_APPR_PROCESS_START |
0 |
6 |
1 |
1 |
2 |
1 |
3 |
6 |
4 |
4 |
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0]) # 對哪個欄位進行One Hot Encoding
sub_train = onehotencoder.fit_transform(sub_train) # 開始Encoding
sub_train
<307511x7 sparse matrix of type '<class 'numpy.float64'>'
with 307511 stored elements in Compressed Sparse Row format>
sub_train.toarray()
array([[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 1., 0., ..., 0., 0., 0.],
[ 0., 1., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 1., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 1., 0., 0.]])
sub_train.shape
(307511, 7)
sub_train_2 = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
pd.get_dummies(sub_train_2)
|
WEEKDAY_APPR_PROCESS_START_FRIDAY |
WEEKDAY_APPR_PROCESS_START_MONDAY |
WEEKDAY_APPR_PROCESS_START_SATURDAY |
WEEKDAY_APPR_PROCESS_START_SUNDAY |
WEEKDAY_APPR_PROCESS_START_THURSDAY |
WEEKDAY_APPR_PROCESS_START_TUESDAY |
WEEKDAY_APPR_PROCESS_START_WEDNESDAY |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
4 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
5 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
6 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
7 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
8 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
9 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
10 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
11 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
12 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
13 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
14 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
15 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
16 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
17 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
18 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
19 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
20 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
21 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
22 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
23 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
24 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
25 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
26 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
27 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
28 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
29 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
... |
... |
... |
... |
... |
... |
... |
... |
307481 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
307482 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307483 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307484 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307485 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
307486 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
307487 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307488 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307489 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
307490 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
307491 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
307492 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
307493 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307494 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307495 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307496 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307497 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307498 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307499 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
307500 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
307501 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307502 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307503 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
307504 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307505 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307506 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307507 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
307508 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307509 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
307510 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
307511 rows × 7 columns