協作閣

開源協作部落格

One Hot Encoding practice

Jessy Chen / 2019-05-10 /


  1. 在資料前處理(Preprocessing)的步驟,除了缺失值等的處理,對於類別型的資料(categorical data),還需要把資料轉換成電腦能夠理解的形式,而這禮拜練習的 One Hot Encoding 就是一種方式。
  2. 像是在處理國家人口資料時,如果我們要依各洲來分類,分成「亞洲、歐洲、美洲、非洲、大洋洲」等五類,但電腦讀取的方式是 0 和 1 等數字,若我們將這五洲轉成 1 到 5 的數字,電腦可能會以為各洲是有「順序之分」的,因此我們可以更進一步地把這五洲轉換成由 0 和 1 組成的類別。每個類別都是一個欄位,因此對每個觀察值或列來說,只有一個欄位會出現1,就是只有一個熱的點(One Hot)的編碼(Encoding)。
  3. 在 python 語言,可以用 sklearn 的套件來做 One Hot Encoding。
  4. 在 One Hot Encoding 之前,必須要先經過 Label Encoding。

參考資料: https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC2-4%E8%AC%9B-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86-missing-data-one-hot-encoding-feature-scaling-3b70a7839b4a https://www.cnblogs.com/zhoukui/p/9159909.html

import os
import numpy as np
import pandas as pd
# 設定 data_path, 並讀取 app_train
dir_data = './data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train = pd.read_csv(f_app_train)
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()
(307511, 1)
WEEKDAY_APPR_PROCESS_START
0 WEDNESDAY
1 MONDAY
2 MONDAY
3 WEDNESDAY
4 THURSDAY
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
sub_train["WEEKDAY_APPR_PROCESS_START"] = labelencoder.fit_transform(sub_train["WEEKDAY_APPR_PROCESS_START"]) # 開始Encoding
sub_train.head()
WEEKDAY_APPR_PROCESS_START
0 6
1 1
2 1
3 6
4 4
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0]) # 對哪個欄位進行One Hot Encoding
sub_train = onehotencoder.fit_transform(sub_train) # 開始Encoding
sub_train
<307511x7 sparse matrix of type '<class 'numpy.float64'>'
    with 307511 stored elements in Compressed Sparse Row format>
sub_train.toarray()
array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  1.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.]])
sub_train.shape
(307511, 7)
sub_train_2 = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
pd.get_dummies(sub_train_2)
WEEKDAY_APPR_PROCESS_START_FRIDAY WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY
0 0 0 0 0 0 0 1
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 0 0 1
4 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1
6 0 0 0 1 0 0 0
7 0 1 0 0 0 0 0
8 0 0 0 0 0 0 1
9 0 0 0 0 1 0 0
10 0 0 1 0 0 0 0
11 1 0 0 0 0 0 0
12 1 0 0 0 0 0 0
13 0 0 0 0 1 0 0
14 0 1 0 0 0 0 0
15 0 0 1 0 0 0 0
16 0 0 0 0 1 0 0
17 0 1 0 0 0 0 0
18 1 0 0 0 0 0 0
19 0 1 0 0 0 0 0
20 1 0 0 0 0 0 0
21 0 1 0 0 0 0 0
22 0 0 0 0 1 0 0
23 1 0 0 0 0 0 0
24 0 0 0 0 1 0 0
25 0 0 1 0 0 0 0
26 0 1 0 0 0 0 0
27 0 0 1 0 0 0 0
28 0 0 0 0 0 0 1
29 0 0 0 0 0 1 0
... ... ... ... ... ... ... ...
307481 1 0 0 0 0 0 0
307482 0 0 0 0 1 0 0
307483 0 1 0 0 0 0 0
307484 0 1 0 0 0 0 0
307485 0 0 1 0 0 0 0
307486 0 0 1 0 0 0 0
307487 0 0 0 0 0 0 1
307488 0 0 0 0 0 0 1
307489 1 0 0 0 0 0 0
307490 0 0 1 0 0 0 0
307491 0 0 1 0 0 0 0
307492 1 0 0 0 0 0 0
307493 0 1 0 0 0 0 0
307494 0 1 0 0 0 0 0
307495 0 0 0 0 0 0 1
307496 0 0 0 0 1 0 0
307497 0 0 0 0 1 0 0
307498 0 1 0 0 0 0 0
307499 0 0 0 0 0 1 0
307500 1 0 0 0 0 0 0
307501 0 0 0 0 0 0 1
307502 0 1 0 0 0 0 0
307503 0 0 1 0 0 0 0
307504 0 0 0 0 0 0 1
307505 0 1 0 0 0 0 0
307506 0 0 0 0 1 0 0
307507 0 1 0 0 0 0 0
307508 0 0 0 0 1 0 0
307509 0 0 0 0 0 0 1
307510 0 0 0 0 1 0 0

307511 rows × 7 columns