EX 2: Normal and Shrinkage Linear Discriminant Analysis for classification

分類法/範例二: Normal and Shrinkage Linear Discriminant Analysis for classification

這個範例用來展示scikit-learn 如何使用Linear Discriminant Analysis (LDA) 線性判別分析來達成資料分類的目的
    1.
    利用 sklearn.datasets.make_blobs 產生測試資料
    2.
    利用自定義函數 generate_data 產生具有數個特徵之資料集,其中僅有一個特徵對於資料分料判斷有意義
    3.
    使用LinearDiscriminantAnalysis來達成資料判別
    4.
    比較於LDA演算法中,開啟 shrinkage 前後之差異

(一)產生測試資料

從程式碼來看,一開始主要為自定義函數generate_data(n_samples, n_features),這個函數的主要目的為產生一組測試資料,總資料列數為n_samples,每一列共有n_features個特徵。而其中只有第一個特徵得以用來判定資料類別,其他特徵則毫無意義。make_blobs負責產生單一特徵之資料後,利用`np.random.randn` 亂數產生其他`n_features - 1`個特徵,之後利用np.hstack以"水平" (horizontal)方式連接X以及亂數產生之特徵資料。
1
%matplotlib inline
2
from __future__ import division
3
import numpy as np
4
import matplotlib.pyplot as plt
5
6
from sklearn.datasets import make_blobs
7
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
8
9
n_train = 20 # samples for training
10
n_test = 200 # samples for testing
11
n_averages = 50 # how often to repeat classification
12
n_features_max = 75 # maximum number of features
13
step = 4 # step size for the calculation
14
15
def generate_data(n_samples, n_features):
16
X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])
17
# add non-discriminative features
18
if n_features > 1:
19
X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
20
return X, y
Copied!
我們可以用以下的程式碼來測試自定義函式,結果回傳了X (10x5矩陣)及y(10個元素之向量),我們可以使用pandas.DataFrame套件來觀察資料
1
X, y = generate_data(10, 5)
2
3
import pandas as pd
4
pd.set_option('precision',2)
5
df=pd.DataFrame(np.hstack([y.reshape(10,1),X]))
6
df.columns = ['y', 'X0', 'X1', 'X2', 'X2', 'X4']
7
print(df)
Copied!
結果顯示如下。。我們可以看到只有X的第一行特徵資料(X0) 與目標數值 y 有一個明確的對應關係,也就是y為1時,數值較大。
1
y X0 X1 X2 X2 X4
2
0 1 0.38 0.35 0.80 -0.97 -0.68
3
1 1 2.41 0.31 -1.47 0.10 -1.39
4
2 1 1.65 -0.99 -0.12 -0.38 0.18
5
3 0 -4.86 0.14 -0.80 1.13 -1.31
6
4 1 -0.06 -1.99 -0.70 -1.26 -1.64
7
5 0 -1.51 -1.74 -0.83 0.74 -2.07
8
6 0 -2.50 0.44 -0.45 -0.55 -0.42
9
7 1 1.55 1.38 0.93 -1.44 0.27
10
8 0 -1.95 0.32 -0.28 0.02 0.07
11
9 0 -0.58 -0.07 -1.01 0.15 -1.84
Copied!

(二)改變特徵數量並測試shrinkage之功能

接下來程式碼裏有兩段迴圈,外圈改變特徵數量。內圈則多次嘗試LDA之以求精準度。使用LinearDiscriminantAnalysis來訓練分類器,過程中以shrinkage='auto'以及shrinkage=None來控制shrinkage之開關,將分類器分別以clf1以及clf2儲存。之後再產生新的測試資料將準確度加入score_clf1score_clf2裏,離開內迴圈之後除以總數以求平均。
1
acc_clf1, acc_clf2 = [], []
2
n_features_range = range(1, n_features_max + 1, step)
3
for n_features in n_features_range:
4
score_clf1, score_clf2 = 0, 0
5
for _ in range(n_averages):
6
X, y = generate_data(n_train, n_features)
7
8
clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto').fit(X, y)
9
clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)
10
11
X, y = generate_data(n_test, n_features)
12
score_clf1 += clf1.score(X, y)
13
score_clf2 += clf2.score(X, y)
14
15
acc_clf1.append(score_clf1 / n_averages)
16
acc_clf2.append(score_clf2 / n_averages)
Copied!

(三)顯示LDA判別結果

這個範例主要希望能得知shrinkage的功能,因此畫出兩條分類準確度的曲線。縱軸代表平均的分類準確度,而橫軸代表的是features_samples_ratio 顧名思義,它是模擬資料中,特徵數量與訓練資料列數的比例。當特徵數量為75且訓練資料列數僅有20筆時,features_samples_ratio = 3.75 由於資料列數過少,導致準確率下降。而此時shrinkage演算法能有效維持LDA演算法的準確度。
1
features_samples_ratio = np.array(n_features_range) / n_train
2
fig = plt.figure(figsize=(10,6), dpi=300)
3
plt.plot(features_samples_ratio, acc_clf1, linewidth=2,
4
label="Linear Discriminant Analysis with shrinkage", color='r')
5
plt.plot(features_samples_ratio, acc_clf2, linewidth=2,
6
label="Linear Discriminant Analysis", color='g')
7
plt.xlabel('n_features / n_samples')
8
plt.ylabel('Classification accuracy')
9
10
plt.legend(loc=1, prop={'size': 10})
11
plt.show()
Copied!
png

(四)完整程式碼

Python source code: plot_lda.py
1
from __future__ import division
2
3
import numpy as np
4
import matplotlib.pyplot as plt
5
6
from sklearn.datasets import make_blobs
7
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
8
9
10
n_train = 20 # samples for training
11
n_test = 200 # samples for testing
12
n_averages = 50 # how often to repeat classification
13
n_features_max = 75 # maximum number of features
14
step = 4 # step size for the calculation
15
16
17
def generate_data(n_samples, n_features):
18
"""Generate random blob-ish data with noisy features.
19
20
This returns an array of input data with shape `(n_samples, n_features)`
21
and an array of `n_samples` target labels.
22
23
Only one feature contains discriminative information, the other features
24
contain only noise.
25
"""
26
X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])
27
28
# add non-discriminative features
29
if n_features > 1:
30
X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
31
return X, y
32
33
acc_clf1, acc_clf2 = [], []
34
n_features_range = range(1, n_features_max + 1, step)
35
for n_features in n_features_range:
36
score_clf1, score_clf2 = 0, 0
37
for _ in range(n_averages):
38
X, y = generate_data(n_train, n_features)
39
40
clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto').fit(X, y)
41
clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)
42
43
X, y = generate_data(n_test, n_features)
44
score_clf1 += clf1.score(X, y)
45
score_clf2 += clf2.score(X, y)
46
47
acc_clf1.append(score_clf1 / n_averages)
48
acc_clf2.append(score_clf2 / n_averages)
49
50
features_samples_ratio = np.array(n_features_range) / n_train
51
52
plt.plot(features_samples_ratio, acc_clf1, linewidth=2,
53
label="Linear Discriminant Analysis with shrinkage", color='r')
54
plt.plot(features_samples_ratio, acc_clf2, linewidth=2,
55
label="Linear Discriminant Analysis", color='g')
56
57
plt.xlabel('n_features / n_samples')
58
plt.ylabel('Classification accuracy')
59
60
plt.legend(loc=1, prop={'size': 12})
61
plt.suptitle('Linear Discriminant Analysis vs. \
62
shrinkage Linear Discriminant Analysis (1 discriminative feature)')
63
plt.show()
Copied!