Ex 4: Label Propagation digits active learning

半監督式分類法/範例4 : Label Propagation digits active learning

本範例目的:
    展示active learning(主動學習)進行以label propagation(標籤傳播法)學習辨識手寫數字

一、Active Learning 主動學習

在實際應用上,通常我們獲得到的數據,有一大部分是未標籤的,如果要套用在常用的分類法上,最直接的想法是標籤所有的數據,但一一標籤所有數據是非常耗時耗工的,因此,在面對未標籤的數據遠多於有標籤的數據之情況下,可以透過active learning,主動的挑選一些數據進行標籤。 Active learning分成兩部分:
    從已標籤的數據中隨機抽取一小部分作為訓練集,訓練出一個分類模型
    透過迭代,將分類器預測出來的結果再進行訓練。

二、引入函式與模型

    stats用來進行統計與分析
    LabelSpreading為半監督式學習的模型
    confusion_matrix為混淆矩陣
    classification_report用於觀察預測和實際數值的差異,包含precision、recall、f1-score及support
1
import numpy as np
2
import matplotlib.pyplot as plt
3
4
from scipy import stats
5
from sklearn import datasets
6
from sklearn.semi_supervised import LabelSpreading
7
from sklearn.metrics import classification_report, confusion_matrix
Copied!

三、建立dataset

    Dataset取自sklearn.datasets.load_digits,內容為0~9的手寫數字,共有1797筆
    使用其中的330筆進行訓練(y_train),其中40筆為labeled,其餘290筆為unlabeled(標為-1)
    迭代的次數設定為5次
    scikit learn網站中的範例程式敘述為10筆labeled,但原始程式碼為40筆,因此在這邊以原始碼為主
1
digits = datasets.load_digits()
2
rng = np.random.RandomState(0)
3
indices = np.arange(len(digits.data))
4
rng.shuffle(indices)
5
6
X = digits.data[indices[:330]]
7
y = digits.target[indices[:330]]
8
images = digits.images[indices[:330]]
9
10
n_total_samples = len(y)
11
n_labeled_points = 40
12
max_iterations = 5
13
14
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
15
f = plt.figure()
Copied!

四、利用Active learning進行模型訓練與預測

    以下程式為每一次迭代所做的過程(for迴圈的內容)
    每一次迭代都利用訓練過後的模型進行預測,得到predicted_labels,並與true_labels計算混淆矩陣與classification report
1
if len(unlabeled_indices) == 0:
2
print("No unlabeled items left to label.")
3
break
4
y_train = np.copy(y)
5
y_train[unlabeled_indices] = -1
6
7
lp_model = LabelSpreading(gamma=0.25, max_iter=20)
8
lp_model.fit(X, y_train)
9
10
predicted_labels = lp_model.transduction_[unlabeled_indices]
11
true_labels = y[unlabeled_indices]
12
13
cm = confusion_matrix(true_labels, predicted_labels,
14
labels=lp_model.classes_)
15
16
print("Iteration %i %s" % (i, 70 * "_"))
17
print("Label Spreading model: %d labeled & %d unlabeled (%d total)"
18
% (n_labeled_points, n_total_samples - n_labeled_points,
19
n_total_samples))
20
21
print(classification_report(true_labels, predicted_labels))
22
23
print("Confusion matrix")
24
print(cm)
Copied!
    利用stats進行數據的統計,找出前5筆預測最不佳的結果,將其預測的label與true label和圖像顯示出來
    每一次迭代的最後挑出上述的5筆預測最不佳的結果,進行下一次的迭代時,把相對應的true label替換給y_train測試集裡面,其餘(第40筆之後的數據)的label依然給予-1表示unlabeled
1
# compute the entropies of transduced label distributions
2
pred_entropies = stats.distributions.entropy(
3
lp_model.label_distributions_.T)
4
5
# select up to 5 digit examples that the classifier is most uncertain about
6
uncertainty_index = np.argsort(pred_entropies)[::-1]
7
uncertainty_index = uncertainty_index[
8
np.in1d(uncertainty_index, unlabeled_indices)][:5]
9
10
# keep track of indices that we get labels for
11
delete_indices = np.array([], dtype=int)
12
13
# for more than 5 iterations, visualize the gain only on the first 5
14
if i < 5:
15
f.text(.05, (1 - (i + 1) * .183),
16
"model %d\n\nfit with\n%d labels" %
17
((i + 1), i * 5 + 40), size=10)
18
for index, image_index in enumerate(uncertainty_index):
19
image = images[image_index]
20
21
# for more than 5 iterations, visualize the gain only on the first 5
22
if i < 5:
23
sub = f.add_subplot(5, 5, index + 1 + (5 * i))
24
sub.imshow(image, cmap=plt.cm.gray_r, interpolation='none')
25
sub.set_title("predict: %i\ntrue: %i" % (
26
lp_model.transduction_[image_index], y[image_index]), size=10)
27
sub.axis('off')
28
29
# labeling 5 points, remote from labeled set
30
delete_index, = np.where(unlabeled_indices == image_index)
31
delete_indices = np.concatenate((delete_indices, delete_index))
32
33
unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
34
n_labeled_points += len(uncertainty_index)
Copied!
    下列程式屬於for迴圈外圍
1
f.suptitle("Active learning with Label Propagation.\nRows show 5 most "
2
"uncertain labels to learn with the next model.", y=1.15)
3
plt.subplots_adjust(left=0.2, bottom=0.03, right=0.9, top=0.9, wspace=0.2,
4
hspace=0.85)
5
plt.show()
Copied!
    以下即為每一次迭代的結果,可以看到每一次迭代之後,micro avg逐漸上升
Out:
1
Iteration 0 ______________________________________________________________________
2
Label Spreading model: 40 labeled & 290 unlabeled (330 total)
3
precision recall f1-score support
4
5
0 1.00 1.00 1.00 22
6
1 0.78 0.69 0.73 26
7
2 0.93 0.93 0.93 29
8
3 1.00 0.89 0.94 27
9
4 0.92 0.96 0.94 23
10
5 0.96 0.70 0.81 33
11
6 0.97 0.97 0.97 35
12
7 0.94 0.91 0.92 33
13
8 0.62 0.89 0.74 28
14
9 0.73 0.79 0.76 34
15
16
micro avg 0.87 0.87 0.87 290
17
macro avg 0.89 0.87 0.87 290
18
weighted avg 0.88 0.87 0.87 290
19
20
Confusion matrix
21
[[22 0 0 0 0 0 0 0 0 0]
22
[ 0 18 2 0 0 0 1 0 5 0]
23
[ 0 0 27 0 0 0 0 0 2 0]
24
[ 0 0 0 24 0 0 0 0 3 0]
25
[ 0 1 0 0 22 0 0 0 0 0]
26
[ 0 0 0 0 0 23 0 0 0 10]
27
[ 0 1 0 0 0 0 34 0 0 0]
28
[ 0 0 0 0 0 0 0 30 3 0]
29
[ 0 3 0 0 0 0 0 0 25 0]
30
[ 0 0 0 0 2 1 0 2 2 27]]
31
Iteration 1 ______________________________________________________________________
32
Label Spreading model: 45 labeled & 285 unlabeled (330 total)
33
precision recall f1-score support
34
35
0 1.00 1.00 1.00 22
36
1 0.79 1.00 0.88 22
37
2 1.00 0.93 0.96 29
38
3 1.00 1.00 1.00 26
39
4 0.92 0.96 0.94 23
40
5 0.96 0.70 0.81 33
41
6 1.00 0.97 0.99 35
42
7 0.94 0.91 0.92 33
43
8 0.77 0.86 0.81 28
44
9 0.73 0.79 0.76 34
45
46
micro avg 0.90 0.90 0.90 285
47
macro avg 0.91 0.91 0.91 285
48
weighted avg 0.91 0.90 0.90 285
49
50
Confusion matrix
51
[[22 0 0 0 0 0 0 0 0 0]
52
[ 0 22 0 0 0 0 0 0 0 0]
53
[ 0 0 27 0 0 0 0 0 2 0]
54
[ 0 0 0 26 0 0 0 0 0 0]
55
[ 0 1 0 0 22 0 0 0 0 0]
56
[ 0 0 0 0 0 23 0 0 0 10]
57
[ 0 1 0 0 0 0 34 0 0 0]
58
[ 0 0 0 0 0 0 0 30 3 0]
59
[ 0 4 0 0 0 0 0 0 24 0]
60
[ 0 0 0 0 2 1 0 2 2 27]]
61
Iteration 2 ______________________________________________________________________
62
Label Spreading model: 50 labeled & 280 unlabeled (330 total)
63
precision recall f1-score support
64
65
0 1.00 1.00 1.00 22
66
1 0.85 1.00 0.92 22
67
2 1.00 1.00 1.00 28
68
3 1.00 1.00 1.00 26
69
4 0.87 1.00 0.93 20
70
5 0.96 0.70 0.81 33
71
6 1.00 0.97 0.99 35
72
7 0.94 1.00 0.97 32
73
8 0.92 0.86 0.89 28
74
9 0.73 0.79 0.76 34
75
76
micro avg 0.92 0.92 0.92 280
77
macro avg 0.93 0.93 0.93 280
78
weighted avg 0.93 0.92 0.92 280
79
80
Confusion matrix
81
[[22 0 0 0 0 0 0 0 0 0]
82
[ 0 22 0 0 0 0 0 0 0 0]
83
[ 0 0 28 0 0 0 0 0 0 0]
84
[ 0 0 0 26 0 0 0 0 0 0]
85
[ 0 0 0 0 20 0 0 0 0 0]
86
[ 0 0 0 0 0 23 0 0 0 10]
87
[ 0 1 0 0 0 0 34 0 0 0]
88
[ 0 0 0 0 0 0 0 32 0 0]
89
[ 0 3 0 0 1 0 0 0 24 0]
90
[ 0 0 0 0 2 1 0 2 2 27]]
91
Iteration 3 ______________________________________________________________________
92
Label Spreading model: 55 labeled & 275 unlabeled (330 total)
93
precision recall f1-score support
94
95
0 1.00 1.00 1.00 22
96
1 0.85 1.00 0.92 22
97
2 1.00 1.00 1.00 27
98
3 1.00 1.00 1.00 26
99
4 0.87 1.00 0.93 20
100
5 0.96 0.87 0.92 31
101
6 1.00 0.97 0.99 35
102
7 1.00 1.00 1.00 31
103
8 0.92 0.86 0.89 28
104
9 0.88 0.85 0.86 33
105
106
micro avg 0.95 0.95 0.95 275
107
macro avg 0.95 0.95 0.95 275
108
weighted avg 0.95 0.95 0.95 275
109
110
Confusion matrix
111
[[22 0 0 0 0 0 0 0 0 0]
112
[ 0 22 0 0 0 0 0 0 0 0]
113
[ 0 0 27 0 0 0 0 0 0 0]
114
[ 0 0 0 26 0 0 0 0 0 0]
115
[ 0 0 0 0 20 0 0 0 0 0]
116
[ 0 0 0 0 0 27 0 0 0 4]
117
[ 0 1 0 0 0 0 34 0 0 0]
118
[ 0 0 0 0 0 0 0 31 0 0]
119
[ 0 3 0 0 1 0 0 0 24 0]
120
[ 0 0 0 0 2 1 0 0 2 28]]
121
Iteration 4 ______________________________________________________________________
122
Label Spreading model: 60 labeled & 270 unlabeled (330 total)
123
precision recall f1-score support
124
125
0 1.00 1.00 1.00 22
126
1 0.96 1.00 0.98 22
127
2 1.00 0.96 0.98 27
128
3 0.96 1.00 0.98 25
129
4 0.86 1.00 0.93 19
130
5 0.96 0.87 0.92 31
131
6 1.00 0.97 0.99 35
132
7 1.00 1.00 1.00 31
133
8 0.92 0.96 0.94 25
134
9 0.88 0.85 0.86 33
135
136
micro avg 0.96 0.96 0.96 270
137
macro avg 0.95 0.96 0.96 270
138
weighted avg 0.96 0.96 0.96 270
139
140
Confusion matrix
141
[[22 0 0 0 0 0 0 0 0 0]
142
[ 0 22 0 0 0 0 0 0 0 0]
143
[ 0 0 26 1 0 0 0 0 0 0]
144
[ 0 0 0 25 0 0 0 0 0 0]
145
[ 0 0 0 0 19 0 0 0 0 0]
146
[ 0 0 0 0 0 27 0 0 0 4]
147
[ 0 1 0 0 0 0 34 0 0 0]
148
[ 0 0 0 0 0 0 0 31 0 0]
149
[ 0 0 0 0 1 0 0 0 24 0]
150
[ 0 0 0 0 2 1 0 0 2 28]]
Copied!
png
上圖的結果即為Active Learning訓練過程的結果,第一次迭代以330筆的資料進行訓練,其中包含40筆labeled的資料與290 unlabeled的資料,再對unlabeled的資料做預測,將預測出來的結果中,5個預測最不佳的結果顯示出來,即第一列的5張圖,將這5筆資料的從測試集中強制變為true label的結果,再下一次迭代中,labeled的資料就變成45筆,unlabeled的資料為285筆,總和為330筆的資料進行第二次的訓練,以此類推,因此可以看到,每一次訓練,labeled的資料會5筆、5筆的增加。

五、原始碼列表

Python source code: plot_label_propagation_digits_active_learning.py
1
print(__doc__)
2
3
# Authors: Clay Woolam <[email protected]>
4
# License: BSD
5
6
import numpy as np
7
import matplotlib.pyplot as plt
8
from scipy import stats
9
10
from sklearn import datasets
11
from sklearn.semi_supervised import LabelSpreading
12
from sklearn.metrics import classification_report, confusion_matrix
13
14
digits = datasets.load_digits()
15
rng = np.random.RandomState(0)
16
indices = np.arange(len(digits.data))
17
rng.shuffle(indices)
18
19
X = digits.data[indices[:330]]
20
y = digits.target[indices[:330]]
21
images = digits.images[indices[:330]]
22
23
n_total_samples = len(y)
24
n_labeled_points = 40
25
max_iterations = 5
26
27
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
28
f = plt.figure()
29
30
for i in range(max_iterations):
31
if len(unlabeled_indices) == 0:
32
print("No unlabeled items left to label.")
33
break
34
y_train = np.copy(y)
35
y_train[unlabeled_indices] = -1
36
37
lp_model = LabelSpreading(gamma=0.25, max_iter=20)
38
lp_model.fit(X, y_train)
39
40
predicted_labels = lp_model.transduction_[unlabeled_indices]
41
true_labels = y[unlabeled_indices]
42
43
cm = confusion_matrix(true_labels, predicted_labels,
44
labels=lp_model.classes_)
45
46
print("Iteration %i %s" % (i, 70 * "_"))
47
print("Label Spreading model: %d labeled & %d unlabeled (%d total)"
48
% (n_labeled_points, n_total_samples - n_labeled_points,
49
n_total_samples))
50
51
print(classification_report(true_labels, predicted_labels))
52
53
print("Confusion matrix")
54
print(cm)
55
56
# compute the entropies of transduced label distributions
57
pred_entropies = stats.distributions.entropy(
58
lp_model.label_distributions_.T)
59
60
# select up to 5 digit examples that the classifier is most uncertain about
61
uncertainty_index = np.argsort(pred_entropies)[::-1]
62
uncertainty_index = uncertainty_index[
63
np.in1d(uncertainty_index, unlabeled_indices)][:5]
64
65
# keep track of indices that we get labels for
66
delete_indices = np.array([], dtype=int)
67
68
# for more than 5 iterations, visualize the gain only on the first 5
69
if i < 5:
70
f.text(.05, (1 - (i + 1) * .183),
71
"model %d\n\nfit with\n%d labels" %
72
((i + 1), i * 5 + 10), size=10)
73
for index, image_index in enumerate(uncertainty_index):
74
image = images[image_index]
75
76
# for more than 5 iterations, visualize the gain only on the first 5
77
if i < 5:
78
sub = f.add_subplot(5, 5, index + 1 + (5 * i))
79
sub.imshow(image, cmap=plt.cm.gray_r, interpolation='none')
80
sub.set_title("predict: %i\ntrue: %i" % (
81
lp_model.transduction_[image_index], y[image_index]), size=10)
82
sub.axis('off')
83
84
# labeling 5 points, remote from labeled set
85
delete_index, = np.where(unlabeled_indices == image_index)
86
delete_indices = np.concatenate((delete_indices, delete_index))
87
88
unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
89
n_labeled_points += len(uncertainty_index)
90
91
f.suptitle("Active learning with Label Propagation.\nRows show 5 most "
92
"uncertain labels to learn with the next model.", y=1.15)
93
plt.subplots_adjust(left=0.2, bottom=0.03, right=0.9, top=0.9, wspace=0.2,
94
hspace=0.85)
95
plt.show()
Copied!
Last modified 1yr ago