Ex 3: Label Propagation digits: Demonstrating performance

半監督式分類法/範例3 : Label Propagation digits: Demonstrating performance

本範例目的:
    利用少量標籤的手寫數字資料集進行模型訓練,展現半監督式學習的能力

一、半監督式學習

在實際的應用上,大部分的資料沒有標籤且數量會遠多於有標籤的資料,而將這些沒有標籤的資料一一標籤是非常耗時的,相對而言,蒐集無標籤的資料更容易,因此可以利用半監督式學習(Semi-supervised learning)對少部分的資料進行標籤,透過這些有標籤的資料擷取特徵,然後再對其他資料進行分類。

二、引入函式與模型

    stats用來進行統計與分析
    LabelSpreading為半監督式學習的模型
    confusion_matrix為混淆矩陣
    classification_report用於觀察預測和實際數值的差異,包含precision、recall、f1-score及support
1
import numpy as np
2
import matplotlib.pyplot as plt
3
4
from scipy import stats
5
from sklearn import datasets
6
from sklearn.semi_supervised import LabelSpreading
7
from sklearn.metrics import confusion_matrix, classification_report
Copied!

三、建立dataset

    Dataset取自sklearn.datasets.load_digits,內容為0~9的手寫數字,共有1797筆
    使用其中的340筆進行訓練,其中40筆為labeled,其餘為unlabeled
    複製一組340筆的target (y_train)作為訓練集,並將第40筆之後的label都設為-1
1
digits = datasets.load_digits()
2
rng = np.random.RandomState(2)
3
indices = np.arange(len(digits.data))
4
rng.shuffle(indices)
5
6
X = digits.data[indices[:340]]
7
y = digits.target[indices[:340]]
8
images = digits.images[indices[:340]]
9
10
n_total_samples = len(y)
11
n_labeled_points = 40
12
13
indices = np.arange(n_total_samples)
14
15
unlabeled_set = indices[n_labeled_points:]
16
17
y_train = np.copy(y)
18
y_train[unlabeled_set] = -1
Copied!

四、模型訓練與預測

    利用訓練過後的模型進行預測,得到predicted_labels,並與true_labels計算混淆矩陣
    列出classification report
    support為每個標籤出現的次數
    precision(精確度)為true positives/(true positivies + false positivies)
    recall(召回率)為true positivies/(true positivies + false negatives)
    f1值為精確度與召回率的調和均值,為2 x precision x recall/(precision + recall)
    micro avg為所有數據中,正確預測的比率
    macro avg為每個評估項目未加權的平均值
    weighted avg為每個評估項目加權平均值
1
lp_model = LabelSpreading(gamma=.25, max_iter=20)
2
lp_model.fit(X, y_train)
3
predicted_labels = lp_model.transduction_[unlabeled_set]
4
true_labels = y[unlabeled_set]
5
6
cm = confusion_matrix(true_labels, predicted_labels, labels=lp_model.classes_)
7
8
print("Label Spreading model: %d labeled & %d unlabeled points (%d total)" %
9
(n_labeled_points, n_total_samples - n_labeled_points, n_total_samples))
10
11
print(classification_report(true_labels, predicted_labels))
12
13
print("Confusion matrix")
14
print(cm)
Copied!
Out:
1
Label Spreading model: 40 labeled & 300 unlabeled points (340 total)
2
precision recall f1-score support
3
4
0 1.00 1.00 1.00 27
5
1 0.82 1.00 0.90 37
6
2 1.00 0.86 0.92 28
7
3 1.00 0.80 0.89 35
8
4 0.92 1.00 0.96 24
9
5 0.74 0.94 0.83 34
10
6 0.89 0.96 0.92 25
11
7 0.94 0.89 0.91 35
12
8 1.00 0.68 0.81 31
13
9 0.81 0.88 0.84 24
14
15
micro avg 0.90 0.90 0.90 300
16
macro avg 0.91 0.90 0.90 300
17
weighted avg 0.91 0.90 0.90 300
18
19
Confusion matrix
20
[[27 0 0 0 0 0 0 0 0 0]
21
[ 0 37 0 0 0 0 0 0 0 0]
22
[ 0 1 24 0 0 0 2 1 0 0]
23
[ 0 0 0 28 0 5 0 1 0 1]
24
[ 0 0 0 0 24 0 0 0 0 0]
25
[ 0 0 0 0 0 32 0 0 0 2]
26
[ 0 0 0 0 0 1 24 0 0 0]
27
[ 0 0 0 0 1 3 0 31 0 0]
28
[ 0 7 0 0 0 0 1 0 21 2]
29
[ 0 0 0 0 1 2 0 0 0 21]]
Copied!

五、結果觀察與分析

    利用stats進行數據的統計,並找出前10筆預測結果最不佳的結果
1
# Calculate uncertainty values for each transduced distribution
2
pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
3
4
# Pick the top 10 most uncertain labels
5
uncertainty_index = np.argsort(pred_entropies)[-10:]
6
7
# Plot
8
f = plt.figure(figsize=(7, 5))
9
for index, image_index in enumerate(uncertainty_index):
10
image = images[image_index]
11
12
sub = f.add_subplot(2, 5, index + 1)
13
sub.imshow(image, cmap=plt.cm.gray_r)
14
plt.xticks([])
15
plt.yticks([])
16
sub.set_title('predict: %i\ntrue: %i' % (
17
lp_model.transduction_[image_index], y[image_index]))
18
19
f.suptitle('Learning with small amount of labeled data')
20
plt.show()
Copied!
png

六、原始碼列表

Python source code: plot_label_propagation_digits.py
1
print(__doc__)
2
3
# Authors: Clay Woolam <[email protected]>
4
# License: BSD
5
6
import numpy as np
7
import matplotlib.pyplot as plt
8
9
from scipy import stats
10
11
from sklearn import datasets
12
from sklearn.semi_supervised import LabelSpreading
13
14
from sklearn.metrics import confusion_matrix, classification_report
15
16
digits = datasets.load_digits()
17
rng = np.random.RandomState(2)
18
indices = np.arange(len(digits.data))
19
rng.shuffle(indices)
20
21
X = digits.data[indices[:340]]
22
y = digits.target[indices[:340]]
23
images = digits.images[indices[:340]]
24
25
n_total_samples = len(y)
26
n_labeled_points = 40
27
28
indices = np.arange(n_total_samples)
29
30
unlabeled_set = indices[n_labeled_points:]
31
32
# #############################################################################
33
# Shuffle everything around
34
y_train = np.copy(y)
35
y_train[unlabeled_set] = -1
36
37
# #############################################################################
38
# Learn with LabelSpreading
39
lp_model = LabelSpreading(gamma=.25, max_iter=20)
40
lp_model.fit(X, y_train)
41
predicted_labels = lp_model.transduction_[unlabeled_set]
42
true_labels = y[unlabeled_set]
43
44
cm = confusion_matrix(true_labels, predicted_labels, labels=lp_model.classes_)
45
46
print("Label Spreading model: %d labeled & %d unlabeled points (%d total)" %
47
(n_labeled_points, n_total_samples - n_labeled_points, n_total_samples))
48
49
print(classification_report(true_labels, predicted_labels))
50
51
print("Confusion matrix")
52
print(cm)
53
54
# #############################################################################
55
# Calculate uncertainty values for each transduced distribution
56
pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
57
58
# #############################################################################
59
# Pick the top 10 most uncertain labels
60
uncertainty_index = np.argsort(pred_entropies)[-10:]
61
62
# #############################################################################
63
# Plot
64
f = plt.figure(figsize=(7, 5))
65
for index, image_index in enumerate(uncertainty_index):
66
image = images[image_index]
67
68
sub = f.add_subplot(2, 5, index + 1)
69
sub.imshow(image, cmap=plt.cm.gray_r)
70
plt.xticks([])
71
plt.yticks([])
72
sub.set_title('predict: %i\ntrue: %i' % (
73
lp_model.transduction_[image_index], y[image_index]))
74
75
f.suptitle('Learning with small amount of labeled data')
76
plt.show()
Copied!
Last modified 7mo ago