EX 10:_K-means群聚法
此範例顯示了K-means演算法使用不同數量cluster,以及不同初始值設定產生的結果
    1.
    利用 datasets.load_iris() 來讀取內建資料庫
    2.
    利用 KMeans 做分類
    3.
    利用 Axes3D 秀出結果

(一)引入函式庫

引入函式如下:
    1.
    numpy : 產生陣列數值
    2.
    matplotlib.pyplot : 用來繪製影像
    3.
    mpl_toolkits.mplot3d import Axes3D : 繪製3D圖形
    4.
    sklearn.cluster import KMeans : 切割cluster
    5.
    sklearn import datasets : 用來匯入影像資料庫
1
import numpy as np
2
import matplotlib.pyplot as plt
3
from mpl_toolkits.mplot3d import Axes3D
4
from sklearn.cluster import KMeans
5
from sklearn import datasets
Copied!
1
np.random.seed(5)
Copied!
隨機設定種子,可以用在 KMeans 裡 n_init 的參數
1
iris = datasets.load_iris()
2
X = iris.data
3
y = iris.target
Copied!
iris = datasets.load_iris() : 將一個dict型別資料存入iris
1
for key,value in iris.items() :
2
try:
3
print (key,value.shape)
4
except:
5
print (key)
6
print(iris['feature_names'])
Copied!
顯示
說明
('target_names', (3L,))
共有三種鳶尾花 setosa, versicolor, virginica
('data', (150L, 4L))
有150筆資料,共四種特徵
('target', (150L,))
這150筆資料各是那一種鳶尾花
DESCR
資料之描述
feature_names
四個特徵代表的意義,分別為 萼片(sepal)之長與寬以及花瓣(petal)之長與寬

(二)Clustering

1
estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
2
('k_means_iris_3', KMeans(n_clusters=3)),
3
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
4
init='random'))]
Copied!
設定 KMeans 參數,各項參數設定如下:
    n_clusters : 需要計算出的群集數
    init : 設定初始化方式
    n_init : 使用不同 centroid seeds 運行 k-means 算法的時間
1
fignum = 1
2
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
3
for name, est in estimators:
4
fig = plt.figure(fignum, figsize=(4, 3))
5
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
6
est.fit(X)
7
labels = est.labels_
8
9
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
10
c=labels.astype(np.float), edgecolor='k')
11
12
ax.w_xaxis.set_ticklabels([])
13
ax.w_yaxis.set_ticklabels([])
14
ax.w_zaxis.set_ticklabels([])
15
ax.set_xlabel('Petal width')
16
ax.set_ylabel('Sepal length')
17
ax.set_zlabel('Petal length')
18
ax.set_title(titles[fignum - 1])
19
ax.dist = 12
20
fignum = fignum + 1
Copied!
Axes3D : 定義一個3D的圖形 est.fit : 根據上面 estimators 去 fit 資料庫的圖 ax.scatter : 畫散點圖,後面的參數用來調整顏色 ax.dist : 設定與物體之間的距離
1
fig = plt.figure(fignum, figsize=(4, 3))
2
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
3
4
for name, label in [('Setosa', 0),
5
('Versicolour', 1),
6
('Virginica', 2)]:
7
ax.text3D(X[y == label, 3].mean(),
8
X[y == label, 0].mean(),
9
X[y == label, 2].mean() + 2, name,
10
horizontalalignment='center',
11
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
12
# Reorder the labels to have colors matching the cluster results
13
y = np.choose(y, [1, 2, 0]).astype(np.float)
14
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')
15
16
ax.w_xaxis.set_ticklabels([])
17
ax.w_yaxis.set_ticklabels([])
18
ax.w_zaxis.set_ticklabels([])
19
ax.set_xlabel('Petal width')
20
ax.set_ylabel('Sepal length')
21
ax.set_zlabel('Petal length')
22
ax.set_title('Ground Truth')
23
ax.dist = 12
24
25
fig.show()
Copied!
np.choose : 將原本 label 順序的(0 1 2)改成(1 2 0) ax.text3D : 將不同label的資料標上個別物種類名稱,裡面 X[y == label, 3].mean() 用在調整 text 的 X Y Z 軸位置

(三)完整程式碼

Python source code:plot_cluster_iris.py
1
"""
2
=========================================================
3
K-means Clustering
4
=========================================================
5
6
The plots display firstly what a K-means algorithm would yield
7
using three clusters. It is then shown what the effect of a bad
8
initialization is on the classification process:
9
By setting n_init to only 1 (default is 10), the amount of
10
times that the algorithm will be run with different centroid
11
seeds is reduced.
12
The next plot displays what using eight clusters would deliver
13
and finally the ground truth.
14
15
"""
16
print(__doc__)
17
18
19
# Code source: Gaël Varoquaux
20
# Modified for documentation by Jaques Grobler
21
# License: BSD 3 clause
22
23
import numpy as np
24
import matplotlib.pyplot as plt
25
# Though the following import is not directly being used, it is required
26
# for 3D projection to work
27
from mpl_toolkits.mplot3d import Axes3D
28
29
from sklearn.cluster import KMeans
30
from sklearn import datasets
31
32
np.random.seed(5)
33
34
iris = datasets.load_iris()
35
X = iris.data
36
y = iris.target
37
38
estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
39
('k_means_iris_3', KMeans(n_clusters=3)),
40
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
41
init='random'))]
42
43
fignum = 1
44
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
45
for name, est in estimators:
46
fig = plt.figure(fignum, figsize=(4, 3))
47
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
48
est.fit(X)
49
labels = est.labels_
50
51
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
52
c=labels.astype(np.float), edgecolor='k')
53
54
ax.w_xaxis.set_ticklabels([])
55
ax.w_yaxis.set_ticklabels([])
56
ax.w_zaxis.set_ticklabels([])
57
ax.set_xlabel('Petal width')
58
ax.set_ylabel('Sepal length')
59
ax.set_zlabel('Petal length')
60
ax.set_title(titles[fignum - 1])
61
ax.dist = 12
62
fignum = fignum + 1
63
64
# Plot the ground truth
65
fig = plt.figure(fignum, figsize=(4, 3))
66
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
67
68
for name, label in [('Setosa', 0),
69
('Versicolour', 1),
70
('Virginica', 2)]:
71
ax.text3D(X[y == label, 3].mean(),
72
X[y == label, 0].mean(),
73
X[y == label, 2].mean() + 2, name,
74
horizontalalignment='center',
75
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
76
# Reorder the labels to have colors matching the cluster results
77
y = np.choose(y, [1, 2, 0]).astype(np.float)
78
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')
79
80
ax.w_xaxis.set_ticklabels([])
81
ax.w_yaxis.set_ticklabels([])
82
ax.w_zaxis.set_ticklabels([])
83
ax.set_xlabel('Petal width')
84
ax.set_ylabel('Sepal length')
85
ax.set_zlabel('Petal length')
86
ax.set_title('Ground Truth')
87
ax.dist = 12
88
89
fig.show()
Copied!