Plot Hierarchical Clustering Dendrogram

Plot Hierarchical Clustering Dendrogram

此範例使用AgglomerativeClustering(聚集聚類)和scipy中的樹狀圖法繪製Hierarchical Clustering Dendrogram(分層式聚類樹狀圖)
Hierarchical Clustering Dendrogram透過一種階層架構的方式,將資料分層反覆進行分群或聚集,以產生最後的樹狀結構,常見的方式有兩種:
    如果採用聚集的方式,階層式分群法可由樹狀結構的底部開始,將資料或聚集逐次合併
    如果採用分群的方式,則由樹狀結構的頂端開始,將聚集逐次分群
    (一)引入函式庫
    numpy : 產生陣列數值
    matplotlib.pyplot : 用來繪製影像
    scipy.cluster.hierarchy import dendrogram : 繪製樹狀圖
    sklearn.datasets import load_iris : 匯入資料集
    sklearn.cluster import AgglomerativeClustering : 匯入聚集聚類演算法
1
import numpy as np
2
3
from matplotlib import pyplot as plt
4
from scipy.cluster.hierarchy import dendrogram
5
from sklearn.datasets import load_iris
6
from sklearn.cluster import AgglomerativeClustering
Copied!

(二)定義函式

    numpy.zeros() : 創建空矩陣
    numpy.column_stack() : 將一維陣列堆疊成二維陣列
    scipy.cluster.hierarchy.dendrogram(Z, p=30, truncate_mode=None, color_threshold=None, get_leaves=True, orientation='top', labels=None, count_sort=False, distance_sort=False, show_leaf_counts=True, no_plot=False, no_labels=False, leaf_font_size=None, leaf_rotation=None, leaf_label_func=None, show_contracted=False, link_color_func=None, ax=None, above_threshold_color='b')
Z : 決定群聚距離的定義方式
    1.
    'single-linkage agglomerative algorithm'(單一連結聚合演算法):群聚間的距離可以定義為不同群聚中最接近兩點間的距離
    2.
    'complete-linkage agglomerative algorithm'(完整連結聚合演算法):群聚間的距離定義為不同群聚中最遠兩點間的距離
    3.
    'average-linkage agglomerative algorithm'(平均連結聚合演算法):群聚間的距離則定義為不同群聚間各點與各點間距離總和的平均
    4.
    'Ward's method'(沃德法):群聚間的距離定義為在將兩群合併後,各點到合併後的群中心的距離平方和
    1
    def plot_dendrogram(model, **kwargs):
    2
    # Create linkage matrix and then plot the dendrogram 創建連結矩陣並繪製樹狀圖
    3
    4
    # create the counts of samples under each node 計算各節點下的樣本數量
    5
    counts = np.zeros(model.children_.shape[0]) # 創建空矩陣空間儲存計數
    6
    n_samples = len(model.labels_)
    7
    for i, merge in enumerate(model.children_):
    8
    current_count = 0
    9
    for child_idx in merge:
    10
    if child_idx < n_samples:
    11
    current_count += 1 # leaf node
    12
    else:
    13
    current_count += counts[child_idx - n_samples]
    14
    counts[i] = current_count
    15
    16
    linkage_matrix = np.column_stack([model.children_, model.distances_,
    17
    counts]).astype(float) #
    18
    19
    # Plot the corresponding dendrogram
    20
    dendrogram(linkage_matrix, **kwargs)
    Copied!
    (三)繪製圖片
    5.
    uster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None)
    6.
    n_clusters : 要查找的群集數。如果distance_threshold不為None,則必須為None
    7.
    affinity : 用於計算鏈接的度量
    8.
    memory :暫存樹計算的輸出
    9.
    connectivity : 連接矩陣
    10.
    compute_full_tree : 在n_clusters處盡早停止構建樹。如果聚集的數量與樣本數相比不小的話,對於減少計算時間很有用。當有指定連接矩陣時,此選項才有作用。需要注意的是,當改變群集數量並使用暫存時,將樹完整計算完可能是有利於結果分群
    11.
    linkage : 連結方式,與上面Z相同
    12.
    distance_threshold : 設定閥值,鏈接距離等於高於該值時,群集將不會合併。
    ```python
    iris = load_iris() # 匯入資料集
    X = iris.data

setting distance_threshold=0 ensures we compute the full tree.

model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(X) plt.title('Hierarchical Clustering Dendrogram')

plot the top three levels of the dendrogram

plot_dendrogram(model, truncate_mode='level', p=3) plt.xlabel("Number of points in node (or index of point if no parenthesis).") plt.show()
1
![](https://github.com/sdgary56249128/machine-learning-python/blob/master/Clustering/sphx_glr_plot_agglomerative_dendrogram_001.png)
2
## (四)完整程式碼
3
https://scikit-learn.org/stable/_downloads/6c3126e55d97d68efdd8da229311ac00/plot_agglomerative_dendrogram.py
4
```python
5
import numpy as np
6
7
from matplotlib import pyplot as plt
8
from scipy.cluster.hierarchy import dendrogram
9
from sklearn.datasets import load_iris
10
from sklearn.cluster import AgglomerativeClustering
11
12
13
def plot_dendrogram(model, **kwargs):
14
# Create linkage matrix and then plot the dendrogram
15
16
# create the counts of samples under each node
17
counts = np.zeros(model.children_.shape[0])
18
n_samples = len(model.labels_)
19
for i, merge in enumerate(model.children_):
20
current_count = 0
21
for child_idx in merge:
22
if child_idx < n_samples:
23
current_count += 1 # leaf node
24
else:
25
current_count += counts[child_idx - n_samples]
26
counts[i] = current_count
27
28
linkage_matrix = np.column_stack([model.children_, model.distances_,
29
counts]).astype(float)
30
31
# Plot the corresponding dendrogram
32
dendrogram(linkage_matrix, **kwargs)
33
34
35
iris = load_iris()
36
X = iris.data
37
38
# setting distance_threshold=0 ensures we compute the full tree.
39
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
40
41
model = model.fit(X)
42
plt.title('Hierarchical Clustering Dendrogram')
43
# plot the top three levels of the dendrogram
44
plot_dendrogram(model, truncate_mode='level', p=3)
45
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
46
plt.show()
Copied!
Last modified 1yr ago