Ex 4: Understanding the decision tree structure

# (一)引入函式庫及測試資料

## 引入函式資料庫

• `load_iris` 引入鳶尾花資料庫。
1
from sklearn.model_selection import train_test_split
2
3
from sklearn.tree import DecisionTreeClassifier
Copied!

## 建立訓練、測試集及決策樹分類器

• X (特徵資料) 以及 y (目標資料)。
• `train_test_split(X, y, random_state)` 將資料隨機分為測試集及訓練集。
X為特徵資料集、y為目標資料集，`random_state` 隨機數生成器。
• `DecisionTreeClassifier(max_leaf_nodes, random_state)` 建立決策樹分類器。
`max_leaf_nodes` 節點為葉的最大數目，`random_state` 若存在則為隨機數生成器，若不存在則使用`np.random`
• `fit(X, y)` 用做訓練，X為訓練用特徵資料，y為目標資料。
1
2
X = iris.data
3
y = iris.target
4
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
5
estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
6
estimator.fit(X_train, y_train)
Copied!

# (二) 決策樹結構探討

`DecisionTreeClassifier` 中有個屬性 `tree_`，儲存了整個樹的結構。 二元樹被表示為多個平行的矩陣，每個矩陣的第i個元素儲存著關於節點"i"的信息，節點0代表樹的根。 需要注意的是，有些矩陣只適用於有分支的節點，在這種情況下，其他類型的節點的值是任意的。

1
n_nodes = estimator.tree_.node_count
2
children_left = estimator.tree_.children_left
3
children_right = estimator.tree_.children_right
4
feature = estimator.tree_.feature
5
threshold = estimator.tree_.threshold
Copied!

1
n_nodes = 5
2
children_left [ 1 -1 3 -1 -1]
3
children_right [ 2 -1 4 -1 -1]
4
feature [ 3 -2 2 -2 -2]
5
threshold [ 0.80000001 -2. 4.94999981 -2. -2. ]
Copied!

• `node_depth` ：節點在決策樹中的深度(層)。
• `is_leaves` ：該節點是否為決策樹的最底層(葉)。
• `stack`：存放尚未判斷是否達決策樹底層的節點資訊。

1
node_depth = np.zeros(shape=n_nodes)
2
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
3
4
stack = [(0, -1)] #initial
5
6
while len(stack) > 0:
7
node_id, parent_depth = stack.pop()
8
node_depth[node_id] = parent_depth + 1
9
10
# If we have a test node
11
if (children_left[node_id] != children_right[node_id]):
12
stack.append((children_left[node_id], parent_depth + 1))
13
stack.append((children_right[node_id], parent_depth + 1))
14
else:
15
is_leaves[node_id] = True
Copied!

1
stack len 1
2
node_id 0 parent_depth -1
3
node_depth [ 0. 0. 0. 0. 0.]
4
stack [(1, 0), (2, 0)]
5
6
stack len 2
7
node_id 2 parent_depth 0
8
node_depth [ 0. 0. 1. 0. 0.]
9
stack [(1, 0), (3, 1), (4, 1)]
10
11
stack len 3
12
node_id 4 parent_depth 1
13
node_depth [ 0. 0. 1. 0. 2.]
14
stack [(1, 0), (3, 1)]
15
16
stack len 2
17
node_id 3 parent_depth 1
18
node_depth [ 0. 0. 1. 2. 2.]
19
stack [(1, 0)]
20
21
stack len 1
22
node_id 1 parent_depth 0
23
node_depth [ 0. 1. 1. 2. 2.]
24
stack []
Copied! 1
print("The binary tree structure has %s nodes and has "
2
"the following tree structure:"
3
% n_nodes)
4
for i in range(n_nodes):
5
if is_leaves[i]:
6
print("%snode=%s leaf node." % (node_depth[i] * "\t", i)) #"\t"縮排
7
else:
8
print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
9
"node %s."
10
% (node_depth[i] * "\t",
11
i,
12
children_left[i],
13
feature[i],
14
threshold[i],
15
children_right[i],
16
))
Copied!

1
The binary tree structure has 5 nodes and has the following tree structure:
2
node=0 test node: go to node 1 if X[:, 3] <= 0.800000011921 else to node 2.
3
node=1 leaf node.
4
node=2 test node: go to node 3 if X[:, 2] <= 4.94999980927 else to node 4.
5
node=3 leaf node.
6
node=4 leaf node.
Copied!

1
node_indicator = estimator.decision_path(X_test)
2
3
# Similarly, we can also have the leaves ids reached by each sample.
4
5
leave_id = estimator.apply(X_test)
6
7
# Now, it's possible to get the tests that were used to predict a sample or
8
# a group of samples. First, let's make it for the sample.
9
10
sample_id = 0
11
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
12
node_indicator.indptr[sample_id + 1]]
13
print('node_index', node_index)
14
print('Rules used to predict sample %s: ' % sample_id)
15
for node_id in node_index:
16
if leave_id[sample_id] != node_id:
17
continue
18
19
if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
20
threshold_sign = "<="
21
else:
22
threshold_sign = ">"
23
24
print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
25
% (node_id,
26
sample_id,
27
feature[node_id],
28
X_test[i, feature[node_id]],
29
threshold_sign,
30
threshold[node_id]))
Copied!

1
node_index [0 2 4]
2
Rules used to predict sample 0:
3
decision id node 4 : (X[0, -2] (= 1.5) > -2.0)
Copied!

1
# For a group of samples, we have the following common node.
2
sample_ids = [0, 1]
3
common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
4
len(sample_ids))
5
6
print('node_indicator',node_indicator.toarray()[sample_ids])
7
print('common_nodes',common_nodes)
8
9
common_node_id = np.arange(n_nodes)[common_nodes]
10
print('common_node_id',common_node_id)
11
12
13
print("\nThe following samples %s share the node %s in the tree"
14
% (sample_ids, common_node_id))
15
print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))
Copied!

1
node_indicator [[1 0 1 0 1]
2
[1 0 1 1 0]]
3
common_nodes [ True False True False False]
4
common_node_id [0 2]
5
6
The following samples [0, 1] share the node [0 2] in the tree
7
It is 40.0 % of all nodes.
Copied!

# (三)完整程式碼

1
import numpy as np
2
3
from sklearn.model_selection import train_test_split
4
5
from sklearn.tree import DecisionTreeClassifier
6
7
8
X = iris.data
9
y = iris.target
10
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
11
12
estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
13
estimator.fit(X_train, y_train)
14
15
# The decision estimator has an attribute called tree_ which stores the entire
16
# tree structure and allows access to low level attributes. The binary tree
17
# tree_ is represented as a number of parallel arrays. The i-th element of each
18
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
19
# Some of the arrays only apply to either leaves or split nodes, resp. In this
20
# case the values of nodes of the other type are arbitrary!
21
#
22
# Among those arrays, we have:
23
# - left_child, id of the left child of the node
24
# - right_child, id of the right child of the node
25
# - feature, feature used for splitting the node
26
# - threshold, threshold value at the node
27
#
28
29
# Using those arrays, we can parse the tree structure:
30
31
n_nodes = estimator.tree_.node_count
32
children_left = estimator.tree_.children_left
33
children_right = estimator.tree_.children_right
34
feature = estimator.tree_.feature
35
threshold = estimator.tree_.threshold
36
37
38
# The tree structure can be traversed to compute various properties such
39
# as the depth of each node and whether or not it is a leaf.
40
node_depth = np.zeros(shape=n_nodes)
41
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
42
stack = [(0, -1)] # seed is the root node id and its parent depth
43
while len(stack) > 0:
44
node_id, parent_depth = stack.pop()
45
node_depth[node_id] = parent_depth + 1
46
47
# If we have a test node
48
if (children_left[node_id] != children_right[node_id]):
49
stack.append((children_left[node_id], parent_depth + 1))
50
stack.append((children_right[node_id], parent_depth + 1))
51
else:
52
is_leaves[node_id] = True
53
54
print("The binary tree structure has %s nodes and has "
55
"the following tree structure:"
56
% n_nodes)
57
for i in range(n_nodes):
58
if is_leaves[i]:
59
print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
60
else:
61
print("%snode=%s test node: go to node %s if X[:, %s] <= %ss else to "
62
"node %s."
63
% (node_depth[i] * "\t",
64
i,
65
children_left[i],
66
feature[i],
67
threshold[i],
68
children_right[i],
69
))
70
print()
71
72
# First let's retrieve the decision path of each sample. The decision_path
73
# method allows to retrieve the node indicator functions. A non zero element of
74
# indicator matrix at the position (i, j) indicates that the sample i goes
75
# through the node j.
76
77
node_indicator = estimator.decision_path(X_test)
78
79
# Similarly, we can also have the leaves ids reached by each sample.
80
81
leave_id = estimator.apply(X_test)
82
83
# Now, it's possible to get the tests that were used to predict a sample or
84
# a group of samples. First, let's make it for the sample.
85
86
sample_id = 0
87
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
88
node_indicator.indptr[sample_id + 1]]
89
90
print('Rules used to predict sample %s: ' % sample_id)
91
for node_id in node_index:
92
if leave_id[sample_id] != node_id:
93
continue
94
95
if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
96
threshold_sign = "<="
97
else:
98
threshold_sign = ">"
99
100
print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
101
% (node_id,
102
sample_id,
103
feature[node_id],
104
X_test[i, feature[node_id]],
105
threshold_sign,
106
threshold[node_id]))
107
108
# For a group of samples, we have the following common node.
109
sample_ids = [0, 1]
110
common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
111
len(sample_ids))
112
113
common_node_id = np.arange(n_nodes)[common_nodes]
114
115
print("\nThe following samples %s share the node %s in the tree"
116
% (sample_ids, common_node_id))
117
print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))
Copied!