Ex 3: The iris 鳶尾花資料集

機器學習資料集/ 範例三: The iris dataset

這個範例目的是介紹機器學習範例資料集中的iris 鳶尾花資料集

(一)引入函式庫及內建手寫數字資料庫

1
#這行是在ipython notebook的介面裏專用,如果在其他介面則可以拿掉
2
%matplotlib inline
3
4
import matplotlib.pyplot as plt
5
from mpl_toolkits.mplot3d import Axes3D
6
from sklearn import datasets
7
from sklearn.decomposition import PCA
8
9
# import some data to play with
10
iris = datasets.load_iris()
11
X = iris.data[:, :2] # we only take the first two features.
12
Y = iris.target
13
14
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
15
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
16
17
plt.figure(2, figsize=(8, 6))
18
plt.clf()
19
# Plot the training points
20
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
21
plt.xlabel('Sepal length')
22
plt.ylabel('Sepal width')
23
24
plt.xlim(x_min, x_max)
25
plt.ylim(y_min, y_max)
26
plt.xticks(())
27
plt.yticks(())
Copied!
png

(二)資料集介紹

iris = datasets.load_iris() 將一個dict型別資料存入iris,我們可以用下面程式碼來觀察裏面資料
1
for key,value in iris.items() :
2
try:
3
print (key,value.shape)
4
except:
5
print (key)
6
print(iris['feature_names'])
Copied!
顯示
說明
('target_names', (3L,))
共有三種鳶尾花 setosa, versicolor, virginica
('data', (150L, 4L))
有150筆資料,共四種特徵
('target', (150L,))
這150筆資料各是那一種鳶尾花
DESCR
資料之描述
feature_names
四個特徵代表的意義,分別為 萼片(sepal)之長與寬以及花瓣(petal)之長與寬
為了用視覺化方式呈現這個資料集,下面程式碼首先使用PCA演算法將資料維度降低至3
1
X_reduced = PCA(n_components=3).fit_transform(iris.data)
Copied!
接下來將三個維度的資料立用mpl_toolkits.mplot3d.Axes3D 建立三維繪圖空間,並利用 scatter以三個特徵資料數值當成座標繪入空間,並以三種iris之數值 Y,來指定資料點的顏色。我們可以看出三種iris中,有一種明顯的可以與其他兩種區別,而另外兩種則無法明顯區別。
1
# To getter a better understanding of interaction of the dimensions
2
# plot the first three PCA dimensions
3
fig = plt.figure(1, figsize=(8, 6))
4
ax = Axes3D(fig, elev=-150, azim=110)
5
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
6
cmap=plt.cm.Paired)
7
ax.set_title("First three PCA directions")
8
ax.set_xlabel("1st eigenvector")
9
ax.w_xaxis.set_ticklabels([])
10
ax.set_ylabel("2nd eigenvector")
11
ax.w_yaxis.set_ticklabels([])
12
ax.set_zlabel("3rd eigenvector")
13
ax.w_zaxis.set_ticklabels([])
14
15
plt.show()
Copied!
png
1
#接著我們嘗試將這個機器學習資料之描述檔顯示出來
2
print(iris['DESCR'])
Copied!
1
Iris Plants Database
2
3
Notes
4
-----
5
Data Set Characteristics:
6
:Number of Instances: 150 (50 in each of three classes)
7
:Number of Attributes: 4 numeric, predictive attributes and the class
8
:Attribute Information:
9
- sepal length in cm
10
- sepal width in cm
11
- petal length in cm
12
- petal width in cm
13
- class:
14
- Iris-Setosa
15
- Iris-Versicolour
16
- Iris-Virginica
17
:Summary Statistics:
18
19
============== ==== ==== ======= ===== ====================
20
Min Max Mean SD Class Correlation
21
============== ==== ==== ======= ===== ====================
22
sepal length: 4.3 7.9 5.84 0.83 0.7826
23
sepal width: 2.0 4.4 3.05 0.43 -0.4194
24
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
25
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
26
============== ==== ==== ======= ===== ====================
27
28
:Missing Attribute Values: None
29
:Class Distribution: 33.3% for each of 3 classes.
30
:Creator: R.A. Fisher
31
:Donor: Michael Marshall (MARSHALL%[email protected])
32
:Date: July, 1988
33
34
This is a copy of UCI ML iris datasets.
35
http://archive.ics.uci.edu/ml/datasets/Iris
36
37
The famous Iris database, first used by Sir R.A Fisher
38
39
This is perhaps the best known database to be found in the
40
pattern recognition literature. Fisher's paper is a classic in the field and
41
is referenced frequently to this day. (See Duda & Hart, for example.) The
42
data set contains 3 classes of 50 instances each, where each class refers to a
43
type of iris plant. One class is linearly separable from the other 2; the
44
latter are NOT linearly separable from each other.
45
46
References
47
----------
48
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
49
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
50
Mathematical Statistics" (John Wiley, NY, 1950).
51
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
52
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
53
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
54
Structure and Classification Rule for Recognition in Partially Exposed
55
Environments". IEEE Transactions on Pattern Analysis and Machine
56
Intelligence, Vol. PAMI-2, No. 1, 67-71.
57
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
58
on Information Theory, May 1972, 431-433.
59
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
60
conceptual clustering system finds 3 classes in the data.
61
- Many, many more ...
Copied!
這個描述檔說明了這個資料集是在 1936年時由Fisher建立,為圖形識別領域之重要經典範例。共例用四種特徵來分類三種鳶尾花

(三)應用範例介紹

在整個scikit-learn應用範例中,有以下幾個範例是利用了這組iris資料集。