Ex 3: Recursive Feature Elimination with Cross-Validation

特徵選擇/範例三: Recursive feature elimination with cross-validation

REFCV比REF多一個交叉比對的分數(gridscores)，代表選擇多少特徵後的準確率。但REFCV不用像REF要給定選擇多少特徵，而是會依照交叉比對的分數而自動選擇訓練的特徵數。

1. 1.
以疊代方式計算模型
2. 2.
以交叉驗證來取得影響力特徵

(一)建立模擬資料

1
# Build a classification task using 3 informative features
2
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
3
n_redundant=2, n_repeated=0, n_classes=8,
4
n_clusters_per_class=1, random_state=0)
Copied!

(二)以疊代排序特徵影響力，並以交叉驗證來選出具有實際影響力的特徵

scoring參數則是依照分類資料的形式，輸入對應的評分方式。以本例子為超過兩類型的分類，因此使用'accuracy'來對多重分類的評分方式。詳細可參考scoring
1
# Create the RFE object and compute a cross-validated score.
2
svc = SVC(kernel="linear")
3
# The "accuracy" scoring is proportional to the number of correct
4
# classifications
5
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
6
scoring='accuracy')
7
rfecv.fit(X, y)
8
9
print("Optimal number of features : %d" % rfecv.n_features_)
Copied!

(三)畫出具有影響力特徵對應準確率的圖 (四) 原始碼出處

Python source code: plot_rfe_digits.py
1
print(__doc__)
2
3
import matplotlib.pyplot as plt
4
from sklearn.svm import SVC
5
from sklearn.cross_validation import StratifiedKFold
6
from sklearn.feature_selection import RFECV
7
from sklearn.datasets import make_classification
8
9
# Build a classification task using 3 informative features
10
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
11
n_redundant=2, n_repeated=0, n_classes=8,
12
n_clusters_per_class=1, random_state=0)
13
14
# Create the RFE object and compute a cross-validated score.
15
svc = SVC(kernel="linear")
16
# The "accuracy" scoring is proportional to the number of correct
17
# classifications
18
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
19
scoring='accuracy')
20
rfecv.fit(X, y)
21
22
print("Optimal number of features : %d" % rfecv.n_features_)
23
24
# Plot number of features VS. cross-validation scores
25
plt.figure()
26
plt.xlabel("Number of features selected")
27
plt.ylabel("Cross validation score (nb of correct classifications)")
28
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
29
plt.show()
Copied!

本章介紹到函式用法

​RFECV() 的參數

1
class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, estimator_params=None, verbose=0)[source]
Copied!

• estimator
• step
• cv: 若無輸入，預設為3-fold的交叉驗證。輸入整數i，則做i-fold交叉驗證。若為物件，則以該物件做為交叉驗證產生器。
• scoring
• estimator_params
• verbose

• n_features_: 預測有影響力的特徵的總數目
• support_: 有影響力的特徵遮罩，可以用來挑出哪些特徵
• ranking_: 各特徵的影響力程度
• gridscores: 從最有影響力的特徵開始加入，計算使用多少個特徵對應得到的準確率。
• estimator_