Ex 4: Underfitting vs. Overfitting

模型選擇/範例4 : Underfitting vs. Overfitting

此範例目的:
    觀察Underfiting(欠擬合)與Overfitting(過度擬合)的問題
    如何使用具有多項式特徵的線性回歸來近似非線性函數

一、引入函式與模型

    Pipeline將欲進行的步驟進行流式化的封裝與管理,使參數在新的測試集上被重複使用
    PolynomialFeatures用於產生多項式的特徵構造
    LinearRegression用於將資料數據擬合至一條直線上
    cross_val_score使用k交叉驗證來計算資料的誤差值
1
import numpy as np
2
import matplotlib.pyplot as plt
3
4
from sklearn.pipeline import Pipeline
5
from sklearn.preprocessing import PolynomialFeatures
6
from sklearn.linear_model import LinearRegression
7
from sklearn.model_selection import cross_val_score
Copied!

二、建立近似函數

1
def true_fun(X):
2
return np.cos(1.5 * np.pi * X)
3
4
np.random.seed(0)
5
n_samples = 30
6
X = np.sort(np.random.rand(n_samples))
7
8
plt.plot(X, true_fun(X))
9
plt.show()
Copied!
png
該圖為欲近似的函數,其為cosine function的一部分

三、擬合函數、計算誤差值與繪圖

    進行三種維度的擬合,分別為1次、4次與15次
    y為擬合資料集
    透過PolynomialFeatures產生多項式的特徵構造
    利用Pipeline將欲擬合的函數與資料進行線性回歸的擬合,產生具有不同程度多項式特徵的模型
    透過cross_val_score將資料分成10組進行K交叉驗證,並計算誤差值
    由於cross_val_score中的參數scoring會回傳最大值,但誤差值須愈小愈好,因此須回傳mean square error的負值:neg_mean_squared_error
    產生一個新的測試集X_test,將其放入擬合過後的模型中進行預測
1
degrees = [1, 4, 15]
2
y = true_fun(X) + np.random.randn(n_samples) * 0.1
3
4
plt.figure(figsize=(14, 5))
5
for i in range(len(degrees)):
6
#擬合函數與計算誤差值
7
ax = plt.subplot(1, len(degrees), i + 1)
8
plt.setp(ax, xticks=(), yticks=())
9
10
polynomial_features = PolynomialFeatures(degree=degrees[i],
11
include_bias=False)
12
linear_regression = LinearRegression()
13
pipeline = Pipeline([("polynomial_features", polynomial_features),
14
("linear_regression", linear_regression)])
15
pipeline.fit(X[:, np.newaxis], y)
16
17
# Evaluate the models using crossvalidation
18
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
19
scoring="neg_mean_squared_error", cv=10)
20
21
#繪圖
22
X_test = np.linspace(0, 1, 100)
23
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
24
plt.plot(X_test, true_fun(X_test), label="True function")
25
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
26
plt.xlabel("x")
27
plt.ylabel("y")
28
plt.xlim((0, 1))
29
plt.ylim((-2, 2))
30
plt.legend(loc="best")
31
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
32
degrees[i], -scores.mean(), scores.std()))
33
plt.show()
Copied!
png
上圖顯示了來自real function的數個樣本以及不同模型的近似結果:
    1次多項式的線性函數不足以訓練樣本,稱為Underfitting(欠擬合)
    4次多項式的線性函數幾乎逼近了欲擬合的函數(true function)
    15次多項式的線性函數因為有較高的維度,模型產生Overfitting(過度擬合)的情況

四、原始碼列表

Python source code: plot_underfitting_overfitting.py
1
print(__doc__)
2
3
import numpy as np
4
import matplotlib.pyplot as plt
5
from sklearn.pipeline import Pipeline
6
from sklearn.preprocessing import PolynomialFeatures
7
from sklearn.linear_model import LinearRegression
8
from sklearn.model_selection import cross_val_score
9
10
11
def true_fun(X):
12
return np.cos(1.5 * np.pi * X)
13
14
np.random.seed(0)
15
16
n_samples = 30
17
degrees = [1, 4, 15]
18
19
X = np.sort(np.random.rand(n_samples))
20
y = true_fun(X) + np.random.randn(n_samples) * 0.1
21
22
plt.figure(figsize=(14, 5))
23
for i in range(len(degrees)):
24
ax = plt.subplot(1, len(degrees), i + 1)
25
plt.setp(ax, xticks=(), yticks=())
26
27
polynomial_features = PolynomialFeatures(degree=degrees[i],
28
include_bias=False)
29
linear_regression = LinearRegression()
30
pipeline = Pipeline([("polynomial_features", polynomial_features),
31
("linear_regression", linear_regression)])
32
pipeline.fit(X[:, np.newaxis], y)
33
34
# Evaluate the models using crossvalidation
35
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
36
scoring="neg_mean_squared_error", cv=10)
37
38
X_test = np.linspace(0, 1, 100)
39
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
40
plt.plot(X_test, true_fun(X_test), label="True function")
41
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
42
plt.xlabel("x")
43
plt.ylabel("y")
44
plt.xlim((0, 1))
45
plt.ylim((-2, 2))
46
plt.legend(loc="best")
47
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
48
degrees[i], -scores.mean(), scores.std()))
49
plt.show()
Copied!
Last modified 1yr ago