# basic imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# model imports
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
# metric imports
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import accuracy_score
# model selection imports
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# preprocessing imports
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# data imports
from sklearn.datasets import make_friedman1
Generalization
Model Flexibility, Overfitting, and the Bias-Variance Tradeoff
Coming soon!
TODO
Setup and Objectives
Model Flexibility
A model’s flexibility determines how well the model can learn the training data.
What does it mean to “learn the training data” in this context? We’ll use this phrase to mean that the model can accurately predict the target variable for the training data after being fit to the training data. When that happens, the model has learned both the underlying patterns in the training data and the noise in the training data. So generally, a model like \(k\)-nearest neighbors with \(k=1\) will be able to “learn the training data”. We’ll often reference degrees of how well a model can learn the training data. The degree to which a model has learned the training data can be measured by the model’s training error. The lower the training error, the better the model has learned the training data.
- A “flexible” model can learn “complex” patterns in the data, but is also more likely to overfit.
- An “inflexible” model is less likely to overfit, but may not be able to learn the true underlying patterns in the data.
Given a particular dataset, when fitting a model, you are essentially trying to find a model that is flexible enough to learn the underlying patterns in the data, but not so flexible that it learns the noise in the data.
How do we control a model’s flexibility? Our main tool for controlling a model’s flexibility is the available hyperparameters (tuning parameters).
Let’s investigate with \(k\) for \(k\)-nearest neighbors.
# fit models and calculate train and test test RMSE for each value of k
for k in k_values:
# initialize model, with the current k
knn = KNeighborsRegressor(n_neighbors=k)
# fit the model to the (validation) train data
knn.fit(X_train, y_train)
# get train predictions
y_train_pred = knn.predict(X_train)
# calculate (and store) train RMSE
train_rmse.append(root_mean_squared_error(y_train, y_train_pred))
# get test predictions
y_test_pred = knn.predict(X_test)
# calculate (and store) test RMSE
test_rmse.append(root_mean_squared_error(y_test, y_test_pred))
# plot RMSE against k
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(k_values, train_rmse, marker="o", markersize=5, label="Train RMSE")
ax.plot(k_values, test_rmse, marker="o", markersize=5, label="Test RMSE")
ax.set_xlabel("k")
ax.set_ylabel("RMSE")
ax.set_title("k versus Error")
ax.legend()
plt.show()
How does \(k\) relate to model flexibility?
- A small \(k\) has low train error, thus is more flexible model.
- A large \(k\) has high train error, thus is a less flexible model.
That is, as \(k\) increases, the model becomes less flexible.
How do we know this? Note that as \(k\) increases, the train RMSE increases. Said in reverse, as \(k\) decreases, the train RMSE decreases, and thus the model better learns the training data.
Overfitting
What is overfitting?
As the name subtly suggests, overfitting occurs when a model has learned too much. In particular, it has learned the training data so well, that beyond learning the underlying patterns in the data, it has also learned the noise in the data.
How do we know when overfitting has occurred?
- The model has a low train error, relative to training error for other models.
- The model has a high test error, relative to test error for other models.
Wait! I thought we shouldn’t use the test data to tune hyperparameters and select a model? You are correct! All of the above applies to data used to fit the model (train) and any holdout data (validation or test). So if we were to plot train and validation error, we expect to see the same U-shaped curve for validation error as we did with test error.
Bias-Variance Tradeoff
There are three sources of error in a model for supervised learning:
- Bias
- Variance
- Noise
The bias and variance make up the reducible error in a model. The noise is also called the irreducible error.
The bias of a model is the error due to the model’s assumptions or lack of flexibility. It is systematic error due to the model’s inability to (fully) learn the true underlying patterns in the data.
The variance of a model is the error due to the model’s sensitivity to the training data. It is the error due to the model (partially) learning some noise in the training data, and thus the model would change if the training data were changed.
Both bias and variance are related to the model’s flexibility.
- As flexibility increases, bias decreases.
- As flexibility increases, variance increases.
def plot_four_simulations_and_predictions(n_neighbors):
fig, ax = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle("Simulated Sine Wave Data")
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((-1, 1))
for i in range(2):
for j in range(2):
ax[i, j].plot(x_plot, np.sin(x_plot), label="True Signal")
X, y = simulate_sin_data(n=100, sd=0.25, seed=i * 2 + j)
ax[i, j].scatter(X, y, c="gray")
knn = KNeighborsRegressor(n_neighbors=n_neighbors)
knn.fit(X, y)
ax[i, j].plot(
x_plot,
knn.predict(x_plot),
label="Predicted Signal",
linestyle="--",
)
ax[i, j].legend()
plt.show()
def plot_many_simulations_and_predictions_with_average(n_neighbors):
fig, ax = plt.subplots(figsize=(10, 6))
fig.suptitle("Simulated Sine Wave Data")
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((-1, 1))
ax.plot(x_plot, np.sin(x_plot), color="tab:blue", label="True Signal")
predictions = []
for seed in range(25):
X, y = simulate_sin_data(n=100, sd=0.25, seed=seed)
knn = KNeighborsRegressor(n_neighbors=n_neighbors)
knn.fit(X, y)
ax.plot(x_plot, knn.predict(x_plot), color="gray", alpha=0.25)
predictions.append(knn.predict(x_plot))
avg_prediction = np.mean(predictions, axis=0)
ax.plot(
x_plot,
avg_prediction,
color="tab:orange",
linestyle="--",
label="Average Prediction",
)
ax.legend()
plt.show()
TODO
- TODO: model flexibility
- TODO: model flexibility = “how well the model can learn the training data”
- TODO: flexibility is a better term than complexity
- TODO: show knn with k=1 for many datasets (look, it changes a lot)
- TODO: show knn with k=100 for many datasets (look, it doesn’t change much)
- TODO: overfitting / underfitting
- TODO: overfitting = “model is too flexible, for the given data”
- TODO: underfitting = “model is not flexible enough, for the given data”
- TODO: generalization and the generalization gap
- TODO: bias variance tradeoff
- TODO: bias = “error due to the model’s assumptions”
- TODO: variance = “error due to the model’s sensitivity to the training data”
- TODO: how do bias and variance relate to overfitting and underfitting?
- TODO: how do bias and variance relate to model flexibility?
- TODO: https://medium.com/@ivanreznikov/stop-using-the-same-image-in-bias-variance-trade-off-explanation-691997a94a54
- TODO: https://fa23.cs307.org/notes/week-04/week-04.pdf
- TODO: https://sp24.cs307.org/slides/generalization.html#/validation-and-test-data
- https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html
- https://mlu-explain.github.io/bias-variance/
- https://mineetha.com/2021/02/25/bias-variance-tradeoff/
- https://scott.fortmann-roe.com/docs/BiasVariance.html