Draft

5  Data Splitting

Splitting, Evaluation, and Tuning

5.1 Setup and Objectives

In this note, we will discuss:

  • data splitting,
  • and the process of evaluating the performance of predictions from a model.

Along the way, we will outline a general procedure that will be used for almost all supervised learning tasks.

# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# machine learning
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error
def simulate_sin_data(n, sd, seed):

    # control randomness
    np.random.seed(seed)

    # simulate X data
    X = np.random.uniform(low=-2 * np.pi, high=2 * np.pi, size=(n, 1))

    # generate signal portion of y data
    signal = np.sin(X).ravel()

    # generate noise
    noise = np.random.normal(loc=0, scale=sd, size=n)

    # combine signal and noise
    y = signal + noise

    # return simulated data
    return X, y
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

5.2 Generalization

A model generalizes well if it is able to make good predictions on unseen data. Predicting on already observed data is easy!

5.3 Data Splitting

5.3.1 Train-Test Split

In practice, we cannot simply simulate more data!

However, the “easy” solution here is to take the available data and (randomly) use some of the data for training, that is the data used to fit a model. Then, (randomly) reserve some other data for testing. That is, the data we will calculate metrics on.

But, we actually need to go one step further…

5.3.2 Train-Validation-Test Split

  • Train: The data used to train and select models.
    • Validation-Train: The data used to fit models during training.
    • Validation: The data used to evaluate models during training.
  • Test: The data used only for a final evaluation of an already chosen model.

5.3.3 Train-Validation-Test Split Flowchart

flowchart TB
  A("Full Data")
  A -->|"80%"| B("Train Data")
  B -->|"80%"| D("(Validation) <br> Train Data")
  B -->|"20%"| E("Validation Data")
  A -->|"20%"| C("Test Data")

An 80-20 split is common, but not required. The choice of how much data to put into each set is called data budgeting.

5.3.4 Validation and Test Metrics

Now that we have a desire to fit a model to some data, but calculate metrics based on other data, we need to update our metric definitions.

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]

Importantly, now we need to consider both a dataset and function (learned from data) when calculating metrics. We’re still comparing “true” values to “predicted” values, but we need to pay attention to where they come from.

Here:

  • \(f\) is a function that outputs predictions. In sklearn, a function like some_model.predict().
  • \(\mathcal{D}\) is a dataset, usually either the validation or test data.
  • \(n_\mathcal{D}\) is the number of observations in the dataset \(\mathcal{D}\).
  • \(i\) is the index of an observation (row) of the dataset \(\mathcal{D}\). \(x_i\) is the feature value(s) for this observation and \(y_i\) is the target value.

Validation metrics use:

  • models fit to validation-train data.
  • validation data.

Test metrics…

  • models fit to train data.
  • test data

5.4 Example

5.4.1 Data Setup

We’ll return to our simulated data from earlier, this time starting with a larger number of samples.

X, y = simulate_sin_data(
    n=500,
    sd=0.25,
    seed=42,
)
X.shape, y.shape # verify shapes of the data
((500, 1), (500,))

We first split the full data into a train and test set.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
)
X_train.shape, X_test.shape # verify shapes of the data
((400, 1), (100, 1))

We then repeat the process, splitting the train set into a (validation) train and validation set.

X_vtrain, X_validation, y_vtrain, y_validation = train_test_split(
    X_train,
    y_train,
    test_size=0.20,
    random_state=42,
)
X_vtrain.shape, X_validation.shape  # verify shapes of the data
((320, 1), (80, 1))

5.4.2 Create Models

knn001 = KNeighborsRegressor(n_neighbors=1)
knn010 = KNeighborsRegressor(n_neighbors=10)
knn100 = KNeighborsRegressor(n_neighbors=100)

5.4.3 Fit Models using .fit()

_ = knn001.fit(X_vtrain, y_vtrain)
_ = knn010.fit(X_vtrain, y_vtrain)
_ = knn100.fit(X_vtrain, y_vtrain)

5.4.4 Predictions using .predict()

knn010.predict(X_validation)
array([-0.72  ,  0.0718, -0.7686,  1.0194, -0.7585,  0.957 , -0.7495,
        0.8941,  0.7542,  0.0718,  0.9662,  0.3956,  0.9154, -0.8119,
        0.835 ,  0.2383, -0.7807, -0.5975, -0.8037,  1.0089,  0.9854,
       -0.7812, -0.2986,  0.6657, -0.7855, -0.7686,  0.5979,  0.9467,
        0.1706,  0.4683, -0.8143, -0.0535, -0.8776,  0.8266, -0.72  ,
        0.0185,  0.8388, -0.8998,  0.0746, -0.0836,  0.1706,  0.8759,
        0.7893, -1.0632, -0.7253,  0.7876, -0.7718, -1.0151,  0.5466,
       -0.3634, -1.0151, -0.2986,  0.4454,  0.971 , -1.0194,  0.0369,
        0.9854, -0.5975,  0.847 ,  0.2239, -0.407 , -0.6493,  0.2946,
       -0.4577,  0.7624, -0.72  ,  0.5466,  1.0079,  0.5922, -0.8119,
       -0.7177, -0.3039,  0.8738,  0.7542,  0.6022, -0.3634, -0.9926,
       -1.0151, -0.4577,  0.9046])
print(knn010.predict(X_validation))
[-0.72    0.0718 -0.7686  1.0194 -0.7585  0.957  -0.7495  0.8941  0.7542
  0.0718  0.9662  0.3956  0.9154 -0.8119  0.835   0.2383 -0.7807 -0.5975
 -0.8037  1.0089  0.9854 -0.7812 -0.2986  0.6657 -0.7855 -0.7686  0.5979
  0.9467  0.1706  0.4683 -0.8143 -0.0535 -0.8776  0.8266 -0.72    0.0185
  0.8388 -0.8998  0.0746 -0.0836  0.1706  0.8759  0.7893 -1.0632 -0.7253
  0.7876 -0.7718 -1.0151  0.5466 -0.3634 -1.0151 -0.2986  0.4454  0.971
 -1.0194  0.0369  0.9854 -0.5975  0.847   0.2239 -0.407  -0.6493  0.2946
 -0.4577  0.7624 -0.72    0.5466  1.0079  0.5922 -0.8119 -0.7177 -0.3039
  0.8738  0.7542  0.6022 -0.3634 -0.9926 -1.0151 -0.4577  0.9046]
knn010.predict(X_validation).shape
(80,)

5.4.5 Validation Metrics

# model from train, data from validation
rmse_val_001 = rmse(y_validation, knn001.predict(X_validation))
rmse_val_010 = rmse(y_validation, knn010.predict(X_validation))
rmse_val_100 = rmse(y_validation, knn100.predict(X_validation))

print(f"Validation RMSE with k = 1:   {rmse_val_001:.3f}")
print(f"Validation RMSE with k = 10:  {rmse_val_010:.3f}")
print(f"Validation RMSE with k = 100: {rmse_val_100:.3f}")
Validation RMSE with k = 1:   0.353
Validation RMSE with k = 10:  0.261
Validation RMSE with k = 100: 0.449

Based on these results, we would select the model with the \(k\) value of 10. This will be our chosen model.

5.4.6 Refit and Calculate Test Metric

# refit to (full) train data
knn010.fit(X_train, y_train)

# calculate test RMSE
rmse_test_010 = rmse(y_test, knn010.predict(X_test))

# print
print("Test RMSE with k = 10:", rmse_test_010)
Test RMSE with k = 10: 0.2720220479053038

5.4.7 Visualizing Test Results

# calculate residuals
residuals = y_test - knn010.predict(X_test)

# plot histogram of residuals
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(residuals, bins=15, edgecolor="black", alpha=0.75)
ax.set_title("Histogram of Residuals")
ax.set_xlabel("Residual")
ax.set_ylabel("Frequency")
plt.show()

5.5 A Process for Supervised Learning

flowchart TB
    Pipeline{ML Pipeline} --> Model{Tuned Model}
    Model -- Predictions --> Metric[Final Evaluation]
    Estimator["Estimator"] --> Pipeline
    Data[(Data)] --> TrainData[(Train Data)]
    Data --> TestData[(Test Data)]
    TrainData -- Train Features --> Pipeline
    TrainData -- Train Target --> Pipeline
    TestData -- Test Features --> Model
    TestData -- Test Target --> Metric

The tuning procedure for supervised learning is:

  1. Train-test split the available data.
  2. Further split the train data into (validation) train and validation datasets.
  3. Fit all candidate models (in this case, three KNN models) to the (validation) train dataset.
  4. Calculate validation RMSE. With the validation data, calculate the RMSE for predictions from these models.
  5. Choose the model with the lowest validation RMSE. Call this the tuned model.
  6. Fit the tuned model to the train dataset.
  7. Calculate test RMSE for the tuned model. With the test data, calculate the RMSE for predictions from the model fit to the train data.

In general,

  • Validation RMSE is for tuning and selecting models, often via their tuning parameters.
  • Test RMSE is for reporting the performance of a selected model.

Next, we’ll add one more layer of complexity to this procedure that we will use the rest of the semester: cross-validation. While technically more complicated, through the help of sklearn it will actually simplify the boilerplate code that we need to write.