Draft

4  Regression Introduction

Definitions, KNN, and Metrics

4.1 Setup and Objectives

In this note, we will discuss:

  • the supervised learning regression task,
  • \(k\)-nearest neighbors,
  • and metrics to evaluate regression models.
# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# machine learning
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error

4.2 The Goal of Regression

In the supervised learning regression task, we generally assume a data generating process that relates a response \(Y\) to features \(X\) through a functional relationship, \(f(X)\), plus some random noise, \(\epsilon\).

\[ \Large Y = f(X) + \epsilon \]

  • We want to learn the signal, \(f\).
  • We do not want to learn the noise, \(\epsilon\).

Stating this goal in terms of a data generating process is technically correct, but more statistical and theoretical than we will need. More practically, the goal of regression is to make good predictions of a numeric target (output) given unseen feature data (input).

Like the name suggests, a data generating process describes how to generate data, which in the case of supervised learning, is thought of as an \((X, Y)\) pair.

Consider the following example of a data generating process.

\[ \begin{aligned} X &\sim \text{Uniform}(-2\pi, 2\pi) \\ \epsilon &\sim N(\mu=0, \sigma=0.25) \\ Y &= \sin(X) + \epsilon \end{aligned} \]

Here, \(X\) is a feature and \(Y\) is the target. The relationship between \(X\) and \(Y\) is the sine function, \(\sin(X)\), plus some noise, \(\epsilon\). The noise is normally distributed with a mean of 0 and a standard deviation of 0.25.

You can think of the data generation as a step-by-step process. Repeat the following \(n\) times to create a dataset of size \(n\):

  • First, randomly generate \(x\), in this case from a uniform distribution.
  • Second, randomly generate the noise, \(\epsilon\), in this case from a normal distribution.
  • Lastly, create \(y\) by applying the sine function to \(x\) and then adding the noise.
    • Here, the sine function is the “signal” we want to learn. That is, the functional relationship, \(f\), between \(x\) and \(y\).

We can translate this to Python code that can simulate data according to this data generating process, which we write as a function for repeated usage.

def simulate_sin_data(n, sd, seed):

    # control randomness
    np.random.seed(seed)

    # simulate X data
    X = np.random.uniform(low=-2 * np.pi, high=2 * np.pi, size=(n, 1))

    # generate signal portion of y data
    signal = np.sin(X).ravel()

    # generate noise
    noise = np.random.normal(loc=0, scale=sd, size=n)

    # combine signal and noise
    y = signal + noise

    # return simulated data
    return X, y
X, y = simulate_sin_data(
    n=100,
    sd=0.25,
    seed=42,
)
print(X[:10])
[[-1.5766]
 [ 5.6638]
 [ 2.9153]
 [ 1.2398]
 [-4.3226]
 [-4.3229]
 [-5.5533]
 [ 4.6015]
 [ 1.2706]
 [ 2.6147]]
print(y)
[-0.9782 -0.6553  0.2473  0.4488  0.8701  1.0144  1.0363 -1.1234  0.7532
  0.3774  0.4846 -0.287  -0.9928  0.5841  0.78    0.9847 -0.8056  0.2242
 -0.8527 -0.8611  1.0604  1.0487 -0.5039 -1.0528 -0.8783 -0.5329  0.5054
 -0.0226  0.877   0.6521  1.4476  0.8844  0.7938 -0.6177 -0.8983 -0.6764
 -0.6186  1.5574  0.6874 -0.6078  0.9906 -0.3527  0.7045 -0.7204  0.0876
  0.6635 -0.3494 -0.1009  0.7006  1.2778 -0.6206 -0.4522 -0.6642 -1.095
  0.5549 -0.8143  0.631   0.7463  0.3084 -0.424  -1.1811 -0.3456 -0.6323
 -1.2815 -0.3222  0.8379  0.5782 -0.5637  0.8707  0.0314 -0.5851  0.2707
  0.1998 -0.6587  0.5786  0.3474 -0.4341  0.86   -0.9052  0.8148 -0.5224
  1.1182 -1.1481  0.8806 -0.9372 -0.6136  0.5431  0.7824 -0.7474 -0.2389
  1.2032  0.9198 -0.1965  0.5077 -0.4828 -0.2818  0.2625 -0.7046  0.3832
  1.1838]
# verify shapes of the data
X.shape, y.shape
((100, 1), (100,))

In general, when working with ML data in Python, especially with sklearn:

  • X will contain the features, usually as a two-dimensional numpy array
    • Here, we have a single feature so X has shape (100, 1).
  • y will contain the target, usually as a one-dimensional numpy array
    • Here, y has shape (100,).

The following table displays this data in a more human readable format.

x y
0 -1.58 -0.98
1 5.66 -0.66
2 2.92 0.25
3 1.24 0.45
4 -4.32 0.87
... ... ...
95 -0.08 -0.28
96 0.29 0.26
97 -0.91 -0.70
98 -5.96 0.38
99 -4.93 1.18

100 rows × 2 columns

Table 4.1: A data frame of some simulated data with a single feature \(x\) and a target \(y\).

This will be the starting point of regression tasks. Given the feature data, X, and the target data, y, we want to learn the relationship between the two well enough to make good predictions of \(y\) given new \(x\) values.

We can also visualize this data.

Show Code for Plot
# setup figure
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# add overall title
fig.suptitle("Simulated Sine Wave Data")

# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((-1, 1))

# create subplot for "simulation study"
axs[0].set_title("Simulation Study")
axs[0].plot(x_plot, np.sin(x_plot), color="tab:gray", zorder=1)
axs[0].scatter(X, y, s=50, alpha=0.75)

# create subplot for "reality"
axs[1].set_title("Reality")
axs[1].scatter(X, y, s=50, alpha=0.75)

# show plot
plt.show()

In some sense, the left panel visualizes the data generating process. The sine wave is the signal, and the points are randomly placed around the sine wave according to the distribution of the noise. The right panel shows the reality that we will face when dealing with the regression task. We only have access to the data, and instead need to infer how it was generated. So, another way of thinking about the regression task is inferring the left panel, specifically the unknown signal, from the right panel.

So, if \(f\) is a function that describes the true signal within the data generating process, our goal is to use the data to estimate \(f\) with \(\hat{f}\), a function that approximates \(f\) well enough to make good predictions of \(Y\) given new \(X\) values.

Thinking about the right panel, without seeing the left…

  • When \(x = -\pi\), what should we predict for \(y\)?
  • When \(x = 0\), what should we predict for \(y\)?
  • When \(x = \pi\), what should we predict for \(y\)?

You can probably make some reasonable estimates, but how can we formalize this process? How can a machine learn to make these predictions? How can a machine learn to estimate \(f\)?

4.3 \(k\)-Nearest Neighbors

Before specifically introducing \(k\)-Nearest Neighbors (KNN), let’s discuss a simple but important general concept in machine learning, that you probably used when thinking about the predictions above.

To predict \(y\) at \(x = 0\), find some “nearby” samples in the data, with respect to \(x\). Predict the average of the \(y\) values of these samples.

Here, we’ve highlighted five points that are “nearby” to \(x = 0\). We then average their \(y\) values to predict \(y\) at \(x = 0\).

# get the "distance" of each point from x = 0
distances = np.abs(X).ravel()

# get the indexes for the five nearest neighbors
nearest_indices = np.argsort(distances)[:5]
# X values of the five nearest neighbors
print(X[nearest_indices])
[[-0.0606]
 [-0.078 ]
 [ 0.1789]
 [ 0.2522]
 [ 0.2857]]
# y values of the five nearest neighbors
print(y[nearest_indices])
[-0.3527 -0.2818 -0.0226 -0.1009  0.2625]
# average of the y values, use as prediction
print(np.mean(y[nearest_indices]))
-0.09912979149490428

Because we know the data generating process, we know that at \(x = 0\), \(y = \sin(0) = 0\), so our prediction here seems reasonably close. Of course in practice, we’ll need some other way to check our predictions.

Two questions arise:

  1. How is “nearby” defined? Distance, most often the Euclidean distance.
  2. How many “nearby” samples should we use? Good question!

If we formalize this process by picking a distance metric and a number of “nearby” samples, which we will call neighbors, we are describing \(k\)-nearest neighbors (KNN).

For some \(k\), which determines the number of neighbors to use, predict \(y\) at \(x\) using

\[ \hat{f}(x) = \frac{1}{k} \sum_{ \{i \ : \ x_i \in \mathcal{N}_k(x, \mathcal{D}) \} } y_i. \]

Here, we have:

  • \(\mathcal{D} = \{ (x_i, y_i) \in \mathbb{R}^p \times \mathbb{R}, \ i = 1, 2, \ldots n \}\), the data used to fit the model, where \(p\) is the number of features.
  • \(\{i \ : \ x_i \in \mathcal{N}_k(x, \mathcal{D}) \}\), the indexes of the samples that contain a value that is among the nearest neighbors to the point.

Don’t let this notation scare you! This is just a mathematical way of saying: find the \(k\) closest points to \(x\) and average their \(y\) values.

While it is simple enough to implement KNN in Python, we will instead use the KNeighborsRegressor class from sklearn.

Let’s use sklearn to fit a KNN model to the simulated data.

knn = KNeighborsRegressor(
    n_neighbors=5, # this is k
    metric="euclidean",
)
knn.get_params()
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}
_ = knn.fit(X, y)
knn.predict(X)
array([-1.0798, -0.6912,  0.487 ,  0.8163,  0.8792,  0.8792,  0.7342,
       -0.8962,  0.8163,  0.5533,  0.4161, -0.4859, -0.7048,  0.6769,
        0.9346,  0.9346, -0.6429,  0.2402, -0.6564, -0.6223,  0.9656,
        0.9004, -0.6223, -1.0798, -0.6564, -0.4974,  0.6769,  0.0021,
        0.8163,  0.7164,  0.853 ,  0.9594,  0.8883, -0.6912, -0.6158,
       -0.6128, -0.6429,  1.0781,  0.6454, -0.6564,  1.0482, -0.1994,
        0.5066, -0.8083, -0.389 ,  0.766 , -0.6649,  0.2402,  0.5066,
        0.8119, -0.4859, -0.4974, -0.7299, -0.9001,  0.8163, -0.7823,
        1.0206,  0.757 ,  0.7164, -0.6944, -1.0798, -0.389 , -0.7048,
       -1.0732, -0.389 ,  0.5066,  0.9004, -0.6128,  0.8072, -0.4859,
       -0.4301,  0.6769,  0.4161, -0.7048,  0.5821,  0.487 , -0.4301,
        0.8072, -1.0732,  1.15  , -0.8962,  1.0324, -0.6944,  0.8883,
       -0.6649, -0.6944,  0.487 ,  1.0144, -0.8417, -0.4719,  1.0482,
        0.5533, -0.4301,  0.6295, -0.4301, -0.1994,  0.2402, -0.8449,
        0.4161,  1.15  ])
Show Code for Plot
# setup figure
fig, ax = plt.subplots(1, 1, figsize=(8, 5))

# add overall title
fig.suptitle("Simulated Sine Wave Data")

# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((-1, 1))

# create scatterplot of data
ax.plot(
    x_plot,
    knn.predict(x_plot),
    color="tab:gray",
    label="k = 5",
    zorder=1,
)
ax.scatter(X, y, s=50, alpha=0.75)
ax.legend()

# show plot
plt.show()

4.3.1 Distance Metrics

Given two vectors \(a\) and \(b\):

\[ \begin{aligned} a &= (a_{1},a_{2},\ldots ,a_{n}) \in \mathbb {R}^{n} \\ b &= (b_{1},b_{2},\ldots ,b_{n}) \in \mathbb {R}^{n} \end{aligned} \]

The Euclidean distance between \(a\) and \(b\) is defined to be

\[ d(a, b) = \left(\sum _{i=1}^{n}|a_{i}-b_{i}|^{2}\right)^{\frac{1}{2}} = \sqrt{\sum _{i=1}^{n}|a_{i}-b_{i}|^{2}}. \]

The Manhattan distance between \(a\) and \(b\) is defined to be

\[ d(a, b) = \sum _{i=1}^{n}|a_{i}-b_{i}|. \]

The Minkowski distance between \(a\) and \(b\) is defined to be

\[ d(a, b) = \left(\sum _{i=1}^{n}|a_{i}-b_{i}|^{p}\right)^{\frac {1}{p}} \]

For example, with two points in three dimensions:

\[ \begin{aligned} a &= (a_1, a_2, a_3) = (1, 2, 3)\\ b &= (b_1, b_2, b_3) = (4, 5, 6) \end{aligned} \]

The Euclidean distance between \(a\) and \(b\) is calculated as:

\[ \begin{aligned} d(a, b) &= \sqrt{\sum _{i=1}^{3}|a_{i}-b_{i}|^{2}} \\ &= \sqrt{(1 - 4)^2 + (2 - 5)^2 + (3 - 6)^2} \\ &= \sqrt{9 + 9 + 9} \\ &= \sqrt{27} \\ &\approx 5.19615 \end{aligned} \]

The following plot shows the distance between \((1, 2)\) and \((4, 6)\). The Euclidean distance is shown in solid gray, while the Manhattan distance is shown in dashed gray. The Euclidean distance is the “shortest” path between two points, while the Manhattan distance is the sum of the horizontal and vertical distances. The Manhattan distance (also known as the taxicab distance) gets its name from the grid-like layout of Manhattan streets.

4.4 Evaluating Regression

How do we quantify the performance of a model? Generally, we compare predicted values to actual values, but there are many ways to do so.

4.4.1 Metrics

  • \(y\) is a vector of the true (actual) values of length \(n\)
    • \(y_i\) is a particular elements of this vector
  • \(\hat{y}\) is a vector of the predicted values of length \(n\)
    • \(\hat{y}_i\) is a particular elements of this vector
  • \(n\) is the number of samples, which is the same for the actual and predicted values
  • \(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\)

Root Mean Squared Error

\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

Mean Absolute Error

\[ \text{MAE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]

Mean Absolute Percentage Error

\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \]

Coefficient of Determination

\[ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} \]

Maximum Error

\[ \text{MaxError}(y, \hat{y}) = \max |y_i - \hat{y}_i| \]

4.4.2 Comparing Models

RMSE with k = 1:   0.00
RMSE with k = 10:  0.26
RMSE with k = 100: 0.75

RMSE with k = 1:   0.49
RMSE with k = 10:  0.43
RMSE with k = 100: 0.83

In the two preceding plots, the first plots the points that were used to fit the models. In the second, the plot displays the same learned models, but plots “new” unseen data. For illustrative purposes, we simply simulated new data. However, in practice, we will not be able to do so. Notice how the model with the best (lowest) RMSE changes!