# basic imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# machine learning imports
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
In these notes, we will discuss:
- the supervised learning classification task,
- the Bayes classifier,
- \(k\)-nearest neighbors,
- classification metrics,
- and estimating conditional probabilities with a learned model.
Along the way, you should notice that except for the conditional probabilities, the process followed mirrors that of regression.
The Goal of Classification
Like, regression, classification is a supervised learning task. However, while regression is concerned with predicting a numeric target variable, classification seeks to predicts a categorical target variable.
= make_blobs(
X, y =800,
n_samples=2,
n_features=3,
centers=4.2,
cluster_std=42,
random_state )
= train_test_split(
X_train, X_test, y_train, y_test
X,
y,=0.20,
test_size=42,
random_state )
Show Code for Plot
= plt.subplots()
fig, ax 8, 6)
fig.set_size_inches(= sns.scatterplot(
scatter =X_train[:, 0],
x=X_train[:, 1],
y=y_train,
hue="tab10",
palette=ax,
ax=50,
s="k",
edgecolor=0.75,
alpha
)"$x_1$")
ax.set_xlabel("$x_2$")
ax.set_ylabel("Simulated Training Data for Classification")
ax.set_title( plt.show()
Consider new data at \(x = (x_1, x_2)\). The goal of classification is to predict the class label \(y\) for this new data point. Consider the following example when \(x = (-5, 5)\).
Show Code for Plot
= plt.subplots()
fig, ax 8, 6)
fig.set_size_inches(= sns.scatterplot(
scatter =X_train[:, 0],
x=X_train[:, 1],
y=y_train,
hue="tab10",
palette=ax,
ax=50,
s="k",
edgecolor=0.75,
alpha
)"$x_1$")
ax.set_xlabel("$x_2$")
ax.set_ylabel("Simulated Training Data for Classification")
ax.set_title(
ax.plot(-5,
5,
="o",
marker=10,
markersize="red",
color
) plt.show()
The fundamental question that classification seeks to answer is: what is the probability that \(Y = g\) given \(X = x\)?
So, in this case, what is the probability that:
- \(Y = 0\) (blue) when \(x = (-5, 5)\)?
- \(Y = 1\) (orange) when \(x = (-5, 5)\)?
- \(Y = 2\) (green) when \(x = (-5, 5)\)?
With these questions answered, we can then make a prediction for the class label of \(x = (-5, 5)\). Simply predict the class label with the highest probability!
Unfortunately, we do not know the true conditional probabilities. So instead, we will fit a model that can be used to estimate these probabilities.
Bayes Classifier
The Bayes Classifier, \(C^B(x)\), is the classifier that minimizes the probability of misclassification, and thus is considered the optimal classifier. However, the Bayes Classifier cannot be used in practice as it requires knowledge of the true conditional probabilities. It is simply a useful concept for theoretical understanding of classification.
\[ p_g(x) = P\left[ Y = g \mid X = x \right] \]
\[ C^B(x) = \underset{g \in \{1, 2, \ldots G\}}{\text{argmax}} P\left[ Y = k \mid X = x \right] \]
The Bayes Classifier simply says “predict the class label with the highest conditional probability”.
Building a Classifier
Given that we cannot use the Bayes Classifier in practice, we will build a model that can be used to estimate the relevant conditional probabilities. With those estimate probabilities, we would then make predictions using the same rule as the Bayes Classifier, but with the estimated probabilities rather than known conditional probabilities.
\[ \hat{p}_g(x) = \hat{P}\left[ Y = k \mid X = x \right] \]
\[ \hat{C}(x) = \underset{g \in \{1, 2, \ldots G\}}{\text{argmax}} \hat{p}_g(x) \]
\(k\)-Nearest Neighbors
Our first model for classification will be the \(k\)-nearest neighbors classifier. Using \(k\)-nearest neighbors for classification is quite similar to \(k\)-nearest neighbors for regression.
After finding the \(k\)-nearest neighbors of \(x\), we will predict the class label of \(x\) as the class label that is most common among the \(k\)-nearest neighbors. More specifically, we can utilize the \(k\)-nearest neighbors to estimate the conditional probabilities. We will simply count the number of neighbors of each class label and divide by \(k\) to get the estimated probabilities!
\[ \hat{p}_g(x) = \hat{P}\left[ Y = k \mid X = x \right] = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} I(y_i = g) \]
After estimating the conditional probabilities, we can then make a prediction for the class label of \(x\) that has the highest estimated conditional probability.
\[ \hat{C}(x) = \underset{g \in \{1, 2, \ldots G\}}{\text{argmax}} \hat{p}_g(x) \]
Like \(k\)-nearest neighbors regression, \(k\)-nearest neighbors for classification requires selecting a value for \(k\).
Metrics
There are many metrics to evaluate the performance of a classifier. We will detail a long list of them in the future. For introductory purposes, we will focus on two metrics: accuracy and misclassification.
\[ \text{Accuracy}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i = \hat{y}_i) \]
The accuracy is simply the proportion of correct predictions made by the classifier. We will see that accuracy is in many ways the default metric for classification, especially within sklearn
.
Note that until RMSE for regression, which we want to minimize, accuracy is metric that we want to maximize. The classification error, or misclassification rate, is simply the proportion of incorrect predictions made by the classifier. So while these two metrics are related, and essentially measure the same thing, it can sometimes be useful to considered errors instead of correct predictions.
\[ \text{Misclassification}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \]
\[ I(y_i = \hat{y}_i) = \begin{cases} 1 & \text{if } y_i = \hat{y}_i \\ 0 & \text{otherwise} \end{cases} \]
Example: Tuning a \(k\)-Nearest Neighbors Classifier
= train_test_split(
X_vtrain, X_validation, y_vtrain, y_validation
X_train,
y_train,=0.20,
test_size=42,
random_state )
= range(1, 501, 10)
k_values = []
train_accuracies = []
validation_accuracies
for k in k_values:
= KNeighborsClassifier(n_neighbors=k)
knn
knn.fit(X_vtrain, y_vtrain)= knn.predict(X_vtrain)
y_pred_train = knn.predict(X_validation)
y_pred_validation = accuracy_score(y_vtrain, y_pred_train)
train_accuracy = accuracy_score(y_validation, y_pred_validation)
validation_accuracy
train_accuracies.append(train_accuracy)
validation_accuracies.append(validation_accuracy)
= k_values[np.argmax(validation_accuracies)]
best_k print(f"Best k: {best_k}")
print(f"Validation Accuracy: {max(validation_accuracies):.2f}")
Best k: 201
Validation Accuracy: 0.87
= 1 - np.array(validation_accuracies)
validation_error = 1 - np.array(train_accuracies) train_error
= plt.subplots()
fig, ax 8, 6)
fig.set_size_inches(
ax.plot(
k_values,
train_error,="Train Error",
label="black",
color="o",
marker="black",
markeredgecolor="tab:blue",
markerfacecolor
)
ax.plot(
k_values,
validation_error,="Validation Error",
label="black",
color="o",
marker="black",
markeredgecolor="tab:orange",
markerfacecolor
)"k (Number of Neighbors)")
ax.set_xlabel("Error")
ax.set_ylabel("Train and Validation Error versus k")
ax.set_title(
ax.invert_xaxis()
ax.legend() plt.show()
= pd.Series(y_validation).value_counts(normalize=True)
proportions print(proportions)
1 0.37
0 0.32
2 0.31
Name: proportion, dtype: float64
= KNeighborsClassifier(n_neighbors=best_k)
knn
knn.fit(X_train, y_train)= knn.predict(X_test)
y_pred_test = accuracy_score(y_test, y_pred_test)
test_accuracy print(f"Test Accuracy: {test_accuracy:.2f}")
Test Accuracy: 0.90
print(knn.predict_proba(X_test)[:10])
[[0.6119403 0.37313433 0.01492537]
[0.10945274 0.80099502 0.08955224]
[0.09950249 0.8358209 0.06467662]
[0.0199005 0.04975124 0.93034826]
[0.74626866 0.21890547 0.03482587]
[0.72636816 0.27363184 0. ]
[0.74626866 0.08955224 0.1641791 ]
[0.06467662 0.03482587 0.90049751]
[0.06467662 0.53233831 0.40298507]
[0.85074627 0.13930348 0.00995025]]
print(y_test[:10])
[1 1 1 2 0 0 0 2 1 0]
print(y_pred_test[:10])
[0 1 1 2 0 0 0 2 1 0]
= pd.DataFrame(
results_df
{"Actual": y_test[:10],
"Predicted": y_pred_test[:10],
}
)= pd.DataFrame(knn.predict_proba(X_test)[:10])
prob_df = [f"prob_{i}" for i in knn.classes_]
prob_df.columns = pd.concat([results_df, prob_df], axis=1)
results_df print(results_df)
Actual Predicted prob_0 prob_1 prob_2
0 1 0 0.61 0.37 0.01
1 1 1 0.11 0.80 0.09
2 1 1 0.10 0.84 0.06
3 2 2 0.02 0.05 0.93
4 0 0 0.75 0.22 0.03
5 0 0 0.73 0.27 0.00
6 0 0 0.75 0.09 0.16
7 2 2 0.06 0.03 0.90
8 1 1 0.06 0.53 0.40
9 0 0 0.85 0.14 0.01