In these notes, we will cover various preprocessing techniques and their applications, including:
Numeric Scaling: Standardizing numeric features to have a mean of 0 and a standard deviation of 1.
Categorical Encoding: Converting categorical features into (multiple) numeric features using one-hot encoding.
Imputation: Handling missing data by replacing missing values.
Pipelines: Combining preprocessing steps and machine learning models for seamless application.
To summarize these techniques, we will demonstrate their applications using the Palmer Penguins dataset.
# basic importsimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt# machine learning importsfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_score# preprocessing importsfrom sklearn.preprocessing import StandardScalerfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.impute import SimpleImputerfrom sklearn.compose import make_column_transformerfrom sklearn.pipeline import make_pipeline# data importsfrom palmerpenguins import load_penguins
/home/runner/work/cs307-notes/cs307-notes/.venv/lib/python3.13/site-packages/palmerpenguins/penguins.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
Motivation
# define sample sizen_samples =10# set random seed for reproducibilitynp.random.seed(42)# create feature variables of different typesnum_feat_big = np.random.normal(loc=1000, scale=100, size=n_samples)num_feat_small = np.random.normal(loc=0, scale=1, size=n_samples)cat_feat_letters = np.random.choice(["A", "B", "C"], size=n_samples)cat_feat_binary = np.random.choice([0, 1], size=n_samples)target_binary = np.random.choice([0, 1], size=n_samples)
Preprocessing is done to improve model performance.
Consider the data frame above. We have numeric features with different scales. We also have two categorical features, one that is currently encoded as letters, that is, as strings. There are none here, but we could also have missing values that need to be imputed.
Now consider modeling this data with \(k\)-nearest neighbors.
Until we have dealt with any missing data, we cannot fit the model. Additionally, something will need to be done to the cat_feat_letters variable, as we cannot include values likes “B” in distance calculations. Theses transformations would make modeling possible.
Notice that the two numeric features are on different scales. We could consider scaling them, thus putting them on the same scale. This is an example of a transformation that could improve model performance. (It isn’t a guarantee, but it could help.)
/home/runner/work/cs307-notes/cs307-notes/.venv/lib/python3.13/site-packages/seaborn/axisgrid.py:1766: UserWarning: The figure layout has changed to tight
f.tight_layout()
# fit the model with the best k on the full training datafinal_model = make_pipeline( preprocessor, KNeighborsClassifier(n_neighbors=k_best),)final_model.fit(X_train, y_train)# predict on the test datay_test_pred = final_model.predict(X_test)# calculate and print the test accuracytest_accuracy = accuracy_score(y_test, y_test_pred)print(f"Test Accuracy: {test_accuracy}")
Test Accuracy: 0.9834710743801653
TODO
TODO: what is the distance between (0.2, 1000, “A”) and (1.2, 10001, “C”)? (also for the auto mpg data)