Preprocessing

Modeling Heterogeneous and Missing Data

Author
Modified

March 3, 2025

In these notes, we will cover various preprocessing techniques and their applications, including:

To summarize these techniques, we will demonstrate their applications using the Palmer Penguins dataset.

# basic imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# machine learning imports
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# preprocessing imports
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# data imports
from palmerpenguins import load_penguins

Motivation

# define sample size
n_samples = 10

# set random seed for reproducibility
np.random.seed(42)

# create feature variables of different types
num_feat_big = np.random.normal(loc=1000, scale=100, size=n_samples)
num_feat_small = np.random.normal(loc=0, scale=1, size=n_samples)
cat_feat_letters = np.random.choice(["A", "B", "C"], size=n_samples)
cat_feat_binary = np.random.choice([0, 1], size=n_samples)
target_binary = np.random.choice([0, 1], size=n_samples)
pd.DataFrame(
    {
        "num_feat_big": num_feat_big,
        "num_feat_small": num_feat_small,
        "cat_feat_letters": cat_feat_letters,
        "cat_feat_binary": cat_feat_binary,
        "target": target_binary,
    }
)
num_feat_big num_feat_small cat_feat_letters cat_feat_binary target
0 1049.671415 -0.463418 B 0 1
1 986.173570 -0.465730 B 0 0
2 1064.768854 0.241962 C 0 1
3 1152.302986 -1.913280 B 0 0
4 976.584663 -1.724918 C 1 1
5 976.586304 -0.562288 C 1 1
6 1157.921282 -1.012831 A 0 1
7 1076.743473 0.314247 C 1 0
8 953.052561 -0.908024 A 1 1
9 1054.256004 -1.412304 C 1 0

Why do we need to preprocess data? Two reasons:

  • Preprocessing is done to make modeling possible.
  • Preprocessing is done to improve model performance.

Consider the data frame above. We have numeric features with different scales. We also have two categorical features, one that is currently encoded as letters, that is, as strings. There are none here, but we could also have missing values that need to be imputed.

Now consider modeling this data with \(k\)-nearest neighbors.

Until we have dealt with any missing data, we cannot fit the model. Additionally, something will need to be done to the cat_feat_letters variable, as we cannot include values likes “B” in distance calculations. Theses transformations would make modeling possible.

Notice that the two numeric features are on different scales. We could consider scaling them, thus putting them on the same scale. This is an example of a transformation that could improve model performance. (It isn’t a guarantee, but it could help.)

Numeric Scaling

df_numeric = pd.DataFrame(
    {
        "num_feat_big": num_feat_big,
        "num_feat_small": num_feat_small,
    }
)
df_numeric
num_feat_big num_feat_small
0 1049.671415 -0.463418
1 986.173570 -0.465730
2 1064.768854 0.241962
3 1152.302986 -1.913280
4 976.584663 -1.724918
5 976.586304 -0.562288
6 1157.921282 -1.012831
7 1076.743473 0.314247
8 953.052561 -0.908024
9 1054.256004 -1.412304
standard_scaler = StandardScaler()
_ = standard_scaler.fit(df_numeric)
print(standard_scaler.transform(df_numeric))
[[ 0.07093253  0.45668015]
 [-0.85481899  0.45345355]
 [ 0.29104199  1.44107232]
 [ 1.56722474 -1.56667381]
 [-0.99461815 -1.30380487]
 [-0.99459421  0.31870247]
 [ 1.64913533 -0.31005311]
 [ 0.46562306  1.54194965]
 [-1.33769874 -0.16378976]
 [ 0.13777244 -0.86753659]]
print(standard_scaler.mean_)
print(standard_scaler.scale_)
[ 1.04480611e+03 -7.90658235e-01]
[68.59059303  0.71656397]
arr = standard_scaler.transform(df_numeric)
col_mean = np.mean(arr, axis=0)
col_sd = np.std(arr, axis=0)

print(f"Column means: {col_mean}")
print(f"Column standard deviations: {col_sd}")
Column means: [ 1.49880108e-15 -4.44089210e-17]
Column standard deviations: [1. 1.]

Categorical Encoding

df_categorical = pd.DataFrame(
    {
        "cat_feat_letters": cat_feat_letters,
        "cat_feat_binary": cat_feat_binary,
    }
)
df_categorical
cat_feat_letters cat_feat_binary
0 B 0
1 B 0
2 C 0
3 B 0
4 C 1
5 C 1
6 A 0
7 C 1
8 A 1
9 C 1
one_hot_encoder = OneHotEncoder(
    handle_unknown="infrequent_if_exist",
)
_ = one_hot_encoder.fit(df_categorical)
print(one_hot_encoder.transform(df_categorical).todense())
[[0. 1. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]
 [0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 1.]
 [1. 0. 0. 0. 1.]
 [0. 0. 1. 0. 1.]]
dummy_encoder = OneHotEncoder(
    handle_unknown="infrequent_if_exist",
    drop="first",
)
_ = dummy_encoder.fit(df_categorical)
print(dummy_encoder.transform(df_categorical).todense())
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 1.]
 [0. 1. 1.]
 [0. 0. 0.]
 [0. 1. 1.]
 [0. 0. 1.]
 [0. 1. 1.]]

Imputation

Numeric Features

df_numeric.loc[[1, 2], "num_feat_big"] = np.nan
df_numeric
num_feat_big num_feat_small
0 1049.671415 -0.463418
1 NaN -0.465730
2 NaN 0.241962
3 1152.302986 -1.913280
4 976.584663 -1.724918
5 976.586304 -0.562288
6 1157.921282 -1.012831
7 1076.743473 0.314247
8 953.052561 -0.908024
9 1054.256004 -1.412304
simple_imputer = SimpleImputer(strategy="median")
_ = simple_imputer.fit(df_numeric)
print(simple_imputer.transform(df_numeric))
[[ 1.04967142e+03 -4.63417693e-01]
 [ 1.05196371e+03 -4.65729754e-01]
 [ 1.05196371e+03  2.41962272e-01]
 [ 1.15230299e+03 -1.91328024e+00]
 [ 9.76584663e+02 -1.72491783e+00]
 [ 9.76586304e+02 -5.62287529e-01]
 [ 1.15792128e+03 -1.01283112e+00]
 [ 1.07674347e+03  3.14247333e-01]
 [ 9.53052561e+02 -9.08024076e-01]
 [ 1.05425600e+03 -1.41230370e+00]]

Categorical Features

df_categorical.loc[[3, 4], "cat_feat_letters"] = np.nan
df_categorical
cat_feat_letters cat_feat_binary
0 B 0
1 B 0
2 C 0
3 NaN 0
4 NaN 1
5 C 1
6 A 0
7 C 1
8 A 1
9 C 1
simple_imputer = SimpleImputer(strategy="most_frequent")
_ = simple_imputer.fit(df_categorical)
print(simple_imputer.transform(df_categorical))
[['B' 0]
 ['B' 0]
 ['C' 0]
 ['C' 0]
 ['C' 1]
 ['C' 1]
 ['A' 0]
 ['C' 1]
 ['A' 1]
 ['C' 1]]

Pipelines

df_all = pd.concat([df_numeric, df_categorical], axis=1)
df_all["target"] = target_binary
df_all
num_feat_big num_feat_small cat_feat_letters cat_feat_binary target
0 1049.671415 -0.463418 B 0 1
1 NaN -0.465730 B 0 0
2 NaN 0.241962 C 0 1
3 1152.302986 -1.913280 NaN 0 0
4 976.584663 -1.724918 NaN 1 1
5 976.586304 -0.562288 C 1 1
6 1157.921282 -1.012831 A 0 1
7 1076.743473 0.314247 C 1 0
8 953.052561 -0.908024 A 1 1
9 1054.256004 -1.412304 C 1 0
numeric_features = ["num_feat_big", "num_feat_small"]
categorical_features = ["cat_feat_letters", "cat_feat_binary"]
features = numeric_features + categorical_features
target = "target"
X = df_all[features]
y = df_all[target]
numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
)

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="infrequent_if_exist"),
)

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
    remainder="drop",
)

preprocessor.fit(X)
preprocessor.transform(X)
array([[-0.0066037 ,  0.45668015,  0.        ,  1.        ,  0.        ,
         1.        ,  0.        ],
       [ 0.02834041,  0.45345355,  0.        ,  1.        ,  0.        ,
         1.        ,  0.        ],
       [ 0.02834041,  1.44107232,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [ 1.55792867, -1.56667381,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [-1.12075006, -1.30380487,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [-1.12072503,  0.31870247,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [ 1.64357488, -0.31005311,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ],
       [ 0.40608715,  1.54194965,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [-1.47947724, -0.16378976,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ],
       [ 0.06328452, -0.86753659,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ]])
mod = make_pipeline(
    preprocessor,
    KNeighborsClassifier(n_neighbors=5),
)
mod.fit(X, y)
mod.predict(X)
array([1, 1, 1, 0, 1, 1, 0, 1, 1, 1])

Example: Palmer Penguins

penguins = load_penguins()
penguins_train, penguins_test = train_test_split(
    penguins,
    test_size=0.35,
    random_state=42,
)
penguins_train
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
292 Chinstrap Dream 50.3 20.0 197.0 3300.0 male 2007
302 Chinstrap Dream 50.5 18.4 200.0 3400.0 female 2008
56 Adelie Biscoe 39.0 17.5 186.0 3550.0 female 2008
271 Gentoo Biscoe NaN NaN NaN NaN NaN 2009
10 Adelie Torgersen 37.8 17.1 186.0 3300.0 NaN 2007
... ... ... ... ... ... ... ... ...
188 Gentoo Biscoe 42.6 13.7 213.0 4950.0 female 2008
71 Adelie Torgersen 39.7 18.4 190.0 3900.0 male 2008
106 Adelie Biscoe 38.6 17.2 199.0 3750.0 female 2009
270 Gentoo Biscoe 47.2 13.7 214.0 4925.0 female 2009
102 Adelie Biscoe 37.7 16.0 183.0 3075.0 female 2009

223 rows × 8 columns

plot = sns.jointplot(
    data=penguins,
    x="bill_length_mm",
    y="bill_depth_mm",
    hue="species",
    space=0,
    zorder=2,
)
plot.set_axis_labels(
    xlabel="Bill Length (mm)",
    ylabel="Bill Depth (mm)",
)
plot.figure.suptitle(
    t="Palmer Penguins",
    y=1.02,
)
plot.ax_joint.legend(
    title="Species",
    loc="lower left",
)
plot.ax_joint.grid(
    color="lightgrey",
    linestyle="--",
    linewidth=0.75,
    zorder=1,
)
plot.figure.set_size_inches(
    w=8,
    h=8,
)

penguins_vtrain, penguins_validation = train_test_split(
    penguins_train,
    test_size=0.35,
    random_state=42,
)
numeric_features = [
    "bill_length_mm",
    "bill_depth_mm",
    "body_mass_g",
]
categorical_features = ["sex"]
features = numeric_features + categorical_features
target = "species"
X_train = penguins_train[features]
y_train = penguins_train["species"]

X_test = penguins_test[features]
y_test = penguins_test["species"]

X_vtrain = penguins_vtrain[features]
y_vtrain = penguins_vtrain["species"]

X_validation = penguins_validation[features]
y_validation = penguins_validation["species"]
# define preprocessing for numeric features
numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
)

# define preprocessing for categorical features
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="infrequent_if_exist"),
)

# create general preprocessor
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
    remainder="drop",
)

# try many values of k for knn
validation_accuracy_scores = []
param_grid = [1, 5, 10, 15, 20, 25, 50, 100]
for k in param_grid:
    mod = make_pipeline(
        preprocessor,
        KNeighborsClassifier(n_neighbors=k),
    )
    mod.fit(X_vtrain, y_vtrain)
    y_pred = mod.predict(X_validation)
    validation_accuracy_score = accuracy_score(y_validation, y_pred)
    validation_accuracy_scores.append(validation_accuracy_score)

# get best k from validation process
k_best = param_grid[np.argmax(validation_accuracy_scores)]

# arrange and print validation results
validation_results = pd.DataFrame(
    {
        "k": param_grid,
        "Accuracy": validation_accuracy_scores,
    }
)
print(validation_results)
print(f"")
     k  Accuracy
0    1  0.974684
1    5  0.974684
2   10  0.962025
3   15  0.962025
4   20  0.962025
5   25  0.962025
6   50  0.911392
7  100  0.746835
# fit the model with the best k on the full training data
final_model = make_pipeline(
    preprocessor,
    KNeighborsClassifier(n_neighbors=k_best),
)
final_model.fit(X_train, y_train)

# predict on the test data
y_test_pred = final_model.predict(X_test)

# calculate and print the test accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy}")
Test Accuracy: 0.9834710743801653