Decision Trees for Regression

Neighborhoods via Recursive Partitioning

Objectives

In this note, we will discuss:

decision trees for regression,
how trees differ from KNN when determining similarity of data,
how to find and evaluate decision tree splits for regression,
and how to fit and tune decision trees with DecisionTreeRegressor.

We’ll return to the Auto MPG example to explore how trees split data and make predictions.

Python Setup

# basics
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# machine learning
from sklearn.datasets import make_friedman1, make_regression
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# data
from ucimlrepo import fetch_ucirepo

Notebook

The following Jupyter notebook contains some starter code that may be useful for following along with this note.

Regression Trees Notebook

Introduction

Decision trees are a fundamental machine learning algorithm that can be used for both classification and regression tasks. In regression problems, decision trees predict numeric target variables by recursively partitioning the feature space into regions (“neighborhoods”) and predicting the mean (or other statistics like the median) of the target values within each region.

Loss Functions

\[ \text{MSE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y})^2 \]

\[ \text{MAE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} \left| (y_i - \hat{y}) \right| \]

\[ \text{SSE}(y, \hat{y}) = \sum_{i=1}^{n} (y_i - \hat{y})^2 \]

Tree Building Idea

\(x\)	\(y\)
2.65	2.52
7.96	7.05
6.12	6.35
9.18	6.82
3.47	3.08
9.80	7.09
5.31	6.21
5.10	6.53
6.53	6.50
3.88	2.65

Table 1: Ten samples of simulated data with a single feature \(x\) and a target \(y\).

Decision trees split the feature data into regions (neighborhoods), called nodes, by finding optimal cutoffs.

Where KNN looks for neighbors when making predictions, decision trees create neighborhoods during fitting.

For a single feature \(x\) and node \(N\), we choose a cutoff value \(c_N\) that divides our data into two groups:

Left Node: \(N_L = \{i : x_i \leq c_N\}\)
Right Node: \(N_R = \{i : x_i > c_N\}\)

Each neighborhood predicts the mean of the target values within that region.

**Figure 1:** Comparison of SST vs SSE. Left panel shows SST (total sum of squares) with all residuals measured from the overall mean. Right panel shows SSE (sum of squared errors) after one split at x=5, with residuals measured from neighborhood means. The split reduces error by creating more homogeneous regions.

To find the optimal split, we consider all possible features (here just a single feature \(x\)) and all possible cutoff values, then choose the combination that minimizes \(\text{SSE}\).

After splitting node \(N\) based on \(x_j \leq c_N\), the new \(\text{SSE}\) is:

\[ \text{SSE}(y, \hat{y}) = \sum_{i=1}^{n} (y_i - \hat{y})^2 = \sum_{i \in N_L} (y_i - \bar{y}_L)^2 + \sum_{i \in N_R} (y_i - \bar{y}_R)^2 \]

where \(\bar{y}_L\) and \(\bar{y}_R\) are the means of the left and right nodes, respectively. That is, for samples in the left node, \(N_L\), the predicted value \(\hat{y}\) is \(\bar{y}_L\). For samples in the right node, \(N_R\), the predicted value \(\hat{y}\) is \(\bar{y}_R\).

The algorithm considers all features and typically uses midpoints between adjacent sorted values as potential cutoffs.

A tree will consist of more than a single split. To develop a “deeper” tree, we recursively partition the feature space. That is, after splitting the root node \(N_0\) into nodes \(N_L\) and \(N_R\), we repeat the same process within both nodes.

We first split \(N_L\) into \(N_{LL}\) and \(N_{LR}\).

**Figure 2:** Progressive reduction of SSE through additional splits. Left panel shows one split at x=5 creating two neighborhoods. Right panel shows two splits (at x=2.5 and x=5) creating three neighborhoods. Each additional split reduces SSE by creating smaller, more homogeneous regions with shorter residuals.

With an arbitrary number of neighborhoods \(T\) (terminal nodes), we have:

\[ \text{SSE}(y, \hat{y}) = \sum_{t=1}^{T} \sum_{i \in N_t} (y_i - \bar{y}_t)^2 \]

We then split \(N_R\) into \(N_{RL}\) and \(N_{RR}\).

**Figure 3:** Adding a third split further reduces SSE. Left panel shows two splits creating three regions. Right panel shows three optimal splits creating four regions, each corresponding to our original step function levels.

**Figure 4:** Visualization of decision tree structure. Left panel shows the data with three splits creating four regions. Right panel shows the corresponding tree structure with decision nodes (splits) and leaf nodes (predictions).

With a tree build, we can give categorize the different types of nodes that we have encountered.

Root Node: The “top” node containing all data.
Terminal Nodes: Nodes at the “bottom” of the tree that are not split any further. In the tree analogy, this could also be called leaves.
Internal Nodes: Any node that is not the root node or a terminal node.

Decision Tree Tuning Parameters

def simulate_sin_data(n, sd, seed):
    np.random.seed(seed)
    X = np.random.uniform(
        low=-2 * np.pi,
        high=2 * np.pi,
        size=(n, 1),
    )
    signal = np.sin(X).ravel()
    noise = np.random.normal(
        loc=0,
        scale=sd,
        size=n,
    )
    y = signal + noise
    return X, y

X, y = simulate_sin_data(n=200, sd=0.25, seed=42)

Show Code for Plot

# setup figure
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((1000, 1))

# create subplot for "simulation study"
axs[0].set_title("Simulation Study")
axs[0].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[0].plot(x_plot, np.sin(x_plot), zorder=3)
axs[0].set_xlabel("$x$")
axs[0].set_ylabel("$y$")

# create subplot for "reality"
axs[1].set_title("Reality")
axs[1].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[1].set_xlabel("$x$")

# show plot
plt.show()

**Figure 5:** A visual representation of a data generating process, and data generated by the process. **Left**: A sine wave that is the signal and data generated by the process. **Right**: Data generated by the process, without any indication of the true signal.

# fit a knn model for comparison
knn010 = KNeighborsRegressor(n_neighbors=10)
_ = knn010.fit(X, y)

# list the possible parameters (and their default values) to the DecisionTreeRegressor
DecisionTreeRegressor().get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

# fit a decision tree with min_samples_split=40
dt040 = DecisionTreeRegressor(min_samples_split=40)
_ = dt040.fit(X, y)

Show Code for Plot

# setup figure
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# create subplot for KNN
axs[0].set_title("k-Nearest Neighbors, $k=10$")
axs[0].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[0].plot(x_plot, knn010.predict(x_plot))
axs[0].set_xlabel("$x$")
axs[0].set_ylabel("$y$")

# create subplot for decision tree
axs[1].set_title(r"Decision Tree, $\mathtt{min\_samples\_split=40}$")
axs[1].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[1].plot(x_plot, dt040.predict(x_plot))
axs[0].set_xlabel("$x$")

# show plot
plt.show()

# visualize the decision tree
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
plot_tree(dt040, feature_names=["x"], filled=True, rounded=True, fontsize=10, precision=2)
plt.show()

# view text representation of the tree
print(export_text(dt040))

|--- feature_0 <= -3.55
|   |--- feature_0 <= -5.52
|   |   |--- value: [0.43]
|   |--- feature_0 >  -5.52
|   |   |--- value: [0.95]
|--- feature_0 >  -3.55
|   |--- feature_0 <= 3.69
|   |   |--- feature_0 <= -0.05
|   |   |   |--- feature_0 <= -2.83
|   |   |   |   |--- value: [0.01]
|   |   |   |--- feature_0 >  -2.83
|   |   |   |   |--- feature_0 <= -0.80
|   |   |   |   |   |--- value: [-0.83]
|   |   |   |   |--- feature_0 >  -0.80
|   |   |   |   |   |--- value: [-0.35]
|   |   |--- feature_0 >  -0.05
|   |   |   |--- feature_0 <= 2.87
|   |   |   |   |--- feature_0 <= 0.56
|   |   |   |   |   |--- value: [0.25]
|   |   |   |   |--- feature_0 >  0.56
|   |   |   |   |   |--- value: [0.86]
|   |   |   |--- feature_0 >  2.87
|   |   |   |   |--- value: [-0.10]
|   |--- feature_0 >  3.69
|   |   |--- feature_0 <= 5.60
|   |   |   |--- value: [-0.84]
|   |   |--- feature_0 >  5.60
|   |   |   |--- value: [-0.39]

# initialize decision trees with different values of the tuning parameter min_samples_split
dt002 = DecisionTreeRegressor(min_samples_split=2)
dt100 = DecisionTreeRegressor(min_samples_split=100)
dt250 = DecisionTreeRegressor(min_samples_split=250)

# fit those models
_ = dt002.fit(X, y)
_ = dt100.fit(X, y)
_ = dt250.fit(X, y)

Show Code for Plot

# setup figure
fig, axs = plt.subplots(1, 3, figsize=(18, 5))

# create subplot for decision tree with min_samples_split=250
axs[0].set_title(r"$\mathtt{min\_samples\_split=250}$")
axs[0].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[0].plot(x_plot, dt250.predict(x_plot))
axs[0].set_xlabel("$x$")
axs[0].set_ylabel("$y$")

# create subplot for decision tree with min_samples_split=100
axs[1].set_title(r"$\mathtt{min\_samples\_split=100}$")
axs[1].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[1].plot(x_plot, dt100.predict(x_plot))
axs[1].set_xlabel("$x$")

# create subplot for decision tree with min_samples_split=2
axs[2].set_title(r"$\mathtt{min\_samples\_split=2}$")
axs[2].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[2].plot(x_plot, dt002.predict(x_plot))
axs[2].set_xlabel("$x$")

# show plot
plt.show()

# initialize decision trees with different values of the tuning parameter max_depth
dt_d01 = DecisionTreeRegressor(max_depth=1)
dt_d05 = DecisionTreeRegressor(max_depth=5)
dt_d10 = DecisionTreeRegressor(max_depth=10)

# fit those models
_ = dt_d01.fit(X, y)
_ = dt_d05.fit(X, y)
_ = dt_d10.fit(X, y)

Show Code for Plot

# setup figure
fig, axs = plt.subplots(1, 3, figsize=(18, 5))

# create subplot for decision tree with max_depth=1
axs[0].set_title(r"$\mathtt{max\_depth=1}$")
axs[0].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[0].plot(x_plot, dt_d01.predict(x_plot))
axs[0].set_xlabel("$x$")
axs[0].set_ylabel("$y$")

# create subplot for decision tree with max_depth=5
axs[1].set_title(r"$\mathtt{max\_depth=5}$")
axs[1].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[1].plot(x_plot, dt_d05.predict(x_plot))
axs[1].set_xlabel("$x$")

# create subplot for decision tree with max_depth=10
axs[2].set_title(r"$\mathtt{max\_depth=10}$")
axs[2].scatter(X, y, s=50, alpha=0.5, c="tab:gray")
axs[2].plot(x_plot, dt_d10.predict(x_plot))
axs[2].set_xlabel("$x$")

# show plot
plt.show()

Recursive Partitioning Algorithm

The decision tree algorithm works through recursive partitioning. Let’s look at a sketch of the algorithm.

Initialize root node containing all data.
Set stop conditions via min_samples_split or max_depth.
While there are terminal nodes that can be split:
- For each terminal node \(N\):
  - If stop conditions are met:
    - Skip node.
  - For each feature \(x_j\):
    - Find cutoff \(c_N\) for potential split \(x_j \leq c_N\) that minimizes loss (\(\text{SSE}\)).
  - Pick feature \(x_j\) (and associate cutoff \(c_N\)) with smallest loss.
  - Split \(N\) into \(N_L\) and \(N_R\) using \(x_j \leq c_N\), creating two new terminal nodes.

The algorithm considers all features and finds potential split points by examining the data. For continuous features, typical split points are the midpoints between adjacent unique values when the data is sorted.

Show Code for Recursive Partitioning Demo

# create 2D dataset to show spatial partitioning
np.random.seed(307)
n = 100
x1 = np.random.uniform(0, 10, n)
x2 = np.random.uniform(0, 10, n)
y = 5 + 0.3 * x1 + 0.5 * x2 + 0.1 * x1 * x2 + np.random.normal(0, 2, n)

X = np.column_stack([x1, x2])

# fit trees with increasing depth
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

depths = [1, 2, 3, 4]
for i, depth in enumerate(depths):
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X, y)

    # create prediction surface
    xx, yy = np.meshgrid(np.linspace(0, 10, 500), np.linspace(0, 10, 500))
    X_grid = np.column_stack([xx.ravel(), yy.ravel()])
    Z = tree.predict(X_grid).reshape(xx.shape)

    # plot the regions as discrete blocks (no interpolation)
    im = axes[i].imshow(
        Z,
        extent=[0, 10, 0, 10],
        origin="lower",
        cmap="coolwarm",
        aspect="equal",
        vmin=np.min(y),
        vmax=np.max(y),
    )
    scatter = axes[i].scatter(
        x1,
        x2,
        c=y,
        cmap="coolwarm",
    )
    axes[i].set_title(f"max_depth = {depth} ({tree.get_n_leaves()} leaves)")
    axes[i].set_xlabel("$x_1$")
    axes[i].set_ylabel("$x_2$")

    # add colorbar
    plt.colorbar(im, ax=axes[i], shrink=0.8)

# show plot
plt.show()

**Figure 6:** Demonstration of recursive partitioning process

Example: Automobile Fuel Efficiency

auto_mpg = fetch_ucirepo("Auto MPG")

We load the Auto MPG dataset using the fetch_ucirepo function, which provides access to datasets from the UCI Machine Learning Repository.

X = auto_mpg.data.features
y = auto_mpg.data.targets["mpg"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

DecisionTreeRegressor().get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

tree = DecisionTreeRegressor(max_depth=2, random_state=42)

_ = tree.fit(X_train, y_train)

tree.predict(X_test)

array([31.7385, 31.7385, 19.3635, 14.5213, 14.5213, 24.4915, 24.4915,
       14.5213, 19.3635, 14.5213, 14.5213, 31.7385, 31.7385, 14.5213,
       31.7385, 14.5213, 24.4915, 24.4915, 14.5213, 31.7385, 31.7385,
       19.3635, 19.3635, 31.7385, 14.5213, 31.7385, 24.4915, 24.4915,
       19.3635, 14.5213, 24.4915, 31.7385, 19.3635, 24.4915, 31.7385,
       14.5213, 24.4915, 14.5213, 14.5213, 31.7385, 24.4915, 31.7385,
       24.4915, 14.5213, 24.4915, 24.4915, 31.7385, 24.4915, 24.4915,
       31.7385, 31.7385, 31.7385, 31.7385, 14.5213, 24.4915, 14.5213,
       14.5213, 24.4915, 24.4915, 19.3635, 14.5213, 31.7385, 31.7385,
       24.4915, 19.3635, 24.4915, 24.4915, 31.7385, 31.7385, 14.5213,
       31.7385, 14.5213, 14.5213, 19.3635, 24.4915, 19.3635, 19.3635,
       24.4915, 31.7385, 14.5213])

fig, ax = plt.subplots()
plot_tree(tree, feature_names=X_train.columns, filled=True, rounded=True)
plt.show()

print(export_text(tree, feature_names=list(X_train.columns)))

|--- displacement <= 190.50
|   |--- horsepower <= 84.50
|   |   |--- value: [31.74]
|   |--- horsepower >  84.50
|   |   |--- value: [24.49]
|--- displacement >  190.50
|   |--- horsepower <= 127.00
|   |   |--- value: [19.36]
|   |--- horsepower >  127.00
|   |   |--- value: [14.52]

Show Code for Plot

# setup figure
fig, ax = plt.subplots(figsize=(6, 6))

# create scatterplot
ax.scatter(y_test, tree.predict(X_test), alpha=0.50, c="tab:gray")
ax.axline((0, 0), slope=1)
ax.set_xlim(0, 50)
ax.set_ylim(0, 50)
ax.set_xlabel("Actual")
ax.set_ylabel("Predicted")
ax.set_aspect("equal")

# show figure
plt.show()

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
https://scikit-learn.org/stable/modules/tree.html#tree

Summary

Decision trees for regression work by:

Partitioning the feature space into regions using recursive binary splits
Predicting the mean of target values within each region
Minimizing the Sum of Squares Error (SSE) to find optimal splits
Stopping based on predefined criteria to prevent overfitting

Key Concepts

SST (Sum of Squares Total): \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) - total variance in target
SSE (Sum of Squares Error): \(\sum_{j=1}^{J} \sum_{i \in N_j} (y_i - \bar{y}_j)^2\) - error after partitioning
R²: \(1 - \frac{SSE}{SST}\) - proportion of variance explained
Recursive Partitioning: Algorithm that systematically finds best splits by considering all features and cutoffs

Advantages of Decision Trees

Easy to interpret and visualize
Handle non-linear relationships naturally
Require minimal data preprocessing
Can capture interactions between features
No assumptions about data distribution

Disadvantages of Decision Trees

Prone to overfitting without proper controls
Can be unstable (small changes in data → very different trees)
May not perform as well as other algorithms on some datasets
Bias toward features with more levels

Best Practices

Always use stopping criteria (max_depth, min_samples_split, etc.)
Validate performance on holdout data
Consider ensemble methods (Random Forest, Gradient Boosting) for better performance
Visualize trees to understand model decisions

Decision trees provide an excellent foundation for understanding more advanced tree-based methods and serve as a powerful yet interpretable machine learning tool for regression problems.

Stopping Criteria and Hyperparameters

A crucial aspect of decision tree algorithms is knowing when to stop growing the tree. Growing trees until perfect fit leads to overfitting. Key stopping criteria include:

Important Hyperparameters

max_depth: Maximum depth of the tree
min_samples_split: Minimum number of samples required to split an internal node
min_samples_leaf: Minimum number of samples required to be at a leaf node
max_features: Number of features to consider when looking for the best split

These are tuning parameters that control model complexity and help prevent overfitting.

Avoiding Overfitting: Don’t grow trees until SSE = 0! This creates overly complex models that memorize training data rather than learning generalizable patterns.

Model Evaluation: R-squared

The coefficient of determination, \(R^2\), provides a measure of how well our tree explains the variance in the target variable:

\[ R^2 = 1 - \frac{SSE}{SST} \]

\(R^2 = 1\): Perfect fit (SSE = 0)
\(R^2 = 0\): Model is no better than predicting the mean
\(R^2 < 0\): Model performs worse than predicting the mean

Sum of Squares Total (SST)

The Sum of Squares Total (SST) represents the total variance in our target variable. It measures how much the individual observations deviate from the overall mean:

\[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]

where:

\(y_i\) is the actual value for observation \(i\)
\(\bar{y}\) is the overall mean of all target values
\(n\) is the total number of observations

This represents the “overall error” or variance in our data.

Sum of Squares Error (SSE)

The Sum of Squares Error (SSE) measures how much error remains after making our predictions. For a decision tree, this is the sum of squared errors within each terminal node (leaf):

\[ \text{SSE}(y, \hat{y}) = \sum_{j=1}^{J} \sum_{i \in N_j} (y_i - \bar{y}_j)^2 \]

where:

\(J\) is the number of terminal nodes (leaves)
\(N_j\) is the set of observations in node \(j\)
\(\bar{y}_j\) is the mean of target values in node \(j\)

The goal of decision tree regression is to minimize SSE by finding the best splits.

Objectives

Python Setup

Notebook

Introduction

Loss Functions

Tree Building Idea

Decision Tree Tuning Parameters

Recursive Partitioning Algorithm

Example: Automobile Fuel Efficiency

TODO

Summary

Key Concepts

Advantages of Decision Trees

Disadvantages of Decision Trees

Best Practices

Stopping Criteria and Hyperparameters

Important Hyperparameters

Model Evaluation: R-squared

Sum of Squares Total (SST)

Sum of Squares Error (SSE)