# basics
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# machine learning
from sklearn.datasets import make_friedman1, make_regression
from sklearn.tree import DecisionTreeRegressor, plot_tree, export_text
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import seaborn as sns
Decision Trees for Regression
Neighborhoods via Recursive Partitioning
Coming soon!
TODO
Objectives
By the end of this chapter, you will be able to:
- Understand the fundamental concepts of decision trees for regression problems
- Explain how recursive partitioning works to minimize prediction error
- Calculate and interpret Sum of Squares Total (SST) and Sum of Squares Error (SSE)
- Describe the tree-building algorithm and splitting criteria
- Implement decision tree regression using scikit-learn
- Understand when to stop growing trees and key hyperparameters
- Compare decision trees with other regression methods
Python Setup
Notebook
The following Jupyter notebook contains some starter code that may be useful for following along with this note.
Introduction to Decision Trees for Regression
Decision trees are a fundamental machine learning algorithm that can be used for both classification and regression tasks. In regression problems, decision trees predict continuous values by partitioning the feature space into regions and predicting the mean of the target values within each region.
The key idea behind decision trees for regression is to create neighborhoods or regions in the feature space where we can predict the mean of the target variable. This is achieved through a process called recursive partitioning.
Understanding Error Measurement
Before diving into how decision trees work, we need to understand how we measure the quality of our predictions.
Sum of Squares Total (SST)
The Sum of Squares Total (SST) represents the total variance in our target variable. It measures how much the individual observations deviate from the overall mean:
\[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]
where:
- \(y_i\) is the actual value for observation \(i\)
- \(\bar{y}\) is the overall mean of all target values
- \(n\) is the total number of observations
This represents the “overall error” or variance in our data.
Sum of Squares Error (SSE)
The Sum of Squares Error (SSE) measures how much error remains after making our predictions. For a decision tree, this is the sum of squared errors within each terminal node (leaf):
\[ SSE = \sum_{j=1}^{J} \sum_{i \in N_j} (y_i - \bar{y}_j)^2 \]
where:
- \(J\) is the number of terminal nodes (leaves)
- \(N_j\) is the set of observations in node \(j\)
- \(\bar{y}_j\) is the mean of target values in node \(j\)
The goal of decision tree regression is to minimize SSE by finding the best splits.
Show Code for SST/SSE Demonstration
# create simple dataset for demonstration
np.random.seed(42)
x = np.linspace(0, 10, 20)
y = 2 + 0.5 * x + np.random.normal(0, 1, 20)
# overall mean
y_mean = np.mean(y)
# create simple split at x = 5 (cutoff)
left_mask = x <= 5
right_mask = x > 5
y_left_mean = np.mean(y[left_mask])
y_right_mean = np.mean(y[right_mask])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# plot 1: overall error (SST)
ax1.scatter(x, y, alpha=0.7, color='blue')
ax1.axhline(y=y_mean, color='red', linestyle='-', linewidth=2,
label=f'overall mean = {y_mean:.2f}')
for i in range(len(x)):
ax1.plot([x[i], x[i]], [y[i], y_mean], 'r--', alpha=0.5)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title('sum of squares total (SST)\nerrors from overall mean')
ax1.legend()
ax1.grid(True, alpha=0.3)
# plot 2: partitioned error (SSE)
ax2.scatter(x[left_mask], y[left_mask], alpha=0.7, color='blue',
label='left neighborhood')
ax2.scatter(x[right_mask], y[right_mask], alpha=0.7, color='green',
label='right neighborhood')
# show means for each region
ax2.axhline(y=y_left_mean, xmin=0, xmax=0.5, color='blue', linestyle='-', linewidth=2)
ax2.axhline(y=y_right_mean, xmin=0.5, xmax=1, color='green', linestyle='-', linewidth=2)
ax2.axvline(x=5, color='black', linestyle=':', alpha=0.7, label='cutoff at x=5')
# show errors within regions
for i in range(len(x)):
if left_mask[i]:
ax2.plot([x[i], x[i]], [y[i], y_left_mean], 'b--', alpha=0.5)
else:
ax2.plot([x[i], x[i]], [y[i], y_right_mean], 'g--', alpha=0.5)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title('sum of squares error (SSE)\nerrors within neighborhoods')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# calculate actual values
sst = np.sum((y - y_mean)**2)
sse = np.sum((y[left_mask] - y_left_mean)**2) + np.sum((y[right_mask] - y_right_mean)**2)
print(f"SST = {sst:.2f}")
print(f"SSE = {sse:.2f}")
print(f"reduction in error = {sst - sse:.2f}")
print(f"R² = {1 - sse/sst:.3f}")

SST = 28.72
SSE = 19.02
reduction in error = 9.70
R² = 0.338
The Tree Building Process
Creating Neighborhoods with Cutoffs
The fundamental idea of decision trees is to create neighborhoods by finding optimal cutoffs in our features. For a single feature \(x\), we choose a cutoff value \(c\) that divides our data into two groups:
- Left neighborhood: \(N_L = \{i : x_i \leq c\}\)
- Right neighborhood: \(N_R = \{i : x_i > c\}\)
Each neighborhood predicts the mean of the target values within that region.
Finding the Best Split
To find the optimal split, we consider all possible features and all possible cutoff values, then choose the combination that minimizes SSE.
For a given feature \(j\) and cutoff \(c\), the SSE after splitting is:
\[ SSE = \sum_{i \in N_L} (y_i - \bar{y}_L)^2 + \sum_{i \in N_R} (y_i - \bar{y}_R)^2 \]
where \(\bar{y}_L\) and \(\bar{y}_R\) are the means of the left and right neighborhoods, respectively.
The algorithm considers all features and typically uses midpoints between adjacent sorted values as potential cutoffs.
Simulated Sine Wave Data
# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(12, 5)
fig.set_dpi(100)
# add overall title
fig.suptitle("Simulated Sine Wave Data")
# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((1000, 1))
# create subplot for "simulation study"
ax1.set_title("Simulation Study")
ax1.scatter(X, y, color="dodgerblue")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.grid(True, linestyle="--", color="lightgrey")
# add true regression function, the "signal" that we want to learn
ax1.plot(x_plot, np.sin(x_plot), color="black")
# create subplot for "reality"
ax2.set_title("Reality")
ax2.scatter(X, y, color="dodgerblue")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, linestyle="--", color="lightgrey")
# show plot
plt.show()
{'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'monotonic_cst': None,
'random_state': None,
'splitter': 'best'}
# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(12, 5)
fig.set_dpi(100)
# add overall title
fig.suptitle("Simulated Sine Wave Data")
# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((1000, 1))
# create subplot for KNN
ax1.set_title("k-Nearest Neighbors, k=10")
ax1.scatter(X, y, color="dodgerblue")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.grid(True, linestyle="--", color="lightgrey")
ax1.plot(x_plot, knn010.predict(x_plot), color="black")
# create subplot for decision tree
ax2.set_title("Decision Tree, min_samples_split=40")
ax2.scatter(X, y, color="dodgerblue")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, linestyle="--", color="lightgrey")
ax2.plot(x_plot, dt040.predict(x_plot), color="black")
# show plot
plt.show()
|--- feature_0 <= -3.55
| |--- feature_0 <= -5.52
| | |--- value: [0.43]
| |--- feature_0 > -5.52
| | |--- value: [0.95]
|--- feature_0 > -3.55
| |--- feature_0 <= 3.69
| | |--- feature_0 <= -0.05
| | | |--- feature_0 <= -2.83
| | | | |--- value: [0.01]
| | | |--- feature_0 > -2.83
| | | | |--- feature_0 <= -0.80
| | | | | |--- value: [-0.83]
| | | | |--- feature_0 > -0.80
| | | | | |--- value: [-0.35]
| | |--- feature_0 > -0.05
| | | |--- feature_0 <= 2.87
| | | | |--- feature_0 <= 0.56
| | | | | |--- value: [0.25]
| | | | |--- feature_0 > 0.56
| | | | | |--- value: [0.86]
| | | |--- feature_0 > 2.87
| | | | |--- value: [-0.10]
| |--- feature_0 > 3.69
| | |--- feature_0 <= 5.60
| | | |--- value: [-0.84]
| | |--- feature_0 > 5.60
| | | |--- value: [-0.39]
# setup figure
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_size_inches(18, 5)
fig.set_dpi(100)
# add overall title
fig.suptitle("Simulated Sine Wave Data")
# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((1000, 1))
# create subplot for decision tree with min_samples_split=250
ax1.set_title("Tree, min_samples_split=250")
ax1.scatter(X, y, color="dodgerblue")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.grid(True, linestyle="--", color="lightgrey")
ax1.plot(x_plot, dt250.predict(x_plot), color="black")
# create subplot for decision tree with min_samples_split=100
ax2.set_title("Tree, min_samples_split=100")
ax2.scatter(X, y, color="dodgerblue")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, linestyle="--", color="lightgrey")
ax2.plot(x_plot, dt100.predict(x_plot), color="black")
# create subplot for decision tree with min_samples_split=2
ax3.set_title("Tree, min_samples_split=2")
ax3.scatter(X, y, color="dodgerblue")
ax3.set_xlabel("x")
ax3.set_ylabel("y")
ax3.grid(True, linestyle="--", color="lightgrey")
ax3.plot(x_plot, dt002.predict(x_plot), color="black")
# show plot
plt.show()
# setup figure
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_size_inches(18, 5)
fig.set_dpi(100)
# add overall title
fig.suptitle("Simulated Sine Wave Data")
# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 1000).reshape((1000, 1))
# create subplot for decision tree with max_depth=1
ax1.set_title("Tree, max_depth=1")
ax1.scatter(X, y, color="dodgerblue")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.grid(True, linestyle="--", color="lightgrey")
ax1.plot(x_plot, dt_d01.predict(x_plot), color="black")
# create subplot for decision tree with max_depth=5
ax2.set_title("Tree, max_depth=5")
ax2.scatter(X, y, color="dodgerblue")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, linestyle="--", color="lightgrey")
ax2.plot(x_plot, dt_d05.predict(x_plot), color="black")
# create subplot for decision tree with max_depth=10
ax3.set_title("Tree, max_depth=10")
ax3.scatter(X, y, color="dodgerblue")
ax3.set_xlabel("x")
ax3.set_ylabel("y")
ax3.grid(True, linestyle="--", color="lightgrey")
ax3.plot(x_plot, dt_d10.predict(x_plot), color="black")
# show plot
plt.show()
Simulated Data with Multiple Features
array([[0.3745, 0.9507, 0.732 , 0.5987, 0.156 ],
[0.156 , 0.0581, 0.8662, 0.6011, 0.7081],
[0.0206, 0.9699, 0.8324, 0.2123, 0.1818],
[0.1834, 0.3042, 0.5248, 0.4319, 0.2912],
[0.6119, 0.1395, 0.2921, 0.3664, 0.4561],
[0.7852, 0.1997, 0.5142, 0.5924, 0.0465],
[0.6075, 0.1705, 0.0651, 0.9489, 0.9656],
[0.8084, 0.3046, 0.0977, 0.6842, 0.4402],
[0.122 , 0.4952, 0.0344, 0.9093, 0.2588],
[0.6625, 0.3117, 0.5201, 0.5467, 0.1849]])
3.015648008168984
|--- feature_3 <= 0.43
| |--- feature_0 <= 0.28
| | |--- value: [7.48]
| |--- feature_0 > 0.28
| | |--- feature_1 <= 0.22
| | | |--- value: [6.74]
| | |--- feature_1 > 0.22
| | | |--- value: [13.35]
|--- feature_3 > 0.43
| |--- feature_0 <= 0.24
| | |--- value: [12.57]
| |--- feature_0 > 0.24
| | |--- feature_1 <= 0.33
| | | |--- value: [15.82]
| | |--- feature_1 > 0.33
| | | |--- feature_3 <= 0.78
| | | | |--- value: [17.72]
| | | |--- feature_3 > 0.78
| | | | |--- value: [21.46]
Summary
Decision trees for regression work by:
- Partitioning the feature space into regions using recursive binary splits
- Predicting the mean of target values within each region
- Minimizing the Sum of Squares Error (SSE) to find optimal splits
- Stopping based on predefined criteria to prevent overfitting
Key Concepts
- SST (Sum of Squares Total): \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) - total variance in target
- SSE (Sum of Squares Error): \(\sum_{j=1}^{J} \sum_{i \in N_j} (y_i - \bar{y}_j)^2\) - error after partitioning
- R²: \(1 - \frac{SSE}{SST}\) - proportion of variance explained
- Recursive Partitioning: Algorithm that systematically finds best splits by considering all features and cutoffs
Advantages of Decision Trees
- Easy to interpret and visualize
- Handle non-linear relationships naturally
- Require minimal data preprocessing
- Can capture interactions between features
- No assumptions about data distribution
Disadvantages of Decision Trees
- Prone to overfitting without proper controls
- Can be unstable (small changes in data → very different trees)
- May not perform as well as other algorithms on some datasets
- Bias toward features with more levels
Best Practices
- Always use stopping criteria (
max_depth
,min_samples_split
, etc.) - Validate performance on holdout data
- Consider ensemble methods (Random Forest, Gradient Boosting) for better performance
- Visualize trees to understand model decisions
Decision trees provide an excellent foundation for understanding more advanced tree-based methods and serve as a powerful yet interpretable machine learning tool for regression problems.
Tree Structure and Terminology
A decision tree consists of several types of nodes:
- Root Node: The top node containing all data
- Internal Nodes: Nodes that split the data based on feature conditions
- Terminal Nodes (Leaves): Final nodes that make predictions
Each internal node contains a splitting rule of the form “if \(x_j \leq c\) then go left, else go right”.
Implementation Note: The tree shown in our examples follows scikit-learn’s convention where we go left for “yes” (condition is true) and right for “no” (condition is false). Some other implementations use the opposite convention.
Show Code for Split Comparison
# demonstrate progression from one split to two splits
X_demo, y_demo = simulate_sin_data(n=50, sd=0.3, seed=123)
x_demo = X_demo.ravel()
# fit trees with different depths
tree_1split = DecisionTreeRegressor(max_depth=1, random_state=42)
tree_2split = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_1split.fit(X_demo, y_demo)
tree_2split.fit(X_demo, y_demo)
# create prediction plots
x_plot = np.linspace(-2 * np.pi, 2 * np.pi, 300).reshape(-1, 1)
y_pred_1 = tree_1split.predict(x_plot)
y_pred_2 = tree_2split.predict(x_plot)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
# plot 1: one split
ax1.scatter(x_demo, y_demo, alpha=0.7, color='blue', label='data')
ax1.plot(x_plot, y_pred_1, color='red', linewidth=3, label='tree prediction')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_title(f'one split (R² = {tree_1split.score(X_demo, y_demo):.3f})')
ax1.legend()
ax1.grid(True, alpha=0.3)
# plot 2: two splits
ax2.scatter(x_demo, y_demo, alpha=0.7, color='blue', label='data')
ax2.plot(x_plot, y_pred_2, color='red', linewidth=3, label='tree prediction')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_title(f'two splits (R² = {tree_2split.score(X_demo, y_demo):.3f})')
ax2.legend()
ax2.grid(True, alpha=0.3)
# plot 3: tree structure for 2-split tree
plot_tree(tree_2split, ax=ax3, feature_names=['x'], filled=True, rounded=True, fontsize=10)
ax3.set_title('tree structure (max_depth=2)')
plt.tight_layout()
plt.show()
# calculate SSE values
overall_mean = np.mean(y_demo)
sst = np.sum((y_demo - overall_mean)**2)
y_pred_1_train = tree_1split.predict(X_demo)
y_pred_2_train = tree_2split.predict(X_demo)
sse_1 = np.sum((y_demo - y_pred_1_train)**2)
sse_2 = np.sum((y_demo - y_pred_2_train)**2)
print("error reduction from additional splits:")
print(f"SST = {sst:.2f}")
print(f"SSE (one split) = {sse_1:.2f}")
print(f"SSE (two splits) = {sse_2:.2f}")
print(f"improvement from second split = {sse_1 - sse_2:.2f}")

error reduction from additional splits:
SST = 30.81
SSE (one split) = 25.28
SSE (two splits) = 18.35
improvement from second split = 6.93
Recursive Partitioning Algorithm
The decision tree algorithm works through recursive partitioning. Here’s a sketch of the algorithm:
Algorithm Steps
Initialize: Start with root node containing all data, set stop conditions
For each current terminal node:
- If stop conditions are met: Skip this node
- If no nodes can be split: Stop algorithm, return tree
- For each feature \(j\):
- For each potential cutoff \(c\) (midpoints between adjacent sorted values):
- Calculate SSE after splitting on \(x_j \leq c\)
- Store the error
- For each potential cutoff \(c\) (midpoints between adjacent sorted values):
- Find the feature \(j\) and cutoff \(c\) combination with lowest SSE
- Split current node using \(x_j \leq c\)
Repeat until stopping conditions are met
The algorithm considers all features and finds potential split points by examining the data. For continuous features, typical split points are the midpoints between adjacent unique values when the data is sorted.
Show Code for Recursive Partitioning Demo
# create 2D dataset to show spatial partitioning
np.random.seed(307)
n = 100
x1 = np.random.uniform(0, 10, n)
x2 = np.random.uniform(0, 10, n)
y = 5 + 0.3 * x1 + 0.5 * x2 + 0.1 * x1 * x2 + np.random.normal(0, 2, n)
X = np.column_stack([x1, x2])
# fit trees with increasing depth
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
depths = [1, 2, 3, 4]
for i, depth in enumerate(depths):
tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
tree.fit(X, y)
# create prediction surface
xx, yy = np.meshgrid(np.linspace(0, 10, 500), np.linspace(0, 10, 500))
X_grid = np.column_stack([xx.ravel(), yy.ravel()])
Z = tree.predict(X_grid).reshape(xx.shape)
# plot the regions as discrete blocks (no interpolation)
im = axes[i].imshow(
Z,
extent=[0, 10, 0, 10],
origin="lower",
cmap="viridis",
aspect="equal",
vmin=np.min(y),
vmax=np.max(y),
)
scatter = axes[i].scatter(
x1,
x2,
c=y,
cmap="viridis",
)
axes[i].set_title(f"max_depth = {depth} ({tree.get_n_leaves()} leaves)")
axes[i].set_xlabel("$x_1$")
axes[i].set_ylabel("$x_2$")
# add colorbar
plt.colorbar(im, ax=axes[i], shrink=0.8)
plt.suptitle("recursive partitioning: feature space divided into regions", fontsize=14)
plt.show()
# show how R² changes with tree complexity
r2_scores = []
for depth in depths:
tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
tree.fit(X, y)
r2_scores.append(tree.score(X, y))
print("R² by tree depth:")
for depth, r2 in zip(depths, r2_scores):
print(f"depth {depth}: R² = {r2:.3f}")

R² by tree depth:
depth 1: R² = 0.436
depth 2: R² = 0.682
depth 3: R² = 0.808
depth 4: R² = 0.865
Stopping Criteria and Hyperparameters
A crucial aspect of decision tree algorithms is knowing when to stop growing the tree. Growing trees until perfect fit leads to overfitting. Key stopping criteria include:
Important Hyperparameters
max_depth
: Maximum depth of the treemin_samples_split
: Minimum number of samples required to split an internal nodemin_samples_leaf
: Minimum number of samples required to be at a leaf nodemax_features
: Number of features to consider when looking for the best split
These are tuning parameters that control model complexity and help prevent overfitting.
Avoiding Overfitting: Don’t grow trees until SSE = 0! This creates overly complex models that memorize training data rather than learning generalizable patterns.
Model Evaluation: R-squared
The coefficient of determination, \(R^2\), provides a measure of how well our tree explains the variance in the target variable:
\[ R^2 = 1 - \frac{SSE}{SST} \]
- \(R^2 = 1\): Perfect fit (SSE = 0)
- \(R^2 = 0\): Model is no better than predicting the mean
- \(R^2 < 0\): Model performs worse than predicting the mean