Applied Statistics with Python – Chapter 16
Variable selection and model building
The previous chapter showed how collinearity between predictors can make regression models unstable. One common remedy is to fit a smaller model: drop some predictors so the remaining ones are less redundant.
That raises a new question:
Given many possible models, how do we choose a “good” one?
In the original R notes, this chapter uses tools like AIC(), BIC(),
step(), and regsubsets() from the faraway and leaps packages.
Here we will:
translate those ideas into Python-first code using
statsmodelsandscikit-learn,separate quality criteria (AIC, BIC, adjusted \(R^2\), cross-validated RMSE) from search procedures (backward, forward, stepwise, exhaustive),
use a couple of running examples (seat position and Auto MPG) to see model selection in action, and
revisit the distinction between models for explanation and models for prediction.
Throughout, you can imagine that we already have two cleaned CSV files:
data/seatpos.csvwith a responsehipcenterand several body measurements as predictors.data/autompg.csvwithmpgand car attributes (cylinders, displacement, horsepower, weight, acceleration, model year, domestic / non-domestic).
16.1 Quality criteria: balancing fit and complexity
So far we have used residual plots, \(R^2\), and RMSE to judge a single model’s fit. If we try to compare models using plain \(R^2\) or RMSE, we run into a problem:
Adding predictors can never make :math:`R^2` worse, and almost never makes RMSE worse. A huge, overfitted model will always win.
We need criteria that reward good fit but penalize complexity. In this chapter we will use four:
AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)
Adjusted :math:`R^2`
Cross-validated RMSE (for prediction)
In all cases, “smaller is better” except that larger adjusted \(R^2\) is better.
16.1.1 AIC and BIC in Python
For a linear regression with \(n\) observations, \(p\) parameters (including the intercept), and residual sum of squares \(\mathrm{RSS}\), the R notes write AIC and BIC as
The first term measures fit; the second term penalizes model size.
In statsmodels, both are built in:
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("data/seatpos.csv")
y = df["hipcenter"]
X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.aic # Akaike Information Criterion
model.bic # Bayesian Information Criterion
Smaller AIC / BIC means “better” under that criterion. BIC usually prefers smaller models than AIC because its penalty \((\log n)p\) is larger than \(2p\) once \(\log n > 2\).
If you have RSS, \(n\), and \(p\) directly, you can compute them by hand:
import numpy as np
def aic_from_rss(rss: float, n: int, p: int) -> float:
return n * np.log(rss / n) + 2 * p
def bic_from_rss(rss: float, n: int, p: int) -> float:
return n * np.log(rss / n) + np.log(n) * p
16.1.2 Adjusted \(R^2\)
Plain \(R^2\) is
Adjusted \(R^2\) adds a penalty for the number of parameters:
Unlike plain \(R^2\), adjusted \(R^2\) can decrease when you add a useless predictor.
In statsmodels you get it for free:
model.rsquared # R^2
model.rsquared_adj # adjusted R^2
16.1.3 Cross-validated RMSE
AIC, BIC, and adjusted \(R^2\) all use \(p\) explicitly. Cross-validation instead asks a predictive question:
How well would this model perform on new data?
The R notes focus on leave-one-out cross-validation (LOOCV):
For each observation \(i\), fit the model without that point.
Predict \(\hat y_{[i]}\) for the left-out point.
Compute the LOOCV residual \(e_{[i]} = y_i - \hat y_{[i]}\).
Define
\[\mathrm{RMSE}_{\mathrm{LOOCV}} = \sqrt{\frac{1}{n} \sum_{i=1}^n e_{[i]}^2 }.\]
In Python we usually use scikit-learn for cross-validation:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, LeaveOneOut
def loocv_rmse(X, y):
model = LinearRegression()
loo = LeaveOneOut()
# By convention, cross_val_score *maximizes* a score, so we use
# negative MSE and flip the sign.
neg_mse_scores = cross_val_score(
model, X, y,
cv=loo,
scoring="neg_mean_squared_error",
)
mse = -neg_mse_scores.mean()
return np.sqrt(mse)
X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]]
y = df["hipcenter"]
loocv_rmse(X, y)
For larger data sets, 5-fold or 10-fold cross-validation is more common:
from sklearn.model_selection import KFold
def kfold_rmse(X, y, k=5, random_state=0):
model = LinearRegression()
kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
neg_mse_scores = cross_val_score(
model, X, y,
cv=kf,
scoring="neg_mean_squared_error",
)
mse = -neg_mse_scores.mean()
return np.sqrt(mse)
16.1.4 Which criterion should I use?
A rule of thumb:
For explanation and interpretable models, prefer criteria that strongly penalize complexity:
BIC,
adjusted \(R^2\).
For prediction, emphasize cross-validated RMSE (or another predictive metric). AIC can also be a compromise.
In the rest of the chapter we will often compare several criteria side-by-side.
16.1.5 R vs Python: quality criteria
Concept |
R notes |
Python (statsmodels / scikit-learn) |
|---|---|---|
AIC / BIC |
|
|
Adjusted \(R^2\) |
|
|
LOOCV RMSE |
manual function with |
|
16.2 Search procedures: which models to consider?
A model selection procedure has two ingredients:
A quality criterion (AIC, BIC, adjusted \(R^2\), cross-validated RMSE, …).
A search strategy to explore the huge space of possible models.
With \(p\) candidate predictors, there are \(2^p\) possible subsets. For \(p = 8\), that is already 256 models; for \(p = 15\) it is over 32,000.
The R chapter introduces four strategies:
backward selection,
forward selection,
stepwise selection,
exhaustive (all-subsets) search.
We will sketch Python versions using statsmodels.
Helper: fit a model on a given subset
We’ll write a tiny helper that:
takes a list of predictor names,
fits a linear regression with an intercept,
returns a chosen metric (AIC, BIC, adjusted \(R^2\), or CV-RMSE).
import itertools
from typing import List, Literal
Metric = Literal["aic", "bic", "adj_r2", "loocv_rmse"]
def evaluate_subset(df, response: str, predictors: List[str],
metric: Metric = "aic") -> float:
"""Fit OLS on a subset of predictors and return a scalar score."""
y = df[response]
X = sm.add_constant(df[predictors])
fit = sm.OLS(y, X).fit()
if metric == "aic":
return fit.aic
if metric == "bic":
return fit.bic
if metric == "adj_r2":
# For consistency with AIC/BIC ("smaller is better"), we return
# negative adjusted R^2 here.
return -fit.rsquared_adj
if metric == "loocv_rmse":
# Use scikit-learn with the *same* predictors
X_np = df[predictors].to_numpy()
y_np = y.to_numpy()
return loocv_rmse(X_np, y_np)
raise ValueError(f"Unknown metric: {metric}")
16.2.1 Backward selection
Backward selection starts from the full model and repeatedly drops one predictor at a time if it improves the criterion.
Skeleton implementation:
def backward_selection(df, response: str, candidates: List[str],
metric: Metric = "aic"):
remaining = list(candidates)
best_score = evaluate_subset(df, response, remaining, metric)
improved = True
while improved and len(remaining) > 1:
improved = False
scores = []
for predictor in remaining:
trial = [p for p in remaining if p != predictor]
score = evaluate_subset(df, response, trial, metric)
scores.append((score, predictor, trial))
# Pick the *best* (smallest score) among the candidates
score, predictor_to_drop, trial = min(scores, key=lambda t: t[0])
if score + 1e-8 < best_score: # small tolerance
best_score = score
remaining = trial
improved = True
return remaining, best_score
Example (using the seat position data):
seat_df = pd.read_csv("data/seatpos.csv")
predictors = ["Age", "Weight", "HtShoes", "Ht", "Seated",
"Arm", "Thigh", "Leg"]
selected_aic, score_aic = backward_selection(
seat_df, "hipcenter", predictors, metric="aic"
)
selected_bic, score_bic = backward_selection(
seat_df, "hipcenter", predictors, metric="bic"
)
print("Backward AIC:", selected_aic)
print("Backward BIC:", selected_bic)
Typically:
AIC keeps more predictors.
BIC keeps a smaller set with nearly as good predictive performance.
16.2.2 Forward selection
Forward selection starts from the intercept-only model and repeatedly adds the predictor that most improves the criterion.
def forward_selection(df, response: str, candidates: List[str],
metric: Metric = "aic"):
remaining = list(candidates)
selected: List[str] = []
best_score = evaluate_subset(df, response, selected or [remaining[0]], metric)
improved = True
while improved and remaining:
improved = False
scores = []
for predictor in remaining:
trial = selected + [predictor]
score = evaluate_subset(df, response, trial, metric)
scores.append((score, predictor, trial))
score, predictor_to_add, trial = min(scores, key=lambda t: t[0])
if score + 1e-8 < best_score:
best_score = score
selected = trial
remaining.remove(predictor_to_add)
improved = True
return selected, best_score
Forward and backward often find similar but not identical models.
16.2.3 Stepwise selection
Stepwise selection looks both directions at each step:
try adding each unused predictor;
also try dropping each currently-included predictor;
commit the change (add or drop) that best improves the criterion.
Conceptually it is:
def stepwise_selection(df, response: str, candidates: List[str],
metric: Metric = "aic"):
selected: List[str] = []
best_score = float("inf")
while True:
changed = False
# 1. Try adding
add_scores = []
for p in candidates:
if p in selected:
continue
trial = selected + [p]
score = evaluate_subset(df, response, trial, metric)
add_scores.append((score, ("add", p), trial))
# 2. Try dropping
drop_scores = []
for p in selected:
trial = [q for q in selected if q != p]
if not trial:
continue
score = evaluate_subset(df, response, trial, metric)
drop_scores.append((score, ("drop", p), trial))
all_scores = add_scores + drop_scores
if not all_scores:
break
score, (action, p), trial = min(all_scores, key=lambda t: t[0])
if score + 1e-8 < best_score:
best_score = score
selected = trial
changed = True
if not changed:
break
return selected, best_score
Stepwise tends to behave like a hybrid between forward and backward.
16.2.4 Exhaustive search (all subsets)
For small numbers of predictors (\(p \le 10\) or so), you can simply check all subsets and pick the best according to a given criterion.
def exhaustive_search(df, response: str, candidates: List[str],
metric: Metric = "aic"):
best_score = float("inf")
best_subset: List[str] | None = None
for k in range(1, len(candidates) + 1):
for subset in itertools.combinations(candidates, k):
subset = list(subset)
score = evaluate_subset(df, response, subset, metric)
if score < best_score:
best_score = score
best_subset = subset
return best_subset, best_score
This mirrors the R chapter’s use of regsubsets() from the leaps
package.
16.3 Example: seat position (AIC vs BIC vs LOOCV)
In the R version, the seatpos data from the faraway package is used
to compare models for the response hipcenter. Assume we have
data/seatpos.csv with the following columns:
hipcenter– driver hip center (response),Age,Weight,HtShoes,Ht,Seated,Arm,Thigh,Leg– body measurements.
A typical workflow in Python:
Fit the full model with all eight predictors.
Run backward selection with AIC and BIC.
Compare adjusted \(R^2\) and LOOCV RMSE across the full and selected models.
Sketch:
full_predictors = ["Age", "Weight", "HtShoes", "Ht",
"Seated", "Arm", "Thigh", "Leg"]
# Full model
y = seat_df["hipcenter"]
X_full = sm.add_constant(seat_df[full_predictors])
full_fit = sm.OLS(y, X_full).fit()
# Backward AIC / BIC
back_aic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "aic")
back_bic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "bic")
# Fit selected models
X_aic = sm.add_constant(seat_df[back_aic])
X_bic = sm.add_constant(seat_df[back_bic])
fit_aic = sm.OLS(y, X_aic).fit()
fit_bic = sm.OLS(y, X_bic).fit()
# Compare criteria
print("Full adj R^2:", full_fit.rsquared_adj)
print("AIC adj R^2:", fit_aic.rsquared_adj)
print("BIC adj R^2:", fit_bic.rsquared_adj)
print("Full LOOCV RMSE:", loocv_rmse(X_full.iloc[:, 1:].to_numpy(), y.to_numpy()))
print("AIC LOOCV RMSE:", loocv_rmse(X_aic.iloc[:, 1:].to_numpy(), y.to_numpy()))
print("BIC LOOCV RMSE:", loocv_rmse(X_bic.iloc[:, 1:].to_numpy(), y.to_numpy()))
Typically you will see:
both AIC and BIC improve adjusted :math:`R^2` and LOOCV RMSE compared to the full model;
BIC selects a smaller subset with only a small loss (or even a gain) in predictive performance.
16.4 Higher-order terms: Auto MPG example
The final part of the R chapter returns to the Auto MPG data and considers:
quadratic terms such as \(\mathrm{disp}^2\), \(\mathrm{hp}^2\), \(\mathrm{wt}^2\), \(\mathrm{acc}^2\),
all two-way interactions between first-order terms.
The idea is to show that:
we can start with a very rich model (many terms),
then use backward selection (often with BIC) to find a smaller model that still predicts well,
and compare LOOCV RMSE between the full and selected models.
In Python we can use formula syntax with statsmodels:
import numpy as np
import statsmodels.formula.api as smf
auto_df = pd.read_csv("data/autompg.csv")
# Log-transform the response as in the R notes
auto_df["log_mpg"] = np.log(auto_df["mpg"])
big_formula = (
"log_mpg ~ (cyl + disp + hp + wt + acc + year + domestic) ** 2"
" + I(disp ** 2) + I(hp ** 2) + I(wt ** 2) + I(acc ** 2)"
)
big_fit = smf.ols(big_formula, data=auto_df).fit()
big_fit.summary()
Now apply backward selection on this rich model, using BIC or AIC as the criterion. The search logic is identical; the only difference is that the candidate “predictors” are now terms from the formula (main effects, quadratics, interactions) rather than raw columns.
Python does not automatically enforce hierarchy (keeping main effects whenever an interaction is included), so in practice you may want to:
define groups of terms that must stay together,
or rely on domain knowledge to simplify the final model.
The main lesson:
Sometimes a slightly richer model (interactions, quadratic terms) combined with a strong penalty (BIC or cross-validation) gives the best predictive performance.
16.5 Explanation versus prediction
A central theme of the R chapter is the distinction between models used for:
Explanation – understanding how predictors relate to the response.
Prediction – making accurate forecasts for new observations.
Explanation
For explanation we care about:
Interpretability: How does each predictor affect the response, holding others fixed?
Parsimony: Smaller models are easier to explain and reason about.
Causality vs association: Does a predictor cause changes in the response, or is it just correlated?
We normally:
favor BIC or adjusted \(R^2\) to penalize complexity,
inspect diagnostic plots (from previous chapters),
remember that with observational data we detect associations, not necessarily causal effects.
Prediction
For pure prediction we care about:
out-of-sample error (cross-validated RMSE),
robustness to new data.
We are less worried about:
exact p-values,
whether effects are causal,
whether the model is small.
A model can predict well even if it is not a realistic description of the data-generating process.
A classic analogy: shoe size and reading level in children are highly correlated (both driven by age). Shoe size does not cause reading ability, but knowing shoe size still helps predict reading level. For prediction that is fine; for explanation it is misleading.
16.6 How this connects to PyStatsV1
In PyStatsV1 we will continually revisit model selection:
Choosing which predictors to include in a regression or GLM.
Comparing different model families (linear vs logistic vs Poisson).
Balancing interpretability against predictive performance.
This chapter gives you a toolkit that we can plug into many later examples:
computing AIC / BIC / adjusted \(R^2\),
using cross-validation to estimate predictive error,
exploring backward / forward / stepwise / exhaustive search.
In applied case studies we will often start from a rich model suggested by subject-matter knowledge, then use these tools to trim it down to something:
scientifically interpretable,
numerically stable,
and reasonably accurate for prediction.
16.7 What you should take away
By the end of this chapter (and its R + Python versions), you should be able to:
Explain why plain :math:`R^2` and RMSE are not enough for model comparison.
Compute and interpret:
AIC and BIC,
adjusted :math:`R^2`,
cross-validated RMSE (LOOCV or K-fold).
Implement or adapt search procedures in Python:
backward selection,
forward selection,
stepwise selection,
exhaustive search for small \(p\).
Use these procedures on real data sets (like seat position and Auto MPG) to find reasonable models.
Distinguish between models aimed at explanation and those aimed at prediction, and choose criteria that match your goal.
Recognize that:
strong penalties (BIC, adjusted \(R^2\)) often yield compact, interpretable models,
cross-validation is the gold standard for estimating predictive error,
observational data supports association, not automatic causal claims.
In later PyStatsV1 chapters, these model-selection tools will support both:
observational regression analyses, and
upcoming material on experimental design, where we can combine good designs with good models to make stronger causal statements.