Applied Statistics with Python – Chapter 16

Variable selection and model building

The previous chapter showed how collinearity between predictors can make regression models unstable. One common remedy is to fit a smaller model: drop some predictors so the remaining ones are less redundant.

That raises a new question:

Given many possible models, how do we choose a “good” one?

In the original R notes, this chapter uses tools like AIC(), BIC(), step(), and regsubsets() from the faraway and leaps packages. Here we will:

translate those ideas into Python-first code using statsmodels and scikit-learn,
separate quality criteria (AIC, BIC, adjusted $R^2$, cross-validated RMSE) from search procedures (backward, forward, stepwise, exhaustive),
use a couple of running examples (seat position and Auto MPG) to see model selection in action, and
revisit the distinction between models for explanation and models for prediction.

Throughout, you can imagine that we already have two cleaned CSV files:

data/seatpos.csv with a response hipcenter and several body measurements as predictors.
data/autompg.csv with mpg and car attributes (cylinders, displacement, horsepower, weight, acceleration, model year, domestic / non-domestic).

16.1 Quality criteria: balancing fit and complexity

So far we have used residual plots, $R^2$, and RMSE to judge a single model’s fit. If we try to compare models using plain $R^2$ or RMSE, we run into a problem:

Adding predictors can never make :math:`R^2` worse, and almost never makes RMSE worse. A huge, overfitted model will always win.

We need criteria that reward good fit but penalize complexity. In this chapter we will use four:

AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)
Adjusted :math:`R^2`
Cross-validated RMSE (for prediction)

In all cases, “smaller is better” except that larger adjusted $R^2$ is better.

16.1.1 AIC and BIC in Python

For a linear regression with $n$ observations, $p$ parameters (including the intercept), and residual sum of squares $\mathrm{RSS}$, the R notes write AIC and BIC as

\[ \begin{align}\begin{aligned}\mathrm{AIC} = n \log\!\left(\frac{\mathrm{RSS}}{n}\right) + 2p\\\mathrm{BIC} = n \log\!\left(\frac{\mathrm{RSS}}{n}\right) + (\log n)\, p.\end{aligned}\end{align} \]

The first term measures fit; the second term penalizes model size.

In statsmodels, both are built in:

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("data/seatpos.csv")
y = df["hipcenter"]
X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

model.aic      # Akaike Information Criterion
model.bic      # Bayesian Information Criterion

Smaller AIC / BIC means “better” under that criterion. BIC usually prefers smaller models than AIC because its penalty $(\log n)p$ is larger than $2p$ once $\log n > 2$.

If you have RSS, $n$, and $p$ directly, you can compute them by hand:

import numpy as np

def aic_from_rss(rss: float, n: int, p: int) -> float:
    return n * np.log(rss / n) + 2 * p

def bic_from_rss(rss: float, n: int, p: int) -> float:
    return n * np.log(rss / n) + np.log(n) * p

16.1.2 Adjusted $R^2$

Plain $R^2$ is

\[R^2 = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} = 1 - \frac{\sum (y_i - \hat y_i)^2}{\sum(y_i - \bar y)^2}.\]

Adjusted $R^2$ adds a penalty for the number of parameters:

\[R^2_\mathrm{adj} = 1 - \frac{\mathrm{SSE} / (n - p)}{\mathrm{SST} / (n - 1)} = 1 - \frac{n - 1}{n - p} (1 - R^2).\]

Unlike plain $R^2$, adjusted $R^2$ can decrease when you add a useless predictor.

In statsmodels you get it for free:

model.rsquared        # R^2
model.rsquared_adj    # adjusted R^2

16.1.3 Cross-validated RMSE

AIC, BIC, and adjusted $R^2$ all use $p$ explicitly. Cross-validation instead asks a predictive question:

How well would this model perform on new data?

The R notes focus on leave-one-out cross-validation (LOOCV):

For each observation $i$, fit the model without that point.
Predict $\hat y_{[i]}$ for the left-out point.
Compute the LOOCV residual $e_{[i]} = y_i - \hat y_{[i]}$.
Define

\[\mathrm{RMSE}_{\mathrm{LOOCV}} = \sqrt{\frac{1}{n} \sum_{i=1}^n e_{[i]}^2 }.\]

In Python we usually use scikit-learn for cross-validation:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, LeaveOneOut

def loocv_rmse(X, y):
    model = LinearRegression()
    loo = LeaveOneOut()
    # By convention, cross_val_score *maximizes* a score, so we use
    # negative MSE and flip the sign.
    neg_mse_scores = cross_val_score(
        model, X, y,
        cv=loo,
        scoring="neg_mean_squared_error",
    )
    mse = -neg_mse_scores.mean()
    return np.sqrt(mse)

X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]]
y = df["hipcenter"]

loocv_rmse(X, y)

For larger data sets, 5-fold or 10-fold cross-validation is more common:

from sklearn.model_selection import KFold

def kfold_rmse(X, y, k=5, random_state=0):
    model = LinearRegression()
    kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
    neg_mse_scores = cross_val_score(
        model, X, y,
        cv=kf,
        scoring="neg_mean_squared_error",
    )
    mse = -neg_mse_scores.mean()
    return np.sqrt(mse)

16.1.4 Which criterion should I use?

A rule of thumb:

For explanation and interpretable models, prefer criteria that strongly penalize complexity:
- BIC,
- adjusted $R^2$.
For prediction, emphasize cross-validated RMSE (or another predictive metric). AIC can also be a compromise.

In the rest of the chapter we will often compare several criteria side-by-side.

16.1.5 R vs Python: quality criteria

Concept	R notes	Python (statsmodels / scikit-learn)
AIC / BIC	`AIC(fit)`, `BIC(fit)`	`fit.aic`, `fit.bic`
Adjusted $R^2$	`summary(fit)$adj.r.squared`	`fit.rsquared_adj`
LOOCV RMSE	manual function with `hatvalues(fit)`	`LeaveOneOut` + `cross_val_score` (or manual loop)

16.2 Search procedures: which models to consider?

A model selection procedure has two ingredients:

A quality criterion (AIC, BIC, adjusted $R^2$, cross-validated RMSE, …).
A search strategy to explore the huge space of possible models.

With $p$ candidate predictors, there are $2^p$ possible subsets. For $p = 8$, that is already 256 models; for $p = 15$ it is over 32,000.

The R chapter introduces four strategies:

backward selection,
forward selection,
stepwise selection,
exhaustive (all-subsets) search.

We will sketch Python versions using statsmodels.

Helper: fit a model on a given subset

We’ll write a tiny helper that:

takes a list of predictor names,
fits a linear regression with an intercept,
returns a chosen metric (AIC, BIC, adjusted $R^2$, or CV-RMSE).

import itertools
from typing import List, Literal

Metric = Literal["aic", "bic", "adj_r2", "loocv_rmse"]

def evaluate_subset(df, response: str, predictors: List[str],
                    metric: Metric = "aic") -> float:
    """Fit OLS on a subset of predictors and return a scalar score."""
    y = df[response]
    X = sm.add_constant(df[predictors])

    fit = sm.OLS(y, X).fit()

    if metric == "aic":
        return fit.aic
    if metric == "bic":
        return fit.bic
    if metric == "adj_r2":
        # For consistency with AIC/BIC ("smaller is better"), we return
        # negative adjusted R^2 here.
        return -fit.rsquared_adj
    if metric == "loocv_rmse":
        # Use scikit-learn with the *same* predictors
        X_np = df[predictors].to_numpy()
        y_np = y.to_numpy()
        return loocv_rmse(X_np, y_np)

    raise ValueError(f"Unknown metric: {metric}")

16.2.1 Backward selection

Backward selection starts from the full model and repeatedly drops one predictor at a time if it improves the criterion.

Skeleton implementation:

def backward_selection(df, response: str, candidates: List[str],
                       metric: Metric = "aic"):
    remaining = list(candidates)
    best_score = evaluate_subset(df, response, remaining, metric)

    improved = True
    while improved and len(remaining) > 1:
        improved = False
        scores = []

        for predictor in remaining:
            trial = [p for p in remaining if p != predictor]
            score = evaluate_subset(df, response, trial, metric)
            scores.append((score, predictor, trial))

        # Pick the *best* (smallest score) among the candidates
        score, predictor_to_drop, trial = min(scores, key=lambda t: t[0])

        if score + 1e-8 < best_score:  # small tolerance
            best_score = score
            remaining = trial
            improved = True

    return remaining, best_score

Example (using the seat position data):

seat_df = pd.read_csv("data/seatpos.csv")
predictors = ["Age", "Weight", "HtShoes", "Ht", "Seated",
              "Arm", "Thigh", "Leg"]

selected_aic, score_aic = backward_selection(
    seat_df, "hipcenter", predictors, metric="aic"
)
selected_bic, score_bic = backward_selection(
    seat_df, "hipcenter", predictors, metric="bic"
)

print("Backward AIC:", selected_aic)
print("Backward BIC:", selected_bic)

Typically:

AIC keeps more predictors.
BIC keeps a smaller set with nearly as good predictive performance.

16.2.2 Forward selection

Forward selection starts from the intercept-only model and repeatedly adds the predictor that most improves the criterion.

def forward_selection(df, response: str, candidates: List[str],
                      metric: Metric = "aic"):
    remaining = list(candidates)
    selected: List[str] = []
    best_score = evaluate_subset(df, response, selected or [remaining[0]], metric)

    improved = True
    while improved and remaining:
        improved = False
        scores = []

        for predictor in remaining:
            trial = selected + [predictor]
            score = evaluate_subset(df, response, trial, metric)
            scores.append((score, predictor, trial))

        score, predictor_to_add, trial = min(scores, key=lambda t: t[0])

        if score + 1e-8 < best_score:
            best_score = score
            selected = trial
            remaining.remove(predictor_to_add)
            improved = True

    return selected, best_score

Forward and backward often find similar but not identical models.

16.2.3 Stepwise selection

Stepwise selection looks both directions at each step:

try adding each unused predictor;
also try dropping each currently-included predictor;
commit the change (add or drop) that best improves the criterion.

Conceptually it is:

def stepwise_selection(df, response: str, candidates: List[str],
                       metric: Metric = "aic"):
    selected: List[str] = []
    best_score = float("inf")

    while True:
        changed = False

        # 1. Try adding
        add_scores = []
        for p in candidates:
            if p in selected:
                continue
            trial = selected + [p]
            score = evaluate_subset(df, response, trial, metric)
            add_scores.append((score, ("add", p), trial))

        # 2. Try dropping
        drop_scores = []
        for p in selected:
            trial = [q for q in selected if q != p]
            if not trial:
                continue
            score = evaluate_subset(df, response, trial, metric)
            drop_scores.append((score, ("drop", p), trial))

        all_scores = add_scores + drop_scores
        if not all_scores:
            break

        score, (action, p), trial = min(all_scores, key=lambda t: t[0])

        if score + 1e-8 < best_score:
            best_score = score
            selected = trial
            changed = True

        if not changed:
            break

    return selected, best_score

Stepwise tends to behave like a hybrid between forward and backward.

16.2.4 Exhaustive search (all subsets)

For small numbers of predictors ($p \le 10$ or so), you can simply check all subsets and pick the best according to a given criterion.

def exhaustive_search(df, response: str, candidates: List[str],
                      metric: Metric = "aic"):
    best_score = float("inf")
    best_subset: List[str] | None = None

    for k in range(1, len(candidates) + 1):
        for subset in itertools.combinations(candidates, k):
            subset = list(subset)
            score = evaluate_subset(df, response, subset, metric)
            if score < best_score:
                best_score = score
                best_subset = subset

    return best_subset, best_score

This mirrors the R chapter’s use of regsubsets() from the leaps package.

16.3 Example: seat position (AIC vs BIC vs LOOCV)

In the R version, the seatpos data from the faraway package is used to compare models for the response hipcenter. Assume we have data/seatpos.csv with the following columns:

hipcenter – driver hip center (response),
Age, Weight, HtShoes, Ht, Seated, Arm, Thigh, Leg – body measurements.

A typical workflow in Python:

Fit the full model with all eight predictors.
Run backward selection with AIC and BIC.
Compare adjusted $R^2$ and LOOCV RMSE across the full and selected models.

Sketch:

full_predictors = ["Age", "Weight", "HtShoes", "Ht",
                   "Seated", "Arm", "Thigh", "Leg"]

# Full model
y = seat_df["hipcenter"]
X_full = sm.add_constant(seat_df[full_predictors])
full_fit = sm.OLS(y, X_full).fit()

# Backward AIC / BIC
back_aic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "aic")
back_bic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "bic")

# Fit selected models
X_aic = sm.add_constant(seat_df[back_aic])
X_bic = sm.add_constant(seat_df[back_bic])

fit_aic = sm.OLS(y, X_aic).fit()
fit_bic = sm.OLS(y, X_bic).fit()

# Compare criteria
print("Full   adj R^2:", full_fit.rsquared_adj)
print("AIC    adj R^2:", fit_aic.rsquared_adj)
print("BIC    adj R^2:", fit_bic.rsquared_adj)

print("Full   LOOCV RMSE:", loocv_rmse(X_full.iloc[:, 1:].to_numpy(), y.to_numpy()))
print("AIC    LOOCV RMSE:", loocv_rmse(X_aic.iloc[:, 1:].to_numpy(), y.to_numpy()))
print("BIC    LOOCV RMSE:", loocv_rmse(X_bic.iloc[:, 1:].to_numpy(), y.to_numpy()))

Typically you will see:

both AIC and BIC improve adjusted :math:`R^2` and LOOCV RMSE compared to the full model;
BIC selects a smaller subset with only a small loss (or even a gain) in predictive performance.

16.4 Higher-order terms: Auto MPG example

The final part of the R chapter returns to the Auto MPG data and considers:

quadratic terms such as $\mathrm{disp}^2$, $\mathrm{hp}^2$, $\mathrm{wt}^2$, $\mathrm{acc}^2$,
all two-way interactions between first-order terms.

The idea is to show that:

we can start with a very rich model (many terms),
then use backward selection (often with BIC) to find a smaller model that still predicts well,
and compare LOOCV RMSE between the full and selected models.

In Python we can use formula syntax with statsmodels:

import numpy as np
import statsmodels.formula.api as smf

auto_df = pd.read_csv("data/autompg.csv")

# Log-transform the response as in the R notes
auto_df["log_mpg"] = np.log(auto_df["mpg"])

big_formula = (
    "log_mpg ~ (cyl + disp + hp + wt + acc + year + domestic) ** 2"
    " + I(disp ** 2) + I(hp ** 2) + I(wt ** 2) + I(acc ** 2)"
)

big_fit = smf.ols(big_formula, data=auto_df).fit()
big_fit.summary()

Now apply backward selection on this rich model, using BIC or AIC as the criterion. The search logic is identical; the only difference is that the candidate “predictors” are now terms from the formula (main effects, quadratics, interactions) rather than raw columns.

Python does not automatically enforce hierarchy (keeping main effects whenever an interaction is included), so in practice you may want to:

define groups of terms that must stay together,
or rely on domain knowledge to simplify the final model.

The main lesson:

Sometimes a slightly richer model (interactions, quadratic terms) combined with a strong penalty (BIC or cross-validation) gives the best predictive performance.

16.5 Explanation versus prediction

A central theme of the R chapter is the distinction between models used for:

Explanation – understanding how predictors relate to the response.
Prediction – making accurate forecasts for new observations.

Explanation

For explanation we care about:

Interpretability: How does each predictor affect the response, holding others fixed?
Parsimony: Smaller models are easier to explain and reason about.
Causality vs association: Does a predictor cause changes in the response, or is it just correlated?

We normally:

favor BIC or adjusted $R^2$ to penalize complexity,
inspect diagnostic plots (from previous chapters),
remember that with observational data we detect associations, not necessarily causal effects.

Prediction

For pure prediction we care about:

out-of-sample error (cross-validated RMSE),
robustness to new data.

We are less worried about:

exact p-values,
whether effects are causal,
whether the model is small.

A model can predict well even if it is not a realistic description of the data-generating process.

A classic analogy: shoe size and reading level in children are highly correlated (both driven by age). Shoe size does not cause reading ability, but knowing shoe size still helps predict reading level. For prediction that is fine; for explanation it is misleading.

16.6 How this connects to PyStatsV1

In PyStatsV1 we will continually revisit model selection:

Choosing which predictors to include in a regression or GLM.
Comparing different model families (linear vs logistic vs Poisson).
Balancing interpretability against predictive performance.

This chapter gives you a toolkit that we can plug into many later examples:

computing AIC / BIC / adjusted $R^2$,
using cross-validation to estimate predictive error,
exploring backward / forward / stepwise / exhaustive search.

In applied case studies we will often start from a rich model suggested by subject-matter knowledge, then use these tools to trim it down to something:

scientifically interpretable,
numerically stable,
and reasonably accurate for prediction.

16.7 What you should take away

By the end of this chapter (and its R + Python versions), you should be able to:

Explain why plain :math:`R^2` and RMSE are not enough for model comparison.
Compute and interpret:
- AIC and BIC,
- adjusted :math:`R^2`,
- cross-validated RMSE (LOOCV or K-fold).
Implement or adapt search procedures in Python:
- backward selection,
- forward selection,
- stepwise selection,
- exhaustive search for small $p$.
Use these procedures on real data sets (like seat position and Auto MPG) to find reasonable models.
Distinguish between models aimed at explanation and those aimed at prediction, and choose criteria that match your goal.
Recognize that:
- strong penalties (BIC, adjusted $R^2$) often yield compact, interpretable models,
- cross-validation is the gold standard for estimating predictive error,
- observational data supports association, not automatic causal claims.

In later PyStatsV1 chapters, these model-selection tools will support both:

observational regression analyses, and
upcoming material on experimental design, where we can combine good designs with good models to make stronger causal statements.