.. _applied_stats_with_python_ch16_variable_selection_and_model_building: Applied Statistics with Python – Chapter 16 =========================================== Variable selection and model building ------------------------------------- The previous chapter showed how **collinearity** between predictors can make regression models unstable. One common remedy is to **fit a smaller model**: drop some predictors so the remaining ones are less redundant. That raises a new question: *Given many possible models, how do we choose a “good” one?* In the original R notes, this chapter uses tools like ``AIC()``, ``BIC()``, ``step()``, and ``regsubsets()`` from the **faraway** and **leaps** packages. Here we will: * translate those ideas into **Python-first** code using :mod:`statsmodels` and :mod:`scikit-learn`, * separate **quality criteria** (AIC, BIC, adjusted :math:`R^2`, cross-validated RMSE) from **search procedures** (backward, forward, stepwise, exhaustive), * use a couple of running examples (seat position and Auto MPG) to see model selection in action, and * revisit the distinction between models for **explanation** and models for **prediction**. Throughout, you can imagine that we already have two cleaned CSV files: * ``data/seatpos.csv`` with a response ``hipcenter`` and several body measurements as predictors. * ``data/autompg.csv`` with ``mpg`` and car attributes (cylinders, displacement, horsepower, weight, acceleration, model year, domestic / non-domestic). 16.1 Quality criteria: balancing fit and complexity --------------------------------------------------- So far we have used **residual plots**, :math:`R^2`, and RMSE to judge a single model’s fit. If we try to *compare* models using plain :math:`R^2` or RMSE, we run into a problem: *Adding predictors can never make :math:`R^2` worse, and almost never makes RMSE worse.* A huge, overfitted model will always win. We need criteria that reward **good fit** but **penalize complexity**. In this chapter we will use four: * **AIC** (Akaike Information Criterion) * **BIC** (Bayesian Information Criterion) * **Adjusted :math:`R^2`** * **Cross-validated RMSE** (for prediction) In all cases, “smaller is better” except that **larger** adjusted :math:`R^2` is better. 16.1.1 AIC and BIC in Python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a linear regression with :math:`n` observations, :math:`p` parameters (including the intercept), and residual sum of squares :math:`\mathrm{RSS}`, the R notes write AIC and BIC as .. math:: \mathrm{AIC} = n \log\!\left(\frac{\mathrm{RSS}}{n}\right) + 2p \mathrm{BIC} = n \log\!\left(\frac{\mathrm{RSS}}{n}\right) + (\log n)\, p. *The first term measures fit; the second term penalizes model size.* In **statsmodels**, both are built in: .. code-block:: python import pandas as pd import statsmodels.api as sm df = pd.read_csv("data/seatpos.csv") y = df["hipcenter"] X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]] X = sm.add_constant(X) model = sm.OLS(y, X).fit() model.aic # Akaike Information Criterion model.bic # Bayesian Information Criterion Smaller AIC / BIC means “better” under that criterion. BIC usually prefers **smaller** models than AIC because its penalty :math:`(\log n)p` is larger than :math:`2p` once :math:`\log n > 2`. If you have RSS, :math:`n`, and :math:`p` directly, you can compute them by hand: .. code-block:: python import numpy as np def aic_from_rss(rss: float, n: int, p: int) -> float: return n * np.log(rss / n) + 2 * p def bic_from_rss(rss: float, n: int, p: int) -> float: return n * np.log(rss / n) + np.log(n) * p 16.1.2 Adjusted :math:`R^2` ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Plain :math:`R^2` is .. math:: R^2 = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} = 1 - \frac{\sum (y_i - \hat y_i)^2}{\sum(y_i - \bar y)^2}. Adjusted :math:`R^2` adds a penalty for the number of parameters: .. math:: R^2_\mathrm{adj} = 1 - \frac{\mathrm{SSE} / (n - p)}{\mathrm{SST} / (n - 1)} = 1 - \frac{n - 1}{n - p} (1 - R^2). Unlike plain :math:`R^2`, adjusted :math:`R^2` can **decrease** when you add a useless predictor. In statsmodels you get it for free: .. code-block:: python model.rsquared # R^2 model.rsquared_adj # adjusted R^2 16.1.3 Cross-validated RMSE ~~~~~~~~~~~~~~~~~~~~~~~~~~~ AIC, BIC, and adjusted :math:`R^2` all use :math:`p` explicitly. **Cross-validation** instead asks a predictive question: *How well would this model perform on new data?* The R notes focus on **leave-one-out cross-validation** (LOOCV): 1. For each observation :math:`i`, fit the model **without** that point. 2. Predict :math:`\hat y_{[i]}` for the left-out point. 3. Compute the LOOCV residual :math:`e_{[i]} = y_i - \hat y_{[i]}`. 4. Define .. math:: \mathrm{RMSE}_{\mathrm{LOOCV}} = \sqrt{\frac{1}{n} \sum_{i=1}^n e_{[i]}^2 }. In Python we usually use :mod:`scikit-learn` for cross-validation: .. code-block:: python import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score, LeaveOneOut def loocv_rmse(X, y): model = LinearRegression() loo = LeaveOneOut() # By convention, cross_val_score *maximizes* a score, so we use # negative MSE and flip the sign. neg_mse_scores = cross_val_score( model, X, y, cv=loo, scoring="neg_mean_squared_error", ) mse = -neg_mse_scores.mean() return np.sqrt(mse) X = df[["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]] y = df["hipcenter"] loocv_rmse(X, y) For larger data sets, **5-fold or 10-fold** cross-validation is more common: .. code-block:: python from sklearn.model_selection import KFold def kfold_rmse(X, y, k=5, random_state=0): model = LinearRegression() kf = KFold(n_splits=k, shuffle=True, random_state=random_state) neg_mse_scores = cross_val_score( model, X, y, cv=kf, scoring="neg_mean_squared_error", ) mse = -neg_mse_scores.mean() return np.sqrt(mse) 16.1.4 Which criterion should I use? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A rule of thumb: * For **explanation** and **interpretable models**, prefer criteria that strongly penalize complexity: * BIC, * adjusted :math:`R^2`. * For **prediction**, emphasize **cross-validated RMSE** (or another predictive metric). AIC can also be a compromise. In the rest of the chapter we will often compare several criteria side-by-side. 16.1.5 R vs Python: quality criteria ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 * - Concept - R notes - Python (statsmodels / scikit-learn) * - AIC / BIC - ``AIC(fit)``, ``BIC(fit)`` - ``fit.aic``, ``fit.bic`` * - Adjusted :math:`R^2` - ``summary(fit)$adj.r.squared`` - ``fit.rsquared_adj`` * - LOOCV RMSE - manual function with ``hatvalues(fit)`` - ``LeaveOneOut`` + ``cross_val_score`` (or manual loop) 16.2 Search procedures: which models to consider? ------------------------------------------------- A **model selection procedure** has two ingredients: 1. A **quality criterion** (AIC, BIC, adjusted :math:`R^2`, cross-validated RMSE, …). 2. A **search strategy** to explore the huge space of possible models. With :math:`p` candidate predictors, there are :math:`2^p` possible subsets. For :math:`p = 8`, that is already 256 models; for :math:`p = 15` it is over 32,000. The R chapter introduces four strategies: * backward selection, * forward selection, * stepwise selection, * exhaustive (all-subsets) search. We will sketch Python versions using statsmodels. Helper: fit a model on a given subset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We’ll write a tiny helper that: * takes a list of predictor names, * fits a linear regression with an intercept, * returns a chosen metric (AIC, BIC, adjusted :math:`R^2`, or CV-RMSE). .. code-block:: python import itertools from typing import List, Literal Metric = Literal["aic", "bic", "adj_r2", "loocv_rmse"] def evaluate_subset(df, response: str, predictors: List[str], metric: Metric = "aic") -> float: """Fit OLS on a subset of predictors and return a scalar score.""" y = df[response] X = sm.add_constant(df[predictors]) fit = sm.OLS(y, X).fit() if metric == "aic": return fit.aic if metric == "bic": return fit.bic if metric == "adj_r2": # For consistency with AIC/BIC ("smaller is better"), we return # negative adjusted R^2 here. return -fit.rsquared_adj if metric == "loocv_rmse": # Use scikit-learn with the *same* predictors X_np = df[predictors].to_numpy() y_np = y.to_numpy() return loocv_rmse(X_np, y_np) raise ValueError(f"Unknown metric: {metric}") 16.2.1 Backward selection ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Backward selection** starts from the **full model** and repeatedly **drops** one predictor at a time if it improves the criterion. Skeleton implementation: .. code-block:: python def backward_selection(df, response: str, candidates: List[str], metric: Metric = "aic"): remaining = list(candidates) best_score = evaluate_subset(df, response, remaining, metric) improved = True while improved and len(remaining) > 1: improved = False scores = [] for predictor in remaining: trial = [p for p in remaining if p != predictor] score = evaluate_subset(df, response, trial, metric) scores.append((score, predictor, trial)) # Pick the *best* (smallest score) among the candidates score, predictor_to_drop, trial = min(scores, key=lambda t: t[0]) if score + 1e-8 < best_score: # small tolerance best_score = score remaining = trial improved = True return remaining, best_score Example (using the seat position data): .. code-block:: python seat_df = pd.read_csv("data/seatpos.csv") predictors = ["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"] selected_aic, score_aic = backward_selection( seat_df, "hipcenter", predictors, metric="aic" ) selected_bic, score_bic = backward_selection( seat_df, "hipcenter", predictors, metric="bic" ) print("Backward AIC:", selected_aic) print("Backward BIC:", selected_bic) Typically: * AIC keeps **more** predictors. * BIC keeps a **smaller** set with nearly as good predictive performance. 16.2.2 Forward selection ~~~~~~~~~~~~~~~~~~~~~~~~~ **Forward selection** starts from the **intercept-only** model and repeatedly **adds** the predictor that most improves the criterion. .. code-block:: python def forward_selection(df, response: str, candidates: List[str], metric: Metric = "aic"): remaining = list(candidates) selected: List[str] = [] best_score = evaluate_subset(df, response, selected or [remaining[0]], metric) improved = True while improved and remaining: improved = False scores = [] for predictor in remaining: trial = selected + [predictor] score = evaluate_subset(df, response, trial, metric) scores.append((score, predictor, trial)) score, predictor_to_add, trial = min(scores, key=lambda t: t[0]) if score + 1e-8 < best_score: best_score = score selected = trial remaining.remove(predictor_to_add) improved = True return selected, best_score Forward and backward often find *similar* but not identical models. 16.2.3 Stepwise selection ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Stepwise** selection looks **both directions** at each step: * try adding each unused predictor; * also try dropping each currently-included predictor; * commit the change (add or drop) that best improves the criterion. Conceptually it is: .. code-block:: python def stepwise_selection(df, response: str, candidates: List[str], metric: Metric = "aic"): selected: List[str] = [] best_score = float("inf") while True: changed = False # 1. Try adding add_scores = [] for p in candidates: if p in selected: continue trial = selected + [p] score = evaluate_subset(df, response, trial, metric) add_scores.append((score, ("add", p), trial)) # 2. Try dropping drop_scores = [] for p in selected: trial = [q for q in selected if q != p] if not trial: continue score = evaluate_subset(df, response, trial, metric) drop_scores.append((score, ("drop", p), trial)) all_scores = add_scores + drop_scores if not all_scores: break score, (action, p), trial = min(all_scores, key=lambda t: t[0]) if score + 1e-8 < best_score: best_score = score selected = trial changed = True if not changed: break return selected, best_score Stepwise tends to behave like a hybrid between forward and backward. 16.2.4 Exhaustive search (all subsets) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For **small** numbers of predictors (:math:`p \le 10` or so), you can simply check **all** subsets and pick the best according to a given criterion. .. code-block:: python def exhaustive_search(df, response: str, candidates: List[str], metric: Metric = "aic"): best_score = float("inf") best_subset: List[str] | None = None for k in range(1, len(candidates) + 1): for subset in itertools.combinations(candidates, k): subset = list(subset) score = evaluate_subset(df, response, subset, metric) if score < best_score: best_score = score best_subset = subset return best_subset, best_score This mirrors the R chapter’s use of ``regsubsets()`` from the **leaps** package. 16.3 Example: seat position (AIC vs BIC vs LOOCV) -------------------------------------------------- In the R version, the **seatpos** data from the ``faraway`` package is used to compare models for the response ``hipcenter``. Assume we have ``data/seatpos.csv`` with the following columns: * ``hipcenter`` – driver hip center (response), * ``Age``, ``Weight``, ``HtShoes``, ``Ht``, ``Seated``, ``Arm``, ``Thigh``, ``Leg`` – body measurements. A typical workflow in Python: #. Fit the **full** model with all eight predictors. #. Run backward selection with **AIC** and **BIC**. #. Compare adjusted :math:`R^2` and LOOCV RMSE across the full and selected models. Sketch: .. code-block:: python full_predictors = ["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"] # Full model y = seat_df["hipcenter"] X_full = sm.add_constant(seat_df[full_predictors]) full_fit = sm.OLS(y, X_full).fit() # Backward AIC / BIC back_aic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "aic") back_bic, _ = backward_selection(seat_df, "hipcenter", full_predictors, "bic") # Fit selected models X_aic = sm.add_constant(seat_df[back_aic]) X_bic = sm.add_constant(seat_df[back_bic]) fit_aic = sm.OLS(y, X_aic).fit() fit_bic = sm.OLS(y, X_bic).fit() # Compare criteria print("Full adj R^2:", full_fit.rsquared_adj) print("AIC adj R^2:", fit_aic.rsquared_adj) print("BIC adj R^2:", fit_bic.rsquared_adj) print("Full LOOCV RMSE:", loocv_rmse(X_full.iloc[:, 1:].to_numpy(), y.to_numpy())) print("AIC LOOCV RMSE:", loocv_rmse(X_aic.iloc[:, 1:].to_numpy(), y.to_numpy())) print("BIC LOOCV RMSE:", loocv_rmse(X_bic.iloc[:, 1:].to_numpy(), y.to_numpy())) Typically you will see: * both AIC and BIC improve **adjusted :math:`R^2`** and **LOOCV RMSE** compared to the full model; * BIC selects a **smaller** subset with only a small loss (or even a gain) in predictive performance. 16.4 Higher-order terms: Auto MPG example ----------------------------------------- The final part of the R chapter returns to the **Auto MPG** data and considers: * quadratic terms such as :math:`\mathrm{disp}^2`, :math:`\mathrm{hp}^2`, :math:`\mathrm{wt}^2`, :math:`\mathrm{acc}^2`, * all two-way interactions between first-order terms. The idea is to show that: * we can start with a **very rich model** (many terms), * then use backward selection (often with BIC) to find a smaller model that still predicts well, * and compare LOOCV RMSE between the full and selected models. In Python we can use **formula syntax** with statsmodels: .. code-block:: python import numpy as np import statsmodels.formula.api as smf auto_df = pd.read_csv("data/autompg.csv") # Log-transform the response as in the R notes auto_df["log_mpg"] = np.log(auto_df["mpg"]) big_formula = ( "log_mpg ~ (cyl + disp + hp + wt + acc + year + domestic) ** 2" " + I(disp ** 2) + I(hp ** 2) + I(wt ** 2) + I(acc ** 2)" ) big_fit = smf.ols(big_formula, data=auto_df).fit() big_fit.summary() Now apply backward selection on this rich model, using BIC or AIC as the criterion. The search logic is identical; the only difference is that the candidate “predictors” are now terms from the formula (main effects, quadratics, interactions) rather than raw columns. Python does not automatically enforce **hierarchy** (keeping main effects whenever an interaction is included), so in practice you may want to: * define groups of terms that must stay together, * or rely on domain knowledge to simplify the final model. The main lesson: *Sometimes a slightly richer model (interactions, quadratic terms) combined with a strong penalty (BIC or cross-validation) gives the best predictive performance.* 16.5 Explanation versus prediction ---------------------------------- A central theme of the R chapter is the distinction between models used for: * **Explanation** – understanding how predictors relate to the response. * **Prediction** – making accurate forecasts for new observations. Explanation ~~~~~~~~~~~ For explanation we care about: * **Interpretability**: How does each predictor affect the response, holding others fixed? * **Parsimony**: Smaller models are easier to explain and reason about. * **Causality vs association**: Does a predictor *cause* changes in the response, or is it just correlated? We normally: * favor **BIC** or adjusted :math:`R^2` to penalize complexity, * inspect diagnostic plots (from previous chapters), * remember that with **observational data** we detect **associations**, not necessarily **causal effects**. Prediction ~~~~~~~~~~ For pure prediction we care about: * **out-of-sample error** (cross-validated RMSE), * robustness to new data. We are less worried about: * exact p-values, * whether effects are causal, * whether the model is small. A model can predict well even if it is not a realistic description of the data-generating process. A classic analogy: **shoe size** and **reading level** in children are highly correlated (both driven by age). Shoe size does not *cause* reading ability, but knowing shoe size still helps predict reading level. For prediction that is fine; for explanation it is misleading. 16.6 How this connects to PyStatsV1 ----------------------------------- In PyStatsV1 we will continually revisit **model selection**: * Choosing **which predictors to include** in a regression or GLM. * Comparing different model families (linear vs logistic vs Poisson). * Balancing **interpretability** against **predictive performance**. This chapter gives you a toolkit that we can plug into many later examples: * computing AIC / BIC / adjusted :math:`R^2`, * using cross-validation to estimate predictive error, * exploring backward / forward / stepwise / exhaustive search. In applied case studies we will often start from a **rich model** suggested by subject-matter knowledge, then use these tools to trim it down to something: * scientifically interpretable, * numerically stable, * and reasonably accurate for prediction. 16.7 What you should take away ------------------------------ By the end of this chapter (and its R + Python versions), you should be able to: * Explain why **plain :math:`R^2` and RMSE** are not enough for model comparison. * Compute and interpret: * **AIC** and **BIC**, * **adjusted :math:`R^2`**, * **cross-validated RMSE** (LOOCV or K-fold). * Implement or adapt **search procedures** in Python: * backward selection, * forward selection, * stepwise selection, * exhaustive search for small :math:`p`. * Use these procedures on real data sets (like **seat position** and **Auto MPG**) to find reasonable models. * Distinguish between models aimed at **explanation** and those aimed at **prediction**, and choose criteria that match your goal. * Recognize that: * strong penalties (BIC, adjusted :math:`R^2`) often yield compact, interpretable models, * cross-validation is the gold standard for estimating predictive error, * observational data supports **association**, not automatic causal claims. In later PyStatsV1 chapters, these model-selection tools will support both: * **observational** regression analyses, and * upcoming material on **experimental design**, where we can combine good designs with good models to make stronger causal statements.