Applied Statistics with Python – Chapter 10 ========================================== Model building: explanation and prediction ------------------------------------------ In earlier chapters we focused on **fitting a single model**: * simple linear regression (Chapters 7–8), * multiple linear regression (Chapter 9). Here we step back and ask a bigger question: .. rubric:: How do we choose *which* model to use? We will: * separate the ideas of **family**, **form**, and **fit**, * distinguish between models aimed at **explanation** vs **prediction**, * see how **overfitting** and **train–test splits** enter the picture. Throughout, you can imagine the familiar Auto MPG example: * response :math:`y` = miles per gallon (``mpg``), * predictors :math:`x_1, x_2, \dots` = car attributes (weight, horsepower, …). 10.1 Family, form, and fit -------------------------- When we say "build a model", there are really *three* choices hiding inside: 1. **Family** – the broad class of models we are willing to consider. 2. **Form** – the specific predictors and transformations included. 3. **Fit** – the numerical values of the parameters, estimated from data. We will mostly stay inside one family: .. rubric:: Family: linear models .. math:: y = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1} + \varepsilon, with :math:`\varepsilon` capturing noise or unexplained variation. Other families exist (trees, smoothers, neural nets), but linear models are: * the **standard starting point**, * easy to fit and interpret, * an excellent gateway to more advanced methods. 10.1.1 Fit ~~~~~~~~~~ Suppose we choose a simple form with one predictor: .. math:: y = \beta_0 + \beta_1 x_1 + \varepsilon. To **fit** this model in Python we choose a loss function and minimize it. In this course we almost always use **least squares**: .. math:: \min_{\beta_0, \beta_1} \sum_{i=1}^n \left(y_i - (\beta_0 + \beta_1 x_{1i})\right)^2. In practice: * with :mod:`statsmodels`, this is done by ``smf.ols(...).fit()``; * with :mod:`sklearn`, by ``LinearRegression().fit(X, y)``. The result is a **fitted model**: .. math:: \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1, which we can use for **interpretation** or **prediction**. 10.1.2 Form ~~~~~~~~~~~ The **form** of a linear model is determined by: * which predictors are included, * which transformations and interactions we use. Examples, using ``mpg`` as the response: * Simple linear regression: .. math:: \text{mpg} = \beta_0 + \beta_1 \,\text{weight} + \varepsilon. * Multiple linear regression: .. math:: \text{mpg} = \beta_0 + \beta_1 \,\text{weight} + \beta_2 \,\text{horsepower} + \beta_3 \,\text{year} + \varepsilon. * Model with a transformation and an interaction: .. math:: \text{mpg} = \beta_0 + \beta_1 \,\text{weight} + \beta_2 \,\text{weight}^2 + \beta_3 \,\text{year} + \beta_4 \,\text{weight} \times \text{year} + \varepsilon. All of these are still **linear models**: linear in the parameters :math:`\beta_j`. The form controls *flexibility*: * more predictors and terms → more flexibility, * but also more risk of **overfitting** and harder interpretation. 10.1.3 Family ~~~~~~~~~~~~~ The **family** is the broad modeling approach. Some examples: * linear regression, * generalized linear models (logistic, Poisson, …), * non-parametric smoothers, * trees and ensembles (random forests, boosting). In this mini-textbook we focus on the **linear regression family** because: * it is the standard tool for many applied problems, * it has a rich theory of **inference** (standard errors, t-tests, F-tests), * many ideas (design matrices, loss functions, regularization) carry directly into more advanced models. You should keep a mental picture: * **family** = which toolbox? * **form** = which tools from that box? * **fit** = how we use data to tune the tools (estimate parameters). 10.1.4 Assumed model vs fitted model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When we write a formula like .. math:: \text{mpg} = \beta_0 + \beta_1 \,\text{weight} + \beta_2 \,\text{horsepower} + \varepsilon, we are specifying the **assumed model**: * linear family, * particular form (which variables and interactions), * often with additional assumptions about :math:`\varepsilon` (e.g. Normal errors with constant variance). After fitting, we obtain a **fitted model** such as: .. math:: \widehat{\text{mpg}} = 46.2 - 3.1 \,\text{weight} - 0.02 \,\text{horsepower}. Important: * Fitting only gives the **best model within the chosen form**. * If the family or form is poorly chosen, even a perfectly fitted model can be misleading. 10.2 Explanation versus prediction ---------------------------------- Why are we building a model? * To **explain** how predictors relate to the response? * Or to **predict** future responses as accurately as possible? The distinction matters. The modeling steps can look similar, but: * For **explanation**, we prioritize *interpretability* and valid inference. * For **prediction**, we prioritize *accuracy on new data* and resistance to overfitting. 10.2.1 Explanation ~~~~~~~~~~~~~~~~~~~ For explanation we want models that are: * **small** – using as few predictors as reasonably possible, * **interpretable** – each coefficient has a clear story, * **well-behaved** – assumptions are at least approximately satisfied. In linear regression, we often: * start from a **full model** with many predictors, * use: * t-tests for individual coefficients, * F-tests / ANOVA for comparing nested models, * residual plots to check model assumptions, * gradually simplify to a **parsimonious model** that still fits well. Example goals for the Auto MPG data: * quantify how **weight** and **year** relate to fuel efficiency, * understand which car attributes matter *most*, * communicate results to non-statisticians. Here, even if a larger model slightly improves prediction, we may prefer a **simpler model** that tells a clearer story. 10.2.1.1 Correlation and causation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A crucial warning for explanatory models: .. rubric:: Correlation does not imply causation. Linear models detect **associations** between variables. They do *not* prove that one variable *causes* another. * Observational data (like Auto MPG) can show that higher horsepower is associated with lower fuel efficiency. * But this does not prove that "increasing horsepower by 10 automatically reduces mpg by 3" in a causal sense. To argue for causation we usually need: * a carefully designed **experiment**, * or strong subject-matter reasoning and supporting evidence. In PyStatsV1, we will often treat our models as tools for **description** and **exploration**, with appropriate caution about causal claims. 10.2.2 Prediction ~~~~~~~~~~~~~~~~~ For prediction, the priority is different: * We care about how well the model predicts **new, unseen data**. * We are less concerned with: * whether each coefficient is statistically significant, * whether the model is easy to explain in words. We need a **numerical measure of prediction error**. A common choice is root mean squared error (RMSE): .. math:: \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2}. In Python, given arrays ``y_true`` and ``y_pred``: .. code-block:: python import numpy as np def rmse(y_true, y_pred): return np.sqrt(np.mean((y_true - y_pred) ** 2)) Lower RMSE means better predictive performance on the data we are evaluating. 10.2.2.1 Train–test split and overfitting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A key problem in predictive modeling is **overfitting**: * A very flexible model can track the noise in the training data. * It will have **low error on the data it saw**, but **high error on new data**. To detect overfitting we mimic the "magic extra data" thought experiment by splitting our data: * **training set** – used to fit the model, * **test set** – held out and only used to evaluate predictions. In code, using scikit-learn: .. code-block:: python from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression import numpy as np X = mpg_df[["weight", "horsepower", "year"]].to_numpy() y = mpg_df["mpg"].to_numpy() X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) model = LinearRegression().fit(X_train, y_train) y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test) train_rmse = rmse(y_train, y_train_pred) test_rmse = rmse(y_test, y_test_pred) print(train_rmse, test_rmse) Typical pattern: * As we add predictors and complexity: * **train RMSE** almost always decreases, * **test RMSE** may first decrease, then increase once we overfit. * The best **predictive** model is often the one with the **lowest test RMSE**, even if it is not the largest or most complex. 10.3 What you should take away ------------------------------ By the end of this chapter (and its R + Python versions), you should be able to: * distinguish clearly between: * **family** of models, * **form** of a model, * **fit** of a model; * explain the difference between models aimed at: * **explanation** – small, interpretable, inference-friendly, * **prediction** – chosen to minimize error on new data; * understand why: * linear models are often the **first choice**, * more complex models can **overfit**; * compute and interpret: * **RMSE**, and * **train vs test** prediction error; * describe why a train–test split is essential for honest assessment; * articulate the warning: * "Correlation does not imply causation" in the context of regression. 10.4 How this connects to PyStatsV1 ----------------------------------- In PyStatsV1 you will see these ideas used repeatedly: * **Explanatory models** * ``statsmodels`` regressions with detailed summaries, * ANOVA tables and F-tests for comparing nested models, * clean, compact models that are easy to discuss in class. * **Predictive checks** * simple train–test splits for case studies, * side-by-side train vs test RMSE, * examples where a smaller model outperforms a more complex one on held-out data. As you work through the code in later chapters, keep asking: * "Am I trying to **explain** or **predict** here?" * "Have I thought about **family**, **form**, and **fit** separately?" That habit will pay off in any future modeling you do, whether with linear models, machine learning methods, or more advanced tools.