.. _applied_stats_with_python_ch9_multiple_linear_regression:

Applied Statistics with Python – Chapter 9
==========================================

Multiple linear regression
--------------------------

This chapter mirrors the **“Multiple Linear Regression”** chapter from the
original R notes. The statistical ideas are the same, but we’ll express them in
Python-first language and keep R alongside as a translation layer.

By the end of this chapter you should be comfortable with:

* Building and interpreting linear regression models with **more than one
  predictor**.
* Understanding the **matrix formulation** of regression.
* Using summary output to:

  * obtain **standard errors** and **t-tests** for each coefficient,
  * construct **confidence intervals** and **prediction intervals**,
  * run **global F-tests** for the significance of regression,
  * and compare **nested models** with ANOVA F-tests.

* Connecting these ideas back to simulation: checking a theoretical sampling
  distribution with code.

Throughout we will:

* Treat **R**’s ``lm()`` as the reference implementation, and
* Show Python equivalents using ``pandas`` and ``statsmodels``.

9.1 From simple to multiple regression
--------------------------------------

In simple linear regression (SLR) we had one predictor:

.. math::

   Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad
   \varepsilon_i \sim N(0, \sigma^2).

Graphically, this is a **line** through a cloud of points.

In multiple linear regression (MLR) we allow several predictors:

.. math::

   Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots
         + \beta_{p-1} x_{i,p-1} + \varepsilon_i.

For two predictors you can picture a **plane** in 3-D space; for more predictors,
geometry is harder to visualize but the algebra scales cleanly.

Key idea: each :math:`\beta_j` is a **partial slope**:

* :math:`\beta_j` measures the change in the mean response for a one-unit change
  in :math:`x_j`, **holding all other predictors fixed**.

That “holding others fixed” interpretation is new in MLR and central to reading
regression output correctly.

9.2 Auto MPG example
--------------------

We’ll again use a car-fuel-efficiency dataset. In the R notes it’s read directly
from the UCI Machine Learning Repository and cleaned into a data frame
``autompg`` with columns:

* ``mpg`` – miles per gallon (response),
* ``cyl`` – number of cylinders,
* ``disp`` – engine displacement,
* ``hp`` – horsepower,
* ``wt`` – weight,
* ``acc`` – acceleration,
* ``year`` – model year.

In PyStatsV1, you can either:

* read the data from a local CSV we include, or
* follow the R code closely by reading from the URL and cleaning in Python.

A minimal Python version might look like:

.. code-block:: python

   import pandas as pd

   url = (
       "http://archive.ics.uci.edu/ml/machine-learning-databases/"
       "auto-mpg/auto-mpg.data"
   )

   cols = ["mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name"]
   autompg = pd.read_csv(
       url,
       delim_whitespace=True,
       names=cols,
       na_values="?",
       comment="#",
   )

   # Drop rows with missing horsepower and the Plymouth Reliant row,
   # then drop unused columns to match the R notes
   autompg = autompg.dropna(subset=["hp"])
   autompg = autompg[autompg["name"] != "plymouth reliant"]
   autompg = autompg[["mpg", "cyl", "disp", "hp", "wt", "acc", "year"]]
   autompg["hp"] = autompg["hp"].astype(float)

   autompg.info()

This gets us to the same cleaned structure as in R.

9.3 Fitting a multiple regression model
---------------------------------------

We’ll start with a two-predictor model: mpg as a function of weight and year.

Model:

.. math::

   Y_i = \beta_0 + \beta_1 \text{wt}_i + \beta_2 \text{year}_i + \varepsilon_i.

R version
^^^^^^^^^

.. code-block:: r

   mpg_model <- lm(mpg ~ wt + year, data = autompg)
   summary(mpg_model)

Python version (statsmodels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We’ll use the formula API, which understands ``"mpg ~ wt + year"`` like ``lm()``.

.. code-block:: python

   import statsmodels.formula.api as smf

   mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
   print(mpg_model.summary())

You should see:

* estimates for ``Intercept``, ``wt``, and ``year``,
* standard errors and t-values,
* residual standard error (``sigma``),
* :math:`R^2` and adjusted :math:`R^2`,
* and an F-statistic testing whether the regression as a whole is useful.

Interpreting the coefficients
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the R output notation, suppose we get:

* :math:`\hat\beta_0 \approx -14.6`
* :math:`\hat\beta_1 \approx -0.0066` for weight
* :math:`\hat\beta_2 \approx 0.76` for year

Interpretations:

* **Intercept**: expected mpg for a car of weight 0 made in year 0. This is
  physically meaningless, but needed algebraically. Intercepts often have poor
  real-world meaning but are still part of the model.

* **Weight slope**: for a one-unit increase in weight (one pound in these data),
  expected mpg **decreases** by about 0.0066, **holding model year fixed**.

* **Year slope**: for a one-unit increase in model year (one year newer),
  expected mpg **increases** by about 0.76, **holding weight fixed**.

Notice how every slope interpretation now includes “for a fixed value of the
other predictor”.

9.4 Matrix formulation of regression
------------------------------------

Matrix notation lets us handle any number of predictors cleanly.

We stack the response values into a vector :math:`Y`, the predictors into a
design matrix :math:`X`, the coefficients into a vector :math:`\beta`, and the
errors into a vector :math:`\varepsilon`:

.. math::

   Y = X \beta + \varepsilon.

* :math:`Y` is :math:`n \times 1`,
* :math:`X` is :math:`n \times p` (first column all ones for the intercept),
* :math:`\beta` is :math:`p \times 1`,
* :math:`\varepsilon` is :math:`n \times 1`.

The least squares estimate of :math:`\beta` is:

.. math::

   \hat\beta = (X^\top X)^{-1} X^\top y.

The fitted values and residuals are:

.. math::

   \hat y = X \hat\beta, \qquad e = y - \hat y.

The residual variance estimate (mean squared error) is:

.. math::

   s_e^2 = \frac{e^\top e}{n - p},

where :math:`n` is the sample size and :math:`p` is the number of :math:`\beta`
parameters (including the intercept). Compare this to SLR:

* SLR had :math:`p = 2` and denominator :math:`n - 2`,
* here we allow general :math:`p`, so the denominator is :math:`n - p`.

Python check
^^^^^^^^^^^^

We can verify that ``statsmodels`` is doing exactly this:

.. code-block:: python

   import numpy as np

   y = autompg["mpg"].to_numpy()
   X = np.column_stack([
       np.ones(len(autompg)),   # intercept
       autompg["wt"].to_numpy(),
       autompg["year"].to_numpy(),
   ])

   XtX_inv = np.linalg.inv(X.T @ X)
   beta_hat = XtX_inv @ X.T @ y

   print(beta_hat)                 # matches mpg_model.params

   y_hat = X @ beta_hat
   e = y - y_hat
   n, p = X.shape
   s_e = np.sqrt((e @ e) / (n - p))
   print(s_e, mpg_model.mse_resid**0.5)

9.5 Sampling distribution of :math:`\hat\beta`
----------------------------------------------

Under the normal-error assumptions of the linear model:

* :math:`\varepsilon \sim N(0, \sigma^2 I)`,

we can show (using multivariate normal theory) that:

.. math::

   \hat\beta \sim N\!\bigl(\beta,\, \sigma^2 (X^\top X)^{-1}\bigr).

Consequences:

* :math:`E[\hat\beta] = \beta` – each coefficient estimate is **unbiased**.
* The covariance matrix of :math:`\hat\beta` is :math:`\sigma^2 (X^\top X)^{-1}`.
  The diagonal elements give the variances of individual coefficients.

If we call :math:`C = (X^\top X)^{-1}`, then

.. math::

   \operatorname{Var}(\hat\beta_j) = \sigma^2 C_{jj},
   \qquad
   SE(\hat\beta_j) = s_e \sqrt{C_{jj}}.

Each coefficient has a t-distribution after standardization:

.. math::

   \frac{\hat\beta_j - \beta_j}{s_e \sqrt{C_{jj}}}
   \sim t_{n-p}.

In practice we rarely compute :math:`C` by hand; we read standard errors from
software.

* R: ``summary(mpg_model)$coef``
* Python: ``mpg_model.params`` and ``mpg_model.bse``

9.6 Testing individual coefficients
-----------------------------------

To test whether a specific predictor is useful in the presence of the others, we
typically test

.. math::

   H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0.

Test statistic:

.. math::

   t = \frac{\hat\beta_j - 0}{SE(\hat\beta_j)}
     = \frac{\hat\beta_j}{s_e \sqrt{C_{jj}}}
   \sim t_{n-p} \quad \text{under } H_0.

R version
^^^^^^^^^

.. code-block:: r

   summary(mpg_model)$coef

The ``Estimate``, ``Std. Error``, ``t value``, and ``Pr(>|t|)`` columns give
exactly this test.

Python version
^^^^^^^^^^^^^^

.. code-block:: python

   mpg_model.params      # estimates
   mpg_model.bse         # standard errors
   mpg_model.tvalues     # t statistics
   mpg_model.pvalues     # two-sided p-values

If the p-value for ``wt`` is tiny, we reject :math:`H_0: \beta_{\text{wt}} = 0` and
conclude weight is a useful predictor, *given that year is already in the model*.

This “given that …” interpretation is crucial: the t-test evaluates a coefficient
**conditional on the other predictors in the model**, not in isolation.

9.7 Confidence intervals for coefficients and mean response
-----------------------------------------------------------

Intervals for coefficients
^^^^^^^^^^^^^^^^^^^^^^^^^^

From the sampling distribution we get a :math:`(1-\alpha)` confidence interval:

.. math::

   \hat\beta_j \pm t_{\alpha/2,\, n-p} \cdot SE(\hat\beta_j).

R:

.. code-block:: r

   confint(mpg_model, level = 0.99)

Python:

.. code-block:: python

   mpg_model.conf_int(alpha=0.01)

Intervals for mean response
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose we want the mean mpg for cars with specified predictors
:math:`x_0 = (1, x_{01}, \dots, x_{0,p-1})^\top`. The fitted mean is

.. math::

   \hat y(x_0) = x_0^\top \hat\beta.

Its standard error is

.. math::

   SE\bigl(\hat y(x_0)\bigr)
   = s_e \sqrt{x_0^\top (X^\top X)^{-1} x_0}.

A :math:`(1-\alpha)` confidence interval is

.. math::

   \hat y(x_0) \pm t_{\alpha/2,\, n-p} \cdot
   s_e \sqrt{x_0^\top (X^\top X)^{-1} x_0}.

R:

.. code-block:: r

   new_cars <- data.frame(wt = c(3500, 5000),
                          year = c(76, 81))

   predict(mpg_model, newdata = new_cars,
           interval = "confidence", level = 0.99)

Python:

.. code-block:: python

   new_cars = pd.DataFrame({"wt": [3500, 5000],
                            "year": [76, 81]})

   pred = mpg_model.get_prediction(new_cars)
   pred.summary_frame(alpha=0.01)[["mean", "mean_ci_lower", "mean_ci_upper"]]

9.8 Prediction intervals
------------------------

Prediction intervals account for both:

* uncertainty in the mean, and
* variability of individual observations around that mean.

Standard error:

.. math::

   SE_{\text{pred}}\bigl(\hat y(x_0)\bigr)
   = s_e \sqrt{1 + x_0^\top (X^\top X)^{-1} x_0}.

Prediction interval:

.. math::

   \hat y(x_0) \pm t_{\alpha/2,\, n-p} \cdot
   s_e \sqrt{1 + x_0^\top (X^\top X)^{-1} x_0}.

R:

.. code-block:: r

   predict(mpg_model, newdata = new_cars,
           interval = "prediction", level = 0.99)

Python:

.. code-block:: python

   pred.summary_frame(alpha=0.01)[
       ["obs_ci_lower", "obs_ci_upper"]
   ]

Warning about extrapolation
^^^^^^^^^^^^^^^^^^^^^^^^^^^

With multiple predictors, extrapolation can be subtle. Each new point must be
reasonable **in the joint predictor space**, not just in each coordinate
separately. A car with an unusual combination of weight and year may be far
from the existing data cloud even if its weight and year separately look
“within range”.

Plotting predictors against each other and marking new points is a good
sanity check.

9.9 Significance of regression: global F-test
---------------------------------------------

In SLR, the t-test for the slope and the F-test for the model were equivalent.
In MLR they separate:

* Individual t-tests: “Is this particular predictor useful, given the others?”
* Global F-test: “Is **any** linear relationship present at all?”

Null hypothesis for the global F-test:

.. math::

   H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0.

Under :math:`H_0` the model reduces to

.. math::

   Y_i = \beta_0 + \varepsilon_i,

an intercept-only model. The F-statistic is based on the variance decomposition

.. math::

   \text{SST} = \text{SSReg} + \text{SSE},

and compares mean squares:

.. math::

   F =
   \frac{\text{SSReg} / (p-1)}{\text{SSE} / (n-p)}
   \sim F_{p-1,\, n-p} \quad \text{under } H_0.

Software gives this automatically.

R:

.. code-block:: r

   summary(mpg_model)$fstatistic  # F and df
   # or
   null_mpg_model <- lm(mpg ~ 1, data = autompg)
   anova(null_mpg_model, mpg_model)

Python:

.. code-block:: python

   mpg_model.fvalue, mpg_model.f_pvalue, mpg_model.df_model, mpg_model.df_resid

A tiny p-value says: “At least one predictor has a non-zero slope; the regression
as a whole explains a meaningful amount of variation.”

9.10 Nested model comparisons
-----------------------------

Often we want to compare two models where one is a **subset** of the other.

Example:

* **Reduced (null) model**: ``mpg ~ wt + year``
* **Full model**: ``mpg ~ wt + year + cyl + disp + hp + acc``

Here the reduced model is nested inside the full model. The null hypothesis is:

.. math::

   H_0: \beta_{\text{cyl}} = \beta_{\text{disp}} =
        \beta_{\text{hp}} = \beta_{\text{acc}} = 0.

We use an F-test based on the extra sum of squares explained by the full model.

R:

.. code-block:: r

   null_mpg_model <- lm(mpg ~ wt + year, data = autompg)
   full_mpg_model <- lm(mpg ~ wt + year + cyl + disp + hp + acc,
                        data = autompg)

   anova(null_mpg_model, full_mpg_model)

Python (statsmodels):

.. code-block:: python

   import statsmodels.api as sm

   null_mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
   full_mpg_model = smf.ols("mpg ~ wt + year + cyl + disp + hp + acc",
                            data=autompg).fit()

   sm.stats.anova_lm(null_mpg_model, full_mpg_model)

If the p-value is large, the extra predictors don’t improve the model
significantly given ``wt`` and ``year``.

The global significance-of-regression test from the previous section is a special
case where the reduced model is intercept-only.

9.11 Simulation: checking the sampling distribution
---------------------------------------------------

The R notes close with a simulation study that:

* fixes a design matrix :math:`X`,
* simulates many response vectors from a known linear model,
* refits the regression each time,
* and looks at the empirical distribution of one coefficient.

You can implement the same idea in Python. Sketch:

.. code-block:: python

   import numpy as np
   import pandas as pd
   import statsmodels.api as sm

   rng = np.random.default_rng(1337)

   n = 100
   x1 = np.linspace(1, 10, n)
   x2 = np.linspace(1, 10, n)[::-1]

   beta_true = np.array([5.0, -2.0, 6.0])
   sigma = 4.0

   X = np.column_stack([np.ones(n), x1, x2])

   num_sims = 10_000
   beta2_vals = np.empty(num_sims)

   for i in range(num_sims):
       eps = rng.normal(0, sigma, size=n)
       y = X @ beta_true + eps

       df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
       fit = smf.ols("y ~ x1 + x2", data=df).fit()
       beta2_vals[i] = fit.params["x2"]

   beta2_vals.mean(), beta2_vals.var()

You should find:

* the mean of ``beta2_vals`` is close to the true :math:`\beta_2`,
* the variance matches :math:`\sigma^2 C_{22}` where :math:`C = (X^\top X)^{-1}`,
* a histogram of ``beta2_vals`` overlaid with the corresponding Normal density
  looks very similar.

This is a concrete demonstration of the theoretical sampling distribution
results from Section 9.5.

9.12 What you should take away
------------------------------

By the end of this chapter (and its R + Python versions), you should be able to:

* Write down and interpret a **multiple linear regression model** with several
  predictors.
* Understand and use the **matrix formulation**:

  * :math:`Y = X \beta + \varepsilon`,
  * :math:`\hat\beta = (X^\top X)^{-1} X^\top y`,
  * residual variance :math:`s_e^2 = e^\top e / (n - p)`.

* Read software output to obtain:

  * coefficient estimates, standard errors, t-tests, and p-values,
  * confidence intervals for coefficients and for mean responses,
  * prediction intervals for new observations,
  * global F-tests for significance of regression,
  * and F-tests comparing **nested models**.

* Recognize that:

  * coefficient interpretations are **conditional** (“holding others fixed”),
  * extrapolation in multiple dimensions can be subtle,
  * and simulation is a powerful way to check theoretical results.

In later PyStatsV1 chapters, these tools will underpin more advanced models
(logistic regression, models with interactions, etc.) and many of the case
studies.