Applied Statistics with Python – Chapter 9

Multiple linear regression

This chapter mirrors the “Multiple Linear Regression” chapter from the original R notes. The statistical ideas are the same, but we’ll express them in Python-first language and keep R alongside as a translation layer.

By the end of this chapter you should be comfortable with:

  • Building and interpreting linear regression models with more than one predictor.

  • Understanding the matrix formulation of regression.

  • Using summary output to:

    • obtain standard errors and t-tests for each coefficient,

    • construct confidence intervals and prediction intervals,

    • run global F-tests for the significance of regression,

    • and compare nested models with ANOVA F-tests.

  • Connecting these ideas back to simulation: checking a theoretical sampling distribution with code.

Throughout we will:

  • Treat R’s lm() as the reference implementation, and

  • Show Python equivalents using pandas and statsmodels.

9.1 From simple to multiple regression

In simple linear regression (SLR) we had one predictor:

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2).\]

Graphically, this is a line through a cloud of points.

In multiple linear regression (MLR) we allow several predictors:

\[Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i,p-1} + \varepsilon_i.\]

For two predictors you can picture a plane in 3-D space; for more predictors, geometry is harder to visualize but the algebra scales cleanly.

Key idea: each \(\beta_j\) is a partial slope:

  • \(\beta_j\) measures the change in the mean response for a one-unit change in \(x_j\), holding all other predictors fixed.

That “holding others fixed” interpretation is new in MLR and central to reading regression output correctly.

9.2 Auto MPG example

We’ll again use a car-fuel-efficiency dataset. In the R notes it’s read directly from the UCI Machine Learning Repository and cleaned into a data frame autompg with columns:

  • mpg – miles per gallon (response),

  • cyl – number of cylinders,

  • disp – engine displacement,

  • hp – horsepower,

  • wt – weight,

  • acc – acceleration,

  • year – model year.

In PyStatsV1, you can either:

  • read the data from a local CSV we include, or

  • follow the R code closely by reading from the URL and cleaning in Python.

A minimal Python version might look like:

import pandas as pd

url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/"
    "auto-mpg/auto-mpg.data"
)

cols = ["mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name"]
autompg = pd.read_csv(
    url,
    delim_whitespace=True,
    names=cols,
    na_values="?",
    comment="#",
)

# Drop rows with missing horsepower and the Plymouth Reliant row,
# then drop unused columns to match the R notes
autompg = autompg.dropna(subset=["hp"])
autompg = autompg[autompg["name"] != "plymouth reliant"]
autompg = autompg[["mpg", "cyl", "disp", "hp", "wt", "acc", "year"]]
autompg["hp"] = autompg["hp"].astype(float)

autompg.info()

This gets us to the same cleaned structure as in R.

9.3 Fitting a multiple regression model

We’ll start with a two-predictor model: mpg as a function of weight and year.

Model:

\[Y_i = \beta_0 + \beta_1 \text{wt}_i + \beta_2 \text{year}_i + \varepsilon_i.\]

R version

mpg_model <- lm(mpg ~ wt + year, data = autompg)
summary(mpg_model)

Python version (statsmodels)

We’ll use the formula API, which understands "mpg ~ wt + year" like lm().

import statsmodels.formula.api as smf

mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
print(mpg_model.summary())

You should see:

  • estimates for Intercept, wt, and year,

  • standard errors and t-values,

  • residual standard error (sigma),

  • \(R^2\) and adjusted \(R^2\),

  • and an F-statistic testing whether the regression as a whole is useful.

Interpreting the coefficients

Using the R output notation, suppose we get:

  • \(\hat\beta_0 \approx -14.6\)

  • \(\hat\beta_1 \approx -0.0066\) for weight

  • \(\hat\beta_2 \approx 0.76\) for year

Interpretations:

  • Intercept: expected mpg for a car of weight 0 made in year 0. This is physically meaningless, but needed algebraically. Intercepts often have poor real-world meaning but are still part of the model.

  • Weight slope: for a one-unit increase in weight (one pound in these data), expected mpg decreases by about 0.0066, holding model year fixed.

  • Year slope: for a one-unit increase in model year (one year newer), expected mpg increases by about 0.76, holding weight fixed.

Notice how every slope interpretation now includes “for a fixed value of the other predictor”.

9.4 Matrix formulation of regression

Matrix notation lets us handle any number of predictors cleanly.

We stack the response values into a vector \(Y\), the predictors into a design matrix \(X\), the coefficients into a vector \(\beta\), and the errors into a vector \(\varepsilon\):

\[Y = X \beta + \varepsilon.\]
  • \(Y\) is \(n \times 1\),

  • \(X\) is \(n \times p\) (first column all ones for the intercept),

  • \(\beta\) is \(p \times 1\),

  • \(\varepsilon\) is \(n \times 1\).

The least squares estimate of \(\beta\) is:

\[\hat\beta = (X^\top X)^{-1} X^\top y.\]

The fitted values and residuals are:

\[\hat y = X \hat\beta, \qquad e = y - \hat y.\]

The residual variance estimate (mean squared error) is:

\[s_e^2 = \frac{e^\top e}{n - p},\]

where \(n\) is the sample size and \(p\) is the number of \(\beta\) parameters (including the intercept). Compare this to SLR:

  • SLR had \(p = 2\) and denominator \(n - 2\),

  • here we allow general \(p\), so the denominator is \(n - p\).

Python check

We can verify that statsmodels is doing exactly this:

import numpy as np

y = autompg["mpg"].to_numpy()
X = np.column_stack([
    np.ones(len(autompg)),   # intercept
    autompg["wt"].to_numpy(),
    autompg["year"].to_numpy(),
])

XtX_inv = np.linalg.inv(X.T @ X)
beta_hat = XtX_inv @ X.T @ y

print(beta_hat)                 # matches mpg_model.params

y_hat = X @ beta_hat
e = y - y_hat
n, p = X.shape
s_e = np.sqrt((e @ e) / (n - p))
print(s_e, mpg_model.mse_resid**0.5)

9.5 Sampling distribution of \(\hat\beta\)

Under the normal-error assumptions of the linear model:

  • \(\varepsilon \sim N(0, \sigma^2 I)\),

we can show (using multivariate normal theory) that:

\[\hat\beta \sim N\!\bigl(\beta,\, \sigma^2 (X^\top X)^{-1}\bigr).\]

Consequences:

  • \(E[\hat\beta] = \beta\) – each coefficient estimate is unbiased.

  • The covariance matrix of \(\hat\beta\) is \(\sigma^2 (X^\top X)^{-1}\). The diagonal elements give the variances of individual coefficients.

If we call \(C = (X^\top X)^{-1}\), then

\[\operatorname{Var}(\hat\beta_j) = \sigma^2 C_{jj}, \qquad SE(\hat\beta_j) = s_e \sqrt{C_{jj}}.\]

Each coefficient has a t-distribution after standardization:

\[\frac{\hat\beta_j - \beta_j}{s_e \sqrt{C_{jj}}} \sim t_{n-p}.\]

In practice we rarely compute \(C\) by hand; we read standard errors from software.

  • R: summary(mpg_model)$coef

  • Python: mpg_model.params and mpg_model.bse

9.6 Testing individual coefficients

To test whether a specific predictor is useful in the presence of the others, we typically test

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0.\]

Test statistic:

\[t = \frac{\hat\beta_j - 0}{SE(\hat\beta_j)} = \frac{\hat\beta_j}{s_e \sqrt{C_{jj}}} \sim t_{n-p} \quad \text{under } H_0.\]

R version

summary(mpg_model)$coef

The Estimate, Std. Error, t value, and Pr(>|t|) columns give exactly this test.

Python version

mpg_model.params      # estimates
mpg_model.bse         # standard errors
mpg_model.tvalues     # t statistics
mpg_model.pvalues     # two-sided p-values

If the p-value for wt is tiny, we reject \(H_0: \beta_{\text{wt}} = 0\) and conclude weight is a useful predictor, given that year is already in the model.

This “given that …” interpretation is crucial: the t-test evaluates a coefficient conditional on the other predictors in the model, not in isolation.

9.7 Confidence intervals for coefficients and mean response

Intervals for coefficients

From the sampling distribution we get a \((1-\alpha)\) confidence interval:

\[\hat\beta_j \pm t_{\alpha/2,\, n-p} \cdot SE(\hat\beta_j).\]

R:

confint(mpg_model, level = 0.99)

Python:

mpg_model.conf_int(alpha=0.01)

Intervals for mean response

Suppose we want the mean mpg for cars with specified predictors \(x_0 = (1, x_{01}, \dots, x_{0,p-1})^\top\). The fitted mean is

\[\hat y(x_0) = x_0^\top \hat\beta.\]

Its standard error is

\[SE\bigl(\hat y(x_0)\bigr) = s_e \sqrt{x_0^\top (X^\top X)^{-1} x_0}.\]

A \((1-\alpha)\) confidence interval is

\[\hat y(x_0) \pm t_{\alpha/2,\, n-p} \cdot s_e \sqrt{x_0^\top (X^\top X)^{-1} x_0}.\]

R:

new_cars <- data.frame(wt = c(3500, 5000),
                       year = c(76, 81))

predict(mpg_model, newdata = new_cars,
        interval = "confidence", level = 0.99)

Python:

new_cars = pd.DataFrame({"wt": [3500, 5000],
                         "year": [76, 81]})

pred = mpg_model.get_prediction(new_cars)
pred.summary_frame(alpha=0.01)[["mean", "mean_ci_lower", "mean_ci_upper"]]

9.8 Prediction intervals

Prediction intervals account for both:

  • uncertainty in the mean, and

  • variability of individual observations around that mean.

Standard error:

\[SE_{\text{pred}}\bigl(\hat y(x_0)\bigr) = s_e \sqrt{1 + x_0^\top (X^\top X)^{-1} x_0}.\]

Prediction interval:

\[\hat y(x_0) \pm t_{\alpha/2,\, n-p} \cdot s_e \sqrt{1 + x_0^\top (X^\top X)^{-1} x_0}.\]

R:

predict(mpg_model, newdata = new_cars,
        interval = "prediction", level = 0.99)

Python:

pred.summary_frame(alpha=0.01)[
    ["obs_ci_lower", "obs_ci_upper"]
]

Warning about extrapolation

With multiple predictors, extrapolation can be subtle. Each new point must be reasonable in the joint predictor space, not just in each coordinate separately. A car with an unusual combination of weight and year may be far from the existing data cloud even if its weight and year separately look “within range”.

Plotting predictors against each other and marking new points is a good sanity check.

9.9 Significance of regression: global F-test

In SLR, the t-test for the slope and the F-test for the model were equivalent. In MLR they separate:

  • Individual t-tests: “Is this particular predictor useful, given the others?”

  • Global F-test: “Is any linear relationship present at all?”

Null hypothesis for the global F-test:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0.\]

Under \(H_0\) the model reduces to

\[Y_i = \beta_0 + \varepsilon_i,\]

an intercept-only model. The F-statistic is based on the variance decomposition

\[\text{SST} = \text{SSReg} + \text{SSE},\]

and compares mean squares:

\[F = \frac{\text{SSReg} / (p-1)}{\text{SSE} / (n-p)} \sim F_{p-1,\, n-p} \quad \text{under } H_0.\]

Software gives this automatically.

R:

summary(mpg_model)$fstatistic  # F and df
# or
null_mpg_model <- lm(mpg ~ 1, data = autompg)
anova(null_mpg_model, mpg_model)

Python:

mpg_model.fvalue, mpg_model.f_pvalue, mpg_model.df_model, mpg_model.df_resid

A tiny p-value says: “At least one predictor has a non-zero slope; the regression as a whole explains a meaningful amount of variation.”

9.10 Nested model comparisons

Often we want to compare two models where one is a subset of the other.

Example:

  • Reduced (null) model: mpg ~ wt + year

  • Full model: mpg ~ wt + year + cyl + disp + hp + acc

Here the reduced model is nested inside the full model. The null hypothesis is:

\[H_0: \beta_{\text{cyl}} = \beta_{\text{disp}} = \beta_{\text{hp}} = \beta_{\text{acc}} = 0.\]

We use an F-test based on the extra sum of squares explained by the full model.

R:

null_mpg_model <- lm(mpg ~ wt + year, data = autompg)
full_mpg_model <- lm(mpg ~ wt + year + cyl + disp + hp + acc,
                     data = autompg)

anova(null_mpg_model, full_mpg_model)

Python (statsmodels):

import statsmodels.api as sm

null_mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
full_mpg_model = smf.ols("mpg ~ wt + year + cyl + disp + hp + acc",
                         data=autompg).fit()

sm.stats.anova_lm(null_mpg_model, full_mpg_model)

If the p-value is large, the extra predictors don’t improve the model significantly given wt and year.

The global significance-of-regression test from the previous section is a special case where the reduced model is intercept-only.

9.11 Simulation: checking the sampling distribution

The R notes close with a simulation study that:

  • fixes a design matrix \(X\),

  • simulates many response vectors from a known linear model,

  • refits the regression each time,

  • and looks at the empirical distribution of one coefficient.

You can implement the same idea in Python. Sketch:

import numpy as np
import pandas as pd
import statsmodels.api as sm

rng = np.random.default_rng(1337)

n = 100
x1 = np.linspace(1, 10, n)
x2 = np.linspace(1, 10, n)[::-1]

beta_true = np.array([5.0, -2.0, 6.0])
sigma = 4.0

X = np.column_stack([np.ones(n), x1, x2])

num_sims = 10_000
beta2_vals = np.empty(num_sims)

for i in range(num_sims):
    eps = rng.normal(0, sigma, size=n)
    y = X @ beta_true + eps

    df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
    fit = smf.ols("y ~ x1 + x2", data=df).fit()
    beta2_vals[i] = fit.params["x2"]

beta2_vals.mean(), beta2_vals.var()

You should find:

  • the mean of beta2_vals is close to the true \(\beta_2\),

  • the variance matches \(\sigma^2 C_{22}\) where \(C = (X^\top X)^{-1}\),

  • a histogram of beta2_vals overlaid with the corresponding Normal density looks very similar.

This is a concrete demonstration of the theoretical sampling distribution results from Section 9.5.

9.12 What you should take away

By the end of this chapter (and its R + Python versions), you should be able to:

  • Write down and interpret a multiple linear regression model with several predictors.

  • Understand and use the matrix formulation:

    • \(Y = X \beta + \varepsilon\),

    • \(\hat\beta = (X^\top X)^{-1} X^\top y\),

    • residual variance \(s_e^2 = e^\top e / (n - p)\).

  • Read software output to obtain:

    • coefficient estimates, standard errors, t-tests, and p-values,

    • confidence intervals for coefficients and for mean responses,

    • prediction intervals for new observations,

    • global F-tests for significance of regression,

    • and F-tests comparing nested models.

  • Recognize that:

    • coefficient interpretations are conditional (“holding others fixed”),

    • extrapolation in multiple dimensions can be subtle,

    • and simulation is a powerful way to check theoretical results.

In later PyStatsV1 chapters, these tools will underpin more advanced models (logistic regression, models with interactions, etc.) and many of the case studies.