Applied Statistics with Python – Chapter 9
Multiple linear regression
This chapter mirrors the “Multiple Linear Regression” chapter from the original R notes. The statistical ideas are the same, but we’ll express them in Python-first language and keep R alongside as a translation layer.
By the end of this chapter you should be comfortable with:
Building and interpreting linear regression models with more than one predictor.
Understanding the matrix formulation of regression.
Using summary output to:
obtain standard errors and t-tests for each coefficient,
construct confidence intervals and prediction intervals,
run global F-tests for the significance of regression,
and compare nested models with ANOVA F-tests.
Connecting these ideas back to simulation: checking a theoretical sampling distribution with code.
Throughout we will:
Treat R’s
lm()as the reference implementation, andShow Python equivalents using
pandasandstatsmodels.
9.1 From simple to multiple regression
In simple linear regression (SLR) we had one predictor:
Graphically, this is a line through a cloud of points.
In multiple linear regression (MLR) we allow several predictors:
For two predictors you can picture a plane in 3-D space; for more predictors, geometry is harder to visualize but the algebra scales cleanly.
Key idea: each \(\beta_j\) is a partial slope:
\(\beta_j\) measures the change in the mean response for a one-unit change in \(x_j\), holding all other predictors fixed.
That “holding others fixed” interpretation is new in MLR and central to reading regression output correctly.
9.2 Auto MPG example
We’ll again use a car-fuel-efficiency dataset. In the R notes it’s read directly
from the UCI Machine Learning Repository and cleaned into a data frame
autompg with columns:
mpg– miles per gallon (response),cyl– number of cylinders,disp– engine displacement,hp– horsepower,wt– weight,acc– acceleration,year– model year.
In PyStatsV1, you can either:
read the data from a local CSV we include, or
follow the R code closely by reading from the URL and cleaning in Python.
A minimal Python version might look like:
import pandas as pd
url = (
"http://archive.ics.uci.edu/ml/machine-learning-databases/"
"auto-mpg/auto-mpg.data"
)
cols = ["mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name"]
autompg = pd.read_csv(
url,
delim_whitespace=True,
names=cols,
na_values="?",
comment="#",
)
# Drop rows with missing horsepower and the Plymouth Reliant row,
# then drop unused columns to match the R notes
autompg = autompg.dropna(subset=["hp"])
autompg = autompg[autompg["name"] != "plymouth reliant"]
autompg = autompg[["mpg", "cyl", "disp", "hp", "wt", "acc", "year"]]
autompg["hp"] = autompg["hp"].astype(float)
autompg.info()
This gets us to the same cleaned structure as in R.
9.3 Fitting a multiple regression model
We’ll start with a two-predictor model: mpg as a function of weight and year.
Model:
R version
mpg_model <- lm(mpg ~ wt + year, data = autompg)
summary(mpg_model)
Python version (statsmodels)
We’ll use the formula API, which understands "mpg ~ wt + year" like lm().
import statsmodels.formula.api as smf
mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
print(mpg_model.summary())
You should see:
estimates for
Intercept,wt, andyear,standard errors and t-values,
residual standard error (
sigma),\(R^2\) and adjusted \(R^2\),
and an F-statistic testing whether the regression as a whole is useful.
Interpreting the coefficients
Using the R output notation, suppose we get:
\(\hat\beta_0 \approx -14.6\)
\(\hat\beta_1 \approx -0.0066\) for weight
\(\hat\beta_2 \approx 0.76\) for year
Interpretations:
Intercept: expected mpg for a car of weight 0 made in year 0. This is physically meaningless, but needed algebraically. Intercepts often have poor real-world meaning but are still part of the model.
Weight slope: for a one-unit increase in weight (one pound in these data), expected mpg decreases by about 0.0066, holding model year fixed.
Year slope: for a one-unit increase in model year (one year newer), expected mpg increases by about 0.76, holding weight fixed.
Notice how every slope interpretation now includes “for a fixed value of the other predictor”.
9.4 Matrix formulation of regression
Matrix notation lets us handle any number of predictors cleanly.
We stack the response values into a vector \(Y\), the predictors into a design matrix \(X\), the coefficients into a vector \(\beta\), and the errors into a vector \(\varepsilon\):
\(Y\) is \(n \times 1\),
\(X\) is \(n \times p\) (first column all ones for the intercept),
\(\beta\) is \(p \times 1\),
\(\varepsilon\) is \(n \times 1\).
The least squares estimate of \(\beta\) is:
The fitted values and residuals are:
The residual variance estimate (mean squared error) is:
where \(n\) is the sample size and \(p\) is the number of \(\beta\) parameters (including the intercept). Compare this to SLR:
SLR had \(p = 2\) and denominator \(n - 2\),
here we allow general \(p\), so the denominator is \(n - p\).
Python check
We can verify that statsmodels is doing exactly this:
import numpy as np
y = autompg["mpg"].to_numpy()
X = np.column_stack([
np.ones(len(autompg)), # intercept
autompg["wt"].to_numpy(),
autompg["year"].to_numpy(),
])
XtX_inv = np.linalg.inv(X.T @ X)
beta_hat = XtX_inv @ X.T @ y
print(beta_hat) # matches mpg_model.params
y_hat = X @ beta_hat
e = y - y_hat
n, p = X.shape
s_e = np.sqrt((e @ e) / (n - p))
print(s_e, mpg_model.mse_resid**0.5)
9.5 Sampling distribution of \(\hat\beta\)
Under the normal-error assumptions of the linear model:
\(\varepsilon \sim N(0, \sigma^2 I)\),
we can show (using multivariate normal theory) that:
Consequences:
\(E[\hat\beta] = \beta\) – each coefficient estimate is unbiased.
The covariance matrix of \(\hat\beta\) is \(\sigma^2 (X^\top X)^{-1}\). The diagonal elements give the variances of individual coefficients.
If we call \(C = (X^\top X)^{-1}\), then
Each coefficient has a t-distribution after standardization:
In practice we rarely compute \(C\) by hand; we read standard errors from software.
R:
summary(mpg_model)$coefPython:
mpg_model.paramsandmpg_model.bse
9.6 Testing individual coefficients
To test whether a specific predictor is useful in the presence of the others, we typically test
Test statistic:
R version
summary(mpg_model)$coef
The Estimate, Std. Error, t value, and Pr(>|t|) columns give
exactly this test.
Python version
mpg_model.params # estimates
mpg_model.bse # standard errors
mpg_model.tvalues # t statistics
mpg_model.pvalues # two-sided p-values
If the p-value for wt is tiny, we reject \(H_0: \beta_{\text{wt}} = 0\) and
conclude weight is a useful predictor, given that year is already in the model.
This “given that …” interpretation is crucial: the t-test evaluates a coefficient conditional on the other predictors in the model, not in isolation.
9.7 Confidence intervals for coefficients and mean response
Intervals for coefficients
From the sampling distribution we get a \((1-\alpha)\) confidence interval:
R:
confint(mpg_model, level = 0.99)
Python:
mpg_model.conf_int(alpha=0.01)
Intervals for mean response
Suppose we want the mean mpg for cars with specified predictors \(x_0 = (1, x_{01}, \dots, x_{0,p-1})^\top\). The fitted mean is
Its standard error is
A \((1-\alpha)\) confidence interval is
R:
new_cars <- data.frame(wt = c(3500, 5000),
year = c(76, 81))
predict(mpg_model, newdata = new_cars,
interval = "confidence", level = 0.99)
Python:
new_cars = pd.DataFrame({"wt": [3500, 5000],
"year": [76, 81]})
pred = mpg_model.get_prediction(new_cars)
pred.summary_frame(alpha=0.01)[["mean", "mean_ci_lower", "mean_ci_upper"]]
9.8 Prediction intervals
Prediction intervals account for both:
uncertainty in the mean, and
variability of individual observations around that mean.
Standard error:
Prediction interval:
R:
predict(mpg_model, newdata = new_cars,
interval = "prediction", level = 0.99)
Python:
pred.summary_frame(alpha=0.01)[
["obs_ci_lower", "obs_ci_upper"]
]
Warning about extrapolation
With multiple predictors, extrapolation can be subtle. Each new point must be reasonable in the joint predictor space, not just in each coordinate separately. A car with an unusual combination of weight and year may be far from the existing data cloud even if its weight and year separately look “within range”.
Plotting predictors against each other and marking new points is a good sanity check.
9.9 Significance of regression: global F-test
In SLR, the t-test for the slope and the F-test for the model were equivalent. In MLR they separate:
Individual t-tests: “Is this particular predictor useful, given the others?”
Global F-test: “Is any linear relationship present at all?”
Null hypothesis for the global F-test:
Under \(H_0\) the model reduces to
an intercept-only model. The F-statistic is based on the variance decomposition
and compares mean squares:
Software gives this automatically.
R:
summary(mpg_model)$fstatistic # F and df
# or
null_mpg_model <- lm(mpg ~ 1, data = autompg)
anova(null_mpg_model, mpg_model)
Python:
mpg_model.fvalue, mpg_model.f_pvalue, mpg_model.df_model, mpg_model.df_resid
A tiny p-value says: “At least one predictor has a non-zero slope; the regression as a whole explains a meaningful amount of variation.”
9.10 Nested model comparisons
Often we want to compare two models where one is a subset of the other.
Example:
Reduced (null) model:
mpg ~ wt + yearFull model:
mpg ~ wt + year + cyl + disp + hp + acc
Here the reduced model is nested inside the full model. The null hypothesis is:
We use an F-test based on the extra sum of squares explained by the full model.
R:
null_mpg_model <- lm(mpg ~ wt + year, data = autompg)
full_mpg_model <- lm(mpg ~ wt + year + cyl + disp + hp + acc,
data = autompg)
anova(null_mpg_model, full_mpg_model)
Python (statsmodels):
import statsmodels.api as sm
null_mpg_model = smf.ols("mpg ~ wt + year", data=autompg).fit()
full_mpg_model = smf.ols("mpg ~ wt + year + cyl + disp + hp + acc",
data=autompg).fit()
sm.stats.anova_lm(null_mpg_model, full_mpg_model)
If the p-value is large, the extra predictors don’t improve the model
significantly given wt and year.
The global significance-of-regression test from the previous section is a special case where the reduced model is intercept-only.
9.11 Simulation: checking the sampling distribution
The R notes close with a simulation study that:
fixes a design matrix \(X\),
simulates many response vectors from a known linear model,
refits the regression each time,
and looks at the empirical distribution of one coefficient.
You can implement the same idea in Python. Sketch:
import numpy as np
import pandas as pd
import statsmodels.api as sm
rng = np.random.default_rng(1337)
n = 100
x1 = np.linspace(1, 10, n)
x2 = np.linspace(1, 10, n)[::-1]
beta_true = np.array([5.0, -2.0, 6.0])
sigma = 4.0
X = np.column_stack([np.ones(n), x1, x2])
num_sims = 10_000
beta2_vals = np.empty(num_sims)
for i in range(num_sims):
eps = rng.normal(0, sigma, size=n)
y = X @ beta_true + eps
df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
fit = smf.ols("y ~ x1 + x2", data=df).fit()
beta2_vals[i] = fit.params["x2"]
beta2_vals.mean(), beta2_vals.var()
You should find:
the mean of
beta2_valsis close to the true \(\beta_2\),the variance matches \(\sigma^2 C_{22}\) where \(C = (X^\top X)^{-1}\),
a histogram of
beta2_valsoverlaid with the corresponding Normal density looks very similar.
This is a concrete demonstration of the theoretical sampling distribution results from Section 9.5.
9.12 What you should take away
By the end of this chapter (and its R + Python versions), you should be able to:
Write down and interpret a multiple linear regression model with several predictors.
Understand and use the matrix formulation:
\(Y = X \beta + \varepsilon\),
\(\hat\beta = (X^\top X)^{-1} X^\top y\),
residual variance \(s_e^2 = e^\top e / (n - p)\).
Read software output to obtain:
coefficient estimates, standard errors, t-tests, and p-values,
confidence intervals for coefficients and for mean responses,
prediction intervals for new observations,
global F-tests for significance of regression,
and F-tests comparing nested models.
Recognize that:
coefficient interpretations are conditional (“holding others fixed”),
extrapolation in multiple dimensions can be subtle,
and simulation is a powerful way to check theoretical results.
In later PyStatsV1 chapters, these tools will underpin more advanced models (logistic regression, models with interactions, etc.) and many of the case studies.