Applied Statistics with Python – Chapter 15
Collinearity
This chapter parallels the Collinearity chapter from the R notes. Here we keep the statistical ideas the same, but express them in Python-first terms.
By the end of this chapter you should be able to:
Recognize exact collinearity (perfect linear dependence) in a design matrix.
Diagnose high collinearity between predictors using correlations and variance inflation factors (VIFs).
Explain how collinearity affects regression coefficients, standard errors, and interpretation.
Distinguish between the impact of collinearity on explanation versus prediction.
Use partial correlation and added–variable plots to decide whether a new predictor is worth adding to an existing model.
15.1 Exact collinearity: when the design matrix is singular
We start with an extreme case: one predictor is an exact linear combination of the others.
In R, the book defines a function that generates three predictors where
while the response \(y\) only depends on \(x_1\) and \(x_2\).
In Python, we can do the same:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
rng = np.random.default_rng(42)
def gen_exact_collin_data(num_samples: int = 100) -> pd.DataFrame:
x1 = rng.normal(loc=80, scale=10, size=num_samples)
x2 = rng.normal(loc=70, scale=5, size=num_samples)
x3 = 2 * x1 + 4 * x2 + 3
y = 3 + x1 + x2 + rng.normal(loc=0, scale=1, size=num_samples)
return pd.DataFrame({"y": y, "x1": x1, "x2": x2, "x3": x3})
exact_collin_data = gen_exact_collin_data()
exact_collin_data.head()
Here \(y\) really only “needs” \(x_1\) and \(x_2\).
If we try to fit a model with all three predictors using the usual normal equations, the matrix \(X^\top X\) is not invertible:
import numpy.linalg as la
X = np.column_stack(
[
np.ones(len(exact_collin_data)), # intercept
exact_collin_data[["x1", "x2", "x3"]].to_numpy(),
]
)
la.inv(X.T @ X) # raises LinAlgError: singular matrix
Statsmodels hides this detail for us and simply drops one column:
fit = smf.ols("y ~ x1 + x2 + x3", data=exact_collin_data).fit()
print(fit.summary())
You will see that one coefficient is reported as not defined because the columns are linearly dependent. Internally, statsmodels has fitted an equivalent model that only uses two of the three predictors.
Key points:
With exact collinearity, there are infinitely many regression coefficient vectors \(\hat\beta\) that produce the same fitted values \(\hat y\).
The model still predicts well, but the individual coefficients are not uniquely defined, so they cannot be interpreted.
To see this, fit three smaller models:
fit1 = smf.ols("y ~ x1 + x2", data=exact_collin_data).fit()
fit2 = smf.ols("y ~ x1 + x3", data=exact_collin_data).fit()
fit3 = smf.ols("y ~ x2 + x3", data=exact_collin_data).fit()
np.allclose(fit1.fittedvalues, fit2.fittedvalues)
np.allclose(fit2.fittedvalues, fit3.fittedvalues)
The fitted values are identical, but
fit1.params
fit2.params
fit3.params
give quite different coefficients. This is the hallmark of exact collinearity: same predictions, wildly different coefficient stories.
15.3 Partial correlation and added–variable plots
Suppose we are considering adding another predictor, say HtShoes, to the
smaller model with Age, Arm, and Ht.
A useful diagnostic is the partial correlation between the candidate predictor and the response after removing the effect of the existing predictors.
Procedure:
Fit the current model and keep the residuals:
resid_y = hip_model_small.resid
Regress the candidate predictor on the existing predictors and keep its residuals:
ht_shoes_model_small = smf.ols( "HtShoes ~ Age + Arm + Ht", data=seatpos ).fit() resid_ht_shoes = ht_shoes_model_small.resid
Compute the correlation:
np.corrcoef(resid_ht_shoes, resid_y)[0, 1]
If this partial correlation is close to 0, then once we know Age, Arm,
and Ht, the remaining variation in HtShoes tells us almost nothing
about the remaining variation in hipcenter. Adding HtShoes is unlikely
to be helpful.
An added–variable plot visualizes the same idea:
plt.scatter(resid_ht_shoes, resid_y, alpha=0.8)
plt.axhline(0, linestyle="--", color="grey")
plt.axvline(0, linestyle="--", color="grey")
# regression line of residuals on residuals
line = smf.ols(
"resid_y ~ resid_ht_shoes",
data=pd.DataFrame({"resid_y": resid_y,
"resid_ht_shoes": resid_ht_shoes}),
).fit()
x_grid = np.linspace(resid_ht_shoes.min(), resid_ht_shoes.max(), 100)
y_grid = line.params["Intercept"] + line.params["resid_ht_shoes"] * x_grid
plt.plot(x_grid, y_grid)
plt.xlabel("Residuals of HtShoes (given Age, Arm, Ht)")
plt.ylabel("Residuals of hipcenter (given Age, Arm, Ht)")
plt.tight_layout()
A nearly horizontal cloud with a flat regression line indicates that the new predictor adds little explanatory power once the others are in the model.
15.4 How this connects to PyStatsV1
Collinearity is one of the main reasons why “more predictors” does not always mean “better model.”
In later PyStatsV1 chapters and case studies you will see:
Scripts that compute and report VIFs alongside regression summaries.
Examples where we deliberately drop or combine highly correlated predictors to stabilize coefficients.
Model–comparison workflows that balance good fit, interpretability, and low collinearity.
When you adapt or extend PyStatsV1 code for your own data, you should:
check correlations and VIFs early,
be suspicious of models where small changes in the data flip coefficient signs or dramatically change magnitudes,
prefer smaller, well-behaved models when the goal is explanation.
15.5 What you should take away
Exact collinearity means some predictor is an exact linear combination of others. The design matrix is singular, the coefficients are not uniquely defined, but predictions can still be fine.
High collinearity (multicollinearity) occurs when predictors are highly correlated. It inflates standard errors and makes coefficients unstable and hard to interpret.
The variance inflation factor (VIF) quantifies how much collinearity inflates the variance of \(\hat\beta_j\). VIFs above 5–10 are often used as warning flags.
Collinearity can severely damage our ability to explain relationships, while having relatively little impact on prediction error.
Tools such as partial correlation and added–variable plots help decide whether a new predictor is worth adding to a model that already has several correlated variables.
When in doubt, prefer simpler models with reasonably low collinearity, especially when the goal is to communicate and interpret effects.