Applied Statistics with Python – Chapter 15

Collinearity

This chapter parallels the Collinearity chapter from the R notes. Here we keep the statistical ideas the same, but express them in Python-first terms.

By the end of this chapter you should be able to:

Recognize exact collinearity (perfect linear dependence) in a design matrix.
Diagnose high collinearity between predictors using correlations and variance inflation factors (VIFs).
Explain how collinearity affects regression coefficients, standard errors, and interpretation.
Distinguish between the impact of collinearity on explanation versus prediction.
Use partial correlation and added–variable plots to decide whether a new predictor is worth adding to an existing model.

15.1 Exact collinearity: when the design matrix is singular

We start with an extreme case: one predictor is an exact linear combination of the others.

In R, the book defines a function that generates three predictors where

\[x_3 = 2 x_1 + 4 x_2 + 3,\]

while the response \(y\) only depends on \(x_1\) and \(x_2\).

In Python, we can do the same:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(42)

def gen_exact_collin_data(num_samples: int = 100) -> pd.DataFrame:
    x1 = rng.normal(loc=80, scale=10, size=num_samples)
    x2 = rng.normal(loc=70, scale=5, size=num_samples)
    x3 = 2 * x1 + 4 * x2 + 3
    y = 3 + x1 + x2 + rng.normal(loc=0, scale=1, size=num_samples)
    return pd.DataFrame({"y": y, "x1": x1, "x2": x2, "x3": x3})

exact_collin_data = gen_exact_collin_data()
exact_collin_data.head()

Here \(y\) really only “needs” \(x_1\) and \(x_2\).

If we try to fit a model with all three predictors using the usual normal equations, the matrix \(X^\top X\) is not invertible:

import numpy.linalg as la

X = np.column_stack(
    [
        np.ones(len(exact_collin_data)),          # intercept
        exact_collin_data[["x1", "x2", "x3"]].to_numpy(),
    ]
)

la.inv(X.T @ X)   # raises LinAlgError: singular matrix

Statsmodels hides this detail for us and simply drops one column:

fit = smf.ols("y ~ x1 + x2 + x3", data=exact_collin_data).fit()
print(fit.summary())

You will see that one coefficient is reported as not defined because the columns are linearly dependent. Internally, statsmodels has fitted an equivalent model that only uses two of the three predictors.

Key points:

With exact collinearity, there are infinitely many regression coefficient vectors \(\hat\beta\) that produce the same fitted values \(\hat y\).
The model still predicts well, but the individual coefficients are not uniquely defined, so they cannot be interpreted.

To see this, fit three smaller models:

fit1 = smf.ols("y ~ x1 + x2", data=exact_collin_data).fit()
fit2 = smf.ols("y ~ x1 + x3", data=exact_collin_data).fit()
fit3 = smf.ols("y ~ x2 + x3", data=exact_collin_data).fit()

np.allclose(fit1.fittedvalues, fit2.fittedvalues)
np.allclose(fit2.fittedvalues, fit3.fittedvalues)

The fitted values are identical, but

fit1.params
fit2.params
fit3.params

give quite different coefficients. This is the hallmark of exact collinearity: same predictions, wildly different coefficient stories.

15.2 Collinearity in practice: highly correlated predictors

Exact collinearity is rare in real data. More commonly, we see strong but not perfect correlation between predictors. This is usually called multicollinearity.

The R notes use the seatpos dataset from the faraway package. In PyStatsV1 we will ship a CSV version of this dataset as data/seatpos.csv:

import pandas as pd

seatpos = pd.read_csv("data/seatpos.csv")
seatpos.head()

The response is hipcenter (seat position). Predictors include:

Age – age in years,
Weight – weight in pounds,
Ht – standing height,
HtShoes – height while wearing shoes,
Seated, Arm, Thigh, Leg – various body measurements.

15.2.1 Pairwise correlation checks

A first check is to look at pairwise scatterplots and correlations:

import matplotlib.pyplot as plt

cols = ["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"]
pd.plotting.scatter_matrix(seatpos[cols], figsize=(8, 8))
plt.tight_layout()

seatpos[cols].corr().round(2)

You should see extremely high correlations between many of the size-related variables, especially HtShoes and Ht (essentially the same measurement).

15.2.2 A “kitchen-sink” regression

Now fit a multiple regression with all predictors:

import statsmodels.formula.api as smf

hip_model = smf.ols("hipcenter ~ Age + Weight + HtShoes + Ht + "
                    "Seated + Arm + Thigh + Leg",
                    data=seatpos).fit()
print(hip_model.summary())

Typical pattern:

The overall F-test (at the bottom of the summary) says the model is highly significant.
Yet almost all individual t-tests for coefficients have large p-values.
Some coefficients for very similar predictors can even have opposite signs (for example, Ht and HtShoes).

This is a classic symptom of collinearity:

Together the predictors explain a lot of variation in hipcenter.
But because they are highly correlated with each other, the model struggles to attribute variation to any single predictor.

15.2.3 Variance inflation factors (VIFs)

To understand the impact on standard errors, we use the variance inflation factor.

For a given predictor \(x_j\), let \(R_j^2\) be the \(R^2\) from regressing \(x_j\) on all the other predictors. Then the variance of its coefficient \(\hat\beta_j\) can be written as

\[\mathrm{Var}(\hat\beta_j) \;=\; \sigma^2 \, \underbrace{\frac{1}{1 - R_j^2}}_{\text{variance inflation factor}} \cdot \frac{1}{S_{x_j x_j}},\]

where \(S_{x_j x_j}\) is the sum of squares of \(x_j\) around its mean.

The variance inflation factor (VIF) is

\[\text{VIF}_j = \frac{1}{1 - R_j^2}.\]

If \(R_j^2\) is close to 1 (x_j is well explained by other predictors), then \(\text{VIF}_j\) is large and the standard error of \(\hat\beta_j\) is inflated.

In Python we can compute VIFs using statsmodels:

import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = hip_model.model.exog
vif = pd.Series(
    [variance_inflation_factor(X, i) for i in range(X.shape[1])],
    index=hip_model.model.exog_names,
)
vif

You should see very large VIFs for several predictors (well above common rules of thumb such as 5 or 10). That tells us:

The model’s coefficients are unstable and hard to interpret.
Small changes in the data can lead to large swings in estimated effects.

15.2.4 Collinearity and small perturbations

To see this instability, add a bit of random noise to the response and refit:

rng = np.random.default_rng(1337)
noise = rng.normal(loc=0, scale=5, size=len(seatpos))

hip_model_noise = smf.ols(
    "hipcenter_plus_noise ~ Age + Weight + HtShoes + Ht + "
    "Seated + Arm + Thigh + Leg",
    data=seatpos.assign(hipcenter_plus_noise=seatpos["hipcenter"] + noise),
).fit()

hip_model.params
hip_model_noise.params

You will often see:

Coefficients change a lot, sometimes even flipping sign.
But the fitted values are quite similar:

plt.scatter(hip_model.fittedvalues,
            hip_model_noise.fittedvalues,
            alpha=0.7)
plt.xlabel("Predicted hipcenter (original)")
plt.ylabel("Predicted hipcenter (with noise)")
plt.axline((0, 0), slope=1, color="k", linestyle="--")
plt.tight_layout()

Collinearity therefore:

Hurts our ability to explain the relationship (unstable coefficients),
But may have much smaller effect on prediction error.

15.2.5 A smaller, more stable model

Suppose we fit a simpler model using just a few predictors, for example Age, Arm, and Ht:

hip_model_small = smf.ols("hipcenter ~ Age + Arm + Ht",
                          data=seatpos).fit()
print(hip_model_small.summary())

X_small = hip_model_small.model.exog
vif_small = pd.Series(
    [variance_inflation_factor(X_small, i)
     for i in range(X_small.shape[1])],
    index=hip_model_small.model.exog_names,
)
vif_small

Now the VIFs are reasonable and the coefficient for Ht has a stable sign.

We can compare the large and small models with an F-test:

import statsmodels.api as sm

sm.stats.anova_lm(hip_model_small, hip_model)

Often you will find:

The more complicated model does not provide a statistically significant improvement in fit.
The simpler model is therefore preferred: easier to interpret and less sensitive to noise.

15.3 Partial correlation and added–variable plots

Suppose we are considering adding another predictor, say HtShoes, to the smaller model with Age, Arm, and Ht.

A useful diagnostic is the partial correlation between the candidate predictor and the response after removing the effect of the existing predictors.

Procedure:

Fit the current model and keep the residuals:
```
resid_y = hip_model_small.resid
```

Regress the candidate predictor on the existing predictors and keep its residuals:

ht_shoes_model_small = smf.ols(
    "HtShoes ~ Age + Arm + Ht",
    data=seatpos
).fit()
resid_ht_shoes = ht_shoes_model_small.resid

Compute the correlation:

np.corrcoef(resid_ht_shoes, resid_y)[0, 1]

If this partial correlation is close to 0, then once we know Age, Arm, and Ht, the remaining variation in HtShoes tells us almost nothing about the remaining variation in hipcenter. Adding HtShoes is unlikely to be helpful.

An added–variable plot visualizes the same idea:

plt.scatter(resid_ht_shoes, resid_y, alpha=0.8)
plt.axhline(0, linestyle="--", color="grey")
plt.axvline(0, linestyle="--", color="grey")

# regression line of residuals on residuals
line = smf.ols(
    "resid_y ~ resid_ht_shoes",
    data=pd.DataFrame({"resid_y": resid_y,
                       "resid_ht_shoes": resid_ht_shoes}),
).fit()
x_grid = np.linspace(resid_ht_shoes.min(), resid_ht_shoes.max(), 100)
y_grid = line.params["Intercept"] + line.params["resid_ht_shoes"] * x_grid
plt.plot(x_grid, y_grid)
plt.xlabel("Residuals of HtShoes (given Age, Arm, Ht)")
plt.ylabel("Residuals of hipcenter (given Age, Arm, Ht)")
plt.tight_layout()

A nearly horizontal cloud with a flat regression line indicates that the new predictor adds little explanatory power once the others are in the model.

15.4 How this connects to PyStatsV1

Collinearity is one of the main reasons why “more predictors” does not always mean “better model.”

In later PyStatsV1 chapters and case studies you will see:

Scripts that compute and report VIFs alongside regression summaries.
Examples where we deliberately drop or combine highly correlated predictors to stabilize coefficients.
Model–comparison workflows that balance good fit, interpretability, and low collinearity.

When you adapt or extend PyStatsV1 code for your own data, you should:

check correlations and VIFs early,
be suspicious of models where small changes in the data flip coefficient signs or dramatically change magnitudes,
prefer smaller, well-behaved models when the goal is explanation.

15.5 What you should take away

Exact collinearity means some predictor is an exact linear combination of others. The design matrix is singular, the coefficients are not uniquely defined, but predictions can still be fine.
High collinearity (multicollinearity) occurs when predictors are highly correlated. It inflates standard errors and makes coefficients unstable and hard to interpret.
The variance inflation factor (VIF) quantifies how much collinearity inflates the variance of \(\hat\beta_j\). VIFs above 5–10 are often used as warning flags.
Collinearity can severely damage our ability to explain relationships, while having relatively little impact on prediction error.
Tools such as partial correlation and added–variable plots help decide whether a new predictor is worth adding to a model that already has several correlated variables.
When in doubt, prefer simpler models with reasonably low collinearity, especially when the goal is to communicate and interpret effects.