.. _applied_stats_with_python_ch15_collinearity: Applied Statistics with Python – Chapter 15 =========================================== Collinearity ------------ This chapter parallels the *Collinearity* chapter from the R notes. Here we keep the statistical ideas the same, but express them in Python-first terms. By the end of this chapter you should be able to: * Recognize **exact collinearity** (perfect linear dependence) in a design matrix. * Diagnose **high collinearity** between predictors using correlations and variance inflation factors (VIFs). * Explain how collinearity affects regression **coefficients, standard errors, and interpretation**. * Distinguish between the impact of collinearity on **explanation** versus **prediction**. * Use **partial correlation** and **added–variable plots** to decide whether a new predictor is worth adding to an existing model. 15.1 Exact collinearity: when the design matrix is singular ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We start with an extreme case: one predictor is an exact linear combination of the others. In R, the book defines a function that generates three predictors where .. math:: x_3 = 2 x_1 + 4 x_2 + 3, while the response :math:`y` only depends on :math:`x_1` and :math:`x_2`. In Python, we can do the same: .. code-block:: python import numpy as np import pandas as pd import statsmodels.formula.api as smf rng = np.random.default_rng(42) def gen_exact_collin_data(num_samples: int = 100) -> pd.DataFrame: x1 = rng.normal(loc=80, scale=10, size=num_samples) x2 = rng.normal(loc=70, scale=5, size=num_samples) x3 = 2 * x1 + 4 * x2 + 3 y = 3 + x1 + x2 + rng.normal(loc=0, scale=1, size=num_samples) return pd.DataFrame({"y": y, "x1": x1, "x2": x2, "x3": x3}) exact_collin_data = gen_exact_collin_data() exact_collin_data.head() Here :math:`y` really only “needs” :math:`x_1` and :math:`x_2`. If we try to fit a model with all three predictors using the usual normal equations, the matrix :math:`X^\top X` is not invertible: .. code-block:: python import numpy.linalg as la X = np.column_stack( [ np.ones(len(exact_collin_data)), # intercept exact_collin_data[["x1", "x2", "x3"]].to_numpy(), ] ) la.inv(X.T @ X) # raises LinAlgError: singular matrix Statsmodels hides this detail for us and simply drops one column: .. code-block:: python fit = smf.ols("y ~ x1 + x2 + x3", data=exact_collin_data).fit() print(fit.summary()) You will see that one coefficient is reported as *not defined* because the columns are linearly dependent. Internally, statsmodels has fitted an equivalent model that only uses two of the three predictors. Key points: * With **exact collinearity**, there are infinitely many regression coefficient vectors :math:`\hat\beta` that produce the *same fitted values* :math:`\hat y`. * The model still predicts well, but the **individual coefficients are not uniquely defined**, so they cannot be interpreted. To see this, fit three smaller models: .. code-block:: python fit1 = smf.ols("y ~ x1 + x2", data=exact_collin_data).fit() fit2 = smf.ols("y ~ x1 + x3", data=exact_collin_data).fit() fit3 = smf.ols("y ~ x2 + x3", data=exact_collin_data).fit() np.allclose(fit1.fittedvalues, fit2.fittedvalues) np.allclose(fit2.fittedvalues, fit3.fittedvalues) The fitted values are identical, but .. code-block:: python fit1.params fit2.params fit3.params give quite different coefficients. This is the hallmark of exact collinearity: **same predictions, wildly different coefficient stories.** 15.2 Collinearity in practice: highly correlated predictors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Exact collinearity is rare in real data. More commonly, we see **strong but not perfect correlation** between predictors. This is usually called *multicollinearity*. The R notes use the ``seatpos`` dataset from the ``faraway`` package. In PyStatsV1 we will ship a CSV version of this dataset as ``data/seatpos.csv``: .. code-block:: python import pandas as pd seatpos = pd.read_csv("data/seatpos.csv") seatpos.head() The response is ``hipcenter`` (seat position). Predictors include: * ``Age`` – age in years, * ``Weight`` – weight in pounds, * ``Ht`` – standing height, * ``HtShoes`` – height while wearing shoes, * ``Seated``, ``Arm``, ``Thigh``, ``Leg`` – various body measurements. 15.2.1 Pairwise correlation checks ********************************** A first check is to look at pairwise scatterplots and correlations: .. code-block:: python import matplotlib.pyplot as plt cols = ["Age", "Weight", "HtShoes", "Ht", "Seated", "Arm", "Thigh", "Leg"] pd.plotting.scatter_matrix(seatpos[cols], figsize=(8, 8)) plt.tight_layout() seatpos[cols].corr().round(2) You should see extremely high correlations between many of the size-related variables, especially ``HtShoes`` and ``Ht`` (essentially the same measurement). 15.2.2 A “kitchen-sink” regression ********************************** Now fit a multiple regression with all predictors: .. code-block:: python import statsmodels.formula.api as smf hip_model = smf.ols("hipcenter ~ Age + Weight + HtShoes + Ht + " "Seated + Arm + Thigh + Leg", data=seatpos).fit() print(hip_model.summary()) Typical pattern: * The **overall F-test** (at the bottom of the summary) says the model is highly significant. * Yet almost all **individual t-tests** for coefficients have large p-values. * Some coefficients for very similar predictors can even have **opposite signs** (for example, ``Ht`` and ``HtShoes``). This is a classic symptom of collinearity: * Together the predictors explain a lot of variation in ``hipcenter``. * But because they are highly correlated with each other, the model struggles to attribute variation to any single predictor. 15.2.3 Variance inflation factors (VIFs) *************************************** To understand the impact on standard errors, we use the **variance inflation factor**. For a given predictor :math:`x_j`, let :math:`R_j^2` be the :math:`R^2` from regressing :math:`x_j` on all the *other* predictors. Then the variance of its coefficient :math:`\hat\beta_j` can be written as .. math:: \mathrm{Var}(\hat\beta_j) \;=\; \sigma^2 \, \underbrace{\frac{1}{1 - R_j^2}}_{\text{variance inflation factor}} \cdot \frac{1}{S_{x_j x_j}}, where :math:`S_{x_j x_j}` is the sum of squares of :math:`x_j` around its mean. The **variance inflation factor (VIF)** is .. math:: \text{VIF}_j = \frac{1}{1 - R_j^2}. If :math:`R_j^2` is close to 1 (``x_j`` is well explained by other predictors), then :math:`\text{VIF}_j` is large and the standard error of :math:`\hat\beta_j` is inflated. In Python we can compute VIFs using statsmodels: .. code-block:: python import numpy as np from statsmodels.stats.outliers_influence import variance_inflation_factor X = hip_model.model.exog vif = pd.Series( [variance_inflation_factor(X, i) for i in range(X.shape[1])], index=hip_model.model.exog_names, ) vif You should see very large VIFs for several predictors (well above common rules of thumb such as 5 or 10). That tells us: * The model’s **coefficients are unstable and hard to interpret**. * Small changes in the data can lead to large swings in estimated effects. 15.2.4 Collinearity and small perturbations ******************************************* To see this instability, add a bit of random noise to the response and refit: .. code-block:: python rng = np.random.default_rng(1337) noise = rng.normal(loc=0, scale=5, size=len(seatpos)) hip_model_noise = smf.ols( "hipcenter_plus_noise ~ Age + Weight + HtShoes + Ht + " "Seated + Arm + Thigh + Leg", data=seatpos.assign(hipcenter_plus_noise=seatpos["hipcenter"] + noise), ).fit() hip_model.params hip_model_noise.params You will often see: * Coefficients change a lot, sometimes even **flipping sign**. * But the **fitted values** are quite similar: .. code-block:: python plt.scatter(hip_model.fittedvalues, hip_model_noise.fittedvalues, alpha=0.7) plt.xlabel("Predicted hipcenter (original)") plt.ylabel("Predicted hipcenter (with noise)") plt.axline((0, 0), slope=1, color="k", linestyle="--") plt.tight_layout() Collinearity therefore: * Hurts our ability to **explain** the relationship (unstable coefficients), * But may have **much smaller effect on prediction error**. 15.2.5 A smaller, more stable model *********************************** Suppose we fit a simpler model using just a few predictors, for example ``Age``, ``Arm``, and ``Ht``: .. code-block:: python hip_model_small = smf.ols("hipcenter ~ Age + Arm + Ht", data=seatpos).fit() print(hip_model_small.summary()) X_small = hip_model_small.model.exog vif_small = pd.Series( [variance_inflation_factor(X_small, i) for i in range(X_small.shape[1])], index=hip_model_small.model.exog_names, ) vif_small Now the VIFs are reasonable and the coefficient for ``Ht`` has a stable sign. We can compare the large and small models with an F-test: .. code-block:: python import statsmodels.api as sm sm.stats.anova_lm(hip_model_small, hip_model) Often you will find: * The more complicated model does **not** provide a statistically significant improvement in fit. * The **simpler model** is therefore preferred: easier to interpret and less sensitive to noise. 15.3 Partial correlation and added–variable plots ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose we are considering adding another predictor, say ``HtShoes``, to the smaller model with ``Age``, ``Arm``, and ``Ht``. A useful diagnostic is the **partial correlation** between the candidate predictor and the response *after* removing the effect of the existing predictors. Procedure: 1. Fit the *current* model and keep the residuals: .. code-block:: python resid_y = hip_model_small.resid 2. Regress the *candidate* predictor on the existing predictors and keep its residuals: .. code-block:: python ht_shoes_model_small = smf.ols( "HtShoes ~ Age + Arm + Ht", data=seatpos ).fit() resid_ht_shoes = ht_shoes_model_small.resid 3. Compute the correlation: .. code-block:: python np.corrcoef(resid_ht_shoes, resid_y)[0, 1] If this partial correlation is close to 0, then once we know ``Age``, ``Arm``, and ``Ht``, the remaining variation in ``HtShoes`` tells us almost nothing about the remaining variation in ``hipcenter``. Adding ``HtShoes`` is unlikely to be helpful. An **added–variable plot** visualizes the same idea: .. code-block:: python plt.scatter(resid_ht_shoes, resid_y, alpha=0.8) plt.axhline(0, linestyle="--", color="grey") plt.axvline(0, linestyle="--", color="grey") # regression line of residuals on residuals line = smf.ols( "resid_y ~ resid_ht_shoes", data=pd.DataFrame({"resid_y": resid_y, "resid_ht_shoes": resid_ht_shoes}), ).fit() x_grid = np.linspace(resid_ht_shoes.min(), resid_ht_shoes.max(), 100) y_grid = line.params["Intercept"] + line.params["resid_ht_shoes"] * x_grid plt.plot(x_grid, y_grid) plt.xlabel("Residuals of HtShoes (given Age, Arm, Ht)") plt.ylabel("Residuals of hipcenter (given Age, Arm, Ht)") plt.tight_layout() A nearly horizontal cloud with a flat regression line indicates that the new predictor adds little explanatory power once the others are in the model. 15.4 How this connects to PyStatsV1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Collinearity is one of the main reasons why “more predictors” does not always mean “better model.” In later PyStatsV1 chapters and case studies you will see: * Scripts that compute and **report VIFs** alongside regression summaries. * Examples where we deliberately drop or combine highly correlated predictors to stabilize coefficients. * Model–comparison workflows that balance **good fit**, **interpretability**, and **low collinearity**. When you adapt or extend PyStatsV1 code for your own data, you should: * check correlations and VIFs early, * be suspicious of models where small changes in the data flip coefficient signs or dramatically change magnitudes, * prefer smaller, well-behaved models when the goal is explanation. 15.5 What you should take away ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Exact collinearity** means some predictor is an exact linear combination of others. The design matrix is singular, the coefficients are not uniquely defined, but predictions can still be fine. * **High collinearity** (multicollinearity) occurs when predictors are highly correlated. It inflates standard errors and makes coefficients unstable and hard to interpret. * The **variance inflation factor (VIF)** quantifies how much collinearity inflates the variance of :math:`\hat\beta_j`. VIFs above 5–10 are often used as warning flags. * Collinearity can severely damage our ability to **explain** relationships, while having relatively little impact on **prediction error**. * Tools such as **partial correlation** and **added–variable plots** help decide whether a new predictor is worth adding to a model that already has several correlated variables. * When in doubt, prefer **simpler models** with reasonably low collinearity, especially when the goal is to communicate and interpret effects.