Applied Statistics with Python – Chapter 14
Transformations
In Chapter 13 we focused on diagnosing regression models: checking assumptions, looking for non-constant variance, and finding unusual observations.
This chapter asks a natural follow-up question:
What can we do when the diagnostics say our model is not OK?
A major tool is to transform variables:
transform the response to stabilize variance or make residuals more nearly Normal;
transform predictors to capture non-linear relationships while keeping a linear model;
use polynomial terms to fit smooth curves.
By the end of this chapter you will be able to:
explain what a variance-stabilizing transformation is and why it matters;
fit and interpret regression models with a log-transformed response;
use the Box–Cox family to choose a transformation automatically;
transform predictors (for example, log–log relationships);
fit polynomial regression models using
statsmodels(and understand how Patsy’sI()works);recognize over-fitting and the dangers of extrapolating high-degree polynomials.
14.1 Response transformations
We start from a simple but common problem: the residual variance grows with the mean.
Example: salaries at Initech
Suppose we have data on salary versus years of experience at a fictional
company Initech, stored in data/initech.csv:
import pandas as pd
import statsmodels.formula.api as smf
initech = pd.read_csv("data/initech.csv")
initech.head()
We first fit a simple linear regression,
where \(Y_i\) is salary and \(x_i\) is years of experience.
lin_mod = smf.ols("salary ~ years", data=initech).fit()
print(lin_mod.summary())
If you reuse the residual plots from Chapter 13 you will typically see:
the mean relationship is roughly linear;
the variance of the residuals increases with the fitted value.
This violates the constant variance assumption (\(\mathsf{Var}[\varepsilon_i] = \sigma^2\) for all \(i\)).
Variance-stabilizing transformations
In symbols, our assumption is
But in the Initech plot the variance looks like it depends on the mean:
for some increasing function \(h\).
A variance-stabilizing transformation is a function \(g\) of the response such that
a constant that no longer changes with the mean.
In practice we often try simple monotone transforms:
log: \(g(y) = \log y\);
square root: \(g(y) = \sqrt{y}\);
reciprocal: \(g(y) = 1 / y\).
A good rule of thumb:
If the response is strictly positive and spans multiple orders of magnitude, trying a log transform is almost always reasonable.
Log-transforming salary
For Initech we try
In Python we can express this directly in the model formula:
import numpy as np
log_mod = smf.ols("np.log(salary) ~ years", data=initech).fit()
print(log_mod.summary())
Note a few things:
The response is now
np.log(salary). We did not create a new column; Patsy (the formula library used bystatsmodels) evaluates the NumPy call on the fly.The model is still linear in the parameters \(\beta_0\) and \(\beta_1\).
To visualize:
scatterplot of
yearsversusnp.log(salary)with the fitted straight line;residual plots for
log_mod.
You should see:
the residual spread is much more constant;
the Normal Q–Q plot looks closer to a straight line.
Interpreting coefficients on the original scale
On the transformed scale we have
Exponentiating both sides,
Each additional year of experience multiplies the median salary by
For example, if \(\hat{\beta}_1 = 0.079\), then
meaning “about an 8% increase in salary per year of experience”.
Comparing fit on the original scale
“Smaller residual standard error” is not comparable across models that use different scales. Instead, compare the root mean squared error on the original salary scale:
# original model
rmse_lin = np.sqrt(np.mean((initech["salary"] - lin_mod.fittedvalues) ** 2))
# log model, transformed back to dollars
rmse_log = np.sqrt(
np.mean((initech["salary"] - np.exp(log_mod.fittedvalues)) ** 2)
)
rmse_lin, rmse_log
Typically rmse_log is smaller, supporting the transformed model.
14.1.1 The Box–Cox family
The log transform works well here, but we can also let the data suggest a transformation.
The Box–Cox family of transforms for a strictly positive response \(y\) is
The idea:
For each candidate \(\lambda\), transform the response with \(g_\lambda\), fit a linear model, and compute the log-likelihood.
Choose the \(\lambda\) that maximizes this likelihood (or a nearby “nice” value, like 0, 0.5, 1, -0.5, etc.).
Optionally, build a confidence interval for \(\lambda\) to decide whether a simple transformation such as log is adequate.
In Python you can use scipy.stats.boxcox to obtain the MLE of
\(\lambda\):
from scipy import stats
y = initech["salary"].to_numpy()
# returns transformed y and the MLE lambda_
y_bc, lambda_ = stats.boxcox(y)
lambda_
You can then fit a model to y_bc instead of salary and compare
diagnostics as before.
For the Initech data the Box–Cox profile (not shown here) typically puts \(\lambda=0\) (log) very near the maximum and inside its confidence interval, justifying our simpler choice.
14.2 Transforming predictors and using polynomials
So far we transformed the response. We can also transform predictors:
to make non-linear relationships look linear;
to build more flexible models while staying within the linear regression framework.
Log–log relationships
Recall the Auto MPG data used in earlier chapters. Suppose we want to
model fuel economy mpg as a function of horsepower hp:
auto = pd.read_csv("data/autompg.csv")
auto.head()
A simple linear model often shows curvature and non-constant variance:
mpg_hp_lin = smf.ols("mpg ~ hp", data=auto).fit()
mpg_hp_lin.summary()
Diagnostic plots usually reveal:
mpg decreases as hp increases;
the relationship is not quite linear;
residual variance grows for large hp.
A common fix is to log-transform one or both variables.
mpg_hp_logy = smf.ols("np.log(mpg) ~ hp", data=auto).fit()
mpg_hp_loglog = smf.ols("np.log(mpg) ~ np.log(hp)", data=auto).fit()
The log–log model
has a convenient interpretation:
the percent change in mpg for a 1% change in horsepower (holding other variables fixed).
In practice, residual plots for the log–log model are often much cleaner than for the raw variables.
14.2.1 Polynomial regression
Another powerful tool is to include polynomial terms of a predictor.
Example: marketing and diminishing returns
Suppose data/marketing.csv contains monthly sales of a product
(sales) and the advertising budget (advert), both measured in
tens of thousands of dollars.
Plotting the data suggests that
sales increase with advertising,
but with diminishing returns.
We first fit a straight line:
marketing = pd.read_csv("data/marketing.csv")
mod_lin = smf.ols("sales ~ advert", data=marketing).fit()
mod_lin.summary()
A linear model ignores curvature. To capture diminishing returns we add a quadratic term:
In Patsy / statsmodels we can write
mod_quad = smf.ols("sales ~ advert + I(advert**2)", data=marketing).fit()
print(mod_quad.summary())
Key points:
I(advert**2)tells Patsy to treatadvert**2as a new predictor, not as part of formula syntax—I()stands for “inhibit”.With the linear term already in the model, the
I(advert**2)term is highly significant.Residual plots are much cleaner; the fitted curve bends the right way.
You can continue this pattern with higher-order terms, but beware of over-fitting and strange behavior outside the data range.
Over-fitting and extrapolation
In principle we can fit a polynomial of any degree:
With degree \(p-1 = n-1\) (one less than the number of points) a polynomial can perfectly interpolate the data: residuals are exactly zero.
That is rarely useful:
the fit becomes extremely wiggly;
predictions outside the observed range (extrapolation) can be absurd;
standard errors explode because the design matrix is nearly singular.
The moral:
Use low-degree polynomials (quadratic, maybe cubic, occasionally quartic).
Always check residuals and, importantly, plots of the fitted curve versus data, not just \(R^2\).
Example: fuel economy versus speed
Consider experimental data on a car’s fuel efficiency at different
speeds, stored in data/fuel_econ.csv with columns mph and
mpg.
We expect:
mpg increases with speed up to some optimal point,
then decreases again—roughly a smooth hump-shaped curve.
We can start with
econ = pd.read_csv("data/fuel_econ.csv")
fit1 = smf.ols("mpg ~ mph", data=econ).fit()
fit2 = smf.ols("mpg ~ mph + I(mph**2)", data=econ).fit()
fit4 = smf.ols("mpg ~ mph + I(mph**2) + I(mph**3) + I(mph**4)", data=econ).fit()
fit6 = smf.ols(
"mpg ~ mph + I(mph**2) + I(mph**3) + I(mph**4) + I(mph**5) + I(mph**6)",
data=econ,
).fit()
Use residual plots and F-tests for nested models to decide how far to go:
from statsmodels.stats.anova import anova_lm
anova_lm(fit4, fit6)
If the \(p\)-value is moderate but the residuals look clearly better
for fit6, you might keep the degree-6 model but still be cautious
about extrapolation beyond the range of observed speeds.
Orthogonal polynomials (optional)
High-degree raw polynomials (x, x**2, x**3, …) are often
highly correlated, which can lead to numerical instability and large
standard errors.
In R, poly(x, degree) constructs orthogonal polynomials. The
Patsy library used by statsmodels has similar tools but they are
less commonly used in basic workflows.
If you need high-degree polynomials in Python, a pragmatic strategy is:
use
numpy.vanderorsklearn.preprocessing.PolynomialFeaturesto generate columns;standardize predictors (center and scale) before forming high powers;
keep degrees fairly small unless you have a strong reason and plenty of data.
14.3 Examples in Python
This section sketches complete Python workflows that combine the ideas above. (Feel free to open a notebook and run them step by step.)
Example 1: fixing non-constant variance via log
Load Initech salary data and fit the straight-line model
salary ~ years.Reuse the diagnostic helper from Chapter 13 to make residual plots.
Fit the log-response model
np.log(salary) ~ years.Compare residual plots and RMSE on the original dollar scale.
Translate \(\hat{\beta}_1\) into a “percent change per year” interpretation.
Example 2: log–log relationship between mpg and horsepower
Load Auto MPG data from
data/autompg.csv.Fit three models:
mpg ~ hp,np.log(mpg) ~ hp, andnp.log(mpg) ~ np.log(hp).Compare:
\(R^2\) and residual standard error;
residual plots;
interpret the slope in the log–log model.
Example 3: polynomial fuel-economy curve
Load
data/fuel_econ.csv.Fit models with degrees 1, 2, 4, and 6 in
mph.For each model:
draw the fitted curve overlaid on the scatterplot;
inspect residual plots.
Use
anova_lmor an information criterion (AIC) to compare nested models.Choose a final degree (for example 4 or 6) and interpret its implications for “best speed for fuel economy”.
14.4 How this connects to PyStatsV1
Transformations and polynomial terms are building blocks that reappear throughout PyStatsV1:
In later material on model selection we will compare many models that differ only in which transformed variables they include.
In logistic regression and generalized linear models, link functions play a role very similar to response transformations here.
In chapters on experimental design, we often build models that include polynomial terms in quantitative factors (for example, quadratic response-surface models) and interactions between factors.
In applied work, much of the “art” of modeling is about choosing sensible transformations that make diagnostics (Chapter 13) look good without sacrificing interpretability.
PyStatsV1 examples and exercises will encourage you to:
try simple transformations when diagnostics suggest problems;
check that you can still explain the model in context (e.g., percent changes instead of additive changes);
avoid blindly adding very high-degree polynomials just because they increase \(R^2\).
14.5 What you should take away
By the end of this chapter you should be comfortable with:
When and why to transform the response:
to stabilize variance;
to make residuals more nearly Normal;
to convert additive effects into multiplicative (percentage) effects.
Using the log transform for strictly positive variables and interpreting coefficients back on the original scale.
The Box–Cox family as a way to choose a transformation:
the special role of \(\lambda = 0\) (log);
reading the profile of log-likelihood vs \(\lambda\).
Transforming predictors:
log and other monotone transforms;
building polynomial regression models with
I(x**2),I(x**3), etc.;using nested model comparisons (or AIC) to decide how much complexity is warranted.
Recognizing over-fitting and the dangers of extrapolating high-degree polynomials.
The distinction between:
the statistical model (what assumptions we make about \(Y\mid X\));
the formula language we use to tell Python which transformed variables to include.
Most importantly:
Diagnostics come first; transformations are tools you apply in response to what the diagnostics show.