.. _applied_stats_with_python_ch14_transformations:

Applied Statistics with Python – Chapter 14
===========================================

Transformations
---------------

In Chapter 13 we focused on *diagnosing* regression models: checking
assumptions, looking for non-constant variance, and finding unusual
observations.

This chapter asks a natural follow-up question:

.. rst-class:: spaced

*What can we do when the diagnostics say our model is not OK?*

A major tool is to transform variables:

* transform the **response** to stabilize variance or make residuals
  more nearly Normal;
* transform **predictors** to capture non-linear relationships while
  keeping a linear model;
* use **polynomial terms** to fit smooth curves.

By the end of this chapter you will be able to:

* explain what a variance-stabilizing transformation is and why it
  matters;
* fit and interpret regression models with a log-transformed response;
* use the Box–Cox family to choose a transformation automatically;
* transform predictors (for example, log–log relationships);
* fit polynomial regression models using ``statsmodels`` (and understand
  how Patsy’s ``I()`` works);
* recognize over-fitting and the dangers of extrapolating high-degree
  polynomials.


14.1 Response transformations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We start from a simple but common problem: the residual variance grows
with the mean.

Example: salaries at Initech
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose we have data on salary versus years of experience at a fictional
company Initech, stored in ``data/initech.csv``:

.. code-block:: python

   import pandas as pd
   import statsmodels.formula.api as smf

   initech = pd.read_csv("data/initech.csv")
   initech.head()

We first fit a simple linear regression,

.. math::

   Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,

where :math:`Y_i` is salary and :math:`x_i` is years of experience.

.. code-block:: python

   lin_mod = smf.ols("salary ~ years", data=initech).fit()
   print(lin_mod.summary())

If you reuse the residual plots from Chapter 13 you will typically see:

* the *mean* relationship is roughly linear;
* the *variance* of the residuals increases with the fitted value.

This violates the constant variance assumption
(:math:`\mathsf{Var}[\varepsilon_i] = \sigma^2` for all :math:`i`).

Variance-stabilizing transformations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In symbols, our assumption is

.. math::

   \varepsilon \sim N(0, \sigma^2), \qquad
   \text{so } \operatorname{Var}[Y\mid X=x] = \sigma^2.

But in the Initech plot the variance looks like it depends on the mean:

.. math::

   \operatorname{Var}[Y\mid X=x] = h\big(\mathbb{E}[Y\mid X=x]\big),

for some increasing function :math:`h`.

A **variance-stabilizing transformation** is a function :math:`g` of the
response such that

.. math::

   \operatorname{Var}[g(Y)\mid X=x] \approx c,

a constant that no longer changes with the mean.

In practice we often try simple monotone transforms:

* log: :math:`g(y) = \log y`;
* square root: :math:`g(y) = \sqrt{y}`;
* reciprocal: :math:`g(y) = 1 / y`.

A good rule of thumb:

* If the response is strictly positive and spans multiple orders of
  magnitude, trying a log transform is almost always reasonable.

Log-transforming salary
^^^^^^^^^^^^^^^^^^^^^^^

For Initech we try

.. math::

   \log Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i.

In Python we can express this directly in the model formula:

.. code-block:: python

   import numpy as np

   log_mod = smf.ols("np.log(salary) ~ years", data=initech).fit()
   print(log_mod.summary())

Note a few things:

* The *response* is now ``np.log(salary)``. We did **not** create a new
  column; Patsy (the formula library used by ``statsmodels``) evaluates
  the NumPy call on the fly.
* The model is still *linear* in the parameters :math:`\beta_0` and
  :math:`\beta_1`.

To visualize:

* scatterplot of ``years`` versus ``np.log(salary)`` with the fitted
  straight line;
* residual plots for ``log_mod``.

You should see:

* the residual spread is much more constant;
* the Normal Q–Q plot looks closer to a straight line.

Interpreting coefficients on the original scale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On the transformed scale we have

.. math::

   \log \hat{y}(x) = \hat{\beta}_0 + \hat{\beta}_1 x.

Exponentiating both sides,

.. math::

   \hat{y}(x)
   = \exp(\hat{\beta}_0)\,\exp(\hat{\beta}_1 x).

Each **additional year of experience** multiplies the *median* salary by

.. math::

   \exp(\hat{\beta}_1).

For example, if :math:`\hat{\beta}_1 = 0.079`, then

.. math::

   \exp(0.079) \approx 1.08,

meaning “about an 8% increase in salary per year of experience”.

Comparing fit on the original scale
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

“Smaller residual standard error” is not comparable across models that
use different scales. Instead, compare the root mean squared error on
the original salary scale:

.. code-block:: python

   # original model
   rmse_lin = np.sqrt(np.mean((initech["salary"] - lin_mod.fittedvalues) ** 2))

   # log model, transformed back to dollars
   rmse_log = np.sqrt(
       np.mean((initech["salary"] - np.exp(log_mod.fittedvalues)) ** 2)
   )

   rmse_lin, rmse_log

Typically ``rmse_log`` is smaller, supporting the transformed model.


14.1.1 The Box–Cox family
^^^^^^^^^^^^^^^^^^^^^^^^^^

The log transform works well here, but we can also *let the data* suggest
a transformation.

The **Box–Cox family** of transforms for a strictly positive response
:math:`y` is

.. math::

   g_\lambda(y) =
   \begin{cases}
     \dfrac{y^\lambda - 1}{\lambda}, & \lambda \neq 0, \\
     \log y, & \lambda = 0.
   \end{cases}

The idea:

* For each candidate :math:`\lambda`, transform the response with
  :math:`g_\lambda`, fit a linear model, and compute the log-likelihood.
* Choose the :math:`\lambda` that maximizes this likelihood (or a nearby
  “nice” value, like 0, 0.5, 1, -0.5, etc.).
* Optionally, build a confidence interval for :math:`\lambda` to decide
  whether a simple transformation such as log is adequate.

In Python you can use ``scipy.stats.boxcox`` to obtain the MLE of
:math:`\lambda`:

.. code-block:: python

   from scipy import stats

   y = initech["salary"].to_numpy()
   # returns transformed y and the MLE lambda_
   y_bc, lambda_ = stats.boxcox(y)
   lambda_

You can then fit a model to ``y_bc`` instead of ``salary`` and compare
diagnostics as before.

For the Initech data the Box–Cox profile (not shown here) typically
puts :math:`\lambda=0` (log) very near the maximum and inside its
confidence interval, justifying our simpler choice.


14.2 Transforming predictors and using polynomials
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So far we transformed the *response*. We can also transform
**predictors**:

* to make non-linear relationships look linear;
* to build more flexible models while staying within the linear
  regression framework.

Log–log relationships
^^^^^^^^^^^^^^^^^^^^^

Recall the Auto MPG data used in earlier chapters. Suppose we want to
model fuel economy ``mpg`` as a function of horsepower ``hp``:

.. code-block:: python

   auto = pd.read_csv("data/autompg.csv")
   auto.head()

A simple linear model often shows curvature and non-constant variance:

.. code-block:: python

   mpg_hp_lin = smf.ols("mpg ~ hp", data=auto).fit()
   mpg_hp_lin.summary()

Diagnostic plots usually reveal:

* mpg decreases as hp increases;
* the relationship is not quite linear;
* residual variance grows for large hp.

A common fix is to log-transform one or both variables.

.. code-block:: python

   mpg_hp_logy = smf.ols("np.log(mpg) ~ hp", data=auto).fit()
   mpg_hp_loglog = smf.ols("np.log(mpg) ~ np.log(hp)", data=auto).fit()

The log–log model

.. math::

   \log(\text{mpg}) = \beta_0 + \beta_1 \log(\text{hp}) + \varepsilon

has a convenient interpretation:

.. math::

   \beta_1 \approx \text{elasticity of mpg with respect to hp},

the percent change in mpg for a 1% change in horsepower (holding other
variables fixed).

In practice, residual plots for the log–log model are often much cleaner
than for the raw variables.


14.2.1 Polynomial regression
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Another powerful tool is to include **polynomial terms** of a predictor.

Example: marketing and diminishing returns
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose ``data/marketing.csv`` contains monthly sales of a product
(``sales``) and the advertising budget (``advert``), both measured in
tens of thousands of dollars.

Plotting the data suggests that

* sales increase with advertising,
* but with *diminishing returns*.

We first fit a straight line:

.. code-block:: python

   marketing = pd.read_csv("data/marketing.csv")

   mod_lin = smf.ols("sales ~ advert", data=marketing).fit()
   mod_lin.summary()

A linear model ignores curvature. To capture diminishing returns we add a
quadratic term:

.. math::

   Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i.

In Patsy / ``statsmodels`` we can write

.. code-block:: python

   mod_quad = smf.ols("sales ~ advert + I(advert**2)", data=marketing).fit()
   print(mod_quad.summary())

Key points:

* ``I(advert**2)`` tells Patsy to treat ``advert**2`` as a new predictor,
  not as part of formula syntax—``I()`` stands for “inhibit”.
* With the linear term already in the model, the ``I(advert**2)`` term
  is highly significant.
* Residual plots are much cleaner; the fitted curve bends the right way.

You can continue this pattern with higher-order terms, but beware of
over-fitting and strange behavior outside the data range.

Over-fitting and extrapolation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In principle we can fit a polynomial of any degree:

.. math::

   Y_i = \beta_0 + \beta_1 x_i + \dots + \beta_{p-1} x_i^{p-1}
         + \varepsilon_i.

With degree :math:`p-1 = n-1` (one less than the number of points) a
polynomial can *perfectly interpolate* the data: residuals are exactly
zero.

That is rarely useful:

* the fit becomes extremely wiggly;
* predictions outside the observed range (extrapolation) can be absurd;
* standard errors explode because the design matrix is nearly singular.

The moral:

* Use low-degree polynomials (quadratic, maybe cubic, occasionally
  quartic).
* Always check residuals and, importantly, **plots of the fitted curve
  versus data**, not just :math:`R^2`.

Example: fuel economy versus speed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Consider experimental data on a car’s fuel efficiency at different
speeds, stored in ``data/fuel_econ.csv`` with columns ``mph`` and
``mpg``.

We expect:

* mpg increases with speed up to some optimal point,
* then decreases again—roughly a smooth hump-shaped curve.

We can start with

.. code-block:: python

   econ = pd.read_csv("data/fuel_econ.csv")

   fit1 = smf.ols("mpg ~ mph", data=econ).fit()
   fit2 = smf.ols("mpg ~ mph + I(mph**2)", data=econ).fit()
   fit4 = smf.ols("mpg ~ mph + I(mph**2) + I(mph**3) + I(mph**4)", data=econ).fit()
   fit6 = smf.ols(
       "mpg ~ mph + I(mph**2) + I(mph**3) + I(mph**4) + I(mph**5) + I(mph**6)",
       data=econ,
   ).fit()

Use residual plots and F-tests for nested models to decide how far to
go:

.. code-block:: python

   from statsmodels.stats.anova import anova_lm

   anova_lm(fit4, fit6)

If the :math:`p`-value is moderate but the residuals look clearly better
for ``fit6``, you might keep the degree-6 model *but* still be cautious
about extrapolation beyond the range of observed speeds.

Orthogonal polynomials (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

High-degree raw polynomials (``x``, ``x**2``, ``x**3``, …) are often
highly correlated, which can lead to numerical instability and large
standard errors.

In R, ``poly(x, degree)`` constructs **orthogonal polynomials**. The
Patsy library used by ``statsmodels`` has similar tools but they are
less commonly used in basic workflows.

If you need high-degree polynomials in Python, a pragmatic strategy is:

* use ``numpy.vander`` or ``sklearn.preprocessing.PolynomialFeatures``
  to generate columns;
* standardize predictors (center and scale) before forming high powers;
* keep degrees fairly small unless you have a strong reason and plenty
  of data.


14.3 Examples in Python
~~~~~~~~~~~~~~~~~~~~~~~

This section sketches complete Python workflows that combine the ideas
above. (Feel free to open a notebook and run them step by step.)

Example 1: fixing non-constant variance via log
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Load Initech salary data and fit the straight-line model
   ``salary ~ years``.
2. Reuse the diagnostic helper from Chapter 13 to make residual plots.
3. Fit the log-response model ``np.log(salary) ~ years``.
4. Compare residual plots and RMSE on the original dollar scale.
5. Translate :math:`\hat{\beta}_1` into a “percent change per year”
   interpretation.

Example 2: log–log relationship between mpg and horsepower
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Load Auto MPG data from ``data/autompg.csv``.
2. Fit three models:
   ``mpg ~ hp``, ``np.log(mpg) ~ hp``, and
   ``np.log(mpg) ~ np.log(hp)``.
3. Compare:

   * :math:`R^2` and residual standard error;
   * residual plots;
   * interpret the slope in the log–log model.

Example 3: polynomial fuel-economy curve
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Load ``data/fuel_econ.csv``.
2. Fit models with degrees 1, 2, 4, and 6 in ``mph``.
3. For each model:

   * draw the fitted curve overlaid on the scatterplot;
   * inspect residual plots.

4. Use ``anova_lm`` or an information criterion (AIC) to compare nested
   models.
5. Choose a final degree (for example 4 or 6) and interpret its
   implications for “best speed for fuel economy”.


14.4 How this connects to PyStatsV1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Transformations and polynomial terms are building blocks that reappear
throughout PyStatsV1:

* In later material on **model selection** we will compare many models
  that differ only in which transformed variables they include.
* In **logistic regression** and generalized linear models, link
  functions play a role very similar to response transformations here.
* In chapters on **experimental design**, we often build models that
  include polynomial terms in quantitative factors (for example,
  quadratic response-surface models) and interactions between factors.
* In applied work, much of the “art” of modeling is about choosing
  sensible transformations that make diagnostics (Chapter 13) look good
  without sacrificing interpretability.

PyStatsV1 examples and exercises will encourage you to:

* try simple transformations when diagnostics suggest problems;
* check that you can still explain the model in context (e.g., percent
  changes instead of additive changes);
* avoid blindly adding very high-degree polynomials just because they
  increase :math:`R^2`.


14.5 What you should take away
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By the end of this chapter you should be comfortable with:

* **When and why** to transform the response:

  * to stabilize variance;
  * to make residuals more nearly Normal;
  * to convert additive effects into multiplicative (percentage) effects.

* Using the **log transform** for strictly positive variables and
  interpreting coefficients back on the original scale.

* The **Box–Cox family** as a way to choose a transformation:

  * the special role of :math:`\lambda = 0` (log);
  * reading the profile of log-likelihood vs :math:`\lambda`.

* Transforming **predictors**:

  * log and other monotone transforms;
  * building **polynomial regression** models with ``I(x**2)``,
    ``I(x**3)``, etc.;
  * using nested model comparisons (or AIC) to decide how much
    complexity is warranted.

* Recognizing **over-fitting** and the dangers of extrapolating high-degree
  polynomials.

* The distinction between:

  * the *statistical model* (what assumptions we make about
    :math:`Y\mid X`);
  * the *formula* language we use to tell Python which transformed
    variables to include.

Most importantly:

.. rst-class:: spaced

*Diagnostics come first; transformations are tools you apply in response
to what the diagnostics show.*