Applied Statistics with Python – Chapter 12
===========================================

Analysis of variance (ANOVA) and experiments
--------------------------------------------

This chapter parallels the “Analysis of Variance” material from the R notes,
but is written for a Python-first audience.

So far in the mini-book we have mostly:

* worked with **numeric predictors** (chapters on simple and multiple regression),
* treated real data sets as if they were “just given,” without asking how they
  were produced.

Here we:

* draw a sharp line between **observational** and **experimental** data,
* show how experiments naturally lead to **ANOVA models**,
* connect classical ANOVA to the regression tools you have already seen,
* and map the R functions to Python:

  * R: ``t.test()``, ``aov()``, ``pairwise.t.test()``, ``TukeyHSD()``
  * Python: :mod:`scipy.stats`, :mod:`statsmodels` formula API, and
    :mod:`statsmodels.stats.multicomp`.

The key message:

*Regression tools are not only for observational data; in experimental settings
we use the same modelling ideas, but we are allowed to make much stronger
causal statements because the predictors are under our control.*

12.1 Experiments: observational vs experimental data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The most important question about a data set is **how it was generated**.

Observational study
  *Both* predictors and response are observed.
  No one controlled who received which level of a predictor.

  Examples: hospital records, government surveys, user-behaviour logs.

Experiment
  The analyst **chooses** the levels of one or more predictors, applies them to
  subjects, then observes the response.

  Examples: drug vs placebo trials, A/B tests on a website, training
  interventions for athletes.

Terminology for experiments
---------------------------

In experimental settings we rename things slightly:

* **Factors** – predictors that are controlled by the experimenter
  (treatment group, diet, machine setting, etc.).
* **Levels** – possible values of a factor (control vs treatment, A/B/C, low /
  medium / high).
* **Subjects** – experimental units (people, animals, plots of land, machines…).
* **Randomization** – subjects are randomly assigned to factor levels.

Randomization is crucial:

* it balances **unobserved** variables across groups on average;
* it justifies treating observations as independent draws from a model.

Throughout this chapter we will:

* write models in symbols,
* show the equivalent R formula (for reference),
* and give the Python version with :mod:`statsmodels`.

12.2 Two-sample t-test as a tiny experiment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The simplest experimental design is a **two-group experiment**:

* one factor (for example “group”) with two levels: control and treatment;
* subjects randomly assigned to a level;
* one numeric response measured after treatment.

Mathematical model
------------------

We assume

.. math::

   Y_{1j} \sim N(\mu_1, \sigma^2), \quad
   Y_{2j} \sim N(\mu_2, \sigma^2),

for subjects in groups 1 and 2 respectively.

We test

.. math::

   H_0: \mu_1 = \mu_2
   \quad \text{vs} \quad
   H_1: \mu_1 \ne \mu_2.

In Chapter 5 we met the **two-sample t-test** for this situation.

R vs Python
-----------

R (formula interface):

.. code-block:: r

   t.test(sleep ~ group, data = melatonin, var.equal = TRUE)

Python (using :mod:`scipy.stats`):

.. code-block:: python

   import numpy as np
   from scipy import stats

   control = melatonin.loc[melatonin["group"] == "control", "sleep"]
   treatment = melatonin.loc[melatonin["group"] == "treatment", "sleep"]

   t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=True)

The important conceptual link:

*The two-sample t-test is a special case of one-way ANOVA with two groups.*
The next section generalizes to any number of groups.

12.3 One-way ANOVA
~~~~~~~~~~~~~~~~~~

12.3.1 Model and intuition
--------------------------

Suppose we have one factor with :math:`g` levels (groups), such as four diets
A, B, C, and D. With :math:`n_i` observations in group :math:`i`, we write

.. math::

   Y_{ij} = \mu_i + \varepsilon_{ij}, \qquad
   \varepsilon_{ij} \sim N(0, \sigma^2), \quad i = 1,\dots,g.

Equivalently,

.. math::

   Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \qquad
   \sum_{i=1}^g \alpha_i = 0.

Here:

* :math:`\mu` is the overall mean.
* :math:`\alpha_i` is the **effect** of group :math:`i` (difference from
  the overall mean).
* :math:`\sigma^2` is the common within-group variance.

We test whether the group means are all equal:

.. math::

   H_0: \mu_1 = \mu_2 = \cdots = \mu_g
   \quad \text{vs} \quad
   H_1: \text{at least one } \mu_i \text{ is different}.

ANOVA works by decomposing variability:

* **Between-group** variability – how far the group means are from the overall
  mean.
* **Within-group** variability – how much observations vary around their own
  group mean.

If between-group variation is large relative to within-group variation, the
group means are unlikely to be all the same.

Sums of squares (conceptually)
------------------------------

We can write three key quantities:

* Total variation:

  .. math::

     SS_T = \sum_{i=1}^g \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2.

* Between-group variation:

  .. math::

     SS_B = \sum_{i=1}^g n_i(\bar{y}_i - \bar{y})^2.

* Within-group variation:

  .. math::

     SS_W = \sum_{i=1}^g \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_i)^2.

These satisfy :math:`SS_T = SS_B + SS_W`.

The ANOVA :math:`F` statistic compares mean squares:

.. math::

   MS_B = \frac{SS_B}{g-1}, \qquad
   MS_W = \frac{SS_W}{N-g}, \qquad
   F = \frac{MS_B}{MS_W},

where :math:`N = \sum_i n_i` is the total sample size.

Under :math:`H_0`, :math:`F` has an :math:`F_{g-1, N-g}` distribution.

12.3.2 One-way ANOVA in Python
------------------------------

Suppose we have a DataFrame with columns:

* ``response`` – numeric outcome,
* ``group`` – categorical factor (diet, treatment, etc.).

Using :mod:`statsmodels`:

.. code-block:: python

   import statsmodels.api as sm
   import statsmodels.formula.api as smf

   model = smf.ols("response ~ C(group)", data=df).fit()
   anova_table = sm.stats.anova_lm(model, typ=2)
   print(anova_table)

Key points:

* ``C(group)`` tells :mod:`statsmodels` to treat ``group`` as categorical.
* The ANOVA table includes sums of squares, degrees of freedom, mean squares,
  :math:`F` statistic and p-value.

R uses

.. code-block:: r

   aov(response ~ group, data = df)

The underlying ideas are the same; both are just linear models with dummy
variables for the groups.

12.3.3 Factor variables and categorical dtype
---------------------------------------------

In R you must ensure the grouping variable is a **factor**; otherwise ANOVA
silently becomes a regression with a numeric predictor.

In Python / pandas:

* Either wrap the variable in ``C()`` in the formula, **or**
* set ``df["group"] = df["group"].astype("category")``.

If you forget and leave it numeric, the model becomes “line through the codes”
(1, 2, 3, …) rather than separate means per group.

12.3.4 Simulating the F distribution in Python
----------------------------------------------

A good sanity check is to simulate from a null model (equal means) and verify
that the empirical distribution of :math:`F` matches the theoretical
:math:`F` distribution.

Skeleton code:

.. code-block:: python

   import numpy as np
   import pandas as pd
   import statsmodels.api as sm
   import statsmodels.formula.api as smf

   rng = np.random.default_rng(seed=123)

   def sim_oneway_F(n_per_group=10, g=4, sigma=1.0):
       # Null model: all means = 0
       groups = np.repeat(np.arange(g), n_per_group)
       y = rng.normal(loc=0.0, scale=sigma, size=g * n_per_group)
       df = pd.DataFrame({"y": y, "group": groups})
       model = smf.ols("y ~ C(group)", data=df).fit()
       table = sm.stats.anova_lm(model, typ=2)
       return table.loc["C(group)", "F"]

   f_stats = np.array([sim_oneway_F() for _ in range(5000)])

You can then plot a histogram of ``f_stats`` and overlay an
``scipy.stats.f`` density to see the agreement.

12.3.5 Power via simulation
---------------------------

Because experiments cost time and money, we care about **power**:

.. math::

   \text{Power} = P(\text{reject } H_0 \mid H_0 \text{ false}).

For one-way ANOVA this depends on:

* effect sizes (how far the group means are apart),
* noise level :math:`\sigma`,
* sample size and balance (equal :math:`n_i` helps),
* significance level :math:`\alpha`.

We can use almost the same simulation function as above, but now draw from
unequal means and record how often the ANOVA p-value is below :math:`\alpha`.

Sketch:

.. code-block:: python

   from scipy import stats

   def sim_oneway_p(mu, n_per_group=10, sigma=1.0):
       mu = np.asarray(mu)
       g = len(mu)
       groups = np.repeat(np.arange(g), n_per_group)
       means = np.repeat(mu, n_per_group)
       y = rng.normal(loc=means, scale=sigma, size=g * n_per_group)
       df = pd.DataFrame({"y": y, "group": groups})
       model = smf.ols("y ~ C(group)", data=df).fit()
       table = sm.stats.anova_lm(model, typ=2)
       return table.loc["C(group)", "PR(>F)"]

   p_vals = np.array([sim_oneway_p(mu=[-1, 0, 0, 1], sigma=1.5)
                      for _ in range(1000)])

   power_05 = np.mean(p_vals < 0.05)
   power_01 = np.mean(p_vals < 0.01)

By varying ``mu``, ``sigma``, and sample size you can explore how design
choices affect power.

12.4 Post-hoc comparisons and multiple testing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ANOVA answers a global question:

*“Are all group means equal?”*

If the F-test is significant, we often want to know **which means differ**.

Naive approach
--------------

Run all pairwise two-sample t-tests and look at their p-values. But if there
are many groups, this inflates the **family-wise error rate (FWER)**: the
probability of at least one false positive among the whole family of tests.

Example: with 10 independent tests at :math:`\alpha = 0.05`, the probability
of at least one false positive is much larger than 0.05.

Bonferroni adjustment
---------------------

One simple fix:

*Either* use a stricter per-test level :math:`\alpha / m` for :math:`m` tests,  
*or* multiply each p-value by :math:`m` (capped at 1).

Python:

.. code-block:: python

   from statsmodels.stats.multitest import multipletests

   # pvals: array of unadjusted p-values from pairwise t-tests
   reject, pvals_adj, _, _ = multipletests(pvals, alpha=0.05,
                                           method="bonferroni")

Tukey’s HSD
-----------

For “all pairwise comparisons of group means after one-way ANOVA” there is a
classical procedure: **Tukey’s Honest Significant Difference**.

In Python, :mod:`statsmodels` provides:

.. code-block:: python

   from statsmodels.stats.multicomp import pairwise_tukeyhsd

   tukey = pairwise_tukeyhsd(endog=df["response"],
                             groups=df["group"],
                             alpha=0.05)
   print(tukey.summary())

This reports:

* which pairs of means differ significantly,
* adjusted p-values,
* simultaneous confidence intervals for mean differences.

12.5 Two-way ANOVA and interactions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One-way ANOVA handles one factor. Real experiments often manipulate **two or
more factors at once**.

Example design:

* factor A: three types of antibiotic (I, II, III),
* factor B: four treatments (A, B, C, D),
* response: survival time of a bacteria cluster.

Model with interaction
----------------------

With factors A and B, levels :math:`i = 1,\dots,I`, :math:`j=1,\dots,J`, and
replicates :math:`k = 1,\dots,K` per cell, we write

.. math::

   Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk},

where :math:`\varepsilon_{ijk} \sim N(0, \sigma^2)`.

* :math:`\alpha_i` – main effect of factor A,
* :math:`\beta_j` – main effect of factor B,
* :math:`(\alpha\beta)_{ij}` – **interaction** between A and B.

Interpretation of interaction:

*The effect of A depends on the level of B (or vice versa).*

Additive model (no interaction)
-------------------------------

If all interaction terms are zero we have

.. math::

   Y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk}.

The difference between two levels of factor A is the same at every level of
factor B.

Model hierarchy and testing strategy
------------------------------------

Typically we proceed in this order:

1. Fit the **full interaction model**.
2. Test whether the interaction is significant.
3. If interaction *is* significant, stop there and interpret it.
4. If interaction is *not* significant, drop it and fit the **additive model**
   with main effects only.
5. Within the additive model you can test main effects and use Tukey-type
   post-hoc comparisons on each factor.

Two-way ANOVA in Python
-----------------------

Assume a DataFrame ``bacteria`` with columns ``time`` (response),
``antibiotic`` (factor A), and ``treat`` (factor B).

Full interaction model:

.. code-block:: python

   import statsmodels.api as sm
   import statsmodels.formula.api as smf

   model_int = smf.ols("time ~ C(antibiotic) * C(treat)", data=bacteria).fit()
   anova_int = sm.stats.anova_lm(model_int, typ=2)
   print(anova_int)

If the ``antibiotic:treat`` row has a small p-value, keep this model and examine
estimated means for each combination.

To get cell means:

.. code-block:: python

   import itertools
   import pandas as pd

   levels_antibiotic = bacteria["antibiotic"].unique()
   levels_treat = bacteria["treat"].unique()

   grid = pd.DataFrame(
       list(itertools.product(levels_antibiotic, levels_treat)),
       columns=["antibiotic", "treat"],
   )
   cell_means = model_int.predict(grid)
   grid["mean_time"] = cell_means
   print(grid)

If the interaction is not significant, fit the additive model:

.. code-block:: python

   model_add = smf.ols("time ~ C(antibiotic) + C(treat)", data=bacteria).fit()
   anova_add = sm.stats.anova_lm(model_add, typ=2)

   print(anova_add)

You can then run Tukey’s HSD separately on ``antibiotic`` and ``treat``.

Interaction plots
-----------------

Before fitting any models, it is helpful to **plot group means**.

In Python you can approximate R’s ``interaction.plot`` using a line plot of
mean response vs one factor, with a separate line for each level of the other
factor (for example with :mod:`seaborn`’s ``pointplot``).

Parallel lines suggest an additive model; clearly crossing lines suggest an
interaction.

12.6 What you should take away
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By the end of this chapter you should be comfortable with:

* clearly distinguishing **observational** vs **experimental** data,
* explaining why experiments (with randomization) are needed for causal
  statements,
* writing down the models for

  * two-sample t-tests,
  * one-way ANOVA,
  * two-way ANOVA with and without interaction,

* reading an ANOVA table: sums of squares, degrees of freedom, mean squares,
  :math:`F` statistics, and p-values,
* using Python tools to perform

  * two-sample t-tests (:mod:`scipy.stats`),
  * one-way ANOVA (:mod:`statsmodels` with ``C(group)``),
  * post-hoc comparisons with multiple-testing corrections,
  * two-way ANOVA and interaction tests,

* understanding the idea of **power** and how simulation can help plan sample
  sizes,
* appreciating that “statistically significant” is not automatically
  “scientifically important” – you still need to think about effect sizes.

12.7 How this connects to PyStatsV1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In later PyStatsV1 chapters and case studies you will see these ideas appear in
several ways:

* using ANOVA as a special case of the linear models you already know,
* connecting experimental designs to regression models with dummy variables,
* simulating power curves before you “run an experiment in code,”
* contrasting what you can say from experimental data versus observational
  data using the same modelling tools.

If you work in psychology, sports science, education, or any field with
interventions, this chapter provides the conceptual bridge from PyStatsV1’s
regression techniques to the world of **experimental design** and **evidence-based
practice**.