.. _psych_ch16b_pingouin_regression:

======================================================
Chapter 16b – Regression Diagnostics with Pingouin
======================================================

*Track B: Psychological Science & Statistics – Appendix to Chapter 16*

Overview
========

In Chapter 16, we introduced linear regression as a tool for prediction
and interpretation. We focused on

* the **line of best fit** (:math:`Y' = bX + a`),
* the **least squares** criterion,
* the **standard error of the estimate**, and
* **multiple regression** (predicting behavior from multiple variables).

However, a good PyStatsV1 workflow does **not** stop after fitting a model.
We must *engineer* our results by checking whether the model and the data
behave as the assumptions require.

This appendix shows how to:

* use :mod:`pingouin` to fit a multiple regression model,
* compute standard regression diagnostics (residuals, leverage,
  Cook's distance),
* identify potentially influential observations, and
* illustrate the dangers of relying only on summary statistics using
  **Anscombe's Quartet**.

The goal is to give students a reproducible, testable set of tools they can
reuse in their own projects.

Learning goals
==============

After working through this appendix, you should be able to:

1. Explain the difference between **good fit** (e.g., high :math:`R^2`)
   and **good model** (reasonable assumptions).
2. Interpret standard regression diagnostics:

   * residuals and standardized residuals,
   * leverage (hat values),
   * Cook's distance.

3. Use :mod:`pingouin`'s regression tools together with NumPy and
   pandas to compute these diagnostics.
4. Explain how **Anscombe's Quartet** shows that

   * identical means, variances, and correlations can hide very different
     data patterns,
   * visualization and diagnostics are crucial in a PyStatsV1 workflow.

Files for this appendix
=======================

This appendix uses the following PyStatsV1 files:

* **Script**: ``scripts/psych_ch16b_pingouin_regression_diagnostics.py``

  - simulates a psychology regression dataset (reusing the Chapter 16
    data generator),
  - fits a multiple regression model with :func:`pingouin.linear_regression`,
  - computes regression diagnostics (residuals, leverage, Cook's distance),
  - identifies the most influential observations,
  - generates diagnostic plots,
  - constructs and analyzes **Anscombe's Quartet** to demonstrate why
    diagnostics and visualization matter.

* **Tests**:
  ``tests/test_psych_ch16b_pingouin_regression_diagnostics.py``

  - verify that diagnostics have the expected shape and properties,
  - check that leverage behaves as theory predicts,
  - ensure that model :math:`R^2` is in a reasonable range,
  - run the full end-to-end pipeline in a temporary directory,
  - verify that CSV and PNG outputs are written correctly,
  - check that the Anscombe datasets have nearly identical summary
    statistics while having different shapes.

* **Makefile targets** (added in a separate CI branch):

  - ``make psych-ch16b`` – run the diagnostics demo (including
    Anscombe's Quartet),
  - ``make test-psych-ch16b`` – run tests for this appendix only.

.. note::

   As with previous chapters, the script is written in a way that makes
   it easy to import its functions into other projects or Jupyter
   notebooks. The tests treat regression diagnostics as *software*
   objects that can be checked, versioned, and reused.

Section 1 – Regression diagnostics in practice
==============================================

Recall that a linear regression model makes several assumptions:

* **Linearity** – the relationship between predictors and outcome is
  approximately linear.
* **Homoscedasticity** – the spread (variance) of residuals is roughly
  constant across the range of fitted values.
* **Independence** – residuals are not systematically related to each
  other (e.g., no strong time trends).
* **Normality of residuals** – residuals are approximately normally
  distributed.

The :mod:`pingouin` function :func:`pingouin.linear_regression` focuses
primarily on **estimation** and **inference**:

* regression coefficients and standard errors,
* :math:`t`-tests and :math:`p`-values,
* :math:`R^2` and adjusted :math:`R^2`.

To check assumptions, we need additional diagnostics. In
``psych_ch16b_pingouin_regression_diagnostics.py`` we therefore:

1. Simulate a dataset that extends the Chapter 16 example, with
   variables such as:

   * ``stress``
   * ``sleep_hours``
   * ``study_hours``
   * ``motivation``
   * ``exam_score`` (outcome)

2. Fit a multiple regression model predicting ``exam_score`` from
   several predictors.
3. Compute diagnostics using NumPy and pandas:

   * **fitted values** – model predictions :math:`\\hat{y}`,
   * **residuals** – observed minus fitted (:math:`y - \\hat{y}`),
   * **standardized residuals** – residuals scaled by their estimated
     standard deviation,
   * **leverage** – hat values on the diagonal of the hat matrix,
     :math:`H = X (X'X)^{-1} X'`,
   * **Cook's distance** – a measure of how much the regression
     coefficients would change if we removed a given observation.

4. Save the diagnostics to CSV and plot simple diagnostics:

   * **Residuals vs Fitted** plot – to check linearity and
     homoscedasticity.
   * **Leverage vs Cook's distance** plot – to identify high-leverage,
     influential observations.

Interpreting diagnostics (high level)
-------------------------------------

* Residuals should be roughly centered around zero. A clear curve or
  pattern in residuals vs fitted values suggests non-linearity.
* Leverage values near 0 indicate little influence on the model fit;
  values closer to 1 indicate observations that are far from the center
  of the predictor space.
* Cook's distance combines residual size and leverage. Points with
  unusually large Cook's distance are candidates for closer inspection.
  They are not automatically "bad" data points, but they may be
  influential.

Section 2 – Anscombe's Quartet
==============================

To see why diagnostics and visualization are essential, this appendix
includes a second dataset: **Anscombe's Quartet**.

Anscombe (1973) constructed four small datasets (I–IV) with the
following surprising property:

* Each dataset has nearly identical:

  - mean of :math:`x`,
  - mean of :math:`y`,
  - variance of :math:`x`,
  - variance of :math:`y`,
  - correlation :math:`r` between :math:`x` and :math:`y`,
  - regression line :math:`Y' = bX + a`.

* But when you plot them, the **shapes are completely different**:

  - One looks like a typical linear relationship.
  - One is clearly non-linear.
  - One is linear except for a single outlier.
  - One has a nearly perfect vertical line with one extreme point.

In other words, **summary statistics alone can mislead us**. Two
datasets can share the same correlation and regression line but tell
completely different stories once we visualize them.

How we use Anscombe's Quartet in PyStatsV1
------------------------------------------

The script ``psych_ch16b_pingouin_regression_diagnostics.py`` includes:

* a helper that constructs a tidy version of **Anscombe's Quartet**
  with columns

  - ``x``
  - ``y``
  - ``dataset`` (I, II, III, IV)

* a function that computes, for each dataset:

  - :math:`\\bar{x}`, :math:`\\bar{y}`,
  - :math:`s_x^2`, :math:`s_y^2`,
  - correlation :math:`r`,
  - simple regression line (:math:`a` and :math:`b`).

* a **2x2 grid of scatterplots** with

  - the same axis limits,
  - the fitted regression line overlaid,
  - one panel per dataset (I–IV).

The corresponding tests check that:

* all four datasets have nearly identical summary statistics, and
* the code produces the expected summary table and plot file.

Worked example (conceptual)
---------------------------

1. Run the diagnostics script (once your Makefile targets are wired):

   .. code-block:: bash

      make psych-ch16b

2. The script first runs the psychology regression diagnostics example
   (as described in Section 1).

3. Then the script constructs Anscombe's Quartet, computes summary
   statistics by dataset, and prints something like:

   .. code-block:: text

      Anscombe summary (per dataset):
        dataset   mean_x   mean_y   var_x   var_y     r       slope   intercept
      0       I    9.00    7.50   11.00    4.13  0.82      0.50      3.00
      1      II    9.00    7.50   11.00    4.13  0.82      0.50      3.00
      2     III    9.00    7.50   11.00    4.13  0.82      0.50      3.00
      3      IV    9.00    7.50   11.00    4.13  0.82      0.50      3.00

   The exact numbers may differ slightly due to floating point rounding,
   but the key idea is that the four datasets have almost identical
   summary statistics.

4. Finally, the script creates a 2x2 scatterplot figure and writes it to:

   * ``outputs/track_b/ch16b_anscombe_quartet.png``

   When you inspect this image, you will see four very different
   patterns, despite having "the same" regression summary.

Takeaway for students and instructors
-------------------------------------

Anscombe's Quartet makes two core points that align with the PyStatsV1
philosophy:

1. **Do not stop at statistics.**

   * A single number like :math:`r` or :math:`R^2` can hide very
     different data stories.
   * Always pair numerical output with plots and diagnostics.

2. **Treat models as software artifacts.**

   * In PyStatsV1, every substantial analysis step is backed by
     functions, tests, and CI checks.
   * Adding a new diagnostic (e.g., Cook's distance, Anscombe analysis)
     means adding new code *and* new tests.

Section 3 – The code: overview of key functions
===============================================

You do not need to memorize the exact implementation details, but it is
useful to know what the main functions do.

In ``scripts/psych_ch16b_pingouin_regression_diagnostics.py``:

* ``compute_regression_diagnostics(df, predictors, outcome)``

  - Fits a multiple regression model,

    .. math::
       exam\\_score \\sim study\\_hours + sleep\\_hours + stress + motivation,

  - returns a diagnostics DataFrame with

    * ``fitted``,
    * ``residual``,
    * ``std_residual``,
    * ``leverage``,
    * ``cooks_distance``,

  - and a :mod:`pingouin` regression summary table for cross-checking.

* ``run_ch16b_demo(n, random_state)``

  - Simulates the psychology regression dataset,
  - calls :func:`compute_regression_diagnostics`,
  - saves diagnostics and **top influential points** to CSV,
  - generates residuals vs fitted and leverage vs Cook's distance plots,
  - constructs and analyzes Anscombe's Quartet,
  - saves Anscombe summary statistics and plots,
  - prints a concise narrative summary to the console.

* Anscombe helpers (internal names may differ slightly):

  - a function to construct the tidy Anscombe dataset,
  - a function to compute summary statistics by dataset,
  - a plotting function to generate the 2x2 Anscombe scatterplot
    figure with regression lines.

In ``tests/test_psych_ch16b_pingouin_regression_diagnostics.py``:

* One test verifies that diagnostics have the expected columns and that
  leverage behaves as theory predicts (e.g., the average leverage is
  approximately :math:`p / n`, where :math:`p` is the number of
  parameters including the intercept).
* Another test runs :func:`run_ch16b_demo` in a temporary directory and
  verifies that all expected CSV and PNG files exist and are non-empty.
* A third test checks that **Anscombe's Quartet** is implemented
  correctly:

  - there are four datasets with the expected number of rows,
  - group-level summary statistics are nearly identical across datasets,
  - the code produces an Anscombe summary CSV and plot image.

How this Appendix fits into the Track B narrative
=================================================

* Chapter 15 and 15a introduced **correlation** and **partial
  correlation**, using :mod:`pingouin` as a high-level toolbox.
* Chapter 16 developed the core ideas of **linear regression**:
  prediction, least squares, standard error of the estimate, and
  multiple regression.
* Appendix 16a expanded regression with additional estimation examples.
* Appendix 16b (this chapter) emphasizes that

  * even a beautifully written model can be misleading if we ignore
    diagnostics,
  * the *shape* of the data always matters,
  * simple, testable diagnostics can be integrated into every
    analysis pipeline.

By the time students reach Chapter 17 (Mixed-Model Designs), they will
have seen that a PyStatsV1-style analysis is not just about "getting
significant results." It is about building **robust, transparent, and
reproducible** statistical workflows that can be trusted.

Next steps
==========

After completing this appendix, you are ready to move into

* **Chapter 17 – Mixed-Model Designs**, where we combine
  between-subjects and within-subjects factors, and
* later, **Chapter 18 – ANCOVA**, where we explicitly control for
  covariates in more complex models.

In both chapters, the habits you practiced here—**checking assumptions,
visualizing patterns, and treating models as software artifacts**—will
remain central.