Chapter 16a Appendix: Linear Regression with Pingouin
=====================================================

Motivation
----------

In :doc:`psych_ch16_regression`, you learned how to:

- Simulate a psychology dataset with several predictors
- Fit a simple linear regression by hand (using NumPy)
- Fit a multiple regression model
- Interpret the slope, intercept, :math:`R^2`, and standard error of the estimate

In this appendix, we lean more heavily on the :mod:`pingouin` library to
engineer regression analyses as *re-usable, testable components*.

Why Pingouin for regression?
----------------------------

Pingouin is a Python 3 statistics library built on top of NumPy and pandas.
For regression, :func:`pingouin.linear_regression` gives you, in a single
DataFrame:

- Unstandardized coefficients (``coef``)
- Standard errors (``se``)
- t-statistics (``T``) and p-values (``pval``)
- Model-level :math:`R^2` and adjusted :math:`R^2`
- Confidence intervals for each coefficient

This pairs naturally with the PyStatsV1 philosophy:

    *Don't just calculate your results — engineer them.*

Instead of copying numbers from an output window into a homework sheet, we
write small, well-tested scripts that can be re-run, inspected and adapted
to new research projects.

Overview of the 16a lab
-----------------------

The 16a appendix is powered by the script:

.. code-block:: bash

   python -m scripts.psych_ch16a_pingouin_regression_demo

It reuses the same simulated dataset generator introduced in Chapter 16:

.. code-block:: python

   from scripts.psych_ch16_regression import simulate_psych_regression_dataset

and then uses :mod:`pingouin` to:

1. Fit a *multiple regression* model

   .. math::
      \text{exam\_score} = b_0 + b_1 \times \text{study\_hours}
                               + b_2 \times \text{sleep\_hours}
                               + b_3 \times \text{stress}
                               + b_4 \times \text{motivation} + e

2. Compute *standardized* regression coefficients (betas) by running the same
   model on z-scored variables.

3. Extract *partial effects* using :func:`pingouin.partial_corr`, so students
   can see how a predictor relates to the outcome after controlling for other
   variables.

The goal is not to introduce a brand-new design, but to show how the
*measurement model* from Chapter 16 behaves when we add a professional
regression toolbox on top.

Section 16a.1 – Recap: Why multiple regression?
-----------------------------------------------

In Chapter 16, we motivated multiple regression as an extension of correlation:

- Correlation: how two variables move together
- Regression: how we *predict* one variable from one (simple) or many
  (multiple) predictors

Multiple regression helps us answer questions like:

- "How many points of exam score do we gain for each extra hour of study,
  **holding sleep constant**?"
- "Is the effect of sleep on exam performance still present after controlling
  for stress and motivation?"

This language ("holding constant") maps directly onto partial regression and
partial correlation. Pingouin makes those quantities easy to compute.

Section 16a.2 – Pingouin's linear_regression
--------------------------------------------

The core workhorse in this appendix is
:func:`pingouin.linear_regression`. Its minimal usage pattern looks like:

.. code-block:: python

   import pingouin as pg

   X = df[["study_hours", "sleep_hours", "stress", "motivation"]]
   y = df["exam_score"]

   reg_table = pg.linear_regression(X=X, y=y)

The returned ``reg_table`` is a pandas DataFrame with one row per term
(intercept and predictors). Key columns include:

- ``names`` – the name of the predictor (or ``Intercept``)
- ``coef`` – the unstandardized regression coefficient
- ``se`` – standard error of the coefficient
- ``T`` – t-statistic (coefficient divided by its standard error)
- ``pval`` – p-value for a two-sided test of :math:`H_0: b_i = 0`
- ``r2`` – model :math:`R^2` (repeated on each row)
- ``adj_r2`` – adjusted :math:`R^2`

In :mod:`scripts.psych_ch16a_pingouin_regression_demo`, we wrap this logic
into a helper function so that it can be imported and tested:

.. code-block:: python

   from scripts.psych_ch16_regression import simulate_psych_regression_dataset
   import pingouin as pg

   def build_pingouin_regression_tables(df):
       X = df[["study_hours", "sleep_hours", "stress", "motivation"]]
       y = df["exam_score"]
       raw_table = pg.linear_regression(X=X, y=y)
       # (plus a standardized version, see next section)
       ...

This also means that future chapters (e.g., ANCOVA or mixed models) could
reuse the same simulation code and regression helpers for more advanced demos.

Section 16a.3 – Standardized coefficients (betas)
-------------------------------------------------

Unstandardized coefficients (e.g., ``+4 points per extra study hour``) are
often the most intuitive for reporting. However, standardized coefficients
("betas") can be helpful when predictors are on very different scales.

To obtain standardized coefficients, we simply:

1. Z-score the predictors and the outcome.
2. Run :func:`pingouin.linear_regression` on the standardized variables.
3. Interpret the resulting ``coef`` values as *change in standard deviations
   of the outcome per one standard deviation change in the predictor*.

In the 16a script we perform this transformation with:

.. code-block:: python

   def zscore_columns(df, columns):
       zdf = df.copy()
       for col in columns:
           col_mean = zdf[col].mean()
           col_std = zdf[col].std(ddof=0)
           zdf[col + "_z"] = (zdf[col] - col_mean) / col_std
       return zdf

   zdf = zscore_columns(
       df,
       ["exam_score", "study_hours", "sleep_hours", "stress", "motivation"],
   )

We then fit a second regression model:

.. code-block:: python

   X_z = zdf[["study_hours_z", "sleep_hours_z", "stress_z", "motivation_z"]]
   y_z = zdf["exam_score_z"]

   standardized_table = pg.linear_regression(X=X_z, y=y_z)

The resulting ``standardized_table`` is saved to
``outputs/track_b/ch16a_regression_standardized.csv`` and printed to the
console so students can compare unstandardized and standardized effect sizes.

Section 16a.4 – Partial effects and partial correlation
-------------------------------------------------------

In a multiple regression, each coefficient is a *partial effect*: it describes
the association between that predictor and the outcome *after controlling for*
(all else equal to) the other predictors in the model.

Pingouin also exposes these partial relationships directly via
:func:`pingouin.partial_corr`, which computes partial correlation
coefficients.

For example, to examine the relationship between exam score and study hours
while controlling for stress and motivation, the 16a script uses:

.. code-block:: python

   partial = pg.partial_corr(
       data=df,
       x="study_hours",
       y="exam_score",
       covar=["stress", "motivation"],
       method="pearson",
   )

The resulting DataFrame contains:

- ``r`` – the partial correlation coefficient
- ``CI95%`` – a confidence interval for the partial correlation
- ``p-val`` – a p-value testing :math:`H_0: \rho_{xy \cdot \text{covar}} = 0`

The key conceptual link for students is:

- The *sign* and *relative magnitude* of the partial correlation
  align with the regression coefficient for that predictor.
- The partial correlation can be interpreted in the same "holding other
  variables constant" language used to explain multiple regression.

Section 16a.5 – Running the 16a lab
-----------------------------------

The appendix demo is designed to run from the command line and to save its
artifacts into the same folder structure as the other Track B labs.

From the root of the repository, with the virtual environment activated:

.. code-block:: bash

   # Run the demo script (regression + partial correlations)
   make psych-ch16a

   # Or directly via Python
   python -m scripts.psych_ch16a_pingouin_regression_demo

   # Run the tests for this chapter's appendix
   make test-psych-ch16a

   # Inspect all outputs under:
   # - data/synthetic/psych_ch16_regression.csv
   # - outputs/track_b/ch16a_regression_raw.csv
   # - outputs/track_b/ch16a_regression_standardized.csv
   # - outputs/track_b/ch16a_partial_corr_exam_study.csv

The associated test module,
:mod:`tests.test_psych_ch16a_pingouin_regression_demo`, checks that:

- The regression tables contain the expected columns.
- Study hours and sleep hours have positive regression coefficients.
- Stress has a negative regression coefficient.
- The partial correlation between exam score and study hours (controlling
  for stress and motivation) is positive and statistically significant.

These tests turn the 16a appendix into *executable documentation* for
both students and instructors.

Section 16a.6 – For instructors
-------------------------------

Some suggestions for using this appendix in teaching:

- **Compare models in class.** Run the Chapter 16 lab and the 16a Pingouin
  appendix side-by-side. Ask students to reconcile the manual calculations
  with the Pingouin output.

- **Highlight effect sizes.** Use the standardized regression table to
  discuss which predictors have the strongest relative influence on exam
  performance.

- **Discuss collinearity.** Because predictors like study hours, sleep,
  stress, and motivation are correlated with each other, multiple regression
  is a natural context to introduce collinearity and its consequences.

- **Encourage replication.** Invite students to fork the PyStatsV1 repo,
  modify the simulation parameters (e.g., make sleep more important), and
  observe how the Pingouin tables change.

In later chapters (e.g., ANCOVA, mixed-model designs), we can revisit this
dataset and the Pingouin regression helpers as a familiar sandbox for more
advanced modeling.