Chapter 16a Appendix: Linear Regression with Pingouin

Motivation

In Chapter 16 – Linear Regression, you learned how to:

Simulate a psychology dataset with several predictors
Fit a simple linear regression by hand (using NumPy)
Fit a multiple regression model
Interpret the slope, intercept, \(R^2\), and standard error of the estimate

In this appendix, we lean more heavily on the pingouin library to engineer regression analyses as re-usable, testable components.

Why Pingouin for regression?

Pingouin is a Python 3 statistics library built on top of NumPy and pandas. For regression, pingouin.linear_regression() gives you, in a single DataFrame:

Unstandardized coefficients (coef)
Standard errors (se)
t-statistics (T) and p-values (pval)
Model-level \(R^2\) and adjusted \(R^2\)
Confidence intervals for each coefficient

This pairs naturally with the PyStatsV1 philosophy:

Don’t just calculate your results — engineer them.

Instead of copying numbers from an output window into a homework sheet, we write small, well-tested scripts that can be re-run, inspected and adapted to new research projects.

Overview of the 16a lab

The 16a appendix is powered by the script:

python -m scripts.psych_ch16a_pingouin_regression_demo

It reuses the same simulated dataset generator introduced in Chapter 16:

from scripts.psych_ch16_regression import simulate_psych_regression_dataset

and then uses pingouin to:

Fit a multiple regression model

\[\text{exam\_score} = b_0 + b_1 \times \text{study\_hours} + b_2 \times \text{sleep\_hours} + b_3 \times \text{stress} + b_4 \times \text{motivation} + e\]
Compute standardized regression coefficients (betas) by running the same model on z-scored variables.
Extract partial effects using pingouin.partial_corr(), so students can see how a predictor relates to the outcome after controlling for other variables.

The goal is not to introduce a brand-new design, but to show how the measurement model from Chapter 16 behaves when we add a professional regression toolbox on top.

Section 16a.1 – Recap: Why multiple regression?

In Chapter 16, we motivated multiple regression as an extension of correlation:

Correlation: how two variables move together
Regression: how we predict one variable from one (simple) or many (multiple) predictors

Multiple regression helps us answer questions like:

“How many points of exam score do we gain for each extra hour of study, holding sleep constant?”
“Is the effect of sleep on exam performance still present after controlling for stress and motivation?”

This language (“holding constant”) maps directly onto partial regression and partial correlation. Pingouin makes those quantities easy to compute.

Section 16a.2 – Pingouin’s linear_regression

The core workhorse in this appendix is pingouin.linear_regression(). Its minimal usage pattern looks like:

import pingouin as pg

X = df[["study_hours", "sleep_hours", "stress", "motivation"]]
y = df["exam_score"]

reg_table = pg.linear_regression(X=X, y=y)

The returned reg_table is a pandas DataFrame with one row per term (intercept and predictors). Key columns include:

names – the name of the predictor (or Intercept)
coef – the unstandardized regression coefficient
se – standard error of the coefficient
T – t-statistic (coefficient divided by its standard error)
pval – p-value for a two-sided test of \(H_0: b_i = 0\)
r2 – model \(R^2\) (repeated on each row)
adj_r2 – adjusted \(R^2\)

In scripts.psych_ch16a_pingouin_regression_demo, we wrap this logic into a helper function so that it can be imported and tested:

from scripts.psych_ch16_regression import simulate_psych_regression_dataset
import pingouin as pg

def build_pingouin_regression_tables(df):
    X = df[["study_hours", "sleep_hours", "stress", "motivation"]]
    y = df["exam_score"]
    raw_table = pg.linear_regression(X=X, y=y)
    # (plus a standardized version, see next section)
    ...

This also means that future chapters (e.g., ANCOVA or mixed models) could reuse the same simulation code and regression helpers for more advanced demos.

Section 16a.3 – Standardized coefficients (betas)

Unstandardized coefficients (e.g., +4 points per extra study hour) are often the most intuitive for reporting. However, standardized coefficients (“betas”) can be helpful when predictors are on very different scales.

To obtain standardized coefficients, we simply:

Z-score the predictors and the outcome.
Run pingouin.linear_regression() on the standardized variables.
Interpret the resulting coef values as change in standard deviations of the outcome per one standard deviation change in the predictor.

In the 16a script we perform this transformation with:

def zscore_columns(df, columns):
    zdf = df.copy()
    for col in columns:
        col_mean = zdf[col].mean()
        col_std = zdf[col].std(ddof=0)
        zdf[col + "_z"] = (zdf[col] - col_mean) / col_std
    return zdf

zdf = zscore_columns(
    df,
    ["exam_score", "study_hours", "sleep_hours", "stress", "motivation"],
)

We then fit a second regression model:

X_z = zdf[["study_hours_z", "sleep_hours_z", "stress_z", "motivation_z"]]
y_z = zdf["exam_score_z"]

standardized_table = pg.linear_regression(X=X_z, y=y_z)

The resulting standardized_table is saved to outputs/track_b/ch16a_regression_standardized.csv and printed to the console so students can compare unstandardized and standardized effect sizes.

Section 16a.4 – Partial effects and partial correlation

In a multiple regression, each coefficient is a partial effect: it describes the association between that predictor and the outcome after controlling for (all else equal to) the other predictors in the model.

Pingouin also exposes these partial relationships directly via pingouin.partial_corr(), which computes partial correlation coefficients.

For example, to examine the relationship between exam score and study hours while controlling for stress and motivation, the 16a script uses:

partial = pg.partial_corr(
    data=df,
    x="study_hours",
    y="exam_score",
    covar=["stress", "motivation"],
    method="pearson",
)

The resulting DataFrame contains:

r – the partial correlation coefficient
CI95% – a confidence interval for the partial correlation
p-val – a p-value testing \(H_0: \rho_{xy \cdot \text{covar}} = 0\)

The key conceptual link for students is:

The sign and relative magnitude of the partial correlation align with the regression coefficient for that predictor.
The partial correlation can be interpreted in the same “holding other variables constant” language used to explain multiple regression.

Section 16a.5 – Running the 16a lab

The appendix demo is designed to run from the command line and to save its artifacts into the same folder structure as the other Track B labs.

From the root of the repository, with the virtual environment activated:

# Run the demo script (regression + partial correlations)
make psych-ch16a

# Or directly via Python
python -m scripts.psych_ch16a_pingouin_regression_demo

# Run the tests for this chapter's appendix
make test-psych-ch16a

# Inspect all outputs under:
# - data/synthetic/psych_ch16_regression.csv
# - outputs/track_b/ch16a_regression_raw.csv
# - outputs/track_b/ch16a_regression_standardized.csv
# - outputs/track_b/ch16a_partial_corr_exam_study.csv

The associated test module, tests.test_psych_ch16a_pingouin_regression_demo, checks that:

The regression tables contain the expected columns.
Study hours and sleep hours have positive regression coefficients.
Stress has a negative regression coefficient.
The partial correlation between exam score and study hours (controlling for stress and motivation) is positive and statistically significant.

These tests turn the 16a appendix into executable documentation for both students and instructors.

Section 16a.6 – For instructors

Some suggestions for using this appendix in teaching:

Compare models in class. Run the Chapter 16 lab and the 16a Pingouin appendix side-by-side. Ask students to reconcile the manual calculations with the Pingouin output.
Highlight effect sizes. Use the standardized regression table to discuss which predictors have the strongest relative influence on exam performance.
Discuss collinearity. Because predictors like study hours, sleep, stress, and motivation are correlated with each other, multiple regression is a natural context to introduce collinearity and its consequences.
Encourage replication. Invite students to fork the PyStatsV1 repo, modify the simulation parameters (e.g., make sleep more important), and observe how the Pingouin tables change.

In later chapters (e.g., ANCOVA, mixed-model designs), we can revisit this dataset and the Pingouin regression helpers as a familiar sandbox for more advanced modeling.