Chapter 16 – Linear Regression

In Chapter 15 you learned how to quantify relationships between two variables using correlation. Correlation answers the question:

“How strongly are two variables associated?”

Linear regression goes one step further. It answers a different question:

“How well can we **predict* one variable from one or more other variables?”*

In this chapter you will:

  • build the idea of a line of best fit for predicting an outcome,

  • understand how the least-squares method chooses that line,

  • interpret the standard error of the estimate as typical prediction error,

  • extend to multiple regression, where several predictors work together,

  • and use pystatsv1 and pingouin to fit and interpret regression models on a synthetic psychology dataset.

16.1 Prediction: The Line of Best Fit

Imagine we want to predict a student’s exam score from their number of study hours. Each student gives us one data point: (study_hours, exam_score). If we scatter those points, a clear trend often appears: students who study more tend to score higher.

Linear regression summarizes that trend with a straight line:

\[\widehat{Y} = bX + a\]

where

  • \(\widehat{Y}\) is the predicted value of the outcome,

  • \(X\) is the predictor,

  • \(b\) is the slope (how much \(Y\) changes for a one-unit change in \(X\)), and

  • \(a\) is the intercept (the predicted value of \(Y\) when \(X = 0\)).

In psychology, we often use regression to predict:

  • exam performance from study time,

  • depressive symptoms from life stress,

  • therapy outcomes from baseline severity and treatment type,

  • or attention scores from sleep quality and caffeine intake.

The important idea is that regression is a model of prediction. We care both about how strong the relationship is and how well we can forecast new data.

Interpretation of the slope

Suppose our fitted line is

\[\widehat{\text{Exam Score}} = 5.0 \times \text{Study Hours} + 60.\]

The slope \(b = 5.0\) means:

For every extra hour of study, exam score is predicted to increase by about 5 points, on average.

The intercept \(a = 60\) means:

A student who studied 0 hours is predicted to score 60 (although we should be cautious about interpreting predictions far outside the observed range).

16.2 Least Squares: Choosing the Best Line

There are infinitely many lines we could draw through a cloud of points. Linear regression chooses the one that minimizes the sum of squared residuals.

A residual is the difference between the observed outcome and the predicted outcome from the line:

\[e_i = Y_i - \widehat{Y}_i.\]

The least-squares solution chooses \(a\) and \(b\) to minimize

\[\sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2.\]

This has several nice properties:

  • it gives more weight to large errors (because of squaring),

  • it has a closed-form solution (no iterative search is required),

  • and it links naturally to the Pearson correlation \(r\) and the ANOVA framework you saw in Chapters 12–14.

In the Chapter 16 lab script we compute the least-squares solution using both basic NumPy functions and the higher-level pingouin.linear_regression() helper for cross-checking.

16.3 Standard Error of the Estimate

No regression line predicts perfectly. Different students with the same study hours will still have different exam scores. The standard error of the estimate summarizes typical prediction error in the original units of the outcome variable.

After we fit the line, we compute residuals \(e_i\) for each case and then:

\[S_\text{est} = \sqrt{\frac{\sum e_i^2}{n - 2}}.\]

This is similar to a standard deviation of the residuals. A smaller \(S_\text{est}\) means:

  • predictions are typically closer to the observed scores,

  • the regression line fits the data more tightly,

  • and we have more precise forecasts for new cases.

In the lab we will:

  • print \(S_\text{est}\) for our simple regression model,

  • compare it across different models,

  • and relate it back to the residual plots.

16.4 Multiple Regression and R²

Real psychological outcomes usually depend on many factors at once. Multiple regression extends the linear model to several predictors:

\[\widehat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \dots + b_k X_k.\]

For example, our synthetic dataset in Chapter 16 includes:

  • study_hours – weekly study time,

  • sleep_hours – average nightly sleep,

  • stress – perceived stress score,

  • and exam_score – exam performance.

A multiple regression model might be:

\[\widehat{\text{Exam Score}} = b_0 + b_1 \times \text{Study Hours} + b_2 \times \text{Sleep Hours} - b_3 \times \text{Stress}.\]

Key ideas:

  • Each slope \(b_j\) is a partial effect: it tells us how much \(Y\) is expected to change when \(X_j\) increases by one unit, holding the other predictors constant.

  • We can assess overall model fit with \(R^2\), the proportion of variance in \(Y\) explained by the predictors.

  • Adjusted \(R^2\) penalizes adding predictors that do not really help.

In the lab script we use pingouin.linear_regression() to fit a multiple regression model with exam_score as the outcome and multiple predictors. We then interpret:

  • the coefficient signs (which predictors help or hurt),

  • their statistical significance (p-values),

  • and the overall \(R^2\) / adjusted \(R^2\).

16.5 PyStatsV1 Lab: Building a Predictive Model

The Chapter 16 lab script is scripts.psych_ch16_regression. It demonstrates:

  • simulating a psychology dataset with exam performance,

  • fitting and interpreting a simple linear regression,

  • fitting a multiple regression with several predictors,

  • saving results for replication,

  • and visualizing the line of best fit.

Overview of the lab script

The script is structured into a few main helper functions:

  • simulate_psych_regression_dataset()

    Creates a synthetic dataset with columns such as stress, sleep_hours, study_hours, and exam_score, using a known underlying regression model. Because we know the “true” slopes, we can check that the estimated values behave as expected.

  • fit_simple_regression()

    Fits a simple regression predicting exam_score from study_hours. Returns the slope, intercept, correlation, \(R^2\), and standard error of the estimate.

  • fit_multiple_regression()

    Uses pingouin.linear_regression() to fit a multiple regression model with several predictors. Returns the regression table along with \(R^2\) and adjusted \(R^2\) for quick inspection.

  • plot_regression_line()

    Generates a scatterplot of study_hours versus exam_score along with the fitted line. The figure is saved to the outputs/track_b folder for use in slides or assignments.

When you run the script via:

make psych-ch16

you will see printed output that includes:

  • a preview of the simulated dataset,

  • the simple regression slope, intercept, \(R^2\), and standard error of the estimate,

  • the multiple regression summary from pingouin,

  • and file paths where the data, table, and figure were saved.

Files written by the lab

The script saves three main artifacts:

  • data/synthetic/psych_ch16_regression.csv

    The simulated psychology dataset used for all analyses.

  • outputs/track_b/ch16_regression_summary.csv

    A CSV file containing the multiple regression summary table produced by pingouin.linear_regression().

  • outputs/track_b/ch16_regression_fit.png

    A scatterplot of study hours and exam scores with the regression line superimposed.

These files make it easy to reproduce the main figures and tables for homework, lecture slides, or exam preparation.

PyStatsV1 and Pingouin together

As in Chapters 14 and 15, we treat pingouin as a trusted reference implementation. Our own helper functions use NumPy and pandas to compute regression quantities “from scratch”, and then we cross-check key numbers against pingouin.linear_regression().

This dual approach reinforces the core philosophy of PyStatsV1:

Don’t just calculate your results — engineer them.

By writing small, well-tested functions and validating them against trusted libraries, students learn both the statistical ideas and the software engineering mindset needed for reproducible science.

Checklist: What You Should Be Able to Do

By the end of Chapter 16, you should be able to:

  • explain what the regression line \(\widehat{Y} = bX + a\) means in words,

  • interpret the slope and intercept in a psychology example,

  • describe how the least-squares method chooses the “best” line,

  • compute and interpret the standard error of the estimate,

  • explain the difference between simple and multiple regression,

  • interpret \(R^2\) and adjusted \(R^2\),

  • and run the Chapter 16 PyStatsV1 lab to build and evaluate a predictive model.