Chapter 16 – Linear Regression
In Chapter 15 you learned how to quantify relationships between two variables using correlation. Correlation answers the question:
“How strongly are two variables associated?”
Linear regression goes one step further. It answers a different question:
“How well can we **predict* one variable from one or more other variables?”*
In this chapter you will:
build the idea of a line of best fit for predicting an outcome,
understand how the least-squares method chooses that line,
interpret the standard error of the estimate as typical prediction error,
extend to multiple regression, where several predictors work together,
and use
pystatsv1andpingouinto fit and interpret regression models on a synthetic psychology dataset.
16.1 Prediction: The Line of Best Fit
Imagine we want to predict a student’s exam score from their number of
study hours. Each student gives us one data point:
(study_hours, exam_score). If we scatter those points, a clear
trend often appears: students who study more tend to score higher.
Linear regression summarizes that trend with a straight line:
where
\(\widehat{Y}\) is the predicted value of the outcome,
\(X\) is the predictor,
\(b\) is the slope (how much \(Y\) changes for a one-unit change in \(X\)), and
\(a\) is the intercept (the predicted value of \(Y\) when \(X = 0\)).
In psychology, we often use regression to predict:
exam performance from study time,
depressive symptoms from life stress,
therapy outcomes from baseline severity and treatment type,
or attention scores from sleep quality and caffeine intake.
The important idea is that regression is a model of prediction. We care both about how strong the relationship is and how well we can forecast new data.
Interpretation of the slope
Suppose our fitted line is
The slope \(b = 5.0\) means:
For every extra hour of study, exam score is predicted to increase by about 5 points, on average.
The intercept \(a = 60\) means:
A student who studied 0 hours is predicted to score 60 (although we should be cautious about interpreting predictions far outside the observed range).
16.2 Least Squares: Choosing the Best Line
There are infinitely many lines we could draw through a cloud of points. Linear regression chooses the one that minimizes the sum of squared residuals.
A residual is the difference between the observed outcome and the predicted outcome from the line:
The least-squares solution chooses \(a\) and \(b\) to minimize
This has several nice properties:
it gives more weight to large errors (because of squaring),
it has a closed-form solution (no iterative search is required),
and it links naturally to the Pearson correlation \(r\) and the ANOVA framework you saw in Chapters 12–14.
In the Chapter 16 lab script we compute the least-squares solution
using both basic NumPy functions and the higher-level
pingouin.linear_regression() helper for cross-checking.
16.3 Standard Error of the Estimate
No regression line predicts perfectly. Different students with the same study hours will still have different exam scores. The standard error of the estimate summarizes typical prediction error in the original units of the outcome variable.
After we fit the line, we compute residuals \(e_i\) for each case and then:
This is similar to a standard deviation of the residuals. A smaller \(S_\text{est}\) means:
predictions are typically closer to the observed scores,
the regression line fits the data more tightly,
and we have more precise forecasts for new cases.
In the lab we will:
print \(S_\text{est}\) for our simple regression model,
compare it across different models,
and relate it back to the residual plots.
16.4 Multiple Regression and R²
Real psychological outcomes usually depend on many factors at once. Multiple regression extends the linear model to several predictors:
For example, our synthetic dataset in Chapter 16 includes:
study_hours– weekly study time,sleep_hours– average nightly sleep,stress– perceived stress score,and
exam_score– exam performance.
A multiple regression model might be:
Key ideas:
Each slope \(b_j\) is a partial effect: it tells us how much \(Y\) is expected to change when \(X_j\) increases by one unit, holding the other predictors constant.
We can assess overall model fit with \(R^2\), the proportion of variance in \(Y\) explained by the predictors.
Adjusted \(R^2\) penalizes adding predictors that do not really help.
In the lab script we use pingouin.linear_regression() to fit a
multiple regression model with exam_score as the outcome and
multiple predictors. We then interpret:
the coefficient signs (which predictors help or hurt),
their statistical significance (p-values),
and the overall \(R^2\) / adjusted \(R^2\).
16.5 PyStatsV1 Lab: Building a Predictive Model
The Chapter 16 lab script is
scripts.psych_ch16_regression. It demonstrates:
simulating a psychology dataset with exam performance,
fitting and interpreting a simple linear regression,
fitting a multiple regression with several predictors,
saving results for replication,
and visualizing the line of best fit.
Overview of the lab script
The script is structured into a few main helper functions:
simulate_psych_regression_dataset()Creates a synthetic dataset with columns such as
stress,sleep_hours,study_hours, andexam_score, using a known underlying regression model. Because we know the “true” slopes, we can check that the estimated values behave as expected.fit_simple_regression()Fits a simple regression predicting
exam_scorefromstudy_hours. Returns the slope, intercept, correlation, \(R^2\), and standard error of the estimate.fit_multiple_regression()Uses
pingouin.linear_regression()to fit a multiple regression model with several predictors. Returns the regression table along with \(R^2\) and adjusted \(R^2\) for quick inspection.plot_regression_line()Generates a scatterplot of
study_hoursversusexam_scorealong with the fitted line. The figure is saved to theoutputs/track_bfolder for use in slides or assignments.
When you run the script via:
make psych-ch16
you will see printed output that includes:
a preview of the simulated dataset,
the simple regression slope, intercept, \(R^2\), and standard error of the estimate,
the multiple regression summary from
pingouin,and file paths where the data, table, and figure were saved.
Files written by the lab
The script saves three main artifacts:
data/synthetic/psych_ch16_regression.csvThe simulated psychology dataset used for all analyses.
outputs/track_b/ch16_regression_summary.csvA CSV file containing the multiple regression summary table produced by
pingouin.linear_regression().outputs/track_b/ch16_regression_fit.pngA scatterplot of study hours and exam scores with the regression line superimposed.
These files make it easy to reproduce the main figures and tables for homework, lecture slides, or exam preparation.
PyStatsV1 and Pingouin together
As in Chapters 14 and 15, we treat pingouin as a
trusted reference implementation. Our own helper functions
use NumPy and pandas to compute regression quantities “from scratch”,
and then we cross-check key numbers against
pingouin.linear_regression().
This dual approach reinforces the core philosophy of PyStatsV1:
Don’t just calculate your results — engineer them.
By writing small, well-tested functions and validating them against trusted libraries, students learn both the statistical ideas and the software engineering mindset needed for reproducible science.
Checklist: What You Should Be Able to Do
By the end of Chapter 16, you should be able to:
explain what the regression line \(\widehat{Y} = bX + a\) means in words,
interpret the slope and intercept in a psychology example,
describe how the least-squares method chooses the “best” line,
compute and interpret the standard error of the estimate,
explain the difference between simple and multiple regression,
interpret \(R^2\) and adjusted \(R^2\),
and run the Chapter 16 PyStatsV1 lab to build and evaluate a predictive model.