Applied Statistics with Python – Chapter 7
Simple linear regression in Python and R
This chapter parallels the Simple Linear Regression chapter from the R notes. The statistical ideas are the same:
We have a numeric predictor (
x) and a numeric response (y).We want to describe how the mean of
ychanges asxchanges.We quantify that relationship with a line plus random noise.
In the R notes, the running example is the classic cars dataset:
speed– car speed in miles per hour,dist– stopping distance in feet.
In PyStatsV1 we reuse the same data, but we work in Python-first terms:
NumPy, pandas, Matplotlib, and the statsmodels regression API.
By the end of this chapter you should be able to:
write down the simple linear regression model and its assumptions,
compute least–squares estimates by hand (with NumPy),
fit the same model using a high-level tool (
statsmodels.ols),interpret the slope, intercept, residuals, and \(R^2\),
use the fitted model to make predictions (and know when not to trust them),
understand how the R function
lm()corresponds to the Python tools we use.
Note
Throughout this chapter we assume you have the PyStatsV1 repository checked out locally and that you can run the chapter script
scripts/ch07_simple_linear_regression.py
which contains a full, executable version of the examples.
7.1 From scatterplots to models
We start with a scatterplot: speed vs. stopping distance.
In Python, with a DataFrame cars that has speed and dist columns:
import pandas as pd
import matplotlib.pyplot as plt
cars = pd.read_csv("data/cars.csv") # see PyStatsV1 data folder
fig, ax = plt.subplots()
ax.scatter(cars["speed"], cars["dist"], alpha=0.7)
ax.set_xlabel("Speed (mph)")
ax.set_ylabel("Stopping distance (ft)")
ax.set_title("Stopping distance vs. speed")
plt.show()
The plot tells us:
Faster cars tend to have longer stopping distances.
The points do not fall exactly on a line—there is random variation.
We can express this informally as
response = pattern + noise.
In math notation we write
where
\(X\) – predictor (speed),
\(Y\) – response (stopping distance),
\(f(\cdot)\) – unknown systematic pattern,
\(\varepsilon\) – random error.
In this chapter we restrict \(f(\cdot)\) to be a line.
7.2 The simple linear regression model
The simple linear regression (SLR) model uses a straight line to describe
how the mean of Y changes with X:
Here
\(x_i\) – observed predictor value (fixed, not random),
\(Y_i\) – random response,
\(\beta_0\) – intercept,
\(\beta_1\) – slope,
\(\varepsilon_i\) – random error term.
We assume the errors satisfy the usual LINE conditions:
L – Linear: the mean of \(Y\) is a straight line in \(x\),
\[\mathbb{E}[Y_i \mid X_i = x_i] = \beta_0 + \beta_1 x_i.\]I – Independent: errors \(\varepsilon_i\) are independent.
N – Normal: errors are normally distributed,
\[\varepsilon_i \sim \mathcal{N}(0, \sigma^2).\]E – Equal variance: all errors share the same variance \(\sigma^2\).
Under these assumptions, the conditional distribution of \(Y_i\) is
The three unknown parameters are \(\beta_0\), \(\beta_1\), and \(\sigma^2\). Our task is to estimate them from data.
7.3 Least squares: estimating the line
Given observed pairs \((x_i, y_i)\), the least–squares idea is:
Choose the line that makes the squared vertical errors as small as possible.
Vertical errors (residuals) are
We want \(\beta_0\) and \(\beta_1\) that minimize
Solving the resulting equations gives the familiar closed forms
where \(\bar{x}\) and \(\bar{y}\) are the sample means.
7.3.1 Computing \(\hat\beta_0\) and \(\hat\beta_1\) in Python
Using NumPy with the cars dataset:
import numpy as np
import pandas as pd
cars = pd.read_csv("data/cars.csv")
x = cars["speed"].to_numpy()
y = cars["dist"].to_numpy()
x_bar = x.mean()
y_bar = y.mean()
Sxx = np.sum((x - x_bar) ** 2)
Sxy = np.sum((x - x_bar) * (y - y_bar))
beta1_hat = Sxy / Sxx
beta0_hat = y_bar - beta1_hat * x_bar
print(beta0_hat, beta1_hat)
Interpretation (for this dataset):
\(\hat\beta_1 \approx 3.93\) – for each +1 mph of speed, the mean stopping distance increases by about 3.9 feet.
\(\hat\beta_0 \approx -17.6\) – the predicted mean distance at 0 mph (an extrapolation; not physically meaningful, but needed for the line).
7.3.2 Predictions and interpolation
Once we have \(\hat\beta_0\) and \(\hat\beta_1\), the fitted line is
Predictions in Python are straightforward:
def predict_stopping_distance(speed_mph: float) -> float:
return beta0_hat + beta1_hat * speed_mph
predict_stopping_distance(8) # within data range (interpolation)
predict_stopping_distance(21) # within range but not observed
predict_stopping_distance(50) # outside range (extrapolation – be careful!)
Key idea:
Interpolation inside the observed
xrange is usually reasonable.Extrapolation far beyond the data range is risky, even with a good line.
7.4 Residuals, variance, and \(R^2\)
Once the line is fitted, we inspect how far each point is from that line.
Residuals
Residuals are observed minus fitted values:
In NumPy:
y_hat = beta0_hat + beta1_hat * x
residuals = y - y_hat
residuals[:5]
A good first diagnostic is a residual vs. fitted plot; you should see:
no strong curve (supports linearity),
roughly constant spread (supports equal variance),
no obvious pattern or clustering (supports independence).
Residual variance and residual standard error
The sum of squared residuals is
We estimate \(\sigma^2\) by
where n − 2 reflects the two parameters we estimated.
In Python:
n = len(y)
sse = np.sum(residuals ** 2)
s2_e = sse / (n - 2)
s_e = np.sqrt(s2_e) # residual standard error
The residual standard error \(s_e\) is in the same units as y
(feet here). You can read it as:
“Our fitted mean stopping distances are typically off by about \(s_e\) feet.”
Decomposition of variation and \(R^2\)
We can decompose total variation in y into explained and unexplained parts:
Then the coefficient of determination is
Interpretation:
\(R^2\) is the proportion of variation in ``y`` explained by the regression on
x.Values near 1 mean the line explains most of the variability; values near 0 mean the line explains little.
In Python:
sst = np.sum((y - y_bar) ** 2)
ss_reg = np.sum((y_hat - y_bar) ** 2)
r2 = ss_reg / sst # or 1 - sse / sst
For the cars example, \(R^2\) is around 0.65 – about 65% of the
variation in stopping distance is explained by speed alone.
7.5 Using statsmodels: Python’s version of lm()
In R, we would write
cars <- datasets::cars
fit <- lm(dist ~ speed, data = cars)
summary(fit)
predict(fit, newdata = data.frame(speed = 8))
In Python, the closest equivalent is statsmodels’ formula API:
import pandas as pd
import statsmodels.formula.api as smf
cars = pd.read_csv("data/cars.csv")
model = smf.ols("dist ~ speed", data=cars).fit()
print(model.params) # beta_0_hat and beta_1_hat
print(model.rsquared) # R^2
print(model.summary()) # detailed regression table
model.predict({"speed": [8, 21, 50]})
You should recognize many familiar quantities in the summary:
coefficient estimates (\(\hat\beta_0\), \(\hat\beta_1\)),
their standard errors and t-statistics,
\(R^2\) and adjusted \(R^2\),
residual standard error (called
scaleorsigma).
Mapping R → Python
lm(dist ~ speed, data=cars)→smf.ols("dist ~ speed", data=cars).fit()coef(fit)→model.paramsresid(fit)→model.residfitted(fit)→model.fittedvaluessummary(fit)→model.summary()predict(fit, newdata=...)→model.predict(new_dataframe)
7.6 Simulation: seeing SLR in action
Simulation is a powerful way to see what a model means.
Suppose the true relationship is
We can simulate a dataset, fit a line, and compare estimates to truth:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
rng = np.random.default_rng(seed=1)
n = 21
beta0_true = 5.0
beta1_true = -2.0
sigma_true = 3.0
x = np.linspace(0, 10, n)
eps = rng.normal(loc=0.0, scale=sigma_true, size=n)
y = beta0_true + beta1_true * x + eps
sim = pd.DataFrame({"x": x, "y": y})
fit = smf.ols("y ~ x", data=sim).fit()
print(fit.params)
If you plot the data and overlay both lines (true and fitted), you will see that the estimated line is close, but not perfect—that’s sampling variability.
This kind of simulation becomes even more useful in later chapters (e.g., to study confidence intervals, hypothesis tests, and power).
7.7 What you should take away
By the end of this chapter you should be comfortable with:
thinking in terms of models:
\[Y = \beta_0 + \beta_1 X + \varepsilon,\]and “response = prediction + error”,
interpreting the slope and intercept in context,
computing least–squares estimates \(\hat\beta_0\) and \(\hat\beta_1\),
computing and interpreting residuals, SSE, residual standard error,
interpreting \(R^2\) as “fraction of variance explained,”
using
statsmodelsin Python as the counterpart of R’slm():model = smf.ols("dist ~ speed", data=cars).fit() model.summary() model.predict({"speed": [10, 20]})
In later PyStatsV1 chapters, we will:
extend SLR to multiple regression,
build models with transformed predictors (e.g., quadratic terms),
and connect regression more formally to inference (tests and confidence intervals).
If any of the algebra feels fuzzy, focus first on the pictures and the Python code; the formulas will slowly become familiar as you keep applying them to real datasets.