Applied Statistics with Python – Chapter 7

Simple linear regression in Python and R

This chapter parallels the Simple Linear Regression chapter from the R notes. The statistical ideas are the same:

We have a numeric predictor (x) and a numeric response (y).
We want to describe how the mean of y changes as x changes.
We quantify that relationship with a line plus random noise.

In the R notes, the running example is the classic cars dataset:

speed – car speed in miles per hour,
dist – stopping distance in feet.

In PyStatsV1 we reuse the same data, but we work in Python-first terms: NumPy, pandas, Matplotlib, and the statsmodels regression API.

By the end of this chapter you should be able to:

write down the simple linear regression model and its assumptions,
compute least–squares estimates by hand (with NumPy),
fit the same model using a high-level tool (statsmodels.ols),
interpret the slope, intercept, residuals, and \(R^2\),
use the fitted model to make predictions (and know when not to trust them),
understand how the R function lm() corresponds to the Python tools we use.

Note

Throughout this chapter we assume you have the PyStatsV1 repository checked out locally and that you can run the chapter script

scripts/ch07_simple_linear_regression.py

which contains a full, executable version of the examples.

7.1 From scatterplots to models

We start with a scatterplot: speed vs. stopping distance.

In Python, with a DataFrame cars that has speed and dist columns:

import pandas as pd
import matplotlib.pyplot as plt

cars = pd.read_csv("data/cars.csv")  # see PyStatsV1 data folder

fig, ax = plt.subplots()
ax.scatter(cars["speed"], cars["dist"], alpha=0.7)
ax.set_xlabel("Speed (mph)")
ax.set_ylabel("Stopping distance (ft)")
ax.set_title("Stopping distance vs. speed")
plt.show()

The plot tells us:

Faster cars tend to have longer stopping distances.
The points do not fall exactly on a line—there is random variation.

We can express this informally as

response = pattern + noise.

In math notation we write

\[Y = f(X) + \varepsilon,\]

where

\(X\) – predictor (speed),
\(Y\) – response (stopping distance),
\(f(\cdot)\) – unknown systematic pattern,
\(\varepsilon\) – random error.

In this chapter we restrict \(f(\cdot)\) to be a line.

7.2 The simple linear regression model

The simple linear regression (SLR) model uses a straight line to describe how the mean of Y changes with X:

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i = 1,\dots,n.\]

Here

\(x_i\) – observed predictor value (fixed, not random),
\(Y_i\) – random response,
\(\beta_0\) – intercept,
\(\beta_1\) – slope,
\(\varepsilon_i\) – random error term.

We assume the errors satisfy the usual LINE conditions:

L – Linear: the mean of \(Y\) is a straight line in \(x\),

\[\mathbb{E}[Y_i \mid X_i = x_i] = \beta_0 + \beta_1 x_i.\]
I – Independent: errors \(\varepsilon_i\) are independent.
N – Normal: errors are normally distributed,

\[\varepsilon_i \sim \mathcal{N}(0, \sigma^2).\]
E – Equal variance: all errors share the same variance \(\sigma^2\).

Under these assumptions, the conditional distribution of \(Y_i\) is

\[Y_i \mid X_i = x_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\; \sigma^2).\]

The three unknown parameters are \(\beta_0\), \(\beta_1\), and \(\sigma^2\). Our task is to estimate them from data.

7.3 Least squares: estimating the line

Given observed pairs \((x_i, y_i)\), the least–squares idea is:

Choose the line that makes the squared vertical errors as small as possible.

Vertical errors (residuals) are

\[e_i = y_i - (\beta_0 + \beta_1 x_i).\]

We want \(\beta_0\) and \(\beta_1\) that minimize

\[\text{SSE}(\beta_0,\beta_1) = \sum_{i=1}^n \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2.\]

Solving the resulting equations gives the familiar closed forms

\[\hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad \hat\beta_0 = \bar{y} - \hat\beta_1 \bar{x},\]

where \(\bar{x}\) and \(\bar{y}\) are the sample means.

7.3.1 Computing \(\hat\beta_0\) and \(\hat\beta_1\) in Python

Using NumPy with the cars dataset:

import numpy as np
import pandas as pd

cars = pd.read_csv("data/cars.csv")

x = cars["speed"].to_numpy()
y = cars["dist"].to_numpy()

x_bar = x.mean()
y_bar = y.mean()

Sxx = np.sum((x - x_bar) ** 2)
Sxy = np.sum((x - x_bar) * (y - y_bar))

beta1_hat = Sxy / Sxx
beta0_hat = y_bar - beta1_hat * x_bar

print(beta0_hat, beta1_hat)

Interpretation (for this dataset):

\(\hat\beta_1 \approx 3.93\) – for each +1 mph of speed, the mean stopping distance increases by about 3.9 feet.
\(\hat\beta_0 \approx -17.6\) – the predicted mean distance at 0 mph (an extrapolation; not physically meaningful, but needed for the line).

7.3.2 Predictions and interpolation

Once we have \(\hat\beta_0\) and \(\hat\beta_1\), the fitted line is

\[\hat{y} = \hat\beta_0 + \hat\beta_1 x.\]

Predictions in Python are straightforward:

def predict_stopping_distance(speed_mph: float) -> float:
    return beta0_hat + beta1_hat * speed_mph

predict_stopping_distance(8)   # within data range (interpolation)
predict_stopping_distance(21)  # within range but not observed
predict_stopping_distance(50)  # outside range (extrapolation – be careful!)

Key idea:

Interpolation inside the observed x range is usually reasonable.
Extrapolation far beyond the data range is risky, even with a good line.

7.4 Residuals, variance, and \(R^2\)

Once the line is fitted, we inspect how far each point is from that line.

Residuals

Residuals are observed minus fitted values:

\[e_i = y_i - \hat{y}_i = y_i - (\hat\beta_0 + \hat\beta_1 x_i).\]

In NumPy:

y_hat = beta0_hat + beta1_hat * x
residuals = y - y_hat

residuals[:5]

A good first diagnostic is a residual vs. fitted plot; you should see:

no strong curve (supports linearity),
roughly constant spread (supports equal variance),
no obvious pattern or clustering (supports independence).

Residual variance and residual standard error

The sum of squared residuals is

\[\text{SSE} = \sum_{i=1}^n e_i^2.\]

We estimate \(\sigma^2\) by

\[s_e^2 = \frac{\text{SSE}}{n - 2},\]

where n − 2 reflects the two parameters we estimated.

In Python:

n = len(y)
sse = np.sum(residuals ** 2)
s2_e = sse / (n - 2)
s_e = np.sqrt(s2_e)   # residual standard error

The residual standard error \(s_e\) is in the same units as y (feet here). You can read it as:

“Our fitted mean stopping distances are typically off by about \(s_e\) feet.”

Decomposition of variation and \(R^2\)

We can decompose total variation in y into explained and unexplained parts:

\[\underbrace{\sum (y_i - \bar{y})^2}_{\text{SST}} = \underbrace{\sum (\hat{y}_i - \bar{y})^2}_{\text{SSReg}} + \underbrace{\sum (y_i - \hat{y}_i)^2}_{\text{SSE}}.\]

Then the coefficient of determination is

\[R^2 = \frac{\text{SSReg}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}.\]

Interpretation:

\(R^2\) is the proportion of variation in ``y`` explained by the regression on x.
Values near 1 mean the line explains most of the variability; values near 0 mean the line explains little.

In Python:

sst = np.sum((y - y_bar) ** 2)
ss_reg = np.sum((y_hat - y_bar) ** 2)
r2 = ss_reg / sst   # or 1 - sse / sst

For the cars example, \(R^2\) is around 0.65 – about 65% of the variation in stopping distance is explained by speed alone.

7.5 Using `statsmodels`: Python’s version of `lm()`

In R, we would write

cars <- datasets::cars
fit  <- lm(dist ~ speed, data = cars)

summary(fit)
predict(fit, newdata = data.frame(speed = 8))

In Python, the closest equivalent is statsmodels’ formula API:

import pandas as pd
import statsmodels.formula.api as smf

cars = pd.read_csv("data/cars.csv")

model = smf.ols("dist ~ speed", data=cars).fit()

print(model.params)      # beta_0_hat and beta_1_hat
print(model.rsquared)    # R^2
print(model.summary())   # detailed regression table

model.predict({"speed": [8, 21, 50]})

You should recognize many familiar quantities in the summary:

coefficient estimates (\(\hat\beta_0\), \(\hat\beta_1\)),
their standard errors and t-statistics,
\(R^2\) and adjusted \(R^2\),
residual standard error (called scale or sigma).

Mapping R → Python

lm(dist ~ speed, data=cars) → smf.ols("dist ~ speed", data=cars).fit()
coef(fit) → model.params
resid(fit) → model.resid
fitted(fit) → model.fittedvalues
summary(fit) → model.summary()
predict(fit, newdata=...) → model.predict(new_dataframe)

7.6 Simulation: seeing SLR in action

Simulation is a powerful way to see what a model means.

Suppose the true relationship is

\[Y = 5 - 2x + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, 3^2).\]

We can simulate a dataset, fit a line, and compare estimates to truth:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(seed=1)

n = 21
beta0_true = 5.0
beta1_true = -2.0
sigma_true = 3.0

x = np.linspace(0, 10, n)
eps = rng.normal(loc=0.0, scale=sigma_true, size=n)
y = beta0_true + beta1_true * x + eps

sim = pd.DataFrame({"x": x, "y": y})

fit = smf.ols("y ~ x", data=sim).fit()
print(fit.params)

If you plot the data and overlay both lines (true and fitted), you will see that the estimated line is close, but not perfect—that’s sampling variability.

This kind of simulation becomes even more useful in later chapters (e.g., to study confidence intervals, hypothesis tests, and power).

7.7 What you should take away

By the end of this chapter you should be comfortable with:

thinking in terms of models:

\[Y = \beta_0 + \beta_1 X + \varepsilon,\]

and “response = prediction + error”,
interpreting the slope and intercept in context,
computing least–squares estimates \(\hat\beta_0\) and \(\hat\beta_1\),
computing and interpreting residuals, SSE, residual standard error,
interpreting \(R^2\) as “fraction of variance explained,”

using statsmodels in Python as the counterpart of R’s lm():

model = smf.ols("dist ~ speed", data=cars).fit()
model.summary()
model.predict({"speed": [10, 20]})

In later PyStatsV1 chapters, we will:

extend SLR to multiple regression,
build models with transformed predictors (e.g., quadratic terms),
and connect regression more formally to inference (tests and confidence intervals).

If any of the algebra feels fuzzy, focus first on the pictures and the Python code; the formulas will slowly become familiar as you keep applying them to real datasets.