Applied Statistics with Python – Chapter 7 ========================================== Simple linear regression in Python and R ---------------------------------------- This chapter parallels the *Simple Linear Regression* chapter from the R notes. The statistical ideas are the same: * We have a **numeric predictor** (``x``) and a **numeric response** (``y``). * We want to describe how the *mean* of ``y`` changes as ``x`` changes. * We quantify that relationship with a **line** plus **random noise**. In the R notes, the running example is the classic ``cars`` dataset: * ``speed`` – car speed in miles per hour, * ``dist`` – stopping distance in feet. In PyStatsV1 we reuse the same data, but we work in **Python-first** terms: NumPy, pandas, Matplotlib, and the ``statsmodels`` regression API. By the end of this chapter you should be able to: * write down the **simple linear regression model** and its assumptions, * compute least–squares estimates by hand (with NumPy), * fit the same model using a high-level tool (``statsmodels.ols``), * interpret the slope, intercept, residuals, and :math:`R^2`, * use the fitted model to make **predictions** (and know when not to trust them), * understand how the R function ``lm()`` corresponds to the Python tools we use. .. note:: Throughout this chapter we assume you have the PyStatsV1 repository checked out locally and that you can run the chapter script ``scripts/ch07_simple_linear_regression.py`` which contains a full, executable version of the examples. 7.1 From scatterplots to models ------------------------------- We start with a scatterplot: *speed vs. stopping distance*. In Python, with a DataFrame ``cars`` that has ``speed`` and ``dist`` columns: .. code-block:: python import pandas as pd import matplotlib.pyplot as plt cars = pd.read_csv("data/cars.csv") # see PyStatsV1 data folder fig, ax = plt.subplots() ax.scatter(cars["speed"], cars["dist"], alpha=0.7) ax.set_xlabel("Speed (mph)") ax.set_ylabel("Stopping distance (ft)") ax.set_title("Stopping distance vs. speed") plt.show() The plot tells us: * Faster cars tend to have **longer** stopping distances. * The points do not fall exactly on a line—there is **random variation**. We can express this informally as *response = pattern + noise*. In math notation we write .. math:: Y = f(X) + \varepsilon, where * :math:`X` – predictor (speed), * :math:`Y` – response (stopping distance), * :math:`f(\cdot)` – unknown systematic pattern, * :math:`\varepsilon` – random error. In this chapter we **restrict** :math:`f(\cdot)` to be a *line*. 7.2 The simple linear regression model -------------------------------------- The **simple linear regression (SLR)** model uses a straight line to describe how the mean of ``Y`` changes with ``X``: .. math:: Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i = 1,\dots,n. Here * :math:`x_i` – observed predictor value (fixed, not random), * :math:`Y_i` – random response, * :math:`\beta_0` – intercept, * :math:`\beta_1` – slope, * :math:`\varepsilon_i` – random error term. We assume the errors satisfy the usual **LINE** conditions: * **L – Linear**: the mean of :math:`Y` is a straight line in :math:`x`, .. math:: \mathbb{E}[Y_i \mid X_i = x_i] = \beta_0 + \beta_1 x_i. * **I – Independent**: errors :math:`\varepsilon_i` are independent. * **N – Normal**: errors are normally distributed, .. math:: \varepsilon_i \sim \mathcal{N}(0, \sigma^2). * **E – Equal variance**: all errors share the same variance :math:`\sigma^2`. Under these assumptions, the conditional distribution of :math:`Y_i` is .. math:: Y_i \mid X_i = x_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\; \sigma^2). The three unknown parameters are :math:`\beta_0`, :math:`\beta_1`, and :math:`\sigma^2`. Our task is to estimate them from data. 7.3 Least squares: estimating the line -------------------------------------- Given observed pairs :math:`(x_i, y_i)`, the **least–squares** idea is: *Choose the line that makes the squared vertical errors as small as possible.* Vertical errors (residuals) are .. math:: e_i = y_i - (\beta_0 + \beta_1 x_i). We want :math:`\beta_0` and :math:`\beta_1` that minimize .. math:: \text{SSE}(\beta_0,\beta_1) = \sum_{i=1}^n \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2. Solving the resulting equations gives the familiar closed forms .. math:: \hat\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=1}^n (x_i - \bar{x})^2}, \qquad \hat\beta_0 = \bar{y} - \hat\beta_1 \bar{x}, where :math:`\bar{x}` and :math:`\bar{y}` are the sample means. 7.3.1 Computing :math:`\hat\beta_0` and :math:`\hat\beta_1` in Python ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using NumPy with the ``cars`` dataset: .. code-block:: python import numpy as np import pandas as pd cars = pd.read_csv("data/cars.csv") x = cars["speed"].to_numpy() y = cars["dist"].to_numpy() x_bar = x.mean() y_bar = y.mean() Sxx = np.sum((x - x_bar) ** 2) Sxy = np.sum((x - x_bar) * (y - y_bar)) beta1_hat = Sxy / Sxx beta0_hat = y_bar - beta1_hat * x_bar print(beta0_hat, beta1_hat) Interpretation (for this dataset): * :math:`\hat\beta_1 \approx 3.93` – for each +1 mph of speed, the **mean** stopping distance increases by about **3.9 feet**. * :math:`\hat\beta_0 \approx -17.6` – the predicted mean distance at 0 mph (an extrapolation; not physically meaningful, but needed for the line). 7.3.2 Predictions and interpolation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once we have :math:`\hat\beta_0` and :math:`\hat\beta_1`, the fitted line is .. math:: \hat{y} = \hat\beta_0 + \hat\beta_1 x. Predictions in Python are straightforward: .. code-block:: python def predict_stopping_distance(speed_mph: float) -> float: return beta0_hat + beta1_hat * speed_mph predict_stopping_distance(8) # within data range (interpolation) predict_stopping_distance(21) # within range but not observed predict_stopping_distance(50) # outside range (extrapolation – be careful!) Key idea: * **Interpolation** inside the observed ``x`` range is usually reasonable. * **Extrapolation** far beyond the data range is risky, even with a good line. 7.4 Residuals, variance, and :math:`R^2` ---------------------------------------- Once the line is fitted, we inspect how far each point is from that line. Residuals ^^^^^^^^^ Residuals are observed minus fitted values: .. math:: e_i = y_i - \hat{y}_i = y_i - (\hat\beta_0 + \hat\beta_1 x_i). In NumPy: .. code-block:: python y_hat = beta0_hat + beta1_hat * x residuals = y - y_hat residuals[:5] A good first diagnostic is a **residual vs. fitted** plot; you should see: * no strong curve (supports linearity), * roughly constant spread (supports equal variance), * no obvious pattern or clustering (supports independence). Residual variance and residual standard error ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The sum of squared residuals is .. math:: \text{SSE} = \sum_{i=1}^n e_i^2. We estimate :math:`\sigma^2` by .. math:: s_e^2 = \frac{\text{SSE}}{n - 2}, where ``n − 2`` reflects the two parameters we estimated. In Python: .. code-block:: python n = len(y) sse = np.sum(residuals ** 2) s2_e = sse / (n - 2) s_e = np.sqrt(s2_e) # residual standard error The residual standard error :math:`s_e` is in the same units as ``y`` (feet here). You can read it as: “Our fitted mean stopping distances are typically off by about :math:`s_e` feet.” Decomposition of variation and :math:`R^2` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can decompose total variation in ``y`` into explained and unexplained parts: .. math:: \underbrace{\sum (y_i - \bar{y})^2}_{\text{SST}} = \underbrace{\sum (\hat{y}_i - \bar{y})^2}_{\text{SSReg}} + \underbrace{\sum (y_i - \hat{y}_i)^2}_{\text{SSE}}. Then the **coefficient of determination** is .. math:: R^2 = \frac{\text{SSReg}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}. Interpretation: * :math:`R^2` is the **proportion of variation in ``y``** explained by the regression on ``x``. * Values near 1 mean the line explains most of the variability; values near 0 mean the line explains little. In Python: .. code-block:: python sst = np.sum((y - y_bar) ** 2) ss_reg = np.sum((y_hat - y_bar) ** 2) r2 = ss_reg / sst # or 1 - sse / sst For the ``cars`` example, :math:`R^2` is around 0.65 – about 65% of the variation in stopping distance is explained by speed alone. 7.5 Using ``statsmodels``: Python’s version of ``lm()`` ------------------------------------------------------- In R, we would write .. code-block:: r cars <- datasets::cars fit <- lm(dist ~ speed, data = cars) summary(fit) predict(fit, newdata = data.frame(speed = 8)) In Python, the closest equivalent is ``statsmodels``’ formula API: .. code-block:: python import pandas as pd import statsmodels.formula.api as smf cars = pd.read_csv("data/cars.csv") model = smf.ols("dist ~ speed", data=cars).fit() print(model.params) # beta_0_hat and beta_1_hat print(model.rsquared) # R^2 print(model.summary()) # detailed regression table model.predict({"speed": [8, 21, 50]}) You should recognize many familiar quantities in the summary: * coefficient estimates (:math:`\hat\beta_0`, :math:`\hat\beta_1`), * their standard errors and t-statistics, * :math:`R^2` and adjusted :math:`R^2`, * residual standard error (called ``scale`` or ``sigma``). Mapping R → Python ^^^^^^^^^^^^^^^^^^^ * ``lm(dist ~ speed, data=cars)`` → ``smf.ols("dist ~ speed", data=cars).fit()`` * ``coef(fit)`` → ``model.params`` * ``resid(fit)`` → ``model.resid`` * ``fitted(fit)`` → ``model.fittedvalues`` * ``summary(fit)`` → ``model.summary()`` * ``predict(fit, newdata=...)`` → ``model.predict(new_dataframe)`` 7.6 Simulation: seeing SLR in action ------------------------------------ Simulation is a powerful way to *see* what a model means. Suppose the **true** relationship is .. math:: Y = 5 - 2x + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, 3^2). We can simulate a dataset, fit a line, and compare estimates to truth: .. code-block:: python import numpy as np import pandas as pd import statsmodels.formula.api as smf rng = np.random.default_rng(seed=1) n = 21 beta0_true = 5.0 beta1_true = -2.0 sigma_true = 3.0 x = np.linspace(0, 10, n) eps = rng.normal(loc=0.0, scale=sigma_true, size=n) y = beta0_true + beta1_true * x + eps sim = pd.DataFrame({"x": x, "y": y}) fit = smf.ols("y ~ x", data=sim).fit() print(fit.params) If you plot the data and overlay both lines (true and fitted), you will see that the estimated line is close, but not perfect—that’s sampling variability. This kind of simulation becomes even more useful in later chapters (e.g., to study confidence intervals, hypothesis tests, and power). 7.7 What you should take away ------------------------------ By the end of this chapter you should be comfortable with: * thinking in terms of **models**: .. math:: Y = \beta_0 + \beta_1 X + \varepsilon, and “response = prediction + error”, * interpreting the **slope** and **intercept** in context, * computing least–squares estimates :math:`\hat\beta_0` and :math:`\hat\beta_1`, * computing and interpreting **residuals**, **SSE**, **residual standard error**, * interpreting :math:`R^2` as “fraction of variance explained,” * using ``statsmodels`` in Python as the counterpart of R’s ``lm()``: .. code-block:: python model = smf.ols("dist ~ speed", data=cars).fit() model.summary() model.predict({"speed": [10, 20]}) In later PyStatsV1 chapters, we will: * extend SLR to **multiple** regression, * build models with **transformed** predictors (e.g., quadratic terms), * and connect regression more formally to **inference** (tests and confidence intervals). If any of the algebra feels fuzzy, focus first on the **pictures** and the Python code; the formulas will slowly become familiar as you keep applying them to real datasets.