.. _applied_stats_with_python_ch17_logistic_regression: Applied Statistics with Python – Chapter 17 =========================================== Logistic regression and classification -------------------------------------- So far, our regression chapters have focused on *numeric* response variables and ordinary least squares (OLS). In many applications, however, the response is binary: * disease vs no disease, * spam vs non-spam, * success vs failure, and so on. In this chapter we: * introduce **generalized linear models (GLMs)** as an extension of OLS, * develop **logistic regression** for binary responses, * show how to **fit, interpret, and test** logistic models in Python, and * use logistic regression as a **classifier**, including evaluation metrics such as misclassification rate, sensitivity, and specificity. Throughout, we will mirror the R notes but work in Python using: * :mod:`numpy` and :mod:`pandas` for data handling, * :mod:`statsmodels` for logistic regression and GLMs, and * :mod:`scikit-learn` for classification workflows and cross-validation. .. contents:: :local: :depth: 2 17.1 Generalized linear models ------------------------------ Ordinary linear regression assumes * a *Normal* distribution for the response given the predictors, and * a mean that is a **linear combination** of the predictors: .. math:: Y \mid X = x \sim N(\mu(x), \sigma^2), \qquad \mu(x) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}. A **generalized linear model (GLM)** keeps the linear predictor .. math:: \eta(x) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1} but allows: * different choices of distribution for :math:`Y \mid X = x`, and * a **link function** :math:`g(\cdot)` connecting the linear predictor to the mean: .. math:: \eta(x) = g\big( E[Y \mid X = x] \big). For example: * **Linear regression** * Distribution: Normal * Link: identity :math:`g(\mu) = \mu` * **Poisson regression** * Distribution: Poisson (counts) * Link: log :math:`g(\lambda) = \log \lambda` * **Logistic regression** * Distribution: Bernoulli (binary) * Link: logit :math:`g(p) = \log \{p / (1 - p)\}` 17.2 Binary responses and the logistic model ------------------------------------------- Suppose we have a binary response coded as .. math:: Y = \begin{cases} 1, & \text{event occurs (``yes'', ``spam'', ``disease'')} \\ 0, & \text{otherwise (``no'', ``not spam'', ``no disease'')}. \end{cases} Define .. math:: p(x) = P(Y = 1 \mid X = x). With a **logistic regression model** we assume .. math:: \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}. The left-hand side is the **log-odds** (logit). Applying the inverse logit gives .. math:: p(x) = \frac{\exp\{\eta(x)\}}{1 + \exp\{\eta(x)\}} = \frac{1}{1 + \exp\{-\eta(x)\}}. This guarantees :math:`0 < p(x) < 1`, which is exactly what we want from a probability. Where is the error term? ~~~~~~~~~~~~~~~~~~~~~~~~ In the OLS model we write .. math:: Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_q x_q + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2), or equivalently .. math:: Y \mid X = x \sim N(\mu(x), \sigma^2). The Normal distribution has two parameters, :math:`\mu(x)` and :math:`\sigma^2`, so we estimate both. In logistic regression the conditional distribution is **Bernoulli** with mean :math:`p(x)`. The distribution has a *single* parameter, so we only need to estimate :math:`p(x)` (through the :math:`\beta` coefficients). There is no separate :math:`\sigma^2` parameter. 17.2.1 Why not OLS on a 0/1 outcome? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If :math:`Y \in \{0, 1\}`, the conditional mean is .. math:: E[Y \mid X = x] = P(Y = 1 \mid X = x) = p(x). You might think we can simply fit an ordinary linear regression of :math:`Y` on :math:`X` and interpret the fitted values as estimated probabilities. The problem: * the OLS fitted line is linear in :math:`x`, so fitted values can be **less than 0** or **greater than 1**, which makes no sense as probabilities; * the Normal error assumption is a poor match to a Bernoulli variable. Logistic regression solves both issues: probabilities are constrained to :math:`(0, 1)` and the Bernoulli model correctly reflects the 0/1 nature of the data. 17.2.2 Simulating logistic data in Python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is a direct translation of the R simulation used in the original notes. .. code-block:: python import numpy as np import pandas as pd rng = np.random.default_rng(42) def sim_logistic_data(sample_size=25, beta_0=-2.0, beta_1=3.0) -> pd.DataFrame: x = rng.normal(size=sample_size) eta = beta_0 + beta_1 * x p = 1 / (1 + np.exp(-eta)) # inverse logit y = rng.binomial(n=1, p=p, size=sample_size) return pd.DataFrame({"y": y, "x": x}) df = sim_logistic_data() df.head() We can now compare an OLS fit to a logistic fit: .. code-block:: python import statsmodels.api as sm import statsmodels.formula.api as smf import matplotlib.pyplot as plt # ordinary least squares ols_mod = smf.ols("y ~ x", data=df).fit() # logistic regression (GLM with binomial family + logit link) logit_mod = smf.glm( "y ~ x", data=df, family=sm.families.Binomial() ).fit() x_grid = np.linspace(df["x"].min(), df["x"].max(), 200) grid = pd.DataFrame({"x": x_grid}) plt.scatter(df["x"], df["y"], s=20) plt.plot(x_grid, ols_mod.predict(grid), label="OLS") plt.plot(x_grid, logit_mod.predict(grid), linestyle="--", label="Logistic") plt.xlabel("x") plt.ylabel("Estimated probability") plt.legend() plt.grid(True) The logistic curve stays between 0 and 1; the OLS line does not. 17.3 Fitting logistic regression in Python ------------------------------------------ We use :mod:`statsmodels` to estimate logistic regression models. The syntax parallels :func:`glm` in R. Basic fit ~~~~~~~~~ .. code-block:: python import statsmodels.formula.api as smf import statsmodels.api as sm # binary response y, predictors x1, x2, ... logit_mod = smf.glm( "y ~ x1 + x2", data=df, family=sm.families.Binomial() ).fit() print(logit_mod.summary()) Key outputs: * coefficient estimates :math:`\hat\beta_j`, * standard errors and Wald :math:`z` statistics, * approximate p-values for tests :math:`H_0: \beta_j = 0`, * deviance and information criteria (AIC / BIC). Wald tests ~~~~~~~~~~ A single-parameter hypothesis .. math:: H_0 : \beta_j = 0 \quad \text{vs} \quad H_1 : \beta_j \neq 0 is tested with a **Wald statistic** .. math:: z = \frac{\hat\beta_j}{\operatorname{SE}(\hat\beta_j)} \approx N(0, 1), reported in the summary as ``z`` and ``P>|z|``. Likelihood-ratio tests ~~~~~~~~~~~~~~~~~~~~~~ To compare a **reduced** model to a **full** model, we can use a likelihood-ratio test (LRT), the GLM analogue of the regression F-test. .. code-block:: python reduced = smf.glm("y ~ x1 + x2", data=df, family=sm.families.Binomial()).fit() full = smf.glm("y ~ x1 + x2 + x3 + x4", data=df, family=sm.families.Binomial()).fit() lr_stat, lr_pvalue, df_diff = full.compare_lr_test(reduced) print(lr_stat, df_diff, lr_pvalue) Here: * :code:`lr_stat` is the deviance difference :math:`D`, * :code:`df_diff` is the difference in degrees of freedom (number of parameters), * :code:`lr_pvalue` is the p-value under a :math:`\chi^2_{df\_diff}` approximation. 17.3.1 SAheart example: coronary heart disease ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The R notes use the :data:`SAheart` dataset from :mod:`ElemStatLearn`. In PyStatsV1 we will assume a CSV version is available, for example: * :file:`data/sa_heart.csv` with columns: * ``chd`` – coronary heart disease indicator (1 = present, 0 = absent), * ``sbp`` – systolic blood pressure, * ``tobacco`` – lifetime tobacco (kg), * ``ldl`` – low density lipoprotein cholesterol, * ``adiposity``, ``famhist``, ``typea``, ``obesity``, ``alcohol``, ``age``. Reading and preparing the data: .. code-block:: python sa = pd.read_csv("data/sa_heart.csv") # make sure chd is 0/1 integers sa["chd"] = sa["chd"].astype(int) Single-predictor model (LDL only) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python chd_ldl = smf.glm( "chd ~ ldl", data=sa, family=sm.families.Binomial() ).fit() print(chd_ldl.summary()) A positive coefficient for ``ldl`` indicates that higher LDL is associated with higher probability of CHD. Additive model with all predictors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python chd_add = smf.glm( "chd ~ sbp + tobacco + ldl + adiposity + C(famhist)" " + typea + obesity + alcohol + age", data=sa, family=sm.families.Binomial() ).fit() chd_add.summary() *Note*: ``C(famhist)`` treats ``famhist`` as a categorical predictor. Comparing models with an LRT: .. code-block:: python lr_stat, lr_pvalue, df_diff = chd_add.compare_lr_test(chd_ldl) print(f"LRT stat = {lr_stat:.3f}, df = {df_diff}, p = {lr_pvalue:.3g}") A very small p-value suggests that the larger additive model explains CHD substantially better than LDL alone. Variable selection (briefly) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For a quick backward stepwise fit using AIC we can loop over predictors or use a small helper; for now we simply note that: * AIC/BIC from :attr:`result.aic` / :attr:`result.bic` can be compared across models. * We prefer models that balance **good fit** (low deviance) with **parsimony** (few predictors). Confidence intervals for coefficients ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python chd_sel = chd_add # for now, suppose this is our selected model ci_99 = chd_sel.conf_int(alpha=0.01) ci_99.columns = ["lower", "upper"] ci_99 These are profile-likelihood intervals, analogous to the R output from ``confint``. Confidence intervals for mean response ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For a new patient we often want an interval for the *predicted probability* of CHD. .. code-block:: python new_obs = pd.DataFrame( { "sbp": [148.0], "tobacco": [5.0], "ldl": [12.0], "adiposity": [31.23], "famhist": ["Present"], "typea": [47], "obesity": [28.50], "alcohol": [23.89], "age": [60], } ) # get eta and its standard error pred = chd_sel.get_prediction(new_obs) summary = pred.summary_frame() # includes mean, mean_ci_lower, mean_ci_upper summary[["mean_ci_lower", "mean", "mean_ci_upper"]] By default, :mod:`statsmodels` returns intervals on the **link scale** (log-odds). Using ``transform=True`` we can instead request intervals already back-transformed to probabilities: .. code-block:: python summary_prob = chd_sel.get_prediction(new_obs).summary_frame(transform=True) summary_prob[["mean_ci_lower", "mean", "mean_ci_upper"]] This provides a confidence interval for :math:`p(x)` between 0 and 1. 17.4 Logistic regression as a classifier --------------------------------------- Up to now, we have focused on modelling the probability :math:`p(x)`. To turn logistic regression into a **classifier**, we map each observation to a class label. If we knew the true probabilities, the **Bayes classifier** .. math:: C_B(x) = \arg\max_k P(Y = k \mid X = x) would minimize the probability of misclassification. For a binary response this reduces to: .. math:: C_B(x) = \begin{cases} 1, & p(x) > 0.5,\\ 0, & \text{otherwise}. \end{cases} In practice we replace :math:`p(x)` by its estimate :math:`\hat p(x)` from the logistic model and use the same rule. Misclassification rate ~~~~~~~~~~~~~~~~~~~~~~ Given predictions :math:`\hat C(x_i)` on a dataset, the **misclassification rate** is .. math:: \text{Misclass} = \frac{1}{n} \sum_{i=1}^n I\{ y_i \neq \hat C(x_i) \}. Equivalently, accuracy is :math:`1 - \text{Misclass}`. Sensitivity and specificity ~~~~~~~~~~~~~~~~~~~~~~~~~~~ A confusion matrix for a binary classifier records: * true positives (TP), * false positives (FP), * true negatives (TN), * false negatives (FN). Two important metrics are: * **Sensitivity** (recall, true positive rate) .. math:: \text{Sens} = \frac{\text{TP}}{\text{TP} + \text{FN}}. * **Specificity** (true negative rate) .. math:: \text{Spec} = \frac{\text{TN}}{\text{TN} + \text{FP}}. Changing the probability cutoff :math:`c` in .. math:: \hat C(x) = \begin{cases} 1, & \hat p(x) > c,\\ 0, & \hat p(x) \le c, \end{cases} trades off sensitivity and specificity: * lowering :math:`c` increases sensitivity but decreases specificity, * raising :math:`c` does the opposite. In practice, the “best” cutoff depends on the costs of false positives vs false negatives. 17.4.1 Spam example: email classification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The R notes use the classic ``spam`` dataset from the UCI machine learning repository. In PyStatsV1 we can assume a CSV file * :file:`data/spam.csv` with a binary column ``type`` (``"spam"`` vs ``"nonspam"``) and many engineered text features. Train / test split ^^^^^^^^^^^^^^^^^^ Using scikit-learn: .. code-block:: python from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( confusion_matrix, accuracy_score, recall_score ) spam = pd.read_csv("data/spam.csv") X = spam.drop(columns=["type"]) y = (spam["type"] == "spam").astype(int) # 1 = spam, 0 = non-spam X_trn, X_tst, y_trn, y_tst = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) Fit several models of increasing complexity: .. code-block:: python # simple model using only capitalTotal caps_mod = LogisticRegression(max_iter=1000) caps_mod.fit(X_trn[["capitalTotal"]], y_trn) # a small hand-picked model few_features = ["edu", "money", "capitalTotal", "charDollar"] sel_mod = LogisticRegression(max_iter=1000) sel_mod.fit(X_trn[few_features], y_trn) # additive model using all predictors full_mod = LogisticRegression( max_iter=1000, n_jobs=-1 ) full_mod.fit(X_trn, y_trn) # (an “over-parametrized” model with interactions would usually be handled # via feature engineering; we omit it here for brevity.) Cross-validated misclassification rate ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Instead of relying on training error, use cross-validation: .. code-block:: python from sklearn.model_selection import cross_val_score def cv_misclass(model, X, y, cv=5): acc = cross_val_score(model, X, y, cv=cv, scoring="accuracy") return 1 - acc.mean() cv_caps = cv_misclass(caps_mod, X_trn[["capitalTotal"]], y_trn) cv_sel = cv_misclass(sel_mod, X_trn[few_features], y_trn) cv_full = cv_misclass(full_mod, X_trn, y_trn) print(cv_caps, cv_sel, cv_full) Typically you will see: * the very simple model **underfits** (higher misclassification), * a moderate-size model performs better, * an extremely complex model risks **overfitting**. Confusion matrix on the test set ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Pick a preferred model (say ``full_mod``) and evaluate it on the held-out test set: .. code-block:: python # predicted probabilities for class 1 (spam) p_hat = full_mod.predict_proba(X_tst)[:, 1] # default cutoff 0.5 y_pred_50 = (p_hat > 0.5).astype(int) cm_50 = confusion_matrix(y_tst, y_pred_50) acc_50 = accuracy_score(y_tst, y_pred_50) sens_50 = recall_score(y_tst, y_pred_50) # sensitivity = recall for positive class # specificity needs a small helper tn, fp, fn, tp = cm_50.ravel() spec_50 = tn / (tn + fp) print("Confusion matrix (c = 0.5):") print(cm_50) print(f"Accuracy = {acc_50:.3f}, Sensitivity = {sens_50:.3f}, Specificity = {spec_50:.3f}") Changing the cutoff to 0.1 or 0.9 illustrates the trade-off: .. code-block:: python for c in [0.1, 0.5, 0.9]: y_pred = (p_hat > c).astype(int) tn, fp, fn, tp = confusion_matrix(y_tst, y_pred).ravel() acc = accuracy_score(y_tst, y_pred) sens = recall_score(y_tst, y_pred) spec = tn / (tn + fp) print(f"Cutoff {c:.1f}: acc={acc:.3f}, sens={sens:.3f}, spec={spec:.3f}") This matches the behaviour described in the R notes: lower cutoffs favour sensitivity (catching more spam) at the cost of more false positives; higher cutoffs do the opposite. 17.5 How this connects to PyStatsV1 ----------------------------------- Logistic regression and GLMs will reappear throughout PyStatsV1: * Many modern models used in science, health, and industry are **generalized linear models**, not just ordinary linear regression. * In later chapters on **experimental data and causal questions**, logistic regression provides a natural way to model binary outcomes (e.g., success vs failure) while controlling for covariates. * For applied work, logistic regression often serves as a **baseline classifier** before moving to more complex machine-learning methods. PyStatsV1 will show how to compare logistic models with tree-based and other algorithms. * The ideas of **link functions**, **likelihood-based inference**, and **classification metrics** generalize to many other models (Poisson, multinomial, etc.). Exercises and examples built on this chapter will encourage you to: * fit and interpret logistic models for real binary outcomes, * compare OLS and logistic regression on the same dataset, * practice using deviance, AIC, and cross-validated misclassification to choose between models, and * think carefully about which metrics (accuracy, sensitivity, specificity) are appropriate for a given applied problem. 17.6 What you should take away ------------------------------ By the end of this chapter (and its R + Python versions), you should be able to: * Explain how **generalized linear models** extend ordinary linear regression. * Define and work with the **logistic regression model**: * log-odds, odds, probabilities, * logit and inverse-logit (sigmoid) transformations. * Fit logistic regression models in Python using :mod:`statsmodels` and :mod:`scikit-learn`. * interpret coefficients in terms of log-odds and probabilities; * compute Wald tests and likelihood-ratio tests; * construct confidence intervals for coefficients and mean response. * Use logistic regression as a **classifier**: * obtain class probabilities :math:`\hat p(x)`, * convert them to labels using a cutoff, * compute misclassification rate, sensitivity, and specificity, * understand how changing the cutoff trades off false positives vs false negatives. * Recognize that many ideas from linear regression (model comparison, variable selection, diagnostics) carry over almost unchanged to the logistic setting. In short: logistic regression is your first step into the broader world of GLMs and classification methods, and a core tool for applied statistics with PyStatsV1.