Applied Statistics with Python – Chapter 17

Logistic regression and classification

So far, our regression chapters have focused on numeric response variables and ordinary least squares (OLS). In many applications, however, the response is binary:

disease vs no disease,
spam vs non-spam,
success vs failure, and so on.

In this chapter we:

introduce generalized linear models (GLMs) as an extension of OLS,
develop logistic regression for binary responses,
show how to fit, interpret, and test logistic models in Python, and
use logistic regression as a classifier, including evaluation metrics such as misclassification rate, sensitivity, and specificity.

Throughout, we will mirror the R notes but work in Python using:

numpy and pandas for data handling,
statsmodels for logistic regression and GLMs, and
scikit-learn for classification workflows and cross-validation.

17.1 Generalized linear models

Ordinary linear regression assumes

a Normal distribution for the response given the predictors, and
a mean that is a linear combination of the predictors:

\[Y \mid X = x \sim N(\mu(x), \sigma^2), \qquad \mu(x) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}.\]

A generalized linear model (GLM) keeps the linear predictor

\[\eta(x) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}\]

but allows:

different choices of distribution for \(Y \mid X = x\), and
a link function \(g(\cdot)\) connecting the linear predictor to the mean:

\[\eta(x) = g\big( E[Y \mid X = x] \big).\]

For example:

Linear regression
- Distribution: Normal
- Link: identity \(g(\mu) = \mu\)
Poisson regression
- Distribution: Poisson (counts)
- Link: log \(g(\lambda) = \log \lambda\)
Logistic regression
- Distribution: Bernoulli (binary)
- Link: logit \(g(p) = \log \{p / (1 - p)\}\)

17.2 Binary responses and the logistic model

Suppose we have a binary response coded as

\[\begin{split}Y = \begin{cases} 1, & \text{event occurs (``yes'', ``spam'', ``disease'')} \\ 0, & \text{otherwise (``no'', ``not spam'', ``no disease'')}. \end{cases}\end{split}\]

Define

\[p(x) = P(Y = 1 \mid X = x).\]

With a logistic regression model we assume

\[\log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}.\]

The left-hand side is the log-odds (logit). Applying the inverse logit gives

\[p(x) = \frac{\exp\{\eta(x)\}}{1 + \exp\{\eta(x)\}} = \frac{1}{1 + \exp\{-\eta(x)\}}.\]

This guarantees \(0 < p(x) < 1\), which is exactly what we want from a probability.

Where is the error term?

In the OLS model we write

\[Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_q x_q + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2),\]

or equivalently

\[Y \mid X = x \sim N(\mu(x), \sigma^2).\]

The Normal distribution has two parameters, \(\mu(x)\) and \(\sigma^2\), so we estimate both.

In logistic regression the conditional distribution is Bernoulli with mean \(p(x)\). The distribution has a single parameter, so we only need to estimate \(p(x)\) (through the \(\beta\) coefficients). There is no separate \(\sigma^2\) parameter.

17.2.1 Why not OLS on a 0/1 outcome?

If \(Y \in \{0, 1\}\), the conditional mean is

\[E[Y \mid X = x] = P(Y = 1 \mid X = x) = p(x).\]

You might think we can simply fit an ordinary linear regression of \(Y\) on \(X\) and interpret the fitted values as estimated probabilities. The problem:

the OLS fitted line is linear in \(x\), so fitted values can be less than 0 or greater than 1, which makes no sense as probabilities;
the Normal error assumption is a poor match to a Bernoulli variable.

Logistic regression solves both issues: probabilities are constrained to \((0, 1)\) and the Bernoulli model correctly reflects the 0/1 nature of the data.

17.2.2 Simulating logistic data in Python

Here is a direct translation of the R simulation used in the original notes.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

def sim_logistic_data(sample_size=25, beta_0=-2.0, beta_1=3.0) -> pd.DataFrame:
    x = rng.normal(size=sample_size)
    eta = beta_0 + beta_1 * x
    p = 1 / (1 + np.exp(-eta))          # inverse logit
    y = rng.binomial(n=1, p=p, size=sample_size)
    return pd.DataFrame({"y": y, "x": x})

df = sim_logistic_data()
df.head()

We can now compare an OLS fit to a logistic fit:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# ordinary least squares
ols_mod = smf.ols("y ~ x", data=df).fit()

# logistic regression (GLM with binomial family + logit link)
logit_mod = smf.glm(
    "y ~ x", data=df,
    family=sm.families.Binomial()
).fit()

x_grid = np.linspace(df["x"].min(), df["x"].max(), 200)
grid = pd.DataFrame({"x": x_grid})

plt.scatter(df["x"], df["y"], s=20)
plt.plot(x_grid, ols_mod.predict(grid), label="OLS")
plt.plot(x_grid, logit_mod.predict(grid), linestyle="--", label="Logistic")
plt.xlabel("x")
plt.ylabel("Estimated probability")
plt.legend()
plt.grid(True)

The logistic curve stays between 0 and 1; the OLS line does not.

17.3 Fitting logistic regression in Python

We use statsmodels to estimate logistic regression models. The syntax parallels glm() in R.

Basic fit

import statsmodels.formula.api as smf
import statsmodels.api as sm

# binary response y, predictors x1, x2, ...
logit_mod = smf.glm(
    "y ~ x1 + x2",
    data=df,
    family=sm.families.Binomial()
).fit()

print(logit_mod.summary())

Key outputs:

coefficient estimates \(\hat\beta_j\),
standard errors and Wald \(z\) statistics,
approximate p-values for tests \(H_0: \beta_j = 0\),
deviance and information criteria (AIC / BIC).

Wald tests

A single-parameter hypothesis

\[H_0 : \beta_j = 0 \quad \text{vs} \quad H_1 : \beta_j \neq 0\]

is tested with a Wald statistic

\[z = \frac{\hat\beta_j}{\operatorname{SE}(\hat\beta_j)} \approx N(0, 1),\]

reported in the summary as z and P>|z|.

Likelihood-ratio tests

To compare a reduced model to a full model, we can use a likelihood-ratio test (LRT), the GLM analogue of the regression F-test.

reduced = smf.glm("y ~ x1 + x2", data=df,
                  family=sm.families.Binomial()).fit()

full = smf.glm("y ~ x1 + x2 + x3 + x4", data=df,
               family=sm.families.Binomial()).fit()

lr_stat, lr_pvalue, df_diff = full.compare_lr_test(reduced)
print(lr_stat, df_diff, lr_pvalue)

Here:

lr_stat is the deviance difference \(D\),
df_diff is the difference in degrees of freedom (number of parameters),
lr_pvalue is the p-value under a \(\chi^2_{df\_diff}\) approximation.

17.3.1 SAheart example: coronary heart disease

The R notes use the SAheart dataset from ElemStatLearn. In PyStatsV1 we will assume a CSV version is available, for example:

data/sa_heart.csv

with columns:

chd – coronary heart disease indicator (1 = present, 0 = absent),
sbp – systolic blood pressure,
tobacco – lifetime tobacco (kg),
ldl – low density lipoprotein cholesterol,
adiposity, famhist, typea, obesity, alcohol, age.

Reading and preparing the data:

sa = pd.read_csv("data/sa_heart.csv")

# make sure chd is 0/1 integers
sa["chd"] = sa["chd"].astype(int)

Single-predictor model (LDL only)

chd_ldl = smf.glm(
    "chd ~ ldl", data=sa,
    family=sm.families.Binomial()
).fit()

print(chd_ldl.summary())

A positive coefficient for ldl indicates that higher LDL is associated with higher probability of CHD.

Additive model with all predictors

chd_add = smf.glm(
    "chd ~ sbp + tobacco + ldl + adiposity + C(famhist)"
    " + typea + obesity + alcohol + age",
    data=sa,
    family=sm.families.Binomial()
).fit()

chd_add.summary()

Note: C(famhist) treats famhist as a categorical predictor.

Comparing models with an LRT:

lr_stat, lr_pvalue, df_diff = chd_add.compare_lr_test(chd_ldl)
print(f"LRT stat = {lr_stat:.3f}, df = {df_diff}, p = {lr_pvalue:.3g}")

A very small p-value suggests that the larger additive model explains CHD substantially better than LDL alone.

Variable selection (briefly)

For a quick backward stepwise fit using AIC we can loop over predictors or use a small helper; for now we simply note that:

AIC/BIC from result.aic / result.bic can be compared across models.
We prefer models that balance good fit (low deviance) with parsimony (few predictors).

Confidence intervals for coefficients

chd_sel = chd_add  # for now, suppose this is our selected model
ci_99 = chd_sel.conf_int(alpha=0.01)
ci_99.columns = ["lower", "upper"]
ci_99

These are profile-likelihood intervals, analogous to the R output from confint.

Confidence intervals for mean response

For a new patient we often want an interval for the predicted probability of CHD.

new_obs = pd.DataFrame(
    {
        "sbp": [148.0],
        "tobacco": [5.0],
        "ldl": [12.0],
        "adiposity": [31.23],
        "famhist": ["Present"],
        "typea": [47],
        "obesity": [28.50],
        "alcohol": [23.89],
        "age": [60],
    }
)

# get eta and its standard error
pred = chd_sel.get_prediction(new_obs)
summary = pred.summary_frame()   # includes mean, mean_ci_lower, mean_ci_upper
summary[["mean_ci_lower", "mean", "mean_ci_upper"]]

By default, statsmodels returns intervals on the link scale (log-odds). Using transform=True we can instead request intervals already back-transformed to probabilities:

summary_prob = chd_sel.get_prediction(new_obs).summary_frame(transform=True)
summary_prob[["mean_ci_lower", "mean", "mean_ci_upper"]]

This provides a confidence interval for \(p(x)\) between 0 and 1.

17.4 Logistic regression as a classifier

Up to now, we have focused on modelling the probability \(p(x)\). To turn logistic regression into a classifier, we map each observation to a class label.

If we knew the true probabilities, the Bayes classifier

\[C_B(x) = \arg\max_k P(Y = k \mid X = x)\]

would minimize the probability of misclassification. For a binary response this reduces to:

\[\begin{split}C_B(x) = \begin{cases} 1, & p(x) > 0.5,\\ 0, & \text{otherwise}. \end{cases}\end{split}\]

In practice we replace \(p(x)\) by its estimate \(\hat p(x)\) from the logistic model and use the same rule.

Misclassification rate

Given predictions \(\hat C(x_i)\) on a dataset, the misclassification rate is

\[\text{Misclass} = \frac{1}{n} \sum_{i=1}^n I\{ y_i \neq \hat C(x_i) \}.\]

Equivalently, accuracy is \(1 - \text{Misclass}\).

Sensitivity and specificity

A confusion matrix for a binary classifier records:

true positives (TP),
false positives (FP),
true negatives (TN),
false negatives (FN).

Two important metrics are:

Sensitivity (recall, true positive rate)

\[\text{Sens} = \frac{\text{TP}}{\text{TP} + \text{FN}}.\]
Specificity (true negative rate)

\[\text{Spec} = \frac{\text{TN}}{\text{TN} + \text{FP}}.\]

Changing the probability cutoff \(c\) in

\[\begin{split}\hat C(x) = \begin{cases} 1, & \hat p(x) > c,\\ 0, & \hat p(x) \le c, \end{cases}\end{split}\]

trades off sensitivity and specificity:

lowering \(c\) increases sensitivity but decreases specificity,
raising \(c\) does the opposite.

In practice, the “best” cutoff depends on the costs of false positives vs false negatives.

17.4.1 Spam example: email classification

The R notes use the classic spam dataset from the UCI machine learning repository. In PyStatsV1 we can assume a CSV file

data/spam.csv

with a binary column type ("spam" vs "nonspam") and many engineered text features.

Train / test split

Using scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, accuracy_score, recall_score
)

spam = pd.read_csv("data/spam.csv")

X = spam.drop(columns=["type"])
y = (spam["type"] == "spam").astype(int)  # 1 = spam, 0 = non-spam

X_trn, X_tst, y_trn, y_tst = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

Fit several models of increasing complexity:

# simple model using only capitalTotal
caps_mod = LogisticRegression(max_iter=1000)
caps_mod.fit(X_trn[["capitalTotal"]], y_trn)

# a small hand-picked model
few_features = ["edu", "money", "capitalTotal", "charDollar"]
sel_mod = LogisticRegression(max_iter=1000)
sel_mod.fit(X_trn[few_features], y_trn)

# additive model using all predictors
full_mod = LogisticRegression(
    max_iter=1000, n_jobs=-1
)
full_mod.fit(X_trn, y_trn)

# (an “over-parametrized” model with interactions would usually be handled
#  via feature engineering; we omit it here for brevity.)

Cross-validated misclassification rate

Instead of relying on training error, use cross-validation:

from sklearn.model_selection import cross_val_score

def cv_misclass(model, X, y, cv=5):
    acc = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
    return 1 - acc.mean()

cv_caps = cv_misclass(caps_mod, X_trn[["capitalTotal"]], y_trn)
cv_sel  = cv_misclass(sel_mod, X_trn[few_features], y_trn)
cv_full = cv_misclass(full_mod, X_trn, y_trn)

print(cv_caps, cv_sel, cv_full)

Typically you will see:

the very simple model underfits (higher misclassification),
a moderate-size model performs better,
an extremely complex model risks overfitting.

Confusion matrix on the test set

Pick a preferred model (say full_mod) and evaluate it on the held-out test set:

# predicted probabilities for class 1 (spam)
p_hat = full_mod.predict_proba(X_tst)[:, 1]

# default cutoff 0.5
y_pred_50 = (p_hat > 0.5).astype(int)

cm_50 = confusion_matrix(y_tst, y_pred_50)
acc_50 = accuracy_score(y_tst, y_pred_50)
sens_50 = recall_score(y_tst, y_pred_50)   # sensitivity = recall for positive class

# specificity needs a small helper
tn, fp, fn, tp = cm_50.ravel()
spec_50 = tn / (tn + fp)

print("Confusion matrix (c = 0.5):")
print(cm_50)
print(f"Accuracy = {acc_50:.3f}, Sensitivity = {sens_50:.3f}, Specificity = {spec_50:.3f}")

Changing the cutoff to 0.1 or 0.9 illustrates the trade-off:

for c in [0.1, 0.5, 0.9]:
    y_pred = (p_hat > c).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_tst, y_pred).ravel()
    acc = accuracy_score(y_tst, y_pred)
    sens = recall_score(y_tst, y_pred)
    spec = tn / (tn + fp)
    print(f"Cutoff {c:.1f}: acc={acc:.3f}, sens={sens:.3f}, spec={spec:.3f}")

This matches the behaviour described in the R notes: lower cutoffs favour sensitivity (catching more spam) at the cost of more false positives; higher cutoffs do the opposite.

17.5 How this connects to PyStatsV1

Logistic regression and GLMs will reappear throughout PyStatsV1:

Many modern models used in science, health, and industry are generalized linear models, not just ordinary linear regression.
In later chapters on experimental data and causal questions, logistic regression provides a natural way to model binary outcomes (e.g., success vs failure) while controlling for covariates.
For applied work, logistic regression often serves as a baseline classifier before moving to more complex machine-learning methods. PyStatsV1 will show how to compare logistic models with tree-based and other algorithms.
The ideas of link functions, likelihood-based inference, and classification metrics generalize to many other models (Poisson, multinomial, etc.).

Exercises and examples built on this chapter will encourage you to:

fit and interpret logistic models for real binary outcomes,
compare OLS and logistic regression on the same dataset,
practice using deviance, AIC, and cross-validated misclassification to choose between models, and
think carefully about which metrics (accuracy, sensitivity, specificity) are appropriate for a given applied problem.

17.6 What you should take away

By the end of this chapter (and its R + Python versions), you should be able to:

Explain how generalized linear models extend ordinary linear regression.
Define and work with the logistic regression model:
- log-odds, odds, probabilities,
- logit and inverse-logit (sigmoid) transformations.
Fit logistic regression models in Python using statsmodels and scikit-learn.
- interpret coefficients in terms of log-odds and probabilities;
- compute Wald tests and likelihood-ratio tests;
- construct confidence intervals for coefficients and mean response.
Use logistic regression as a classifier:
- obtain class probabilities \(\hat p(x)\),
- convert them to labels using a cutoff,
- compute misclassification rate, sensitivity, and specificity,
- understand how changing the cutoff trades off false positives vs false negatives.
Recognize that many ideas from linear regression (model comparison, variable selection, diagnostics) carry over almost unchanged to the logistic setting.

In short: logistic regression is your first step into the broader world of GLMs and classification methods, and a core tool for applied statistics with PyStatsV1.