Applied Statistics with Python – Chapter 17
Logistic regression and classification
So far, our regression chapters have focused on numeric response variables and ordinary least squares (OLS). In many applications, however, the response is binary:
disease vs no disease,
spam vs non-spam,
success vs failure, and so on.
In this chapter we:
introduce generalized linear models (GLMs) as an extension of OLS,
develop logistic regression for binary responses,
show how to fit, interpret, and test logistic models in Python, and
use logistic regression as a classifier, including evaluation metrics such as misclassification rate, sensitivity, and specificity.
Throughout, we will mirror the R notes but work in Python using:
numpyandpandasfor data handling,statsmodelsfor logistic regression and GLMs, andscikit-learnfor classification workflows and cross-validation.
17.1 Generalized linear models
Ordinary linear regression assumes
a Normal distribution for the response given the predictors, and
a mean that is a linear combination of the predictors:
\[Y \mid X = x \sim N(\mu(x), \sigma^2), \qquad \mu(x) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1}.\]
A generalized linear model (GLM) keeps the linear predictor
but allows:
different choices of distribution for \(Y \mid X = x\), and
a link function \(g(\cdot)\) connecting the linear predictor to the mean:
\[\eta(x) = g\big( E[Y \mid X = x] \big).\]
For example:
Linear regression
Distribution: Normal
Link: identity \(g(\mu) = \mu\)
Poisson regression
Distribution: Poisson (counts)
Link: log \(g(\lambda) = \log \lambda\)
Logistic regression
Distribution: Bernoulli (binary)
Link: logit \(g(p) = \log \{p / (1 - p)\}\)
17.2 Binary responses and the logistic model
Suppose we have a binary response coded as
Define
With a logistic regression model we assume
The left-hand side is the log-odds (logit). Applying the inverse logit gives
This guarantees \(0 < p(x) < 1\), which is exactly what we want from a probability.
Where is the error term?
In the OLS model we write
or equivalently
The Normal distribution has two parameters, \(\mu(x)\) and \(\sigma^2\), so we estimate both.
In logistic regression the conditional distribution is Bernoulli with mean \(p(x)\). The distribution has a single parameter, so we only need to estimate \(p(x)\) (through the \(\beta\) coefficients). There is no separate \(\sigma^2\) parameter.
17.2.1 Why not OLS on a 0/1 outcome?
If \(Y \in \{0, 1\}\), the conditional mean is
You might think we can simply fit an ordinary linear regression of \(Y\) on \(X\) and interpret the fitted values as estimated probabilities. The problem:
the OLS fitted line is linear in \(x\), so fitted values can be less than 0 or greater than 1, which makes no sense as probabilities;
the Normal error assumption is a poor match to a Bernoulli variable.
Logistic regression solves both issues: probabilities are constrained to \((0, 1)\) and the Bernoulli model correctly reflects the 0/1 nature of the data.
17.2.2 Simulating logistic data in Python
Here is a direct translation of the R simulation used in the original notes.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
def sim_logistic_data(sample_size=25, beta_0=-2.0, beta_1=3.0) -> pd.DataFrame:
x = rng.normal(size=sample_size)
eta = beta_0 + beta_1 * x
p = 1 / (1 + np.exp(-eta)) # inverse logit
y = rng.binomial(n=1, p=p, size=sample_size)
return pd.DataFrame({"y": y, "x": x})
df = sim_logistic_data()
df.head()
We can now compare an OLS fit to a logistic fit:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
# ordinary least squares
ols_mod = smf.ols("y ~ x", data=df).fit()
# logistic regression (GLM with binomial family + logit link)
logit_mod = smf.glm(
"y ~ x", data=df,
family=sm.families.Binomial()
).fit()
x_grid = np.linspace(df["x"].min(), df["x"].max(), 200)
grid = pd.DataFrame({"x": x_grid})
plt.scatter(df["x"], df["y"], s=20)
plt.plot(x_grid, ols_mod.predict(grid), label="OLS")
plt.plot(x_grid, logit_mod.predict(grid), linestyle="--", label="Logistic")
plt.xlabel("x")
plt.ylabel("Estimated probability")
plt.legend()
plt.grid(True)
The logistic curve stays between 0 and 1; the OLS line does not.
17.3 Fitting logistic regression in Python
We use statsmodels to estimate logistic regression models. The syntax
parallels glm() in R.
Basic fit
import statsmodels.formula.api as smf
import statsmodels.api as sm
# binary response y, predictors x1, x2, ...
logit_mod = smf.glm(
"y ~ x1 + x2",
data=df,
family=sm.families.Binomial()
).fit()
print(logit_mod.summary())
Key outputs:
coefficient estimates \(\hat\beta_j\),
standard errors and Wald \(z\) statistics,
approximate p-values for tests \(H_0: \beta_j = 0\),
deviance and information criteria (AIC / BIC).
Wald tests
A single-parameter hypothesis
is tested with a Wald statistic
reported in the summary as z and P>|z|.
Likelihood-ratio tests
To compare a reduced model to a full model, we can use a likelihood-ratio test (LRT), the GLM analogue of the regression F-test.
reduced = smf.glm("y ~ x1 + x2", data=df,
family=sm.families.Binomial()).fit()
full = smf.glm("y ~ x1 + x2 + x3 + x4", data=df,
family=sm.families.Binomial()).fit()
lr_stat, lr_pvalue, df_diff = full.compare_lr_test(reduced)
print(lr_stat, df_diff, lr_pvalue)
Here:
lr_statis the deviance difference \(D\),df_diffis the difference in degrees of freedom (number of parameters),lr_pvalueis the p-value under a \(\chi^2_{df\_diff}\) approximation.
17.3.1 SAheart example: coronary heart disease
The R notes use the SAheart dataset from ElemStatLearn. In
PyStatsV1 we will assume a CSV version is available, for example:
data/sa_heart.csv
with columns:
chd– coronary heart disease indicator (1 = present, 0 = absent),sbp– systolic blood pressure,tobacco– lifetime tobacco (kg),ldl– low density lipoprotein cholesterol,adiposity,famhist,typea,obesity,alcohol,age.
Reading and preparing the data:
sa = pd.read_csv("data/sa_heart.csv")
# make sure chd is 0/1 integers
sa["chd"] = sa["chd"].astype(int)
Single-predictor model (LDL only)
chd_ldl = smf.glm(
"chd ~ ldl", data=sa,
family=sm.families.Binomial()
).fit()
print(chd_ldl.summary())
A positive coefficient for ldl indicates that higher LDL is associated with
higher probability of CHD.
Additive model with all predictors
chd_add = smf.glm(
"chd ~ sbp + tobacco + ldl + adiposity + C(famhist)"
" + typea + obesity + alcohol + age",
data=sa,
family=sm.families.Binomial()
).fit()
chd_add.summary()
Note: C(famhist) treats famhist as a categorical predictor.
Comparing models with an LRT:
lr_stat, lr_pvalue, df_diff = chd_add.compare_lr_test(chd_ldl)
print(f"LRT stat = {lr_stat:.3f}, df = {df_diff}, p = {lr_pvalue:.3g}")
A very small p-value suggests that the larger additive model explains CHD substantially better than LDL alone.
Variable selection (briefly)
For a quick backward stepwise fit using AIC we can loop over predictors or use a small helper; for now we simply note that:
AIC/BIC from
result.aic/result.biccan be compared across models.We prefer models that balance good fit (low deviance) with parsimony (few predictors).
Confidence intervals for coefficients
chd_sel = chd_add # for now, suppose this is our selected model
ci_99 = chd_sel.conf_int(alpha=0.01)
ci_99.columns = ["lower", "upper"]
ci_99
These are profile-likelihood intervals, analogous to the R output from
confint.
Confidence intervals for mean response
For a new patient we often want an interval for the predicted probability of CHD.
new_obs = pd.DataFrame(
{
"sbp": [148.0],
"tobacco": [5.0],
"ldl": [12.0],
"adiposity": [31.23],
"famhist": ["Present"],
"typea": [47],
"obesity": [28.50],
"alcohol": [23.89],
"age": [60],
}
)
# get eta and its standard error
pred = chd_sel.get_prediction(new_obs)
summary = pred.summary_frame() # includes mean, mean_ci_lower, mean_ci_upper
summary[["mean_ci_lower", "mean", "mean_ci_upper"]]
By default, statsmodels returns intervals on the link scale
(log-odds). Using transform=True we can instead request intervals already
back-transformed to probabilities:
summary_prob = chd_sel.get_prediction(new_obs).summary_frame(transform=True)
summary_prob[["mean_ci_lower", "mean", "mean_ci_upper"]]
This provides a confidence interval for \(p(x)\) between 0 and 1.
17.4 Logistic regression as a classifier
Up to now, we have focused on modelling the probability \(p(x)\). To turn logistic regression into a classifier, we map each observation to a class label.
If we knew the true probabilities, the Bayes classifier
would minimize the probability of misclassification. For a binary response this reduces to:
In practice we replace \(p(x)\) by its estimate \(\hat p(x)\) from the logistic model and use the same rule.
Misclassification rate
Given predictions \(\hat C(x_i)\) on a dataset, the misclassification rate is
Equivalently, accuracy is \(1 - \text{Misclass}\).
Sensitivity and specificity
A confusion matrix for a binary classifier records:
true positives (TP),
false positives (FP),
true negatives (TN),
false negatives (FN).
Two important metrics are:
Sensitivity (recall, true positive rate)
\[\text{Sens} = \frac{\text{TP}}{\text{TP} + \text{FN}}.\]Specificity (true negative rate)
\[\text{Spec} = \frac{\text{TN}}{\text{TN} + \text{FP}}.\]
Changing the probability cutoff \(c\) in
trades off sensitivity and specificity:
lowering \(c\) increases sensitivity but decreases specificity,
raising \(c\) does the opposite.
In practice, the “best” cutoff depends on the costs of false positives vs false negatives.
17.4.1 Spam example: email classification
The R notes use the classic spam dataset from the UCI machine learning
repository. In PyStatsV1 we can assume a CSV file
data/spam.csv
with a binary column type ("spam" vs "nonspam") and many engineered
text features.
Train / test split
Using scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
confusion_matrix, accuracy_score, recall_score
)
spam = pd.read_csv("data/spam.csv")
X = spam.drop(columns=["type"])
y = (spam["type"] == "spam").astype(int) # 1 = spam, 0 = non-spam
X_trn, X_tst, y_trn, y_tst = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Fit several models of increasing complexity:
# simple model using only capitalTotal
caps_mod = LogisticRegression(max_iter=1000)
caps_mod.fit(X_trn[["capitalTotal"]], y_trn)
# a small hand-picked model
few_features = ["edu", "money", "capitalTotal", "charDollar"]
sel_mod = LogisticRegression(max_iter=1000)
sel_mod.fit(X_trn[few_features], y_trn)
# additive model using all predictors
full_mod = LogisticRegression(
max_iter=1000, n_jobs=-1
)
full_mod.fit(X_trn, y_trn)
# (an “over-parametrized” model with interactions would usually be handled
# via feature engineering; we omit it here for brevity.)
Cross-validated misclassification rate
Instead of relying on training error, use cross-validation:
from sklearn.model_selection import cross_val_score
def cv_misclass(model, X, y, cv=5):
acc = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
return 1 - acc.mean()
cv_caps = cv_misclass(caps_mod, X_trn[["capitalTotal"]], y_trn)
cv_sel = cv_misclass(sel_mod, X_trn[few_features], y_trn)
cv_full = cv_misclass(full_mod, X_trn, y_trn)
print(cv_caps, cv_sel, cv_full)
Typically you will see:
the very simple model underfits (higher misclassification),
a moderate-size model performs better,
an extremely complex model risks overfitting.
Confusion matrix on the test set
Pick a preferred model (say full_mod) and evaluate it on the held-out test
set:
# predicted probabilities for class 1 (spam)
p_hat = full_mod.predict_proba(X_tst)[:, 1]
# default cutoff 0.5
y_pred_50 = (p_hat > 0.5).astype(int)
cm_50 = confusion_matrix(y_tst, y_pred_50)
acc_50 = accuracy_score(y_tst, y_pred_50)
sens_50 = recall_score(y_tst, y_pred_50) # sensitivity = recall for positive class
# specificity needs a small helper
tn, fp, fn, tp = cm_50.ravel()
spec_50 = tn / (tn + fp)
print("Confusion matrix (c = 0.5):")
print(cm_50)
print(f"Accuracy = {acc_50:.3f}, Sensitivity = {sens_50:.3f}, Specificity = {spec_50:.3f}")
Changing the cutoff to 0.1 or 0.9 illustrates the trade-off:
for c in [0.1, 0.5, 0.9]:
y_pred = (p_hat > c).astype(int)
tn, fp, fn, tp = confusion_matrix(y_tst, y_pred).ravel()
acc = accuracy_score(y_tst, y_pred)
sens = recall_score(y_tst, y_pred)
spec = tn / (tn + fp)
print(f"Cutoff {c:.1f}: acc={acc:.3f}, sens={sens:.3f}, spec={spec:.3f}")
This matches the behaviour described in the R notes: lower cutoffs favour sensitivity (catching more spam) at the cost of more false positives; higher cutoffs do the opposite.
17.5 How this connects to PyStatsV1
Logistic regression and GLMs will reappear throughout PyStatsV1:
Many modern models used in science, health, and industry are generalized linear models, not just ordinary linear regression.
In later chapters on experimental data and causal questions, logistic regression provides a natural way to model binary outcomes (e.g., success vs failure) while controlling for covariates.
For applied work, logistic regression often serves as a baseline classifier before moving to more complex machine-learning methods. PyStatsV1 will show how to compare logistic models with tree-based and other algorithms.
The ideas of link functions, likelihood-based inference, and classification metrics generalize to many other models (Poisson, multinomial, etc.).
Exercises and examples built on this chapter will encourage you to:
fit and interpret logistic models for real binary outcomes,
compare OLS and logistic regression on the same dataset,
practice using deviance, AIC, and cross-validated misclassification to choose between models, and
think carefully about which metrics (accuracy, sensitivity, specificity) are appropriate for a given applied problem.
17.6 What you should take away
By the end of this chapter (and its R + Python versions), you should be able to:
Explain how generalized linear models extend ordinary linear regression.
Define and work with the logistic regression model:
log-odds, odds, probabilities,
logit and inverse-logit (sigmoid) transformations.
Fit logistic regression models in Python using
statsmodelsandscikit-learn.interpret coefficients in terms of log-odds and probabilities;
compute Wald tests and likelihood-ratio tests;
construct confidence intervals for coefficients and mean response.
Use logistic regression as a classifier:
obtain class probabilities \(\hat p(x)\),
convert them to labels using a cutoff,
compute misclassification rate, sensitivity, and specificity,
understand how changing the cutoff trades off false positives vs false negatives.
Recognize that many ideas from linear regression (model comparison, variable selection, diagnostics) carry over almost unchanged to the logistic setting.
In short: logistic regression is your first step into the broader world of GLMs and classification methods, and a core tool for applied statistics with PyStatsV1.