Applied Statistics with Python – Chapter 5
Probability and statistics in Python and R
This chapter mirrors the “Probability and Statistics in R” chapter from the R notes. We’ll keep the statistical ideas the same, but express them in Python-first terms.
By the end of this chapter you should be comfortable with:
using library functions to work with common distributions,
translating between pdf / cdf / quantile / random draws,
performing basic t-tests in Python (one-sample and two-sample),
and using simulation to understand probability results.
Where the R notes show functions like dnorm(), qbinom(), or
t.test(), we’ll highlight the Python equivalents from:
scipy.statsnumpy.randomand a few plain-Python helper functions.
5.1 Probability in Python (and R)
5.1.1 Distribution helpers: pdf, cdf, quantiles, random draws
In the R book you see the pattern:
dname– density or pmf (pdf),pname– cumulative distribution function (cdf),qname– quantile function,rname– random draws.
For example, dnorm(), pnorm(), qnorm(), rnorm() for the Normal
distribution, or dbinom(), pbinom() and so on.
In Python, the roles are similar but the interface is a little different.
The most convenient toolkit is scipy.stats. Each distribution is an
object with methods and “frozen” versions.
For a Normal distribution with mean 2 and standard deviation 5:
import numpy as np
from scipy import stats
dist = stats.norm(loc=2, scale=5)
# pdf (density) at x = 3
pdf_at_3 = dist.pdf(3)
# cdf at x = 3, P(X <= 3)
cdf_at_3 = dist.cdf(3)
# 97.5th percentile (quantile)
q_975 = dist.ppf(0.975)
# random sample of size n = 10
sample = dist.rvs(size=10, random_state=42)
The mapping to the R notation is:
R
dnorm(x, mean, sd)→ Pythonstats.norm(mean, sd).pdf(x)R
pnorm(q, mean, sd)→ Pythonstats.norm(mean, sd).cdf(q)R
qnorm(p, mean, sd)→ Pythonstats.norm(mean, sd).ppf(p)R
rnorm(n, mean, sd)→ Pythonstats.norm(mean, sd).rvs(size=n)
5.1.2 Other families: binomial, t, Poisson, chi-square, F
The R chapter lists a family of distributions with the same d/p/q/r pattern.
In Python you will typically reach for:
scipy.stats.binomfor binomial,scipy.stats.tfor Student’s t,scipy.stats.poissonfor Poisson,scipy.stats.chi2for chi-square,scipy.stats.ffor F.
Example – Binomial probability
Compute \(P(Y = 6)\) when \(Y \sim \text{Binomial}(n=10, p=0.75)\):
from scipy import stats
# pmf (probability mass function) for a Binomial
prob_y_eq_6 = stats.binom(n=10, p=0.75).pmf(6)
In R the same calculation would use dbinom(x = 6, size = 10, prob = 0.75).
Notice again:
R’s “d” for discrete distributions is really the pmf,
Python exposes this as
pmffor discrete andpdffor continuous.
You can explore other distributions in scipy.stats in exactly the same
way: create the frozen distribution object, then call pdf, cdf,
ppf, and rvs.
5.2 Hypothesis tests in Python
The R notes briefly review hypothesis testing and then demonstrate t-tests. We’ll match that structure here, but show both:
the formula (to demystify the test statistic), and
the convenience function from
scipy.stats.
5.2.1 One-sample t-test
Set-up: we observe \(x_1, \dots, x_n\) and want to test
assuming the data are approximately Normal with unknown variance.
The test statistic is
where \(\bar{x}\) is the sample mean and \(s\) the sample standard deviation.
Python implementation “by hand”
We revisit the cereal-box example from the R notes: boxes labelled “16 oz”, sample of 9 observed weights. (Here we just hard-code the data.)
import numpy as np
from scipy import stats
weights = np.array([15.5, 16.2, 16.1, 15.8, 15.6, 16.0, 15.8, 15.9, 16.2])
mu_0 = 16
n = len(weights)
x_bar = weights.mean()
s = weights.std(ddof=1)
t_stat = (x_bar - mu_0) / (s / np.sqrt(n))
df = n - 1
# two-sided p-value
p_value_two_sided = 2 * stats.t.sf(np.abs(t_stat), df=df)
For a one-sided alternative (for example \(H_1: \mu < \mu_0\)):
p_value_less = stats.t.cdf(t_stat, df=df) # P(T <= observed)
p_value_greater = stats.t.sf(t_stat, df=df) # P(T >= observed)
This mirrors what the R code is doing with t.test() and pt().
Using SciPy’s convenience function
# two-sided one-sample t-test against mu_0
t_stat2, p_value2 = stats.ttest_1samp(weights, popmean=mu_0)
# SciPy always reports a two-sided p-value; you can halve it
# for a one-sided test if the sign of t matches your alternative.
p_value_less = p_value2 / 2 if t_stat2 < 0 else 1 - p_value2 / 2
The important idea: whether you calculate \(t\) by hand or use a helper function, the model assumptions and test structure are the same as in the R notes.
5.2.2 Two-sample t-test
Now suppose we have two independent samples:
and we want to test
under an equal-variance assumption.
The pooled-variance test statistic is
Python implementation “by hand”
import numpy as np
from scipy import stats
x = np.array([70, 82, 78, 74, 94, 82])
y = np.array([64, 72, 60, 76, 72, 80, 84, 68])
n, m = len(x), len(y)
x_bar, y_bar = x.mean(), y.mean()
s_x = x.std(ddof=1)
s_y = y.std(ddof=1)
s_p = np.sqrt(((n - 1) * s_x**2 + (m - 1) * s_y**2) / (n + m - 2))
t_stat = (x_bar - y_bar) / (s_p * np.sqrt(1 / n + 1 / m))
df = n + m - 2
# one-sided alternative H1: mu_x > mu_y
p_value = stats.t.sf(t_stat, df=df)
Using SciPy’s two-sample test
t_stat2, p_value_two_sided = stats.ttest_ind(
x, y, equal_var=True
)
# convert to one-sided p-value if desired
p_value_greater = p_value_two_sided / 2 if t_stat2 > 0 else 1 - p_value_two_sided / 2
SciPy’s ttest_ind() mirrors R’s t.test(x, y, var.equal = TRUE) call.
For practical work you will usually:
check assumptions (rough normality, similar spreads),
call the SciPy function,
interpret \(t\), degrees of freedom, p-value, and confidence interval.
5.3 Simulation in Python
The R chapter ends with simulations that illustrate how probability results behave in practice. We’ll take the same examples and rewrite them in NumPy.
The pattern you’ll see throughout PyStatsV1 is:
write small simulation functions,
loop or vectorize to build many replicates,
examine histograms and summary statistics.
5.3.1 Paired differences
We consider two Normal populations with means \(\mu_1 = 6\), \(\mu_2 = 5\), common variance \(\sigma^2 = 4\), and sample size \(n = 25\) in each group.
From theory, if \(\bar{X}_1\) and \(\bar{X}_2\) are the sample means and \(D = \bar{X}_1 - \bar{X}_2\), then
We can compute \(P(0 < D < 2)\) analytically using the Normal cdf:
from scipy import stats
import numpy as np
mu_D = 1.0
var_D = 0.32
sd_D = np.sqrt(var_D)
prob_0_to_2 = stats.norm(mu_D, sd_D).cdf(2) - stats.norm(mu_D, sd_D).cdf(0)
Alternatively we can simulate many realizations of \(D\) and approximate this probability empirically.
rng = np.random.default_rng(seed=42)
n = 25
num_samples = 10_000
x1 = rng.normal(loc=6, scale=2, size=(num_samples, n))
x2 = rng.normal(loc=5, scale=2, size=(num_samples, n))
diffs = x1.mean(axis=1) - x2.mean(axis=1)
prob_sim = np.mean((diffs > 0) & (diffs < 2))
diffs_mean = diffs.mean()
diffs_var = diffs.var(ddof=1)
Here:
prob_simshould be close toprob_0_to_2,diffs_meanshould be near 1,diffs_varshould be near 0.32.
In later PyStatsV1 chapters we will use this pattern repeatedly to check the behaviour of estimators and test statistics.
5.3.2 Distribution of a sample mean
The second example in the R notes studies the distribution of the sample mean from a Poisson distribution and connects it to the Central Limit Theorem.
Let \(X \sim \text{Poisson}(\mu)\) with \(\mu = 10\). We draw samples of size \(n = 50\) and compute the sample mean \(\bar{X}\). The CLT suggests
for moderately large \(n\).
In Python:
from scipy import stats
import numpy as np
rng = np.random.default_rng(seed=1337)
mu = 10
sample_size = 50
num_samples = 100_000
# each row: one sample of size n
samples = rng.poisson(lam=mu, size=(num_samples, sample_size))
x_bars = samples.mean(axis=1)
# empirical mean and variance
mean_emp = x_bars.mean()
var_emp = x_bars.var(ddof=1)
mean_theory = mu
var_theory = mu / sample_size
You can then:
plot a histogram of
x_bars,overlay a Normal density with mean
mean_theoryand variancevar_theory,and check proportions within 1, 2, or 3 standard deviations of the mean.
(We keep plotting code to a minimum here; later chapters will show more matplotlib examples.)
5.4 What you should take away
By the end of this chapter (R + Python versions), you should be comfortable with:
Using distribution helpers to get pdf/pmf, cdf, quantiles, and random samples from common distributions in Python and R.
Translating between R and Python:
d* / p* / q* / r*functions in R correspond topdf/pmf,cdf,ppf, andrvsinscipy.stats.Implementing t-tests both “by hand” from the formulas and using convenience functions like
scipy.stats.ttest_1samp()andscipy.stats.ttest_ind().Using simulation to: - approximate probabilities, - understand sampling distributions, - and verify theoretical results.
If any of the code in this chapter feels new, try it interactively:
change parameters (means, variances, sample sizes),
re-run the simulations,
see how the distribution of statistics (like \(\bar{X}\) or the t-statistic) responds.
That experimentation will pay off quickly in the PyStatsV1 applied chapters, where we’ll connect these probability tools directly to real case studies.