Applied Statistics with Python – Chapter 5 ========================================== Probability and statistics in Python and R ------------------------------------------ This chapter mirrors the **“Probability and Statistics in R”** chapter from the R notes. We’ll keep the *statistical* ideas the same, but express them in Python-first terms. By the end of this chapter you should be comfortable with: - using library functions to work with common distributions, - translating between **pdf / cdf / quantile / random draws**, - performing basic **t-tests** in Python (one-sample and two-sample), - and using **simulation** to understand probability results. Where the R notes show functions like ``dnorm()``, ``qbinom()``, or ``t.test()``, we’ll highlight the Python equivalents from: - :mod:`scipy.stats` - :mod:`numpy.random` - and a few plain-Python helper functions. 5.1 Probability in Python (and R) ================================= 5.1.1 Distribution helpers: pdf, cdf, quantiles, random draws ------------------------------------------------------------- In the R book you see the pattern: - ``dname`` – density or pmf (pdf), - ``pname`` – cumulative distribution function (cdf), - ``qname`` – quantile function, - ``rname`` – random draws. For example, ``dnorm()``, ``pnorm()``, ``qnorm()``, ``rnorm()`` for the Normal distribution, or ``dbinom()``, ``pbinom()`` and so on. In Python, the roles are similar but the interface is a little different. The most convenient toolkit is :mod:`scipy.stats`. Each distribution is an object with methods and “frozen” versions. For a Normal distribution with mean 2 and standard deviation 5: .. code-block:: python import numpy as np from scipy import stats dist = stats.norm(loc=2, scale=5) # pdf (density) at x = 3 pdf_at_3 = dist.pdf(3) # cdf at x = 3, P(X <= 3) cdf_at_3 = dist.cdf(3) # 97.5th percentile (quantile) q_975 = dist.ppf(0.975) # random sample of size n = 10 sample = dist.rvs(size=10, random_state=42) The mapping to the R notation is: - R ``dnorm(x, mean, sd)`` → Python ``stats.norm(mean, sd).pdf(x)`` - R ``pnorm(q, mean, sd)`` → Python ``stats.norm(mean, sd).cdf(q)`` - R ``qnorm(p, mean, sd)`` → Python ``stats.norm(mean, sd).ppf(p)`` - R ``rnorm(n, mean, sd)`` → Python ``stats.norm(mean, sd).rvs(size=n)`` 5.1.2 Other families: binomial, t, Poisson, chi-square, F --------------------------------------------------------- The R chapter lists a family of distributions with the same ``d/p/q/r`` pattern. In Python you will typically reach for: - :class:`scipy.stats.binom` for binomial, - :class:`scipy.stats.t` for Student’s t, - :class:`scipy.stats.poisson` for Poisson, - :class:`scipy.stats.chi2` for chi-square, - :class:`scipy.stats.f` for F. **Example – Binomial probability** Compute :math:`P(Y = 6)` when :math:`Y \sim \text{Binomial}(n=10, p=0.75)`: .. code-block:: python from scipy import stats # pmf (probability mass function) for a Binomial prob_y_eq_6 = stats.binom(n=10, p=0.75).pmf(6) In R the same calculation would use ``dbinom(x = 6, size = 10, prob = 0.75)``. Notice again: - R’s “d” for discrete distributions is really the **pmf**, - Python exposes this as ``pmf`` for discrete and ``pdf`` for continuous. You can explore other distributions in :mod:`scipy.stats` in exactly the same way: create the frozen distribution object, then call ``pdf``, ``cdf``, ``ppf``, and ``rvs``. 5.2 Hypothesis tests in Python ============================== The R notes briefly review hypothesis testing and then demonstrate t-tests. We’ll match that structure here, but show both: - the **formula** (to demystify the test statistic), and - the **convenience function** from :mod:`scipy.stats`. 5.2.1 One-sample t-test ------------------------ Set-up: we observe :math:`x_1, \dots, x_n` and want to test .. math:: H_0: \mu = \mu_0 \quad \text{vs} \quad H_1: \mu \neq \mu_0, assuming the data are approximately Normal with unknown variance. The test statistic is .. math:: t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \sim t_{n-1} \quad \text{under } H_0, where :math:`\bar{x}` is the sample mean and :math:`s` the sample standard deviation. **Python implementation “by hand”** We revisit the cereal-box example from the R notes: boxes labelled “16 oz”, sample of 9 observed weights. (Here we just hard-code the data.) .. code-block:: python import numpy as np from scipy import stats weights = np.array([15.5, 16.2, 16.1, 15.8, 15.6, 16.0, 15.8, 15.9, 16.2]) mu_0 = 16 n = len(weights) x_bar = weights.mean() s = weights.std(ddof=1) t_stat = (x_bar - mu_0) / (s / np.sqrt(n)) df = n - 1 # two-sided p-value p_value_two_sided = 2 * stats.t.sf(np.abs(t_stat), df=df) For a **one-sided** alternative (for example :math:`H_1: \mu < \mu_0`): .. code-block:: python p_value_less = stats.t.cdf(t_stat, df=df) # P(T <= observed) p_value_greater = stats.t.sf(t_stat, df=df) # P(T >= observed) This mirrors what the R code is doing with ``t.test()`` and ``pt()``. **Using SciPy’s convenience function** .. code-block:: python # two-sided one-sample t-test against mu_0 t_stat2, p_value2 = stats.ttest_1samp(weights, popmean=mu_0) # SciPy always reports a two-sided p-value; you can halve it # for a one-sided test if the sign of t matches your alternative. p_value_less = p_value2 / 2 if t_stat2 < 0 else 1 - p_value2 / 2 The important idea: whether you calculate :math:`t` by hand or use a helper function, the **model assumptions** and **test structure** are the same as in the R notes. 5.2.2 Two-sample t-test ----------------------- Now suppose we have two independent samples: .. math:: X_1, \dots, X_n \sim N(\mu_x, \sigma^2), \quad Y_1, \dots, Y_m \sim N(\mu_y, \sigma^2), and we want to test .. math:: H_0: \mu_x - \mu_y = 0 \quad \text{vs} \quad H_1: \mu_x - \mu_y > 0, under an equal-variance assumption. The pooled-variance test statistic is .. math:: t = \frac{\bar{x} - \bar{y}} {s_p \sqrt{1/n + 1/m}}, \qquad s_p^2 = \frac{(n-1)s_x^2 + (m-1)s_y^2}{n + m - 2}. **Python implementation “by hand”** .. code-block:: python import numpy as np from scipy import stats x = np.array([70, 82, 78, 74, 94, 82]) y = np.array([64, 72, 60, 76, 72, 80, 84, 68]) n, m = len(x), len(y) x_bar, y_bar = x.mean(), y.mean() s_x = x.std(ddof=1) s_y = y.std(ddof=1) s_p = np.sqrt(((n - 1) * s_x**2 + (m - 1) * s_y**2) / (n + m - 2)) t_stat = (x_bar - y_bar) / (s_p * np.sqrt(1 / n + 1 / m)) df = n + m - 2 # one-sided alternative H1: mu_x > mu_y p_value = stats.t.sf(t_stat, df=df) **Using SciPy’s two-sample test** .. code-block:: python t_stat2, p_value_two_sided = stats.ttest_ind( x, y, equal_var=True ) # convert to one-sided p-value if desired p_value_greater = p_value_two_sided / 2 if t_stat2 > 0 else 1 - p_value_two_sided / 2 SciPy’s :func:`ttest_ind` mirrors R’s ``t.test(x, y, var.equal = TRUE)`` call. For practical work you will usually: 1. check assumptions (rough normality, similar spreads), 2. call the SciPy function, 3. interpret :math:`t`, degrees of freedom, p-value, and confidence interval. 5.3 Simulation in Python ======================== The R chapter ends with simulations that illustrate how probability results behave in practice. We’ll take the same examples and rewrite them in NumPy. The pattern you’ll see throughout PyStatsV1 is: - write small simulation functions, - loop or vectorize to build many replicates, - examine histograms and summary statistics. 5.3.1 Paired differences ------------------------ We consider two Normal populations with means :math:`\mu_1 = 6`, :math:`\mu_2 = 5`, common variance :math:`\sigma^2 = 4`, and sample size :math:`n = 25` in each group. From theory, if :math:`\bar{X}_1` and :math:`\bar{X}_2` are the sample means and :math:`D = \bar{X}_1 - \bar{X}_2`, then .. math:: D \sim N(\mu_1 - \mu_2, \sigma^2/n + \sigma^2/n) = N(1, 0.32). We can compute :math:`P(0 < D < 2)` analytically using the Normal cdf: .. code-block:: python from scipy import stats import numpy as np mu_D = 1.0 var_D = 0.32 sd_D = np.sqrt(var_D) prob_0_to_2 = stats.norm(mu_D, sd_D).cdf(2) - stats.norm(mu_D, sd_D).cdf(0) Alternatively we can *simulate* many realizations of :math:`D` and approximate this probability empirically. .. code-block:: python rng = np.random.default_rng(seed=42) n = 25 num_samples = 10_000 x1 = rng.normal(loc=6, scale=2, size=(num_samples, n)) x2 = rng.normal(loc=5, scale=2, size=(num_samples, n)) diffs = x1.mean(axis=1) - x2.mean(axis=1) prob_sim = np.mean((diffs > 0) & (diffs < 2)) diffs_mean = diffs.mean() diffs_var = diffs.var(ddof=1) Here: - ``prob_sim`` should be close to ``prob_0_to_2``, - ``diffs_mean`` should be near 1, - ``diffs_var`` should be near 0.32. In later PyStatsV1 chapters we will use this pattern repeatedly to check the behaviour of estimators and test statistics. 5.3.2 Distribution of a sample mean ----------------------------------- The second example in the R notes studies the distribution of the sample mean from a Poisson distribution and connects it to the Central Limit Theorem. Let :math:`X \sim \text{Poisson}(\mu)` with :math:`\mu = 10`. We draw samples of size :math:`n = 50` and compute the sample mean :math:`\bar{X}`. The CLT suggests .. math:: \bar{X} \approx N\left(\mu, \frac{\mu}{n}\right) for moderately large :math:`n`. In Python: .. code-block:: python from scipy import stats import numpy as np rng = np.random.default_rng(seed=1337) mu = 10 sample_size = 50 num_samples = 100_000 # each row: one sample of size n samples = rng.poisson(lam=mu, size=(num_samples, sample_size)) x_bars = samples.mean(axis=1) # empirical mean and variance mean_emp = x_bars.mean() var_emp = x_bars.var(ddof=1) mean_theory = mu var_theory = mu / sample_size You can then: - plot a histogram of ``x_bars``, - overlay a Normal density with mean ``mean_theory`` and variance ``var_theory``, - and check proportions within 1, 2, or 3 standard deviations of the mean. (We keep plotting code to a minimum here; later chapters will show more matplotlib examples.) 5.4 What you should take away ============================= By the end of this chapter (R + Python versions), you should be comfortable with: - **Using distribution helpers** to get pdf/pmf, cdf, quantiles, and random samples from common distributions in Python and R. - **Translating between R and Python**: ``d* / p* / q* / r*`` functions in R correspond to ``pdf/pmf``, ``cdf``, ``ppf``, and ``rvs`` in :mod:`scipy.stats`. - **Implementing t-tests** both “by hand” from the formulas and using convenience functions like :func:`scipy.stats.ttest_1samp` and :func:`scipy.stats.ttest_ind`. - **Using simulation** to: - approximate probabilities, - understand sampling distributions, - and verify theoretical results. If any of the code in this chapter feels new, try it interactively: - change parameters (means, variances, sample sizes), - re-run the simulations, - see how the distribution of statistics (like :math:`\bar{X}` or the t-statistic) responds. That experimentation will pay off quickly in the PyStatsV1 applied chapters, where we’ll connect these probability tools directly to real case studies.