Intro Stats 5 - Hypothesis testing by simulation and effect size ================================================================ This is Part 5 of the **Intro Stats case study pack**. By now you have: * described the data (Part 1), * visualized distributions and checked for outliers (Part 3), * estimated uncertainty with simulation (Part 2), * computed confidence intervals (Part 4). Now you will run a **hypothesis test** using a simulation method called a **permutation test**, and you will compute an **effect size**. You are answering two different questions: 1. **Is the observed difference “rare” under a no-difference world?** (hypothesis test) 2. **How big is the difference in practical terms?** (effect size) Dataset and research question ----------------------------- * **Dataset:** ``data/intro_stats_scores.csv`` * **Columns:** ``id``, ``group`` (control/treatment), ``score`` * **Research question:** Do students in the **treatment** group score higher than students in the **control** group? Run --- From inside your workbook folder: .. code-block:: bash pystatsv1 workbook run intro_stats_05_hypothesis_testing Or run the script directly: .. code-block:: bash python scripts/intro_stats_05_hypothesis_testing.py What gets created ----------------- Outputs go to: * ``outputs/case_studies/intro_stats/`` You should see: * ``permutation_dist.png`` - histogram of shuffled mean differences * ``permutation_test_summary.csv`` - observed diff + simulated p-value * ``effect_size.csv`` - Cohen’s d and Hedges’ g Inspect (what to look for) -------------------------- 1) **The permutation plot** Open ``permutation_dist.png`` and locate: * **0** on the x-axis (no mean difference), * your **observed difference** (a vertical marker or labeled value, depending on the script), * whether the observed difference is in the “tail” (rare) or near the center (common). Questions to answer: * Is the null distribution centered near 0? * Is your observed difference far out in a tail? * Does it look rare (only a few shuffled results get that extreme)? 2) **The p-value table** Open ``permutation_test_summary.csv`` and record: * the observed mean difference (treatment − control), * the p-value (simulation-based), * the number of permutations used. 3) **Effect size** Open ``effect_size.csv`` and record: * Cohen’s d * Hedges’ g Then answer: * Is the effect size “small”, “medium”, or “large” (rough guideline)? * Does the effect size match what you saw visually in Part 1 and Part 3? Concepts (plain language) ------------------------- Null hypothesis (H0) ^^^^^^^^^^^^^^^^^^^^ **H0 (null):** group does not matter; any observed difference in means is due to chance. In other words: the scores are the same “on average” in the population, and if you repeated the study you would sometimes see a difference just by randomness. Permutation test (simulation hypothesis test) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A **permutation test** builds a “no group effect” world like this: * keep the **scores** exactly as observed, * repeatedly **shuffle the group labels** (control/treatment), * each shuffle produces a **mean difference** you would expect under H0. Then: * compare your real observed mean difference to the shuffled distribution, * count how often the shuffled differences are **as extreme** (or more extreme) than observed, * that fraction is the **p-value**. This is powerful for beginners because it does not require heavy formulas. It directly simulates what “chance differences” look like. p-value (what it is and is not) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **p-value:** *If H0 were true, how often would we see a mean difference at least this large (or larger) just by chance?* Common misunderstandings (avoid these): * ❌ “The p-value is the probability the null is true.” (No.) * ❌ “A p-value of 0.03 means there’s a 97% chance treatment works.” (No.) * ✅ “If there were truly no group effect, a difference this large would be uncommon.” Effect size (how big is the difference?) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Hypothesis tests answer “is it rare under H0?”. Effect sizes answer “how big is it?”. **Cohen’s d** scales the mean difference by the pooled standard deviation: * d ≈ (mean_treatment − mean_control) / SD_pooled **Hedges’ g** is a small-sample corrected version of d: * g is usually slightly smaller than d when sample sizes are small. Very rough guidelines (context matters!): * 0.2 ≈ small * 0.5 ≈ medium * 0.8 ≈ large In real work, you interpret effect sizes with domain context (grading scale, stakes, costs, etc.). Worked problem 1 — compute the observed difference --------------------------------------------------- Before you look at the script output, do this once by hand: 1) Open the dataset: * ``data/intro_stats_scores.csv`` 2) Take a small subset (first 5 control + first 5 treatment rows) and compute: * mean_control * mean_treatment * mean_diff = mean_treatment − mean_control Write your calculations down. Then run: .. code-block:: bash pystatsv1 workbook run intro_stats_05_hypothesis_testing Compare your hand-computed difference to the program’s observed mean difference. They should be consistent (your subset uses fewer rows, so it won’t match exactly, but the sign/direction should). Worked problem 2 — “tiny permutation test” intuition ----------------------------------------------------- Here is a tiny dataset with 6 scores. Imagine 3 students were labeled control and 3 were labeled treatment: * Scores: 60, 62, 65, 70, 72, 75 * Observed labels (example): * control: 60, 62, 65 * treatment: 70, 72, 75 Observed mean difference: * mean_treat = (70+72+75)/3 = 72.33 * mean_ctrl = (60+62+65)/3 = 62.33 * diff = 10.00 If H0 is true, the labels are exchangeable. So we would reassign **which 3 scores** are “treatment” in all possible ways (or in many random shuffles) and compute the difference each time. Key idea: * Under H0, a difference of 10 points should be rare if groups don’t matter. Takeaway: * If you can explain this tiny example in words, you understand the core logic of permutation tests. One-sided vs two-sided (what “as extreme” means) ------------------------------------------------ The pack’s story is directional: * “Does treatment score higher than control?” That is a **one-sided** question. Sometimes you instead ask: * “Are the groups different in either direction?” That is **two-sided**. Your script will implement one of these choices. When reading the output, make sure you know which it is. Rule of thumb: * one-sided when you truly care about a specific direction **and** that direction was chosen before seeing the data. * two-sided when you are open to either direction (most common default in many courses). Reproducibility checkpoint (simulation is still reproducible) ------------------------------------------------------------- Even though this is simulation-based, you should still get stable outputs. PyStatsV1 scripts typically set a random seed so that: * the permutation distribution plot, * the p-value, * and the output CSVs are reproducible from run to run. Try: .. code-block:: bash pystatsv1 workbook run intro_stats_05_hypothesis_testing pystatsv1 workbook run intro_stats_05_hypothesis_testing You should see consistent results and deterministic files written to the same paths. (If a course version changes the number of permutations, results may shift slightly, but the story should remain the same.) Using your own data (same workflow) ----------------------------------- You can reuse this exact workflow on your own dataset if you can provide: * a ``group`` column with exactly two groups, and * a numeric ``score`` column. Quickest path: replace the example CSV with your own data **using the same column names**. **Important:** make a backup first. Step 1 — backup the dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash cp data/intro_stats_scores.csv data/intro_stats_scores_backup.csv Step 2 — edit the dataset in a text editor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Open the CSV with Notepad: .. code-block:: bash notepad data/intro_stats_scores.csv Replace some (or all) rows with your own values. Keep the header exactly: :: id,group,score Rules of thumb: * ``id`` is an integer (1, 2, 3, ...) * ``group`` is ``control`` or ``treatment`` (spelling matters) * ``score`` is a number (avoid commas inside numbers) Step 3 — rerun the analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash pystatsv1 workbook run intro_stats_01_descriptives pystatsv1 workbook run intro_stats_05_hypothesis_testing Step 4 — sanity check the outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Inspect: * ``outputs/case_studies/intro_stats/group_summary.csv`` * ``outputs/case_studies/intro_stats/permutation_test_summary.csv`` * ``outputs/case_studies/intro_stats/effect_size.csv`` If the direction or magnitude changed, that’s expected — you changed the data! Step 5 — restore the original dataset (optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you want to go back to the shipped example data: .. code-block:: bash cp data/intro_stats_scores_backup.csv data/intro_stats_scores.csv Worked “own data” example — reproduce a previously worked result ----------------------------------------------------------------- Sometimes instructors want you to confirm you can reproduce a known result. Here is a safe way to do that: 1) Make a backup (if you haven't already): .. code-block:: bash cp data/intro_stats_scores.csv data/intro_stats_scores_backup.csv 2) Edit the CSV in Notepad and make a small, controlled change: * choose **one** treatment row and increase its score by **+1** * leave everything else unchanged Example: if you see a treatment score of ``78``, change it to ``79``. 3) Rerun and confirm: .. code-block:: bash pystatsv1 workbook run intro_stats_01_descriptives pystatsv1 workbook run intro_stats_05_hypothesis_testing What should change? * The treatment mean should increase slightly. * The observed mean difference should increase slightly. * The p-value may change a little (simulation), but the overall story should remain similar. 4) Restore the original file: .. code-block:: bash cp data/intro_stats_scores_backup.csv data/intro_stats_scores.csv This demonstrates you can: * edit data safely (backup first), * rerun reproducibly, * interpret how outputs respond to small data changes. Check (tests) ------------- When you want a quick “did I break anything?” smoke test: .. code-block:: bash pystatsv1 workbook check intro_stats This confirms the case study pack still matches the lesson expectations. Next ---- Go to :doc:`intro_stats_06_writeup` for a tiny interpretation template you can fill in.