Intro Stats 2 - Simulation and uncertainty (bootstrap)
This is Part 2 of the Intro Stats case study pack.
You still have the same dataset and the same research question:
Dataset:
data/intro_stats_scores.csvQuestion: Do students in the treatment group score higher than students in the control group?
In Part 1 you computed a point estimate (a single number): the difference between the treatment and control means.
In Part 2 you will answer the natural follow-up question:
If we repeated this study with “different students”, how much could the mean difference change?
Learning goals
By the end of this chapter, you should be able to:
explain (in plain language) why a single mean difference is often not enough,
generate a simulation-based uncertainty summary using a bootstrap, and
interpret a bootstrap confidence interval as a “plausible range” for the mean difference.
Concepts (plain language)
- Sampling variability
If you sample a different set of students, you will not get the exact same mean difference every time. Small changes in the sample can cause small (and sometimes not-so-small) changes in the result.
- Bootstrap simulation
The bootstrap is a simple simulation trick:
treat your dataset as your best snapshot of reality,
repeatedly resample rows with replacement (so some students may appear twice and some not at all), and
recompute the statistic each time (here: mean(treatment) - mean(control)).
The collection of simulated statistics is an approximate sampling distribution.
- Deterministic outputs (important for the Workbook)
This script sets a fixed random seed so your outputs are reproducible. That is why your CSV and PNG artifacts should match the reference results when your input dataset matches the workbook dataset.
Run
From inside your workbook folder:
pystatsv1 workbook run intro_stats_02_simulation
If you want to run the script directly:
python scripts/intro_stats_02_simulation.py
What gets created
The script writes outputs to:
outputs/case_studies/intro_stats/
You should see:
bootstrap_mean_diffs.csv- one row per bootstrap drawbootstrap_summary.csv- a tiny one-row summary tablebootstrap_mean_diff_hist.png- a histogram of the bootstrap distribution
Inspect
Open
bootstrap_summary.csvand answer:
What is the observed mean difference?
What is the 95% bootstrap interval (low and high)?
Open
bootstrap_mean_diff_hist.pngand check:
Is the distribution centered near the observed difference?
Is most of the distribution above 0 (meaning treatment > control)?
Reference outputs (what you should see)
If your data/intro_stats_scores.csv matches the workbook dataset, you should
see results close to:
Observed mean difference: about 11.20 points
95% bootstrap interval: about [9.50, 12.85]
The exact values are saved in bootstrap_summary.csv.
Worked problems (with solutions)
Problem 1: Compute the mean difference by hand
From Part 1, you should have a table like this (values may vary slightly if you rounded when you copied them):
control mean: about 69.0
treatment mean: about 80.2
- Question:
What is the mean difference (treatment - control)?
- Solution:
Subtract:
80.2 - 69.0 = 11.2points.That is your point estimate.
Problem 2: Interpret the bootstrap interval
Open bootstrap_summary.csv.
- Question:
Suppose the interval is
[9.50, 12.85]. What does that mean in plain language?- Solution:
A good plain-language interpretation is:
“Given this dataset, a reasonable (simulation-based) range for the true mean advantage of the treatment group is about 9.5 to 12.9 points.”
It does not mean “95% chance the treatment works”. It is about the uncertainty in the estimated mean difference.
Problem 3: How often is the mean difference <= 0?
This is a quick sanity check.
From inside your workbook folder:
python -c "import pandas as pd; d=pd.read_csv('outputs/case_studies/intro_stats/bootstrap_mean_diffs.csv'); print('P(diff<=0)=', (d.boot_mean_diff<=0).mean())"
Interpretation:
If
P(diff<=0)is near 0, your bootstrap draws almost always show treatment > control.If it is large (for example 0.30), your data are consistent with treatment sometimes being worse or equal.
Using your own data (or your own mini-example)
The Intro Stats case study expects a very simple CSV format:
one row per student
columns:
id,group,scoregroupshould becontrolortreatment
Warning
Editing data/intro_stats_scores.csv changes the inputs for all Intro
Stats chapters. Always make a backup first.
Step A: Make a backup
cp data/intro_stats_scores.csv data/intro_stats_scores_backup.csv
Step B: Edit the CSV in a text editor
Open the file with Notepad:
notepad data/intro_stats_scores.csv
Replace the contents with this small worked example:
id,group,score
1,control,73
2,control,69
3,control,75
4,control,71
5,treatment,82
6,treatment,79
7,treatment,85
8,treatment,81
Save the file and close Notepad.
Step C: Run the script and compare to the expected pattern
pystatsv1 workbook run intro_stats_02_simulation
For this mini-example, you should see:
Observed mean difference: 9.75 points
95% bootstrap interval: roughly [4.88, 15.12]
(Your exact values will be written to bootstrap_summary.csv.)
Step D: Restore the workbook dataset
mv data/intro_stats_scores_backup.csv data/intro_stats_scores.csv
Reproducibility checkpoint
Run the chapter twice:
pystatsv1 workbook run intro_stats_02_simulation
pystatsv1 workbook run intro_stats_02_simulation
Because the script uses a fixed seed, you should get the same outputs each time.
Check
This case study pack includes a small “check your work” test.
From inside your workbook folder:
pystatsv1 workbook check intro_stats
If you edited the dataset for the mini-example, restore the original dataset first (see the restore step above) so the check matches the workbook reference.
Next
Go to Intro Stats 3 - Distributions and outliers to look at distributions, outliers, and why plots matter before you run formal tests.