Intro Stats 3 - Distributions and outliers

This is Part 3 of the Intro Stats case study pack.

Before doing “formal” inference, it helps to look at the shape of the data.

Two common questions:

  • Are the scores roughly bell-shaped, or skewed?

  • Are there extreme values that might be mistakes or unusual cases?

In this part you will:

  1. visualize distributions by group (histograms + boxplots),

  2. learn what “skew” and “outlier” mean in plain language,

  3. use a simple, transparent outlier rule (IQR), and

  4. practice with a couple of worked mini-examples.

Why this matters

Many statistical tools (like t-tests and confidence intervals) work best when:

  • the data are roughly symmetric (not extremely skewed), and

  • a few extreme values are not dominating the story.

You do not need “perfect” bell curves to do inference. But you do want to know what the data look like, so you can:

  • spot mistakes (data-entry errors),

  • understand unusual cases (real outliers),

  • choose robust alternatives when needed, and

  • communicate results honestly.

Run

From inside your workbook folder:

pystatsv1 workbook run intro_stats_03_distributions_outliers

Or directly:

python scripts/intro_stats_03_distributions_outliers.py

What gets created

Outputs go to:

  • outputs/case_studies/intro_stats/

You should see:

  • score_distributions.png - per-group histogram + boxplot

  • outliers_iqr.csv - rows flagged as outliers using the IQR rule

Inspect

Step 1: read the picture first

Open score_distributions.png and look for:

Histograms

  • Is each group’s distribution roughly symmetric, or skewed?

  • Is one group “wider” (more variable) than the other?

Boxplots

  • Do the medians (middle lines) differ?

  • Are the boxes (middle 50%) similar sizes?

  • Do whiskers differ a lot?

Extreme points

  • Any dots far from the rest of the data?

  • If yes, are they in one group or both?

Step 2: read the outlier table

Open outliers_iqr.csv.

If it is empty, that is fine. If it has rows, note:

  • which group(s) the outliers are in,

  • whether the values look plausible,

  • whether the outliers would change the story.

Concepts (plain language)

Distribution

A distribution is how values are spread out.

Three quick “shape words” you will see a lot:

  • symmetric: left and right sides look similar

  • right-skewed: a few unusually high values stretch the right tail

  • left-skewed: a few unusually low values stretch the left tail

Outliers

An outlier is a value far from most of the data.

Outliers can happen for many reasons:

  • typo (e.g., 800 instead of 80)

  • unit mix-up (e.g., percent vs points)

  • real extreme case (someone truly unusual)

Outlier detection is not “auto-delete.” It is “worth a closer look.”

The IQR rule

A quick way to flag potential outliers is the IQR rule:

  • \(IQR = Q3 - Q1\) (the middle 50% spread)

  • outlier if:

    • \(value < Q1 - 1.5\times IQR\) or

    • \(value > Q3 + 1.5\times IQR\)

This is transparent and easy to explain. It is not perfect, but it is a good first pass.

Worked mini-examples

Worked example A: what the IQR rule is doing

Suppose a small set of scores:

10, 11, 12, 12, 13, 14, 15, 30

Most scores are around 10–15, but one score is 30.

  • The middle of the data is near 12–14.

  • The value 30 is far away, so it may be flagged as an outlier.

Key lesson: the IQR rule compares a value to the “middle 50%” range.

Worked example B: outlier vs “valid extreme”

Imagine two situations:

  1. A score of 30 on a test that is out of 100.

  2. A score of 300 on that same test.

Which one is more likely to be a typo?

Usually:

  • 30 might be a valid low score.

  • 300 is impossible and almost certainly a data-entry error.

So your first question should often be: “Is this value even possible?”

Good practice (what to do if you find an outlier)

If you find an outlier, don’t delete it just because it is inconvenient. Instead, ask:

  1. Is it a typo or data-entry error? If yes, fix it (and document the fix).

  2. Is it a valid extreme case? If yes, keep it, but acknowledge it.

  3. Does it change conclusions? Try a simple sensitivity check:

    • run the analysis with the outlier included

    • run again with it excluded (only if exclusion is justified)

    • compare the results

In real reporting, you would write something like:

  • “Results were similar with and without the extreme value” (robust)

  • or “Results depended heavily on one extreme value” (fragile)

Using your own data (student workflow)

If you want to use this chapter’s script on your own dataset:

Your CSV needs columns:

  • id (unique identifier)

  • group (e.g., control / treatment)

  • score (numeric)

Option 1: Replace the rows in data/intro_stats_scores.csv

This is the simplest way to reuse the exact same scripts.

  1. Back up the original file:

cp data/intro_stats_scores.csv data/intro_stats_scores_backup.csv
  1. Edit the file (Notepad works fine):

notepad data/intro_stats_scores.csv
  1. Paste your own rows (keeping the same header):

id,group,score
1,control,71
2,control,69
3,treatment,80
4,treatment,78
  1. Run the distributions/outliers script:

pystatsv1 workbook run intro_stats_03_distributions_outliers
  1. Inspect:

  • outputs/case_studies/intro_stats/score_distributions.png

  • outputs/case_studies/intro_stats/outliers_iqr.csv

When you are done, restore the original dataset:

mv data/intro_stats_scores_backup.csv data/intro_stats_scores.csv

Then rerun to get back to the canonical pack:

pystatsv1 workbook run intro_stats_03_distributions_outliers

Option 2: Use the general “My Own Data” scaffold

If you do not want to match the Intro Stats column names, use:

pystatsv1 workbook run my_data_01_explore
pystatsv1 workbook check my_data

This is a great first pass to find missing values, types, and weird entries.

Reproducibility checkpoint

Run the chapter twice and confirm the same outputs appear:

pystatsv1 workbook run intro_stats_03_distributions_outliers
pystatsv1 workbook run intro_stats_03_distributions_outliers

If you changed the dataset for practice, remember:

  • pack tests may fail until you restore the original data.

Next

Go to Intro Stats 4 - Confidence intervals to compute a 95% confidence interval for the mean difference.