Intro Stats 3 - Distributions and outliers
==========================================

This is Part 3 of the **Intro Stats case study pack**.

Before doing “formal” inference, it helps to look at the **shape** of the data.

Two common questions:

* Are the scores roughly bell-shaped, or skewed?
* Are there extreme values that might be mistakes or unusual cases?

In this part you will:

1. visualize distributions by group (histograms + boxplots),
2. learn what “skew” and “outlier” mean in plain language,
3. use a simple, transparent outlier rule (IQR), and
4. practice with a couple of worked mini-examples.

Why this matters
----------------

Many statistical tools (like t-tests and confidence intervals) *work best* when:

* the data are roughly symmetric (not extremely skewed), and
* a few extreme values are not dominating the story.

You do not need “perfect” bell curves to do inference.
But you *do* want to know what the data look like, so you can:

* spot mistakes (data-entry errors),
* understand unusual cases (real outliers),
* choose robust alternatives when needed, and
* communicate results honestly.

Run
---

From inside your workbook folder:

.. code-block:: bash

   pystatsv1 workbook run intro_stats_03_distributions_outliers

Or directly:

.. code-block:: bash

   python scripts/intro_stats_03_distributions_outliers.py

What gets created
-----------------

Outputs go to:

* ``outputs/case_studies/intro_stats/``

You should see:

* ``score_distributions.png`` - per-group histogram + boxplot
* ``outliers_iqr.csv`` - rows flagged as outliers using the IQR rule

Inspect
-------

Step 1: read the picture first
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Open ``score_distributions.png`` and look for:

**Histograms**

* Is each group’s distribution roughly symmetric, or skewed?
* Is one group “wider” (more variable) than the other?

**Boxplots**

* Do the medians (middle lines) differ?
* Are the boxes (middle 50%) similar sizes?
* Do whiskers differ a lot?

**Extreme points**

* Any dots far from the rest of the data?
* If yes, are they in one group or both?

Step 2: read the outlier table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Open ``outliers_iqr.csv``.

If it is empty, that is fine.
If it has rows, note:

* which group(s) the outliers are in,
* whether the values look plausible,
* whether the outliers would change the story.

Concepts (plain language)
-------------------------

Distribution
~~~~~~~~~~~~

A **distribution** is how values are spread out.

Three quick “shape words” you will see a lot:

* **symmetric**: left and right sides look similar
* **right-skewed**: a few unusually *high* values stretch the right tail
* **left-skewed**: a few unusually *low* values stretch the left tail

Outliers
~~~~~~~~

An **outlier** is a value far from most of the data.

Outliers can happen for many reasons:

* **typo** (e.g., 800 instead of 80)
* **unit mix-up** (e.g., percent vs points)
* **real extreme case** (someone truly unusual)

Outlier detection is not “auto-delete.”
It is “worth a closer look.”

The IQR rule
~~~~~~~~~~~~

A quick way to flag potential outliers is the **IQR rule**:

* :math:`IQR = Q3 - Q1` (the middle 50% spread)
* outlier if:

  * :math:`value < Q1 - 1.5\times IQR` or
  * :math:`value > Q3 + 1.5\times IQR`

This is transparent and easy to explain.
It is not perfect, but it is a good first pass.

Worked mini-examples
--------------------

Worked example A: what the IQR rule is doing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose a small set of scores:

.. code-block:: text

   10, 11, 12, 12, 13, 14, 15, 30

Most scores are around 10–15, but one score is 30.

* The **middle** of the data is near 12–14.
* The value 30 is far away, so it may be flagged as an outlier.

Key lesson: the IQR rule compares a value to the “middle 50%” range.

Worked example B: outlier vs “valid extreme”
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Imagine two situations:

1) A score of 30 on a test that is out of 100.
2) A score of 300 on that same test.

Which one is more likely to be a typo?

Usually:

* 30 might be a valid low score.
* 300 is impossible and almost certainly a data-entry error.

So your first question should often be:
“Is this value even possible?”

Good practice (what to do if you find an outlier)
-------------------------------------------------

If you find an outlier, don’t delete it just because it is inconvenient.
Instead, ask:

1) **Is it a typo or data-entry error?**
   If yes, fix it (and document the fix).

2) **Is it a valid extreme case?**
   If yes, keep it, but acknowledge it.

3) **Does it change conclusions?**
   Try a simple sensitivity check:

   * run the analysis with the outlier included
   * run again with it excluded (only if exclusion is justified)
   * compare the results

In real reporting, you would write something like:

* “Results were similar with and without the extreme value” (robust)
* or “Results depended heavily on one extreme value” (fragile)

Using your own data (student workflow)
--------------------------------------

If you want to use this chapter’s script on your own dataset:

Your CSV needs columns:

* ``id`` (unique identifier)
* ``group`` (e.g., ``control`` / ``treatment``)
* ``score`` (numeric)

Option 1: Replace the rows in ``data/intro_stats_scores.csv``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is the simplest way to reuse the exact same scripts.

1) Back up the original file:

.. code-block:: bash

   cp data/intro_stats_scores.csv data/intro_stats_scores_backup.csv

2) Edit the file (Notepad works fine):

.. code-block:: bash

   notepad data/intro_stats_scores.csv

3) Paste your own rows (keeping the same header):

.. code-block:: text

   id,group,score
   1,control,71
   2,control,69
   3,treatment,80
   4,treatment,78

4) Run the distributions/outliers script:

.. code-block:: bash

   pystatsv1 workbook run intro_stats_03_distributions_outliers

5) Inspect:

* ``outputs/case_studies/intro_stats/score_distributions.png``
* ``outputs/case_studies/intro_stats/outliers_iqr.csv``

When you are done, restore the original dataset:

.. code-block:: bash

   mv data/intro_stats_scores_backup.csv data/intro_stats_scores.csv

Then rerun to get back to the canonical pack:

.. code-block:: bash

   pystatsv1 workbook run intro_stats_03_distributions_outliers

Option 2: Use the general “My Own Data” scaffold
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you do not want to match the Intro Stats column names, use:

.. code-block:: bash

   pystatsv1 workbook run my_data_01_explore
   pystatsv1 workbook check my_data

This is a great first pass to find missing values, types, and weird entries.

Reproducibility checkpoint
--------------------------

Run the chapter twice and confirm the same outputs appear:

.. code-block:: bash

   pystatsv1 workbook run intro_stats_03_distributions_outliers
   pystatsv1 workbook run intro_stats_03_distributions_outliers

If you changed the dataset for practice, remember:

* pack tests may fail until you restore the original data.

Next
----

Go to :doc:`intro_stats_04_confidence_intervals` to compute a 95% confidence interval for the mean difference.