Chapter 12 — Hypothesis Testing for Decisions 
=======================================================================

.. |trackd_run| replace:: d12
.. include:: _includes/track_d_run_strip.rst


Why this chapter exists
-----------------------
Most teams *think* they are making data-driven decisions, but often they are really
reacting to random noise:

- “Sales were higher this week — the promo worked!”
- “Cycle time dropped — the new process is better!”

This chapter reframes hypothesis testing as **business experimentation**: a disciplined
way to decide whether a change is likely *real* and whether it is **worth acting on**.

What “good” looks like in this book
-----------------------------------
Clean books → reliable dataset → defensible analysis → usable forecast → decision memo

Chapter 12 focuses on the **decision memo** step:

- Did we measure the right thing?
- Did we pre-commit to the metric and the test window?
- Is the effect big enough to matter (not just “statistically significant”)?

Key accounting & operations terms (quick definitions)
-----------------------------------------------------
Accounts Receivable (AR)
   Money customers owe you. In the NSO case, AR invoices and collections are recorded as events.

Accounts Payable (AP)
   Money you owe suppliers. In NSO, AP invoices and payments are recorded as events.

Invoice
   A request for payment tied to a sale (AR) or a supplier bill (AP). In this project, invoices are identified by ``invoice_id``.

Average Order Value (AOV)
   Average invoice amount (here, for AR invoices). Used as the “promotion” outcome metric.

Cycle Time (AP processing)
   Time between an AP invoice date and payment date. Here we approximate **days-to-pay**.

Control vs Treatment
   Two groups: “business as usual” (control) vs “new thing” (treatment).

Null Hypothesis (H0)
   “No difference” (promo has no effect; new AP process saves no time).

Alternative Hypothesis (H1)
   “There is a difference” (promo changes AOV; process reduces cycle time).

P-value
   Under H0 (“no effect”), how surprising is what we observed? Small p-values mean the result is
   unlikely *if there truly were no effect*.

Alpha (decision threshold)
   The rule you set **before** looking at results (commonly 0.05). In this repo we tie alpha to the confidence level.

Effect Size
   “How big is the impact?” Example: dollars per invoice, or days saved per invoice.

Practical (business) significance
   “Even if real, is it worth doing?” A statistically real effect can still be too small to matter.

P-hacking
   Accidental or intentional “cheating” (testing many metrics, peeking early, changing rules after seeing results)
   until something looks significant.

What the software does (two deterministic teaching cases)
---------------------------------------------------------
NSO v1 doesn’t contain explicit region/promo labels, so we create two **deterministic teaching cases**
from NSO’s event tables:

1) Promotion test (A/B style)
   - Customers are deterministically assigned to **treatment** or **control**
   - We choose a promo month with many invoices (stable sample size)
   - We apply a small, configurable uplift to treatment invoices (teaching knob)
   - We estimate a p-value using a **permutation test** on mean invoice amount
   - We report effect size + bootstrap confidence interval (CI)

2) Cycle time test (before/after)
   - We compute AP days-to-pay using invoice + payment dates
   - We split invoices into **before** vs **after** around a changeover date
   - We apply a small, configurable cycle-time reduction to the after group (teaching knob)
   - We test (before − after) mean difference with the same permutation discipline

Running Chapter 12
------------------
From the repo root:

.. code-block:: bash

   make business-ch12

Or directly:

.. code-block:: bash

   python -m scripts.business_ch12_hypothesis_testing_decisions \
     --datadir data/synthetic/nso_v1 \
     --outdir outputs/track_d \
     --seed 123

Outputs (what to read first)
----------------------------
The script writes artifacts into ``outputs/track_d``:

- ``ch12_experiment_design.json``
  Pre-commitment file (anti p-hacking): metric, alpha, and decision rule.

- ``ch12_hypothesis_testing_summary.json``
  Machine-readable summary: p-values, effect sizes, CIs.

- ``ch12_experiment_memo.md``
  Executive-style memo: “what happened, what it means, what to do.”

- ``ch12_figures_manifest.csv`` + charts in ``outputs/track_d/figures/``
  Charts follow the repo’s style contract and include guardrails to avoid misleading visuals.

How to interpret the results (accountant-friendly)
--------------------------------------------------
Promotion test (AR invoices)

Ask two questions:

1. Is it likely real?

   - *p* < α suggests the observed lift is unlikely under "no effect".

2. Is it worth it?

   - Compare the estimated lift (dollars per invoice and % lift) to the cost
     of the promo and to margin impact (e.g., a 2% revenue lift can be negative
     value if discounts overwhelm margin).

Cycle time test (AP days-to-pay)

Translate days saved into dollars:

- Labor time saved (processing effort)
- Early-pay discounts captured
- Fewer late fees / fewer supplier escalations
- Stronger cash planning (predictable AP timing)

Guardrails (what we do *not* allow)
-----------------------------------
- Do not “peek” and stop the test early.
- Do not change the metric after seeing results.
- Do not test 20 metrics and celebrate the one that passes.
- Always pair p-values with effect size + CI.

End-of-chapter problems (use our outputs)
-----------------------------------------
1) Promotion test decision memo
   Using ``ch12_hypothesis_testing_summary.json`` and the memo template style:
   - State H0 and H1 in business language
   - Report p-value, effect size, and CI
   - Define a “minimum practical lift” and decide rollout / no-rollout

2) Process change evaluation
   Using the cycle-time results:
   - Convert days saved into approximate annual dollars
   - Decide whether benefits exceed implementation costs

3) P-hacking prevention
   List five ways teams accidentally p-hack, and write a pre-commitment rule set that would prevent each one.