Chapter 12 — Hypothesis Testing for Decisions

PyPI workbook run (Track D)

From inside your Track D workbook folder (created by pystatsv1 workbook init --track d --dest ...), run:

pystatsv1 workbook run |trackd_run|

Outputs are written under outputs/track_d/ by default. If you’re unsure what a file is for, start with Track D Outputs Guide.

To see the full chapter-by-chapter run map (D00–D23), see Track D chapter index (PyPI).

Optional: write to a custom output folder:

pystatsv1 workbook run |trackd_run| --outdir outputs/track_d_custom

Interpretation prompts (quick self-check):

What is the accounting or business measurement goal in this chapter?
Which invariant/check would catch a “numbers look fine but are wrong” mistake here?

Why this chapter exists

Most teams think they are making data-driven decisions, but often they are really reacting to random noise:

“Sales were higher this week — the promo worked!”
“Cycle time dropped — the new process is better!”

This chapter reframes hypothesis testing as business experimentation: a disciplined way to decide whether a change is likely real and whether it is worth acting on.

What “good” looks like in this book

Clean books → reliable dataset → defensible analysis → usable forecast → decision memo

Chapter 12 focuses on the decision memo step:

Did we measure the right thing?
Did we pre-commit to the metric and the test window?
Is the effect big enough to matter (not just “statistically significant”)?

Key accounting & operations terms (quick definitions)

Accounts Receivable (AR): Money customers owe you. In the NSO case, AR invoices and collections are recorded as events.
Accounts Payable (AP): Money you owe suppliers. In NSO, AP invoices and payments are recorded as events.
Invoice: A request for payment tied to a sale (AR) or a supplier bill (AP). In this project, invoices are identified by invoice_id.
Average Order Value (AOV): Average invoice amount (here, for AR invoices). Used as the “promotion” outcome metric.
Cycle Time (AP processing): Time between an AP invoice date and payment date. Here we approximate days-to-pay.
Control vs Treatment: Two groups: “business as usual” (control) vs “new thing” (treatment).
Null Hypothesis (H0): “No difference” (promo has no effect; new AP process saves no time).
Alternative Hypothesis (H1): “There is a difference” (promo changes AOV; process reduces cycle time).
P-value: Under H0 (“no effect”), how surprising is what we observed? Small p-values mean the result is unlikely if there truly were no effect.
Alpha (decision threshold): The rule you set before looking at results (commonly 0.05). In this repo we tie alpha to the confidence level.
Effect Size: “How big is the impact?” Example: dollars per invoice, or days saved per invoice.
Practical (business) significance: “Even if real, is it worth doing?” A statistically real effect can still be too small to matter.
P-hacking: Accidental or intentional “cheating” (testing many metrics, peeking early, changing rules after seeing results) until something looks significant.

What the software does (two deterministic teaching cases)

NSO v1 doesn’t contain explicit region/promo labels, so we create two deterministic teaching cases from NSO’s event tables:

Promotion test (A/B style) - Customers are deterministically assigned to treatment or control - We choose a promo month with many invoices (stable sample size) - We apply a small, configurable uplift to treatment invoices (teaching knob) - We estimate a p-value using a permutation test on mean invoice amount - We report effect size + bootstrap confidence interval (CI)
Cycle time test (before/after) - We compute AP days-to-pay using invoice + payment dates - We split invoices into before vs after around a changeover date - We apply a small, configurable cycle-time reduction to the after group (teaching knob) - We test (before − after) mean difference with the same permutation discipline

Running Chapter 12

From the repo root:

make business-ch12

Or directly:

python -m scripts.business_ch12_hypothesis_testing_decisions \
  --datadir data/synthetic/nso_v1 \
  --outdir outputs/track_d \
  --seed 123

Outputs (what to read first)

The script writes artifacts into outputs/track_d:

ch12_experiment_design.json Pre-commitment file (anti p-hacking): metric, alpha, and decision rule.
ch12_hypothesis_testing_summary.json Machine-readable summary: p-values, effect sizes, CIs.
ch12_experiment_memo.md Executive-style memo: “what happened, what it means, what to do.”
ch12_figures_manifest.csv + charts in outputs/track_d/figures/ Charts follow the repo’s style contract and include guardrails to avoid misleading visuals.

How to interpret the results (accountant-friendly)

Promotion test (AR invoices)

Ask two questions:

Is it likely real?
- p < α suggests the observed lift is unlikely under “no effect”.
Is it worth it?
- Compare the estimated lift (dollars per invoice and % lift) to the cost of the promo and to margin impact (e.g., a 2% revenue lift can be negative value if discounts overwhelm margin).

Cycle time test (AP days-to-pay)

Translate days saved into dollars:

Labor time saved (processing effort)
Early-pay discounts captured
Fewer late fees / fewer supplier escalations
Stronger cash planning (predictable AP timing)

Guardrails (what we do not allow)

Do not “peek” and stop the test early.
Do not change the metric after seeing results.
Do not test 20 metrics and celebrate the one that passes.
Always pair p-values with effect size + CI.

End-of-chapter problems (use our outputs)

Promotion test decision memo Using ch12_hypothesis_testing_summary.json and the memo template style: - State H0 and H1 in business language - Report p-value, effect size, and CI - Define a “minimum practical lift” and decide rollout / no-rollout
Process change evaluation Using the cycle-time results: - Convert days saved into approximate annual dollars - Decide whether benefits exceed implementation costs
P-hacking prevention List five ways teams accidentally p-hack, and write a pre-commitment rule set that would prevent each one.