Track D — Chapter 14
PyPI workbook run (Track D)
From inside your Track D workbook folder (created by pystatsv1 workbook init --track d --dest ...), run:
pystatsv1 workbook run |trackd_run|
Outputs are written under outputs/track_d/ by default.
If you’re unsure what a file is for, start with Track D Outputs Guide.
To see the full chapter-by-chapter run map (D00–D23), see Track D chapter index (PyPI).
Optional: write to a custom output folder:
pystatsv1 workbook run |trackd_run| --outdir outputs/track_d_custom
Interpretation prompts (quick self-check):
What is the accounting or business measurement goal in this chapter?
Which invariant/check would catch a “numbers look fine but are wrong” mistake here?
Regression Driver Analysis (NSO running case)
This chapter turns “operational activity” into an explainable planning model.
In accounting work you often start from outcomes:
Revenue was up (or down).
COGS moved.
Gross margin shifted.
That outcome view is essential, but it’s not always sufficient for planning and control. When leadership asks:
“What happens to COGS if we sell 15% more units next month?”
“Is margin pressure mostly price, volume, or some fixed baseline cost?”
“Are we seeing higher revenue because we sold more… or because invoices changed (mix/activity)?”
…you need a driver lens.
Regression driver analysis is a practical way to estimate simple relationships like:
a baseline level (intercept): “what tends to happen even if activity is low”
a rate (slope): “how much outcome changes per unit of activity”
You’ll build this lens for the North Shore Outfitters (NSO) running case using monthly data.
Where this fits in Track D
This chapter is intentionally downstream of earlier Track D ideas:
Chapter 12 (Hypothesis Testing): disciplined interpretation; avoid overconfidence from noisy data.
Chapter 13 (Correlation, Causation, Controlled Comparisons): correlation is not causation and “third factors” often drive two lines together.
Regression is powerful, but it can also create false confidence if used carelessly. So we carry forward the Chapter 13 discipline here:
Use regression as a driver lens
Prefer simple, explainable models
Check residuals and sanity-check assumptions
Treat results as planning inputs, not proof
What you will build
You will produce two things:
A monthly driver table that lines up activity measures with financial outcomes.
Three small regression models:
m1:
COGS ~ units_sold(fixed + variable cost per unit lens)m2:
Revenue ~ units_sold(baseline + average price-per-unit lens)m3:
Revenue ~ units_sold + invoice_count(two-driver “mix/activity check”)
The goal is not “best possible prediction.” The goal is simple, auditable explanations that help accounting planning + control.
The driver table (what it is and where it comes from)
The Chapter 14 script builds a monthly table with these columns:
Column |
Meaning |
Source in NSO outputs |
|---|---|---|
|
Month key (YYYY-MM) |
Derived from date fields in each input file |
|
Month as a real date (YYYY-MM-01) for sorting/plotting |
Derived |
|
Units sold in that month (positive count) |
|
|
Count of invoices issued in that month |
|
|
Monthly sales revenue |
|
|
Monthly cost of goods sold |
|
Why these drivers?
Accounting planning usually needs a bridge between operations and financial outcomes.
Units sold is a natural driver for both revenue and COGS in many product businesses.
Invoice count is not “better than units,” but it can be a useful activity proxy (and a warning sign for mix changes, bundling, partial shipments, pricing patterns, etc.).
This chapter keeps the driver set intentionally small. In real work you might add:
labor hours, headcount, shipments, store traffic, returns
discount rate, product mix %, channel mix %, seasonality indicators
capacity constraints (overtime, stockouts)
Regression models (m1, m2, m3)
All three models use ordinary least squares (OLS). Conceptually:
y = intercept + slope * x + residualIntercept is the baseline component.
Slope is the rate per unit of driver.
Residuals are what the driver did not explain.
The script uses statsmodels and includes an intercept term (a constant).
Model 1: COGS as a function of units sold
m1: COGS ~ units_sold
Interpretation in planning terms:
Intercept (baseline COGS): costs that tend to appear even at low activity (minimum staffing, spoilage, fixed handling, etc.). In some businesses baseline COGS should be near zero; in others it may not be.
Slope (variable cost per unit): the estimated cost-per-unit implied by the data (a “rate”).
How you use it:
Build a cost forecast from a unit forecast.
Explain COGS variance as “rate vs baseline vs unexplained residual.”
Model 2: Revenue as a function of units sold
m2: Revenue ~ units_sold
Interpretation:
Intercept (baseline revenue): in many settings this should be near zero. If it is not, that’s a signal to investigate: timing, returns, non-unit revenue streams, seasonality, or a model mismatch.
Slope (average price per unit lens): an implied average revenue per unit. This is not a SKU-level price; it’s a blended rate across product mix for the period.
How you use it:
Translate a unit plan into a revenue plan.
Detect periods where price/mix changes break the “stable rate” assumption.
Model 3: Revenue as a function of units sold + invoice count
m3: Revenue ~ units_sold + invoice_count
This is a simple two-driver extension:
If invoice count adds meaningful explanatory power beyond units sold, you may be seeing changes in ordering patterns or mix (e.g., more small invoices, more split shipments, different channel behavior).
If invoice count does not add anything (small coefficient, noisy, low incremental fit), that’s also useful: units alone may be sufficient for the current planning lens.
This is not “the final truth.” It’s a lightweight check that encourages better questions.
Interpreting outputs like an accountant
Intercepts and slopes: baseline vs rate
A helpful mental model is:
Intercept = baseline component (what happens at “zero-ish” activity)
Slope = marginal rate (how much outcome changes per unit of activity)
This matches common accounting narratives:
“There is a fixed component plus a variable component.”
“Most of the change is volume-driven; rate is stable.”
“The rate is drifting — pricing/mix or cost structure is changing.”
R²: “how much of the story is volume”
R² is the fraction of variance explained by the model in-sample.
High R² can mean the driver captures the main movement (often volume/seasonality).
Low R² can mean the driver is incomplete or the process changed.
Accounting interpretation:
R² is not “goodness” by itself; it’s a signpost.
Even a moderate R² can be useful if the slope is stable and interpretable.
Residuals: what the drivers didn’t explain
Residuals are where accounting insight often lives:
promotions and discounting
unusual returns
supply shocks and stockouts
one-time events or timing effects
A healthy workflow is:
Fit the simple model.
Inspect residual patterns.
Decide whether you need more drivers or segmentation.
Important
If pricing or product mix changes materially, re-fit the model and re-check residuals. Stable slopes are an assumption, not a guarantee.
How to run the chapter
Prerequisite: NSO dataset
Chapter 14 expects the NSO v1 synthetic dataset to exist (it is generated by the Track D simulator). If you already ran earlier Track D chapters locally, you likely have it.
To (re)generate NSO v1:
make business-nso-sim
This writes the NSO v1 dataset under:
data/synthetic/nso_v1/
Run Chapter 14
Run the Chapter 14 target:
make business-ch14
Note on output location:
The Makefile passes
--outdir outputs/track_d.The Chapter 14 script writes inside a nested folder and prints the final path.
So you should expect artifacts under something like:
outputs/track_d/track_d/
Outputs and how to inspect them
The chapter writes a small “analysis pack” you can use for inspection, debugging, or reporting.
Core artifacts
ch14_driver_table.csvThe monthly driver table (month, units_sold, invoice_count, revenue, cogs). Start here—open it and sanity-check the numbers.ch14_regression_design.jsonA lightweight “design contract” describing expected inputs, driver definitions, and model formulas. Use this for reproducibility and review.ch14_regression_summary.jsonMachine-readable regression results (parameters, R², and a small forecast example). Useful for downstream automation (dashboards, reports, tests).ch14_regression_memo.mdA short human-readable memo summarizing the key results. This is intentionally brief: it’s a starter “executive narrative.”
Figures
ch14_figures_manifest.csvA manifest listing each figure saved by the script (path + chart metadata).figures/PNG charts (scatter + fit line, plus a residual-style view for the multi-driver check).
A quick inspection workflow:
Open
ch14_driver_table.csvand verify the month alignment.Read
ch14_regression_memo.mdto see the story.Open the figures to check whether relationships look linear and whether outliers dominate.
Use
ch14_regression_summary.jsonif you want to extract slopes/intercepts programmatically.
Guardrails (read this before you sell the story)
Regression is a driver lens, not a causality machine.
Common failure modes in accounting settings:
Seasonality / timing drives both the driver and the outcome (spurious fit).
Mix shifts break slope stability (price per unit or cost per unit changes).
Capacity constraints create nonlinearity (overtime, stockouts, shipping changes).
Definition drift (what counts as an invoice, what counts as a unit) changes over time.
Use regression to support planning conversations like:
“Given our recent pattern, the implied cost-per-unit is X.”
“If we hit Y units next month, the model implies COGS around Z, plus/minus residual noise.”
“This month looks unusual relative to the driver story; let’s investigate why.”
Troubleshooting
Missing input files (FileNotFoundError)
If you see an error like “Expected inventory_movements.csv … but not found”:
Confirm you generated NSO v1:
make business-nso-simConfirm the files exist under
data/synthetic/nso_v1/:inventory_movements.csv,ar_events.csv,statements_is_monthly.csv.
Wrong output directory / permission issues
If artifacts don’t appear under outputs/track_d/track_d/:
Ensure your working tree is the repo root and you’re running targets from there.
Confirm you have permission to create the
outputs/directory on your machine.If needed, clear old outputs and re-run:
make clean make business-ch14
Plots not rendering / backend issues
This chapter saves plots to PNG files; it does not require an interactive GUI. If you hit a matplotlib backend error on Windows, ensure you’re using the project’s recommended environment (venv) and that your matplotlib install is healthy.
What’s next (Chapter 15+)
Chapter 14 gives you the core regression driver workflow:
build a driver table
fit simple explainable models
produce artifacts that are reviewable (CSV/JSON/MD/figures)
The natural next steps for Chapter 15+ are to expand from “single driver lens” to “planning-grade forecasting,” for example:
add seasonality (month indicators) and compare fit/residuals
segment by product line or channel (separate slopes for different mixes)
introduce richer drivers (labor hours, shipments, discounts)
build a repeatable rolling forecast workflow and variance decomposition narrative
If Chapter 14 is your first regression chapter, you’re already in the right place: the goal is not sophistication—it’s usable, auditable models that improve decisions.
Appendix 14B: NSO v1 data dictionary cheat sheet
For a compact “what table is what” reference (grain, keys, joins, checks), see:
Appendix 14C: Chapter 14 artifact dictionary
For a compact reference that explains every Chapter 14 output artifact (what it is, what it’s for, and what to look at first), see:
Appendix 14D: Artifact QA checklist (what to verify before sharing results)
Before you share the Chapter 14 memo or coefficients, use the “big picture” QA checklist:
Appendix 14E: applying this chapter to your own data
To adapt the Chapter 14 workflow (driver table + explainable regression + artifacts) to your own real-world business data, see: