Track D — Chapter 14

PyPI workbook run (Track D)

From inside your Track D workbook folder (created by pystatsv1 workbook init --track d --dest ...), run:

pystatsv1 workbook run |trackd_run|

Outputs are written under outputs/track_d/ by default. If you’re unsure what a file is for, start with Track D Outputs Guide.

To see the full chapter-by-chapter run map (D00–D23), see Track D chapter index (PyPI).

Optional: write to a custom output folder:

pystatsv1 workbook run |trackd_run| --outdir outputs/track_d_custom

Interpretation prompts (quick self-check):

What is the accounting or business measurement goal in this chapter?
Which invariant/check would catch a “numbers look fine but are wrong” mistake here?

Regression Driver Analysis (NSO running case)

This chapter turns “operational activity” into an explainable planning model.

In accounting work you often start from outcomes:

Revenue was up (or down).
COGS moved.
Gross margin shifted.

That outcome view is essential, but it’s not always sufficient for planning and control. When leadership asks:

“What happens to COGS if we sell 15% more units next month?”
“Is margin pressure mostly price, volume, or some fixed baseline cost?”
“Are we seeing higher revenue because we sold more… or because invoices changed (mix/activity)?”

…you need a driver lens.

Regression driver analysis is a practical way to estimate simple relationships like:

a baseline level (intercept): “what tends to happen even if activity is low”
a rate (slope): “how much outcome changes per unit of activity”

You’ll build this lens for the North Shore Outfitters (NSO) running case using monthly data.

Where this fits in Track D

This chapter is intentionally downstream of earlier Track D ideas:

Chapter 12 (Hypothesis Testing): disciplined interpretation; avoid overconfidence from noisy data.
Chapter 13 (Correlation, Causation, Controlled Comparisons): correlation is not causation and “third factors” often drive two lines together.

Regression is powerful, but it can also create false confidence if used carelessly. So we carry forward the Chapter 13 discipline here:

Use regression as a driver lens
Prefer simple, explainable models
Check residuals and sanity-check assumptions
Treat results as planning inputs, not proof

What you will build

You will produce two things:

A monthly driver table that lines up activity measures with financial outcomes.
Three small regression models:

m1: COGS ~ units_sold (fixed + variable cost per unit lens)
m2: Revenue ~ units_sold (baseline + average price-per-unit lens)
m3: Revenue ~ units_sold + invoice_count (two-driver “mix/activity check”)

The goal is not “best possible prediction.” The goal is simple, auditable explanations that help accounting planning + control.

The driver table (what it is and where it comes from)

The Chapter 14 script builds a monthly table with these columns:

Chapter 14 driver table fields
Column	Meaning	Source in NSO outputs
`month`	Month key (YYYY-MM)	Derived from date fields in each input file
`month_dt`	Month as a real date (YYYY-MM-01) for sorting/plotting	Derived
`units_sold`	Units sold in that month (positive count)	`inventory_movements.csv` where `movement_type == sale_issue`. (In the simulator, sale issues are negative inventory movements, so the script flips the sign.)
`invoice_count`	Count of invoices issued in that month	`ar_events.csv` where `event_type == invoice` (grouped and counted by month)
`sales_revenue`	Monthly sales revenue	`statements_is_monthly.csv` where `line == Sales Revenue`
`cogs`	Monthly cost of goods sold	`statements_is_monthly.csv` where `line == Cost of Goods Sold`

Why these drivers?

Accounting planning usually needs a bridge between operations and financial outcomes.

Units sold is a natural driver for both revenue and COGS in many product businesses.
Invoice count is not “better than units,” but it can be a useful activity proxy (and a warning sign for mix changes, bundling, partial shipments, pricing patterns, etc.).

This chapter keeps the driver set intentionally small. In real work you might add:

labor hours, headcount, shipments, store traffic, returns
discount rate, product mix %, channel mix %, seasonality indicators
capacity constraints (overtime, stockouts)

Regression models (m1, m2, m3)

All three models use ordinary least squares (OLS). Conceptually:

y = intercept + slope * x + residual
Intercept is the baseline component.
Slope is the rate per unit of driver.
Residuals are what the driver did not explain.

The script uses statsmodels and includes an intercept term (a constant).

Model 1: COGS as a function of units sold

m1: COGS ~ units_sold

Interpretation in planning terms:

Intercept (baseline COGS): costs that tend to appear even at low activity (minimum staffing, spoilage, fixed handling, etc.). In some businesses baseline COGS should be near zero; in others it may not be.
Slope (variable cost per unit): the estimated cost-per-unit implied by the data (a “rate”).

How you use it:

Build a cost forecast from a unit forecast.
Explain COGS variance as “rate vs baseline vs unexplained residual.”

Model 2: Revenue as a function of units sold

m2: Revenue ~ units_sold

Interpretation:

Intercept (baseline revenue): in many settings this should be near zero. If it is not, that’s a signal to investigate: timing, returns, non-unit revenue streams, seasonality, or a model mismatch.
Slope (average price per unit lens): an implied average revenue per unit. This is not a SKU-level price; it’s a blended rate across product mix for the period.

How you use it:

Translate a unit plan into a revenue plan.
Detect periods where price/mix changes break the “stable rate” assumption.

Model 3: Revenue as a function of units sold + invoice count

m3: Revenue ~ units_sold + invoice_count

This is a simple two-driver extension:

If invoice count adds meaningful explanatory power beyond units sold, you may be seeing changes in ordering patterns or mix (e.g., more small invoices, more split shipments, different channel behavior).
If invoice count does not add anything (small coefficient, noisy, low incremental fit), that’s also useful: units alone may be sufficient for the current planning lens.

This is not “the final truth.” It’s a lightweight check that encourages better questions.

Interpreting outputs like an accountant

Intercepts and slopes: baseline vs rate

A helpful mental model is:

Intercept = baseline component (what happens at “zero-ish” activity)
Slope = marginal rate (how much outcome changes per unit of activity)

This matches common accounting narratives:

“There is a fixed component plus a variable component.”
“Most of the change is volume-driven; rate is stable.”
“The rate is drifting — pricing/mix or cost structure is changing.”

R²: “how much of the story is volume”

R² is the fraction of variance explained by the model in-sample.

High R² can mean the driver captures the main movement (often volume/seasonality).
Low R² can mean the driver is incomplete or the process changed.

Accounting interpretation:

R² is not “goodness” by itself; it’s a signpost.
Even a moderate R² can be useful if the slope is stable and interpretable.

Residuals: what the drivers didn’t explain

Residuals are where accounting insight often lives:

promotions and discounting
unusual returns
supply shocks and stockouts
one-time events or timing effects

A healthy workflow is:

Fit the simple model.
Inspect residual patterns.
Decide whether you need more drivers or segmentation.

Important

If pricing or product mix changes materially, re-fit the model and re-check residuals. Stable slopes are an assumption, not a guarantee.

How to run the chapter

Prerequisite: NSO dataset

Chapter 14 expects the NSO v1 synthetic dataset to exist (it is generated by the Track D simulator). If you already ran earlier Track D chapters locally, you likely have it.

To (re)generate NSO v1:

make business-nso-sim

This writes the NSO v1 dataset under:

data/synthetic/nso_v1/

Run Chapter 14

Run the Chapter 14 target:

make business-ch14

Note on output location:

The Makefile passes --outdir outputs/track_d.
The Chapter 14 script writes inside a nested folder and prints the final path.

So you should expect artifacts under something like:

outputs/track_d/track_d/

Outputs and how to inspect them

The chapter writes a small “analysis pack” you can use for inspection, debugging, or reporting.

Core artifacts

ch14_driver_table.csv The monthly driver table (month, units_sold, invoice_count, revenue, cogs). Start here—open it and sanity-check the numbers.
ch14_regression_design.json A lightweight “design contract” describing expected inputs, driver definitions, and model formulas. Use this for reproducibility and review.
ch14_regression_summary.json Machine-readable regression results (parameters, R², and a small forecast example). Useful for downstream automation (dashboards, reports, tests).
ch14_regression_memo.md A short human-readable memo summarizing the key results. This is intentionally brief: it’s a starter “executive narrative.”

Figures

ch14_figures_manifest.csv A manifest listing each figure saved by the script (path + chart metadata).
figures/ PNG charts (scatter + fit line, plus a residual-style view for the multi-driver check).

A quick inspection workflow:

Open ch14_driver_table.csv and verify the month alignment.
Read ch14_regression_memo.md to see the story.
Open the figures to check whether relationships look linear and whether outliers dominate.
Use ch14_regression_summary.json if you want to extract slopes/intercepts programmatically.

Guardrails (read this before you sell the story)

Regression is a driver lens, not a causality machine.

Common failure modes in accounting settings:

Seasonality / timing drives both the driver and the outcome (spurious fit).
Mix shifts break slope stability (price per unit or cost per unit changes).
Capacity constraints create nonlinearity (overtime, stockouts, shipping changes).
Definition drift (what counts as an invoice, what counts as a unit) changes over time.

Use regression to support planning conversations like:

“Given our recent pattern, the implied cost-per-unit is X.”
“If we hit Y units next month, the model implies COGS around Z, plus/minus residual noise.”
“This month looks unusual relative to the driver story; let’s investigate why.”

Troubleshooting

Missing input files (FileNotFoundError)

If you see an error like “Expected inventory_movements.csv … but not found”:

Confirm you generated NSO v1:
```
make business-nso-sim
```
Confirm the files exist under data/synthetic/nso_v1/: inventory_movements.csv, ar_events.csv, statements_is_monthly.csv.

Wrong output directory / permission issues

If artifacts don’t appear under outputs/track_d/track_d/:

Ensure your working tree is the repo root and you’re running targets from there.
Confirm you have permission to create the outputs/ directory on your machine.
If needed, clear old outputs and re-run:
```
make clean
make business-ch14
```

Plots not rendering / backend issues

This chapter saves plots to PNG files; it does not require an interactive GUI. If you hit a matplotlib backend error on Windows, ensure you’re using the project’s recommended environment (venv) and that your matplotlib install is healthy.

What’s next (Chapter 15+)

Chapter 14 gives you the core regression driver workflow:

build a driver table
fit simple explainable models
produce artifacts that are reviewable (CSV/JSON/MD/figures)

The natural next steps for Chapter 15+ are to expand from “single driver lens” to “planning-grade forecasting,” for example:

add seasonality (month indicators) and compare fit/residuals
segment by product line or channel (separate slopes for different mixes)
introduce richer drivers (labor hours, shipments, discounts)
build a repeatable rolling forecast workflow and variance decomposition narrative

If Chapter 14 is your first regression chapter, you’re already in the right place: the goal is not sophistication—it’s usable, auditable models that improve decisions.

Appendix 14B: NSO v1 data dictionary cheat sheet

For a compact “what table is what” reference (grain, keys, joins, checks), see:

Appendix 14B: NSO v1 data dictionary cheat sheet (table → grain → keys → joins → checks)

Appendix 14C: Chapter 14 artifact dictionary

For a compact reference that explains every Chapter 14 output artifact (what it is, what it’s for, and what to look at first), see:

Appendix 14C: Chapter 14 artifact dictionary (what each output is for)

Appendix 14E: applying this chapter to your own data

To adapt the Chapter 14 workflow (driver table + explainable regression + artifacts) to your own real-world business data, see:

Appendix 14E: Applying Track D through Chapter 14 to your own real-world data