Appendix 14A: Chapter 14 milestone — Track D, the NSO system, and our synthetic datasets

Chapter 14 is a milestone in Track D.

Up through Chapter 13, we focused on measurement and inference: clean accounting structure, analysis-ready tables, descriptive statistics, probability, sampling, hypothesis testing, and controlled comparisons.

In Chapter 14, we shift into explanation for planning: driver analysis — a simple, auditable way to translate operational activity into a quantitative story that accounting and operations can actually use.

This appendix is the “big picture + under the hood” companion to Chapter 14:

  • What Track D is building (and why this is designed for accountants).

  • Where our data comes from (synthetic datasets generated locally).

  • What’s in the NSO running-case dataset and how it ties to chapters.

  • How to regenerate / modify datasets safely and reproducibly.

  • What comes next after Chapter 14 (forecasting and planning).

If you want a shorter milestone / philosophy view earlier in Track D, see: Appendix 8A: Chapter 8 milestone and the big picture (Ch01–Ch08).

Why Track D looks the way it does

Track D is designed around a practical accounting workflow:

  1. Close: post events and summarize them correctly.

  2. Clean: reconcile and validate so the numbers are trustworthy.

  3. Explain: use statistics to understand variance and identify drivers.

  4. Forecast: predict future months with explicit assumptions and error tracking.

  5. Decide: turn outputs into a memo, a plan, and control checks.

Chapter 14 lives in Step 3 (“Explain”). It answers questions like:

  • “Are COGS rising because we sold more units, because unit costs rose, or both?”

  • “What’s the expected revenue level for a given units-sold and invoice count mix?”

  • “If we change operational levers, how should the P&L respond — and by how much?”

A quick “what we’ve built so far” (Ch01–Ch14)

Track D is intentionally cumulative — later chapters assume the reader trusts the data.

Ch01–Ch03: accounting as measurement and summaries

  • Accounting equation and classification as a measurement system.

  • Double-entry and the GL as a database.

  • Statements as “summary statistics” (income statement, balance sheet, cash flow).

Ch04–Ch06: real accounting structure + quality control

  • Assets and inventory/fixed assets (how operational events become numbers).

  • Liabilities/payroll/taxes/equity (what “owed” means, and how it shows up).

  • Reconciliations and QC: bank tie-outs, subledger checks, and “trust gates”.

Ch07–Ch09: analysis workflow and reporting contract

  • Turn accounting exports into analysis-ready tables.

  • Descriptive statistics for performance monitoring.

  • Reporting style: what a good analysis deliverable looks like.

Ch10–Ch13: risk + inference

  • Probability and risk framing.

  • Sampling and estimation (audit/control mindset).

  • Hypothesis testing for decisions.

  • Correlation vs causation; controlled comparisons and guardrails.

Ch14: regression driver analysis (this milestone)

  • A monthly driver table built from operational + accounting records.

  • Explainable OLS models that connect operational levers to financial outcomes.

  • Outputs designed for planning conversations (not “math for math’s sake”).

Where the data comes from (and why it’s synthetic)

Track D uses generated (synthetic) datasets for a simple reason:

  • They’re safe to share (no confidential client data).

  • They’re reproducible (same seed → same dataset).

  • They’re “tie-out friendly” (subledgers link cleanly into the GL).

  • They’re intentionally structured to support teaching: reconciliation checks, clean joins, realistic accounting relationships, and predictable artifacts that can be tested.

Important: our synthetic data is generated locally and is gitignored.

  • Output folder: data/synthetic/

  • This folder is excluded in .gitignore so the repo stays small and clean.

  • You generate the data with Make targets or direct CLI runs (examples below).

The two key ideas are:

  • The repo contains the generators (simulators) and validators.

  • Your machine produces the dataset you analyze, deterministically.

The NSO running case dataset (v1)

The NSO dataset (“North Shore Outfitters”) is the Track D running case. It’s designed to feel like a small business with realistic accounting subsystems:

  • GL detail (journal-level “database” of financial events)

  • Bank activity (reconciliation)

  • A/R and A/P subledgers (invoice and bill flows)

  • Inventory movements (units and COGS logic)

  • Payroll events (wages and liabilities)

  • Fixed assets + depreciation schedule

  • Monthly financial statements for trend work

The simulator that produces this dataset is:

  • scripts/sim_business_nso_v1.py

The default Make target writes to:

  • data/synthetic/nso_v1/

NSO v1 file map (what each table represents)

The simulator writes a set of CSVs that you can think of as “mini sub-systems”. Below is a practical map of what they mean.

NSO v1 dataset outputs (generated locally)

File

What it is / why it exists

chart_of_accounts.csv

The schema of the GL: account IDs, names, normal balance. Used throughout Track D.

gl_journal.csv

Transaction-level double-entry journal lines (the “database”). Source of truth for tie-outs.

trial_balance_monthly.csv

Monthly account balances derived from GL; supports statement builds and control checks.

statements_is_monthly.csv

Monthly income statement (revenue, COGS, expenses). Feeds trend work + driver analysis.

statements_bs_monthly.csv

Monthly balance sheet summary. Supports ratios, solvency, and “does this reconcile?” checks.

statements_cf_monthly.csv

Simple cash flow bridge (CFO/CFI/CFF style). Supports cash reasoning and planning.

inventory_movements.csv

Operational inventory log (in/out). Key for units sold and COGS reasoning.

fixed_assets.csv

Asset register: acquisitions and metadata for depreciation logic.

depreciation_schedule.csv

Depreciation by asset / month; supports fixed cost behavior and statement logic.

payroll_events.csv

Payroll activity events: wages and related liabilities.

sales_tax_events.csv

Sales tax collected/remitted events; supports liability reasoning and cash flow realism.

ap_events.csv

Accounts payable events (bills, payments). Supports payables workflow + controls.

ar_events.csv

Accounts receivable events (invoices, receipts). Key for invoice counts and revenue logic.

debt_schedule.csv

Debt timeline: principal/interest structure for liabilities and planning.

equity_events.csv

Owner contributions/draws (equity flows).

bank_statement.csv

Bank-like activity log for reconciliation exercises and cash controls.

nso_v1_meta.json

Metadata for reproducibility (seed, months, and scenario notes).

Tip: if you ever ask “where did this number come from?”, the answer should be traceable back to either a subledger event table or to gl_journal.csv.

How Chapter 14 uses NSO v1 (driver table lineage)

Chapter 14 intentionally uses a small, explainable driver set — enough to teach the method without turning it into a data engineering chapter.

The driver table is monthly and includes:

  • units_sold: derived from inventory_movements.csv using the sales outflow rows (the operational quantity driver for COGS and revenue).

  • invoice_count: derived from ar_events.csv using invoice rows (a proxy for “transaction volume” / customer activity).

  • sales_revenue and cogs: pulled from statements_is_monthly.csv (financial outcomes already summarized through the accounting system).

This lineage is deliberate:

  • The “driver” fields come from operational subledgers.

  • The “outcome” fields come from financial statements.

  • That separation mirrors real accounting analytics work: operations → accounting → analysis → planning conversation.

Regenerating the dataset (the standard workflow)

From the repo root, the normal Track D flow is:

# Generate NSO v1 synthetic dataset locally (gitignored)
make business-nso-sim

# Run dataset validation checks (schema + basic consistency checks)
make business-validate

# Run Chapter 14 analysis (build driver table, fit models, write artifacts)
make business-ch14

You can also run the simulator directly:

python -m scripts.sim_business_nso_v1 --outdir data/synthetic/nso_v1 --seed 123 --start-month 2025-01 --n-months 24

And you can validate directly:

python -m scripts.business_validate_dataset --datadir data/synthetic/nso_v1

How to modify the synthetic datasets (what “modification” means here)

There are two “levels” of modification, depending on your goal.

Level 1: change generation knobs (fast, safe)

These are ideal for teaching, experimentation, and reproducibility:

  • --seed: changes the random realization of events while keeping structure consistent

  • --start-month: shifts the calendar window

  • --n-months: creates shorter/longer histories (useful for forecasting chapters later)

This level preserves the same schema and tends to keep downstream chapters stable.

Level 2: change business story assumptions (powerful, but do it deliberately)

This is where you alter the simulated business behavior itself — for example:

  • Different product mix or unit economics

  • More/less volatility in demand

  • Different payment terms (cash vs credit)

  • More payroll headcount growth (step costs)

  • Higher frequency of one-off shocks (supplier issues, returns, etc.)

These changes are educational gold — but they can also break assumptions in later chapters if you change the “shape” of the data too aggressively.

When you do Level 2 changes, treat it like a controlled experiment:

  1. Generate into a new outdir (don’t overwrite baseline)

  2. Validate the dataset

  3. Re-run the chapter(s) you care about

  4. Compare artifacts and write down what changed (assumptions log)

Why this matters: regression needs stable measurement

Regression driver analysis is only as credible as the measurement pipeline behind it.

That’s why Track D spends so much time on:

  • classification and mapping (COA discipline),

  • reconciliations (bank tie-outs),

  • subledger consistency checks, and

  • repeatable reporting artifacts.

If the dataset is “messy”, regression becomes a confidence trick: coefficients look precise but reflect inconsistent measurement, not business reality.

What’s next after Chapter 14 (how this tees up Chapter 15+)

Chapter 14 is “explain the variance.”

Chapters 15+ move into “predict and plan”:

  • Forecast hygiene: horizon, granularity, backtesting, and error metrics

  • Forecast versioning: assumptions logs and change control

  • Revenue forecasting: segmentation + drivers (not just a single trend line)

  • Expense forecasting: fixed/variable/step cost behavior (payroll as a model)

  • Communicating uncertainty: forecast ranges, risk flags, and decision memos

In other words:

  • Ch14 turns drivers into explainable models.

  • Ch15+ turns explainable models into forecast workflows that stand up to scrutiny.

Troubleshooting

Problem: `make business-ch14` fails because files are missing - Cause: data/synthetic/nso_v1 hasn’t been generated locally yet. - Fix:

make business-nso-sim
make business-validate
make business-ch14

Problem: Outputs go into `outputs/track_d/track_d` - Cause: Chapter 14’s CLI writes into --outdir plus a Track D subfolder for organization. - Fix: This is expected. Treat outputs/track_d as your “project outputs root”.

Problem: You edited the simulator and validation fails - Cause: You changed the business story or schema. - Fix:

  1. regenerate into a new outdir,

  2. re-run validation,

  3. keep schema stable unless you intend to update multiple chapters.

Closing note

This project is intentionally “controls-aware”:

  • It teaches analytics that can be explained and audited.

  • It respects accounting structure (not just data science convenience).

  • It treats reproducibility and validation as non-negotiable.

Chapter 14 is the point where that philosophy becomes visible: drivers → model → explanation → planning conversation.