Appendix 14A: Chapter 14 milestone — Track D, the NSO system, and our synthetic datasets ====================================================================================== Chapter 14 is a milestone in Track D. Up through Chapter 13, we focused on *measurement and inference*: clean accounting structure, analysis-ready tables, descriptive statistics, probability, sampling, hypothesis testing, and controlled comparisons. In Chapter 14, we shift into *explanation for planning*: **driver analysis** — a simple, auditable way to translate operational activity into a quantitative story that accounting and operations can actually use. This appendix is the “big picture + under the hood” companion to Chapter 14: - What Track D is building (and why this is designed for accountants). - Where our data comes from (synthetic datasets generated locally). - What’s in the NSO running-case dataset and how it ties to chapters. - How to regenerate / modify datasets safely and reproducibly. - What comes next after Chapter 14 (forecasting and planning). If you want a shorter milestone / philosophy view earlier in Track D, see: :doc:`business_appendix_ch08_milestone_big_picture`. Why Track D looks the way it does --------------------------------- Track D is designed around a practical accounting workflow: 1. **Close**: post events and summarize them correctly. 2. **Clean**: reconcile and validate so the numbers are trustworthy. 3. **Explain**: use statistics to understand variance and identify drivers. 4. **Forecast**: predict future months with explicit assumptions and error tracking. 5. **Decide**: turn outputs into a memo, a plan, and control checks. Chapter 14 lives in Step 3 (“Explain”). It answers questions like: - “Are COGS rising because we sold more units, because unit costs rose, or both?” - “What’s the *expected* revenue level for a given units-sold and invoice count mix?” - “If we change operational levers, how should the P&L respond — and by how much?” A quick “what we’ve built so far” (Ch01–Ch14) --------------------------------------------- Track D is intentionally cumulative — later chapters assume the reader trusts the data. **Ch01–Ch03: accounting as measurement and summaries** - Accounting equation and classification as a measurement system. - Double-entry and the GL as a database. - Statements as “summary statistics” (income statement, balance sheet, cash flow). **Ch04–Ch06: real accounting structure + quality control** - Assets and inventory/fixed assets (how operational events become numbers). - Liabilities/payroll/taxes/equity (what “owed” means, and how it shows up). - Reconciliations and QC: bank tie-outs, subledger checks, and “trust gates”. **Ch07–Ch09: analysis workflow and reporting contract** - Turn accounting exports into analysis-ready tables. - Descriptive statistics for performance monitoring. - Reporting style: what a good analysis deliverable looks like. **Ch10–Ch13: risk + inference** - Probability and risk framing. - Sampling and estimation (audit/control mindset). - Hypothesis testing for decisions. - Correlation vs causation; controlled comparisons and guardrails. **Ch14: regression driver analysis (this milestone)** - A monthly driver table built from operational + accounting records. - Explainable OLS models that connect operational levers to financial outcomes. - Outputs designed for planning conversations (not “math for math’s sake”). Where the data comes from (and why it’s synthetic) -------------------------------------------------- Track D uses **generated (synthetic) datasets** for a simple reason: - They’re safe to share (no confidential client data). - They’re reproducible (same seed → same dataset). - They’re “tie-out friendly” (subledgers link cleanly into the GL). - They’re intentionally structured to support teaching: reconciliation checks, clean joins, realistic accounting relationships, and predictable artifacts that can be tested. Important: our synthetic data is generated **locally** and is **gitignored**. - Output folder: ``data/synthetic/`` - This folder is excluded in ``.gitignore`` so the repo stays small and clean. - You generate the data with Make targets or direct CLI runs (examples below). The two key ideas are: - **The repo contains the generators** (simulators) and validators. - **Your machine produces the dataset** you analyze, deterministically. The NSO running case dataset (v1) --------------------------------- The NSO dataset (“North Shore Outfitters”) is the Track D running case. It’s designed to feel like a small business with realistic accounting subsystems: - GL detail (journal-level “database” of financial events) - Bank activity (reconciliation) - A/R and A/P subledgers (invoice and bill flows) - Inventory movements (units and COGS logic) - Payroll events (wages and liabilities) - Fixed assets + depreciation schedule - Monthly financial statements for trend work The simulator that produces this dataset is: - ``scripts/sim_business_nso_v1.py`` The default Make target writes to: - ``data/synthetic/nso_v1/`` NSO v1 file map (what each table represents) -------------------------------------------- The simulator writes a set of CSVs that you can think of as “mini sub-systems”. Below is a practical map of what they mean. .. list-table:: NSO v1 dataset outputs (generated locally) :header-rows: 1 :widths: 30 70 * - File - What it is / why it exists * - ``chart_of_accounts.csv`` - The schema of the GL: account IDs, names, normal balance. Used throughout Track D. * - ``gl_journal.csv`` - Transaction-level double-entry journal lines (the “database”). Source of truth for tie-outs. * - ``trial_balance_monthly.csv`` - Monthly account balances derived from GL; supports statement builds and control checks. * - ``statements_is_monthly.csv`` - Monthly income statement (revenue, COGS, expenses). Feeds trend work + driver analysis. * - ``statements_bs_monthly.csv`` - Monthly balance sheet summary. Supports ratios, solvency, and “does this reconcile?” checks. * - ``statements_cf_monthly.csv`` - Simple cash flow bridge (CFO/CFI/CFF style). Supports cash reasoning and planning. * - ``inventory_movements.csv`` - Operational inventory log (in/out). Key for units sold and COGS reasoning. * - ``fixed_assets.csv`` - Asset register: acquisitions and metadata for depreciation logic. * - ``depreciation_schedule.csv`` - Depreciation by asset / month; supports fixed cost behavior and statement logic. * - ``payroll_events.csv`` - Payroll activity events: wages and related liabilities. * - ``sales_tax_events.csv`` - Sales tax collected/remitted events; supports liability reasoning and cash flow realism. * - ``ap_events.csv`` - Accounts payable events (bills, payments). Supports payables workflow + controls. * - ``ar_events.csv`` - Accounts receivable events (invoices, receipts). Key for invoice counts and revenue logic. * - ``debt_schedule.csv`` - Debt timeline: principal/interest structure for liabilities and planning. * - ``equity_events.csv`` - Owner contributions/draws (equity flows). * - ``bank_statement.csv`` - Bank-like activity log for reconciliation exercises and cash controls. * - ``nso_v1_meta.json`` - Metadata for reproducibility (seed, months, and scenario notes). Tip: if you ever ask “where did this number come from?”, the answer should be traceable back to either a subledger event table or to ``gl_journal.csv``. How Chapter 14 uses NSO v1 (driver table lineage) ------------------------------------------------- Chapter 14 intentionally uses a *small, explainable* driver set — enough to teach the method without turning it into a data engineering chapter. The driver table is monthly and includes: - ``units_sold``: derived from ``inventory_movements.csv`` using the sales outflow rows (the operational quantity driver for COGS and revenue). - ``invoice_count``: derived from ``ar_events.csv`` using invoice rows (a proxy for “transaction volume” / customer activity). - ``sales_revenue`` and ``cogs``: pulled from ``statements_is_monthly.csv`` (financial outcomes already summarized through the accounting system). This lineage is deliberate: - The “driver” fields come from operational subledgers. - The “outcome” fields come from financial statements. - That separation mirrors real accounting analytics work: operations → accounting → analysis → planning conversation. Regenerating the dataset (the standard workflow) ------------------------------------------------ From the repo root, the normal Track D flow is: .. code-block:: bash # Generate NSO v1 synthetic dataset locally (gitignored) make business-nso-sim # Run dataset validation checks (schema + basic consistency checks) make business-validate # Run Chapter 14 analysis (build driver table, fit models, write artifacts) make business-ch14 You can also run the simulator directly: .. code-block:: bash python -m scripts.sim_business_nso_v1 --outdir data/synthetic/nso_v1 --seed 123 --start-month 2025-01 --n-months 24 And you can validate directly: .. code-block:: bash python -m scripts.business_validate_dataset --datadir data/synthetic/nso_v1 Regenerating without overwriting (recommended for experimentation) ------------------------------------------------------------------ When experimenting, don’t overwrite your “baseline” dataset. Generate a new dataset folder and point chapter scripts at it. Example: .. code-block:: bash # Create a new scenario dataset python -m scripts.sim_business_nso_v1 --outdir data/synthetic/nso_v1_experiment --seed 999 --start-month 2025-01 --n-months 24 # Validate the new dataset python -m scripts.business_validate_dataset --datadir data/synthetic/nso_v1_experiment # Run Chapter 14 on the new dataset (custom datadir) python -m scripts.business_ch14_regression_driver_analysis --datadir data/synthetic/nso_v1_experiment --outdir outputs/track_d --seed 123 If you prefer Make, you can override variables at runtime: .. code-block:: bash make business-ch14 OUT_NSO_V1=data/synthetic/nso_v1_experiment (That works because Make lets command-line variables override Makefile defaults.) How to modify the synthetic datasets (what “modification” means here) --------------------------------------------------------------------- There are two “levels” of modification, depending on your goal. Level 1: change *generation knobs* (fast, safe) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are ideal for teaching, experimentation, and reproducibility: - ``--seed``: changes the random realization of events while keeping structure consistent - ``--start-month``: shifts the calendar window - ``--n-months``: creates shorter/longer histories (useful for forecasting chapters later) This level preserves the same schema and tends to keep downstream chapters stable. Level 2: change *business story assumptions* (powerful, but do it deliberately) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is where you alter the simulated business behavior itself — for example: - Different product mix or unit economics - More/less volatility in demand - Different payment terms (cash vs credit) - More payroll headcount growth (step costs) - Higher frequency of one-off shocks (supplier issues, returns, etc.) These changes are educational gold — but they can also break assumptions in later chapters if you change the “shape” of the data too aggressively. When you do Level 2 changes, treat it like a controlled experiment: 1. Generate into a new outdir (don’t overwrite baseline) 2. Validate the dataset 3. Re-run the chapter(s) you care about 4. Compare artifacts and write down what changed (assumptions log) Why this matters: regression needs stable measurement ----------------------------------------------------- Regression driver analysis is only as credible as the measurement pipeline behind it. That’s why Track D spends so much time on: - classification and mapping (COA discipline), - reconciliations (bank tie-outs), - subledger consistency checks, and - repeatable reporting artifacts. If the dataset is “messy”, regression becomes a confidence trick: coefficients look precise but reflect inconsistent measurement, not business reality. What’s next after Chapter 14 (how this tees up Chapter 15+) ----------------------------------------------------------- Chapter 14 is “explain the variance.” Chapters 15+ move into “predict and plan”: - Forecast hygiene: horizon, granularity, backtesting, and error metrics - Forecast versioning: assumptions logs and change control - Revenue forecasting: segmentation + drivers (not just a single trend line) - Expense forecasting: fixed/variable/step cost behavior (payroll as a model) - Communicating uncertainty: forecast ranges, risk flags, and decision memos In other words: - Ch14 turns drivers into explainable models. - Ch15+ turns explainable models into **forecast workflows** that stand up to scrutiny. Troubleshooting --------------- **Problem: `make business-ch14` fails because files are missing** - Cause: ``data/synthetic/nso_v1`` hasn’t been generated locally yet. - Fix: .. code-block:: bash make business-nso-sim make business-validate make business-ch14 **Problem: Outputs go into `outputs/track_d/track_d`** - Cause: Chapter 14’s CLI writes into ``--outdir`` plus a Track D subfolder for organization. - Fix: This is expected. Treat ``outputs/track_d`` as your “project outputs root”. **Problem: You edited the simulator and validation fails** - Cause: You changed the business story or schema. - Fix: 1) regenerate into a new outdir, 2) re-run validation, 3) keep schema stable unless you intend to update multiple chapters. Closing note ------------ This project is intentionally “controls-aware”: - It teaches analytics that can be explained and audited. - It respects accounting structure (not just data science convenience). - It treats reproducibility and validation as non-negotiable. Chapter 14 is the point where that philosophy becomes visible: **drivers → model → explanation → planning conversation**. - :doc:`business_appendix_ch14b_nso_v1_data_dictionary`