Track D — Chapter 14 ==================== .. |trackd_run| replace:: d14 .. include:: _includes/track_d_run_strip.rst Regression Driver Analysis (NSO running case) --------------------------------------------- This chapter turns “operational activity” into an **explainable planning model**. In accounting work you often start from outcomes: - Revenue was up (or down). - COGS moved. - Gross margin shifted. That outcome view is essential, but it’s not always sufficient for planning and control. When leadership asks: - “What happens to COGS if we sell 15% more units next month?” - “Is margin pressure mostly price, volume, or some fixed baseline cost?” - “Are we seeing higher revenue because we sold more… or because invoices changed (mix/activity)?” …you need a **driver lens**. **Regression driver analysis** is a practical way to estimate simple relationships like: - a **baseline** level (intercept): “what tends to happen even if activity is low” - a **rate** (slope): “how much outcome changes per unit of activity” You’ll build this lens for the North Shore Outfitters (NSO) running case using monthly data. Where this fits in Track D -------------------------- This chapter is intentionally downstream of earlier Track D ideas: - Chapter 12 (Hypothesis Testing): disciplined interpretation; avoid overconfidence from noisy data. - Chapter 13 (Correlation, Causation, Controlled Comparisons): *correlation is not causation* and “third factors” often drive two lines together. Regression is powerful, but it can also create false confidence if used carelessly. So we carry forward the Chapter 13 discipline here: - Use regression as a **driver lens** - Prefer **simple, explainable models** - Check residuals and sanity-check assumptions - Treat results as **planning inputs**, not proof What you will build ------------------- You will produce two things: 1) A monthly **driver table** that lines up activity measures with financial outcomes. 2) Three small regression models: - **m1:** ``COGS ~ units_sold`` (fixed + variable cost per unit lens) - **m2:** ``Revenue ~ units_sold`` (baseline + average price-per-unit lens) - **m3:** ``Revenue ~ units_sold + invoice_count`` (two-driver “mix/activity check”) The goal is not “best possible prediction.” The goal is **simple, auditable explanations** that help accounting planning + control. The driver table (what it is and where it comes from) ----------------------------------------------------- The Chapter 14 script builds a monthly table with these columns: .. list-table:: Chapter 14 driver table fields :header-rows: 1 :widths: 18 22 60 * - Column - Meaning - Source in NSO outputs * - ``month`` - Month key (YYYY-MM) - Derived from date fields in each input file * - ``month_dt`` - Month as a real date (YYYY-MM-01) for sorting/plotting - Derived * - ``units_sold`` - Units sold in that month (positive count) - ``inventory_movements.csv`` where ``movement_type == sale_issue``. (In the simulator, sale issues are negative inventory movements, so the script flips the sign.) * - ``invoice_count`` - Count of invoices issued in that month - ``ar_events.csv`` where ``event_type == invoice`` (grouped and counted by month) * - ``sales_revenue`` - Monthly sales revenue - ``statements_is_monthly.csv`` where ``line == Sales Revenue`` * - ``cogs`` - Monthly cost of goods sold - ``statements_is_monthly.csv`` where ``line == Cost of Goods Sold`` Why these drivers? ^^^^^^^^^^^^^^^^^^ Accounting planning usually needs a bridge between **operations** and **financial outcomes**. - Units sold is a natural driver for both revenue and COGS in many product businesses. - Invoice count is not “better than units,” but it can be a useful **activity proxy** (and a warning sign for mix changes, bundling, partial shipments, pricing patterns, etc.). This chapter keeps the driver set intentionally small. In real work you might add: - labor hours, headcount, shipments, store traffic, returns - discount rate, product mix %, channel mix %, seasonality indicators - capacity constraints (overtime, stockouts) Regression models (m1, m2, m3) ------------------------------ All three models use ordinary least squares (OLS). Conceptually: - ``y = intercept + slope * x + residual`` - **Intercept** is the baseline component. - **Slope** is the rate per unit of driver. - **Residuals** are what the driver did not explain. The script uses ``statsmodels`` and includes an intercept term (a constant). Model 1: COGS as a function of units sold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **m1:** ``COGS ~ units_sold`` Interpretation in planning terms: - **Intercept (baseline COGS):** costs that tend to appear even at low activity (minimum staffing, spoilage, fixed handling, etc.). In some businesses baseline COGS should be near zero; in others it may not be. - **Slope (variable cost per unit):** the estimated cost-per-unit implied by the data (a “rate”). How you use it: - Build a cost forecast from a unit forecast. - Explain COGS variance as “rate vs baseline vs unexplained residual.” Model 2: Revenue as a function of units sold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **m2:** ``Revenue ~ units_sold`` Interpretation: - **Intercept (baseline revenue):** in many settings this should be near zero. If it is not, that’s a signal to investigate: timing, returns, non-unit revenue streams, seasonality, or a model mismatch. - **Slope (average price per unit lens):** an implied average revenue per unit. This is *not* a SKU-level price; it’s a blended rate across product mix for the period. How you use it: - Translate a unit plan into a revenue plan. - Detect periods where price/mix changes break the “stable rate” assumption. Model 3: Revenue as a function of units sold + invoice count ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **m3:** ``Revenue ~ units_sold + invoice_count`` This is a simple two-driver extension: - If invoice count adds meaningful explanatory power beyond units sold, you may be seeing changes in ordering patterns or mix (e.g., more small invoices, more split shipments, different channel behavior). - If invoice count does *not* add anything (small coefficient, noisy, low incremental fit), that’s also useful: units alone may be sufficient for the current planning lens. This is not “the final truth.” It’s a lightweight check that encourages better questions. Interpreting outputs like an accountant --------------------------------------- Intercepts and slopes: baseline vs rate ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A helpful mental model is: - **Intercept** = baseline component (what happens at “zero-ish” activity) - **Slope** = marginal rate (how much outcome changes per unit of activity) This matches common accounting narratives: - “There is a fixed component plus a variable component.” - “Most of the change is volume-driven; rate is stable.” - “The rate is drifting — pricing/mix or cost structure is changing.” R²: “how much of the story is volume” ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ R² is the fraction of variance explained by the model *in-sample*. - High R² can mean the driver captures the main movement (often volume/seasonality). - Low R² can mean the driver is incomplete or the process changed. Accounting interpretation: - R² is not “goodness” by itself; it’s a signpost. - Even a moderate R² can be useful if the slope is stable and interpretable. Residuals: what the drivers didn’t explain ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Residuals are where accounting insight often lives: - promotions and discounting - unusual returns - supply shocks and stockouts - one-time events or timing effects A healthy workflow is: 1) Fit the simple model. 2) Inspect residual patterns. 3) Decide whether you need more drivers or segmentation. .. important:: If pricing or product mix changes materially, re-fit the model and re-check residuals. Stable slopes are an assumption, not a guarantee. How to run the chapter ---------------------- Prerequisite: NSO dataset ^^^^^^^^^^^^^^^^^^^^^^^^^ Chapter 14 expects the NSO v1 synthetic dataset to exist (it is generated by the Track D simulator). If you already ran earlier Track D chapters locally, you likely have it. To (re)generate NSO v1: .. code-block:: bash make business-nso-sim This writes the NSO v1 dataset under: - ``data/synthetic/nso_v1/`` Run Chapter 14 ^^^^^^^^^^^^^^ Run the Chapter 14 target: .. code-block:: bash make business-ch14 Note on output location: - The Makefile passes ``--outdir outputs/track_d``. - The Chapter 14 script writes inside a nested folder and prints the final path. So you should expect artifacts under something like: - ``outputs/track_d/track_d/`` Outputs and how to inspect them ------------------------------- The chapter writes a small “analysis pack” you can use for inspection, debugging, or reporting. Core artifacts ^^^^^^^^^^^^^^ - ``ch14_driver_table.csv`` The monthly driver table (month, units_sold, invoice_count, revenue, cogs). Start here—open it and sanity-check the numbers. - ``ch14_regression_design.json`` A lightweight “design contract” describing expected inputs, driver definitions, and model formulas. Use this for reproducibility and review. - ``ch14_regression_summary.json`` Machine-readable regression results (parameters, R², and a small forecast example). Useful for downstream automation (dashboards, reports, tests). - ``ch14_regression_memo.md`` A short human-readable memo summarizing the key results. This is intentionally brief: it’s a starter “executive narrative.” Figures ^^^^^^^ - ``ch14_figures_manifest.csv`` A manifest listing each figure saved by the script (path + chart metadata). - ``figures/`` PNG charts (scatter + fit line, plus a residual-style view for the multi-driver check). A quick inspection workflow: 1) Open ``ch14_driver_table.csv`` and verify the month alignment. 2) Read ``ch14_regression_memo.md`` to see the story. 3) Open the figures to check whether relationships look linear and whether outliers dominate. 4) Use ``ch14_regression_summary.json`` if you want to extract slopes/intercepts programmatically. Guardrails (read this before you sell the story) ------------------------------------------------ Regression is a **driver lens**, not a causality machine. Common failure modes in accounting settings: - **Seasonality / timing** drives both the driver and the outcome (spurious fit). - **Mix shifts** break slope stability (price per unit or cost per unit changes). - **Capacity constraints** create nonlinearity (overtime, stockouts, shipping changes). - **Definition drift** (what counts as an invoice, what counts as a unit) changes over time. Use regression to support planning conversations like: - “Given our recent pattern, the implied cost-per-unit is X.” - “If we hit Y units next month, the model implies COGS around Z, plus/minus residual noise.” - “This month looks unusual relative to the driver story; let’s investigate why.” Troubleshooting --------------- Missing input files (FileNotFoundError) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you see an error like “Expected inventory_movements.csv … but not found”: - Confirm you generated NSO v1: .. code-block:: bash make business-nso-sim - Confirm the files exist under ``data/synthetic/nso_v1/``: ``inventory_movements.csv``, ``ar_events.csv``, ``statements_is_monthly.csv``. Wrong output directory / permission issues ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If artifacts don’t appear under ``outputs/track_d/track_d/``: - Ensure your working tree is the repo root and you’re running targets from there. - Confirm you have permission to create the ``outputs/`` directory on your machine. - If needed, clear old outputs and re-run: .. code-block:: bash make clean make business-ch14 Plots not rendering / backend issues ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This chapter saves plots to PNG files; it does not require an interactive GUI. If you hit a matplotlib backend error on Windows, ensure you’re using the project’s recommended environment (venv) and that your matplotlib install is healthy. What’s next (Chapter 15+) ------------------------- Chapter 14 gives you the core regression driver workflow: - build a driver table - fit simple explainable models - produce artifacts that are reviewable (CSV/JSON/MD/figures) The natural next steps for Chapter 15+ are to expand from “single driver lens” to “planning-grade forecasting,” for example: - add seasonality (month indicators) and compare fit/residuals - segment by product line or channel (separate slopes for different mixes) - introduce richer drivers (labor hours, shipments, discounts) - build a repeatable rolling forecast workflow and variance decomposition narrative If Chapter 14 is your first regression chapter, you’re already in the right place: the goal is not sophistication—it’s **usable, auditable models that improve decisions**. Appendix 14B: NSO v1 data dictionary cheat sheet ------------------------------------------------ For a compact “what table is what” reference (grain, keys, joins, checks), see: - :doc:`business_appendix_ch14b_nso_v1_data_dictionary` Appendix 14C: Chapter 14 artifact dictionary -------------------------------------------- For a compact reference that explains every Chapter 14 output artifact (what it is, what it’s for, and what to look at first), see: - :doc:`business_appendix_ch14c_ch14_artifact_dictionary` Appendix 14D: Artifact QA checklist (what to verify before sharing results) --------------------------------------------------------------------------- Before you share the Chapter 14 memo or coefficients, use the “big picture” QA checklist: - :doc:`business_appendix_ch14d_artifact_qa_checklist_big_picture` Appendix 14E: applying this chapter to your own data ---------------------------------------------------- To adapt the Chapter 14 workflow (driver table + explainable regression + artifacts) to your own real-world business data, see: - :doc:`business_appendix_ch14e_apply_to_real_world`