Business Chapter 7: Preparing accounting data for analysis
PyPI workbook run (Track D)
From inside your Track D workbook folder (created by pystatsv1 workbook init --track d --dest ...), run:
pystatsv1 workbook run |trackd_run|
Outputs are written under outputs/track_d/ by default.
If you’re unsure what a file is for, start with Track D Outputs Guide.
To see the full chapter-by-chapter run map (D00–D23), see Track D chapter index (PyPI).
Optional: write to a custom output folder:
pystatsv1 workbook run |trackd_run| --outdir outputs/track_d_custom
Interpretation prompts (quick self-check):
What is the accounting or business measurement goal in this chapter?
Which invariant/check would catch a “numbers look fine but are wrong” mistake here?
This chapter is the “bridge” between bookkeeping outputs (CSV exports, trial balances, etc.) and the kinds of tables you want for statistics, forecasting, and dashboards.
In real organizations, data prep is often the highest-leverage work:
Raw general ledger exports are not model-friendly.
The chart of accounts (COA) contains the semantic labels you need.
Most analytics workflows want one amount column, not separate debit/credit.
Chapter 7 produces two small, reusable datasets:
gl_tidy.csv (line-level): one row per GL line, with COA labels and a single signed amount.
gl_monthly_summary.csv (rollup): a monthly activity table by account.
What you should be able to do after this chapter
Explain why accountants track debits/credits, but analysts often prefer one signed amount column.
Build a tidy, analysis-ready GL table that preserves auditability.
Create a monthly roll-up suitable for time-series work and forecasting.
Run basic data-quality checks before modeling.
Inputs and outputs
Inputs (from the NSO v1 dataset folder)
gl_journal.csv: the line-level journal exportchart_of_accounts.csv: the COA lookup table
Outputs (written to --outdir)
gl_tidy.csvgl_monthly_summary.csvch07_summary.json(counts + light QC signals)
Running the chapter
If you have not generated the NSO v1 dataset yet:
make business-nso-sim
Then run Chapter 7:
make business-ch07
Or run the module directly:
python -m scripts.business_ch07_preparing_accounting_data_for_analysis \
--datadir data/synthetic/nso_v1 \
--outdir outputs/track_d \
--seed 123
The two output tables
1) gl_tidy.csv (line-level)
The raw GL has debit and credit columns. That is perfect for
accounting integrity, but awkward for analysis.
In gl_tidy.csv we keep the audit fields (txn_id, doc_id, date,
description) and add COA labels:
account_nameaccount_typenormal_side(debit or credit)
We also add two derived amount fields:
raw_amount=debit - credit(debit positive, credit negative)amount= signed change in the account’s normal-side direction
Why two amount columns?
raw_amountis a faithful numeric encoding of the journal line.amountis usually what you want for analytics (“did the account go up or down, in the normal direction?”).
To make each row easy to reference, we add:
line_no: the line number within a transactiongl_line_id:<txn_id>-<line_no>
2) gl_monthly_summary.csv (monthly roll-up)
This is a grouped summary of the tidy GL:
group by
month+ accountsum
debit,credit, andamount
This produces a table that is immediately useful for:
trend charts and dashboards
anomaly detection (spikes in monthly activity)
forecasting models (per-account time series)
Reading ch07_summary.json
The summary JSON is a small “run report” that answers:
How many rows were produced?
Did we successfully join to the COA?
Did we have any unparseable dates?
It is not meant to be exhaustive; it is meant to provide quick confidence before you begin deeper analysis.
Exercises and extensions
Wide-format features: pivot
gl_monthly_summary.csvso each account is a column and each month is a row. (This is a common forecasting feature matrix.)Custom groupings: map accounts into “management categories” (e.g., marketing spend, shipping expense) and create a second monthly summary.
More QC checks: add rules to flag months where an expense account has negative activity, or where an account has no COA mapping.
Notes for maintainers
Keep ETL helpers small and testable. Shared code lives in
scripts/_business_etl.py.Prefer deterministic outputs (stable sorting, fixed seed). It keeps unit tests robust and makes docs examples repeatable.