Business Chapter 7: Preparing accounting data for analysis

PyPI workbook run (Track D)

From inside your Track D workbook folder (created by pystatsv1 workbook init --track d --dest ...), run:

pystatsv1 workbook run |trackd_run|

Outputs are written under outputs/track_d/ by default. If you’re unsure what a file is for, start with Track D Outputs Guide.

To see the full chapter-by-chapter run map (D00–D23), see Track D chapter index (PyPI).

Optional: write to a custom output folder:

pystatsv1 workbook run |trackd_run| --outdir outputs/track_d_custom

Interpretation prompts (quick self-check):

What is the accounting or business measurement goal in this chapter?
Which invariant/check would catch a “numbers look fine but are wrong” mistake here?

This chapter is the “bridge” between bookkeeping outputs (CSV exports, trial balances, etc.) and the kinds of tables you want for statistics, forecasting, and dashboards.

In real organizations, data prep is often the highest-leverage work:

Raw general ledger exports are not model-friendly.
The chart of accounts (COA) contains the semantic labels you need.
Most analytics workflows want one amount column, not separate debit/credit.

Chapter 7 produces two small, reusable datasets:

gl_tidy.csv (line-level): one row per GL line, with COA labels and a single signed amount.
gl_monthly_summary.csv (rollup): a monthly activity table by account.

What you should be able to do after this chapter

Explain why accountants track debits/credits, but analysts often prefer one signed amount column.
Build a tidy, analysis-ready GL table that preserves auditability.
Create a monthly roll-up suitable for time-series work and forecasting.
Run basic data-quality checks before modeling.

Inputs and outputs

Inputs (from the NSO v1 dataset folder)

gl_journal.csv: the line-level journal export
chart_of_accounts.csv: the COA lookup table

Outputs (written to `--outdir`)

gl_tidy.csv
gl_monthly_summary.csv
ch07_summary.json (counts + light QC signals)

Running the chapter

If you have not generated the NSO v1 dataset yet:

make business-nso-sim

Then run Chapter 7:

make business-ch07

Or run the module directly:

python -m scripts.business_ch07_preparing_accounting_data_for_analysis \
  --datadir data/synthetic/nso_v1 \
  --outdir outputs/track_d \
  --seed 123

The two output tables

1) `gl_tidy.csv` (line-level)

The raw GL has debit and credit columns. That is perfect for accounting integrity, but awkward for analysis.

In gl_tidy.csv we keep the audit fields (txn_id, doc_id, date, description) and add COA labels:

account_name
account_type
normal_side (debit or credit)

We also add two derived amount fields:

raw_amount = debit - credit (debit positive, credit negative)
amount = signed change in the account’s normal-side direction

Why two amount columns?

raw_amount is a faithful numeric encoding of the journal line.
amount is usually what you want for analytics (“did the account go up or down, in the normal direction?”).

To make each row easy to reference, we add:

line_no: the line number within a transaction
gl_line_id: <txn_id>-<line_no>

2) `gl_monthly_summary.csv` (monthly roll-up)

This is a grouped summary of the tidy GL:

group by month + account
sum debit, credit, and amount

This produces a table that is immediately useful for:

trend charts and dashboards
anomaly detection (spikes in monthly activity)
forecasting models (per-account time series)

Reading `ch07_summary.json`

The summary JSON is a small “run report” that answers:

How many rows were produced?
Did we successfully join to the COA?
Did we have any unparseable dates?

It is not meant to be exhaustive; it is meant to provide quick confidence before you begin deeper analysis.

Exercises and extensions

Wide-format features: pivot gl_monthly_summary.csv so each account is a column and each month is a row. (This is a common forecasting feature matrix.)
Custom groupings: map accounts into “management categories” (e.g., marketing spend, shipping expense) and create a second monthly summary.
More QC checks: add rules to flag months where an expense account has negative activity, or where an account has no COA mapping.

Notes for maintainers

Keep ETL helpers small and testable. Shared code lives in scripts/_business_etl.py.
Prefer deterministic outputs (stable sorting, fixed seed). It keeps unit tests robust and makes docs examples repeatable.