Business Chapter 7: Preparing accounting data for analysis ========================================================== .. |trackd_run| replace:: d07 .. include:: _includes/track_d_run_strip.rst This chapter is the “bridge” between bookkeeping outputs (CSV exports, trial balances, etc.) and the kinds of tables you want for statistics, forecasting, and dashboards. In real organizations, *data prep* is often the highest-leverage work: - Raw general ledger exports are not model-friendly. - The chart of accounts (COA) contains the semantic labels you need. - Most analytics workflows want *one amount column*, not separate debit/credit. Chapter 7 produces two small, reusable datasets: - **gl_tidy.csv** (line-level): one row per GL line, with COA labels and a single signed amount. - **gl_monthly_summary.csv** (rollup): a monthly activity table by account. What you should be able to do after this chapter ------------------------------------------------ - Explain why accountants track debits/credits, but analysts often prefer one signed amount column. - Build a tidy, analysis-ready GL table that preserves auditability. - Create a monthly roll-up suitable for time-series work and forecasting. - Run basic data-quality checks before modeling. Inputs and outputs ------------------ Inputs (from the NSO v1 dataset folder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``gl_journal.csv``: the line-level journal export - ``chart_of_accounts.csv``: the COA lookup table Outputs (written to ``--outdir``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ``gl_tidy.csv`` - ``gl_monthly_summary.csv`` - ``ch07_summary.json`` (counts + light QC signals) Running the chapter ------------------- If you have not generated the NSO v1 dataset yet: .. code-block:: bash make business-nso-sim Then run Chapter 7: .. code-block:: bash make business-ch07 Or run the module directly: .. code-block:: bash python -m scripts.business_ch07_preparing_accounting_data_for_analysis \ --datadir data/synthetic/nso_v1 \ --outdir outputs/track_d \ --seed 123 The two output tables --------------------- 1) ``gl_tidy.csv`` (line-level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The *raw* GL has ``debit`` and ``credit`` columns. That is perfect for accounting integrity, but awkward for analysis. In ``gl_tidy.csv`` we keep the audit fields (``txn_id``, ``doc_id``, ``date``, ``description``) and add COA labels: - ``account_name`` - ``account_type`` - ``normal_side`` (debit or credit) We also add two derived amount fields: - ``raw_amount`` = ``debit - credit`` (debit positive, credit negative) - ``amount`` = signed change in the account’s **normal-side** direction Why two amount columns? - ``raw_amount`` is a faithful numeric encoding of the journal line. - ``amount`` is usually what you want for analytics (“did the account go up or down, in the normal direction?”). To make each row easy to reference, we add: - ``line_no``: the line number within a transaction - ``gl_line_id``: ``-`` 2) ``gl_monthly_summary.csv`` (monthly roll-up) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is a grouped summary of the tidy GL: - group by ``month`` + account - sum ``debit``, ``credit``, and ``amount`` This produces a table that is immediately useful for: - trend charts and dashboards - anomaly detection (spikes in monthly activity) - forecasting models (per-account time series) Reading ``ch07_summary.json`` ----------------------------- The summary JSON is a small “run report” that answers: - How many rows were produced? - Did we successfully join to the COA? - Did we have any unparseable dates? It is not meant to be exhaustive; it is meant to provide quick confidence before you begin deeper analysis. Exercises and extensions ------------------------ 1. **Wide-format features:** pivot ``gl_monthly_summary.csv`` so each account is a column and each month is a row. (This is a common forecasting feature matrix.) 2. **Custom groupings:** map accounts into “management categories” (e.g., *marketing spend*, *shipping expense*) and create a second monthly summary. 3. **More QC checks:** add rules to flag months where an expense account has negative activity, or where an account has no COA mapping. Notes for maintainers --------------------- - Keep ETL helpers small and testable. Shared code lives in ``scripts/_business_etl.py``. - Prefer deterministic outputs (stable sorting, fixed seed). It keeps unit tests robust and makes docs examples repeatable.