Track D: Apply what you learned to your data

This page is the bridge between the Track D running case (LedgerLab + NSO v1) and your own accounting / bookkeeping / finance data.

The promise of Track D: you leave with analyst habits that transfer:

  • think in tables (grain + keys + joins)

  • write contracts before analysis

  • run checks (duplicates, missingness, reconciliations)

  • produce reproducible outputs (tables + charts + short memo)

Tip

If your starting point is a real export (GnuCash / QuickBooks / bank / invoices), begin with Track D BYOD: Bring Your Own Data to normalize it into Track D’s canonical tables. Then come back to this page for the analysis checklist.

If you haven’t yet, skim these pages first:

Two paths

Most student projects fall into one of these paths. Pick the one that matches what you actually have today.

Choose your path

Path

When it’s a good fit

Path A: exports only

You have bank transactions, invoices, bills, payroll exports, but no clean GL detail export. You can still build a monthly “TB-style” table (a consistent monthly rollup) and do real analysis.

Path B: GL export

You have a General Ledger detail (or journal export) with debits/credits by account. You can validate fast and get to trial-balance and statement style outputs quickly.

The “first 30 minutes” recipe below is intentionally minimal. You can make it fancier later.

Path A: exports only (bank + invoices/bills)

You can do meaningful analysis without a full GL. Your first goal is a monthly rollup table you can trust. Think of it as “trial-balance style”: consistent columns, consistent sign conventions, reproducible from your exports.

Minimum required columns

Bank transactions (CSV)

  • posted_date (or date): transaction date (parseable)

  • description: text description

  • amount: numeric amount (see sign conventions below)

  • (strongly recommended) bank_txn_id: unique id from the bank export

Invoices / sales (CSV) (if you have them)

  • invoice_id: unique id

  • invoice_date (or date): invoice date

  • amount: invoice total (or subtotal + tax)

  • (optional but very useful) customer_id or customer

  • (optional) paid_date or status

Sign conventions (decide early)

Pick one convention and stick to it:

  • cash inflows are positive

  • cash outflows are negative

If your bank export uses the opposite (or splits debit/credit into separate columns), normalize it during the “working” step.

First 30 minutes recipe

  1. Copy exports into raw/.

  2. Create cleaned versions in working/ with consistent headers.

  3. Produce one table + one chart + one sanity check.

Example script (save as working/path_a_30min.py and run with python working/path_a_30min.py):

from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

ROOT = Path(__file__).resolve().parents[1]
raw_bank = ROOT / "raw" / "bank.csv"
out_dir = ROOT / "outputs"
out_dir.mkdir(exist_ok=True)

bank = pd.read_csv(raw_bank)

# --- Required columns (rename if needed)
required = {"posted_date", "description", "amount"}
missing = required - set(bank.columns)
if missing:
    raise ValueError(f"Bank export missing columns: {sorted(missing)}")

# --- Dates
bank["posted_date"] = pd.to_datetime(bank["posted_date"], errors="coerce")
if bank["posted_date"].isna().any():
    bad = bank.loc[bank["posted_date"].isna(), :].head(5)
    raise ValueError(f"Some bank dates failed to parse. Examples:\n{bad}")

# --- Amounts
bank["amount"] = pd.to_numeric(bank["amount"], errors="coerce")
if bank["amount"].isna().any():
    bad = bank.loc[bank["amount"].isna(), :].head(5)
    raise ValueError(f"Some bank amounts failed to parse. Examples:\n{bad}")

# --- Optional QA: duplicate IDs
if "bank_txn_id" in bank.columns:
    dup_ids = bank.loc[bank["bank_txn_id"].duplicated(), "bank_txn_id"].unique()
    if len(dup_ids) > 0:
        print(f"WARNING: duplicate bank_txn_id (showing up to 10): {dup_ids[:10]}")

# --- Month rollup
bank["month"] = bank["posted_date"].dt.to_period("M").astype(str)

g = bank.groupby("month")["amount"]
monthly = pd.DataFrame(
    {
        "cash_in": g.apply(lambda s: s[s > 0].sum()),
        "cash_out": g.apply(lambda s: -s[s < 0].sum()),
        "net_cash": g.sum(),
        "n_txns": g.size(),
    }
).reset_index()

# Sanity check: net_cash = cash_in - cash_out
max_err = (monthly["cash_in"] - monthly["cash_out"] - monthly["net_cash"]).abs().max()
if max_err > 1e-6:
    raise ValueError(f"Sanity check failed (cash_in - cash_out != net_cash). max_err={max_err}")

# Save the table
monthly.to_csv(out_dir / "monthly_cash_summary.csv", index=False)

# One chart: net cash by month
plt.figure()
plt.plot(monthly["month"], monthly["net_cash"], marker="o")
plt.xticks(rotation=45, ha="right")
plt.title("Net cash flow by month")
plt.tight_layout()
plt.savefig(out_dir / "net_cash_by_month.png", dpi=150)

print("Wrote:")
print(" -", out_dir / "monthly_cash_summary.csv")
print(" -", out_dir / "net_cash_by_month.png")

What you just did is the Track D pattern:

  • define a contract (required columns)

  • normalize types (dates, numeric)

  • run one QA check (duplicates)

  • produce a reproducible table + chart

From there, you can grow into richer questions (cash forecasting, seasonality, outliers, customer payment behavior if you have invoices).

Path B: GL export (journal / general ledger detail)

If you have GL detail, you can validate and build trial-balance style tables fast.

Minimum required columns

At minimum:

  • date: transaction date

  • account_id (or account): account identifier

  • debit: numeric debit amount (0 if none)

  • credit: numeric credit amount (0 if none)

Strongly recommended:

  • txn_id (or journal_entry_id): lets you check balance per transaction

  • description

Even better (optional):

  • a Chart of Accounts export you can join on account_id with account_name and account_type

First 30 minutes recipe

Example script (save as working/path_b_30min.py and run with python working/path_b_30min.py):

from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

ROOT = Path(__file__).resolve().parents[1]
raw_gl = ROOT / "raw" / "gl_detail.csv"
out_dir = ROOT / "outputs"
out_dir.mkdir(exist_ok=True)

gl = pd.read_csv(raw_gl)

required = {"date", "account_id", "debit", "credit"}
missing = required - set(gl.columns)
if missing:
    raise ValueError(f"GL export missing columns: {sorted(missing)}")

gl["date"] = pd.to_datetime(gl["date"], errors="coerce")
if gl["date"].isna().any():
    bad = gl.loc[gl["date"].isna(), :].head(5)
    raise ValueError(f"Some GL dates failed to parse. Examples:\n{bad}")

for col in ["debit", "credit"]:
    gl[col] = pd.to_numeric(gl[col], errors="coerce").fillna(0.0)

gl["month"] = gl["date"].dt.to_period("M").astype(str)

# Check 1: debits = credits per month (basic reconciliation)
monthly_dc = gl.groupby("month")[["debit", "credit"]].sum().reset_index()
monthly_dc["diff"] = monthly_dc["debit"] - monthly_dc["credit"]

max_diff = monthly_dc["diff"].abs().max()
if max_diff > 1e-6:
    print(monthly_dc)
    raise ValueError(
        "Debits != credits by month. This often means the export was filtered, "
        "or some lines are missing. Fix this before you trust any analysis."
    )

# Optional Check 2: debits = credits per txn_id (stronger)
if "txn_id" in gl.columns:
    per_txn = gl.groupby("txn_id")[["debit", "credit"]].sum()
    per_txn["diff"] = per_txn["debit"] - per_txn["credit"]
    bad = per_txn[per_txn["diff"].abs() > 1e-6]
    if len(bad) > 0:
        raise ValueError(
            f"Found {len(bad)} unbalanced txn_id rows. "
            "Your export may be incomplete or mis-keyed."
        )

# Build a monthly TB-style table: debit, credit, net (debit-positive convention)
tb = gl.groupby(["month", "account_id"], as_index=False)[["debit", "credit"]].sum()
tb["net"] = tb["debit"] - tb["credit"]

tb.to_csv(out_dir / "tb_monthly.csv", index=False)

# One chart: total debits and credits by month (a quick integrity picture)
plt.figure()
plt.plot(monthly_dc["month"], monthly_dc["debit"], marker="o", label="debit")
plt.plot(monthly_dc["month"], monthly_dc["credit"], marker="o", label="credit")
plt.xticks(rotation=45, ha="right")
plt.title("GL totals by month (should match)")
plt.legend()
plt.tight_layout()
plt.savefig(out_dir / "gl_debits_credits_by_month.png", dpi=150)

print("Wrote:")
print(" -", out_dir / "tb_monthly.csv")
print(" -", out_dir / "gl_debits_credits_by_month.png")

You now have:

  • a reconciliation check (debits == credits)

  • a monthly TB-style table you can analyze

  • a chart you can put in a memo

From here, you can add one join (Chart of Accounts) and start doing Track D-style statement rollups or account-level diagnostics.

Common pitfalls (and what to do)

Date parsing

  • Mixed formats (MM/DD/YYYY vs YYYY-MM-DD) cause silent errors.

  • Always use pd.to_datetime(..., errors="coerce") and fail fast if you get nulls.

Signs and “negative” exports

  • Some systems export credits as negative numbers instead of a separate credit column.

  • Decide your convention, normalize once, and document it in notes/assumptions.txt.

Duplicate IDs

  • Bank feeds can duplicate rows (Track D includes this on purpose).

  • If you have IDs, check duplicates and decide: remove exact duplicates, or keep and flag.

Missing months / incomplete exports

  • If debits != credits by month, the export is likely filtered or incomplete.

  • Fix the export first. Don’t “patch” it with analysis code.

Month boundaries

  • Accruals, backdated entries, and late postings create confusing month swings.

  • Track D teaches the right response: describe it, measure it, and document the limitation.

Template header pack (copy/paste)

If you want a starting point without hunting for column names, these tiny CSV templates are safe to download and copy into your own project.

Next steps

Once you can produce one table + one chart + one check, you are ready for:

  • a short “executive summary” memo (what you found + what limits confidence)

  • deeper Track D-style joins (drivers, segmentation)

  • classroom labs and rubrics (the next Track D docs pages)