.. _track_d_my_own_data:

============================================
Track D: Apply what you learned to your data
============================================

This page is the *bridge* between the Track D running case (LedgerLab + NSO v1)
and your own accounting / bookkeeping / finance data.

**The promise of Track D:** you leave with analyst habits that transfer:

- think in **tables** (grain + keys + joins)
- write **contracts** before analysis
- run **checks** (duplicates, missingness, reconciliations)
- produce **reproducible outputs** (tables + charts + short memo)

.. tip::

   If your starting point is a **real export** (GnuCash / QuickBooks / bank / invoices),
   begin with :doc:`track_d_byod` to normalize it into Track D’s canonical tables.
   Then come back to this page for the analysis checklist.


If you haven’t yet, skim these pages first:

- :doc:`track_d_student_edition` (entry point)
- :doc:`track_d_dataset_map` (what the Track D tables are)
- :doc:`track_d_outputs_guide` (what the Track D scripts write)

Two paths
=========

Most student projects fall into one of these paths.
Pick the one that matches what you *actually* have today.

.. list-table:: Choose your path
   :header-rows: 1
   :widths: 28 72

   * - Path
     - When it’s a good fit
   * - **Path A: exports only**
     - You have bank transactions, invoices, bills, payroll exports, but **no** clean GL detail export.
       You can still build a monthly "TB-style" table (a consistent monthly rollup) and do real analysis.
   * - **Path B: GL export**
     - You have a **General Ledger detail** (or journal export) with debits/credits by account.
       You can validate fast and get to trial-balance and statement style outputs quickly.

The "first 30 minutes" recipe below is intentionally minimal.
You can make it fancier later.

Setup once (recommended)
========================

Create a small project folder structure and keep raw exports untouched.

.. code-block:: text

   my_trackd_project/
     raw/        # untouched exports
     working/    # renamed columns, cleaned values
     outputs/    # your tables + charts + short memo
     notes/      # assumptions, mapping notes, QA notes

**Privacy note:** remove names / emails / account numbers before sharing.

Path A: exports only (bank + invoices/bills)
============================================

You can do meaningful analysis without a full GL.
Your first goal is a **monthly rollup table** you can trust.
Think of it as "trial-balance style": consistent columns, consistent sign conventions,
reproducible from your exports.

Minimum required columns
------------------------

**Bank transactions (CSV)**

- ``posted_date`` (or ``date``): transaction date (parseable)
- ``description``: text description
- ``amount``: numeric amount (see sign conventions below)
- *(strongly recommended)* ``bank_txn_id``: unique id from the bank export

**Invoices / sales (CSV)** (if you have them)

- ``invoice_id``: unique id
- ``invoice_date`` (or ``date``): invoice date
- ``amount``: invoice total (or subtotal + tax)
- *(optional but very useful)* ``customer_id`` or ``customer``
- *(optional)* ``paid_date`` or ``status``

Sign conventions (decide early)
-------------------------------

Pick one convention and stick to it:

- **cash inflows are positive**
- **cash outflows are negative**

If your bank export uses the opposite (or splits debit/credit into separate columns),
normalize it during the "working" step.

First 30 minutes recipe
-----------------------

1) Copy exports into ``raw/``.
2) Create cleaned versions in ``working/`` with consistent headers.
3) Produce **one table + one chart + one sanity check**.

Example script (save as ``working/path_a_30min.py`` and run with ``python working/path_a_30min.py``):

.. code-block:: python

   from pathlib import Path
   import pandas as pd
   import matplotlib.pyplot as plt

   ROOT = Path(__file__).resolve().parents[1]
   raw_bank = ROOT / "raw" / "bank.csv"
   out_dir = ROOT / "outputs"
   out_dir.mkdir(exist_ok=True)

   bank = pd.read_csv(raw_bank)

   # --- Required columns (rename if needed)
   required = {"posted_date", "description", "amount"}
   missing = required - set(bank.columns)
   if missing:
       raise ValueError(f"Bank export missing columns: {sorted(missing)}")

   # --- Dates
   bank["posted_date"] = pd.to_datetime(bank["posted_date"], errors="coerce")
   if bank["posted_date"].isna().any():
       bad = bank.loc[bank["posted_date"].isna(), :].head(5)
       raise ValueError(f"Some bank dates failed to parse. Examples:\n{bad}")

   # --- Amounts
   bank["amount"] = pd.to_numeric(bank["amount"], errors="coerce")
   if bank["amount"].isna().any():
       bad = bank.loc[bank["amount"].isna(), :].head(5)
       raise ValueError(f"Some bank amounts failed to parse. Examples:\n{bad}")

   # --- Optional QA: duplicate IDs
   if "bank_txn_id" in bank.columns:
       dup_ids = bank.loc[bank["bank_txn_id"].duplicated(), "bank_txn_id"].unique()
       if len(dup_ids) > 0:
           print(f"WARNING: duplicate bank_txn_id (showing up to 10): {dup_ids[:10]}")

   # --- Month rollup
   bank["month"] = bank["posted_date"].dt.to_period("M").astype(str)

   g = bank.groupby("month")["amount"]
   monthly = pd.DataFrame(
       {
           "cash_in": g.apply(lambda s: s[s > 0].sum()),
           "cash_out": g.apply(lambda s: -s[s < 0].sum()),
           "net_cash": g.sum(),
           "n_txns": g.size(),
       }
   ).reset_index()

   # Sanity check: net_cash = cash_in - cash_out
   max_err = (monthly["cash_in"] - monthly["cash_out"] - monthly["net_cash"]).abs().max()
   if max_err > 1e-6:
       raise ValueError(f"Sanity check failed (cash_in - cash_out != net_cash). max_err={max_err}")

   # Save the table
   monthly.to_csv(out_dir / "monthly_cash_summary.csv", index=False)

   # One chart: net cash by month
   plt.figure()
   plt.plot(monthly["month"], monthly["net_cash"], marker="o")
   plt.xticks(rotation=45, ha="right")
   plt.title("Net cash flow by month")
   plt.tight_layout()
   plt.savefig(out_dir / "net_cash_by_month.png", dpi=150)

   print("Wrote:")
   print(" -", out_dir / "monthly_cash_summary.csv")
   print(" -", out_dir / "net_cash_by_month.png")

What you just did is the Track D pattern:

- define a contract (required columns)
- normalize types (dates, numeric)
- run one QA check (duplicates)
- produce a reproducible table + chart

From there, you can grow into richer questions (cash forecasting, seasonality, outliers,
customer payment behavior if you have invoices).

Path B: GL export (journal / general ledger detail)
===================================================

If you have GL detail, you can validate and build trial-balance style tables fast.

Minimum required columns
------------------------

At minimum:

- ``date``: transaction date
- ``account_id`` (or ``account``): account identifier
- ``debit``: numeric debit amount (0 if none)
- ``credit``: numeric credit amount (0 if none)

Strongly recommended:

- ``txn_id`` (or ``journal_entry_id``): lets you check balance per transaction
- ``description``

Even better (optional):

- a Chart of Accounts export you can join on ``account_id`` with ``account_name`` and ``account_type``

First 30 minutes recipe
-----------------------

Example script (save as ``working/path_b_30min.py`` and run with ``python working/path_b_30min.py``):

.. code-block:: python

   from pathlib import Path
   import pandas as pd
   import matplotlib.pyplot as plt

   ROOT = Path(__file__).resolve().parents[1]
   raw_gl = ROOT / "raw" / "gl_detail.csv"
   out_dir = ROOT / "outputs"
   out_dir.mkdir(exist_ok=True)

   gl = pd.read_csv(raw_gl)

   required = {"date", "account_id", "debit", "credit"}
   missing = required - set(gl.columns)
   if missing:
       raise ValueError(f"GL export missing columns: {sorted(missing)}")

   gl["date"] = pd.to_datetime(gl["date"], errors="coerce")
   if gl["date"].isna().any():
       bad = gl.loc[gl["date"].isna(), :].head(5)
       raise ValueError(f"Some GL dates failed to parse. Examples:\n{bad}")

   for col in ["debit", "credit"]:
       gl[col] = pd.to_numeric(gl[col], errors="coerce").fillna(0.0)

   gl["month"] = gl["date"].dt.to_period("M").astype(str)

   # Check 1: debits = credits per month (basic reconciliation)
   monthly_dc = gl.groupby("month")[["debit", "credit"]].sum().reset_index()
   monthly_dc["diff"] = monthly_dc["debit"] - monthly_dc["credit"]

   max_diff = monthly_dc["diff"].abs().max()
   if max_diff > 1e-6:
       print(monthly_dc)
       raise ValueError(
           "Debits != credits by month. This often means the export was filtered, "
           "or some lines are missing. Fix this before you trust any analysis."
       )

   # Optional Check 2: debits = credits per txn_id (stronger)
   if "txn_id" in gl.columns:
       per_txn = gl.groupby("txn_id")[["debit", "credit"]].sum()
       per_txn["diff"] = per_txn["debit"] - per_txn["credit"]
       bad = per_txn[per_txn["diff"].abs() > 1e-6]
       if len(bad) > 0:
           raise ValueError(
               f"Found {len(bad)} unbalanced txn_id rows. "
               "Your export may be incomplete or mis-keyed."
           )

   # Build a monthly TB-style table: debit, credit, net (debit-positive convention)
   tb = gl.groupby(["month", "account_id"], as_index=False)[["debit", "credit"]].sum()
   tb["net"] = tb["debit"] - tb["credit"]

   tb.to_csv(out_dir / "tb_monthly.csv", index=False)

   # One chart: total debits and credits by month (a quick integrity picture)
   plt.figure()
   plt.plot(monthly_dc["month"], monthly_dc["debit"], marker="o", label="debit")
   plt.plot(monthly_dc["month"], monthly_dc["credit"], marker="o", label="credit")
   plt.xticks(rotation=45, ha="right")
   plt.title("GL totals by month (should match)")
   plt.legend()
   plt.tight_layout()
   plt.savefig(out_dir / "gl_debits_credits_by_month.png", dpi=150)

   print("Wrote:")
   print(" -", out_dir / "tb_monthly.csv")
   print(" -", out_dir / "gl_debits_credits_by_month.png")

You now have:

- a **reconciliation check** (debits == credits)
- a **monthly TB-style table** you can analyze
- a **chart** you can put in a memo

From here, you can add one join (Chart of Accounts) and start doing Track D-style
statement rollups or account-level diagnostics.

Common pitfalls (and what to do)
================================

Date parsing
------------

- Mixed formats (``MM/DD/YYYY`` vs ``YYYY-MM-DD``) cause silent errors.
- Always use ``pd.to_datetime(..., errors="coerce")`` and *fail fast* if you get nulls.

Signs and "negative" exports
----------------------------

- Some systems export credits as negative numbers instead of a separate credit column.
- Decide your convention, normalize once, and document it in ``notes/assumptions.txt``.

Duplicate IDs
-------------

- Bank feeds can duplicate rows (Track D includes this on purpose).
- If you have IDs, check duplicates and decide: remove exact duplicates, or keep and flag.

Missing months / incomplete exports
-----------------------------------

- If debits != credits by month, the export is likely filtered or incomplete.
- Fix the export first. Don’t "patch" it with analysis code.

Month boundaries
----------------

- Accruals, backdated entries, and late postings create confusing month swings.
- Track D teaches the right response: *describe it, measure it, and document the limitation.*

Template header pack (copy/paste)
=================================

If you want a starting point without hunting for column names, these tiny CSV templates
are safe to download and copy into your own project.

- :download:`gl_detail_minimal.csv <_downloads/track_d_headers/gl_detail_minimal.csv>`
- :download:`chart_of_accounts_minimal.csv <_downloads/track_d_headers/chart_of_accounts_minimal.csv>`
- :download:`bank_transactions_minimal.csv <_downloads/track_d_headers/bank_transactions_minimal.csv>`
- :download:`invoices_minimal.csv <_downloads/track_d_headers/invoices_minimal.csv>`

Next steps
==========

Once you can produce one table + one chart + one check, you are ready for:

- a short "executive summary" memo (what you found + what limits confidence)
- deeper Track D-style joins (drivers, segmentation)
- classroom labs and rubrics (the next Track D docs pages)