I'm self-employed. I didn't track a single expense in 2025. Not one. But here's the thing — I knew I didn't have to. Every business expense generates at least one email: a receipt, an invoice, a confirmation. I just let them pile up. 24,000 emails worth of “I'll deal with it later.”
Then Google launched their Workspace CLI — a proper command-line interface for Gmail, Drive, and Sheets. I'd been wanting to build something real with it, and tax season gave me the perfect excuse.
So I did what any reasonable person would do: I built an AI-powered pipeline to scan every single email, classify the expenses, download the receipts, and produce a color-coded Google Sheet. It took a weekend, Claude Code as my copilot, and more crashes than I'd like to admit.
Here's everything that broke — and how it eventually didn't.
The Stack
gws) — Google's new-ish CLI for Gmail, Drive, and Sheets APIs. Think aws cli but for Google. Installed with npm install -g @googleworkspace/cli. It handles OAuth, pagination, and outputs JSON. It's rough around the edges but surprisingly powerful.claude -p (print mode) to pipe email data in and get structured JSON out.claude -p couldn't handle: reading PDF receipts directly. The SDK's document understanding sends the raw PDF to Claude and extracts amounts.Design
The key design decision was separating what AI should decide from what code should execute. Claude doesn't call APIs. Claude doesn't download files. Claude doesn't upload to Drive. Claude does exactly one thing: look at email metadata and decide “is this a business expense?”
Everything else is deterministic. The agent decides, the code executes.
run.sh (orchestrator) |-- Phase 1: fetch.sh (deterministic - Gmail API) |-- Phase 2: classify.py (agentic - Claude AI) |-- Phase 3: save.sh (deterministic - Drive/Sheets API)
Phase 1
The first version fetched the full body of every email. At ~50-100KB each, that's potentially 2GB of data and hours of API calls. It broke.
gws CLI paginates results — 100 messages per page. My first attempt only read the first page. “Only found 100 messages” when I had 24,000.--page-all --page-limit 99999 and merging pages with jq -s.declare -A). Every script that used them crashed with “invalid option”.grep -Fxq instead.format=metadata — a Gmail API parameter that returns only headers (From, Subject, Date) and a snippet. Each message dropped from ~50KB to ~1.5KB. Full fetch went from “overnight” to about an hour.gws CLI outputs pretty-printed JSON (multi-line), but I was appending to an NDJSON file (one JSON object per line). The metadata extraction script choked on malformed lines.jq -c '.' to compact it before appending.The whole phase is resumable. It appends to an NDJSON file and tracks which message IDs are already fetched. Kill it, restart it, it picks up where it left off.
Phase 2
Sending 24,000 emails to Claude for classification would burn through tokens and context windows. The solution: a two-pass filter.
Pass 1 — Triage (cheap): Send only the sender and subject line, in batches of 100. Claude returns a list of IDs that could be expenses. This filters 24,000 emails down to ~2,000 candidates in about 240 Claude calls. Most emails are obviously not expenses — newsletters, GitHub notifications, marketing — and the subject line is enough to tell.
Pass 2 — Classify (accurate): For only the ~2,000 candidates, send the full metadata (from, subject, snippet, date) in batches of 20. Claude returns structured JSON: vendor name, amount, category, confidence level, duplicate detection, reasoning.
json.loads() chokes on that.json.JSONDecoder().raw_decode(), which parses the first valid JSON value and ignores everything after it.continue on failure — failed batches stay unprocessed and automatically retry on the next run.Both passes save progress after every batch. You can kill the process at any point and not lose work.
Phase 3
This phase takes each classified expense, fetches the full email (now we need the body — but only for ~200 emails, not 24,000), downloads receipt attachments, uploads them to Google Drive, and appends a row to a Google Sheet.
jq expression to find attachments used != operators, which got mangled by bash's shell escaping. Every email showed “no attachment found” even when attachments existed.has() and length > 0 instead of !=.gws CLI has a -o flag for output, but it doesn't decode the response. Gmail's API returns attachments as base64url-encoded JSON, not raw files. The -o flag just wrote the JSON blob to disk..data field with jq, convert base64url to base64 (tr '_-' '/+'), then base64 -d to decode.--parent flag wasn't being applied correctly.--parent before the file path) fixed it.| Date | Vendor | Amount | Category | Conf. |
|---|---|---|---|---|
| Jan 3 | AWS | $127.43 | SaaS | 98% |
| Jan 5 | Uber | $23.50 | Travel | 95% |
| Jan 8 | GitHub | $44.00 | SaaS | 99% |
| Feb 1 | Delta | $342.00 | Travel | 94% |
| Feb 14 | OpenAI | $48.23 | SaaS | 97% |
| Mar 1 | WeWork | $299.00 | Office | 92% |
| Apr 3 | Anthropic | $86.40 | SaaS | 96% |
| May 7 | DigitalOcean | $24.00 | SaaS | 97% |
Phase 3.5
Since we already fetch the full message in Phase 3 (to get attachments), I added an amount extraction step that piggybacks on that fetch. For any expense with a missing amount:
claude -p (uses my Claude subscription, no extra cost). Works great for Stripe receipts, invoice emails, subscription confirmations.This two-tier approach means most amounts are extracted for free via the CLI, and only the tricky PDF-only cases hit the paid API.
Safety
Giving an AI pipeline access to your full Gmail account is terrifying. The gws CLI can do everything — send emails, delete messages, modify labels, trash conversations. I did not want any of that.
The solution: gws-safe, a bash wrapper that intercepts every gws command and checks it against an allowlist of exactly 8 permitted operations. Everything else is blocked and logged. If the pipeline ever tries to send an email, delete a message, or do anything unexpected, it gets stopped cold and the attempt is recorded.
Resilience
Every phase saves progress to disk after each batch:
data/message_ids.json # which messages exist data/messages_raw.ndjson # which are fetched data/triage_progress.json # which batches triaged data/classify_progress.json # which batches classified data/save_progress.json # which expenses saved
Kill the process at any point. Re-run ./run.sh. It checks what's done, skips it, and picks up exactly where it left off. This saved me dozens of times during development — OAuth token expired mid-fetch, Claude rate limit hit during classification, laptop went to sleep during save. Every time: just re-run.
Result
A Google Sheet with every business expense from 2025, sorted by confidence:
Each row has: date, vendor, amount, currency, category, receipt link (in Drive), email link (to the original Gmail message), and notes.
A Google Drive folder with every receipt PDF, renamed to DATE_VENDOR_AMOUNT.pdf.
A local CSV backup of everything.
Takeaways
Bash was the wrong choice for Phase 3. It started simple — pipe gws to jq, done. But by the end, save.sh was doing base64 decoding, nested JSON traversal, calling Python scripts, and fighting shell quoting at every turn. Python from the start would have been cleaner.
Gmail snippets are unreliable. For plain text emails, snippets are great. For HTML-heavy emails (which is every receipt and invoice), the snippet is often garbage — invisible characters, repeated subject lines, or just empty. Don't rely on them for data extraction.
The two-pass classification was worth it. Sending 24,000 subjects to Claude costs maybe $2 in tokens. Sending 24,000 full emails would have cost 50x more and taken 10x longer. The triage pass eliminated 90% of emails at 1% of the cost.
Crash safety isn't optional for pipelines. I didn't add resume logic because I'm careful. I added it because the first run crashed 47 times. OAuth expired. Rate limits hit. Laptop slept. Process killed. Internet dropped. Every single crash would have meant starting over without it.
The gws CLI is powerful but young. Pagination quirks, base64 attachment encoding, flag ordering issues — it works, but you'll hit edges. The documentation is minimal. Expect to experiment.
AI for classification, code for execution.
The temptation is to let the AI do everything — “here's my Gmail, find my expenses.” That's slow, expensive, and unreliable. The right split: AI makes decisions (is this an expense?), deterministic code executes (download, upload, format). The pipeline is 80% boring bash/python and 20% Claude. That's the right ratio.
Built with Claude Code, gws CLI, and the Anthropic SDK. A weekend of debugging. Total cost in API tokens: ~$3.
←Back to all builds