didn't break

I Built a Tax Expense Scanner with AI and It Broke 47 Times Before It Worked

automation·12 min read·gmail api · claude · gws cli

I'm self-employed. I didn't track a single expense in 2025. Not one. But here's the thing — I knew I didn't have to. Every business expense generates at least one email: a receipt, an invoice, a confirmation. I just let them pile up. 24,000 emails worth of “I'll deal with it later.”

Then Google launched their Workspace CLI — a proper command-line interface for Gmail, Drive, and Sheets. I'd been wanting to build something real with it, and tax season gave me the perfect excuse.

So I did what any reasonable person would do: I built an AI-powered pipeline to scan every single email, classify the expenses, download the receipts, and produce a color-coded Google Sheet. It took a weekend, Claude Code as my copilot, and more crashes than I'd like to admit.

Here's everything that broke — and how it eventually didn't.

The Stack

The Tools

  • Google Workspace CLI (gws) — Google's new-ish CLI for Gmail, Drive, and Sheets APIs. Think aws cli but for Google. Installed with npm install -g @googleworkspace/cli. It handles OAuth, pagination, and outputs JSON. It's rough around the edges but surprisingly powerful.
  • Claude Code — Anthropic's CLI for Claude. I used it both as my coding assistant (to build the whole thing) and as the AI backend for classification, running claude -p (print mode) to pipe email data in and get structured JSON out.
  • Anthropic Python SDK — For the one task claude -p couldn't handle: reading PDF receipts directly. The SDK's document understanding sends the raw PDF to Claude and extracts amounts.
  • jq — The Swiss Army knife holding all the bash scripts together. JSON parsing, filtering, transforming — it does everything.
  • Bash + Python — Bash for the deterministic plumbing (fetch, download, upload), Python for the parts that needed real logic (classification batching, amount extraction).

Design

The Architecture
80% Deterministic, 20% Agentic

The key design decision was separating what AI should decide from what code should execute. Claude doesn't call APIs. Claude doesn't download files. Claude doesn't upload to Drive. Claude does exactly one thing: look at email metadata and decide “is this a business expense?”

Everything else is deterministic. The agent decides, the code executes.

run.sh (orchestrator)
  |-- Phase 1: fetch.sh    (deterministic - Gmail API)
  |-- Phase 2: classify.py (agentic - Claude AI)
  |-- Phase 3: save.sh     (deterministic - Drive/Sheets API)
architecture
run.sh (orchestrator)
├── fetch.sh deterministic
├── classify.py agentic
└── save.sh deterministic
deterministicagentic

Phase 1

Fetching 24,000 Emails
(Without Downloading 24,000 Emails)

The first version fetched the full body of every email. At ~50-100KB each, that's potentially 2GB of data and hours of API calls. It broke.

broke
The gws CLI paginates results — 100 messages per page. My first attempt only read the first page. “Only found 100 messages” when I had 24,000.
fixed
--page-all --page-limit 99999 and merging pages with jq -s.
broke
macOS ships with Bash 3.2 from 2007. No associative arrays (declare -A). Every script that used them crashed with “invalid option”.
fixed
Rewrote all lookups to use file-based grep -Fxq instead.
broke
Even with pagination fixed, fetching full message bodies for 24K emails was insanely slow.
fixed
format=metadata — a Gmail API parameter that returns only headers (From, Subject, Date) and a snippet. Each message dropped from ~50KB to ~1.5KB. Full fetch went from “overnight” to about an hour.
broke
The gws CLI outputs pretty-printed JSON (multi-line), but I was appending to an NDJSON file (one JSON object per line). The metadata extraction script choked on malformed lines.
fixed
Pipe every response through jq -c '.' to compact it before appending.

The whole phase is resumable. It appends to an NDJSON file and tracks which message IDs are already fetched. Kill it, restart it, it picks up where it left off.

didntbreak scanner
LinkedInYou appeared in 15 searches
AWSInvoice Available — Jan 2024
MomHappy new year!! Call me
UberYour ride receipt — Maria
SlackWeekly digest for workspace
GitHubYour Teams invoice
DavidRe: lunch tomorrow?
FigmaPayment receipt — Pro
Twitter/XNew follower: @techguy42
VercelInvoice #VCL-2024-0847
DeltaE-ticket SFO to JFK
Team LeadQ1 planning — your input

Phase 2

Two-Pass Classification
(Because Tokens Aren't Free)

Sending 24,000 emails to Claude for classification would burn through tokens and context windows. The solution: a two-pass filter.

Pass 1 — Triage (cheap): Send only the sender and subject line, in batches of 100. Claude returns a list of IDs that could be expenses. This filters 24,000 emails down to ~2,000 candidates in about 240 Claude calls. Most emails are obviously not expenses — newsletters, GitHub notifications, marketing — and the subject line is enough to tell.

Pass 2 — Classify (accurate): For only the ~2,000 candidates, send the full metadata (from, subject, snippet, date) in batches of 20. Claude returns structured JSON: vendor name, amount, category, confidence level, duplicate detection, reasoning.

broke
Claude sometimes returns JSON with extra text after it — an explanation, a note, a trailing sentence. json.loads() chokes on that.
fixed
json.JSONDecoder().raw_decode(), which parses the first valid JSON value and ignores everything after it.
broke
Some batches timed out (120 seconds wasn't enough for 20 complex emails). And failed batches were being marked as “processed,” so they'd never retry.
fixed
Bumped timeout to 180 seconds and added a continue on failure — failed batches stay unprocessed and automatically retry on the next run.

Both passes save progress after every batch. You can kill the process at any point and not lose work.

didntbreak scanner
LinkedInYou appeared in 15 searches
AWSInvoice Available — Jan 2024receipt
MomHappy new year!! Call me
UberYour ride receipt — Mariareceipt
SlackWeekly digest for workspace
GitHubYour Teams invoicereceipt
DavidRe: lunch tomorrow?
FigmaPayment receipt — Proreceipt
Twitter/XNew follower: @techguy42
VercelInvoice #VCL-2024-0847receipt
DeltaE-ticket SFO to JFKreceipt
Team LeadQ1 planning — your input
6 receipts foundscan complete

Phase 3

Save to Drive and Sheets
(Where Everything Else Broke)

This phase takes each classified expense, fetches the full email (now we need the body — but only for ~200 emails, not 24,000), downloads receipt attachments, uploads them to Google Drive, and appends a row to a Google Sheet.

broke
The jq expression to find attachments used != operators, which got mangled by bash's shell escaping. Every email showed “no attachment found” even when attachments existed.
fixed
Rewrote the jq filter to use has() and length > 0 instead of !=.
broke
The gws CLI has a -o flag for output, but it doesn't decode the response. Gmail's API returns attachments as base64url-encoded JSON, not raw files. The -o flag just wrote the JSON blob to disk.
fixed
Extract the .data field with jq, convert base64url to base64 (tr '_-' '/+'), then base64 -d to decode.
broke
Uploaded receipts ended up in the Drive root instead of the target folder. The --parent flag wasn't being applied correctly.
fixed
Reordering the flags (putting --parent before the file path) fixed it.
broke
The classification step couldn't extract dollar amounts because the email snippet was empty for most HTML receipt emails. The Gmail snippet for an Anthropic receipt was literally just the subject line repeated with invisible Unicode spacer characters.
receipt_data.csv
DateVendorAmountCategoryConf.
Jan 3AWS$127.43SaaS98%
Jan 5Uber$23.50Travel95%
Jan 8GitHub$44.00SaaS99%
Feb 1Delta$342.00Travel94%
Feb 14OpenAI$48.23SaaS97%
Mar 1WeWork$299.00Office92%
Apr 3Anthropic$86.40SaaS96%
May 7DigitalOcean$24.00SaaS97%
showing 8 of 247 receipts

Phase 3.5

The Amount Extraction Layer
(A Pipeline Within the Pipeline)

Since we already fetch the full message in Phase 3 (to get attachments), I added an amount extraction step that piggybacks on that fetch. For any expense with a missing amount:

  1. Try the email body — Extract the plain text from the full message, send it to claude -p (uses my Claude subscription, no extra cost). Works great for Stripe receipts, invoice emails, subscription confirmations.
  2. Try the PDF receipt — If the amount still isn't found and we downloaded a PDF, send the raw PDF to Claude via the Anthropic Python SDK's document understanding. Claude reads the invoice layout and pulls the total. This uses the API (~$0.005 per receipt) but it's the most accurate option.

This two-tier approach means most amounts are extracted for free via the CLI, and only the tricky PDF-only cases hit the paid API.

Safety

Security: The Allowlist Wrapper

Giving an AI pipeline access to your full Gmail account is terrifying. The gws CLI can do everything — send emails, delete messages, modify labels, trash conversations. I did not want any of that.

The solution: gws-safe, a bash wrapper that intercepts every gws command and checks it against an allowlist of exactly 8 permitted operations. Everything else is blocked and logged. If the pipeline ever tries to send an email, delete a message, or do anything unexpected, it gets stopped cold and the attempt is recorded.

gws-safe allowlist
# Allowed operations
gmail messages list # search
gmail messages get # read
gmail attachments get # download
drive files create # folder
drive +upload # upload
sheets create # new sheet
sheets batchUpdate # format
sheets +append # add row
# Everything else → blocked + logged

Resilience

Crash Safety: Resume From Anywhere

Every phase saves progress to disk after each batch:

data/message_ids.json      # which messages exist
data/messages_raw.ndjson   # which are fetched
data/triage_progress.json  # which batches triaged
data/classify_progress.json # which batches classified
data/save_progress.json    # which expenses saved

Kill the process at any point. Re-run ./run.sh. It checks what's done, skips it, and picks up exactly where it left off. This saved me dozens of times during development — OAuth token expired mid-fetch, Claude rate limit hit during classification, laptop went to sleep during save. Every time: just re-run.

Result

The Output

A Google Sheet with every business expense from 2025, sorted by confidence:

  • Green rows (high confidence) — clearly business expenses, auto-classified
  • Yellow rows (medium) — probably expenses, worth a quick check
  • Red rows (low) — might be expenses, needs human review

Each row has: date, vendor, amount, currency, category, receipt link (in Drive), email link (to the original Gmail message), and notes.

A Google Drive folder with every receipt PDF, renamed to DATE_VENDOR_AMOUNT.pdf.

A local CSV backup of everything.

2025_expenses.gsheet
AWS
$127.4398%
Uber
$23.5095%
WeWork
$299.0092%
Delta
$342.0094%
GitHub
$44.0099%
Anthropic
$86.4096%
Amazon.com
$189.9988%
DigitalOcean
$24.0097%
247 receiptssorted by confidence
high medium review

Takeaways

What I Learned

Bash was the wrong choice for Phase 3. It started simple — pipe gws to jq, done. But by the end, save.sh was doing base64 decoding, nested JSON traversal, calling Python scripts, and fighting shell quoting at every turn. Python from the start would have been cleaner.

Gmail snippets are unreliable. For plain text emails, snippets are great. For HTML-heavy emails (which is every receipt and invoice), the snippet is often garbage — invisible characters, repeated subject lines, or just empty. Don't rely on them for data extraction.

The two-pass classification was worth it. Sending 24,000 subjects to Claude costs maybe $2 in tokens. Sending 24,000 full emails would have cost 50x more and taken 10x longer. The triage pass eliminated 90% of emails at 1% of the cost.

Crash safety isn't optional for pipelines. I didn't add resume logic because I'm careful. I added it because the first run crashed 47 times. OAuth expired. Rate limits hit. Laptop slept. Process killed. Internet dropped. Every single crash would have meant starting over without it.

The gws CLI is powerful but young. Pagination quirks, base64 attachment encoding, flag ordering issues — it works, but you'll hit edges. The documentation is minimal. Expect to experiment.

AI for classification, code for execution.

The temptation is to let the AI do everything — “here's my Gmail, find my expenses.” That's slow, expensive, and unreliable. The right split: AI makes decisions (is this an expense?), deterministic code executes (download, upload, format). The pipeline is 80% boring bash/python and 20% Claude. That's the right ratio.

didn't break

Built with Claude Code, gws CLI, and the Anthropic SDK. A weekend of debugging. Total cost in API tokens: ~$3.

Back to all builds