97.78% Confidence: Building a Receipt Scanner with AI Agents

My side project is a web app where users photograph receipts and get cashback rewards for qualifying purchases. The feature I’m writing about: dual photo upload. One photo of the receipt, one photo of the purchased product. Both required. Both processed. Cashback triggered when both are verified.

Three-phase build: Database → Backend → Frontend, each verified before moving on

Three phases, one session, five critical bugs, one very satisfying number at the end. (The session that funded this side project’s backend — the million-token cashback campaign build — shows the scale that becomes possible.)

The Three-Phase Build

I’ve learned to structure AI-built features in phases rather than “build the whole thing.” Each phase has a clear output, and you can verify it before moving to the next.

Phase 1: Database schema. New tables and columns to store two images per submission instead of one. Relationships, constraints, migration scripts. The AI handled the schema design; I reviewed the migration.

Phase 2: Backend API. Endpoints updated to accept two files instead of one. Storage logic for both images. OCR pipeline wired to the receipt image specifically (not both images — more on this later). Confidence scoring persisted to the database.

Phase 3: Frontend UI. Two upload zones. Preview states for both images. Progress feedback during upload. Error handling when one image fails validation. The whole user flow from “take two photos” to “cashback submitted.”

Each phase worked before the next began. Standard practice, but worth stating: it’s much easier to debug Phase 2 when you know Phase 1 is correct.

Then Came the Bugs

Five critical bugs surfaced in one session. Each one was non-obvious. None of them would have been caught by unit tests with synthetic data.

Bug 1: Image decompression (appeared three times). The Vision API used for OCR requires raw image bytes. The upload pipeline compressed images with gzip before storing them. The OCR call was receiving compressed bytes, producing garbage output.

This sounds simple to fix. It appeared three times because there were three code paths that processed images: the initial upload handler, the reprocessing endpoint, and the background retry job. Each path had the same bug independently. The fix landed in the upload handler first. Then the reprocessing endpoint failed. Then the retry job failed. Three separate fixes for what was logically one problem.

Test images didn’t catch this because small test images don’t compress significantly — the compressed bytes happened to be valid enough for the Vision API to attempt parsing. Real camera photos at full resolution, properly compressed, produced complete failures.

Bug 2: OCR aggregation scope. Product photos were being fed to the OCR pipeline alongside receipt photos. The text extraction model was trying to read text from a photo of a shampoo bottle, finding some, and mixing it into the receipt data. Confidence scores were degraded, and in some cases the product name on the packaging was appearing as a “line item” on the receipt.

The fix was a single scope filter — OCR only runs on receipt images — but finding it required tracing through the pipeline to understand where the “process all images” assumption had been made.

Bug 3: FormData field ordering. The backend expected receipt photo first, product photo second. The frontend was constructing the FormData object in a way that didn’t guarantee order. On most browsers with most images, the order happened to be correct. On mobile browsers with specific image sources, the order flipped. The cashback submission would succeed but process the product photo as the receipt and vice versa — high confidence OCR reading, zero matching line items.

This is exactly the kind of bug that survives QA: it works in testing because you’re testing from a desktop with consistent behavior, and breaks for the mobile users you care about most.

Bugs 4 and 5 were validation edge cases — specific image dimensions and file size combinations that triggered an error path the frontend wasn’t handling, causing silent failures with no user feedback.

97.78%

After all five fixes were applied, I tested with a real receipt photograph — the kind of wrinkled, slightly overexposed, taken-at-an-angle photo that actual users submit.

OCR confidence: 97.78%.

Not a test image. Not a PDF. An actual receipt, photographed with a phone, under kitchen lighting. The text extraction was accurate. The line items matched. The total was correct.

That number felt earned.

The Other Work in That Session

Two other things happened in the same session, worth noting because they represent the breadth of what a single agentic session can cover.

Strategic feedback questions. The app needed a post-cashback survey to collect retail data. I wanted five questions that would produce statistically useful results. We designed them together: consistent radio button format for all five (easier to aggregate), clear language, no leading questions. The AI helped with question design the same way it helped with code — draft, review, revise.

GDPR legal pages. The app needed a privacy policy and terms of service. Two things came up that I wouldn’t have thought of without prompting:

First, the distinction between “data processor” and “data controller” under GDPR. The app is a data controller — it decides what to collect and why. A payment processor might be a data processor. Using the wrong term isn’t just imprecise; it has legal implications about liability and consent requirements. We used the right term.

Second, the privacy checkbox wording. The phrasing has to be specific enough to satisfy GDPR consent requirements without being so legalistic that users skip reading it. We iterated on the wording until it was clear, accurate, and genuinely informed.

97.78% OCR confidence on a real receipt photo

The Lesson

Real-world data breaks things that test data doesn’t.

Small synthetic images don’t exercise compression behavior. Desktop browsers don’t expose FormData ordering bugs. Clean, flat receipts don’t reveal OCR pipeline scoping issues.

The gap between “it works” and “it works for real users” is where most of the interesting bugs live. AI builds fast — but fast means you reach that gap faster too. The bugs don’t disappear; they just arrive sooner.

Test with real data as early as possible. Real camera photos, real device behaviors, real user inputs. Every shortcut you take there shows up eventually as a production incident. (What happened next — bolting on rate limiting, CORS hardening, graceful shutdown, and everything else the agents didn’t think to add — is in Production Hardening. And the same cashback app backend was the one built during my first million-token session.)

The Three-Phase Build

Then Came the Bugs

97.78%

The Other Work in That Session

The Lesson

What the models think