
The Problem
Week 1: 3 agent PRs, 800 lines. Manageable. Week 4: 12 PRs, 4,200 lines. Exhausting but doable. Week 8: 47 PRs, 23,000 lines. Impossible. The math doesn’t work. If agents work in parallel (Agent A on checkout, Agent B on payments, Agent C on emails), I can’t review each PR in isolation. Traditional code review assumes:- One human writes code
- Another human reviews it
- Both humans have full context
- 8 agents write code in parallel
- They share partial context
- Human can’t hold all 8 contexts simultaneously
The Key Shift: Verify Proof, Not Code
We moved from “review the code” to “verify the proof.” Old way (code review):- Agent writes 300 lines of checkout flow
- I read 300 lines
- I guess if it’s correct
- Takes 45 minutes
- Agent writes 300 lines + generates receipt
- Receipt says: Tests passed (98%), API matches contract, DB migration succeeded, 0 security issues
- I verify the proof, not the code
- Takes 3 minutes
The Tools We Use
Our review stack:- Receipt System - Every agent action produces verifiable proof
- Acceptance Gates - Frozen contracts prevent chaos before execution
- Director Agent - Meta-reviewer that checks inter-agent coordination
- LLM-as-Judge (Claude Opus) - Reviews subjective quality
- GitHub + Linear Integration - Automated PR linking with receipt metadata
Actual Stats from Last 30 Days
- 187 agent PRs generated
- 163 auto-merged (87%)
- 24 manual reviews (13%)
- Average review time: 4.2 minutes (down from 47 minutes)
- Rollbacks: 2 (both caught in staging, 0 in production)
- 187 human PRs
- 187 manual reviews (100%)
- Average review time: 47 minutes
- Rollbacks: 7 (3 reached production)
This is a scaffold post. Full content will include:
- Step-by-step workflow with code examples
- Frozen contracts YAML format
- Director Agent pre-review output
- LLM-as-Judge prompts and responses
- Auto-merge decision tree
- Real failure modes and how we fixed them
Try It: The Receipt Challenge
Next AI-generated PR, don’t review the code first. Instead, check:- ✅ Did tests pass?
- ✅ Did it match the spec?
- ✅ Did anything unexpected happen?
Related Reading: