How We Review 50+ Agent-Generated PRs Per Week Without Losing Our Minds

The Problem
The Key Shift: Verify Proof, Not Code
The Tools We Use
Actual Stats from Last 30 Days
Try It: The Receipt Challenge

Author: Danial Hasan, CTO @ Squad

The Problem

Week 1: 3 agent PRs, 800 lines. Manageable. Week 4: 12 PRs, 4,200 lines. Exhausting but doable. Week 8: 47 PRs, 23,000 lines. Impossible. The math doesn’t work. If agents work in parallel (Agent A on checkout, Agent B on payments, Agent C on emails), I can’t review each PR in isolation. Traditional code review assumes:

One human writes code
Another human reviews it
Both humans have full context

Multi-agent reality:

8 agents write code in parallel
They share partial context
Human can’t hold all 8 contexts simultaneously

Here’s what we built to actually ship agent code at scale.

The Key Shift: Verify Proof, Not Code

We moved from “review the code” to “verify the proof.” Old way (code review):

Agent writes 300 lines of checkout flow
I read 300 lines
I guess if it’s correct
Takes 45 minutes

New way (receipt verification):

Agent writes 300 lines + generates receipt
Receipt says: Tests passed (98%), API matches contract, DB migration succeeded, 0 security issues
I verify the proof, not the code
Takes 3 minutes

The Tools We Use

Our review stack:

Receipt System - Every agent action produces verifiable proof
Acceptance Gates - Frozen contracts prevent chaos before execution
Director Agent - Meta-reviewer that checks inter-agent coordination
LLM-as-Judge (Claude Opus) - Reviews subjective quality
GitHub + Linear Integration - Automated PR linking with receipt metadata

Actual Stats from Last 30 Days

187 agent PRs generated
163 auto-merged (87%)
24 manual reviews (13%)
Average review time: 4.2 minutes (down from 47 minutes)
Rollbacks: 2 (both caught in staging, 0 in production)

Compare to our pre-agent baseline:

187 human PRs
187 manual reviews (100%)
Average review time: 47 minutes
Rollbacks: 7 (3 reached production)

Agents + receipts are more reliable than humans.

This is a scaffold post. Full content will include:

Step-by-step workflow with code examples
Frozen contracts YAML format
Director Agent pre-review output
LLM-as-Judge prompts and responses
Auto-merge decision tree
Real failure modes and how we fixed them

Try It: The Receipt Challenge

Next AI-generated PR, don’t review the code first. Instead, check:

✅ Did tests pass?
✅ Did it match the spec?
✅ Did anything unexpected happen?

If all three are ✅, merge it without reading the code. Scary at first. Liberating after the 5th time.

⌘I

Featured

Architecture & Systems

Governance & Operations

Comparisons & Insights

How We Review 50+ Agent-Generated PRs Per Week Without Losing Our Minds

The Problem

The Key Shift: Verify Proof, Not Code

The Tools We Use

Actual Stats from Last 30 Days

Try It: The Receipt Challenge

Featured

Architecture & Systems

Governance & Operations

Comparisons & Insights

​The Problem

​The Key Shift: Verify Proof, Not Code

​The Tools We Use

​Actual Stats from Last 30 Days

​Try It: The Receipt Challenge

The Problem

The Key Shift: Verify Proof, Not Code

The Tools We Use

Actual Stats from Last 30 Days

Try It: The Receipt Challenge