
The Setup: Who Is This Person?
His name is Kunal Malkan. Senior Software Engineer at Meta Superintelligence Labs (MSL). Before that: 6 years at Amazon, 3 years at Apple. But here’s what matters: He builds the infrastructure that evaluates frontier AI models. Not the models themselves. The systems that test whether AI models work correctly. He co-authored research on:- ARE Platform - How Meta creates environments for testing agents
- Gaia2 Benchmark - Testing AI across thousands of realistic scenarios
- AssetOpsBench - Real datacenter operations for AI agents
The Call (What Actually Happened)
10 minutes in, I’m walking through our receipt system. Every time an agent does something—writes code, runs tests, deploys—it generates a receipt. Proof that it did what it claimed. Kunal stops me: “Wait, you’re doing step-by-step verification against constraints?” Me: “Yeah, we call them receipts. Every action has to prove it met requirements.” Kunal: “That’s… that’s exactly what our ARE Verifier does.” Silence. Me: “What’s the ARE Verifier?” Kunal: “It’s the tool we use at Meta to evaluate agent actions. Step-by-step checking against oracles. You check if the agent did what it was supposed to. We call it verification. You call it receipts. Same concept.” I had to pause the recording. We’d been building Squad for 4 months. We created the receipt system because users didn’t trust agent outputs. We needed proof. Turns out, Meta’s frontier AI team arrived at the exact same solution. We didn’t know their research existed. They didn’t know we existed. Independent convergence on the same architecture.The Validation (Four Pillars)
Kunal laid out four requirements for production multi-agent systems:1. Correct Context Provision
“Agents need the right information to make correct decisions.” Squad: Scout agents gather context from codebase, tickets, docs before Engineers execute. His take: ✅ “This aligns with ARE’s emphasis on realistic environments.”2. Code Standard Adherence
“You need consistent patterns across agents.” Squad: Frozen contracts define API shapes, DB schemas, code patterns before execution starts. His take: ✅ “Direct parallel to ARE’s multi-tool environment abstractions.”3. Full Observability
“You have to see what agents are doing, in real-time.” Squad: Receipt dashboard shows every agent action with verification proof. His take: ✅ “This is the most important one. Our ARE Verifier does step-by-step checking. Your receipts do the same thing.”4. Active Governance
“Prevent agents from making unauthorized changes.” Squad: Acceptance Gates (Frozen → Inspect → Receipt → Evidence) enforce constraints before merge. His take: ✅ “This reflects the safety-first approach we need for autonomous systems.” Four for four. Every architectural decision we’d made, validated by someone who builds production AI infrastructure at Meta.This is a scaffold post. Full content will include:
- The complete conversation breakdown
- Gaia2 research validation of multi-LLM routing
- The temporal coupling problem (Meta is researching it too)
- Practical recommendations from the call
- How this changes Squad’s roadmap
What Changed in How I Think
Before this call:
- Receipts = nice-to-have UX feature
- Multi-LLM routing = engineering optimization
- Calendar bugs = our implementation issue
After this call:
- Receipts = fundamental requirement for trustworthy AI (Meta uses the same pattern)
- Multi-LLM routing = empirically necessary (Gaia2 proves no single model dominates)
- Dynamic system failures = frontier research problem (Meta MSL actively working on it)
- Are we overthinking this?
- Is anyone else doing this?
- Are we solving a real problem or an invented one?
Try This: The Receipt Challenge
Next time you review an AI-generated PR, don’t read the code first. Ask for proof:- ✅ Did tests pass?
- ✅ Does it match the spec?
- ✅ Did anything unexpected happen?
Related Reading: