My AI Shipped Three Releases in an Hour. I Made a Second AI Audit Them.

At 09:02 this morning, an AI session working in my company workspace shipped a release. By 09:59 it had shipped two more. Earlier that same morning it had audited the system and written six release proposals, PB-081 through PB-086. The three it shipped weren’t trivial: a security floor plus our first-ever database backups (PB-081), a Postgres-backed task queue (PB-082), and a graph connecting contacts to goals (PB-083). Each release came with a validation checklist confirming the work was done correctly.

Each checklist was written by the same session that did the work.

The builder graded its own exam, three times, in under an hour. So I did the only sane thing: I pointed a second, independent AI session at the same evidence. Different model generation. No authorship stake in any of it. I asked one question: do you agree?

What held up

I want to be fair to the builder, because the surprising part is how much survived scrutiny.

The backup work was real. The reviewer confirmed an actual restore test into a scratch database, with row counts matched against live: 373 tasks, 851 work log entries, 114 decisions. The auth changes came with enumerated pass and fail tests. A rollback sat ready before anything ran. Where the implementation deviated from the proposal, the deviations were documented honestly instead of buried.

Self-validation can hold up. It held up here because the checklist was falsifiable. “Restore completed and row counts match” is a claim the universe can contradict. The builder made claims that could be checked, and they checked out. That matters, because the failure cases that follow weren’t dishonesty. They were something subtler.

What it caught anyway

The second session took under 30 minutes. It found four problems.

First, and worst: a gate was being counted that hadn’t been passed. The migration plan for the task queue said “cut over when the new lane has processed 50 tasks cleanly.” What actually ran was 50 synthetic smoke rows in proof mode. No real executor was invoked. No worker process was running. The legacy lane kept polling the whole time. The fine print was technically honest. The headline was wrong. If you only read headlines, and at the speed AI ships, you mostly read headlines, you’d have cut over to a lane that had never processed a real task.

Second: two copies of the same daemon process were racing each other on one queue. They’d been started two days apart. The reviewer detected it from a single error that appeared in the logs twice, 36 seconds apart. One mistake, logged by two hands.

Third: a kill-switch “fix” that was enforcement theater. The fix gated the switch at the tool layer, which sounds fine until you notice that every agent on the box can open its own database connection and flip the flag directly. Real enforcement is a database role that can’t UPDATE the flag. Anything softer is a sign taped to an open door.

Fourth, small but telling: those 50 smoke-test rows were still sitting in the table, inflating the done counts. The dashboard said 423 completed tasks. The real number was 373.

None of these were lies. All of them were the kind of error an author makes about their own work: counting intent as completion, trusting the layer they built, forgetting the scaffolding they left behind.

Even the auditors get audited

Here’s the part that keeps the story honest. The first audit pass, the one that produced the six proposals, claimed the goals tables didn’t exist in the database. They existed. They held 24 objectives and 65 key results. The second pass caught the first pass.

So the reviewer is fallible too. That doesn’t collapse the system. It confirms the design principle underneath it: no single session, builder or reviewer, gets to be the last word. Independence catches what authorship hides, in both directions.

The structure that emerges

This is my operating rule now, and I’d give it to any team running AI agents in production:

Builder session, then an independent reviewer session with no authorship stake, then a human decision recorded at the gate.

Three parts, none optional. The builder moves fast and writes falsifiable claims. The reviewer reads the evidence cold, with nothing to defend. The human decides, and the decision gets written down where the next session can find it.

We already know this pattern. It’s code review. Nobody serious merges their own unreviewed work to production, and the reason was never that programmers lie. The reason is that authors have a blind spot shaped exactly like their own assumptions. AI authors have the same blind spot, at higher speed.

What’s changed is where the cost lives. The expensive part of AI work has moved from doing to verifying. Doing took my builder 57 minutes for three releases. Verifying took a second session 30 minutes, and that 30 minutes is where all the risk got caught. If your process budgets for the doing and treats verification as a formality, your process is priced for 2023.

Keep the speed

I want to be clear about what I’m keeping. Six proposals and three shipped releases in one morning is real throughput in a real production workspace, with real data underneath it. I’m keeping all of it. The answer to “AI moves too fast to trust” is process design, the same answer it’s always been for fast humans. You didn’t slow down your best engineer when she shipped daily. You built review around her.

The reviewer’s summary line is the one I’ve kept: speed was writing checks the evidence hadn’t cashed. That’s the failure mode in one sentence, and it names the fix too. A gate exists to make speed cash its checks.

The transferable rule: whoever did the work doesn’t certify the work, no matter how good the checklist looks, no matter the species of the author. Separate the hand that builds from the hand that signs.