For about seven weeks, my AI-operated company ran its entire memory off one SSD with zero backups. Postgres is the canonical truth of the workspace: every task, every decision, every work log row the system has ever recorded lives in it. The repo even admitted the gap in writing. docs/backup-strategy.md said “Status: Open question - not implemented,” raised 2026-04-21, still open on 2026-06-10.
The machine carrying all of this was a Mac mini with 28 days of uptime, 15 of 16 GB of memory in use, and 567 processes running. A busy, warm, heavily loaded box, and the only copy of everything. One drive failure and the company wakes up with amnesia.
How I Found It
I ran a full-workspace audit: three AI deep audits in parallel over the same infrastructure, then merged their findings. The missing backups came back as one of two stop-ship items.
The other one was worse to read. pg_hba.conf had a trust rule covering the entire private VPN address range, with listen_addresses='*' in postgresql.conf. Every device on the VPN had passwordless access to every database. Any of them could connect as superuser and flip the AI agents’ kill switch with one SQL statement. No exploit required, just a connection string.
I’ve written about this near-miss genre before in the tunnel incident. Same shape: a configuration chosen for convenience during setup, then forgotten, quietly carrying the whole risk surface.
The audit summarized the backup gap in one line I haven’t been able to shake: “A backup that has never been restored is a hope, not a backup.”
The Fix
Both findings shipped the same day.
Backups first. A script runs pg_dump -Fc against the workspace database, producing a 24 MB compressed dump, plus a SQLite .backup of the secondary store at 1.0 MB. A LaunchAgent fires it at 02:30 nightly, the same scheduling layer the rest of the task system runs on. Each run copies the dumps off-box to a second machine over the VPN, because a backup that lives on the disk it’s protecting protects nothing. Retention is 14 days local, 30 days off-box.
That part took an hour. It’s also the part everyone stops at, and it’s the part the audit line was aimed at.
The Restore Test
So I restored. pg_restore into a scratch database called ace_restore_test, exit 0. Then I compared row counts in the scratch database against live: tasks_inbox 373, work_log 851, decisions 114, actors 52, product_pipeline_items 69. Every count matched. I dropped the scratch database and only then marked the finding closed.
A backup pipeline counts as working on the day you restore from it and compare the data, and not one day before. Exit codes lie by omission. pg_dump can succeed against the wrong database, dump a stale schema, or write to a path nothing ever syncs. The five-minute restore drill is the only step that tests the thing you actually care about, which is whether the data comes back.
Closing the Trust Hole
The auth fix followed a sequence worth keeping as a pattern, because the order is what makes it safe:
- Create least-privilege roles while
trustauth is still active. - Update every consumer’s DSN to use the new roles, also while trust is still active, so nothing breaks mid-flight.
- Flip
pg_hba.conftoscram-sha-256and runpg_reload_conf(). - Test that passwordless and superuser connections now fail. The negative test is the point: if the old path still works, you’ve changed nothing.
- Keep a one-line rollback ready before you start.
One honest deviation got documented at ship time: the app role kept DELETE, because staging cleanup paths need it. DDL denied is the real boundary. A role that can delete rows is recoverable from last night’s dump; a role that can drop tables is a different category of bad day. While in there, I also found a config file sitting at 644 and chmod’d it to 600. Audits pay for themselves in the margins.
The Broader Lesson
Every safety mechanism you’ve never exercised is a hypothesis, and the failure it guards against is the worst possible time to test a hypothesis. The same rule covers more than backups: the rollback you’ve never run, the failover you’ve never triggered, the kill switch you’ve never flipped. Exercise the recovery path under calm conditions, on a schedule, with numbers compared at the end, or accept that what you have is a belief about recovery rather than a capability.
Hope is not a recovery plan.