This is the story of the worst exit code 0 I’ve produced. I ran a daily dashboard generator, the same script the scheduler runs. Its last step: rsync -a --delete from the fresh build to the live directory.
The output looked normal. The sync printed its success line.
Thirty seconds later, a page that had worked an hour earlier came back 404. So did 32 others.
Two copies, two months of drift
The generator existed in two places: a canonical copy, and a second copy that the scheduler actually runs. They’d started identical.
By the day of the incident, the second copy’s parser was two months stale. It predated a whole category of detail pages. It didn’t know how to render them.
I ran the stale copy. It parsed what it recognized, produced a build missing 33 pages, and handed that build to rsync.
--delete did exactly what it’s for: it made the live directory look like the source. The 33 pages the stale parser never produced were, from rsync’s point of view, files that shouldn’t exist.
Gone in one call.
Recovery took seconds, only because the live directory happened to be git-tracked: a one-line restore, an additive sync without --delete, a reload. If those pages had been untracked build output, they’d just be gone.
While hardening the script that afternoon, I found the worse version waiting. Each copy resolves its home folder from where it sits on disk. Run the wrong copy and:
- It looks for its input in a place that has none.
- It finds nothing, and warns instead of raising.
- It renders a single near-empty page.
rsync --deletemakes the live directory match: one page. Total wipe, exit 0.
rsync treats both cases as legitimate states to replicate. Its entire contract: make target look like source. An empty source is a valid source.
| Source looked like | Sync believed | What would vanish |
|---|---|---|
| Healthy build | The target is missing nothing | Nothing |
| Stale parser build | Thirty-three old pages should no longer exist | The detail pages the parser forgot |
| Empty or wrong build | The target should become nearly empty | Almost everything |
Ten pages or no sync
The fix that shipped the same day is a gate between build and sync: count the detail pages, refuse to sync below a threshold. The numbers here:
- A healthy build produces 35 pages.
- A broken one produces 0.
- The threshold is 10.
The negative test confirmed it. The wrong-copy run now prints REFUSING TO SYNC, exits 2, and the live directory stays untouched.
When the builder is code you can’t edit, the same idea works as a wrapper around the sync itself:
#!/usr/bin/env bash
# safe-rsync: refuse to --delete from a source that looks wrong
set -euo pipefail
SRC="${1:?source dir}" ; DEST="${2:?dest dir}"
MIN_FILES=10 # well below any healthy build
SENTINEL="$SRC/index.html" # a file every real build produces
[ -e "$SENTINEL" ] || { echo "REFUSING: sentinel $SENTINEL missing" >&2; exit 2; }
count=$(find "$SRC" -type f | wc -l | tr -d ' ')
[ "$count" -ge "$MIN_FILES" ] || { echo "REFUSING: $count files in $SRC (threshold $MIN_FILES)" >&2; exit 2; }
exec rsync -a --delete "$SRC/" "$DEST/"
The sentinel catches the empty-mount case. The count catches the broken-build case. Both cost milliseconds, against a delete that costs whatever the target was worth.
What the checksums keep saying
The duplication that started this is still alive. I made the two copies byte-identical the day of the incident. They drifted again within three days: a patch landed on one and missed the other.
The independent reviewer caught my draft still claiming they matched. I checksummed them again while finishing this piece: different again. Nothing enforces parity. Consolidating the copies is still on the backlog.
That re-drift is the gate’s whole argument. Hand-syncing repaired one incident and held for three days. The threshold check is the seatbelt: the next time a stale or empty build heads for rsync, the run dies loudly, and the 33 pages stay where they are.