I was running an independent review of our queue-migration work when the same failure message appeared twice in the work log. Identical actor, identical text: a Gmail sync job reporting NOT_AUTHENTICATED at 10:05:21, then again at 10:05:57. One scheduled job, one schedule, two rows, 36 seconds apart.

A scheduled job logs once per run. When it logs twice, either the job ran twice or two things ran it. I checked the process table:

$ ps aux | grep 'alcanah-daemon/main.py' | grep -v grep
ace  79651  0.2  0.9  ...  Sun10PM  2:41.33  python3 /Users/ace/alcanah-daemon/main.py
ace  69917  0.3  0.9  ...  Tue09PM  1:17.08  python3 /Users/ace/alcanah-daemon/main.py

Two copies of the same daemon. Pid 79651 started Sunday at 22:48. Pid 69917 started Tuesday at 21:12. Two days apart, both alive, both polling the same GitHub task queue every 2 seconds, both reading and writing the same task-state.json.

The Echo

The part that stung: this happened mid-migration away from exactly this class of failure. The audit driving the whole project had diagnosed split-brain state as the architecture’s core defect - two writers, one state file, no arbiter. And while we wrote that diagnosis, debated it, and built the replacement, a live instance of the defect sat in the process table the entire time. We’d described the disease accurately while carrying it.

It stayed invisible because an old idempotency patch was covering for it. The legacy queue runner kept a completed-guard dict, added months earlier to paper over a different bug, and that guard absorbed most of the double-processing: the second daemon would pick up a task, find it already marked complete, and move on. The duplicate only leaked through a side effect. The Gmail sync job failed before it ever touched the guard, so its failure logged from both processes, and the doubled rows were the one visible symptom of a two-headed system. The patch that hid the damage also hid the evidence.

How Two of Them Happened

The machine runs 35 LaunchAgents. There was no supervised-services manifest, no document anywhere stating which processes should exist. Somewhere along the way, someone (me) started the daemon manually, probably to test something, and the start path had no check for an existing instance. The first copy kept running. The second copy joined it. Nothing complained, because nothing knew what the correct count was.

The box’s memory pressure (15 of 16 GB used) turned out to be partly this. A redundant Python process polling every 2 seconds is a tax you pay continuously without ever seeing the invoice.

The Fix

The immediate fix took one kill. The real fixes came in two layers.

First, a single-instance guard: a pidfile plus flock, so a second copy refuses to start while the lock is held. The stricter version is to make launchd the only legal starter and have the daemon exit immediately if it didn’t come up through it. Either way, the property you want is that running two copies requires deliberate effort instead of being the silent default.

Second, the structural fix: a supervised-services manifest, emitted by our workspace-mapping agent, listing every process that should be running on the machine. “What should be running” has to be a checkable artifact, not tribal knowledge. Once the manifest exists, a cron job can diff it against ps and flag both directions of drift: processes that should exist and don’t, and processes that exist and shouldn’t. The second category is the one that bites you, because nothing else ever reports it.

The Broader Lesson

Our migration’s acceptance metric was “zero stranded rows for a week.” That number is meaningless while a second consumer races the queue: the duplicate could be eating the failures you’re measuring the absence of. Before you trust any verification metric, establish the process census first, because every measurement assumes you know who’s doing the writing.

Count the processes before you count the rows.