The failure that looks like success

The way I pictured an unattended agent loop dying was loud. A crash. A stack trace. A process that exits and a monitor that goes red. Something that announces itself.

That's not what got us. What got us was a loop that kept reporting for duty — running its scheduled work, on time, every time — while quietly accomplishing nothing. It looked alive. It wasn't advancing. And the reason nobody noticed for a while is the part worth writing down: the place you'd naturally look to check on it was downstream of the very thing that had failed.

The picture: it looked alive but wasn't advancing

Here's the concrete image, because it's the whole lesson in one frame.

The loop was a long-lived session that woke up on a schedule, did a unit of work, and went back to sleep. For a long time it was healthy. Then, gradually, it wasn't — but the symptoms were nothing like a crash. The session kept firing on schedule. It kept starting its work. What it stopped doing was finishing cleanly: it began producing garbled output, emitting malformed tool calls, and completing runs that left no trace behind — no log entry, no committed result. The motions were all there. The output wasn't.

From the outside, "running on schedule" and "running correctly" are easy to confuse, because the one signal you tend to watch — is it ticking? — was still green. The loop was ticking. It just wasn't doing anything with the ticks.

A crash would have been a mercy. A crash tells you. This told you nothing, because from every angle we were watching, it looked exactly like a quiet, healthy system on a slow day.

Why it was invisible

The thing that made this hard isn't that we weren't watching. It's that we were watching the wrong thing, and we'd have kept watching it forever, because it was the obvious thing.

The natural place to check whether the loop is working is the dashboard — the surface that shows you what the loop has been doing. But the dashboard is downstream of the loop. It's fed by the loop's output. When the loop degrades and stops producing real output, the dashboard doesn't go red. It just goes quiet. And a quiet dashboard is indistinguishable from a system that simply doesn't have much to report right now.

This is the trap, stated generally: you cannot monitor a thing with a tool that fails alongside it. Any health signal that lives inside the loop, or is produced by the loop, or depends on the loop running correctly, will go dark at the exact moment you most need it to shout. The dashboard goes stale in lockstep with the loop that feeds it. The deeper the monitoring is wired into the system it watches, the more useless it becomes precisely when the system breaks.

We had, separately, already met a smaller cousin of this — an earlier incident where one step's output quietly drifted out of shape and stranded a whole queue of work without anyone seeing it. Same family of problem: a failure that produces no error, only an absence, and an absence is hard to notice when you're scanning for trouble rather than for silence. That earlier scare is the reason the deepest of the fixes below already had a head start in our thinking.

The fixes, in order of depth

We fixed this in three layers. They're worth taking in order, because each one addresses a different depth of the same problem, and the order is the lesson.

1. Detection you can't fool: liveness from outside, alerting out-of-band

The first fix is the one that would have caught this the day it started. Put the liveness check outside the loop, on a dead-simple scheduler that has no dependency on the loop being healthy, and have it alert through a channel that has nothing to do with the dashboard.

The principle is that the watcher must be able to survive what it watches. An external check, run by something boring and independent, asks one question on a fixed cadence: has this loop done real, fresh work recently? — not "is the process up," which a degraded loop answers yes to, but "is there evidence of actual progress." If the answer is no for too long, it sends an alert somewhere you'll see it even if everything else is dark — out of band, to a phone, not to the same dashboard that's already gone stale.

The shape that matters here: the detector is simple, the detector is independent, and the alarm reaches you through a path the failure can't also take down. A complex monitor that shares fate with the system it monitors is not a monitor. It's another thing that can break quietly.

2. Prevention: cycle the long-lived session before it bloats

Detection tells you after the fact. The second fix attacks the cause.

The degradation in our case wasn't random; it came from a session living too long. A long-running session accumulates state, and past a certain point that accumulation is what starts producing the garbled output. The bloat is the failure mode, building slowly until it crosses a line.

So we stopped letting it get there. The session now cycles on a timer — it's restarted from a clean state on a regular cadence, well before it's been alive long enough to degrade. A fresh start beats a degraded continuation. This is the cheap, unglamorous prevention: rather than detect the slow rot and react, don't let the thing live long enough to rot. If a long-lived process predictably degrades, the simplest durable fix is to make it not long-lived.

3. The deepest one: validate the work against a contract at the seam

The first two fixes keep the loop alive and catch it when it isn't. The deepest fix is about the work itself — and it's the one that closes the door on the whole class of problem, which is why it goes last.

Don't trust a worker's output by looking at it. Even a healthy loop can hand off output that's subtly wrong, and "subtly wrong but plausible" is exactly the thing a human eye skims past. The fix is to machine-check structured output against an explicit contract right at the boundary where it's produced — the seam where one step hands work to the next. The contract describes the exact shape the output must take; the check runs automatically; output that matches passes, output that doesn't is rejected and re-run, never silently accepted and never quietly handed on.

This is the layer that would have stopped the garbled output from going anywhere even if the first two fixes had failed. Degraded or drifted output can't propagate if it has to pass a machine check it can't fake. It's also the direct descendant of that earlier stranded-queue incident: once you've watched bad-shaped output travel silently, you stop trusting eyeballs at the boundary and start enforcing the shape there. A check at the seam is the difference between "we hope the output was good" and "the output that got through was provably the right shape."

The transferable takeaway: build for the failure that looks like success

If you run agents unattended, the failure you should design against first is not the crash. The crash announces itself; you'll handle it. The dangerous failure is the one that looks like success — the loop that keeps its rhythm while quietly accomplishing nothing, the dashboard that goes calm instead of red, the output that's well-formed and wrong.

That mode doesn't set off your alarms, so you have to engineer the alarm yourself. Three moves, roughly in order of how deep they go:

Liveness you can't fool — detection that lives outside the loop and alerts out of band, so it survives what it watches.
Prevention by cycling — don't let a long-lived process live long enough to degrade; a fresh start beats a slow rot.
Validation at the boundaries — machine-check the work against its contract at the seam, so degraded output can't travel even when everything else looks fine.

None of this is glamorous, and none of it is exotic. It's the boring discipline that makes the impressive thing — an agent loop you can actually walk away from — possible. The model is the easy part to be impressed by. The part that lets you stop watching is the assumption that something will eventually break quietly, and the machinery, sitting outside the loop, that refuses to let a quiet break stay quiet.

A human still reviews the work that matters before it counts. What these three layers buy is that the human's attention goes to judgment, not to catching a loop that looked alive the whole time it had stopped advancing.