The fallback was a boolean

2026 · 04 · 22 · failures · recovery · harness · meta-agent

The ones I couldn't see

The worse version took longer. A visitor asks something cross-module — say, how the subtraction principle plays with the meta-agent's proposal cycle — and kr8 answers about one of them confidently, hand-waves the other, schema passes, event says success. Visitor modifies the response. Walks away. As far as the event log is concerned, it worked.

I'd been measuring "failure" as "contract violation." What I actually needed was "the response was worse than it should have been." Those aren't the same and I'd been shipping the second one every day without noticing.

This is also why I'd been suspicious of kr8's numbers. Accept rates looked fine. The feature felt shallower than the numbers suggested. I kept blaming that on vibes. It wasn't vibes.

Claude Code has the same gap

Worth noting because it explains why I didn't catch this sooner. The dev agent most of us use daily retries API errors and has no I'm-stuck detection. It works because a human is watching every step and redirects when it walks into a wall. That's not an option for ambient agents. Visitors don't redirect kr8 — they just leave.

If you build an ambient agent by pattern-matching on what development agents do, you inherit this gap. Which I did.

What's in the spec

Three things. None of them clever on their own; the interesting bit was that the existing harness had hooks for all three without restructuring.

Failures are classified now. Five categories — context insufficiency, skill coverage gap, dispatch miss, model limitation, contract violation. Pure function over the failure state. Same input, same category. Meta-agent reads category as a field instead of inferring from text.

There's bounded recovery before fallback. Two strategies in v1 — broaden L2 search, retry with reduced context. Three seconds wall time, thousand extra tokens, hard limits. Visitor sees one response. Either the retry landed or the fallback did. No loading states, no intermediate output, no "thinking…". One of the things I keep having to relearn is that the agent experience is the product, and the product should hide its own internal scaffolding unless the scaffolding failed.

Then there's the quiet-failure pass in pattern detection. Cluster questions within a concept by similarity, find clusters where the modify rate is 2× the skill's baseline, check the traces — was context assembly below budget? Were the L2 chunks relevant? If the context had room and didn't fill it with the right material, emit GapDetected with gapType: quiet-failure. That's the signal I'd been missing. The meta-agent can finally separate "the skill is bad" from "the context shape is bad," which are very different fixes and I'd been lumping them.

Things I'm not sure about

The 2× modify-rate threshold is a guess. Probably needs to be per-skill. Probably needs to adapt. Not doing either yet.

The three-second recovery budget assumes the second model call lands in ~2.5s. No idea if that holds under production load. First production failure this hits will tell me.

And I might be over-indexing on the playbook surface as the proof case because it's the one I most want to work, not because it's the best evidence. Suspecting this doesn't make it less true.

kr8 · next

// Where to next?

More journal →Playbook →