entropik.
§ 02.13 · harness · subtraction · assumption-register

The Subtraction Principle

Mature harness engineering is a craft of subtraction as much as addition. I got good at adding structure. Getting good at removing it has been harder, and the difference is where the leverage is.

The shape I had backwards

My first instinct with a new harness, for years, was to add. The model is unreliable here — add a validator. The model sometimes picks the wrong branch — add a router. The model loses track of context — add context management. Every failure I saw became a new component. The harness grew. That felt like progress.

It mostly wasn't. What I was actually doing was building a sediment layer of compensations for specific weaknesses in specific models, and then leaving the sediment in place long after the weakness had gone. By the time I'd built the third or fourth platform, the oldest ones were carrying multiple generations of workarounds for model problems that had been fixed two or three model versions ago. The harness was heavier. The model underneath was smarter. Performance was worse than it would have been with a thinner harness.

The mindset that took me too long to internalise: every harness component encodes a bet that the model can't do something without help. Those bets expire. Harness engineering is as much about noticing when bets have expired as it is about placing new ones.

Evidence from people who did it right

The examples that made me take this seriously. Manus rewrote their harness five times in six months. Each rewrite deleted more than it added. Their biggest performance gains came from removing features — complex document retrieval ripped out, fancy routing killed, management agents replaced with structured handoffs. At fifty tool calls per task, even models with huge context windows degrade because the signal gets buried under noise. Simpler harness, better results.

Vercel had a text-to-SQL agent built with specialised tools — schema understanding, query writing, result validation, error handling wrapped around it. It worked about 80% of the time. They removed 80% of the tools, gave the agent just bash commands and file access, and accuracy went to 100%, token usage dropped 40%, and the whole thing was 3.5× faster. The engineer's conclusion is one I keep returning to: models are getting smarter, context windows are getting larger, so maybe the best agent architecture is almost no architecture at all.

Anthropic's own harness decay is visible across three Opus generations. Opus 4.5 needed sprint decomposition, per-sprint evaluation, context resets between sprints. Opus 4.6 needed none of those — cost dropped 38%, time dropped 36%. Opus 4.7 self-verifies outputs, produces cleaner code with fewer wrapper functions, generates a third the tool errors. Three generations, three rounds of subtraction. The harness didn't grow. It shrank.

The pattern Sutton's bitter lesson predicts: simple methods that scale with compute consistently outperform complex hand-engineered solutions. As models get smarter, the right harness gets simpler, not more elaborate.

The assumption register — the tool that makes subtraction practical

Subtraction without a record is just deletion, and you'll either delete too much or too little. The register is the record. Every harness component gets an entry — what weakness it addresses, when it was added, when it was last tested, what its success rate is, what its latency cost is, what action the last audit recommended.

The columns I use now look roughly like this: component name, the specific assumption it encodes (be concrete — "model returns invalid JSON 12% of the time without schema enforcement", not "model needs help"), the date it was added, the last date you verified the assumption still holds and against which model, the success rate (how often it actually prevents a failure), the latency cost, and the action from the last audit (keep, test, remove).

Two columns do most of the work. Success rate tells you whether the component is earning its place — a verifier that catches less than 5% of calls is dead weight. Latency cost tells you what keeping it is worth. When both numbers stop favouring the component, it comes out.

The discipline that matters: audit on every model release. The playbook's original rule of thumb was quarterly audits, which was sensible in 2023 and too slow now. Models ship every three or four months. Every release is a subtraction opportunity. The audit isn't long — pull up the register, disable each component against the new model, run the evals, see which components were still necessary. Keep the ones that catch real failures. Delete the rest. Update the "Last Tested" column against the new model name.

Red flags I now watch for

Verification loops that catch nothing. If a verifier has been in place for three months and hasn't fired once, it's not a verifier, it's decoration.

Routing logic the model could handle itself. Test it. Remove the classifier, ask the model to figure out the branch inline, see what happens.

Context management that duplicates what the model tracks. Modern models do a surprising amount of working-memory management internally. The context compaction layer I built on one platform was actively fighting the model by the time Opus 4.6 shipped.

Tool descriptions more detailed than the model needs. Concise tool docs often outperform verbose ones, because the model has less to wade through.

Multi-step chains that could be single LLM calls. The reflex to decompose into stages is usually right, but not always. Sometimes the decomposition is residue from an earlier model's weakness.

Error handlers that never fire. If the error case hasn't happened in production in six months, either it can't happen (delete the handler) or you've successfully made it impossible elsewhere (delete the handler anyway, and let the system fail loudly if you're wrong).

Why subtraction is the harder half

Adding a component is a clear act. You can point to it. Someone asks what you did this week, and "I added X" is a legible answer. Deleting a component is diffuse — nobody points to the thing that isn't there. The legible artefact is absence, and absence is hard to review.

There's also the psychological cost. I've built components that solved real problems at the time. Deleting them feels like admitting the time was wasted, even when the time was well spent at the time it was spent. The register helps here — writing "this component caught 12% of calls for two quarters and is now catching <2%" reframes deletion as success rather than regret.

And there's the risk calculation. A kept component can't hurt (much). A deleted component might break something. The loss aversion is real. The discipline I've come to hold is that the risk of keeping is not zero — dead components add latency, consume tokens, occupy attention, and quietly drag performance. Measured against that, the risk of thoughtful deletion, with the register as your record, is usually smaller than it feels.

The question I now ask first

When I'm looking at a harness — mine or someone else's — the first question I ask now is not "what do we need to add?". It's "what can we remove?". Not because addition is wrong. Because the default direction of harness drift is additive, and the question that corrects the drift has to be subtractive.

The best harness is one where every component is earning its place, documented, tested against the current model, and removable when the next model makes it unnecessary. That's the end state I'm working toward. I'm not there on any of my platforms yet. The register is what's moving me in that direction.

// continue the thought

Want to think through how this lands in your project? Tell kr8 what you’re working with.

0 / 4000 chars
kr8 · next

// Keep reading the playbook?

TOPOLOGY