entropik.
§ 02.06 · theory · harness · foundations

The 6× Gap

I spent too long chasing model upgrades before noticing the harness was the thing doing the work. The research caught up in 2026 — and looking back at my older platforms, the scars line up with the findings.

The thing I wasn't seeing

For a long time I optimised the wrong variable. When a platform underperformed, my first move was to try a newer model. Usually it helped a bit, sometimes it helped a lot, and I took that as confirmation that model choice was where the gains lived. What I missed — and I think a lot of teams miss — is that the gains would have been larger, faster, and more durable if I'd put the same hours into the harness around the model instead.

I didn't have a name for this until the research caught up in early 2026. Stanford reported that harness design, not model choice, was the primary determinant of agent performance. LangChain moved their coding agent from outside the top thirty to rank five on Terminal Bench 2 by touching only harness infrastructure. The 2.0 numbers got sharper: same model, old harness versus new — 52.8% to 66.5%. Twenty-six points with zero model changes.

By April, three independent camps — OpenAI's Codex team, Anthropic, and ThoughtWorks — had converged on the same five principles from different starting problems: context beats instructions, planning and execution must be separated, feedback loops are non-negotiable, one thing at a time, the codebase is the documentation. I remember reading the first of those writeups and thinking — with the slightly uncomfortable clarity you get when the thing you half-knew gets named — that this was what had actually been happening on every platform I'd shipped.

What the papers formalised

Two papers put shape around it. A Tsinghua group showed that migrating a code-style harness into a natural-language representation improved scores by sixteen points, cut LLM calls by ninety-seven percent, and dropped runtime by sixty-one percent — same model, same strategy, different representation. Stanford's meta-harness paper took an automatically-optimised harness to rank one on Terminal Bench 2 using Haiku. A smaller model outranking larger ones on the strength of its harness alone.

The detail that stayed with me: a harness tuned on one model transferred to five others and improved all of them. The reusable asset is the harness. The model is not.

If I'd understood this in 2023, I'd have built very differently. The earliest of my platforms treated the model as the product and the harness as glue. Every time a new model landed I'd swap it in, re-test a few flows, and ship. I was burning cycles on something I didn't control, and ignoring the thing I did.

Agent equals model plus harness

The framing I now use is mechanical. The context window is RAM — fast, expensive, limited. External databases are disk. Tool integrations are device drivers. Orchestration code is the operating system. Everything the model doesn't carry in its weights lives in the harness. Anthropic's canonical patterns — prompt chaining, routing, parallelisation, orchestrator-workers, evaluator-optimiser loops — are the primitives. Which ones you compose, and how tightly, is the engineering surface.

I find the question "which model did you use?" almost always comes up first in these conversations, and almost always it's the wrong first question. The one I try to ask now — and to answer for my own platforms — is: what does your harness assume the model cannot do, and when did you last test those assumptions? The second half of that question is the one that catches people out. It caught me out for years.

Subtraction, which I'm still bad at

Harnesses aren't only built by addition. As models improve, the components that existed to compensate for specific model weaknesses quietly become dead weight. The compensations persist because removing them feels risky — and, let's be honest, because nobody gets promoted for deleting code. Then the platform ossifies around workarounds for problems that are no longer problems.

The teams that got the steepest curves subtracted aggressively. Manus rewrote their harness five times in six months. Vercel removed eighty percent of one agent's tools and got better results. Anthropic dropped context resets when Opus stopped needing them. When I read those examples my first reaction was recognition — I had multiple features sitting in production that existed only to route around a model behaviour that had been fixed two versions ago. I hadn't retested. Mature harness work is as much about scheduled subtraction as it is about addition, and I still find subtraction the harder half.

The same finding from the other direction

The Epics benchmark, published in April 2026, tested agents on the one-to-two-hour professional tasks consultants, lawyers, and analysts do daily. The best frontier model completed these on first attempt twenty-four percent of the time, climbing to only forty percent after eight retries. The same models score above ninety on standard benchmarks.

The failure analysis is what made this land for me. It wasn't reasoning failure. The models had the knowledge; they could think through the problems. What they couldn't do was execute and orchestrate across the full task. They got lost after too many steps. They looped back to approaches that had already failed. They lost track of what they were supposed to be doing.

Which is the 6× Gap from the opposite direction. Even when the model is knowledge-sufficient, the harness determines whether that knowledge becomes completed work. And looking back at the platforms where I'd told myself "the model just isn't ready" — honestly, I think in at least two cases the model was fine. The harness hadn't caught up.

What this changed for me

If the harness is the product, the investment logic inverts. Harness improvements compound, transfer across model upgrades, and encode your domain's specific feedback. Model upgrades are external events you don't control and every competitor receives on the same day you do. The model is a commodity you rent. The harness — the orchestration, the context assembly, the skill recipes, the optimisation loops tuned to a domain — is the only thing that stays yours when the next model ships.

I don't think I'd have written this paragraph three years ago. I was too busy chasing the next model release. I wrote it now because the pattern has held across every platform I've built, and the research has finally caught up with the scar tissue.

// continue the thought

Want to think through how this lands in your project? Tell kr8 what you’re working with.

0 / 4000 chars
kr8 · next

// Keep reading the playbook?

TOPOLOGY