§ 02.11 · harness · architecture · patterns

Harness Anatomy

What a harness actually contains, once you stop thinking of it as glue around a model and start thinking of it as the operating system the model runs on. The parts I now reach for deliberately, after the third or fourth time I rebuilt one from scratch.

// kr8 reads this one with you — bring a project.

The analogy that finally made it click

For a long time "the harness" was a word I used for anything around the model call — prompt engineering, output parsing, a bit of retry logic, some tool wiring. It took me too long to realise that the useful shape was much more like an operating system. The model is the CPU. The context window is RAM: fast, expensive, ephemeral. External databases and files are disk. Tool integrations are device drivers. Orchestration code is the kernel, the scheduler, the process table. The harness is all of it together.

Once I saw it this way, a bunch of architectural decisions I'd been making by feel started to have names. Five patterns, five topologies, five execution-contract bindings, and one discipline around state. The rest of this module is what I now carry as the working vocabulary for what a harness actually is.

Five patterns, mix and match

Anthropic's canonical list is the one I've internalised. Prompt chaining — sequential reasoning where each step builds on the last, good for linear workflows, dangerous because errors compound. Routing — let a model classify the input and branch to a specialist, good for polymorphic inputs, only as good as the classifier. Parallelisation — scatter independent tasks, gather results, good for scaling across documents, needs careful error handling so one failure doesn't crash the pipeline. Orchestrator-workers — one agent dispatches to many specialists, good for dynamic task counts, the orchestrator becomes the bottleneck. Evaluator-optimiser loops — generate, score, refine, repeat until a threshold is met, good for quality-critical tasks, always set a max iteration count or you will regret it.

Every production harness I've built is some combination of these five. The mistake I made on early platforms was reaching for the most complex combination first. These days I start with the simplest pattern that could plausibly work and only add complexity when I have evidence the simpler version is failing.

Topologies — shapes I now draw deliberately

Single agent is the right starting place more often than people think. Input goes in, model call, output comes out. If the problem fits, don't invent a pipeline.

Pipeline (linear) is where most of my multi-agent systems ended up living. Discovery → articulation → validation → timeline → quality on one platform. Each stage testable, replaceable, with a clear contract to the next. The weakness is that errors at stage one propagate downstream and you can't parallelise across stages.

Orchestrator-workers is for the shape where you have one coordinator and N independent specialists. Research queries across different sources, for example. Parallelism cuts latency. The orchestrator has to resist the temptation to do work itself.

GAN-like — planner, generator, evaluator — is what you reach for when quality is paramount and you can afford multiple generations. Route optimisation with an evaluator that can reject bad solutions. Expensive.

Mesh — many-to-many peer communication, no clear hierarchy — models real-world complexity and is almost always a bad idea in production. I've tried. It's non-deterministic, expensive, hard to debug. It looks like a good idea in an architecture diagram and becomes a maintenance burden within a quarter. I now treat mesh as a last resort, not a first try.

Execution contracts — five bindings that make agents governable

The part of the harness I care most about now, and which I used to skip, is the set of five bindings that travel with every agent call.

Required outputs — a schema. The agent must return data matching a defined structure or it has failed. Pydantic, Zod, whatever. Without this, downstream agents can't rely on contracts, validation is ad-hoc, and persistence becomes stringly-typed.

Budgets — hard limits on tokens, wall-clock time, and cost. I've shipped platforms without these. I don't do it any more. An agent without a budget is a bill waiting to happen, and the bills I've paid for runaway agents have been some of my least fun financial experiences.

Permissions — what tools and data each agent can access. A discovery agent can search documents but cannot modify case state. A validation agent can read and annotate. Treating this as a first-class concern prevents a class of bug where an agent does more than its specification said it could.

Completion conditions — explicit exit criteria, plus a max-iteration backstop. Without this, evaluator-optimiser loops become infinite loops. I've written the infinite loop. More than once.

Output paths — where results go, in what format, with what retention. Decoupling the agent from the storage decision makes results queryable and cacheable, and stops business logic from creeping into the agent's internals.

All five together are what I now mean when I say "the agent is governable". If any one is missing, the agent is effectively operating outside the harness, and the harness is lying to me about what it contains.

State externalisation — the rule I keep relearning

Context windows are volatile RAM. The moment you treat them as state, you're one crash away from losing everything. This is the mistake I keep making in slightly different guises. An agent accumulates facts across documents, all in the conversation history. A crash halfway through loses all of it. Resume requires restart.

The discipline I now hold: state that matters goes to disk every step. Three tiers — hot (Redis, <10ms), warm (Postgres, <100ms), cold (S3 or archive, <1s) — each with a specific role. Every stage of a pipeline checkpoints its output before the next stage starts. The pipeline can crash and resume from the last checkpoint. Intermediate results are queryable by a debugger who doesn't have to re-run the whole thing.

I know this sounds like basic software engineering. It is. What surprises me is how consistently I've watched AI-heavy projects (including my own) let state pile up in the context window under the quiet rationale that "we'll externalise it later". Later never arrives until something goes wrong on a production incident and you realise there's no state to inspect.

The assumption register

Every harness component is a bet that the model can't do something on its own. Schema validation is a bet that the model sometimes returns malformed JSON. A routing classifier is a bet that the model can't pick the right branch by itself. A context budget enforcer is a bet that the model doesn't self-manage tokens.

Those bets expire. The component you added for a weakness in Sonnet 4.5 may be dead weight in Sonnet 4.7. The register is the list — every component, the assumption it encodes, when you last tested it, and whether it's still earning its place. I covered the subtraction discipline in its own Subtraction Principle module — the register is the practical tool that makes subtraction possible. Without it, you'll never know what to delete.

The lens I now use

Ask, of any agent harness: what's the topology, how does state persist, how are tools defined, what verification exists, does it self-improve, and what assumptions does it encode? I wrote these questions up more than once in notebooks before I trusted them. They're the audit I now run when I pick up somebody else's AI platform and try to understand what's actually going on.

Most of the harnesses I've seen — and most of the ones I've written — fail at least one of those questions. The ones that don't fail any of them are the platforms that kept getting better.

kr8 ·

if you redrew your AI platform as an OS, what's the kernel — and what's still in userspace by accident?

kr8 · next

// Keep reading the playbook?

In Production →Journal →