The Entropik Thesis
Three principles I ended up writing down because they were the shape of the realisations I kept having, platform after platform. I didn't start with them — I arrived at them, and this is the short version of why.
Where this started
I didn't sit down to write a thesis. I sat down, several times, to work out why the AI platforms I was building kept running into the same walls in different-looking forms. Each wall looked specific to its domain — a legal system that produced confident nonsense, a routing tool that quietly drifted, a research assistant that never got any better even after a thousand interactions. It took me longer than I'd like to admit to notice that the walls were the same wall, wearing three different costumes.
What follows are the three principles I ended up writing down because I kept needing them to explain the pattern. They aren't original — the parts have been floating around in physics, in software engineering, in the learning-systems literature, for a while. What's mine is the claim that if you take them seriously, the rest of the architecture becomes less a matter of taste and more a matter of consequence.
Every AI output reduces disorder, and the human pays the cost
An AI output is a reduction of disorder. Unstructured text becomes structured facts. Chaotic trip data becomes an optimised route. A pile of research becomes a validated hypothesis. Useful, obviously. But entropy is not the kind of thing that vanishes. It has to go somewhere.
For a long time I assumed the cost was paid by the compute — the GPU cycles, the token count, the inference bill. That's part of it. But compute is just the sorting. The thing that keeps the sorted output honest is the human decision that follows it. Accept, modify, reject. Without that decision, the system has sorted against a metric of its own choosing, and the metric will drift from the real objective in ways that are quietly disastrous.
I call this the Demon Principle, after Maxwell's sorting demon, because Landauer and Bennett showed that the demon cannot work for free. Neither can an AI platform. Every output needs a human decision somewhere, or it's borrowing from a ledger that will eventually be called in. I used to read "human in the loop" as a concession to trust — a thing we'd need until the model was good enough. I've come to read it as a thermodynamic requirement.
The model isn't the product — the harness is
A raw language model has enormous variety. It can say anything. That's exactly why it can't reliably do anything without help. The gap between "can generate text" and "can reliably perform a task" is what I now think of as the embodiment gap, and for a long time I was investing almost entirely on the wrong side of it.
My instinct, for years, was that gains came from the model. New model ships, swap it in, things get a little better. The Stanford measurement made the alternative impossible to ignore: same model, same benchmark, 6× performance spread depending on harness design. Five researchers and three independent teams arrived at related findings by the end of April 2026. The harness — the skills, the context assembly, the orchestration, the verification loops, the memory — is where the lever actually is.
The implication is unpleasant if you've been optimising the other variable, because it means a lot of the engineering effort you've been doing was in a commodity, and the thing that's actually yours is the thing you've been under-investing in. I'm writing this partly to myself. The harness is the product. The model is rented.
Value compounds only through closed loops
Most AI pipelines are open chains. Input goes in, output goes out, the system learns nothing from what happened downstream. Every session starts cold. Every improvement has to be negotiated by hand, in a meeting, by an engineer writing a prompt tweak.
I shipped several of these before I understood what was missing. The thing that changes the shape is treating every interaction as producing three durable artefacts at once — the decision the human made, the delta between what the AI suggested and what they chose, and the audit trail of reasoning that produced the proposal. Those artefacts feed back into the system. Into evals. Into skill revisions. Into the institutional memory that shapes the next proposal. When the loops are real, the platform gets measurably better without anyone manually retraining it, because the loops are doing the learning.
When the loops aren't real — when you only keep the decision, or only the audit trail, or only the output — the platform feels fine for a quarter and then you realise it hasn't improved in any direction you can measure. This is, I think, the difference between a platform that compounds and a pipeline that just runs. The word I end up using for it is closed, and the rest of the playbook is mostly the architecture that closes the loops deliberately rather than by accident.
What the rest of this plays out
The playbook below is not a framework you adopt. It's the architectural consequences I've found myself living with once these three principles are taken seriously. Event sourcing, skills over controllers, three-tier context, projections over features, boundary skills, triple-output feedback — each of them is what falls out when you commit to the demon paying its energy cost, the harness being the product, and the loops being closed.
I wrote it down partly because I want to stop re-deriving it on every new platform. Partly because I think the people I'd want to work with are the ones who read this kind of thing and push back with their own scars.
Want to think through how this lands in your project? Tell kr8 what you’re working with.
// Keep reading the playbook?