Optimisation Loops
The difference between a tool and a platform is whether feedback closes a loop. I spent years building tools — I'm still learning what it takes to close the loop properly.
The distinction that changed the shape
A tool does a thing. You give it an input, it gives you an output, you go on with your life. If the output is bad, you type something different next time. The tool doesn't change.
A platform does the same thing, but every interaction feeds back into how the next interaction goes. If the output is bad, the system notices, proposes an edit to itself, tests the edit against held-out cases, keeps the edit if it helps, discards it if it doesn't. The platform changes. Slowly at first, then compound.
The difference is whether feedback closes a loop. I built tools for years and called them platforms because they had lots of features. They weren't platforms. They were features. A platform is the shape where today's corrections become tomorrow's defaults without anyone sitting down to make it happen.
The three components that make it a loop
Karpathy sketched the shape I now use. Every optimisation loop has three ingredients, and if any one is missing the loop doesn't close.
A constrained search space. One file, one prompt, one config — something specific enough that a proposer can actually edit it. Not "the system." Not "the agent." A single, well-defined surface where the next edit could happen. The platforms I've seen fail here let the search space be everything, and nothing gets better because every proposal is a partial rewrite.
A scorable metric. Automatic, measurable, derived from real feedback. Edit distance between what the AI proposed and what the human shipped. First-pass classification accuracy. Task completion rate. Whatever it is, it has to be computable without a human in the loop for every run, because the whole point is to let the proposer iterate faster than a human can.
Full traces with reasoning. Not summaries. Raw input context, model output, the reasoning chain, the human's correction verbatim. Stanford's measurement on this is the finding that stayed with me: removing raw traces drops meta-agent accuracy from 50% to 34.6%. Replacing them with summaries gets you 34.9%. Summaries are nearly as useless as nothing. The signal lives in the raw details, and the moment you compress the details for storage reasons you've killed the loop.
Meta-agent and task-agent are different jobs
The pattern I had to unlearn was expecting the task agent to optimise itself. Being good at a domain and being good at improving at that domain are different capabilities. The meta-agent is the harness engineer. The task agent stays a domain specialist. Separate concerns, separate skills, separate context windows.
The model-empathy detail is one I didn't believe until I tested it. Same-model pairings outperform cross-model pairings — if the task agent is Sonnet, the meta-agent should also be Sonnet. It has implicit understanding of how the inner model reasons, which matters more than you'd expect. I'd been using a cheaper model for the meta-agent to save cost, and the meta-agent kept proposing changes that didn't land. Moving both to the same model fixed a lot of my "why isn't this working?" instances.
What only helps, and what actively hurts
The Tsinghua ablation study is worth internalising, because it inverts some things I'd been assuming. Of the modules they tested, the self-evolution module was the only consistently positive one (+4.8, +2.7). Verifiers hurt (-0.8, -8.4). Multi-candidate search hurt more (-2.4, -5.6). Every gate you add is a ceiling you can't exceed. Every extra candidate is extra noise the model has to resolve.
The shape that actually works is an acceptance-gated attempt loop that stays narrow until a failure signal justifies broadening. Discipline beats expensive broadening. I had been adding verifiers and multi-candidate generation on the intuition that "more checks means safer" — and the measurements said I was making things worse.
The solo dev version
When I read the Karpathy writeup I nearly dismissed it as "but that requires production feedback, which I don't have yet." I was wrong. The loop works before production feedback. You just start it at a different point.
Write a skill markdown file. Write three to five eval fixtures by hand from your domain knowledge — input context, expected output. Run the skill against a FakeModelProvider (scripted responses) or the real model with a tight budget cap. Score the output against the fixtures. Edit the skill based on what failed. Go again.
Same three components. Constrained search space — the skill file. Scorable metric — fixture pass rate. Full traces — the model's output on each fixture. The only difference is that the dataset is hand-authored rather than production-derived, and the meta-agent is me with a text editor.
The thing that makes this work is that the fixtures become the golden dataset. When production feedback arrives, it doesn't replace the hand-authored fixtures — it supplements them. The fixtures catch the failure modes I knew to look for. The production feedback catches the edge cases I didn't anticipate. Both are valuable, and the skill that has been through twenty iterations of the manual loop is already tenth-draft quality before the first user touches it.
Harnesses transfer; models don't
The Stanford finding I keep coming back to: a harness optimised on one model transferred to five others and improved all of them. The reusable asset is the harness, not the model. Which means the investment you put into optimising your harness survives the next model release. The optimisation loop is not a sunk cost; it's a compounding asset that keeps paying back as models improve.
This inverts the mental model I carried for too long. I used to treat model upgrades as events that reset my harness work — "oh, we'll need to retune everything for the new model." Sometimes that's true. More often, the harness gets better on the new model too, because the optimisations were mostly about orchestration, context, and verification shape, none of which change when you swap the inference engine underneath.
What I had to give up to make the loop real
Three things, all harder than they sound.
Traces at full fidelity. Storing raw reasoning chains is more expensive than storing summaries. I had to accept the storage cost. Every time I was tempted to compress "for efficiency", I reminded myself that 50% → 34.6% gap.
Patience with narrow search spaces. Editing one file at a time, measuring one metric, over and over, feels slower than "let me just rewrite the whole thing." The Karpathy Loop ran 700 experiments in two days because each experiment was small. Narrow beats broad, but narrow feels slow in the moment.
Willingness to let the meta-agent propose things I wouldn't have. The whole point is that the proposer sees patterns in traces that I don't. If every edit it proposes is one I could have written, the loop has become theatre and I'm not listening.
A closed loop is a bigger commitment than I first appreciated. It's also the only thing I've found that makes a platform get better over time without a human grinding on it. That's what the compounding principle actually looks like in practice.
Want to think through how this lands in your project? Tell kr8 what you’re working with.
// Keep reading the playbook?