Harness Your Development Agent
The harness I thought mattered was the one around the production agent. The one that actually compounds is the one around the agent that builds the production agent. This is the realisation I'm living through right now, writing this module.
The leverage I kept missing
For a long time I thought of "the harness" as something I was building around whatever AI capability I was shipping to users. The production agent had a harness. The development agent — the one I was using to write the production agent — had whatever default behaviour showed up when I opened the terminal. I treated those as two different problems, and most of my attention went to the shipped product.
The thing I'm currently living through is the realisation that the order of leverage is backwards. The most impactful harness a solo dev builds isn't the one around the production agent. It's the one around the agent that builds the production agent. Every principle I've written about in this playbook — skills over controllers, context assembly, execution contracts, optimisation loops, subtraction — applies just as much to my development workflow as to anything I'm shipping.
If you're using Claude Code (or any agentic coding tool), you're already inside a harness. The question is whether you've engineered that harness on purpose or left it at defaults.
Same anatomy, different implementation
The mental move that unstuck this for me was noticing that the development harness has the same anatomy as a production harness. The pieces just go by different names.
Production skills become Claude Code skills in .claude/skills/. Execution contracts become the speckit workflow — specify, plan, tasks, implement. Context assembly becomes CLAUDE.md plus project docs plus memory files. Model provider abstraction becomes model selection per task (Haiku for boilerplate, Sonnet for implementation, Opus for architecture). The eval harness becomes pnpm verify running typecheck, lint, test, and audit gates. The meta-agent becomes the skill-quality system — run skills multiple times, score against criteria, iterate. The feedback loop is every edit I make to what the agent produced. The assumption register is the audit of which dev skills are still needed against the current model.
Once I'd drawn the parallel, it stopped making sense to invest in one harness and ignore the other. They compound each other.
Skills as development harness
A dev skill is a markdown recipe that teaches my coding agent how to handle a specific kind of task. Not "implement this feature" — that's too open. More like "read the spec, then the plan, then implement the tasks in order, running tests after each one, running pnpm verify before claiming completion." A good dev skill constrains the search space the same way a good production skill does. It encodes the institutional knowledge the agent would otherwise have to guess at. It includes verification — how to check its own work, which audit scripts to pass. And it's domain-specific. A generic "write code" skill is useless. A skill for "implement a new Entropik backend skill file with YAML frontmatter, output schema, context shape, and eval fixtures" is leverage.
The speckit workflow is a development harness in disguise. Four phases: specify, plan, tasks, implement. Each phase is a separate skill. Each constrains the agent's search space to one concern. The agent can't jump to implementation without a plan, can't plan without a spec, can't spec without requirements. This is prompt chaining — the first of the five canonical patterns from Harness Anatomy — applied to development work.
The difference this has made on the platform I'm currently building is hard to overstate. Without speckit, asking an agent to "build the meta-agent feature" gets a monolithic attempt that misses edge cases, ignores conventions, and requires extensive rework. With speckit, the same request produces a spec, a plan, a 100-task list, and incremental implementation with tests. The total calendar time is longer. The rework approaches zero.
CLAUDE.md as L1 hot context
The project-level CLAUDE.md is the most consequential file in my development harness. It's the system prompt that loads into every interaction. I engineer it like a production skill now, not like a README.
What belongs in it: the active technologies and their versions (so the agent doesn't guess), the project structure (where things live), the commands to run (test, lint, verify), the non-negotiable conventions the agent must always follow, and the current focus (which spec, which phase). What doesn't belong: implementation details (the agent can read code for that), history (git log is authoritative), temporary state (that goes in a handoff file).
The L1 budget rule I now apply: CLAUDE.md should be scannable in under thirty seconds. If it's longer than about 200 lines, the agent is spending tokens processing context that isn't relevant to most tasks. Same context-budgeting discipline I'd apply to a production skill. The file I keep trimming is the file that's working.
Memory files sit one tier warmer — L2. They're loaded when relevant, not always. Who the user is. Feedback corrections from prior sessions. Project state that isn't in git. The development equivalent of semantic memory on the production platform. The agent doesn't re-learn my conventions in every session; it looks them up when it needs them.
Hooks as computational feedforward
Development hooks are the cheapest harness component I can add. A PostToolUse hook that auto-formats every file the agent writes. A pre-commit hook that runs typecheck before letting me make a mess. A hook that blocks writes to node_modules/ so the agent can't accidentally corrupt a container-mounted volume.
Zero latency cost in practice. Near-zero build cost. Prevents an entire class of rework. I'd have added these years earlier if I'd been thinking about my dev environment as a harness.
The hook audit question — same as any harness component — is whether each hook is compensating for a genuine constraint (humans want consistent style, CI fails on lint errors) or a model weakness that may have expired. Auto-formatting is genuine. An import-ordering hook is probably a model-weakness compensation that I should test against the current model before renewing.
The development optimisation loop
I now run the solo dev version of the Karpathy Loop on my own dev skills. Write a skill. Run it on a real task. Observe what it got right, what I had to fix. Edit the skill. Track the change in a skill-history file. Go again.
Binary eval criteria for dev skills, not composite scoring. Did it read the spec before starting? Did it run tests after each task? Did it check existing code before creating new files? Did it use the shared package for new schemas? Did it run pnpm verify before claiming completion? Run the skill three to five times on different tasks, score against the criteria, edit the skill if something fails consistently.
Skill history — .claude/skill-history/speckit-implement.md with versions and what each version changed and what it scored — is the trace infrastructure for this loop. It's not a changelog. It's the data the next optimisation pass uses to decide what to edit.
Why the dev harness is first
The reason the dev harness is the first domino is that it multiplies everything downstream. A dev agent that reliably produces good code is a dev agent I don't have to review every line from. That's the difference between using AI and being multiplied by AI. A solo dev with a well-harnessed development agent ships faster than a team of three with unharnessed agents.
The leverage compounds into the production harness too. The patterns transfer. The eval format transfers. The subtraction discipline transfers. By the time I'm designing a production meta-agent, I've already been running a manual meta-agent on my dev skills for months. I'm not learning the Karpathy Loop for the first time. I'm applying something I already do.
That's the part that surprised me. The harness around my dev agent turned out to be the training ground for every other harness I've built since. If I were starting a new platform tomorrow, I'd invest in the dev harness first and the production harness second, not the other way around. I wish I'd done that on the earlier ones.
Want to think through how this lands in your project? Tell kr8 what you’re working with.
// Keep reading the playbook?