Skills Architecture
What a skill file actually looks like once you stop treating markdown as notes and start treating it as the capability itself. The shape I've converged on after more rewrites than I'd like to admit.
From "why" to "what goes in the file"
I wrote the case for why capabilities belong in markdown in the Skills over Controllers module. That's the philosophical side. This module is the practical one — what a skill file actually contains once you've made the commitment to treat markdown as the substrate.
Most of the early skills I wrote were under-specified. A paragraph of instructions, a rough output format, and the model did its best. That works for low-stakes capabilities and falls apart the moment the skill needs to be composable, testable, or optimisable. The shape I've converged on is more structured than "notes" but less structured than code. It's a specific anatomy.
Three fidelity levels, not binary
The trap I fell into was treating skills as either production or not. In reality they scale with investment, and most skills don't need to be high-fidelity on day one.
Low fidelity — a paragraph of instructions written by a domain expert in thirty minutes. "Assess risk by checking breach count, timeline completeness, documentation quality. Return HIGH/MEDIUM/LOW with a reason." That's it. Ship it. Measure whether it works. This is where most of my new skills start now.
Medium fidelity — a structured recipe with steps, context requirements, and output format. Two or three hours. Written collaboratively with the AI (ironically, skill writing is itself a good use of the AI). A proper skill file, YAML frontmatter, instruction body, output schema.
High fidelity — detailed recipe with examples, edge cases, quality criteria, feedback capture, eval fixtures. Four to six hours. Reserved for skills where quality matters enough to justify the polish. Start low, iterate to medium based on feedback, ship to high only when the ROI is obvious.
The anatomy I now use
Every skill file has the same shape, which matters because the loader code is the same primitive for every one.
YAML frontmatter on top. Name, version, category, author, last-updated, a short description, and the skill's "intent" (one sentence on the problem it solves, in the author's voice). This metadata is what the loader reads to register the skill in the system.
Required context — what must be in the context window before this skill runs, and in what shape. A skill that assesses case risk needs facts (minimum count), documents, jurisdiction, case type. Declaring this upfront means the orchestration layer knows what to assemble before it dispatches the call. Skills that don't declare their context requirements get called with whatever's convenient, which means they work inconsistently.
Instructions — the actual natural-language recipe. Numbered steps. The place to be concrete about process: "extract breach elements first, then score liability 0-10, then assess quantum probability." Models perform dramatically better with ordered steps than with blobs of prose, and the Tsinghua finding on natural language representation (+16.8 points, -97% LLM calls) is not an exaggeration.
Output schema — a strict contract. Pydantic-style or JSON Schema. What fields, what types, what constraints. The harness validates against this schema and rejects outputs that don't conform. If the skill can't produce valid output, the orchestration knows how to fall back.
Quality criteria — explicit acceptance thresholds. "Reasoning must cite at least one fact." "Confidence >0.75 or mark as uncertain." These are the rules the skill's output has to satisfy beyond the schema. Not every skill needs these, but the skills that matter most usually do.
Tools used — the external capabilities the skill calls. Search, fetch, analyse. Declared so that the execution contract can enforce which ones are permitted and the meta-agent can reason about dependencies.
Feedback capture — what human decisions on this skill's output should be recorded, and how. Accept/modify/reject. Binary confirmation. Score of 1-5. The feedback schema is part of the skill definition because different skills need different feedback shapes.
Examples — two or three worked examples for high-fidelity skills. Input, expected output, notes on why this is the right output. These serve as both eval fixtures and as few-shot context for the model at runtime if needed.
Eval fixtures — the golden cases this skill is validated against. Lives in a sibling .eval.md file. Three to five hand-written cases on day one, supplemented with production feedback as it arrives. These are what the Karpathy Loop optimises against.
Tools vs skills
I confused these two for a while, probably because "tool" is overloaded. The distinction I now hold firmly: a tool is an API integration — search the database, send an email, retrieve a document, call an LLM. A skill is a recipe that composes tools with context and instructions to produce a specific outcome.
A tool never calls a skill. A skill orchestrates tools. Tools are stable, narrow, utility-shaped. Skills are domain-specific, opinionated, and evolving. Tools live in the harness primitives layer; skills live in markdown. Keeping this distinction firm prevents the class of accident where a "skill" slowly grows helper code and mutates back into a controller.
Composability is the unlock
A well-built skill can be composed into compound workflows without knowing it. /generate-case-brief is a compound skill: it calls /gather-key-facts, then /assess-risk on the results, then /summarize-for-client on both. Each sub-skill is independently testable, independently optimisable, independently versionable.
Compound skills reuse atomic skills without coupling. When I optimise /assess-risk with the meta-agent, every compound skill that uses it inherits the improvement. This is the compound interest of the skills architecture — the gains propagate without explicit propagation code.
I've broken this pattern a few times by writing compound skills that did their own mini-assessment inline rather than calling the atomic skill. Each time I regretted it within a quarter, when the skill drifted from the atomic version and the two started producing different answers to the same question. The discipline I hold now: if a compound skill needs a capability, call the atomic skill. Don't reimplement.
Harnessability — the substrate matters
The insight from the ThoughtWorks work I keep returning to: how legible your codebase is to agents before any harness is applied matters enormously. Strong typing, clear module boundaries, well-structured frameworks, embedded documentation, immutable data patterns — these are features that make agents produce better output with less harness investment. Convention-over-configuration, dynamic typing with implicit contracts, god objects, tribal knowledge not in the repo — these fight the substrate, and no amount of harness work compensates.
Skills-as-markdown is maximally harnessable by construction. Natural language is the agent's native representation. This is partly why the skills architecture outperforms code endpoints — it matches the substrate to the agent's strengths, instead of asking the agent to operate against its grain.
Where I'd start if I were starting over
On a new platform today: begin with three or four low-fidelity skills covering the most common interactions. Add the frontmatter, declare the context requirements, keep the instructions short. Ship them. Measure accept/modify/reject rates. Iterate to medium fidelity based on where the modifications cluster. Promote to high fidelity only when ROI is obvious.
The skills I've regretted most are the ones I over-engineered before any user had touched them. The skills I'm proudest of are the ones that went through many cheap iterations with real feedback. Treat the skill file as the search space; let the feedback show you where to push.
Want to think through how this lands in your project? Tell kr8 what you’re working with.
// Keep reading the playbook?