§ 02.21 · validation · evals · scoring · ci

The Eval Harness

If you can't measure it, you can't optimise it. If you can't optimise it, you don't have a platform — you have a demo. I shipped demos for a while before this clicked.

// kr8 reads this one with you — bring a project.

The thing tests cannot do

Traditional software has deterministic tests: same input, same output, assertEqual. AI systems are non-deterministic — the same prompt can produce different but equally valid outputs, and sometimes twenty different answers are all correct. assertEqual(output, expected) doesn't apply.

I reached for tests anyway, for a while, because that's the habit. The tests I wrote were flaky — they passed on Monday, failed on Wednesday, with no change to the code. The flakiness was signalling something real: I was measuring the wrong thing. What I actually needed was a way to score outputs against human judgment at scale, not pass/fail them against a fixed expected string.

Evals are that mechanism. They bridge the gap between non-deterministic systems and rigorous measurement. They're the reason a platform can ship AI changes with confidence rather than gambling on every deploy.

What you lose without them

Without evals, the list is longer than it first appears. No way to know if a harness change helped or hurt. No way to compare models objectively — the choice between Sonnet and Haiku becomes a vibe. No regression detection until users experience the regression. No way to run the meta-agent, because meta-agents need a score to optimise against. No CI/CD for AI — every deployment is an act of faith.

I've lived with all of those gaps. The meta-agent one hurt most — I had the optimiser, I had the skill files, I had the feedback. I didn't have the eval harness, so the optimiser had nothing to score against, and the optimisation loop was theoretical. Building the eval harness is what turned "we have feedback" into "the platform gets measurably better."

The eval taxonomy

Five types, each at a different speed-cost-signal tradeoff.

Unit evals. Single skill, single input, scored output. "Does /extract-fact return a valid FactExtraction?" Fast (under 5 seconds), cheap, narrow signal. This is where you catch schema failures and obvious breaks before they get further.

Integration evals. Multi-skill pipeline, end-to-end quality. "Does the five-agent timeline pipeline produce a coherent timeline from these three documents?" Medium speed (30-120 seconds), moderate cost, broad signal. Catches pipeline bugs that unit evals can't see.

Regression evals. "Is the new version at least as good as the old version on the benchmark suite?" Slow (minutes), higher cost, critical signal. Run before every production deploy. If a change regresses quality, the deploy blocks.

Adversarial evals. "Can we break it?" Edge cases, malformed inputs, prompt injection attempts, out-of-domain queries. Variable speed, variable cost, safety signal. Run weekly, not on every deploy.

Comparative evals. "Model A vs Model B on the same benchmark. Harness v1 vs v2 on the same cases." Double the baseline cost, decisive signal when you're evaluating a change.

Where the dataset comes from

The hardest part of evals isn't running them — it's building the dataset. For AI-first platforms the answer is that production feedback IS the dataset. Every accept/modify/reject is a potential eval case.

Three dataset types emerge naturally from the feedback loop. Golden — accepted and modified outputs (the human-approved final version) — used for regression testing. Improvement — modified outputs with before/after pairs — used for training the prompt optimiser. Negative — rejected outputs — used for adversarial testing and failure mode analysis.

The thing I had to learn the hard way is that datasets aren't automatic. Feedback arrives; it has to be curated into eval cases. Not all feedback is equal. Low-confidence feedback, disputed cases, edge cases — these need different handling from the main golden set. I now tag every dataset with its curation metadata (when it was collected, what confidence threshold, what coverage across verticals and case types) because three months later I'm going to need to know.

Cold-start without users

The trap I fell into when I first read about eval harnesses was thinking they required production traffic. They don't. Every platform starts with zero users and zero feedback. The cold-start pattern bridges the gap.

Hand-author three to five fixtures per skill from your domain knowledge. Input context, expected output, notes on what the case tests. These are not synthetic data — synthetic data is generated by another model and reflects that model's biases. Hand-authored fixtures reflect your domain expertise, which is the thing the model doesn't have.

Run the skill against a deterministic model provider (scripted responses) or the real model with a tight budget cap. Score outputs against the fixtures. Iterate. The skill that goes through twenty iterations of this loop before any user touches it is already tenth-draft quality on day one.

The fixtures don't get replaced when production feedback arrives — they graduate to being the "expert knowledge" partition of the eval set, alongside the "real-world" partition from production feedback. Both partitions stay in the suite permanently. The hand-authored cases catch failure modes users might never trigger. The production cases catch the patterns I didn't anticipate.

Scoring, not matching

The scoring hierarchy runs from most measurable to least. Schema validity — cheapest, binary, automated via Pydantic or equivalent. Factual accuracy — automated against ground truth with rule-based scoring (date accuracy, entity matching, source citation overlap). Completeness — semi-automated, F1 between expected and actual items. Narrative quality — LLM-as-judge, with a separate model assessing prose quality. Human preference — the gold standard, for high-stakes decisions.

Composite scoring combines these. Weights vary by domain — legal weights accuracy highly and narrative moderately; logistics weights accuracy and preference over narrative. The skill-specific score is a weighted sum, and the weights themselves are part of the configuration you iterate on.

LLM-as-judge is useful but not sufficient. Rules that earn their keep: never use the same model to judge itself (self-evaluation bias is real and measurable), always provide the human reference alongside the AI output (judging in the abstract is noisy), define criteria explicitly (not "rate the quality" — "does the output cite a source? is the date correct? is the reasoning 2-3 sentences?"), and calibrate the judge against human ratings before trusting it (does the judge agree with humans? if not, fix the judge or stop using it).

Gates in CI

Every skill change goes through eval gates before production. Unit evals run in under thirty seconds — cheap, fast, catches obvious breaks. If they pass, integration evals run in under five minutes. If those pass, regression evals run against the full golden set in under thirty minutes. Each gate has explicit thresholds. Any gate failure blocks the deploy.

The discipline that takes getting used to: "no regression tolerance" is a real setting. A change that scores lower than the previous version by more than 2% on the composite score blocks. The 2% tolerance accounts for natural variance; beyond that, the change is actually worse, and shipping it is a mistake even if the change looks harmless.

I've tried to argue around gate failures ("this is a minor refactor, surely it's fine"). Every time I was wrong. The gate exists to catch the cases I'm confident about and turn out not to be. Trusting the gate has been cheaper than overriding it.

The loop closes here

Evals are not a standalone activity. They're the scoring mechanism that powers the Optimisation Loops. The meta-agent reads traces, proposes an edit to a skill, runs the eval against the held-out dataset. Score improved? Commit. Score regressed? Revert. Repeat.

Without evals the meta-agent is blind. It can't tell if its edits helped. Every other part of the platform can exist without the eval harness and function in some diminished form. The meta-agent literally cannot function without it. Which is why if I'm building toward compounding improvement, the eval harness is the thing I build early, not the thing I add later.

kr8 ·

what's the metric you'd quote if I asked whether your AI got better this month — and do you trust it?

kr8 · next

// Keep reading the playbook?

In Production →Journal →