§ 02.22 · validation · audit · maturity

The Platform Audit

A 12-section checklist I now run on every AI platform I look at, including my own. Designed to surface architectural gaps honestly, not feature gaps.

Why I wrote this down

I'd been doing versions of this audit informally for years, across five or six platforms — mine and other people's. Every time I picked up a new platform I'd find myself running through roughly the same questions in my head, and every time I'd forget a couple until I ran into the gap they would have flagged. Writing the checklist down made the audit repeatable. More importantly, it made the gaps visible — before, I was scoring platforms on a vibe; now, I can point at a specific row and say "this is a zero, and here's what that means."

The audit has 12 sections, 60-ish questions, and a 0–2 scoring per item (0 = not present, 1 = partial or planned, 2 = fully implemented). Maximum score 134. I run it quarterly on every platform I'm responsible for. It's not a report for someone else — it's a mirror.

What each section looks at

The sections map to the playbook. Each one is a small audit of one layer of the thesis.

Section 1 — the Demon Principle. AI outputs are proposals, never auto-applied. Every AI output has an explicit accept/modify/reject interaction. Human decisions are recorded with who, when, what changed. AI-generated memory is transparent and editable. The feedback loop is the core feature, not a safety add-on. Five items, ten points. This is the first thing I check because it's the most commonly skipped.

Section 2 — the 6× gap. More engineering time goes into the harness than into model selection. Harness components are version-controlled. There's a process for auditing and pruning assumptions. Agent topology is intentional. Harness changes are tested via eval before production. Five items, ten points.

Section 3 — the six pillars. Events not CRUD. Feedback as primary interaction. Skills not controllers. Context engineering as a discipline. Projections over features. Boundary skills for intelligence IN and OUT. Six items, twelve points.

Section 4 — the three-layer stack. Infrastructure (event store, context engine, embeddings). Domain skills (composable, with declared contracts). Orchestration (session state, dispatch, Gateway-Runtime separation). Eight items, sixteen points.

Section 5 — autonomous patterns. Sessions as processes. Heartbeat for proactive monitoring. Configuration by observation. Soul/personality configuration. Dumb-pipe connectors. Five items, ten points.

Section 6 — harness engineering. Execution contracts on every agent call. State externalised from context windows. Natural language harnesses where appropriate. Compute delegated to child agents (parent orchestrates). Discipline narrowing in agent topology. Five items, ten points.

Section 7 — optimisation loops. Structured traces on every AI call. Human overrides logged with before/after. Raw traces stored (not summarised). An eval harness that measures quality. A defined optimisation loop with search space plus metric plus traces. Meta-agent and task-agent separated. Meta-agent uses the same model family as the task agent. Seven items, fourteen points. This is the section most platforms (including most of mine) score lowest on.

Section 8 — competitive moat. Human feedback accumulates as a compounding asset. Session memory creates switching cost. Optimised skills are domain-specific and hard to replicate. Cross-entity intelligence improves with scale. The platform gets measurably better with each unit of usage. Five items, ten points.

Section 9 — memory architecture. The six-type taxonomy is defined. Session state survives crashes. Events stored immutably. Dual-store semantic memory (relational + vector). Isolation at tenant, case, session, agent boundaries. Retention and forgetting policies defined and enforced. Six items, twelve points.

Section 10 — eval harness. 50+ golden cases from production feedback per primary skill. Scoring functions defined. Eval gates block deploys that regress quality. Feedback-to-eval pipeline curates datasets automatically. Quarterly model comparison benchmarks. Five items, ten points.

Section 11 — observability and operations. Correlation IDs on every LLM call. Cost tracked per agent, per skill, per case, per tenant. Quality metrics (accept/modify/reject) dashboarded. Alerts on cost spikes, quality drops, latency regressions. Agent execution replayable from traces for debugging. Five items, ten points.

Section 12 — security and trust. Agent permissions enforce least-privilege. Data isolation at tenant, case, session boundaries. PII detection and masking in the AI pipeline. Prompt injection defences implemented. Immutable audit trails attributing every AI output to a human decision. Five items, ten points.

The scoring scale that holds up

I've tried a few different scoring scales and the one that's survived is 0–2, with 0 meaning "not present," 1 meaning "partial or planned," and 2 meaning "fully implemented." The three-level scale forces a specific honesty. There's no 0.5 for "we thought about it." There's no 1.5 for "it's mostly working." The ambiguity is where I used to let myself off; removing the ambiguity is what made the audit useful.

I score each row honestly. 0 means zero, not "we plan to." 1 means something exists that partially addresses the concern, but if pressed I wouldn't say it's done. 2 means it's in production and working. The audit is only worth running if I'm willing to write 0 for things that are 0.

The maturity tiers

Totals land in five tiers. I've put every platform I've audited into one of these honestly, and the tiers correlate with how much trouble the platform is going to be in a year.

0–28 — Pre-AI. Traditional application with some AI features bolted on. Most platforms in the world, including most "AI platforms" I've seen pitch decks for.

29–56 — AI-Augmented. Real AI features, but the harness is ad hoc, no optimisation loops, no evals. Ships fine. Stops getting better. Models improve around it and it doesn't catch up.

57–84 — AI-Native. Strong harness, deliberate architecture. Usually gaps in autonomy (no heartbeat) or optimisation (no meta-agent). This is where most of my better platforms have lived.

85–112 — AI-First. Comprehensive implementation. Optimisation loops running. Compounding value showing up in metrics. Rare.

113–134 — AI-Systemic. Full playbook. Platform improves autonomously. Strong moat. I haven't got a platform to this score yet. That's the honest answer.

How I use it

Three rules that make the audit pay back. Score honestly — zeros are important signal; inflating them loses the information. Look for clusters of zeros, not individual gaps — a single zero is a feature to add, but three zeros in the same section indicate a structural hole. Prioritise Section 7 (optimisation loops) items 1–3 — trace infrastructure is Phase 0 and unblocks everything else. You can't run the meta-agent without traces, can't build datasets without traces, can't debug production without traces. If you're deciding where to invest, that's where.

Re-audit quarterly. The scores should go up over time. If they plateau — or worse, if they drop as the codebase grows — something's wrong with the architecture's ability to hold the principles as it scales.

The part I find most useful

The audit is more useful to me than any dashboard, because the dashboard tells me what's happening in production, but the audit tells me what the platform is structurally capable of becoming. A platform scoring 45 might be running fine today and still be a platform that can't compound. A platform scoring 85 might have ugly metrics today and still be on a trajectory that bends upward.

I'd rather own the 85 than the 45, even if the 45 is currently more polished. The audit lets me see that. It's the closest thing I have to an x-ray for AI platforms.

// continue the thought

Want to think through how this lands in your project? Tell kr8 what you’re working with.

kr8 · next

// Keep reading the playbook?

In Production →Journal →