§ 02.19 · patterns · contracts · governance

Execution Contracts

Five bindings that turn an agent call from a black box into composable infrastructure. Getting this shape right is what made the difference between "AI feature" and something I could actually operate.

// kr8 reads this one with you — bring a project.

The moment I started taking this seriously

For a while I treated agent calls the way I'd treat any other function call. Inputs went in, outputs came back, and if something went wrong I'd look at logs. Then I shipped a loop that didn't terminate, spent a few hours of compute on what should have been a ten-second task, and got a bill that I had trouble explaining to myself.

The fix wasn't in the prompt. It was in the binding around the call. An agent call without explicit bounds is a permissionless subprocess with a credit card — and that's roughly how it behaved. The pattern I now wrap every agent call in is a five-part execution contract that specifies what the call must produce, how much it can spend, what it can touch, when it should stop, and where the results go. Any call that isn't bound by all five is an unbounded subprocess, and I treat it as a production risk.

Required outputs — the schema

The first binding. A structured schema the agent MUST return. Pydantic, Zod, JSON Schema — the specific tool doesn't matter, but the discipline does. The agent either produces data that matches the schema or the call has failed.

This sounds obvious. The reason to spell it out is that the default — "the LLM returns whatever it returns and we parse it best we can" — is where most production agent bugs I've chased ultimately lived. The output was "roughly correct" but didn't match the expected shape. A downstream consumer made assumptions. The assumptions were violated. Things broke in specific, hard-to-debug ways two hops later.

Schema validation closes this. The harness validates the output before anyone consumes it. If the schema fails, you know immediately that this call failed, not that something downstream broke for inscrutable reasons. Enforced schemas also let downstream agents have contracts of their own — they can rely on the shape rather than hedging against every possible representation. The schemas become the glue that lets the skill layer compose.

Budgets — hard limits

Tokens, wall-clock time, cost in dollars, retry count, iteration count. All five have to be explicit. All five have to be enforced by the harness, not negotiated by the model.

Without a token budget, a verbose retry loop can consume thousands of dollars on a single task. Without a time budget, an evaluator-optimiser loop that oscillates between two near-threshold scores will run indefinitely. Without a cost ceiling, a model-swap from Haiku to Opus can change your cost structure overnight without anyone noticing until the bill arrives. I've had at least one of these happen. More than once, if I'm honest.

The shape that works is a budget object that travels with the call, checked at every opportunity to stop. When the call exceeds any dimension, the harness terminates it. Not gracefully — terminates. Graceful termination is still termination. The cost of an honest abort is less than the cost of letting a runaway continue.

Permissions — least privilege

What tools and data this agent can access. Which databases it can read, which tables it can write, which external APIs it can call, which data types it's allowed to see (PII? trade secrets? internal metrics?). Declared explicitly, enforced at dispatch time.

Most of the agent security problems I've seen — mine and others' — reduce to over-permissioned agents. An agent that "should only read facts" ends up with access to billing, because access was granted at the service level rather than the call level. An agent that extracts from documents ends up with write access to the fact store, because nobody drew the line before the first ship.

Default should be deny. Each agent gets only the tools listed in its contract. New tools require an explicit permission update, which means you have to think about it, which means over-permissioning becomes an active choice rather than the path of least resistance.

Scope inside permissions matters too — "can read facts" is weaker than "can read facts scoped to this tenant and this case." Scoping at the binding level is what keeps a compromised or confused agent from becoming a lateral-movement vector.

Completion conditions — when to stop

Explicit exit criteria, with a maximum-iteration backstop. Not "the model decides when it's done" — defined conditions, checkable, with an upper bound that fires no matter what.

The shape I use: a success predicate (what does "done right" look like? a confidence score above a threshold? a specific field populated? a schema validation pass?), a stop predicate (what conditions should abort? repeated identical outputs indicating a stuck loop? confidence below threshold after max iterations?), and a hard iteration limit. The iteration limit is the safety net. It fires even when the predicates fail to trigger, because predicates have bugs.

Without this, evaluator-optimiser loops become infinite. I've written infinite loops. More than I'd like. The max-iteration backstop has saved me from the worst versions of them.

Output paths — where results go

The last binding. Where does the result go when the call succeeds? Database table? Event bus? S3 bucket? Cache? What format? What retention? What notifications get fired?

This is the one I forgot most often in early platforms. The agent produces output; someone downstream reads it; the path between is ad-hoc code. Which means every agent has its own custom persistence, every error path is bespoke, and swapping the storage backend becomes a multi-week refactor.

Declaring the output path in the contract decouples the agent from storage. The agent produces a result; the harness writes it. Switching Postgres to S3 is a contract change, not an agent rewrite. Adding a notification to a downstream service is a contract change, not a new code path. The agent stays clean and focused on producing the result.

All five, or none

The point I want to hold firmly is that it's not useful to have three out of five. Missing the schema and the whole thing is a guessing game for downstream consumers. Missing the budget and you have a billing incident waiting to happen. Missing permissions and you've built a lateral-movement vector. Missing completion conditions and you have a potential infinite loop. Missing output paths and you have ad-hoc storage code scattered through the system.

I've tried to operate agents with partial contracts. It always ends with the missing binding becoming the source of the next incident. The five together are what make agent calls "governable infrastructure" instead of "unbounded subprocess with a credit card."

The contract as a unit of optimisation

The meta-agent reading traces can propose changes to the contract itself — not just the skill's prompt. If timeouts are frequent, extend the time budget. If confidence is consistently low, lower the threshold or extend max iterations. If cost is drifting, reduce max tokens or switch the model.

This makes the contract a first-class optimisation surface. The search space isn't just "what words should the prompt contain" — it's "what bounds should the call operate within." Some of the most impactful optimisations I've seen came from adjusting budgets and completion conditions rather than rewriting prompts.

The shape holds up against programmatic tool calling too

The shift toward programmatic tool calling — models writing code in sandboxes to compose tools efficiently rather than reasoning through one call at a time — makes execution contracts more important, not less. Unbounded code execution against unbounded tool surfaces is exactly what you don't want. The contract says "you may call these tools, within these budgets, to produce this output, and the environment will enforce the bounds." The model writes code that fits. The sandbox enforces. The contract is the governance layer.

What I now ask before writing any agent call

Five questions, in order. What must come back (schema)? How much can this spend (budget)? What can this touch (permissions)? When does it stop (completion)? Where does the result go (output path)? If I can't answer all five in advance, I'm not ready to write the call. If I try anyway, I'm deferring to future-me to debug the thing I'm about to unleash. Future-me has had enough of that.

kr8 ·

if I picked a random AI call in your system right now, could you tell me its budget, fallback, and quality gate?

kr8 · next

// Keep reading the playbook?

In Production →Journal →