entropik.
§ 02.23 · operations · observability · traces · cost

Observability

Traditional APM measures response time and error rates. Agents need traces, cost attribution, and quality proxies — and I had to learn that the hard way after the first month I ran AI in production without them.

The reason APM isn't enough

The first time I ran AI in production with only traditional APM, I learned something uncomfortable. Response times looked fine. Error rates looked fine. Status codes were 200. The system was producing wrong answers, the users were quietly correcting them, and none of that was visible on any dashboard I had.

Traditional APM measures latency (p50, p95, p99), error rate (4xx, 5xx), resource utilisation. For deterministic systems these are enough. For agents they measure the wrong thing. A fast response can still be wrong. A 200 status code can have terrible quality. An "error" might be the harness handling a model failure gracefully. You need a different set of dimensions.

Four pillars replace the old ones: traces (what actually happened in a single execution), metrics (aggregate signals across many executions), logs (the narrative for debugging), and alerts (anomalies that need intervention). The shape of each differs from traditional observability enough that I had to rebuild my intuition.

Traces — the execution chronicle

A trace is a complete record of an agent's execution from input to output. Every session gets a root span. Every agent, skill, LLM call, database query is a child span. Every span carries a correlation ID that links it to siblings, parents, and the request that started the whole thing.

The shape that earns its keep in practice: the root span represents the session or the scheduled task. Each agent invocation is a span under it. Each LLM call is a span under the agent. Each DB query or tool call is a span under whatever invoked it. Every span carries cost attributes (input tokens, output tokens, model name, USD cost), timing (start, end, duration), and content attributes (the agent's output shape, the decision it made, the feedback it received).

Four reasons this shape is worth the effort. Cost attribution — rolling up span costs gives you per-case and per-session cost without any separate bookkeeping. Agent accountability — you can rank agents by cost, latency, and quality separately, which is what lets you decide which ones to optimise. Model comparison — the same agent span using Sonnet vs Haiku is directly comparable. Drill-down — "the timeline took 5 seconds, why?" answers itself by looking at the span tree.

When a user says "the output was wrong," you pull the trace by case ID and session ID, walk the span tree, and see exactly what happened. What was in the context. What the prompt looked like. What the model returned. How the harness transformed it. Where the human's correction landed. Without the trace, that incident is un-debuggable; with it, it's a fifteen-minute investigation.

Metrics — the vital signs

Metrics are aggregated signals computed from traces. They answer "how is the system performing at scale?"

The ones that matter for agents differ from the ones that matter for traditional services. Agent latency (p50, p95, p99 per agent, with trend) — watch for increasing trends; they indicate context bloat or model regression. Agent cost (per call, per case, per month) — detect drift before the monthly bill. Agent quality (accept rate, modify rate, reject rate per agent) — the single most important thing to dashboard, because everything else can be fine and the platform can still be producing bad outputs.

Model comparison metrics are a thing I build deliberately now. Same agent, split across two models in production. Which one accepts higher? Which one modifies higher? Which one costs less? These are the metrics that drive model-swap decisions. Without them, the swap is a guess. With them, it's a call I can defend.

Quality proxies when you don't have ground truth — which is most of the time in complex domains — earn their own attention. Feedback rates. Confidence calibration (does the model's 90% actually correspond to 90% acceptance?). Inter-rater reliability (when two humans review the same case, how often do they agree?). Drift signals (is this week's quality different from last week's without any change I know about?).

Logs — the narrative thread

Structured logs with correlation IDs at every level. Every log entry carries the correlation ID of the request it's part of, so filtering by ID reconstructs the full story of one execution.

The log levels I actually use, in order of volume. DEBUG is the full-prompt, full-output logs that I only turn on during investigation — the volume is too high to leave on permanently. INFO is agent transitions (which agents executed, in what order, with what cost) and human feedback events (what was decided, when, with what rationale). WARN is budget thresholds (cost or latency approaching limits) and quality anomalies (accept rate dropped significantly). ERROR is exceptions (LLM failure, API timeout, database error), usually with recovery outcomes (did the fallback work?).

Log aggregation with correlation ID as the primary index is what makes this useful. When a problem surfaces, the workflow is: get the case ID, query logs by correlation ID, walk through the INFO-level agent transitions, drill into DEBUG-level prompts and outputs if needed. Fifteen minutes to reconstruct the full execution chain. An hour and a half if the correlation IDs aren't there.

Alerts — detecting anomalies before impact

Alerts watch metrics and logs for abnormal patterns. The trap I fell into early was alerting on symptoms (response time went up) rather than causes (token usage per call went up, likely context bloat). Symptom-level alerts wake you up without telling you why. Cause-level alerts tell you what to fix.

The alerts I now run. Daily cost spike — total platform cost more than 2σ above the rolling mean. Quality regression — agent accept rate drops more than 10 percentage points week over week. Latency regression — agent p95 increases more than 30% week over week. Budget approaching limit — a specific case's cost exceeds 80% of its allocated budget (gives the owner time to pause non-critical work). Confidence miscalibration — model confidence high, actual accept rate low. Timeout spike — error rate from timeouts above 2% in an hour. Heartbeat failure — no agent executions in 15 minutes (the system might be down).

Each alert has a suggested cause in the message, not just the symptom. "Cost spiked to $X; check for model changes, usage increase, or regression in budget controls." That message is often enough to resolve the alert without further investigation.

Cost tracking as a first-class concern

Agents are economically different from traditional services. Cost scales with usage in a direct way — more calls, more tokens, more dollars — and without attribution you can't optimise.

The hierarchy: tenant → case → skill → agent → LLM call. At each layer, you want rollup totals and trends. Per-tenant monthly cost is a billing view. Per-case cost is a project view. Per-skill cost is an optimisation view (which skills are expensive relative to their value?). Per-agent cost tells you which individual agents are carrying the weight.

The Anthropic-published numbers on harness cost-quality tradeoffs are the clearest baseline I've seen. Solo agent with no harness: $9, 20 minutes, produces a UI that renders but whose functionality is broken. Full multi-agent harness on Opus 4.5: $200, 6 hours, produces working software. Same harness on Opus 4.6: $124.70, 3 hours 50 minutes — equivalent quality, 38% cheaper, 36% faster. The 22× cost increase from solo to harnessed bought a functioning product. The 38% reduction from 4.5 to 4.6 came from the model alone.

Track the ratio: harness_overhead = (harnessed_cost - solo_cost) / solo_cost. As models improve, this ratio should decrease. If it doesn't, your harness is carrying dead components — which is the Subtraction Principle viewed through the cost lens.

The time-travel debugger

When something goes wrong in production, the observability stack should let you reconstruct exactly what happened. Pull the trace by case ID. Look at the agent spans. See the input context that was assembled. See the prompt that was sent. See the model output. See how the harness transformed the output. See what the human did with it.

This is a specific workflow I've built for every platform since the first one. The difference between "the timeline was wrong" being a half-hour debug and a two-day fishing expedition is whether this workflow exists.

What I'd build first on a new platform

Before the first user: traces with correlation IDs, cost attribution per span, structured logs with levels. That's the minimum. Dashboards come next — cost, quality (accept/modify/reject), latency per agent. Alerts last, because you need the metrics to alert against.

The platforms I've regretted most are the ones where I skipped any of this for the first few weeks. By the time I went back to add it, I had blind spots in production I couldn't close without migrating data. Day one is cheaper than month three.

// continue the thought

Want to think through how this lands in your project? Tell kr8 what you’re working with.

0 / 4000 chars
kr8 · next

// Keep reading the playbook?

TOPOLOGY