§ 02.10 · architecture · tenancy · verticals · isolation

Multi-Tenancy and Verticals

Tenants and verticals are independent axes. I kept collapsing them into one until the configuration sprawl made it obvious why I shouldn't. This is the shape that held up.

// kr8 reads this one with you — bring a project.

The distinction I kept missing

For a long time I used "tenant" and "vertical" more or less interchangeably. A new law firm on the platform was a tenant. A new practice area was also a tenant. A new logistics company was a tenant. A new fleet type was too. It worked for about three customers. By the fifth, the configuration had branched into a shape I couldn't reason about, and I was paying the cost of every new thing being a special case.

The shift that unstuck me was noticing that tenant and vertical are orthogonal. A tenant is an organisational boundary — one firm, one company, one research group. A vertical is a domain specialisation — family law, last-mile logistics, social network research. A tenant can use one vertical or several. A vertical can be used by one tenant or many. Treat them as two axes and the configuration problem decomposes. Treat them as one axis and you get the sprawl I spent a year paying for.

Configuration cascades, five levels deep

Once I had two axes, the configuration shape that made sense was a cascade. Five levels, each inheriting from the one above, each allowed to override specific things:

Platform sits at the top — the baseline for all tenants. Default model, core skill definitions, infrastructure config, universal thresholds, default prompts. Tenant is next — one organisation's overrides. Branding, voice, which verticals they're entitled to use, data isolation scope, resource quotas, firm-specific terminology. Vertical is below tenant — the domain specialisation's overrides. Skill parameters, evidence weights, domain vocabulary, vertical-specific agents, jurisdiction defaults. User is below vertical — individual preferences and role-based access. Session is at the bottom — transient overrides for one conversation or one case.

Skill parameter resolution walks from bottom to top: session override first, then user preference, then vertical config, then tenant config, then platform default. The first match wins. The discipline that makes this pattern hold is that each level is only allowed to override, not invent — you cannot introduce a brand-new skill at the user level or a brand-new resource category at the vertical level. New things always start at the platform level and cascade down.

I've learned to hold this discipline firmly because the platforms where I let tenant-level customisation drift into inventing new things became the ones I couldn't maintain. Fifty tenants each with their own bespoke skill definitions is not a configuration hierarchy; it's fifty forks.

Shared infrastructure, isolated data

The thing the platforms I got right have in common is shared infrastructure and isolated data. One LLM subscription. One embedding service. One set of skill definitions. One Kubernetes cluster. One auth layer. But per-tenant vector namespaces, per-tenant event partitions, per-tenant session stores, per-tenant data lakes, per-tenant secrets.

What's shared is the stuff where a single well-tuned implementation benefits everyone. What's isolated is the stuff where a single bug could leak one tenant's data to another. The rule I've learned to hold is that isolation is physical, not logical. A tenant_id filter in a query is not isolation — it's a tenant_id filter, and I've shipped one that had a bug at least once. Isolation means separate collections in the vector store, separate topics in the event bus, row-level security enforced at the database level, session keys that start with the tenant_id and cannot be reassigned.

The anti-pattern I've personally fallen for is a shared vector namespace with tenant_id filtering at query time. It feels cheaper. It introduces a class of bug that is both severe (cross-tenant data leak) and silent (no error — just wrong results). I don't do this any more, and I wouldn't let a future me do it either.

Per-vertical skill configuration

The same skill behaves differently in different verticals. /assess-risk in family law weights emotional-stability evidence differently than in employment law, where documentation-quality is the dominant signal. Same code. Different parameters. Different vocabulary. Different confidence floors.

This is where the vertical axis pays for itself. Without it, you either have one skill that tries to serve every domain (and serves none of them well), or one skill per domain (and now every code change has to be made N times). With per-vertical configuration, the skill is shared and the parameters cascade. When the platform-level skill improves, all verticals inherit the improvement. When a vertical needs specialisation, it lives in the vertical's config file — not in the skill code.

The small discipline that makes this work: vertical configs are parameter overrides, not code. If a vertical needs behaviour that parameters cannot express, that's a signal the platform-level skill needs a new parameter — not that the vertical needs a custom skill. I've broken this rule a couple of times and each time regretted it within a quarter.

Vertical discovery

Three patterns for deciding which vertical applies. Explicit selection at onboarding — the user picks. Document classification — the platform looks at an uploaded file or a submitted query and classifies it against known verticals. Automatic routing — if the session doesn't have a vertical set yet, run a classifier on the current context and use the result (above some confidence threshold), or ask the user to clarify.

The one I now default to is a blend: explicit selection for new sessions where context is thin, classification for sessions where a document or clear signal is available, and fallback to ambiguity-resolution (asking the user) when the classifier isn't confident. The platforms where I leaned entirely on automatic routing tended to route confidently wrong a percentage of the time. The platforms where I leaned entirely on explicit selection tended to annoy users who'd set up their vertical three months ago and now had to re-navigate it.

What adding a new tenant or vertical actually costs

When the architecture is healthy, adding a new tenant is hours of work — most of it waiting for provisioning to complete. Create the namespace. Import the platform config. Set resource quotas. Onboard users. Smoke test. Done in an afternoon.

Adding a new vertical is days, not weeks. Define the config. Override 3–5 critical skills with vertical-specific parameters. Compile domain vocabulary into embeddings. Seed the vector store with vertical-specific examples. Deploy, A/B test against the fallback behaviour, monitor skill performance. Usually three to five days.

Adding a new platform from scratch is weeks — but "from scratch" here means from the Entropik baseline, inheriting event sourcing, context tiers, skill framework, feedback loops. A new domain with a new economic model, starting from the same infrastructure. If I had to build the infrastructure too, the number would be months.

The reality check I'd apply to anyone claiming multi-tenancy in an AI platform: how long does adding a new tenant take? If it's measured in weeks, you have a single-tenant platform that sometimes runs for multiple customers. If it's measured in hours, you have multi-tenancy. The architecture above is what makes the difference.

The anti-patterns I've lived through

A few shortcuts that feel reasonable at the time and aren't. Letting tenants customise everything — the two things tenants should be allowed to override are branding and vertical selection. Everything else belongs in the platform or the vertical. Using logical instead of physical data isolation — filter-by-tenant-id is not enough. Defining verticals organisationally rather than by domain — "Q1 projects" is not a vertical; "family law" is. Training the vertical classifier on one tenant's historical data — a new tenant arrives with no classifier; the classifier has to be platform-level. Sharing session state across tenants — sessions should be keyed by tenant_id and physically isolated in whatever store holds them.

Each of those I've tried at least once. Each one produces the same class of bug: a subtle cross-tenant leak that surfaces months later and has to be chased through half the system. The discipline of getting the architecture right at the start is what prevents that class of bug from existing.

kr8 ·

if I asked you to add a new tenant tomorrow, would the config sprawl bite you — and where?

kr8 · next

// Keep reading the playbook?

In Production →Journal →