Designing AI-agent workflows that actually compound: an operator's pattern library

TL;DRThe phrase "AI agent" has been laundered through enough conference decks that it has lost most of its operational meaning.

The phrase "AI agent" has been laundered through enough conference decks that it has lost most of its operational meaning. What I actually run is more boring and more useful: a workflow engine orchestrating a few hundred deterministic pipelines, of which a meaningful fraction call into language models for the steps that need judgement. The agents are the small, focused, well-bounded language-model invocations inside those pipelines. The workflows are the scaffolding that makes them composable.

This piece is the pattern library I've built over the last year of running these things in production. It is not framework-specific. The patterns survive whichever orchestration layer you pick. The mistake to avoid is conflating "AI agent" with "single, autonomous, long-running thing" — that category has its place, but it is not where most useful work in 2026 actually happens.

The first principle: small agents, strong scaffolding

The most important architectural decision is to keep individual agents small, scoped and stateless. A good agent has a single job, a defined input shape, a defined output shape, and a tight context window. It calls a model, validates the response against a schema, and returns. It does not loop. It does not call other agents. It does not maintain state.

The orchestration that strings them together — sequencing, retries, branching, error handling, state management, persistence — lives in the workflow engine, not in the agent. This is the inversion of the "big autonomous agent" pattern. It trades cleverness for reliability. In production, reliability wins every time.

The compounding effect is that small agents are reusable. A sentiment classifier, a JSON-extraction step, a summarisation step, a translation step — these become library citizens. Each one is independently tested, independently versioned, independently swapped. You build a few dozen and discover you can compose entire new workflows from them in an afternoon.

The sub-workflow as a first-class abstraction

Every reusable capability in our stack is a sub-workflow with a webhook entry point. Need to enrich a company by domain? There's a sub-workflow for that. Need to log a metric to the time-series database? Sub-workflow. Need to send a chat message with templated formatting? Sub-workflow. Need to call an LLM through the model router? Sub-workflow.

Each sub-workflow has:

A defined input contract — an explicit list of fields it expects, with types.
A defined output contract — an explicit shape it returns.
Idempotency wherever the underlying operation supports it.
Retry policy embedded — typically 2 retries with a 1-second backoff for transient failures.
Observability hooks — start and end markers logged with timings, AI cost and outcome.

The result is that the parent workflow that composes these sub-workflows is short, readable and almost free of error-handling boilerplate. The boilerplate lives once, in the sub-workflows. New parent workflows ship in hours rather than days because they are mostly just composition.

Pattern: critic-refiner for content quality

One of the highest-leverage patterns we run is a two-stage critic-refiner loop for any content the workflow generates. It works like this:

Generator — a fast, capable model produces a first draft against a structured prompt.
Critic — a separate model (often a different family deliberately) scores the draft against an explicit rubric: accuracy, brand voice adherence, completeness, schema-validity. The critic returns structured feedback, not prose.
Refiner — the original generator, given the draft and the critic's feedback, produces a revised version.
Loop or exit — if the critic's score crosses a threshold, exit. If not, loop back to the refiner with a maximum iteration cap.

This pattern is wildly more effective than single-pass generation. Quality lifts measurably, and — counter-intuitively — total cost often drops because the critic catches enough first-draft issues to avoid downstream rework. The architectural trick is keeping the rubric explicit and structured. Vague rubrics produce vague refinements.

Pattern: the entity-context loader

Most of our workflows are multi-tenant — the same workflow serves multiple clients with different brand voices, ICPs, locales, and credentials. The pattern that makes this scalable is an entity-context loader: a sub-workflow you call at the start of a parent workflow with a single company_id, which returns the full operational context for that client.

That context includes the entity name, brand colours, brand voice descriptors, ICP definition, locale settings, the IDs of any sheets or external resources scoped to them, the credential references they should use, and the routing destinations for notifications. Every downstream step in the workflow consumes from that context object. Nothing is hardcoded.

The result is that one workflow serves N clients. When a new client onboards, you populate their row in the context store and they inherit the whole stack. When a client's brand voice changes, you update one field and every workflow they touch reflects the change.

Pattern: the dead-letter queue and the self-healing monitor

Things fail. APIs return 503s. Models timeout. Rate limits bite. The pattern that turns failures from incidents into noise is a two-tier recovery system:

Every external call carries a per-call retry policy — typically 2 retries, 1-second backoff, exponential.
If the retries exhaust, the failed payload is written to a dead-letter queue (a database table, in our case) with the original input and the failure reason.
A self-healing monitor runs on a schedule — every 15 minutes — picks up DLQ entries and retries them with a wider backoff window. After a configurable number of attempts it escalates.
An error pattern analyzer runs weekly, looks for chronic failures (the same workflow failing the same way repeatedly) and surfaces them for engineering attention.

This pattern means transient failures self-resolve. Persistent failures get noticed. The signal-to-noise ratio on alerts goes from "useless" to "actionable".

Observability is the price of entry

Without observability, none of the above is real. The minimum viable observability for an AI-agent workflow:

Per-execution metrics — workflow name, start time, end time, duration, success or failure, error category if relevant.
Per-AI-call metrics — model used, prompt tokens, completion tokens, calculated cost, latency.
Per-sub-workflow metrics — every sub-workflow logs its own start/end so you can trace bottlenecks.
Time-saved telemetry — for any workflow that replaces human work, log the estimated minutes saved per run.

All of this lands in a time-series database. We render dashboards that answer the questions that matter: which workflows cost the most this week; which workflows save the most operator time; which workflows are degrading in throughput; which AI providers are getting the lion's share of spend; what the error rate looks like.

Without these answers, you are flying blind. With them, you can iterate the system rationally.

Pattern: the human-in-the-loop approval gate

Some workflows generate things that should not ship without a human looking at them — outbound client communication, public content, financial actions, anything irreversible. The pattern is:

The workflow produces the artifact and persists it.
It sends a notification — to a chat channel, an email or a messaging bot — containing a preview and inline action buttons (approve, reject, edit).
The workflow pauses. State is held externally — not in workflow memory — so the pause can survive engine restarts.
A callback handler workflow receives the human's decision and resumes the original flow with a status flag.
The original flow either ships, abandons, or routes to an edit loop.

The approval gate is what makes high-stakes AI workflows safe to deploy. It also turns out to be the pattern most teams skip and most teams later regret skipping.

What does <em>not</em> work, despite the conference talks

Long-running autonomous agents for general tasks. Drift compounds. Costs spiral. Outcomes are non-deterministic. The exceptions are rare and bounded.
Open-ended tool-use loops with no exit criteria. The agent will happily loop until you cap iteration count or run out of money.
Letting the LLM choose its own tools with no schema. Hallucinated tool calls are real. Validate the output shape against an allow-list before dispatching anything.
Putting business logic inside the prompt. Move the logic into deterministic code; let the LLM do judgement, classification or generation. Logic in prompts becomes untestable, unversioned, unauditable.

The agent-flavoured patterns that survived contact with production all share a shape: small scope, strong scaffolding, explicit contracts, human-checked at the edges.

Pattern: the entity-injection contract for multi-tenant agents

Most production AI work in our setup serves multiple clients. Hardcoding any client-specific value — brand voice, ICP, locale, sheet IDs, credentials — is a trap; the workflow becomes a copy-paste exercise per client and the divergence between copies is the nightmare you'll inherit. The pattern that scales: every workflow takes a single company_id at the entry point, calls the entity-context loader, and consumes the returned context object for everything downstream.

The discipline this enforces is that prompts become templates. A blog-writing prompt doesn't say "write in WDM's voice"; it says "write in {{ entity.tone }} targeting {{ entity.icp }} for {{ entity.locale }}". The same workflow ships content for ten different clients. When a client's voice changes, you update one row in the context store and the next run reflects it. When you onboard a new client, you populate a row and they inherit the entire stack.

The non-obvious benefit is testability. A multi-tenant workflow with explicit context becomes a function: same input context, same output. You can run it against synthetic test contexts in CI to catch regressions. Hardcoded workflows can't be tested this way — every test case requires recreating a full client environment.

How prompts evolve over time

The most under-appreciated part of running AI workflows in production is that prompts drift. The model improves, the use case shifts, the brand voice evolves, the edge cases pile up. A prompt that was correct six months ago will quietly produce slightly worse output today, and nobody will notice until a user complains.

The pattern that catches this is a prompt registry: every production prompt is versioned, hashed, and logged with each call. Every output is sampled into an evaluation set. On a regular cadence — monthly is comfortable — you compare the current prompt's outputs against an anchor set of expected behaviours, and you compare across model versions when models change. Anomalies surface as flagged rows.

This is unglamorous work. It is also the difference between an AI workflow stack that improves over time and one that quietly degrades while looking healthy. The teams that maintain prompt evals are the ones whose AI quality keeps climbing. The teams that don't get to relive their year-one mistakes in year three.

The headline framing of "AI agents" makes it sound like the architecture is the hard part. It isn't. The architecture is mostly the same patterns you'd use for any reliable distributed system: small composable units, strong contracts, explicit retries, observable everything. What's new is the call to a model in the middle, and the work to make that call reliable, cheap and auditable.

Build the small agents. Build the strong scaffolding. Build the observability. Don't build the autonomous everything-agent. The compounding comes from composition, not autonomy.

Architect your AI workflow stack If you're scaling AI-driven automation past the proof-of-concept stage and the patterns are starting to bite, book a sovereign-infrastructure consultation and we'll design the agent architecture properly. Book a sovereign-infrastructure consultation →