The cost of agent failure modes

TL;DRThe popular framing of agent systems treats failure as a rare edge case — something the retry budget catches, something the validator rejects, something the human-in-the-loop notices before it ships.

The popular framing of agent systems treats failure as a rare edge case — something the retry budget catches, something the validator rejects, something the human-in-the-loop notices before it ships. The operating reality is the opposite. In production agent traffic, failure is the median event, not the tail event. The interesting question is not whether agents fail; it is what each failure costs, and whether the recovery budget is honest.

I run agent infrastructure that processes tens of thousands of tasks a month across both routine and reasoning-heavy lanes. I have been forced to be honest about failure economics because the failures show up in the cost rollups, the latency distributions, and — most expensively — the trust budget with the people on the receiving end of bad output. This piece is the failure-mode taxonomy I wish I had been handed two years ago, with numbers attached and the parts that don't survive contact with reality flagged as such.

Why most teams underestimate failure cost by 5-10x

The standard accounting model for agent failure is the call-level retry: a model call returns malformed output, the wrapper retries, eventually succeeds, the second call's cost is added to the first, and the team congratulates itself on resilient design. This accounting captures perhaps a fifth of the actual cost.

The other four-fifths sit in places the call-level wrapper cannot see. Recovery work that happens out of band — a human picking up a botched draft, a downstream system rolling back a state change, a customer noticing a hallucinated number in a report — is real cost, but it does not appear on the inference bill. Latency degradation across the loop, where each retry adds round-trip time and the user-facing experience suffers, is real cost without a line item. Reputational drag, where bad output reduces trust in subsequent good output, is real cost over a long horizon. And the silent compounding cost: each failure that does not get caught becomes training data the team uses to second-guess the system, which slows every future deployment.

When a team finally instruments end-to-end failure cost — counting human recovery time, downstream rollbacks, latency degradation, and trust erosion — the multiplier on the visible inference cost is consistently between 5x and 10x. Anyone budgeting agent operations off the inference bill alone is undershooting the real cost by an order of magnitude.

The five failure modes that dominate production traffic

After classifying failures across our agent loops for the better part of two years, the long tail collapses into five categories that account for the overwhelming majority of incidents. The shape of the distribution matters because the recovery posture that works for one mode is wrong for another.

Hallucinated tool arguments. The controller invokes a tool with arguments that look plausible but are wrong — a non-existent identifier, a misformatted date, a field name the model invented. The tool returns an error or, worse, succeeds with the wrong target. This is the single most common failure mode in production traffic.
Infinite recursion or oscillation. The controller enters a loop where each step undoes the last — retrieving the same chunk, calling the same tool with the same wrong arguments, generating the same plan. Without a hard step budget, this consumes tokens until something else breaks.
Retrieval misalignment. The retrieval layer returns documents that look topical to the query but answer a different question than the one being asked. The controller, trusting the retrieval, generates an answer that is internally consistent and externally wrong.
Context-window collapse. The conversation or task accumulates state until the context window is overwhelmed. The model silently drops earlier instructions, ignores critical constraints, or produces output that contradicts something stated three turns ago.
Schema drift on structured output. The model produces output that almost matches the expected schema but breaks on a single field — a missing required key, a stringified number, a malformed date. Downstream consumers reject it, the wrapper retries, and the cost compounds.

The chart below shows the rough share of total failure incidents attributable to each mode in our traffic over the last calendar quarter. The numbers will vary by workload, but the shape is consistent across teams I have compared notes with.

The recovery economics — replay versus human versus error budget

Each failure mode has a different right answer for recovery, and the cost of getting the matching wrong is significant. The three primary recovery postures are replay (re-run the failed segment with corrected inputs or a different lane), human fall-back (escalate to a person), and error-budget burn (accept the failure and move on, counting it against the SLO). The cost-per-incident for each posture varies by an order of magnitude depending on which mode triggered it.

Failure mode	Replay cost	Human fall-back cost	Error-budget cost	Right posture
Hallucinated tool args	$0.04	$8.00	$45.00	Replay with stricter schema
Schema drift	$0.02	$6.00	$28.00	Replay with grammar constraint
Retrieval misalignment	$0.18	$12.00	$80.00	Human review on low confidence
Infinite recursion	$0.00 (kill)	$15.00	$60.00	Hard step budget, then human
Context collapse	$0.30	$10.00	$50.00	Restart with summarised state

Two observations. First, the replay column is almost always the cheapest option when the failure is detectable, which means the engineering investment that pays back fastest is detection — getting the system to know it failed before the output ships. Second, the error-budget column is brutally expensive on retrieval misalignment, because the failure ships an internally-consistent but wrong answer, and the cost of that landing in front of a customer or auditor is not bounded by the inference cost. This mode is the one that justifies real spend on confidence calibration and human review gates.

The detection problem

Replay is only cheap when the system knows the call failed. The hard cases are the ones where the output looks fine. A hallucinated tool argument that the tool happened to accept; a retrieval result that returned topical-looking but wrong documents; a structured output that conformed to the schema but contained fabricated values. None of these trip an exception. They produce output that ships.

The detection disciplines that compound: schema validation with semantic constraints (not just type checks); confidence scoring on retrieval, with thresholds below which results trigger human review; cross-checking generated values against an authoritative source whenever one exists; and — most underrated — a second model pass over the output asking specifically was this generated from the retrieved documents, or was any of it invented? The second-pass discipline catches a meaningful share of hallucinations at a cost that is small relative to the cost of letting them ship.

None of this is exotic. All of it is engineering work that earns its return only when the underlying failure mode is correctly understood. Teams that try to solve the detection problem with a single guardrail layer end up with detection that catches the easy cases and misses the expensive ones. Teams that build mode-specific detection — different validators for different failure shapes — catch the failures that actually cost money.

The hard step budget — the cheapest piece of insurance

Of all the failure modes, infinite recursion is the most embarrassing to discover in a cost report and the cheapest to prevent. A controller that has no hard step budget will, given a confusing enough task, keep stepping until the orchestration layer kills it for some other reason — usually a timeout, occasionally a token quota, sometimes a memory exhaustion event. By that point the cost is already on the bill.

The fix is a step budget per task — explicit, low, and enforced. Most production agent tasks should complete in fewer than ten reasoning steps. A task that has taken twenty steps without converging is almost certainly stuck; another twenty will not unstick it. The step budget should escalate to human review when exceeded, not silently retry. The engineering effort to add this is small. The cost it prevents on the failure tail is large. I have not encountered an agent stack that did not benefit from a more aggressive step budget than the one it shipped with.

Why retrieval misalignment is the most expensive mode

If schema drift is the most common failure and infinite recursion is the most embarrassing, retrieval misalignment is the most expensive in cost-per-incident terms. The mechanism: the retrieval layer returns documents that match the query lexically or topically but do not actually answer the question being asked. The controller, having no way to know the retrieval was off-target, generates an answer that is well-structured and confidently wrong. The output ships. The downstream consumer trusts it because the system has been right before. The error is discovered later, often by a human who notices something does not match a primary source.

The cost compounds because misaligned retrieval is hard to attribute. The output looks like a hallucination — the model said something that is not true — but the model was operating correctly given the documents it received. The fault is at the retrieval layer, which means the fix is at the retrieval layer, which is where teams typically have the least observability and the least mature evaluation harness.

The disciplines that move the needle: hybrid retrieval (dense plus sparse, so semantic and lexical signals both contribute); chunk-level metadata that lets the retrieval layer filter before it ranks; explicit relevance scores surfaced to the controller so it can refuse to answer when confidence is low; and a citation pipeline that ties every claim to the chunk that supports it, with an automated check that flags claims unsupported by any retrieved chunk. None of these are silver bullets. Together they reduce the misalignment incidents materially.

The architecture that absorbs failure gracefully

The agent architectures I see surviving in production for two years or more share a few invariants in their failure handling.

Failure is a first-class signal. Failure events emit telemetry on the same observability bus as success events, with mode classification baked in.
Step budgets are hard, not advisory. A task that exceeds the budget escalates to human review, never silently retries indefinitely.
Validators are mode-specific. Different validators for hallucinated args, schema drift, retrieval misalignment, and context collapse. A single guardrail does not catch all four.
Recovery posture is matched to mode. Replay for cheap-to-detect failures, human fall-back for high-stakes failures, error-budget burn only for low-stakes failures with good rollback.
The cost rollup includes recovery. Per-task cost accounting includes retries, validators, human review time, and downstream rollback. The single-call inference cost is one input to the model, not the whole bill.

None of this is exotic. All of it is invisible to end users and load-bearing for the team running the system. The architectures that get this right end up with bills that are predictable and failure rates that decline over time as the detection layer improves. The architectures that do not get this right end up explaining surprise bills to finance teams and surprise outputs to legal teams, both of which are conversations that destroy trust faster than any model improvement can rebuild it.

The bottom line on failure budgeting

The honest budgeting model for an agent stack is not inference cost plus retries. It is inference cost plus retries plus validator overhead plus human review time plus downstream rollback cost plus the trust impact of failures that ship. Most teams budget the first two and absorb the next three as background noise. The result is the consistent 5-10x undershoot in operational cost that I see across teams I compare notes with.

The discipline that fixes this is end-to-end accounting at the task level, with failure modes classified and recovery costs attributed. Once the data is in front of a finance function, the engineering priorities reorder themselves automatically — detection investment pays back fastest, followed by mode-specific validators, followed by step-budget enforcement. Prompt engineering, the activity teams default to when something is broken, is rarely the right intervention.

Failure is not a tail event in production agent traffic; it is the median event, and the unit economics of recovery dominate the unit economics of the happy path. Teams that have not classified their failure modes and matched recovery postures to them are running on luck and a budget that is structurally short. Teams that have done the work end up with cost predictability, declining failure rates, and the ability to scale the stack without scaling the firefighting.

The right first investment in any agent stack is the failure-mode telemetry. Without it, prompt engineering is theatre and retry budgets are wishful thinking. With it, the architecture improves continuously because every failure becomes information, and information compounds where guesses do not. The teams winning at agent operations in 2026 are not the ones with the best models; they are the ones with the best instrumentation around the moments their models are wrong.

Get on the newsletter Long-form analysis on sovereign infrastructure, agent operations, and the engineering disciplines that compound across years. Once a fortnight, no upsell. Join the newsletter →