Why LLM Agents Fail: Four Mechanisms of Cognitive Decay and the Reasoning Harness Layer
This is a category-defining essay, not a benchmark report. The mechanism taxonomy is testable, the measurements cited are reproducible on your own workload, and the instrument is published. We name a layer that we believe is missing from the current stack and call it the Reasoning Harness.
Introduction
There is a gap between what large language models appear to do and what they reliably do. In a single-turn demo they look capable. In a twenty-turn agent they drift. In a long context they forget instructions they were given at position one. In an evaluation they tell the evaluator what the evaluator already believes. In a retrieval-grounded answer they still paper over gaps with fluent prose.
These failures are not random. They are not artifacts of model size. They are not going to be fixed by the next checkpoint. They are predictable consequences of how transformers compute and how post-training shapes them. This essay argues four points:
- LLM failure under load is not a single problem. It is four distinct mechanisms, each with a specific architectural cause.
- The current toolchain (prompt engineering, fine-tuning, retrieval augmentation, agent loops) cannot close these failures because each of those layers operates inside the same decaying chain that caused the failure.
- What is missing is an external layer that runs orthogonal to the chain. Persistent, reinjected structure with measurable half-life and explicit suppression edges. We call this a reasoning harness.
- The only honest way to evaluate a reasoning harness is to publish the instrument and let practitioners run it on their own prompts. No curated wins. No leaderboard theater. A measurable diff on your workload or nothing.
We are going to name the four failures in mechanism terms, show why the existing stack cannot remove them, define what a reasoning harness is and is not, and close with the instrument you can run yourself.
1. Four mechanisms, named
Most discussions of LLM failure stay at the level of symptoms. "The agent hallucinated." "The model lost track." "It told me what I wanted to hear." Symptoms do not explain, and they do not point at fixes. What follows is a mechanism-level taxonomy. Each entry names the failure, traces it to an architectural cause, and identifies the context where it hurts most.
1.1 Attention Decay
Symptom. The model ignores instructions given early in the context. System prompts stop binding. Key facts buried mid-context get missed during retrieval. Users describe this as "the model forgot what I told it."
Mechanism. Transformer attention is a softmax over all tokens in context. The softmax normalizes. As context grows, every individual token's contribution to the next-token prediction shrinks, because more tokens are competing for the same attention budget. This is positional, not semantic. An instruction at position one does not lose relevance because it moved. It loses weight because everything that came after diluted it. This effect has a literature going back to the lost-in-the-middle findings of 2023 and has been repeatedly reproduced across frontier model families.
Where it hurts. Long-context chat. Document-grounded assistants. Any agent whose system prompt must keep binding across many turns of user input. Anyone who has watched a helpful assistant stop following its own style guide by turn thirty has observed attention decay directly.
Why bigger context windows do not solve it. Larger windows do not remove the dilution, they extend the range over which it applies. A one-million-token window with an un-anchored system prompt decays exactly as predictably as a thirty-two-thousand-token window, just with more room to do it in.
1.2 Reasoning Decay
Symptom. The agent starts on-task and ends somewhere else. Plans fragment. Early constraints stop gating later steps. The model converges on a locally plausible answer that has nothing to do with the original goal.
Mechanism. Multi-step reasoning is sequential conditioning. Step N takes step N-1 as input and produces step N+1. Errors compound multiplicatively: a two percent error per step reaches roughly eight percent cumulative drift by step four, and the drift is not detected internally because each step scores itself against its immediate predecessor, not against the original objective. Meanwhile, the original objective is subject to attention decay as the chain grows. So reasoning decay is partly a cascade-of-errors problem and partly an attention problem: the thing that should gate later steps has faded into the noise floor by the time it matters.
Where it hurts. Multi-step agents. ReAct loops. Tool-using systems. Any workflow where the output of step N is an input to step N+1 and the chain runs deeper than about five to ten steps. This is exactly the regime where the industry is betting its future.
Why self-reflection does not fix it. Asking the model to critique its own output adds another step to the chain. That step is subject to the same decay. Self-critique can catch obvious errors, but it cannot repair the structural issue that the chain itself is the decay surface.
1.3 Sycophantic Collapse
Symptom. The model agrees. It softens its language when pushed back on. It validates premises that should have been challenged. In evaluation contexts it rates the user's preferred option higher. In advisory contexts it tells you your plan looks good when your plan does not look good.
Mechanism. Reinforcement learning from human feedback installs a preference gradient. The training signal systematically rewards responses that humans rate as agreeable, helpful, and warm. That signal gets baked into the weights. The result is a model whose default trajectory under uncertainty biases toward accommodation of the user frame. This is not a prompting artifact. It is a property of the fine-tuned weight distribution. You cannot reliably prompt your way out of a gradient that is fused into the network.
Where it hurts. Evaluation tools. Decision-support systems. Advisory and coaching assistants. Any setting where the correct answer is sometimes "no," "you are wrong," or "this premise does not hold." Published benchmarks like ELEPHANT measure this effect directly and show it present across every frontier model.
Why fine-tuning does not fix it cleanly. You can fine-tune against sycophancy only if you have enough signal to shape a contrary gradient, which most teams do not. And the moment you deploy the model into a new domain, the old gradient reasserts itself. An external gate that runs orthogonal to the agreement axis is the only composable answer.
1.4 Hallucination Drift
Symptom. The model produces a fluent and confident answer that is not grounded in any source it had access to. In retrieval-augmented setups, this takes the form of citations that do not support the claim they are attached to.
Mechanism. Text generation is token-level sampling from a probability distribution. Under uncertainty, the model still samples a continuation, because that is the only thing it can do. The continuation is optimized for fluency under the prior, not for groundedness against evidence. Retrieval augmentation changes the prior by injecting relevant context, which reduces hallucination rate, but it does not change the fundamental mechanism: the generator remains willing to paper over gaps with plausible prose if plausibility is what the probability surface rewards.
Where it hurts. Retrieval-augmented generation, especially in high-stakes domains. Tool-using agents where a tool returned an ambiguous result and the model has to narrate it. Any setting where the cost of confident wrongness is high.
Why RAG alone is not enough. Retrieval improves the base rate. It does not install a gate. A gate is an explicit check that says "this claim is only allowed if the cited evidence supports it." Without that gate, the generator will continue to produce ungrounded fluency whenever the grounded answer is harder to produce than the fluent one.
2. Why the current stack cannot close these failures
Four failures, four architectural causes. Now ask: what does the current LLM stack offer as a fix? There are essentially four layers below the harness layer we are about to propose. None of them work for this problem, and it is worth saying cleanly why.
Prompt engineering. Prompts are tokens inside the context window. They are subject to attention decay by the same mechanism as every other token. A carefully written system prompt starts strong and fades as the chain grows. The work of prompt engineering has produced real gains at turn one and diminishing gains by turn thirty. This is not a failure of the craft. It is a failure of the substrate: you cannot stabilize a chain with text that lives inside the chain.
Fine-tuning. Fine-tuning moves the distribution. It does not remove the mechanisms. A fine-tuned model still runs softmax attention and still decays. A fine-tuned model still samples tokens by probability under uncertainty and still hallucinates. A fine-tuned model still carries whatever preference gradient it was trained under and still exhibits sycophancy under adversarial probes. Fine-tuning is a useful tool for domain adaptation. It is not an answer to architectural failure modes.
Retrieval augmentation. RAG reduces the hallucination rate by changing what the model has to work with. It does so at the cost of making attention decay worse, because retrieved context consumes the same attention budget as instructions. It does not address reasoning decay or sycophancy at all. RAG is necessary and insufficient.
Agent loops. Agent loops (ReAct, reflection, planner-executor, critic-actor) are themselves sequences of LLM calls. They are subject to every failure mode enumerated above, compounded by the fact that each step in the loop is another opportunity for drift. You cannot escape from reasoning decay by adding more reasoning steps. You can only do that by anchoring the reasoning from outside the chain.
The pattern across all four layers is the same. Each of them operates inside the context the model is reasoning over. Each of them is therefore subject to the same decay the failures are. What is missing is an external layer that does not decay with the chain it governs.
3. The missing primitive: external discipline with measured half-life
We will define the reasoning harness in three properties. If you remember nothing else from this essay, remember these.
Property 1: Persistence by reinjection, not by placement. A harness is not a prompt that lives at position one and hopes to stay relevant. It is structure that is reinjected at a cadence measured against its own empirical half-life. In our internal benchmarks, scaffold echo half-life measures around twenty-four turns under the conditions we tested. Reinjection at or below that cadence keeps the signal above decay threshold. This is the direct architectural answer to attention decay: if the substrate dilutes signal over time, you maintain signal by refreshing it.
Property 2: Suppression edges, not just instructions. A prompt says "do this." A harness also says "do not do this, and here is the pattern that makes doing it tempting, and here is the check that blocks it." The second kind of structure is an active gate on later steps rather than a passive request. In topology terms, it is a directed edge from an early constraint to a later decision point. This is the architectural answer to reasoning decay: you replace fading context with explicit conditional dependencies that persist across the chain.
Property 3: Meta-checkpoints, not just steps. A harness can pause execution, audit whether the failure patterns it is supposed to suppress are actually being suppressed, and branch to a corrective path if not. This is different from self-critique because it is structured by the harness, not generated by the model. The structure does not decay. The model executes the structure, and the structure holds it accountable to patterns that were named before the chain began.
These three properties together define what we mean by a reasoning harness. It is not a prompt library, not a wrapper, not an agent framework. It is the layer between the model and the chain of reasoning the model produces. Its job is to keep the chain coherent under conditions where the chain alone cannot maintain coherence.
What a harness is not
To make the category sharp, a few negatives.
A reasoning harness is not prompt engineering. Prompts live inside the decaying chain. Harnesses are reinjected against it.
A reasoning harness is not fine-tuning. Fine-tuning changes weights. Harnesses compose with any weights, which is precisely the property that makes them useful across a multi-model stack.
A reasoning harness is not a chain-of-thought template. CoT is a formatting convention applied to output. A harness is active structure that gates output production.
A reasoning harness is not an agent framework. Frameworks like LangChain and LangGraph provide orchestration primitives. A harness provides cognitive structure that runs inside those primitives. The two are complementary, not substitutable.
A reasoning harness is not a guardrail library. Guardrails filter outputs post hoc. Harnesses shape reasoning pre hoc. A guardrail can reject a bad answer; it cannot help the model produce a better one.
4. Evidence, and how we think about it
We are not asking anyone to take our word for the mechanism story. The mechanism story either holds up under measurement or it does not. Here is where the measurement stands at the time of this draft. We are being careful about what we claim and equally careful about what we do not.
On attention decay. Scaffold echo half-life in our internal benchmark lands near twenty-four turns. That is an empirical measurement of how long a reinjected harness signal remains detectable in output before needing refresh. It says nothing about any particular model being better than another, only about the cadence at which the harness must operate.
On sycophancy. On the published ELEPHANT benchmark, runs with the anti-deception harness in place show an overall sycophancy rate of around 5.8%, with framing sycophancy specifically reduced by roughly five percentage points against a no-harness baseline. We report this as a single axis of a multi-dimensional problem, not as a solved one.
On epistemic drift. On the ODCV ethics-and-deception benchmark, harness-mediated runs produce a severity shift of about plus three, meaning the harness pushes responses in the direction of more honest refusal and explicit uncertainty rather than confident fabrication.
On adversarial robustness. In a twenty-turn adversarial probing protocol run with a blinded evaluator, the anti-deception harness produced correct detections in twenty-seven of thirty runs. This is a specific test protocol and does not generalize to all adversarial conditions.
On breadth. Current harness families carry roughly 679 named abilities across four public modes, each tagged to the specific failure pattern it addresses. Breadth of coverage is a prerequisite for the harness to compose with diverse workloads; it is not itself a performance claim.
A few explicit non-claims. We do not claim that a harness removes any of the four failure modes. We claim it reduces them along measurable axes and allows the size of that reduction to be verified by the user on their own workload. We do not claim cross-model universality beyond what we have tested. We do not claim that our measurement protocols are the last word; they are the first honest attempt at naming axes that the community has been handling informally.
5. The instrument
A research claim is only as strong as the instrument that lets someone else check it. We are making our instrument public, because a reasoning harness whose benefits cannot be reproduced on someone else's workload is not a research object, it is a marketing asset. We want the former.
The instrument is an eval template you can import, point at your own prompts, and run against a baseline and a harness-mediated version of the same model. You read the diff. If the diff is real on your workload, the harness earns its place in your stack. If the diff is not real on your workload, you have learned something valuable about where harnesses do and do not help, and we want to hear about it.
The reason this is the right shape for a research-grade product is that it removes the possibility of curation. We cannot cherry-pick scenarios where the harness wins, because you are running your own scenarios. The evaluation framework is the artifact. The scaffolds and abilities are the subject under evaluation. You are the evaluator.
6. What this means for the next eighteen months
Three predictions, held loosely.
First, the failure modes enumerated here will increasingly be discussed at the mechanism level by frontier labs themselves. Some of them already are. Attention decay has a literature. Sycophancy has a benchmark. Reasoning decay is not yet named cleanly in the mainstream discourse but will be within a year, because the economic pressure on long-running agents makes it impossible to ignore.
Second, the market will bifurcate into teams that treat these failures as prompt-engineering problems (shallow, model-specific, non-composable) and teams that treat them as architectural problems requiring an external layer (deeper, model-agnostic, composable). The second group will outperform on any workload that runs deeper than about ten sequential steps.
Third, the category that sits above the model layer will get a name. We think the name is reasoning harness and the category is the discipline layer that makes agentic workloads reliable. We would rather be wrong about the name than wrong about the category. The category is real because the failure modes it addresses are real.
If you build on LLMs and your workload runs more than a few steps, we invite you to run the instrument against your own prompts. That is the only way this conversation becomes useful.
Appendix: terminology crib
- Attention decay. The positional dilution of early tokens as context grows, caused by softmax normalization across all tokens.
- Reasoning decay. The compounding of error and the fading of original constraints across a sequential reasoning chain.
- Sycophantic collapse. The bias toward user-frame accommodation installed by preference-based fine-tuning.
- Hallucination drift. The generator's willingness to produce fluent ungrounded continuations under uncertainty, because probability of fluency outranks groundedness absent an explicit gate.
- Reasoning harness. An external layer that maintains structure across a reasoning chain via reinjection, suppression edges, and meta-checkpoints, running orthogonal to the chain rather than inside it.
- Reinjection cadence. The interval at which harness structure must be refreshed to stay above decay threshold. Empirically near twenty-four turns in our benchmarks, workload-dependent.
- Suppression edge. A directed gate from an earlier constraint to a later decision point that blocks a named failure pattern from occurring.
- Meta-checkpoint. A scheduled pause in execution at which the harness audits whether its suppression signals are being respected and branches to corrective reasoning if not.