The Cognitive Scaffolding Thesis

This is a working thesis, not a validated finding. The supporting evidence is consistent and the model is testable. We publish it because the practical implication (that value compounds with task length) changes how you deploy.

The Observation

On short tasks, RA²R injection barely helps. On long tasks, the difference between injected and uninjected agents grows dramatically. This pattern appeared in every benchmark we've run:

Beyond-Reasoning (single-turn tasks): 4 of 7 behavioral signals hit ceiling. Baseline was already near-perfect on short, focused evaluations. Injection had almost nothing to improve.
MuSR (multi-paragraph narratives): +16.3pp improvement, more than double the +7.1pp average across all task types. These are the longest, most complex tasks in our external benchmark.
EjBench extreme tasks: The composite injection dominated on the hardest, most multi-step tasks (0.785 composite vs 0.769).

The pattern is consistent: the longer the reasoning chain, the larger the scaffold's effect.

Why?

The Hypothesis

RA²R abilities function as persistent attention anchors in the transformer's context window.

When a scaffold is injected at the beginning of the agent's context, it doesn't fire once and disappear. It remains in the context window throughout the entire reasoning process. The question is whether the model keeps attending to it as task-specific tokens accumulate, or whether it drifts out of attention as the context fills up.

Our hypothesis: the scaffold's structural distinctiveness prevents attention decay. And this matters more as the task gets longer.

Why the Scaffold Resists Decay

The injection format uses a notation that is structurally unlike any other content in the model's context:

S1:identify_failure → G1{mechanism_verified?} --yes→ S2:trace_chain
--no→ S3:expand_search → N{accept_correlation_as_cause}

Arrows, braces, pipes, gate predicates, step labels. This is not natural language. It occupies a unique register in the token space: a pattern the model rarely encounters in its training data and never encounters in the task prompt, tool outputs, or intermediate reasoning.

The transformer's attention mechanism allocates weight based on relevance AND distinctiveness. Tokens that are structurally unique, that don't blend into the surrounding context, receive disproportionate attention. This is the same principle that makes rare words more memorable than common ones.

The scaffold's DAG notation creates persistent anchor points that the attention mechanism keeps referencing even as thousands of task-specific tokens accumulate around them.

What Happens Without Anchors

In multi-step execution, every step introduces new tokens: tool outputs, file contents, intermediate calculations, conversation history. These task-specific tokens compete with reasoning-level constraints for attention.

Without structural anchors, the model's attention drifts from "how should I reason about this?" toward "what is the most immediate thing to respond to?" The reasoning strategy established at the beginning of the context decays as the middle fills with operational detail.

This is the lost-in-the-middle effect applied to reasoning, not retrieval. The model doesn't forget the scaffold exists: it gradually stops attending to it.

The scaffold's structural distinctiveness resists this decay. Each time the model's attention sweeps the context window, the DAG notation stands out from the surrounding natural language. It's re-attended not because the model deliberately looks for it, but because the attention mechanism's similarity computation weights structurally distinctive tokens more heavily.

The Formal Model

Without scaffolding:

R(n) = R(0) * (1 - d)^n

Reasoning quality R at step n decays exponentially. Each step introduces task-specific tokens that dilute attention to reasoning-level constraints. The decay rate d is small per step but compounds.

With scaffolding:

R(n) = R(0) * (1 - d + s(k))^n

The scaffold provides a stabilization term s(k) that depends on k (the number of active abilities). If s(k) >= d, reasoning quality is maintained or even improves across steps. If s(k) < d, decay is slowed but not eliminated.

Predictions by chain length:

Steps	Expected scaffold value
1-3	Near-zero (ceiling effect: model reasons well on short tasks natively)
5-10	Small but measurable (+2-5%)
10-25	Moderate (+5-15%)
25-50	Large (+10-25%): compound drift accumulates without anchors
50+	Maximum divergence: unscaffolded model approaches its reasoning floor

Supporting Evidence (Partial)

Consistent, not conclusive. Each piece supports the thesis but none confirms it individually.

1. Short tasks hit ceiling. The Beyond-Reasoning benchmark measured 7 behavioral signals on single-turn tasks. 4 of 7 signals showed zero or negative improvement: the baseline was already near-perfect. Injection can't improve what doesn't need improving. This is consistent with the model's prediction: near-zero value at short chain lengths.

2. Long tasks show largest lift. MuSR tasks are multi-paragraph narratives requiring sustained reasoning across ~1000 words. They showed +16.3pp improvement vs +7.1pp average. Consistent with the prediction that value increases with chain length.

3. Composite injection compounds. When multiple abilities coexist in context (the composite condition), their effects interact: suppression vectors from different abilities block different failure classes simultaneously, and procedures from one ability's gate output can inform another's reasoning. This creates distributed self-auditing across multiple dimensions: more anchor points, more resistance to decay.

4. Live observation. During a 50+ tool-call codebase reorganization session, 5 abilities were active in context. The agent internally referenced them ("I still have 5 abilities active") without re-reading or explicitly invoking them. The abilities influenced reasoning passively through attention allocation, exactly the mechanism the thesis describes.

This last observation is anecdotal: one session, no controlled comparison. The model self-reported "minimal difference" in its reasoning, but this is expected: the ceiling effect means the agent can't A/B test itself.

What Would Validate This

We've designed four benchmarks to test the thesis directly. None have been run yet.

B-SP1: Chain-Length Ablation. 40 tasks, each in 4 variants: 5, 10, 25, and 50 reasoning steps. If the thesis holds, the scaffold's improvement delta should increase monotonically with step count. This is the critical test.

B-SP2: Mid-Task Removal. 20 tasks (25+ steps), scaffold removed at step 12. If scaffolding provides persistent anchoring, performance should degrade in steps 13-25 as the anchor decays from context.

B-SP3: Format Ablation. 30 tasks (15+ steps), same content delivered in DAG notation vs natural language prose. If register distinctiveness is the mechanism, DAG should outperform prose on long tasks while showing parity on short tasks.

B-SP4: Attention Probe. 10 tasks on an open-weight model, extracting attention weights directly. Measure whether ability tokens receive disproportionate attention relative to their position in the context window. This is the most direct test of the mechanism.

What Would Falsify This

If B-SP1 shows flat improvement across chain lengths, the thesis is wrong. The scaffold's value doesn't compound; it's a fixed-size improvement regardless of task length.

If B-SP3 shows prose equals DAG on long tasks, the register distinctiveness hypothesis is wrong. The scaffold's value comes from its content, not its notation.

If B-SP4 shows no attention premium for scaffold tokens, the attention anchor mechanism is wrong. Something else drives the effect.

We publish the falsification criteria alongside the thesis because a hypothesis that can't be wrong isn't science.

What This Means (If Validated)

For deployment decisions: Use RA²R on your longest reasoning chains first. One-shot classifications don't need scaffolding. Twenty-step analytical pipelines do. The value of injection scales with the length of the task, not the difficulty of the individual step.

For injection design: The scaffold's structural distinctiveness is load-bearing. DAG notation, gate predicates, suppression directives in key: value format: these aren't arbitrary formatting choices. They create attention anchors that natural language instructions cannot.

For the field: If reasoning scaffolds function as attention anchors, this has implications beyond RA²R. Any system that injects structured content into an LLM's context, whether for reasoning, safety, or behavior control, should consider the structural distinctiveness of the injection format as a design variable, not just the content.

Source Data

Beyond-Reasoning Benchmark: 140 tasks × 7 signals, ceiling effects on 4/7
MuSR results: +16.3pp on longest-narrative tasks
Live observation: 50+ tool-call session with 5 active abilities
Formal model: Proposed, not validated
Proposed benchmarks: B-SP1 through B-SP4, designed but not executed
Related findings: Domain-agnostic suppression · External benchmarks · EjBench

These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).