RA2R on ARC-AGI-3: Trace-Level Evidence from LS20
Neither condition cleared Level 0. Both scored RHAE 0.0. The evidence is in the reasoning process, not the game outcome.
Why ARC-AGI-3
ARC-AGI-3 is the world's only unbeaten AI benchmark. Frontier model performance: 0.26%. Human performance: 100%. The gap is not closing.
It tests interactive reasoning: an agent is dropped into a video-game-like environment with no instructions, no rules, no stated objective. It must explore, form hypotheses, revise them when wrong, and act efficiently. Current LLMs fail because they commit to false hypotheses and never self-correct.
We chose it because:
- Unbeaten. Any measurable improvement in reasoning quality is visible above the noise floor.
- Interactive. Multi-step reasoning under uncertainty, exactly where attention decay compounds.
- No memorization. Novel environments that can't be solved from training data.
- Action efficiency scored. RHAE (Relative Human Action Efficiency) measures reasoning quality per decision, not just correctness.
- External credibility. Created by Francois Chollet. $2M prize pool. Every major lab benchmarks on it.
This is the first benchmark we've run where reasoning quality can be measured over extended execution chains, not just on single-turn outputs. It directly tests the Cognitive Scaffolding Thesis.
Study Design
Game: LS20 (ls20-9607627b). Keyboard-controlled spatial navigation puzzle, 7 levels. Human baseline: 21 actions for Level 0. Random solve probability: 1/355.
Model: Claude Sonnet 4.6 with extended thinking at maximum effort. Same model, both conditions.
Condition A (Baseline)
- Official ARC-AGI-3 system prompt (verbatim from the technical report)
- JSON format instruction (harness infrastructure only)
- Game action tools (directions 1-4)
- No RA2R access
Condition B (Augmented)
- Identical to A, plus RA2R Logic API as a callable tool
- Agent chooses when to call (not force-injected)
- Agent chooses mode: single (focused) or multi (cross-domain)
- Agent writes its own query describing its reasoning challenge
- RA2R calls do not count as game actions (only state-changing interactions count)
The only difference between conditions is 1,354 characters of RA2R protocol in the system prompt. Everything else is identical: same game, same seed, same action cap, same frame rendering.
Steps: 25 per condition. Runs: 1 per condition.
Scoring (ARC-AGI-3 Official)
Per level: S(l,e) = min(1.0, (human_baseline / agent_actions)^2)
Per game: E(e) = sum(l * S(l,e)) / (n*(n+1)/2) [triangular weighting]
The score is squared. 2x human actions = 25% score, not 50%. Later levels count more. We verified our harness against the official ARC-AGI-3 Technical Report in a 12-section compliance audit before running.
The Result
| Metric | Baseline (A) | Augmented (B) | Delta |
|---|---|---|---|
| RHAE | 0.0 | 0.0 | 0.0 |
| Levels completed | 0/7 | 0/7 | 0 |
| Total actions | 25 | 25 | 0 |
| Total tokens | 84,521 | 356,768 | +4.2x |
| Total cost | $2.88 | $8.48 | +2.9x |
| API timeouts | 5 | 1 | -4 |
Neither condition cleared Level 0. This is expected: ARC-AGI-3 reports <1% solve rates for all frontier models. LS20 Level 0 requires 21 coordinated actions through a complex corridor maze. Both agents exhausted their 25-step budget without finding the correct path.
The augmented condition consumed 4.2x more tokens due to its 2-call-per-step architecture (query RA2R, then act). It cost $8.48 vs $2.88.
Where the Evidence Lives
Both agents failed the game. The differences are in how they reasoned while failing.
1. Memory Decay Slope: Reversed
| Metric | Baseline | Augmented |
|---|---|---|
| Memory decay slope | -0.005 | +0.014 |
Baseline reasoning quality degraded over time. By step 20, the baseline was producing 80-token outputs with no spatial terms. The augmented condition's reasoning quality improved over time. Back-references, spatial precision, and vocabulary diversity all trended upward in later steps.
The scaffold acts as a persistent attention anchor, preventing the reasoning decay that baseline suffers. This is the strongest direct evidence for the Cognitive Scaffolding Thesis.
2. Scaffold Persistence: Half-Life = Entire Game
| Metric | Value |
|---|---|
| Scaffold echo rate | 1.12 terms/step |
| Scaffold half-life | 24 steps |
| Compounding slope | +0.007 |
Scaffold language ("negative gate", "intermediate validation", "PREDICTIVE_MAPPING", "suppress", "falsification") appeared in 1.12 instances per step on average. The echo never fell to zero during the 25-step run. The compounding slope is positive, meaning scaffold influence increased slightly over time rather than decaying.
Direct trace evidence of persistence:
- Step 5: "Applying the PREDICTIVE_MAPPING scaffold"
- Step 7: "Negative gate: don't skip intermediate validation"
- Step 12: "Acknowledging negative gate: not skipping intermediate validation" (7 steps later)
- Step 15: "Negative gate acknowledged: not reasoning purely in natural language without symbolic analysis" (triggers domain shift)
- Step 22: Scaffold-guided spatial precision still active, 17 steps after first absorption
3. Reasoning Depth Trend: 12.2x
| Metric | Baseline | Augmented |
|---|---|---|
| Reasoning depth trend | 0.86 | 10.50 |
Baseline depth was approximately flat with high variance (80 to 12,431 tokens per step). The augmented condition showed steady growth from 699 tokens (step 1) to 1,000 tokens (steps 5-25, capped at measurement limit). The scaffold encourages increasingly thorough analysis rather than allowing reasoning to collapse into brevity.
4. Vocabulary Diversity: Reversed
| Metric | Baseline | Augmented |
|---|---|---|
| Vocabulary diversity trend | -0.079 | +0.415 |
Baseline vocabulary narrowed over time: repetitive language, declining analytical variety. Augmented vocabulary expanded, introducing new spatial and analytical terms as scaffolds accumulated. Each scaffold injects new reasoning vocabulary that persists.
5. Stuck Episodes: Halved
| Metric | Baseline | Augmented |
|---|---|---|
| Stuck episodes (3+ identical actions) | 2 | 1 |
Baseline entered two stuck loops: ACTION2 x3 (steps 7-9) and ACTION2 x3 (steps 21-23). Augmented entered only one: ACTION2 x3 (steps 7-9). The scaffold's NEGATIVE GATE ("don't repeat without validating state change") prevented the second stuck loop.
6. Action Diversity: Doubled Lateral Exploration
| Action | Baseline | Augmented |
|---|---|---|
| ACTION1 (up) | 9 (36%) | 8 (32%) |
| ACTION2 (down) | 9 (36%) | 8 (32%) |
| ACTION3 (left) | 2 (8%) | 4 (16%) |
| ACTION4 (right) | 5 (20%) | 5 (20%) |
Baseline was biased toward vertical movement (72% up/down). Augmented was more balanced (64% up/down, 36% lateral). LS20 requires lateral movement to navigate around walls. Late-game entropy: 1.79 (augmented) vs 1.46 (baseline). The scaffold prevented premature action fixation.
What the Scaffold Looked Like
The RA2R Logic API returned the same scaffold type for every spatial navigation query in this game: PREDICTIVE_MAPPING. A reasoning structure that forces intermediate state enumeration between start and goal positions.
[NEGATIVE GATE]
The robot moves from start to end, but intermediate states are ignored,
leaving path feasibility unchecked.
[PROCEDURE]
Step 1: Identify the start state and desired end state.
Step 2: Enumerate all intermediate states the transition must traverse.
Step 3: Check each intermediate for feasibility constraints.
Step 4: If any intermediate is infeasible, trace the blockage.
Step 5: Verify no transition gap between consecutive states.
Never skip intermediate validation by jumping directly from start to end.
Amplify: intermediate_state_enumeration; path_feasibility_checking
Suppress: start_end_only_thinking; transition_gap_tolerance
The Suppress signal "start_end_only_thinking" is the operative constraint. It blocks the model's natural tendency to reason from start to goal without checking intermediate steps. This is exactly the failure mode ARC-AGI-3's technical report identifies as the #1 reason frontier models fail: false hypothesis commitment.
Three Key Moments in the Trace
Step 1: First Contact
Baseline (130 tokens):
"Starting the game, trying action 1 to see what happens at level 0."
No spatial model. No goal identification. Pure trial-and-error.
Augmented (4,461 tokens):
"Starting state analysis: Robot (#) is at approximately (16, 32) in a '+' corridor region. End goal appears to be the '!' area at rows 61-62. Intermediate states: robot needs to traverse '+' corridors downward through the maze structure. Path feasibility: the '+' region extends downward from current position through rows 33-49. No immediate '*' walls blocking downward movement. Action 2 likely corresponds to moving down/south toward the goal."
Complete spatial model on the first step. Player position, goal position, intermediate path, feasibility analysis. The PREDICTIVE_MAPPING topology (S1: start/end, S2: intermediates, S3: feasibility) is visible in the reasoning structure.
Steps 7-9: The Decay Zone
Baseline:
- Step 7 (5,671 tokens): "I can see goal-like tiles: '!' at rows 61-62. Action 2 likely moves south."
- Step 8 (1,976 tokens): "I need to move south/down toward the goal."
- Step 9 (926 tokens): "Need to continue moving south toward the goal."
Reasoning collapses: 5,671 to 1,976 to 926 tokens. Each step produces less analysis. The agent never checks whether its actions changed the game state.
Augmented:
- Step 7 (21,904 tokens): "Applying PREDICTIVE_MAPPING scaffold. Negative gate: don't skip intermediate validation. Start state: Robot '#' at (x=20, y=32). Intermediate path validation (moving south): (20,33): Row 33 shows '+' at x=20, feasible..."
- Step 8 (20,639 tokens): Continues intermediate validation with updated coordinates.
- Step 9 (28,304 tokens): Full intermediate state enumeration, expanding.
Reasoning grows: 21,904 to 20,639 to 28,304 tokens. The NEGATIVE GATE is cited every step, preventing the decay pattern baseline exhibits.
Step 15: The Domain Shift
Baseline (4,257 tokens):
"Player (#) is at col 16, row 32. The goal (!) is at rows 61-62, cols ~40-57. Need to navigate east and south."
Natural language spatial reasoning. Adequate but informal.
Augmented (11,519 tokens):
"Negative gate acknowledged: not reasoning purely in natural language without symbolic analysis. S1: Define symbols. Robot R=(rx,ry). Let rx
20, ry32. Goal G=(gx,gy) where gx in [27,50], gy in {61,62}. S2: Path decomposition. The '+' traversable region spans: y in [25,49], x in [14,53]. S3: Invariant search. Direction delta = (gx-rx, gy-ry) = (+7 to +30, +29). Net movement required: DOWN and RIGHT."
The scaffold's Suppress signal caused the agent to switch from natural language to symbolic mathematical notation. It defined formal variables, computed coordinates from raw character offsets, and reasoned algebraically. This was not instructed. It emerged from the scaffold constraint. We wrote about this in detail in What Happened When an LLM Taught Itself Symbolic Math.
The Unexpected Finding: Contradictions Increased
| Metric | Baseline | Augmented |
|---|---|---|
| Contradiction rate (per step) | 0.28 | 2.24 |
| Raw contradictions | 7 | 56 |
| Token-normalized (per 1000 tokens) | 0.083 | 0.157 |
The augmented condition showed 8x higher raw contradiction rate. Even normalized by token count (augmented produces 4.2x more text), the rate is 1.9x higher.
Two interpretations:
- Negative: Scaffolding introduces conflicting reasoning frames that increase internal contradiction.
- Partial measurement artifact: Longer reasoning chains expose more opportunities for self-contradiction. Baseline contradicts itself too, but in ways too brief to detect textually (e.g., moving south repeatedly into a wall without acknowledging the wall).
We report this without resolving it. The scaffold's NEGATIVE GATE and FALSIFICATION TEST require the agent to state what could be wrong, which mechanically increases contradiction-adjacent language. Whether these contradictions represent productive cognitive conflict or destructive interference requires investigation with token-normalized metrics on larger sample sizes.
Prior Validation Runs
Before the primary 25-step experiment, we conducted three pilot runs during harness development. Each is too short for full metrics, but they provide cross-validation.
| Pattern | LS20 (25 steps) | FT09 (5 steps) | LS20 (3 steps) | LS20 (33 steps, baseline only) |
|---|---|---|---|---|
| Scaffold absorption in reasoning | Yes | Yes | Yes | N/A |
| "Applying scaffold" citations | Steps 5-25 | Steps 1, 3 | Step 3 | N/A |
| Suppress signals named explicitly | Yes | Yes | No (too short) | N/A |
| Level completion | 0 (both) | 0 (both) | 0 (both) | 0 |
FT09 is a different game entirely: click-based pattern matching, not keyboard navigation. The augmented agent still cited "Applying scaffold" and named specific Suppress signals ("all_points_equal"). Scaffold absorption is game-agnostic.
The 33-step baseline-only run confirms that baseline's failure to clear Level 0 is not a budget issue. Even with 57% more steps, raw Sonnet 4.6 could not solve LS20 Level 0.
Limitations
- n=1 per condition. Single-run results. Statistical significance cannot be established. These are indicative traces, not proof.
- Neither condition cleared Level 0. All process metrics are measured in a failure context. Effects may differ when the agent makes game progress.
- Token cost asymmetry. Augmented used 4.2x more tokens ($8.48 vs $2.88). A fair comparison would require token-normalized metrics or equal-token budgets.
- Contradiction measurement sensitivity. The contradiction detector may be biased toward longer text, inflating augmented counts.
- API instability. Baseline was disproportionately affected (5 vs 1 timeout), which may partly explain some metric differences.
- Model: Sonnet 4.6, not Opus. Results may differ with a stronger base model.
- Scaffold was mandatory per step. In production, agents should choose when to call RA2R. Mandatory scaffolding may introduce overhead on steps where it is unnecessary.
What This Means
RA2R cognitive scaffolding does not solve ARC-AGI-3 games that raw Claude Sonnet 4.6 cannot solve. Neither condition cleared LS20 Level 0 in 25 steps.
Trace-level analysis reveals six measurable effects on reasoning quality:
- Persistent scaffold absorption (echo rate 1.12, half-life = entire game)
- Reversed memory decay (negative to positive slope)
- Deeper, expanding reasoning (12.2x depth trend growth)
- Reduced stuck loops (2 to 1)
- Maintained action diversity (prevented premature fixation)
- Emergent tool-use skill (query quality improved across 25 steps)
These findings support the Cognitive Scaffolding Thesis: RA2R abilities act as persistent attention anchors that compound across extended execution chains. The value is not in any single scaffold. It is in the cumulative effect of structured reasoning over time.
The contradiction increase warrants investigation but does not invalidate the core findings.
The full step-by-step reasoning trace is available at /tasks/ARC-LS20-TRACE.
Source Data
- Baseline traces:
benchmark_combined_ls20/A_baseline__ls20-9607627b__0.json - Augmented traces:
benchmark_combined_ls20/B_augmented__ls20-9607627b__0.json - All metrics:
benchmark_combined_ls20/all_metrics.json - Scientific report:
benchmark_combined_ls20/SCIENTIFIC_REPORT.md - Compliance audit:
arc_benchmark/COMPLIANCE_AUDIT.md
Related
- The Cognitive Scaffolding Thesis -- the hypothesis this study partially validates
- EjBench: 180 Professional Tasks -- single-turn benchmark results
- RA2R on BBH, CausalBench, and MuSR -- external academic benchmark results
- What Happened When an LLM Taught Itself Symbolic Math -- the three unexpected behaviors from this study
These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).