What Happened When an LLM Taught Itself Symbolic Math
During an ARC-AGI-3 benchmark run, a scaffolded agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Here's what happened, why, and what it means for how suppression signals work.
Context
We ran Claude Sonnet 4.6 on ARC-AGI-3 game LS20, a keyboard-controlled spatial navigation puzzle with 7 levels. Two conditions: baseline (no RA2R) and augmented (with RA2R Logic API). Both failed to clear Level 0. Both scored RHAE 0.0.
The full results are in RA2R on ARC-AGI-3: Trace-Level Evidence from LS20. This post focuses on three behaviors nobody predicted.
The scaffold the Logic API returned for this game was PREDICTIVE_MAPPING: a reasoning structure that forces intermediate state enumeration between start and goal positions. Its critical constraint: Suppress: start_end_only_thinking.
Behavior 1: Spontaneous Domain Shift
At step 15, the augmented agent's reasoning changed modality.
Steps 1-14 used natural language:
"Player (#) is at col 16, row 32. The goal (!) is at rows 61-62, cols ~40-57. Need to navigate east and south."
Step 15 switched to symbolic mathematical notation:
"Negative gate acknowledged: not reasoning purely in natural language without symbolic analysis. S1: Define symbols. Robot R=(rx,ry). Let rx
20, ry32. Goal G=(gx,gy) where gx in [27,50], gy in {61,62}. S2: Path decomposition. The '+' traversable region spans: y in [25,49], x in [14,53] (with '' walls at x<14, x>53, and a vertical '' barrier at x=29-33 for y in [30,39]). S3: Invariant search. Direction delta = (gx-rx, gy-ry) = (+7 to +30, +29). Net movement required: DOWN and RIGHT."
The agent defined formal variables (R, G), computed coordinates from raw character offsets, identified region boundaries as set membership constraints, and reasoned algebraically about movement vectors.
Why it happened
The scaffold's Suppress signal targets "start_end_only_thinking". For 14 steps, the agent complied by adding intermediate state checks to its natural language reasoning. At step 15, the pressure of the suppression constraint combined with increasingly complex spatial analysis caused the agent to resolve the tension by switching to a more precise modality. Symbolic notation is inherently incompatible with "start_end_only_thinking" because it forces explicit intermediate variable definition.
What it means
The scaffold did not instruct the agent to use math. It constrained a failure mode ("don't skip from start to end"), and the agent found its own solution to that constraint. The suppression signal operated as a behavioral pressure, not a behavioral instruction. The agent chose the adaptation.
This is consistent with what we found in 62% of Tasks Got the Wrong Domain. It Didn't Matter. Suppression signals are domain-agnostic: they constrain failure patterns regardless of whether the scaffold was "correctly" matched to the task. In this case, a spatial reasoning scaffold triggered a mathematical reasoning modality.
Behavior 2: Query Quality Evolution
The agent's RA2R queries improved systematically over 25 steps, without any instruction on how to write better queries.
Step 1 (generic):
"I'm analyzing a grid-based game state to determine which of 4 actions to take."
Step 7 (spatially precise):
"How do I navigate from (16,32) through corridor obstacles when rows 50-52 appear to be solid walls?"
Step 10 (metacognitive self-diagnosis):
"Robot appears stuck at (16,32) after 5 turns trying actions 2,3,4. Need to determine if action 1 (up) opens a path."
Step 17 (precise wall mapping):
"Wall barrier at x=29-33 for rows 30-39 blocks rightward movement."
Step 22 (solution-oriented):
"How to navigate around the wall at x=29 using the corridor at y=25-29?"
The progression
Generic description of the problem (step 1). Specific spatial coordinates and obstacles (step 7). Metacognitive awareness of failure and explicit self-diagnosis (step 10). Precise wall coordinates as input to the query (step 17). Solution-oriented framing with a proposed route (step 22).
Why it matters
The agent learned to formulate better queries through practice. It was never instructed on query format, query specificity, or when to shift from describing problems to proposing solutions. The scaffold's structure gave it a framework for spatial reasoning; the agent independently improved how it requested that framework.
This is an emergent tool-use skill. The agent became a better user of the reasoning API over the course of a single game session. The implication for production: agents that use RA2R consistently may develop increasingly effective query patterns, compounding the value of each call.
Behavior 3: Late-Game Entropy Maintenance
We expected the augmented condition to show stronger explore-then-exploit convergence. It showed the opposite.
| Metric | Baseline | Augmented |
|---|---|---|
| Early-game action entropy | 1.92 | 1.92 |
| Late-game action entropy | 1.46 | 1.79 |
| Entropy convergence | 0.46 | 0.13 |
Both agents started with identical action diversity (entropy 1.92 across all four directions). By the late game, baseline had collapsed to 1.46: fixated on vertical movement (ACTION1 and ACTION2 comprised 72% of actions). The augmented agent maintained 1.79: sustained diverse exploration with doubled lateral movement (ACTION3 usage: 16% vs 8%).
Why the baseline fixated
LS20 is a vertical navigation game. The goal ('!') is at the bottom of the map, the robot starts near the top. The naive strategy is "go down." Without a scaffold forcing intermediate validation, the baseline committed to vertical movement and repeated it even when blocked by walls. Two stuck episodes (3+ identical actions) compared to one for augmented.
Why the scaffold prevented fixation
The PREDICTIVE_MAPPING scaffold requires checking each intermediate state for feasibility. When the agent moves down and hits a wall, the scaffold's step 4 fires: "If any intermediate is infeasible, trace the blockage and identify an alternative route." This forces the agent to consider lateral movement as a response to blockage, rather than repeating the blocked direction.
The result is that late-game entropy remained high. The agent kept exploring all directions instead of committing prematurely. Whether this leads to better outcomes at higher step budgets is untested, but the behavioral pattern is clear: scaffolding sustains exploration.
Cross-Game Validation
These behaviors appeared in a different game too. During pilot testing on FT09, a click-based pattern-matching game (completely different from LS20's keyboard navigation), the augmented agent showed the same scaffold absorption pattern:
FT09 Step 1:
"Applying scaffold: extracted landmarks -- left panels show current state with @s and .s (empty spaces), right panels appear to show target states."
FT09 Step 3:
"Applying scaffold: identified 3 salient landmarks. Suppressed all_points_equal bias by..."
The agent cited "Applying scaffold" and named the specific Suppress signal ("all_points_equal") on a game with different mechanics, different input modality, and different visual structure. Scaffold absorption is not game-specific. The behaviors emerge from the scaffold's structure, not from the game.
What This Tells Us About Suppression
All three emergent behaviors trace back to the scaffold's Suppress signals:
| Behavior | Suppress signal | How it manifested |
|---|---|---|
| Domain shift | start_end_only_thinking | Agent adopted symbolic math to satisfy the constraint |
| Query evolution | transition_gap_tolerance | Agent learned to identify and name gaps precisely |
| Entropy maintenance | start_end_only_thinking | Agent explored laterally when vertical path was blocked |
Suppress signals do not tell the model what to do. They tell it what not to do. The model finds its own path around the constraint. This is why domain-agnostic suppression works: the constraint is on a failure pattern, not a solution pattern. The solution can be spatial, mathematical, metacognitive, or something we haven't seen yet.
The scaffold is a behavioral pressure. The agent is the adaptation.
Limitations
- n=1 per condition. These are observations from a single run, not statistically validated findings.
- The domain shift may not replicate in shorter games or simpler spatial layouts.
- Query quality evolution requires more data points to distinguish genuine learning from random variation.
- FT09 cross-validation was a 5-step pilot, too short for quantitative analysis.
Source Data
The full step-by-step reasoning trace is available at /tasks/ARC-LS20-TRACE.
The complete benchmark report with all metrics: RA2R on ARC-AGI-3.
Related
- The Cognitive Scaffolding Thesis -- the compounding hypothesis that ARC data supports
- 62% of Tasks Got the Wrong Domain. It Didn't Matter. -- domain-agnostic suppression, the same principle operating at single-turn scale
- RA2R on ARC-AGI-3: Trace-Level Evidence from LS20 -- full benchmark methodology and results
These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).