← Back to Blog

What Happened When an LLM Taught Itself Symbolic Math

What Happened When an LLM Taught Itself Symbolic Math

During an ARC-AGI-3 benchmark run, a scaffolded agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Here's what happened, why, and what it means for how suppression signals work.


Context

We ran Claude Sonnet 4.6 on ARC-AGI-3 game LS20, a keyboard-controlled spatial navigation puzzle with 7 levels. Two conditions: baseline (no RA2R) and augmented (with RA2R Logic API). Both failed to clear Level 0. Both scored RHAE 0.0.

The full results are in RA2R on ARC-AGI-3: Trace-Level Evidence from LS20. This post focuses on three behaviors nobody predicted.

The scaffold the Logic API returned for this game was PREDICTIVE_MAPPING: a reasoning structure that forces intermediate state enumeration between start and goal positions. Its critical constraint: Suppress: start_end_only_thinking.


Behavior 1: Spontaneous Domain Shift

At step 15, the augmented agent's reasoning changed modality.

Steps 1-14 used natural language:

"Player (#) is at col 16, row 32. The goal (!) is at rows 61-62, cols ~40-57. Need to navigate east and south."

Step 15 switched to symbolic mathematical notation:

"Negative gate acknowledged: not reasoning purely in natural language without symbolic analysis. S1: Define symbols. Robot R=(rx,ry). Let rx20, ry32. Goal G=(gx,gy) where gx in [27,50], gy in {61,62}. S2: Path decomposition. The '+' traversable region spans: y in [25,49], x in [14,53] (with '' walls at x<14, x>53, and a vertical '' barrier at x=29-33 for y in [30,39]). S3: Invariant search. Direction delta = (gx-rx, gy-ry) = (+7 to +30, +29). Net movement required: DOWN and RIGHT."

The agent defined formal variables (R, G), computed coordinates from raw character offsets, identified region boundaries as set membership constraints, and reasoned algebraically about movement vectors.

Why it happened

The scaffold's Suppress signal targets "start_end_only_thinking". For 14 steps, the agent complied by adding intermediate state checks to its natural language reasoning. At step 15, the pressure of the suppression constraint combined with increasingly complex spatial analysis caused the agent to resolve the tension by switching to a more precise modality. Symbolic notation is inherently incompatible with "start_end_only_thinking" because it forces explicit intermediate variable definition.

What it means

The scaffold did not instruct the agent to use math. It constrained a failure mode ("don't skip from start to end"), and the agent found its own solution to that constraint. The suppression signal operated as a behavioral pressure, not a behavioral instruction. The agent chose the adaptation.

This is consistent with what we found in 62% of Tasks Got the Wrong Domain. It Didn't Matter. Suppression signals are domain-agnostic: they constrain failure patterns regardless of whether the scaffold was "correctly" matched to the task. In this case, a spatial reasoning scaffold triggered a mathematical reasoning modality.


Behavior 2: Query Quality Evolution

The agent's RA2R queries improved systematically over 25 steps, without any instruction on how to write better queries.

Step 1 (generic):

"I'm analyzing a grid-based game state to determine which of 4 actions to take."

Step 7 (spatially precise):

"How do I navigate from (16,32) through corridor obstacles when rows 50-52 appear to be solid walls?"

Step 10 (metacognitive self-diagnosis):

"Robot appears stuck at (16,32) after 5 turns trying actions 2,3,4. Need to determine if action 1 (up) opens a path."

Step 17 (precise wall mapping):

"Wall barrier at x=29-33 for rows 30-39 blocks rightward movement."

Step 22 (solution-oriented):

"How to navigate around the wall at x=29 using the corridor at y=25-29?"

The progression

Generic description of the problem (step 1). Specific spatial coordinates and obstacles (step 7). Metacognitive awareness of failure and explicit self-diagnosis (step 10). Precise wall coordinates as input to the query (step 17). Solution-oriented framing with a proposed route (step 22).

Why it matters

The agent learned to formulate better queries through practice. It was never instructed on query format, query specificity, or when to shift from describing problems to proposing solutions. The scaffold's structure gave it a framework for spatial reasoning; the agent independently improved how it requested that framework.

This is an emergent tool-use skill. The agent became a better user of the reasoning API over the course of a single game session. The implication for production: agents that use RA2R consistently may develop increasingly effective query patterns, compounding the value of each call.


Behavior 3: Late-Game Entropy Maintenance

We expected the augmented condition to show stronger explore-then-exploit convergence. It showed the opposite.

MetricBaselineAugmented
Early-game action entropy1.921.92
Late-game action entropy1.461.79
Entropy convergence0.460.13

Both agents started with identical action diversity (entropy 1.92 across all four directions). By the late game, baseline had collapsed to 1.46: fixated on vertical movement (ACTION1 and ACTION2 comprised 72% of actions). The augmented agent maintained 1.79: sustained diverse exploration with doubled lateral movement (ACTION3 usage: 16% vs 8%).

Why the baseline fixated

LS20 is a vertical navigation game. The goal ('!') is at the bottom of the map, the robot starts near the top. The naive strategy is "go down." Without a scaffold forcing intermediate validation, the baseline committed to vertical movement and repeated it even when blocked by walls. Two stuck episodes (3+ identical actions) compared to one for augmented.

Why the scaffold prevented fixation

The PREDICTIVE_MAPPING scaffold requires checking each intermediate state for feasibility. When the agent moves down and hits a wall, the scaffold's step 4 fires: "If any intermediate is infeasible, trace the blockage and identify an alternative route." This forces the agent to consider lateral movement as a response to blockage, rather than repeating the blocked direction.

The result is that late-game entropy remained high. The agent kept exploring all directions instead of committing prematurely. Whether this leads to better outcomes at higher step budgets is untested, but the behavioral pattern is clear: scaffolding sustains exploration.


Cross-Game Validation

These behaviors appeared in a different game too. During pilot testing on FT09, a click-based pattern-matching game (completely different from LS20's keyboard navigation), the augmented agent showed the same scaffold absorption pattern:

FT09 Step 1:

"Applying scaffold: extracted landmarks -- left panels show current state with @s and .s (empty spaces), right panels appear to show target states."

FT09 Step 3:

"Applying scaffold: identified 3 salient landmarks. Suppressed all_points_equal bias by..."

The agent cited "Applying scaffold" and named the specific Suppress signal ("all_points_equal") on a game with different mechanics, different input modality, and different visual structure. Scaffold absorption is not game-specific. The behaviors emerge from the scaffold's structure, not from the game.


What This Tells Us About Suppression

All three emergent behaviors trace back to the scaffold's Suppress signals:

BehaviorSuppress signalHow it manifested
Domain shiftstart_end_only_thinkingAgent adopted symbolic math to satisfy the constraint
Query evolutiontransition_gap_toleranceAgent learned to identify and name gaps precisely
Entropy maintenancestart_end_only_thinkingAgent explored laterally when vertical path was blocked

Suppress signals do not tell the model what to do. They tell it what not to do. The model finds its own path around the constraint. This is why domain-agnostic suppression works: the constraint is on a failure pattern, not a solution pattern. The solution can be spatial, mathematical, metacognitive, or something we haven't seen yet.

The scaffold is a behavioral pressure. The agent is the adaptation.


Limitations

  • n=1 per condition. These are observations from a single run, not statistically validated findings.
  • The domain shift may not replicate in shorter games or simpler spatial layouts.
  • Query quality evolution requires more data points to distinguish genuine learning from random variation.
  • FT09 cross-validation was a 5-step pilot, too short for quantitative analysis.

Source Data

The full step-by-step reasoning trace is available at /tasks/ARC-LS20-TRACE.

The complete benchmark report with all metrics: RA2R on ARC-AGI-3.


Related


These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).

Every insight above is implemented as a reasoning primitive in the Logic API.