SP SpatialARC-AGI-3

ARC-LS20-TRACE

The Study

Claude Sonnet 4.6 with extended thinking at maximum effort

Game LS20 (ls20-9607627b): keyboard-controlled spatial navigation puzzle, 7 levels, human baseline 21 actions for level 0, random solve probability 1/355. Conditions: A (baseline, no API access) vs B (RA2R Logic API, mandatory per step). 25 steps per condition, same game seed. Neither condition cleared Level 0, consistent with <1% frontier model solve rate on ARC-AGI-3. The evidence is in the reasoning process, not the game outcome.

The Finding

Memory Decay Slope

-0.005+0.014

Reversed. Reasoning quality improved instead of degrading.

Scaffold Half-Life

024 steps

Scaffold never left working memory during the entire game.

Reasoning Depth Trend

0.8610.50

12.2x growth. Analysis deepened systematically over time.

Vocabulary Diversity

-0.079+0.415

Baseline narrowed. Augmented expanded with new analytical terms.

Stuck Episodes

21

50% fewer repetitive action loops.

Action Diversity

8% lateral16% lateral

Doubled lateral exploration. Prevented vertical fixation.

Reasoning Trace

The scaffold returned by the Logic API for this game was PREDICTIVE_MAPPING: a reasoning structure that forces intermediate state enumeration between start and goal positions. Its negative gate suppresses "start-end-only thinking" and its topology enforces hop-by-hop feasibility checking.

Side-by-side reasoning at 5 key steps. Same game state, same step number.

Step 1 First Move

Baseline (130 tokens)

Starting the game, trying action 1 to see what happens at level 0.

Ejentum Logic API (4,461 tokens)

Starting state analysis: Robot (#) is at approximately (16, 32) in a '+' corridor region. End goal appears to be the '!' area at rows 61-62 (bottom area). Intermediate states: robot needs to traverse '+' corridors downward through the maze structure. Path feasibility: the '+' region extends downward from current position through rows 33-49, with connections to the lower open area. No immediate '*' walls blocking downward movement. Action 2 likely corresponds to moving down/south toward the goal '!' symbols.

Step 7 Scaffold Absorption

Baseline (5,671 tokens)

I can see goal-like tiles: '!' at rows 61-62. Action 2 likely moves south/down toward the tiles at the bottom of the map. I'll try moving south to explore the lower portion of the dungeon.

Ejentum Logic API (21,904 tokens)

RA2R query: "How do I navigate from (16,32) through corridor obstacles when rows 50-52 appear to be solid walls?"

Applying PREDICTIVE_MAPPING scaffold. Negative gate: don't skip intermediate validation. Start state: Robot '#' at (x=20, y=32). End state: Goal '!' at (x=19-50, y=61-62). Intermediate path validation (moving south): - (20,33): Row 33 shows '+' at x=20 → feasible - (20,34) through (20,49): Corridor open at x=20 → feasible - (20,50): Row 50 = '****' → x=20 is '*' wall → BLOCKED Moving DOWN is the correct first step. No transition gap between (20,32) and (20,33).

Step 10 Stuck Detection

Baseline (0 tokens)

API timeout. Zero tokens produced. No self-diagnosis.

Ejentum Logic API (10,076 tokens)

RA2R query: "Robot appears stuck at (16,32) after 5 turns trying actions 2,3,4. Need to determine if action 1 (up) opens a path toward goal."

Reasoning timed out, but the RA2R query reveals the agent diagnosed its own stuck state before the timeout occurred.

Step 15 Emergent Domain Shift

Baseline (4,257 tokens)

Player (#) is at col 16, row 32. The goal (!) is at rows 61-62, cols ~40-57. Need to navigate east and south. There's a wall at cols 25-29 for rows 30-39, but the wide corridor at rows 25-29 spans cols 10-53 allowing eastward travel. Moving east first.

Ejentum Logic API (11,519 tokens)

Negative gate acknowledged: not reasoning purely in natural language without symbolic analysis. S1: Define symbols. Robot R=(rx,ry). Let rx≈20, ry≈32. Goal G=(gx,gy) where gx∈[27,50], gy∈{61,62}. S2: Path decomposition. The '+' traversable region spans: y∈[25,49], x∈[14,53] (with '*' walls at x<14, x>53, and a vertical '*' barrier at x=29-33 for y∈[30,39]). S3: Invariant search. Direction delta Δ=(gx-rx, gy-ry) = (+7 to +30, +29). Net movement required: DOWN and RIGHT.

Step 22 Late-Game Precision

Baseline (8,720 tokens)

Player (#) is at col 16, row 32. Goal (!) is at rows 61-62, cols 34-52. Need to move south. The wide open + area extends from rows 25-49. The corridor at cols 15-19 allows southward movement.

Ejentum Logic API (25,768 tokens)

RA2R query: "Robot at (20-21, 32-33) needs to reach goal at (34-52, 61-62). How to navigate around wall block at x=29-33?"

Negative gate acknowledged: not jumping from start to end without validating intermediate states. S1: Robot at (x=20, y=32). Goal at (x=34-54, y=61-62). S2: Enumerate intermediate states: - (20,32) → move north to y=25-29 corridor - Wall at x=25-29, y=30-39 blocks direct eastward movement at y=32 - At y=25-29, floor open from x=14 to x=53 — this is the crossing corridor - After crossing east to x=34+, find path south to y=61 S3: Feasibility: Row 31 at x=20 is '+' floor tile — movement north feasible.

Emergent Behaviors

Three behaviors observed in the augmented condition that were not instructed.

Spontaneous Domain Shift (Step 15)

The agent switched from natural language to symbolic mathematical notation without being instructed. Emerged from the scaffold's suppression of "reasoning purely in natural language without symbolic analysis." See Step 15 in the reasoning trace above.

Query Quality Evolution (Steps 7, 10, 22)

The agent learned to formulate better RA2R queries through practice: spatially precise at step 7, metacognitive self-diagnosis at step 10 ("Robot appears stuck after 5 turns"), solution-oriented at step 22. An emergent tool-use skill, not instructed.

Late-Game Entropy Maintenance

Baseline late-game entropy: 1.46 (fixated on vertical movement). Augmented late-game entropy: 1.79 (sustained diverse exploration). The scaffold prevented premature action fixation that the baseline exhibited after step 15.

The Cost

Baseline

84,521 tokens

~$2.88

5 API timeouts (20%)

Augmented

356,768 tokens (4.2x)

$8.48 ($0.339/step)

1 API timeout (4%)

Limitations

  • n=1 per condition. No statistical significance.
  • Neither condition cleared Level 0. All metrics measured in failure context.
  • Token cost asymmetry: augmented used 4.2x more tokens ($8.48 vs $2.88).
  • Scaffold was mandatory per step. Future studies should test agent-initiated calls.
  • Contradiction rate increased 1.9x (token-normalized). Warrants investigation.

Source: arc_benchmark/results/. Full action logs, trace metrics, and scaffold payloads available on GitHub.