← Back to Blog

Under Pressure: Our First Research Paper

Under Pressure: Our First Research Paper

Our first research paper is live.

Title: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors in Scaffold-Augmented Language Models

Author: Franko Luci, Ejentum

Published on:


Abstract

We introduce Reasoning Ability-Augmented Retrieval (RA²R), a paradigm for augmenting large language model agents at inference time by retrieving and injecting structured cognitive operations rather than information. Where RAG retrieves facts and Buffer of Thoughts retrieves reasoning templates, RA²R retrieves complete cognitive procedures that include named failure mode declarations, executable reasoning topologies, inline epistemic checkpoints, and structured failure recovery mechanisms.

We evaluate RA²R across three independent benchmarks using Claude (Anthropic) as the sole model family:

  • EjBench (180 domain-specific tasks, n=536 judgments)
  • BIG-Bench Hard, CausalBench, MuSR (70 published academic tasks, n=209 judgments)
  • ARC-AGI-3 (25-step interactive reasoning, n=2 conditions)

On single-turn tasks, RA²R injection improved composite reasoning quality by +10.1 percentage points on custom tasks and +20.8 percentage points on published benchmarks, with self-monitoring scores nearly doubling while correctness remained stable.

On the interactive benchmark, both conditions scored 0.0 RHAE (neither solved the task), but process-level analysis revealed three uninstructed emergent behaviors: spontaneous transition from natural language to symbolic mathematical notation, progressive improvement in retrieval query quality without instruction, and reversal of the expected reasoning decay pattern from -0.005 to +0.014 slope across 25 steps.

We report all negative findings, including correctness decrements under multi-ability injection and an unresolved 1.9x increase in normalized contradictions. All data is publicly available. All experiments use a single model family; cross-model generalization is untested.


What the Paper Covers

The paper synthesizes everything we've published on this blog into a single, peer-reviewable document with a unified thesis: suppression is pressure, and emergence is the model's response to that pressure.

Each section builds on work we've shared before:

  • The Pressure Thesis (Section 2) formalizes the asymmetry between suppression and amplification that we first observed in our benchmarks
  • The RA²R Paradigm (Section 3) defines the ability injector as a new artifact type, distinct from prompts, templates, and tools
  • The Scaffold (Section 4) shows the complete anatomy of one cognitive operation, including the reasoning topology DAG
  • Experimental Evidence (Section 5) consolidates all three benchmarks: EjBench, BBH/CausalBench/MuSR, and ARC-AGI-3
  • Under Pressure (Section 6) presents the emergent findings: the domain shift, the scaffolding thesis, and the domain-agnostic suppression effect
  • What Falsifies This (Section 7) names four conditions that would disprove our claims and honestly reports which ones we haven't tested yet

Key Contributions

  1. A new paradigm. RA²R retrieves cognitive operations, not information. The retrieved artifact is a 23-field typed object with self-monitoring, failure recovery, and behavioral specification.

  2. A mechanism hypothesis. Suppression acts multiplicatively (pruning failure branches). Amplification acts additively (nudging toward correct behavior). This asymmetry has not been ablated and remains a hypothesis.

  3. Evidence across 250+ tasks. Three benchmarks, 745+ blind judgments, all data public.

  4. Five uninstructed behavioral observations from the ARC-AGI-3 interactive study (n=1, reported as observations, not demonstrated effects).


Negative Findings (Reported in Full)

  • Correctness decreased by -0.112 under multi-ability injection on EjBench
  • Spatial reasoning regressed by -20 percentage points under multi-ability injection on BBH
  • Both conditions scored RHAE 0.0 on ARC-AGI-3 (neither solved the task)
  • Contradictions increased 1.9x (token-normalized), interpretation unresolved
  • Retrieval precision was 38.33% at the domain level (we published why this doesn't break the system)
  • All interactive findings are n=1

Links


What Comes Next

The paper establishes priority for the RA²R paradigm and the novel constructs (N{} anti-pattern traps, M{} epistemic gates, controlled topology rupture). Five open questions define our next phase of work:

  1. Cross-model validation (GPT, Gemini, open-source families)
  2. Suppression vs. amplification ablation
  3. Human evaluation (to address LLM-as-judge limitations)
  4. Random scaffold control (the most critical missing experiment)
  5. Replication at scale

The data is public. The thesis is testable. The work continues.

Every insight above is implemented as a reasoning primitive in the Logic API.