← Back to Blog

62% of Tasks Got the Wrong Domain. It Didn't Matter.

62% of Tasks Got the Wrong Domain. It Didn't Matter.

We expected domain-matched retrieval to drive the improvement. If your agent asks a causal question and gets a causal ability, the scaffold should help. If it gets a spatial ability instead, it shouldn't. That's the obvious assumption.

The data broke it.


The Retrieval Precision

When an agent calls the Logic API, the system matches the query against 311 abilities and returns the best match. We measured how often the returned ability matched the task's actual reasoning domain.

EjBench (180 custom tasks):

DomainTasksMatchedPrecision
Causal302376.7%
Spatial301653.3%
Temporal301446.7%
Simulation301136.7%
Abstraction30516.7%
Metacognition3000.0%
Overall1806938.3%

Nearly two-thirds of tasks received abilities from the wrong cognitive domain. Metacognitive tasks. contradiction detection, bias identification, epistemic evaluation. received zero correctly matched abilities. Every scaffold they got was meant for a different type of reasoning.

BBH/CausalBench/MuSR (70 external tasks):

Overall precision: 50%. Same pattern. strong on Causal (92.5%), weak everywhere else.


The Paradox

Despite 62% domain mismatch, composite quality improved +10.1pp (Haki) and +9.0pp (Ki) on EjBench. On external benchmarks with 50% mismatch, Ki achieved +20.8pp.

The improvement was not concentrated in correctly-matched tasks. It was everywhere.

Metacognitive tasks received 0% matched abilities. yet improved +8.5pp (Haki). Every scaffold came from a different domain. The Causal, Spatial, or Simulation ability that landed on a Metacognitive task still improved the agent's reasoning.

Abstraction tasks received 17% matched abilities. yet showed the strongest Haki lift of any domain: +19.3pp. Five out of six scaffolds were "wrong." The improvement was larger than domains with much higher matching rates.


Why It Works Anyway

The improvement doesn't come from the domain-specific content of the scaffold. It comes from the suppression signals.

Every ability. regardless of domain. carries suppression directives that block universal LLM failure modes. These aren't domain-specific errors. They're architectural shortcuts that transformers take regardless of what they're reasoning about:

  • Premature stopping. accepting the first plausible answer without testing alternatives
  • Forward momentum bias. locking onto an early hypothesis and interpreting all subsequent evidence to confirm it
  • Surface-level analysis. producing a formatted answer that addresses the question superficially without tracing the mechanism

A Causal ability that suppresses "forward momentum bias" helps a Metacognitive task (murder mystery) just as much as a Causal task (root cause analysis). because the failure it blocks is the same failure. The agent anchored on the first suspicious evidence and stopped investigating. The suppression signal forced it to keep going.


Case Study: The Murder Mystery

On EjBench, murder mystery tasks are classified as Metacognitive (they require tracking multiple suspects, weighing evidence, and detecting contradictions). Retrieval precision for Metacognition was 0%. every mystery received a non-Metacognitive ability.

One task received a Causal ability. The Causal ability's suppression of "forward momentum bias" forced the agent to evaluate all suspects systematically instead of anchoring on the first suspicious one.

Baseline: Accused the first suspect with suspicious behavior. Stopped investigating after finding one piece of evidence.

With injection (mismatched domain): Systematically weighed motive, means, and opportunity for each suspect. Identified the correct perpetrator by eliminating alternatives.

The scaffold was "wrong". it was about causal chains, not murder mysteries. But the suppression signal was right. it blocked the exact shortcut the agent was taking.


What This Means

For users: Don't worry about whether the API returned the "right" domain for your task. The suppression signals help regardless. The quality of your query still matters for retrieval. a specific task description routes better than a vague one. But even imperfect routing produces measurable improvement.

For the product: The current results are a lower bound. If retrieval precision improved from 38% to 80%, the suppression signals would still fire. but they'd be paired with domain-matched procedural content that adds on top. Every percentage point of retrieval improvement compounds the existing lift.

For the field: Suppression may be more important than retrieval precision for reasoning augmentation systems. The ability to tell a model what NOT to do appears to transcend the specific reasoning context. This suggests that the highest-value component of a cognitive scaffold is not the procedure. it's the constraint.


Source Data

  • EjBench: 180 tasks, 38.3% retrieval precision, +10.1pp composite lift
  • BBH/CausalBench/MuSR: 70 tasks, 50% retrieval precision, +20.8pp composite lift
  • Full results: EjBench report · External benchmark report

These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).

Every insight above is implemented as a reasoning primitive in the Logic API.