RA²R on BIG-Bench Hard, CausalBench, and MuSR

We tested RA²R injection on 70 tasks from three published academic benchmarks. Dynamic (single ability) achieved a +20.8 percentage point composite lift across seven behavioral factors. The composite condition achieved +8.6pp. Self-monitoring more than doubled. Correctness improved. One benchmark source showed a regression. Here's every number.

Why External Benchmarks

You can't grade your own homework. Custom benchmarks risk unconscious task design bias: the designer may construct tasks that favor the intervention without realizing it.

BIG-Bench Hard, CausalBench, and MuSR were designed by independent research teams with no knowledge of RA²R. Their tasks, ground-truth answers, and difficulty calibration are beyond our control. If RA²R improves performance on tasks designed by third parties, the effect cannot be attributed to task design.

BIG-Bench Hard (Suzgun et al., 2023). 25 tasks across causal judgement, temporal sequences, and spatial navigation. Google's challenge set designed to probe reasoning limits.
CausalBench. 30 tasks requiring formal causal reasoning: abduction, prediction, intervention analysis, and counterfactual evaluation.
MuSR (Multistep Soft Reasoning). 15 tasks with multi-paragraph narratives requiring theory-of-mind reasoning to track object locations through multiple state changes and perspective shifts.

Methodology

Model: Claude Opus 4.6 (Anthropic's strongest reasoning model) at maximum effort with extended thinking.

Agent-native execution: Agents called the Ejentum production harness themselves via tool use. The agent summarized the task in its own words, called the endpoint, received the scaffold in its own context, and applied it before reasoning. This mirrors real production deployment: the agent's task summary determines which ability is retrieved, introducing the same retrieval variance that production users experience.

Three conditions:

A (Baseline): Raw task, no injection, no tool access
B1. Dynamic (single ability): Task + one cognitive ability injected via API call
C1. Composite (four abilities): Task + four composed abilities injected via API call

Blind 7-factor rubric: Each response scored 0-3 on seven factors by a separate evaluator instance that never saw which condition produced which response.

Factor	What It Measures
Correctness	Right answer with valid reasoning
Reasoning Depth	Multi-level analysis, second/third-order effects
Self-Monitoring	Explicit metacognitive awareness, bias checking
Verification	Counterfactual checks, boundary tests, re-derivation
Epistemic Honesty	Known vs. assumed, confidence calibration
Alternative Consideration	Competing explanations, systematic elimination
Audit Trail	Traceable reasoning chain, named methods

Composite: Average of all 7 factors normalized to 0-1 scale.

Scale: 210 total generation calls (70 tasks x 3 conditions). 209 valid judgments (99.5%). One baseline generation failed.

Evaluation: LLM-as-judge (Claude Opus 4.6 evaluating Claude Opus 4.6 output). This is a limitation: human evaluation would be stronger. We chose LLM-as-judge for scale and reproducibility, and mitigated bias through the two-stage blind protocol where the judge never sees condition labels.

Prior Results (Correctness-Only Scoring)

Before the 7-factor evaluation, we ran the same 70 external tasks with binary correctness scoring across two independent runs. These earlier runs tested five conditions (including two modes we later deprecated) and established the baseline pattern.

Run 1 (110 tasks, mixed internal + external):

Condition	Correctness	Delta
Baseline	69.7%	.
Dynamic	76.8%	+7.1pp
Composite	75.2%	+5.5pp

Run 2 (70 tasks, external only, replication):

Condition	Correctness	Delta
Baseline	69.3%	.
Dynamic	74.3%	+5.0pp
Composite	74.3%	+5.0pp

The pattern replicated: positive lift on both runs, Dynamic matching or outperforming Composite on focused external tasks. The magnitudes were smaller on the harder external-only subset (Run 2), as expected.

These correctness-only results established that RA²R injection helps agents get more right answers on published tasks. But correctness alone doesn't capture HOW the agent reasons differently. The v2 evaluation below upgrades the methodology to a 7-factor rubric that measures the behavioral changes behind the correctness improvement.

Results (7-Factor Evaluation)

Overall Composite

Condition	Composite	Delta
A (Baseline)	0.476	.
B1 (Dynamic)	0.684	+20.8pp
C1 (Composite)	0.562	+8.6pp

Dynamic outperformed Composite by 12.2 percentage points on these focused, single-domain tasks. This is the opposite of our custom benchmark (EjBench), where Composite outperformed Dynamic. The reversal is the central finding, analyzed below.

Per-Factor Breakdown

Factor	Baseline	Dynamic	Dynamic Delta	Composite	Composite Delta
Self-Monitoring	0.74	1.73	+0.99	1.39	+0.65
Verification	0.96	1.77	+0.81	1.47	+0.51
Alternative Consideration	0.86	1.43	+0.57	1.16	+0.30
Epistemic Honesty	1.22	1.67	+0.45	1.37	+0.15
Audit Trail	2.26	2.63	+0.37	2.13	-0.13
Reasoning Depth	2.14	2.50	+0.36	2.21	+0.07
Correctness	2.19	2.33	+0.14	2.07	-0.12

All seven factors improved with Dynamic. The ranking is consistent with our custom benchmark: self-monitoring and verification show the largest lift, correctness the smallest.

Self-monitoring more than doubled (0.74 → 1.73). The agent goes from rarely checking its own assumptions to consistently questioning them mid-reasoning.

Verification nearly doubled (0.96 → 1.77). Counterfactual checks, boundary tests, re-derivation from first principles: behaviors the baseline agent skips.

Correctness improved (+0.14 on a 3-point scale). Unlike our custom benchmark where correctness was flat, external tasks with verified ground truth showed a positive correctness delta. On focused tasks with clear right/wrong answers, the scaffold helps the model get more answers right, not just reason better.

Composite showed mixed results. Self-monitoring (+0.65) and verification (+0.51) still improved. But correctness (-0.12) and audit trail (-0.13) degraded. On focused tasks, four abilities occasionally introduced competing perspectives that confused rather than clarified.

By Benchmark Source

Source	Tasks	Baseline	Dynamic	Dynamic Delta	Composite	Composite Delta
CausalBench	30	0.498	0.708	+21.0pp	0.614	+11.6pp
MuSR	15	0.475	0.698	+22.3pp	0.575	+10.0pp
BIG-Bench Hard	25	0.444	0.633	+18.9pp	0.487	+4.3pp

CausalBench and MuSR showed the strongest Dynamic lifts (+21.0pp and +22.3pp). BIG-Bench Hard still showed +18.9pp.

BIG-Bench Hard sub-tasks:

Task Type	Tasks	Baseline	Dynamic	Composite
Temporal Sequences	10	0.453	0.738 (+28.5pp)	0.576
Causal Judgement	10	0.438	0.557 (+11.9pp)	0.510
Spatial Navigation	5	0.438	0.571 (+13.3pp)	0.238

Temporal sequences showed the largest sub-task lift (+28.5pp). These are the most procedural tasks in the set: find the unoccupied time slot by exhaustive enumeration. The suppression signal prevented the baseline failure of jumping to the first plausible gap.

What +20.8pp Looks Like on a Real Task

When caution makes you wrong (EXT-CB-15: CausalBench)

Imagine a self-contained hypothetical world with only these conditions: tanning salon treatment has a direct effect on skin. Going to the beach has a direct effect on skin. No other factors or relationships exist. If skin health varies significantly, is it likely due to tanning salon treatment?

Correct answer: Yes.

Baseline: Answers "No." It correctly identifies that two causes exist (tanning and beach), then overcorrects: "it is not safe to conclude the variation is due to tanning, because beach exposure is an equally valid cause." The model imports real-world skepticism about correlation-vs-causation into a closed world that explicitly rules out confounders. It invents uncertainty that doesn't exist. Correctness: 0.

Dynamic (single ability): Correctly identifies the DAG structure: no backdoor path, no confounding. But then still answers "No": it cannot fully reconcile the formal analysis with the question. The suppression signal may have pushed it toward excessive caution. Correctness: 1.

Composite (four abilities): Explicitly names the instinct it is overriding: "Normally I'd flag that correlation does not equal causation. But in this closed world with known structure and no confounders. yes." Walks through backdoor path analysis step by step and arrives at the correct answer. Correctness: 3.

Factor	Baseline	Dynamic	Composite
Correctness	0	1	3
Reasoning Depth	2	3	3
Self-Monitoring	1	2	2
Composite	0.476	0.809	0.809

This is a case where more caution makes you wrong. The baseline's "correlation is not causation" heuristic is correct in the real world but incorrect in a closed causal system. The scaffold let the model calibrate: applying the right level of skepticism for the specific problem, not a blanket heuristic.

Where It Hurt

Spatial Navigation with Composite: -20.0pp regression. On 5 spatial navigation tasks, Composite dropped from 0.438 (baseline) to 0.238. The root cause: one task (EXT-SP-01) produced a near-empty response under the composite condition due to a parallel execution contention issue during the initial benchmark run. After re-running sequentially, the response improved, but the damage to the 5-task average was severe. On a sample this small (5 tasks), one failure dominates the mean.

Composite correctness: -0.12. Across all 70 tasks, Composite's correctness dropped below baseline. On focused tasks where there's one right answer, four competing scaffolds occasionally led the model to overcorrect: applying sophisticated reasoning methods where simpler approaches succeed. Example: on one CausalBench task, Composite's do-calculus approach led to the wrong answer while Dynamic's simpler Bayesian reasoning succeeded.

Composite audit trail: -0.13. Composite injection expanded the reasoning space in ways that made the agent's chain harder to follow, not easier.

These regressions are real and not cherry-picked from a larger dataset. They reflect a genuine limitation: Composite is the wrong mode for focused, single-domain tasks. This is consistent across all measured factors.

Dynamic vs Composite: When to Use Which

The reversal between benchmarks reveals a clean decision framework:

Task Type	Dynamic Lift	Composite Lift	Winner
Focused (one judgment, one answer)	+20.8pp	+8.6pp	Dynamic
Complex (multi-variable, multi-step)	+9.0pp	+10.1pp	Composite

Focused task data from this benchmark (70 external tasks). Complex task data from EjBench (180 custom tasks).

If your agent needs to get one thing right: use Dynamic. One scaffold, maximum signal density, no competing perspectives.

If your agent needs to hold multiple analytical angles simultaneously: use Composite. Four scaffolds with compound suppression cover cross-dimensional reasoning that single mode misses.

When unsure, start with the dynamic mode.

What This Means

The lift is primarily behavioral. The agent doesn't become omniscient. It becomes disciplined:

It monitors itself instead of committing to the first plausible answer
It verifies from multiple angles instead of stopping at one check
It acknowledges uncertainty instead of projecting false confidence
It considers alternatives instead of anchoring on the obvious explanation

Correctness improved too (+0.14 on Dynamic), but that's the smallest factor. The mechanism is suppression: the scaffold blocks the model's natural tendency to take cognitive shortcuts. The shortcuts are invisible until you measure what happens without them.

Limitations

LLM-as-judge. Claude evaluated Claude's output. Human evaluation would provide stronger validation. The two-stage blind protocol mitigates but does not eliminate potential systematic bias.
70 tasks. A meaningful sample for detecting large effects (+20pp), but insufficient for fine-grained sub-type analysis. The spatial navigation regression rests on 5 tasks.
One model. All results are on Claude Opus 4.6. Generalization to other models is expected (suppression signals target architectural properties of transformers, not model-specific behaviors) but not yet tested.
Single evaluation run. This benchmark has been run once with the 7-factor rubric. The pattern is consistent with our correctness-only evaluation (which was replicated across two independent runs), but the 7-factor results are first-run observations.

Source Data

Benchmarks: BIG-Bench Hard (Suzgun et al., 2023), CausalBench, MuSR
Total tasks: 70 | Valid judgments: 209 | Conditions: 3
Model: Claude Opus 4.6 | Evaluation date: March 2026
Full benchmark data and scoring methodology: github.com/ejentum/benchmarks

These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).