RA²R on BIG-Bench Hard, CausalBench, and MuSR
We tested RA²R injection on 70 tasks from three published academic benchmarks. Ki (single-ability mode) achieved a +20.8 percentage point composite lift across seven behavioral factors. Haki (multi-ability mode) achieved +8.6pp. Self-monitoring more than doubled. Correctness improved. One benchmark source showed a regression. Here's every number.
Why External Benchmarks
You can't grade your own homework. Custom benchmarks risk unconscious task design bias. the designer may construct tasks that favor the intervention without realizing it.
BIG-Bench Hard, CausalBench, and MuSR were designed by independent research teams with no knowledge of RA²R. Their tasks, ground-truth answers, and difficulty calibration are beyond our control. If RA²R improves performance on tasks designed by third parties, the effect cannot be attributed to task design.
- BIG-Bench Hard (Suzgun et al., 2023). 25 tasks across causal judgement, temporal sequences, and spatial navigation. Google's challenge set designed to probe reasoning limits.
- CausalBench. 30 tasks requiring formal causal reasoning: abduction, prediction, intervention analysis, and counterfactual evaluation.
- MuSR (Multistep Soft Reasoning). 15 tasks with multi-paragraph narratives requiring theory-of-mind reasoning to track object locations through multiple state changes and perspective shifts.
Methodology
Model: Claude Opus 4.6 (Anthropic's strongest reasoning model) at maximum effort with extended thinking.
Agent-native execution: Agents called the Ejentum production Logic API themselves via tool use. The agent summarized the task in its own words, called the endpoint, received the scaffold in its own context, and applied it before reasoning. This mirrors real production deployment. the agent's task summary determines which ability is retrieved, introducing the same retrieval variance that production users experience.
Three conditions:
- A (Baseline): Raw task, no injection, no tool access
- B1. Ki (single ability): Task + one cognitive ability injected via API call
- C1. Haki (multi ability): Task + four composed abilities injected via API call
Blind 7-factor rubric: Each response scored 0-3 on seven factors by a separate evaluator instance that never saw which condition produced which response.
| Factor | What It Measures |
|---|---|
| Correctness | Right answer with valid reasoning |
| Reasoning Depth | Multi-level analysis, second/third-order effects |
| Self-Monitoring | Explicit metacognitive awareness, bias checking |
| Verification | Counterfactual checks, boundary tests, re-derivation |
| Epistemic Honesty | Known vs. assumed, confidence calibration |
| Alternative Consideration | Competing explanations, systematic elimination |
| Audit Trail | Traceable reasoning chain, named methods |
Composite: Average of all 7 factors normalized to 0-1 scale.
Scale: 210 total generation calls (70 tasks x 3 conditions). 209 valid judgments (99.5%). One baseline generation failed.
Evaluation: LLM-as-judge (Claude Opus 4.6 evaluating Claude Opus 4.6 output). This is a limitation. human evaluation would be stronger. We chose LLM-as-judge for scale and reproducibility, and mitigated bias through the two-stage blind protocol where the judge never sees condition labels.
Prior Results (Correctness-Only Scoring)
Before the 7-factor evaluation, we ran the same 70 external tasks with binary correctness scoring across two independent runs. These earlier runs tested five conditions (including two modes we later deprecated) and established the baseline pattern.
Run 1 (110 tasks, mixed internal + external):
| Condition | Correctness | Delta |
|---|---|---|
| Baseline | 69.7% | . |
| Ki | 76.8% | +7.1pp |
| Haki | 75.2% | +5.5pp |
Run 2 (70 tasks, external only. replication):
| Condition | Correctness | Delta |
|---|---|---|
| Baseline | 69.3% | . |
| Ki | 74.3% | +5.0pp |
| Haki | 74.3% | +5.0pp |
The pattern replicated: positive lift on both runs, Ki matching or outperforming Haki on focused external tasks. The magnitudes were smaller on the harder external-only subset (Run 2), as expected.
These correctness-only results established that RA²R injection helps agents get more right answers on published tasks. But correctness alone doesn't capture HOW the agent reasons differently. The v2 evaluation below upgrades the methodology to a 7-factor rubric that measures the behavioral changes behind the correctness improvement.
Results (7-Factor Evaluation)
Overall Composite
| Condition | Composite | Delta |
|---|---|---|
| A (Baseline) | 0.476 | . |
| B1 (Ki) | 0.684 | +20.8pp |
| C1 (Haki) | 0.562 | +8.6pp |
Ki outperformed Haki by 12.2 percentage points on these focused, single-domain tasks. This is the opposite of our custom benchmark (EjBench), where Haki outperformed Ki. The reversal is the central finding. analyzed below.
Per-Factor Breakdown
| Factor | Baseline | Ki | Ki Delta | Haki | Haki Delta |
|---|---|---|---|---|---|
| Self-Monitoring | 0.74 | 1.73 | +0.99 | 1.39 | +0.65 |
| Verification | 0.96 | 1.77 | +0.81 | 1.47 | +0.51 |
| Alternative Consideration | 0.86 | 1.43 | +0.57 | 1.16 | +0.30 |
| Epistemic Honesty | 1.22 | 1.67 | +0.45 | 1.37 | +0.15 |
| Audit Trail | 2.26 | 2.63 | +0.37 | 2.13 | -0.13 |
| Reasoning Depth | 2.14 | 2.50 | +0.36 | 2.21 | +0.07 |
| Correctness | 2.19 | 2.33 | +0.14 | 2.07 | -0.12 |
All seven factors improved with Ki. The ranking is consistent with our custom benchmark: self-monitoring and verification show the largest lift, correctness the smallest.
Self-monitoring more than doubled (0.74 → 1.73). The agent goes from rarely checking its own assumptions to consistently questioning them mid-reasoning.
Verification nearly doubled (0.96 → 1.77). Counterfactual checks, boundary tests, re-derivation from first principles. behaviors the baseline agent skips.
Correctness improved (+0.14 on a 3-point scale). Unlike our custom benchmark where correctness was flat, external tasks with verified ground truth showed a positive correctness delta. On focused tasks with clear right/wrong answers, the scaffold helps the model get more answers right. not just reason better.
Haki showed mixed results. Self-monitoring (+0.65) and verification (+0.51) still improved. But correctness (-0.12) and audit trail (-0.13) degraded. On focused tasks, four abilities occasionally introduced competing perspectives that confused rather than clarified.
By Benchmark Source
| Source | Tasks | Baseline | Ki | Ki Delta | Haki | Haki Delta |
|---|---|---|---|---|---|---|
| CausalBench | 30 | 0.498 | 0.708 | +21.0pp | 0.614 | +11.6pp |
| MuSR | 15 | 0.475 | 0.698 | +22.3pp | 0.575 | +10.0pp |
| BIG-Bench Hard | 25 | 0.444 | 0.633 | +18.9pp | 0.487 | +4.3pp |
CausalBench and MuSR showed the strongest Ki lifts (+21.0pp and +22.3pp). BIG-Bench Hard still showed +18.9pp.
BIG-Bench Hard sub-tasks:
| Task Type | Tasks | Baseline | Ki | Haki |
|---|---|---|---|---|
| Temporal Sequences | 10 | 0.453 | 0.738 (+28.5pp) | 0.576 |
| Causal Judgement | 10 | 0.438 | 0.557 (+11.9pp) | 0.510 |
| Spatial Navigation | 5 | 0.438 | 0.571 (+13.3pp) | 0.238 |
Temporal sequences showed the largest sub-task lift (+28.5pp). These are the most procedural tasks in the set: find the unoccupied time slot by exhaustive enumeration. The suppression signal prevented the baseline failure of jumping to the first plausible gap.
What +20.8pp Looks Like on a Real Task
When caution makes you wrong (EXT-CB-15. CausalBench)
Imagine a self-contained hypothetical world with only these conditions: tanning salon treatment has a direct effect on skin. Going to the beach has a direct effect on skin. No other factors or relationships exist. If skin health varies significantly, is it likely due to tanning salon treatment?
Correct answer: Yes.
Baseline: Answers "No." It correctly identifies that two causes exist (tanning and beach), then overcorrects: "it is not safe to conclude the variation is due to tanning, because beach exposure is an equally valid cause." The model imports real-world skepticism about correlation-vs-causation into a closed world that explicitly rules out confounders. It invents uncertainty that doesn't exist. Correctness: 0.
Ki (single ability): Correctly identifies the DAG structure. no backdoor path, no confounding. But then still answers "No". it cannot fully reconcile the formal analysis with the question. The suppression signal may have pushed it toward excessive caution. Correctness: 1.
Haki (multi ability): Explicitly names the instinct it is overriding: "Normally I'd flag that correlation does not equal causation. But in this closed world with known structure and no confounders. yes." Walks through backdoor path analysis step by step and arrives at the correct answer. Correctness: 3.
| Factor | Baseline | Ki | Haki |
|---|---|---|---|
| Correctness | 0 | 1 | 3 |
| Reasoning Depth | 2 | 3 | 3 |
| Self-Monitoring | 1 | 2 | 2 |
| Composite | 0.476 | 0.809 | 0.809 |
This is a case where more caution makes you wrong. The baseline's "correlation is not causation" heuristic is correct in the real world but incorrect in a closed causal system. The scaffold let the model calibrate. applying the right level of skepticism for the specific problem, not a blanket heuristic.
Where It Hurt
Spatial Navigation with Haki: -20.0pp regression. On 5 spatial navigation tasks, Haki dropped from 0.438 (baseline) to 0.238. The root cause: one task (EXT-SP-01) produced a near-empty response under the Haki condition due to a parallel execution contention issue during the initial benchmark run. After re-running sequentially, the response improved, but the damage to the 5-task average was severe. On a sample this small (5 tasks), one failure dominates the mean.
Haki correctness: -0.12. Across all 70 tasks, Haki's correctness dropped below baseline. On focused tasks where there's one right answer, four competing scaffolds occasionally led the model to overcorrect. applying sophisticated reasoning methods where simpler approaches succeed. Example: on one CausalBench task, Haki's do-calculus approach led to the wrong answer while Ki's simpler Bayesian reasoning succeeded.
Haki audit trail: -0.13. Multi-ability injection expanded the reasoning space in ways that made the agent's chain harder to follow, not easier.
These regressions are real and not cherry-picked from a larger dataset. They reflect a genuine limitation: Haki is the wrong mode for focused, single-domain tasks. This is consistent across all measured factors.
Ki vs Haki: When to Use Which
The reversal between benchmarks reveals a clean decision framework:
| Task Type | Ki Composite Lift | Haki Composite Lift | Winner |
|---|---|---|---|
| Focused (one judgment, one answer) | +20.8pp | +8.6pp | Ki |
| Complex (multi-variable, multi-step) | +9.0pp | +12.9pp | Haki |
Focused task data from this benchmark (70 external tasks). Complex task data from EjBench (180 custom tasks).
If your agent needs to get one thing right: use Ki. One scaffold, maximum signal density, no competing perspectives.
If your agent needs to hold multiple analytical angles simultaneously: use Haki. Four scaffolds with compound suppression cover cross-dimensional reasoning that single mode misses.
When unsure, start with Ki.
What This Means
The lift is primarily behavioral. The agent doesn't become omniscient. it becomes disciplined:
- It monitors itself instead of committing to the first plausible answer
- It verifies from multiple angles instead of stopping at one check
- It acknowledges uncertainty instead of projecting false confidence
- It considers alternatives instead of anchoring on the obvious explanation
Correctness improved too (+0.14 on Ki), but that's the smallest factor. The mechanism is suppression: the scaffold blocks the model's natural tendency to take cognitive shortcuts. The shortcuts are invisible until you measure what happens without them.
Limitations
- LLM-as-judge. Claude evaluated Claude's output. Human evaluation would provide stronger validation. The two-stage blind protocol mitigates but does not eliminate potential systematic bias.
- 70 tasks. A meaningful sample for detecting large effects (+20pp), but insufficient for fine-grained sub-type analysis. The spatial navigation regression rests on 5 tasks.
- One model. All results are on Claude Opus 4.6. Generalization to other models is expected (suppression signals target architectural properties of transformers, not model-specific behaviors) but not yet tested.
- Single evaluation run. This benchmark has been run once with the 7-factor rubric. The pattern is consistent with our correctness-only evaluation (which was replicated across two independent runs), but the 7-factor results are first-run observations.
Source Data
- Benchmarks: BIG-Bench Hard (Suzgun et al., 2023), CausalBench, MuSR
- Total tasks: 70 | Valid judgments: 209 | Conditions: 3
- Model: Claude Opus 4.6 | Evaluation date: March 2026
- Full benchmark data and scoring methodology: github.com/ejentum/benchmarks
These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).