EjBench: 180 Professional Tasks, Agent-Native, Blind
We built 180 tasks that test what production agents actually fail at. Haki (multi ability) achieved a +10.1 percentage point composite lift across seven behavioral factors. Self-monitoring nearly doubled. Verification increased 44%. Correctness didn't move. That's the point.
Why We Built Our Own Benchmark
Published benchmarks test academic reasoning. They're essential for external validity. we tested on those too. But they don't capture what breaks in production.
Production agents fail differently. They produce plausible-sounding analysis that misses a cross-metric contradiction. They accept the first causal explanation without testing alternatives. They project confidence uniformly instead of calibrating it to the evidence. These failures are invisible to correctness-only scoring. the agent gets a "right enough" answer through shallow reasoning.
EjBench was designed to measure these failures. Not "did the model get the right answer?" but "did the model reason in a way you can trust?"
Task Design
180 tasks across 6 cognitive domains, 30 per domain:
| Domain | Task Types |
|---|---|
| Simulation | Consequence modeling, equilibrium shifts, cascade failure tracking |
| Abstraction | Category enforcement, isomorphism identification, group theory |
| Metacognition | Contradiction detection, bias identification, epistemic evaluation |
| Causal | Intervention analysis, counterfactual reasoning, threshold cascades |
| Temporal | Sequence ordering, interval overlap, temporal constraint propagation |
| Spatial | Topology validation, constraint satisfaction, dimensional reasoning |
Hardening: Tasks target a 50-65% baseline success window. 54% include counter-intuitive elements where the obvious answer is wrong. mirroring real production scenarios where the first instinct misleads. Answer format is mixed (60 multiple-choice, 60 yes/no, 60 free-text) to prevent format-based pattern matching. Distractors match common intermediate computation errors, not random alternatives.
Methodology
Model: Claude Opus 4.6 with extended thinking at maximum effort.
Agent-native execution: Identical to our external benchmark. Agents called the Ejentum production Logic API themselves via tool use. The agent summarized the task, called the endpoint, received the scaffold, and applied it before reasoning. This is how the API works in production. the retrieval variance is real, not simulated.
Three conditions:
- A (Baseline): Raw task, no injection, no tool access
- B1. Ki (single ability): Task + one cognitive ability injected via API call
- C1. Haki (multi ability): Task + four composed abilities injected via API call
Blind 7-factor rubric: Same seven factors as the external benchmark, scored 0-3 by a separate evaluator that never saw which condition produced which response.
| Factor | What It Measures |
|---|---|
| Correctness | Right answer with valid reasoning |
| Reasoning Depth | Multi-level analysis, second/third-order effects |
| Self-Monitoring | Explicit metacognitive awareness, bias checking |
| Verification | Counterfactual checks, boundary tests, re-derivation |
| Epistemic Honesty | Known vs. assumed, confidence calibration |
| Alternative Consideration | Competing explanations, systematic elimination |
| Audit Trail | Traceable reasoning chain, named methods |
Scale: 540 total generation calls. 536 valid judgments (99.6%). Four generation failures (1 baseline, 3 Haki).
Evaluation: LLM-as-judge (Claude Opus 4.6). Same limitation as the external benchmark. human evaluation would be stronger. Two-stage blind protocol mitigates systematic bias.
Results
Overall Composite
| Condition | Composite | Delta |
|---|---|---|
| A (Baseline) | 0.621 | . |
| B1. Ki (single ability) | 0.711 | +9.0pp |
| C1. Haki (multi ability) | 0.722 | +10.1pp |
Haki outperformed Ki by 1.1 percentage points. This is the opposite of our external benchmark, where Ki outperformed Haki by 12.2pp. The reversal is the central finding. analyzed below.
Per-Factor Breakdown
| Factor | Baseline | Ki | Ki Delta | Haki | Haki Delta |
|---|---|---|---|---|---|
| Self-Monitoring | 0.94 | 1.70 | +0.76 | 1.81 | +0.86 |
| Verification | 1.50 | 2.01 | +0.51 | 2.16 | +0.67 |
| Alternative Consideration | 1.37 | 1.77 | +0.40 | 1.85 | +0.47 |
| Epistemic Honesty | 1.54 | 1.90 | +0.36 | 1.94 | +0.40 |
| Reasoning Depth | 2.44 | 2.54 | +0.10 | 2.57 | +0.14 |
| Audit Trail | 2.64 | 2.75 | +0.11 | 2.76 | +0.12 |
| Correctness | 2.60 | 2.57 | -0.03 | 2.49 | -0.11 |
Haki won every factor. Multi-ability injection produced larger improvements across all seven dimensions than single-ability. On complex multi-variable tasks, four perspectives are better than one.
Self-monitoring nearly doubled (0.94 → 1.81). The baseline agent rarely checks its own assumptions. The injected agent does it consistently. questioning biases, reflecting on its reasoning process, flagging when it's uncertain.
Verification increased 44% (1.50 → 2.16). Injected agents perform counterfactual checks, test boundary conditions, and re-derive intermediate results. The baseline agent stops at the first verification that confirms its answer.
What +10.1pp Looks Like on Real Tasks
Task 1: Same answer, different reasoning (CA-V2-18. Causal)
A social media company finds that users who receive more 'likes' post more frequently. They implement a feature to artificially boost likes on new users' posts. After 3 months, the boosted group posts only 5% more. far less than the 40% predicted from the correlation. If the dominant causal direction is reverse (prolific posters generate more content, which gets more total likes), what should the relationship between posting frequency and likes-per-post look like?
All three conditions answered correctly: (B) Negative. The difference is entirely in reasoning quality.
Baseline: Four sentences. States the answer as self-evident. Zero self-monitoring, zero verification, zero consideration of why the other options are wrong. Composite: 0.286.
Ki (single ability): Explicitly names its reasoning strategy ("suppressing linear one-way thinking"), identifies the feedback loop mechanism, and uses the failed intervention as a sanity check. Self-monitoring: 0 → 2. Composite: 0.700.
Haki (multi ability): Systematically rejects all three wrong options with model-specific reasons. (A) would require a skill mechanism not present, (C) contradicts the model's core structure, (D) requires non-monotonic behavior with no basis. Connects the intervention result as empirical evidence. Alternative consideration: 0 → 3. Composite: 0.833.
| Factor | Baseline | Ki | Haki |
|---|---|---|---|
| Correctness | 3 | 3 | 3 |
| Self-Monitoring | 0 | 2 | 2 |
| Verification | 0 | 2 | 2 |
| Epistemic Honesty | 0 | 1 | 2 |
| Alternative Consideration | 0 | 1 | 3 |
| Composite | 0.286 | 0.700 | 0.833 |
The agent got the same answer all three times. But only the injected versions produced reasoning you could audit, challenge, and trust.
Task 2: Only Haki breaks through (MC-V2-22. Metacognitive)
A tech company's diversity report shows women represented 25% of applications, 24% of phone screens, 23% of on-sites, 22% of offers, and 22% of hires. The company claims: since the ratio remained within 3 percentage points at each stage, the process shows no significant gender bias. Does the near-constant ratio prove the absence of bias?
Baseline and Ki both answered (C) "No, small drops compound." This is wrong. The correct answer is (D) "No, the 25% application pool may itself reflect bias."
Baseline: Produces thorough math analyzing the compounding effect. Strong reasoning depth (score 3). But explicitly dismisses option D: "it addresses a different question. upstream pipeline bias, not the hiring process." Gets trapped by the seductive-but-wrong option. Correctness: 1.
Ki: Performs a named bias scan before answering. identifies anchoring bias, confirmation bias, and scope insensitivity in the company's argument. But still falls for option C, stating "while D raises a valid concern, it doesn't address the specific claim." Correctness: 1.
Haki: Constructs an explicit counterfactual: "Suppose the ratio had been perfectly constant at 25% through every stage. Would that prove no bias? No. The argument never asks why only 25% applied in the first place. Maintaining a potentially-biased baseline is not evidence of neutrality. it's evidence of baseline preservation." Correctness flips to 3.
| Factor | Baseline | Ki | Haki |
|---|---|---|---|
| Correctness | 1 | 1 | 3 |
| Self-Monitoring | 2 | 3 | 3 |
| Epistemic Honesty | 1 | 2 | 3 |
| Composite | 0.714 | 0.867 | 0.933 |
Both baseline and Ki had strong reasoning depth (score 3). Strong reasoning aimed at the wrong target still produces the wrong answer. Only Haki's multi-perspective injection. specifically the counterfactual construction from the alternative scaffold. broke through the trap.
The Correctness Paradox
Correctness didn't improve. On Ki it dropped by 0.03 on a 3-point scale. On Haki it dropped by 0.11. This is the opposite of what most people expect from a "reasoning improvement" tool.
Here's why it's the point, not a flaw:
Baseline correctness is already 2.60/3.00 (86.7%). Claude Opus 4.6 is a frontier reasoning model. On most tasks, it already gets the right answer. There's a 13.3% gap between baseline and perfect. and that gap includes tasks that are genuinely hard, where even the best reasoning may not produce the correct conclusion.
The quality factors have far more headroom. Self-monitoring baseline is 0.94/3.00 (31.3%). Verification is 1.50/3.00 (50.0%). These factors have 50-70% room for improvement. Correctness has 13%. The intervention naturally produces larger deltas where there's more room.
The slight correctness dip is a trade-off, not a failure. When an agent spends more tokens on self-monitoring, verification, and alternative consideration, it allocates attention budget away from pure answer-seeking. The reasoning gets more thorough but occasionally more cautious. hedging where the baseline would have committed. On a 3-point correctness scale, this manifests as a 0.03-0.11 drop. On a production deployment where trust matters, the trade-off is worth it.
We are not spinning this. If correctness had dropped by 0.5 or 1.0, that would be a real problem. The observed drop (-0.03 Ki, -0.11 Haki) is small relative to the scale and consistent with the attention budget hypothesis. But it's a real number and we report it as such.
The Reversal: Why Haki Wins Here
On our external benchmark, Ki outperformed Haki by 12.2pp. Here, Haki outperforms Ki by 1.1pp. The reversal maps cleanly to task complexity:
| Task Type | Ki Lift | Haki Lift | Winner |
|---|---|---|---|
| Focused tasks (external benchmark) | +20.8pp | +8.6pp | Ki |
| Complex tasks (EjBench) | +9.0pp | +10.1pp | Haki |
External benchmark tasks are focused. one reasoning challenge per task, one right answer. Ki delivers one high-precision scaffold with 93% signal density. No noise, no competing perspectives.
EjBench tasks are complex. multiple variables, counter-intuitive elements, multi-step reasoning chains. Haki delivers four abilities: a primary that sets the direction, a dependency that grounds it, an amplifier that deepens it, and an alternative that challenges it. The compound suppression from all four blocks more failure modes than any single scaffold can reach.
The practical decision framework:
- Your agent handles focused, single-domain tasks → Ki
- Your agent handles complex, multi-variable analysis → Haki
- You're not sure → start with Ki, test your hardest tasks, upgrade if single mode doesn't cover them
By Domain
| Domain | Baseline | Ki | Ki Delta | Haki | Haki Delta |
|---|---|---|---|---|---|
| Abstraction | 0.627 | 0.753 | +12.6pp | 0.820 | +19.3pp |
| Simulation | 0.513 | 0.677 | +16.4pp | 0.699 | +18.6pp |
| Causal | 0.637 | 0.732 | +9.5pp | 0.778 | +14.1pp |
| Metacognition | 0.743 | 0.801 | +5.8pp | 0.828 | +8.5pp |
| Spatial | 0.619 | 0.693 | +7.4pp | 0.591 | -2.8pp |
| Temporal | 0.587 | 0.608 | +2.1pp | 0.608 | +2.1pp |
Abstraction showed the strongest Haki lift (+19.3pp). Tasks requiring systematic exploration of mathematical structures. group theory, isomorphism, category enforcement. benefit most from multi-ability suppression of premature categorization.
Simulation had the lowest baseline (0.513) and the largest Ki lift (+16.4pp). The most room for improvement, and the self-monitoring delta was especially large: 0.57 baseline → 1.83 injected (3.2x increase).
Spatial regressed under Haki (-2.8pp). Ki improved Spatial by +7.4pp, but Haki made it worse. On spatial tasks, four competing perspectives confused the constraint tracking rather than clarifying it. This is consistent with the external benchmark finding: focused tasks need focused scaffolds.
Temporal showed minimal lift (+2.1pp both modes). This suggests temporal reasoning tasks may require more domain-specific ability content than current retrieval provides.
Task Flips
| Condition | Improved | Degraded | Net Flip | Net Rate |
|---|---|---|---|---|
| Ki | 88 | 21 | +67 | 37.4% |
| Haki | 92 | 16 | +76 | 42.9% |
Haki improved more tasks (92 vs 88) and degraded fewer (16 vs 21). Nearly half of all tasks showed measurable quality improvement under multi-ability injection. 39% of tasks were neutral. either already high quality or involving task types where retrieved abilities didn't activate relevant suppression.
Limitations
- First-run benchmark. EjBench has been run once. These are first-observation results, not replicated findings. The factor ranking is consistent with our external benchmark (self-monitoring and verification are always the top two), which provides indirect validation.
- Custom tasks. We designed these tasks. Despite hardening measures (counter-intuitive elements, mixed formats, no answer leaks), selection bias is possible. The external benchmark exists specifically to address this concern.
- LLM-as-judge. Same limitation as the external benchmark. Claude evaluating Claude. Two-stage blind protocol mitigates but does not eliminate potential systematic bias.
- One model. All results are on Claude Opus 4.6. Suppression signals target architectural properties of transformers broadly, but generalization to other models is not yet tested.
- Correctness trade-off. The -0.11 Haki correctness dip is small but real. On tasks where getting the right answer matters more than reasoning transparency, this trade-off may not be acceptable.
What This Means
RA²R injection doesn't make your agent smarter. It makes your agent disciplined.
The baseline agent reasons well enough to get the right answer 87% of the time. But it does so through shortcuts. accepting the first plausible explanation, projecting uniform confidence, skipping verification. These shortcuts are invisible until you measure the reasoning process, not just the conclusion.
The injected agent gets the same answers. But it shows why. It checks itself. It considers alternatives. It acknowledges what it doesn't know. It produces reasoning you can audit, challenge, and trust.
If you're building an agent that needs to be right, correctness-only evaluation is sufficient. If you're building an agent that needs to be trusted, reasoning quality is what separates a prototype from production infrastructure.
Source Data
- Benchmark: EjBench v2 (custom, 180 tasks)
- Domains: 6 | Tasks per domain: 30 | Total tasks: 180
- Valid judgments: 536 | Conditions: 3
- Model: Claude Opus 4.6 | Evaluation date: March 2026
- External validation: BBH/CausalBench/MuSR benchmark
- Full benchmark data and scoring methodology: github.com/ejentum/benchmarks
These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).