← Back to Blog

EjBench: 180 Professional Tasks, Agent-Native, Blind

EjBench: 180 Professional Tasks, Agent-Native, Blind

We built 180 tasks that test what production agents actually fail at. Haki (multi ability) achieved a +10.1 percentage point composite lift across seven behavioral factors. Self-monitoring nearly doubled. Verification increased 44%. Correctness didn't move. That's the point.


Why We Built Our Own Benchmark

Published benchmarks test academic reasoning. They're essential for external validity. we tested on those too. But they don't capture what breaks in production.

Production agents fail differently. They produce plausible-sounding analysis that misses a cross-metric contradiction. They accept the first causal explanation without testing alternatives. They project confidence uniformly instead of calibrating it to the evidence. These failures are invisible to correctness-only scoring. the agent gets a "right enough" answer through shallow reasoning.

EjBench was designed to measure these failures. Not "did the model get the right answer?" but "did the model reason in a way you can trust?"


Task Design

180 tasks across 6 cognitive domains, 30 per domain:

DomainTask Types
SimulationConsequence modeling, equilibrium shifts, cascade failure tracking
AbstractionCategory enforcement, isomorphism identification, group theory
MetacognitionContradiction detection, bias identification, epistemic evaluation
CausalIntervention analysis, counterfactual reasoning, threshold cascades
TemporalSequence ordering, interval overlap, temporal constraint propagation
SpatialTopology validation, constraint satisfaction, dimensional reasoning

Hardening: Tasks target a 50-65% baseline success window. 54% include counter-intuitive elements where the obvious answer is wrong. mirroring real production scenarios where the first instinct misleads. Answer format is mixed (60 multiple-choice, 60 yes/no, 60 free-text) to prevent format-based pattern matching. Distractors match common intermediate computation errors, not random alternatives.


Methodology

Model: Claude Opus 4.6 with extended thinking at maximum effort.

Agent-native execution: Identical to our external benchmark. Agents called the Ejentum production Logic API themselves via tool use. The agent summarized the task, called the endpoint, received the scaffold, and applied it before reasoning. This is how the API works in production. the retrieval variance is real, not simulated.

Three conditions:

  • A (Baseline): Raw task, no injection, no tool access
  • B1. Ki (single ability): Task + one cognitive ability injected via API call
  • C1. Haki (multi ability): Task + four composed abilities injected via API call

Blind 7-factor rubric: Same seven factors as the external benchmark, scored 0-3 by a separate evaluator that never saw which condition produced which response.

FactorWhat It Measures
CorrectnessRight answer with valid reasoning
Reasoning DepthMulti-level analysis, second/third-order effects
Self-MonitoringExplicit metacognitive awareness, bias checking
VerificationCounterfactual checks, boundary tests, re-derivation
Epistemic HonestyKnown vs. assumed, confidence calibration
Alternative ConsiderationCompeting explanations, systematic elimination
Audit TrailTraceable reasoning chain, named methods

Scale: 540 total generation calls. 536 valid judgments (99.6%). Four generation failures (1 baseline, 3 Haki).

Evaluation: LLM-as-judge (Claude Opus 4.6). Same limitation as the external benchmark. human evaluation would be stronger. Two-stage blind protocol mitigates systematic bias.


Results

Overall Composite

ConditionCompositeDelta
A (Baseline)0.621.
B1. Ki (single ability)0.711+9.0pp
C1. Haki (multi ability)0.722+10.1pp

Haki outperformed Ki by 1.1 percentage points. This is the opposite of our external benchmark, where Ki outperformed Haki by 12.2pp. The reversal is the central finding. analyzed below.

Per-Factor Breakdown

FactorBaselineKiKi DeltaHakiHaki Delta
Self-Monitoring0.941.70+0.761.81+0.86
Verification1.502.01+0.512.16+0.67
Alternative Consideration1.371.77+0.401.85+0.47
Epistemic Honesty1.541.90+0.361.94+0.40
Reasoning Depth2.442.54+0.102.57+0.14
Audit Trail2.642.75+0.112.76+0.12
Correctness2.602.57-0.032.49-0.11

Haki won every factor. Multi-ability injection produced larger improvements across all seven dimensions than single-ability. On complex multi-variable tasks, four perspectives are better than one.

Self-monitoring nearly doubled (0.94 → 1.81). The baseline agent rarely checks its own assumptions. The injected agent does it consistently. questioning biases, reflecting on its reasoning process, flagging when it's uncertain.

Verification increased 44% (1.50 → 2.16). Injected agents perform counterfactual checks, test boundary conditions, and re-derive intermediate results. The baseline agent stops at the first verification that confirms its answer.


What +10.1pp Looks Like on Real Tasks

Task 1: Same answer, different reasoning (CA-V2-18. Causal)

A social media company finds that users who receive more 'likes' post more frequently. They implement a feature to artificially boost likes on new users' posts. After 3 months, the boosted group posts only 5% more. far less than the 40% predicted from the correlation. If the dominant causal direction is reverse (prolific posters generate more content, which gets more total likes), what should the relationship between posting frequency and likes-per-post look like?

All three conditions answered correctly: (B) Negative. The difference is entirely in reasoning quality.

Baseline: Four sentences. States the answer as self-evident. Zero self-monitoring, zero verification, zero consideration of why the other options are wrong. Composite: 0.286.

Ki (single ability): Explicitly names its reasoning strategy ("suppressing linear one-way thinking"), identifies the feedback loop mechanism, and uses the failed intervention as a sanity check. Self-monitoring: 0 → 2. Composite: 0.700.

Haki (multi ability): Systematically rejects all three wrong options with model-specific reasons. (A) would require a skill mechanism not present, (C) contradicts the model's core structure, (D) requires non-monotonic behavior with no basis. Connects the intervention result as empirical evidence. Alternative consideration: 0 → 3. Composite: 0.833.

FactorBaselineKiHaki
Correctness333
Self-Monitoring022
Verification022
Epistemic Honesty012
Alternative Consideration013
Composite0.2860.7000.833

The agent got the same answer all three times. But only the injected versions produced reasoning you could audit, challenge, and trust.

Task 2: Only Haki breaks through (MC-V2-22. Metacognitive)

A tech company's diversity report shows women represented 25% of applications, 24% of phone screens, 23% of on-sites, 22% of offers, and 22% of hires. The company claims: since the ratio remained within 3 percentage points at each stage, the process shows no significant gender bias. Does the near-constant ratio prove the absence of bias?

Baseline and Ki both answered (C) "No, small drops compound." This is wrong. The correct answer is (D) "No, the 25% application pool may itself reflect bias."

Baseline: Produces thorough math analyzing the compounding effect. Strong reasoning depth (score 3). But explicitly dismisses option D: "it addresses a different question. upstream pipeline bias, not the hiring process." Gets trapped by the seductive-but-wrong option. Correctness: 1.

Ki: Performs a named bias scan before answering. identifies anchoring bias, confirmation bias, and scope insensitivity in the company's argument. But still falls for option C, stating "while D raises a valid concern, it doesn't address the specific claim." Correctness: 1.

Haki: Constructs an explicit counterfactual: "Suppose the ratio had been perfectly constant at 25% through every stage. Would that prove no bias? No. The argument never asks why only 25% applied in the first place. Maintaining a potentially-biased baseline is not evidence of neutrality. it's evidence of baseline preservation." Correctness flips to 3.

FactorBaselineKiHaki
Correctness113
Self-Monitoring233
Epistemic Honesty123
Composite0.7140.8670.933

Both baseline and Ki had strong reasoning depth (score 3). Strong reasoning aimed at the wrong target still produces the wrong answer. Only Haki's multi-perspective injection. specifically the counterfactual construction from the alternative scaffold. broke through the trap.


The Correctness Paradox

Correctness didn't improve. On Ki it dropped by 0.03 on a 3-point scale. On Haki it dropped by 0.11. This is the opposite of what most people expect from a "reasoning improvement" tool.

Here's why it's the point, not a flaw:

Baseline correctness is already 2.60/3.00 (86.7%). Claude Opus 4.6 is a frontier reasoning model. On most tasks, it already gets the right answer. There's a 13.3% gap between baseline and perfect. and that gap includes tasks that are genuinely hard, where even the best reasoning may not produce the correct conclusion.

The quality factors have far more headroom. Self-monitoring baseline is 0.94/3.00 (31.3%). Verification is 1.50/3.00 (50.0%). These factors have 50-70% room for improvement. Correctness has 13%. The intervention naturally produces larger deltas where there's more room.

The slight correctness dip is a trade-off, not a failure. When an agent spends more tokens on self-monitoring, verification, and alternative consideration, it allocates attention budget away from pure answer-seeking. The reasoning gets more thorough but occasionally more cautious. hedging where the baseline would have committed. On a 3-point correctness scale, this manifests as a 0.03-0.11 drop. On a production deployment where trust matters, the trade-off is worth it.

We are not spinning this. If correctness had dropped by 0.5 or 1.0, that would be a real problem. The observed drop (-0.03 Ki, -0.11 Haki) is small relative to the scale and consistent with the attention budget hypothesis. But it's a real number and we report it as such.


The Reversal: Why Haki Wins Here

On our external benchmark, Ki outperformed Haki by 12.2pp. Here, Haki outperforms Ki by 1.1pp. The reversal maps cleanly to task complexity:

Task TypeKi LiftHaki LiftWinner
Focused tasks (external benchmark)+20.8pp+8.6ppKi
Complex tasks (EjBench)+9.0pp+10.1ppHaki

External benchmark tasks are focused. one reasoning challenge per task, one right answer. Ki delivers one high-precision scaffold with 93% signal density. No noise, no competing perspectives.

EjBench tasks are complex. multiple variables, counter-intuitive elements, multi-step reasoning chains. Haki delivers four abilities: a primary that sets the direction, a dependency that grounds it, an amplifier that deepens it, and an alternative that challenges it. The compound suppression from all four blocks more failure modes than any single scaffold can reach.

The practical decision framework:

  • Your agent handles focused, single-domain tasks → Ki
  • Your agent handles complex, multi-variable analysis → Haki
  • You're not sure → start with Ki, test your hardest tasks, upgrade if single mode doesn't cover them

By Domain

DomainBaselineKiKi DeltaHakiHaki Delta
Abstraction0.6270.753+12.6pp0.820+19.3pp
Simulation0.5130.677+16.4pp0.699+18.6pp
Causal0.6370.732+9.5pp0.778+14.1pp
Metacognition0.7430.801+5.8pp0.828+8.5pp
Spatial0.6190.693+7.4pp0.591-2.8pp
Temporal0.5870.608+2.1pp0.608+2.1pp

Abstraction showed the strongest Haki lift (+19.3pp). Tasks requiring systematic exploration of mathematical structures. group theory, isomorphism, category enforcement. benefit most from multi-ability suppression of premature categorization.

Simulation had the lowest baseline (0.513) and the largest Ki lift (+16.4pp). The most room for improvement, and the self-monitoring delta was especially large: 0.57 baseline → 1.83 injected (3.2x increase).

Spatial regressed under Haki (-2.8pp). Ki improved Spatial by +7.4pp, but Haki made it worse. On spatial tasks, four competing perspectives confused the constraint tracking rather than clarifying it. This is consistent with the external benchmark finding: focused tasks need focused scaffolds.

Temporal showed minimal lift (+2.1pp both modes). This suggests temporal reasoning tasks may require more domain-specific ability content than current retrieval provides.


Task Flips

ConditionImprovedDegradedNet FlipNet Rate
Ki8821+6737.4%
Haki9216+7642.9%

Haki improved more tasks (92 vs 88) and degraded fewer (16 vs 21). Nearly half of all tasks showed measurable quality improvement under multi-ability injection. 39% of tasks were neutral. either already high quality or involving task types where retrieved abilities didn't activate relevant suppression.


Limitations

  • First-run benchmark. EjBench has been run once. These are first-observation results, not replicated findings. The factor ranking is consistent with our external benchmark (self-monitoring and verification are always the top two), which provides indirect validation.
  • Custom tasks. We designed these tasks. Despite hardening measures (counter-intuitive elements, mixed formats, no answer leaks), selection bias is possible. The external benchmark exists specifically to address this concern.
  • LLM-as-judge. Same limitation as the external benchmark. Claude evaluating Claude. Two-stage blind protocol mitigates but does not eliminate potential systematic bias.
  • One model. All results are on Claude Opus 4.6. Suppression signals target architectural properties of transformers broadly, but generalization to other models is not yet tested.
  • Correctness trade-off. The -0.11 Haki correctness dip is small but real. On tasks where getting the right answer matters more than reasoning transparency, this trade-off may not be acceptable.

What This Means

RA²R injection doesn't make your agent smarter. It makes your agent disciplined.

The baseline agent reasons well enough to get the right answer 87% of the time. But it does so through shortcuts. accepting the first plausible explanation, projecting uniform confidence, skipping verification. These shortcuts are invisible until you measure the reasoning process, not just the conclusion.

The injected agent gets the same answers. But it shows why. It checks itself. It considers alternatives. It acknowledges what it doesn't know. It produces reasoning you can audit, challenge, and trust.

If you're building an agent that needs to be right, correctness-only evaluation is sufficient. If you're building an agent that needs to be trusted, reasoning quality is what separates a prototype from production infrastructure.


Source Data

  • Benchmark: EjBench v2 (custom, 180 tasks)
  • Domains: 6 | Tasks per domain: 30 | Total tasks: 180
  • Valid judgments: 536 | Conditions: 3
  • Model: Claude Opus 4.6 | Evaluation date: March 2026
  • External validation: BBH/CausalBench/MuSR benchmark
  • Full benchmark data and scoring methodology: github.com/ejentum/benchmarks

These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).

Every insight above is implemented as a reasoning primitive in the Logic API.