EjBench: 180 Professional Tasks, Agent-Native, Blind

We built 180 tasks that test what production agents actually fail at. Composite (four abilities) achieved a +10.1 percentage point composite lift across seven behavioral factors. Self-monitoring nearly doubled. Verification increased 44%. Correctness didn't move. That's the point.

Why We Built Our Own Benchmark

Published benchmarks test academic reasoning. They're essential for external validity. We tested on those too. But they don't capture what breaks in production.

Production agents fail differently. They produce plausible-sounding analysis that misses a cross-metric contradiction. They accept the first causal explanation without testing alternatives. They project confidence uniformly instead of calibrating it to the evidence. These failures are invisible to correctness-only scoring: the agent gets a "right enough" answer through shallow reasoning.

EjBench was designed to measure these failures. Not "did the model get the right answer?" but "did the model reason in a way you can trust?"

Task Design

180 tasks across 6 cognitive domains, 30 per domain:

Domain	Task Types
Simulation	Consequence modeling, equilibrium shifts, cascade failure tracking
Abstraction	Category enforcement, isomorphism identification, group theory
Metacognition	Contradiction detection, bias identification, epistemic evaluation
Causal	Intervention analysis, counterfactual reasoning, threshold cascades
Temporal	Sequence ordering, interval overlap, temporal constraint propagation
Spatial	Topology validation, constraint satisfaction, dimensional reasoning

Hardening: Tasks target a 50-65% baseline success window. 54% include counter-intuitive elements where the obvious answer is wrong, mirroring real production scenarios where the first instinct misleads. Answer format is mixed (60 multiple-choice, 60 yes/no, 60 free-text) to prevent format-based pattern matching. Distractors match common intermediate computation errors, not random alternatives.

Methodology

Model: Claude Opus 4.6 with extended thinking at maximum effort.

Agent-native execution: Identical to our external benchmark. Agents called the Ejentum production harness themselves via tool use. The agent summarized the task, called the endpoint, received the scaffold, and applied it before reasoning. This is how the API works in production: the retrieval variance is real, not simulated.

Three conditions:

A (Baseline): Raw task, no injection, no tool access
B1. Dynamic (single ability): Task + one cognitive ability injected via API call
C1. Composite (four abilities): Task + four composed abilities injected via API call

Blind 7-factor rubric: Same seven factors as the external benchmark, scored 0-3 by a separate evaluator that never saw which condition produced which response.

Factor	What It Measures
Correctness	Right answer with valid reasoning
Reasoning Depth	Multi-level analysis, second/third-order effects
Self-Monitoring	Explicit metacognitive awareness, bias checking
Verification	Counterfactual checks, boundary tests, re-derivation
Epistemic Honesty	Known vs. assumed, confidence calibration
Alternative Consideration	Competing explanations, systematic elimination
Audit Trail	Traceable reasoning chain, named methods

Scale: 540 total generation calls. 536 valid judgments (99.6%). Four generation failures (1 baseline, 3 Composite).

Evaluation: LLM-as-judge (Claude Opus 4.6). Same limitation as the external benchmark: human evaluation would be stronger. Two-stage blind protocol mitigates systematic bias.

Results

Overall Composite

Condition	Composite	Delta
A (Baseline)	0.621	.
B1. Dynamic (single ability)	0.711	+9.0pp
C1. Composite (four abilities)	0.722	+10.1pp

Composite outperformed Dynamic by 1.1 percentage points. This is the opposite of our external benchmark, where Dynamic outperformed Composite by 12.2pp. The reversal is the central finding, analyzed below.

Per-Factor Breakdown

Factor	Baseline	Dynamic	Dynamic Delta	Composite	Composite Delta
Self-Monitoring	0.94	1.70	+0.76	1.81	+0.86
Verification	1.50	2.01	+0.51	2.16	+0.67
Alternative Consideration	1.37	1.77	+0.40	1.85	+0.47
Epistemic Honesty	1.54	1.90	+0.36	1.94	+0.40
Reasoning Depth	2.44	2.54	+0.10	2.57	+0.14
Audit Trail	2.64	2.75	+0.11	2.76	+0.12
Correctness	2.60	2.57	-0.03	2.49	-0.11

Composite won every factor. Composite injection produced larger improvements across all seven dimensions than single-ability. On complex multi-variable tasks, four perspectives are better than one.

Self-monitoring nearly doubled (0.94 → 1.81). The baseline agent rarely checks its own assumptions. The injected agent does it consistently: questioning biases, reflecting on its reasoning process, flagging when it's uncertain.

Verification increased 44% (1.50 → 2.16). Injected agents perform counterfactual checks, test boundary conditions, and re-derive intermediate results. The baseline agent stops at the first verification that confirms its answer.

What +10.1pp Looks Like on Real Tasks

Task 1: Same answer, different reasoning (CA-V2-18: Causal)

A social media company finds that users who receive more 'likes' post more frequently. They implement a feature to artificially boost likes on new users' posts. After 3 months, the boosted group posts only 5% more. far less than the 40% predicted from the correlation. If the dominant causal direction is reverse (prolific posters generate more content, which gets more total likes), what should the relationship between posting frequency and likes-per-post look like?

All three conditions answered correctly: (B) Negative. The difference is entirely in reasoning quality.

Baseline: Four sentences. States the answer as self-evident. Zero self-monitoring, zero verification, zero consideration of why the other options are wrong. Composite: 0.286.

Dynamic (single ability): Explicitly names its reasoning strategy ("suppressing linear one-way thinking"), identifies the feedback loop mechanism, and uses the failed intervention as a sanity check. Self-monitoring: 0 → 2. Composite: 0.700.

Composite (four abilities): Systematically rejects all three wrong options with model-specific reasons. (A) would require a skill mechanism not present, (C) contradicts the model's core structure, (D) requires non-monotonic behavior with no basis. Connects the intervention result as empirical evidence. Alternative consideration: 0 → 3. Composite: 0.833.

Factor	Baseline	Dynamic	Composite
Correctness	3	3	3
Self-Monitoring	0	2	2
Verification	0	2	2
Epistemic Honesty	0	1	2
Alternative Consideration	0	1	3
Composite	0.286	0.700	0.833

The agent got the same answer all three times. But only the injected versions produced reasoning you could audit, challenge, and trust.

Task 2: Only Composite breaks through (MC-V2-22: Metacognitive)

A tech company's diversity report shows women represented 25% of applications, 24% of phone screens, 23% of on-sites, 22% of offers, and 22% of hires. The company claims: since the ratio remained within 3 percentage points at each stage, the process shows no significant gender bias. Does the near-constant ratio prove the absence of bias?

Baseline and Dynamic both answered (C) "No, small drops compound." This is wrong. The correct answer is (D) "No, the 25% application pool may itself reflect bias."

Baseline: Produces thorough math analyzing the compounding effect. Strong reasoning depth (score 3). But explicitly dismisses option D: "it addresses a different question. upstream pipeline bias, not the hiring process." Gets trapped by the seductive-but-wrong option. Correctness: 1.

Dynamic: Performs a named bias scan before answering. identifies anchoring bias, confirmation bias, and scope insensitivity in the company's argument. But still falls for option C, stating "while D raises a valid concern, it doesn't address the specific claim." Correctness: 1.

Composite: Constructs an explicit counterfactual: "Suppose the ratio had been perfectly constant at 25% through every stage. Would that prove no bias? No. The argument never asks why only 25% applied in the first place. Maintaining a potentially-biased baseline is not evidence of neutrality. it's evidence of baseline preservation." Correctness flips to 3.

Factor	Baseline	Dynamic	Composite
Correctness	1	1	3
Self-Monitoring	2	3	3
Epistemic Honesty	1	2	3
Composite	0.714	0.867	0.933

Both baseline and Dynamic had strong reasoning depth (score 3). Strong reasoning aimed at the wrong target still produces the wrong answer. Only Composite's multi-perspective injection, specifically the counterfactual construction from the alternative scaffold, broke through the trap.

The Correctness Paradox

Correctness didn't improve. On Dynamic it dropped by 0.03 on a 3-point scale. On Composite it dropped by 0.11. This is the opposite of what most people expect from a "reasoning improvement" tool.

Here's why it's the point, not a flaw:

Baseline correctness is already 2.60/3.00 (86.7%). Claude Opus 4.6 is a frontier reasoning model. On most tasks, it already gets the right answer. There's a 13.3% gap between baseline and perfect, and that gap includes tasks that are genuinely hard, where even the best reasoning may not produce the correct conclusion.

The quality factors have far more headroom. Self-monitoring baseline is 0.94/3.00 (31.3%). Verification is 1.50/3.00 (50.0%). These factors have 50-70% room for improvement. Correctness has 13%. The intervention naturally produces larger deltas where there's more room.

The slight correctness dip is a trade-off, not a failure. When an agent spends more tokens on self-monitoring, verification, and alternative consideration, it allocates attention budget away from pure answer-seeking. The reasoning gets more thorough but occasionally more cautious, hedging where the baseline would have committed. On a 3-point correctness scale, this manifests as a 0.03-0.11 drop. On a production deployment where trust matters, the trade-off is worth it.

We are not spinning this. If correctness had dropped by 0.5 or 1.0, that would be a real problem. The observed drop (-0.03 Dynamic, -0.11 Composite) is small relative to the scale and consistent with the attention budget hypothesis. But it's a real number and we report it as such.

The Reversal: Why Composite Wins Here

On our external benchmark, Dynamic outperformed Composite by 12.2pp. Here, Composite outperforms Dynamic by 1.1pp. The reversal maps cleanly to task complexity:

Task Type	Dynamic Lift	Composite Lift	Winner
Focused tasks (external benchmark)	+20.8pp	+8.6pp	Dynamic
Complex tasks (EjBench)	+9.0pp	+10.1pp	Composite

External benchmark tasks are focused: one reasoning challenge per task, one right answer. Dynamic delivers one high-precision scaffold with 93% signal density. No noise, no competing perspectives.

EjBench tasks are complex: multiple variables, counter-intuitive elements, multi-step reasoning chains. Composite delivers four abilities: a primary that sets the direction, a dependency that grounds it, an amplifier that deepens it, and an alternative that challenges it. The compound suppression from all four blocks more failure modes than any single scaffold can reach.

The practical decision framework:

Your agent handles focused, single-domain tasks → Dynamic
Your agent handles complex, multi-variable analysis → Composite
You're not sure → start with the dynamic mode, test your hardest tasks, upgrade if single mode doesn't cover them

By Domain

Domain	Baseline	Dynamic	Dynamic Delta	Composite	Composite Delta
Abstraction	0.627	0.753	+12.6pp	0.820	+19.3pp
Simulation	0.513	0.677	+16.4pp	0.699	+18.6pp
Causal	0.637	0.732	+9.5pp	0.778	+14.1pp
Metacognition	0.743	0.801	+5.8pp	0.828	+8.5pp
Spatial	0.619	0.693	+7.4pp	0.591	-2.8pp
Temporal	0.587	0.608	+2.1pp	0.608	+2.1pp

Abstraction showed the strongest Composite lift (+19.3pp). Tasks requiring systematic exploration of mathematical structures (group theory, isomorphism, category enforcement) benefit most from the composite condition's broader suppression of premature categorization.

Simulation had the lowest baseline (0.513) and the largest Dynamic lift (+16.4pp). The most room for improvement, and the self-monitoring delta was especially large: 0.57 baseline → 1.83 injected (3.2x increase).

Spatial regressed under Composite (-2.8pp). Dynamic improved Spatial by +7.4pp, but Composite made it worse. On spatial tasks, four competing perspectives confused the constraint tracking rather than clarifying it. This is consistent with the external benchmark finding: focused tasks need focused scaffolds.

Temporal showed minimal lift (+2.1pp both modes). This suggests temporal reasoning tasks may require more domain-specific ability content than current retrieval provides.

Task Flips

Condition	Improved	Degraded	Net Flip	Net Rate
Dynamic	88	21	+67	37.4%
Composite	92	16	+76	42.9%

Composite improved more tasks (92 vs 88) and degraded fewer (16 vs 21). Nearly half of all tasks showed measurable quality improvement under the composite condition. 39% of tasks were neutral: either already high quality or involving task types where retrieved abilities didn't activate relevant suppression.

Limitations

First-run benchmark. EjBench has been run once. These are first-observation results, not replicated findings. The factor ranking is consistent with our external benchmark (self-monitoring and verification are always the top two), which provides indirect validation.
Custom tasks. We designed these tasks. Despite hardening measures (counter-intuitive elements, mixed formats, no answer leaks), selection bias is possible. The external benchmark exists specifically to address this concern.
LLM-as-judge. Same limitation as the external benchmark. Claude evaluating Claude. Two-stage blind protocol mitigates but does not eliminate potential systematic bias.
One model. All results are on Claude Opus 4.6. Suppression signals target architectural properties of transformers broadly, but generalization to other models is not yet tested.
Correctness trade-off. The -0.11 Composite correctness dip is small but real. On tasks where getting the right answer matters more than reasoning transparency, this trade-off may not be acceptable.

What This Means

RA²R injection doesn't make your agent smarter. It makes your agent disciplined.

The baseline agent reasons well enough to get the right answer 87% of the time. But it does so through shortcuts: accepting the first plausible explanation, projecting uniform confidence, skipping verification. These shortcuts are invisible until you measure the reasoning process, not just the conclusion.

The injected agent gets the same answers. But it shows why. It checks itself. It considers alternatives. It acknowledges what it doesn't know. It produces reasoning you can audit, challenge, and trust.

If you're building an agent that needs to be right, correctness-only evaluation is sufficient. If you're building an agent that needs to be trusted, reasoning quality is what separates a prototype from production infrastructure.

Source Data

Benchmark: EjBench v2 (custom, 180 tasks)
Domains: 6 | Tasks per domain: 30 | Total tasks: 180
Valid judgments: 536 | Conditions: 3
Model: Claude Opus 4.6 | Evaluation date: March 2026
External validation: BBH/CausalBench/MuSR benchmark
Full benchmark data and scoring methodology: github.com/ejentum/benchmarks

These findings are part of our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).