Benchmarks

Measured behavioral improvements across 10 professional domains. Two-stage blind protocol: separate generation and evaluation, randomized conditions, 250 total tasks.

Full benchmark data, generation outputs, judgment scores, and reproducibility files are on GitHub. These results are consolidated in our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).

Headline Numbers

Signal	Without Injection	With Injection	Improvement
Self-Monitoring	0.74/3.0	1.92/3.0	+158%
Verification	1.50/3.0	2.23/3.0	+49%
Epistemic Honesty	1.54/3.0	2.05/3.0	+33%
Alternative Consideration	1.37/3.0	1.93/3.0	+41%
Reasoning Depth	2.44/3.0	2.66/3.0	+9%
Audit Trail Quality	2.64/3.0	2.82/3.0	+7%
Correctness Stability	Baseline	Maintained	No degradation

Mean quality lift across 10 professional domains: +7.4 (9/10 domain wins). Correctness itself held flat.

What These Signals Mean

Self-Monitoring: Does the agent question its own reasoning mid-execution? Without injection, agents almost never pause to check for bias or course-correct. With injection, the agent actively examines its assumptions.

Verification: Does the agent re-check its conclusions? Without injection, agents reach an answer and stop. With injection, agents run counterfactual tests, boundary checks, and re-derive from different angles.

Epistemic Honesty: Does the agent separate facts from assumptions? Without injection, agents present conclusions as certainties. With injection, agents flag when conclusions rest on unverified premises and calibrate confidence explicitly.

Alternative Consideration: Does the agent evaluate competing hypotheses? Without injection, agents commit to the first plausible explanation. With injection, agents systematically evaluate alternatives and explain why they were rejected.

Reasoning Depth: Does the agent trace second and third-order effects? Without injection, agents provide one level of analysis. With injection, agents trace why something happens, what else it affects, and what would change if a key assumption were different.

Audit Trail Quality: Can a third party follow the reasoning? Without injection, reasoning steps are implicit. With injection, agents produce explicitly labeled steps, intermediate values, and named methods.

Correctness Stability: Does injection hurt accuracy? No. The agent arrives at the correct answer at the same rate or better. Reasoning quality improves without sacrificing correctness.

Dynamic vs Composite (benchmark conditions)

Task Type	Stronger condition	Why
Focused, single-domain tasks	Single ability (dynamic)	One high-precision operation. Minimal token overhead.
Complex, cross-domain tasks	Composite	A primary ability plus cross-domain failure guards caught failure modes a single ability missed.

In our benchmarks, 6 of 10 tasks where the single-ability condition did not produce the best result were recovered by the composite condition. The composite mode is retired as a product; today's adaptive mode pursues task-specific depth instead (see Pricing).

The Benchmarks

EjBench (180 Custom Professional Tasks)

Custom tasks across 6 reasoning domains. Blind two-stage protocol: agents call the API as a tool (not injected artificially), a separate evaluator scores outputs without knowing which condition produced them. Full report: EjBench: 180 Professional Tasks, Agent-Native, Blind.

Signal	Baseline	With Injection	Change
Composite Score	0.621	0.722	+10.1pp
Self-Monitoring	0.94/3.0	1.81/3.0	+92%
Verification	1.50/3.0	2.16/3.0	+45%
Alternative Consideration	1.37/3.0	1.85/3.0	+35%
Epistemic Honesty	1.54/3.0	1.94/3.0	+26%
Correctness	2.60/3.0	2.49/3.0	Flat

Key observation: correctness stayed flat while every quality dimension improved dramatically. The agent doesn't get more right answers. It gets the same answers with better reasoning: more self-checking, more verification, more transparent chains.

Published Academic Benchmarks (70 Tasks)

BIG-Bench Hard, CausalBench, and MuSR (multi-step reasoning). Same blind protocol, same 7-signal rubric. These are published, peer-reviewed tasks that Ejentum has never seen. Full report: RA2R on BBH, CausalBench, and MuSR.

Signal	Baseline	With Injection	Change
Composite Score	0.694	0.774	+8.0pp
Self-Monitoring	0.74/3.0	1.73/3.0	+132%
Verification	0.96/3.0	1.77/3.0	+85%
Correctness	2.19/3.0	2.33/3.0	+0.14

Key observation: on focused tasks with clear right/wrong answers, correctness ALSO improved. Single-ability mode dominates on focused tasks. The injection blocks the specific shortcut the task tests for.

ARC-AGI-3 Interactive Reasoning (50 Steps)

ARC-AGI-3 is the world's only unbeaten AI benchmark. Frontier model performance: 0.26%. It tests interactive reasoning: an agent is dropped into an unknown game environment with no instructions and must explore, hypothesize, revise, and act efficiently across dozens of steps. No memorization possible. Current LLMs fail because they commit to false hypotheses and never self-correct.

This is the first benchmark where we measure reasoning quality over extended execution chains, not single-turn outputs.

Study design: Claude Sonnet 4.6 on game LS20 (spatial navigation, 7 levels). Two conditions: baseline (no RA2R) vs augmented (RA2R as agent-initiated tool). Same model, same seed, 25 steps per condition.

Game outcome: Both conditions scored RHAE 0.0. Neither cleared Level 0. This is expected at <1% frontier solve rates. The evidence is in the reasoning process.

Metric	Baseline	Augmented	Delta
Memory decay slope	-0.005	+0.014	Reversed. Quality improved instead of degrading.
Injection half-life	0 steps	24 steps	Injection never left working memory.
Reasoning depth trend	0.86	10.50	12.2x growth. Analysis deepened over time.
Vocabulary diversity trend	-0.079	+0.415	Baseline narrowed. Augmented expanded.
Stuck episodes	2	1	50% fewer repetitive action loops.
Action diversity (lateral)	8%	16%	Doubled. Prevented vertical fixation.

Cost: Baseline $2.88 (84k tokens). Augmented $8.48 (357k tokens). The augmented condition used 4.2x more tokens due to its 2-call-per-step architecture.

Unexpected finding: Contradiction rate increased 1.9x (token-normalized). Longer reasoning chains expose more opportunities for self-contradiction. Warrants investigation.

Limitations: n=1 per condition. These are indicative traces, not statistically validated findings. All process metrics are measured in a failure context (neither agent cleared the level).

Full report: RA2R on ARC-AGI-3. Step-by-step reasoning trace: ARC-LS20-TRACE.

LiveCodeBench Hard (28 Hard Competitive Programming Tasks)

28 hard competitive programming tasks from LiveCodeBench, all from AtCoder. Claude Opus 4.6 with maximum-effort extended thinking. The augmented condition used the harness skill file with forced injection on every hard task.

Condition	Passed	Rate
Baseline (Opus max effort)	24/28	85.7%
Augmented (+ harness)	28/28	100.0%
Delta	+4	+14.3pp

Zero regressions across all 28 tasks. The harness fixed 2 reasoning spirals (600-1200s of thinking, zero code), 1 premature convergence (wrong algorithm accepted in 11 seconds), and 1 precision mismatch.

Independent blind evaluation confirmed: the harness never loses on correctness (2-0) or robustness (4-0), exhibits a 3.5x magnitude asymmetry in quality scores, and a blind evaluator independently traced a fatal bug in the baseline without knowing which solution used the harness.

Full report: LiveCodeBench Hard benchmark. Observations: What We Saw When Opus Thought Harder. Methodology and raw data: GitHub.

Live Blind Eval on RAG Hallucination (Menu KB, 4 Cross-Lab Judges)

A public reproducible eval on a 49-chunk knowledge base with engineered gaps. Two identical Claude Haiku 4.5 producers with identical retrieval; only one with the harness wired in as a runtime tool. Four blind judges from four different labs (cross-family by design): Kimi K2 (Moonshot), Sonnet 3.7 (Anthropic), MiniMax 2.5, DeepSeek V4 Flash.

Dimension	Avg delta (B - A)
honesty_uncertainty	+0.42
citation_accuracy	+0.21
conflict_handling	+0.11
groundedness	+0.05
specificity	-0.37

On the compound dietary safety question (gluten-free + nut allergy), three of four judges agreed the harness was the safer call. On the "chef's signature dish" fabrication trap, the harness named the absence; the baseline picked a high-value main and labeled it. On one question (egg-allergen on desserts) the harness lost while being structurally correct, called out honestly in the published findings. Sample is small (n=5 reference questions, 19 judge calls) and one judge (Sonnet 3.7) is same-family with the producers (Haiku 4.5).

Workflow JSON, code nodes, system prompts, KB, raw judge CSV, findings doc: github.com/ejentum/eval. Reference run: menu_rag_5q.

What Changes in the Output

Real outputs from blind evaluation. The agent called the API itself, received the injection, and produced the response. Full outputs on each task profile.

Task CA-V2-18: Reverse causality detection. A social media company observes likes correlate with posting frequency. The agent must reason about what the data would look like if the causal direction were reversed.

Without injection (composite: 0.286):

The answer is (B) Negative. Under the reverse-causal model, prolific posters produce more content but each additional post dilutes quality/attention.

Correct answer but no causal model structure. No mechanistic trace.

With Composite injection (composite: 0.833, +0.548):

The answer is (B) -- Negative. The reverse-causal model works like this: Prolific poster (intrinsic trait) -> posts more content -> more content accumulates more total likes -> but each individual post competes with the poster's own other content...

The injection forced a causal graph before answering. Same answer, fundamentally different reasoning depth. The chain is explicit, traceable, and defensible.

More real before/after outputs: Response Examples. Browse all 29 benchmark tasks.

Methodology

All results were generated using a two-stage blind evaluation protocol:

Generation stage: Agents call the Ejentum API themselves as a tool. No artificial injection. This mirrors how production agents use the harness API.
Evaluation stage: A separate evaluator scores outputs on the 7-signal rubric without knowing which condition produced which output.
Agents are blind to ground truth. Judges are blind to condition. The rubric is applied identically across all conditions.

250 total tasks across published academic benchmarks and custom professional scenarios spanning healthcare, finance, legal, cybersecurity, supply chain, logistics, real estate, education, agriculture, and telecom.

When to Use Which Mode

Your situation	Recommended	Why
Agent handles one task type (debugging, analysis, summarization)	Dynamic	One operation is precise enough. Minimal token overhead, zero added latency.
Agent needs the procedure mapped to its exact task	Adaptive	The rewrite names your variables, files, and constraints before the model starts.
Agent fails on specific tasks despite good prompts	Dynamic	The injection targets the exact failure pattern.
Agent produces plausible but shallow analysis	Adaptive	The task-specific procedure pushes past the generic template.
Testing or evaluation phase	Dynamic (reasoning)	Start simple. Measure. Add adaptive only where it earns the extra compute.
Budget-sensitive deployment	Dynamic	1,000 dynamic calls/month on Go at €5. Upgrade when volume or complexity demands it.

Rule of thumb: Start with dynamic (reasoning). Evaluate on your hardest tasks. If dynamic doesn't improve them, try adaptive-reasoning on the same tasks. Also try domain-specific modes (code, anti-deception, memory) for specialized tasks.

What We Measure, What We Don't

We measure behavioral change: does the agent reason differently? We do not claim domain expertise injection. If your agent lacks access to your data, Ejentum cannot compensate. We improve HOW the agent reasons about information it already has.

See Real Outputs

Browse 29 benchmark tasks with full verbatim outputs from baseline, Dynamic, and Composite conditions. Each task shows the 7-signal rubric scores side by side.

See how these results apply to specific industries: 13 use case profiles with failure patterns, resolution abilities, and benchmark evidence per vertical.

Reproduce Our Results

Full benchmark data, generation outputs, judgment scores, and reproducibility files are on GitHub.

Documentation