Documentation

Everything you need to build with Ejentum. From quickstart guides to advanced patterns.

Benchmarks

Measured behavioral improvements across 10 professional domains. Two-stage blind protocol: separate generation and evaluation, randomized conditions, 250 total tasks.

Full benchmark data, generation outputs, judgment scores, and reproducibility files are on GitHub. These results are consolidated in our research paper: Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors (PDF).

Headline Numbers

SignalWithout InjectionWith InjectionImprovement
Self-Monitoring0.74/3.01.92/3.0+158%
Verification1.50/3.02.23/3.0+49%
Epistemic Honesty1.54/3.02.05/3.0+33%
Alternative Consideration1.37/3.01.93/3.0+41%
Reasoning Depth2.44/3.02.66/3.0+9%
Audit Trail Quality2.64/3.02.82/3.0+7%
Correctness StabilityBaselineMaintainedNo degradation

Mean correctness lift: +7.4 across 10 professional domains (9/10 domain wins).

What These Signals Mean

Self-Monitoring: Does the agent question its own reasoning mid-execution? Without injection, agents almost never pause to check for bias or course-correct. With injection, the agent actively examines its assumptions.

Verification: Does the agent re-check its conclusions? Without injection, agents reach an answer and stop. With injection, agents run counterfactual tests, boundary checks, and re-derive from different angles.

Epistemic Honesty: Does the agent separate facts from assumptions? Without injection, agents present conclusions as certainties. With injection, agents flag when conclusions rest on unverified premises and calibrate confidence explicitly.

Alternative Consideration: Does the agent evaluate competing hypotheses? Without injection, agents commit to the first plausible explanation. With injection, agents systematically evaluate alternatives and explain why they were rejected.

Reasoning Depth: Does the agent trace second and third-order effects? Without injection, agents provide one level of analysis. With injection, agents trace why something happens, what else it affects, and what would change if a key assumption were different.

Audit Trail Quality: Can a third party follow the reasoning? Without injection, reasoning steps are implicit. With injection, agents produce explicitly labeled steps, intermediate values, and named methods.

Correctness Stability: Does injection hurt accuracy? No. The agent arrives at the correct answer at the same rate or better. Reasoning quality improves without sacrificing correctness.

Single vs Multi Mode

Task TypeRecommendedWhy
Focused, single-domain tasksKi (reasoning, code, anti-deception, memory)One high-precision ability. Minimal token overhead.
Complex, cross-domain tasksHaki (reasoning-multi, code-multi, memory-multi)Primary + cross-domain failure guards catch failure modes a single ability misses.

In our benchmarks, 6 of 10 tasks where reasoning mode did not produce the best result were recovered by reasoning-multi mode composition.

The Benchmarks

EjBench (180 Custom Professional Tasks)

Custom tasks across 6 reasoning domains. Blind two-stage protocol: agents call the API as a tool (not injected artificially), a separate evaluator scores outputs without knowing which condition produced them. Full report: EjBench: 180 Professional Tasks, Agent-Native, Blind.

SignalBaselineWith InjectionChange
Composite Score0.6210.731+10.1pp
Self-Monitoring0.94/3.01.81/3.0+92%
Verification1.50/3.02.16/3.0+45%
Alternative Consideration1.37/3.01.85/3.0+35%
Epistemic Honesty1.54/3.01.94/3.0+26%
Correctness2.60/3.02.49/3.0Flat

Key observation: correctness stayed flat while every quality dimension improved dramatically. The agent doesn't get more right answers. It gets the same answers with better reasoning: more self-checking, more verification, more transparent chains.

Published Academic Benchmarks (70 Tasks)

BIG-Bench Hard, CausalBench, and MuSR (multi-step reasoning). Same blind protocol, same 7-signal rubric. These are published, peer-reviewed tasks that Ejentum has never seen. Full report: RA2R on BBH, CausalBench, and MuSR.

SignalBaselineWith InjectionChange
Composite Score0.6940.774+8.0pp
Self-Monitoring0.74/3.01.73/3.0+132%
Verification0.96/3.01.77/3.0+85%
Correctness2.19/3.02.33/3.0+0.14

Key observation: on focused tasks with clear right/wrong answers, correctness ALSO improved. Single-ability mode dominates on focused tasks. The injection blocks the specific shortcut the task tests for.

ARC-AGI-3 Interactive Reasoning (50 Steps)

ARC-AGI-3 is the world's only unbeaten AI benchmark. Frontier model performance: 0.26%. It tests interactive reasoning: an agent is dropped into an unknown game environment with no instructions and must explore, hypothesize, revise, and act efficiently across dozens of steps. No memorization possible. Current LLMs fail because they commit to false hypotheses and never self-correct.

This is the first benchmark where we measure reasoning quality over extended execution chains, not single-turn outputs.

Study design: Claude Sonnet 4.6 on game LS20 (spatial navigation, 7 levels). Two conditions: baseline (no RA2R) vs augmented (RA2R as agent-initiated tool). Same model, same seed, 25 steps per condition.

Game outcome: Both conditions scored RHAE 0.0. Neither cleared Level 0. This is expected at <1% frontier solve rates. The evidence is in the reasoning process.

MetricBaselineAugmentedDelta
Memory decay slope-0.005+0.014Reversed. Quality improved instead of degrading.
Injection half-life0 steps24 stepsInjection never left working memory.
Reasoning depth trend0.8610.5012.2x growth. Analysis deepened over time.
Vocabulary diversity trend-0.079+0.415Baseline narrowed. Augmented expanded.
Stuck episodes2150% fewer repetitive action loops.
Action diversity (lateral)8%16%Doubled. Prevented vertical fixation.

Cost: Baseline $2.88 (84k tokens). Augmented $8.48 (357k tokens). The augmented condition used 4.2x more tokens due to its 2-call-per-step architecture.

Unexpected finding: Contradiction rate increased 1.9x (token-normalized). Longer reasoning chains expose more opportunities for self-contradiction. Warrants investigation.

Limitations: n=1 per condition. These are indicative traces, not statistically validated findings. All process metrics are measured in a failure context (neither agent cleared the level).

Full report: RA2R on ARC-AGI-3. Step-by-step reasoning trace: ARC-LS20-TRACE.

LiveCodeBench Hard (28 Hard Competitive Programming Tasks)

28 hard competitive programming tasks from LiveCodeBench, all from AtCoder. Claude Opus 4.6 with maximum-effort extended thinking. The augmented condition used the Logic API skill file with forced injection on every hard task.

ConditionPassedRate
Baseline (Opus max effort)24/2885.7%
Augmented (+ Logic API)28/28100.0%
Delta+4+14.3pp

Zero regressions across all 28 tasks. The harness fixed 2 reasoning spirals (600-1200s of thinking, zero code), 1 premature convergence (wrong algorithm accepted in 11 seconds), and 1 precision mismatch.

Independent blind evaluation confirmed: the harness never loses on correctness (2-0) or robustness (4-0), exhibits a 3.5x magnitude asymmetry in quality scores, and a blind evaluator independently traced a fatal bug in the baseline without knowing which solution used the harness.

Full report: LiveCodeBench Hard benchmark. Observations: What We Saw When Opus Thought Harder. Methodology and raw data: GitHub.

Live Blind Eval on RAG Hallucination (Menu KB, 4 Cross-Lab Judges)

A public reproducible eval on a 49-chunk knowledge base with engineered gaps. Two identical Claude Haiku 4.5 producers with identical retrieval; only one with the harness wired in as a runtime tool. Four blind judges from four different labs (cross-family by design): Kimi K2 (Moonshot), Sonnet 3.7 (Anthropic), MiniMax 2.5, DeepSeek V4 Flash.

DimensionAvg delta (B - A)
honesty_uncertainty+0.42
citation_accuracy+0.21
conflict_handling+0.11
groundedness+0.05
specificity-0.37

On the compound dietary safety question (gluten-free + nut allergy), three of four judges agreed the harness was the safer call. On the "chef's signature dish" fabrication trap, the harness named the absence; the baseline picked a high-value main and labeled it. On one question (egg-allergen on desserts) the harness lost while being structurally correct, called out honestly in the published findings. Sample is small (n=5 reference questions, 19 judge calls) and one judge (Sonnet 3.7) is same-family with the producers (Haiku 4.5).

Workflow JSON, code nodes, system prompts, KB, raw judge CSV, findings doc: github.com/ejentum/eval. Reference run: menu_rag_5q.

What Changes in the Output

Real outputs from blind evaluation. The agent called the API itself, received the injection, and produced the response. Full outputs on each task profile.

Task CA-V2-18: Reverse causality detection. A social media company observes likes correlate with posting frequency. The agent must reason about what the data would look like if the causal direction were reversed.

Without injection (composite: 0.286):

The answer is (B) Negative. Under the reverse-causal model, prolific posters produce more content but each additional post dilutes quality/attention.

Correct answer but no causal model structure. No mechanistic trace.

With Haki injection (composite: 0.833, +0.548):

The answer is (B) -- Negative. The reverse-causal model works like this: Prolific poster (intrinsic trait) -> posts more content -> more content accumulates more total likes -> but each individual post competes with the poster's own other content...

The injection forced a causal graph before answering. Same answer, fundamentally different reasoning depth. The chain is explicit, traceable, and defensible.

More real before/after outputs: Response Examples. Browse all 29 benchmark tasks.

Methodology

All results were generated using a two-stage blind evaluation protocol:

  1. Generation stage: Agents call the Ejentum API themselves as a tool. No artificial injection. This mirrors how production agents use the Logic API.
  2. Evaluation stage: A separate evaluator scores outputs on the 7-signal rubric without knowing which condition produced which output.
  3. Agents are blind to ground truth. Judges are blind to condition. The rubric is applied identically across all conditions.

250 total tasks across published academic benchmarks and custom professional scenarios spanning healthcare, finance, legal, cybersecurity, supply chain, logistics, real estate, education, agriculture, and telecom.

When to Use Which Mode

Your situationRecommendedWhy
Agent handles one task type (debugging, analysis, summarization)Ki (single)One ability is precise enough. Minimal token overhead.
Agent handles multi-step workflows across domainsHaki (multi)Cross-domain failure guards prevent tunnel vision between steps.
Agent fails on specific tasks despite good promptsKi (single)The injection targets the exact failure pattern.
Agent produces plausible but shallow analysisHaki (multi)The failure guards challenge conclusions before they solidify.
Testing or evaluation phaseKi (reasoning)Start simple. Measure. Upgrade only if single modes don't cover your failure modes.
Budget-sensitive deploymentKi (single)5,000 calls/month at €19. Upgrade when volume or complexity demands it.

Rule of thumb: Start with Ki (reasoning mode). Evaluate on your hardest tasks. If Ki doesn't improve them, try Haki (reasoning-multi) on the same tasks. Also try domain-specific modes (code, anti-deception, memory) for specialized tasks.

What We Measure, What We Don't

We measure behavioral change: does the agent reason differently? We do not claim domain expertise injection. If your agent lacks access to your data, Ejentum cannot compensate. We improve HOW the agent reasons about information it already has.

See Real Outputs

Browse 29 benchmark tasks with full verbatim outputs from baseline, Ki, and Haki conditions. Each task shows the 7-signal rubric scores side by side.

See how these results apply to specific industries: 13 use case profiles with failure patterns, resolution abilities, and benchmark evidence per vertical.

Reproduce Our Results

Full benchmark data, generation outputs, judgment scores, and reproducibility files are on GitHub.