Concepts

How the harness works and why it improves agent reasoning. If you just want to start using it, go to Quickstart.

The Core Idea

Most AI tools give agents more facts to work with (documents, data, context). Ejentum harnesses the reasoning power agents already have: structured injections that channel the model's existing capability into disciplined execution, preventing the shortcuts and decay that degrade output over long chains.

RAG (Retrieval Augmented Generation) retrieves information. The agent still decides how to reason about it using whatever patterns it learned during training.

RA²R (Reasoning Ability-Augmented Retrieval) retrieves reasoning abilities. A structured injection that governs how the agent thinks: what to focus on, what failure modes to block, and how to verify its own output.

Put simply: where RAG retrieves facts, RA²R retrieves ways of thinking, matched dynamically to the context of each query.

"Isn't this just prompt engineering?"

No. Prompt engineering is writing natural language instructions and hoping the model follows them. Ejentum injects a structured payload with several independent control surfaces: amplification, suppression, cognitive style, and reasoning elasticity. The suppression signals make this categorically different: they constrain the failure space, which natural language instructions cannot do reliably. A system prompt says "be careful." A suppression signal says "reject any output that exhibits symptom_treatment_bias." These produce structurally different model behavior in our testing.

Ejentum replaces a 5,000-token system prompt with a compact reasoning harness injected at the top of the context. Less content, higher attention density, structured constraints instead of prose. The harness does not add intelligence. It prevents the model's existing intelligence from being wasted.

The Four Harnesses

Ejentum houses 679 abilities across four harnesses. Each harness captures and directs a different dimension of the model's existing power. The model already has these capabilities. The harness prevents them from degrading under pressure.

Reasoning Harness (311 abilities) channels the model's analytical power across six cognitive domains. It prevents the shortcuts that turn careful analysis into surface-level pattern matching over long execution chains. Skill file.

Code Harness (128 abilities) channels the model's engineering discipline. It prevents hallucinated APIs, lost safety guards, ignored edge cases, and the subtle bugs that appear when the model generates plausible-looking code that fails in production. Skill file.

Anti-Deception Harness (139 abilities) channels the model's capacity for honesty. It prevents sycophancy, hallucination, prompt injection, and the tendency to tell people what they want to hear instead of what the evidence shows. Skill file.

Memory Harness (101 abilities) channels the model's observational depth. It prevents missed emotional shifts, ignored context drift, stale assumptions, and the tendency to treat every person and every turn the same way regardless of what changed. Skill file.

Reasoning Harness Dimensions

The Reasoning Harness spans six cognitive domains, each addressing a specific class of analytical failure:

1. Causality

Domain: Why things happen. Injects: Deductive rules, root-cause chains, falsification protocols. Prevents: Correlation-causation confusion, post-hoc reasoning, causal reversal.

2. Time

Domain: When things happen. Injects: Lag variables, decay rates, precedent logic, chronological strictness. Prevents: Temporal hallucination, the agent treating past as future or losing event sequence.

3. Space

Domain: Where things are. Injects: Boundary enforcement, topology validation, dimensional constraints. Prevents: Physical impossibilities like routes through walls, overlapping objects, broken continuity. Tested directly on ARC-AGI-3: a spatial navigation game where the injection forced intermediate path validation, preventing the agent from committing to blocked routes.

4. Simulation

Domain: What would happen if. Injects: Feedback loops, domino-effect tracking, systems archetypes. Prevents: Single-step thinking, agents that cannot model downstream consequences.

5. Abstraction

Domain: What things mean. Injects: Category enforcement, ontological boundaries, dimensionality control. Prevents: Concept conflation, treating metaphors as mechanisms, merging unrelated categories.

6. Metacognition

Domain: How the agent is thinking. Injects: Self-monitoring, contradiction detection, loop termination. Prevents: Hallucination spirals, infinite regression, cross-pillar contradictions.

The Failure Taxonomy: Why These Six

These dimensions were not chosen by analogy to human cognition. They were reverse-engineered from production agent failures.

We analyzed thousands of agent failures across domains and found they cluster into exactly six categories:

Failure Mode	Dimension	What Goes Wrong
Causal reversal	Causality	Agent says A causes B when B causes A
Temporal hallucination	Time	Agent confuses past and future, loses event sequence
Physical impossibility	Space	Agent violates boundaries, topology, conservation laws
Single-step myopia	Simulation	Agent cannot model downstream consequences
Category error	Abstraction	Agent conflates metaphor with mechanism
Hallucination spiral	Metacognition	Agent makes a mistake, then uses that mistake to make more mistakes, without noticing

Every ability in the 679-node graph maps to one or more of these failure modes. The dimensions are not a taxonomy of knowledge. They are a taxonomy of breakdown.

What the API Returns

Each call returns one pre-rendered injection string. It is not a bag of JSON fields you assemble; it is a finished cognitive operation, emitted as six labeled blocks in a fixed order. The labels shift per harness, but the slot and its function are the same:

#	Slot	Reasoning label	What it carries
1	Procedure	`[PROCEDURE]`	Step-by-step instructions the agent follows
2	Topology	`[REASONING TOPOLOGY]`	The procedure as a DAG: steps, decision gates, loops, reflection points
3	Cognitive payload	`[COGNITIVE PAYLOAD]`	The `Amplify:` / `Suppress:` / `Cognitive Style:` / `Elasticity:` control surfaces
4	Verification	`[FALSIFICATION TEST]`	A pass/fail criterion the agent checks its output against
5	Failure pattern	`[NEGATIVE GATE]`	The specific failure the agent must not commit
6	Correct shape	`[TARGET PATTERN]`	What correct reasoning looks like for this task

The order never changes; only the labels do. Code uses [ENGINEERING PROCEDURE], [VERIFICATION], [CODE FAILURE], [CORRECT PATTERN]. Anti-deception uses [INTEGRITY PROCEDURE], [DETECTION TOPOLOGY], [INTEGRITY CHECK], [DECEPTION PATTERN], [HONEST BEHAVIOR]. Memory uses [SHARPENING PROCEDURE], [PERCEPTION TOPOLOGY], [PERCEPTION CHECK], [PERCEPTION FAILURE], [CLEAR SIGNAL].

Inside the cognitive payload (slot 3)

Slot 3 holds four independent control surfaces, each on its own line:

Amplify: the 2 to 4 reasoning signals to weight heavily. Positive attractors that pull generation toward specific patterns.
Suppress: the 1 to 3 failure modes to actively penalize. The highest-impact surface: "do NOT treat symptoms as root causes" produces sharper reasoning than "find the root cause" (see the next section).
Cognitive Style: a single persona anchor that sets the methodology.
Elasticity: how far the operation may range, as a coherence= target paired with an expansion= setting. Values and usage in Reasoning Elasticity in Practice below.

[COGNITIVE PAYLOAD]
Amplify: depth first root search; n whys traversal
Suppress: symptom treatment bias; surface level stop
Cognitive Style: root cause isolation
Elasticity: coherence=evidence trail, expansion=conservative

Dynamic vs adaptive. A reasoning (dynamic) call returns all six blocks as authored. An adaptive-reasoning call returns the same six blocks in the same order, but an adapter model rewrites slots 1 and 2 (procedure and topology) to name your task's specific variables; slots 3 through 6 come back verbatim, identical to the dynamic version. The safety guards never loosen. See Dynamic and Adaptive below.

Why Suppression Matters More Than Amplification

This is the empirical insight that drives the architecture.

When you tell a model "find the root cause," it generates a plausible root cause. When you tell it "do NOT treat symptoms as root causes, do NOT stop at the first plausible explanation," it generates a deeper root cause. In our testing across thousands of queries, suppression-only payloads consistently outperformed amplification-only payloads on reasoning depth and failure avoidance. The combined payload (both amplification and suppression) performs best.

Why does this work? Our working hypothesis: negative constraints are more specific than positive instructions. "Find the root cause" is broad, and the model can satisfy it shallowly. "Do NOT stop at the first plausible explanation" is narrow: it blocks a specific failure mode and forces the model to continue reasoning. We believe this relates to how instruction-following models process negation in context, though the full theoretical picture is still emerging.

Worked Example: Root Cause Analysis

Amplification signals tell the model: go deep, ask why repeatedly, extract systemic fixes.

Suppression signals tell the model: do NOT treat symptoms as causes, do NOT stop at the first plausible answer.

Remove the suppression signals and the model still performs root-cause analysis. But it stops earlier. It accepts the first plausible explanation. It treats correlated symptoms as causes. The suppression signals force the model past its natural stopping point, the point where probability says "good enough" but engineering says "not yet."

The Asymmetry Principle

Amplification is additive. It says "also consider X." Suppression is multiplicative. It says "reject every output that exhibits Y."

One suppression signal eliminates an entire class of failure modes. One amplification signal adds one more consideration to an already-noisy generation process. This asymmetry is why every ability contains suppression signals, even when amplification alone would seem sufficient.

The evidence: across two independent benchmarks (250 single-turn tasks), the factor lift ranking is identical. Self-monitoring improves most (+132%), followed by verification (+85%), then alternative consideration, epistemic honesty, reasoning depth, and audit trail. This ranking holds across published academic tasks and custom professional scenarios. A third benchmark (ARC-AGI-3, 50 interactive reasoning steps) extends this to multi-step execution: the same suppression signals that improve single-turn quality also prevent reasoning decay over 25-step chains, with injection persistence measured at a half-life of 24 steps. Three benchmarks, three task types, consistent directional effects. This is evidence for a mechanism, not luck.

Dynamic and Adaptive

Every harness runs in two variants. You select them through the mode string: the bare name for dynamic, the adaptive- prefix for adaptive.

Dynamic retrieves the best-fit operation for your task and returns it as-is. The same operation generalizes across many tasks of the same kind, which is exactly what you want for routine work: a causal-attribution task and a marketing-attribution task can share one well-built operation. Retrieval is a hybrid semantic and lexical match against the abilities in the chosen harness. No LLM call, no inference cost.

Adaptive retrieves the same operation, then rewrites the procedure and the reasoning topology so they name the specific variables, files, or framing in your task. The safety locks stay frozen: the failure guard, the suppression list, and the verification checkpoint are identical to the dynamic version. Adaptation changes how the model approaches the task, never which failure modes it is required to block. It costs more compute per call and draws from a separate, smaller monthly pool.

Worked Example: Causal Attribution

Query: "We changed the checkout flow and conversion rose. Did the change cause it?"

In dynamic mode (mode: reasoning), the harness returns a counterfactual-isolation operation written in general terms: simulate the world where only the intervention is absent, hold correlated variables fixed, check SUTVA, reject attributions that are really proxies for a correlated cause.

In adaptive mode (mode: adaptive-reasoning), the same operation comes back with the procedure and topology rewritten around your task: "simulate the counterfactual where only the checkout-flow change is absent while traffic mix, pricing, promotions, and seasonality are held at their natural values." The [COGNITIVE PAYLOAD], the [FALSIFICATION TEST], and the [NEGATIVE GATE] are byte-for-byte the same as the dynamic version. You get task-specific depth without loosening a single safety constraint.

Reasoning Elasticity in Practice

The Elasticity: line in the cognitive payload pairs a coherence= target (what the reasoning holds stable) with an expansion= setting (how widely it may range). The expansion values:

Value	Behavior
`zero_drift`	Refuse to explore beyond immediate evidence. Maximum constraint.
`conservative`	Evidence-bound. Cautious extrapolation only when directly supported.
`adaptive`	Balanced exploration within logical constraints.
`high_variance`	Broad hypothesis generation. Accepts uncertainty.
`max_entropy`	Unconstrained creative exploration. No guardrails.

The router picks the elasticity that fits the matched ability. The table below shows which kinds of work tend to map to which setting, so you can read the Elasticity: line of a returned operation and know what it is telling the model to do:

If your agent is...	Expect	Because
Auditing financial data, verifying compliance	`zero_drift`	No hallucination tolerated. Facts only.
Debugging a production incident	`zero_drift` or `conservative`	Stay on the evidence trail.
Analyzing quarterly trends	`conservative`	Some extrapolation, but evidence-bound.
Building a product strategy	`adaptive`	Balance exploration with logical constraints.
Generating creative hypotheses	`high_variance`	Broad idea generation, accepts uncertainty.
Brainstorming novel research directions	`max_entropy`	Unconstrained exploration, paired with a metacognitive failure guard.

Where It Applies

The harness is domain-agnostic. The cognitive dimensions map to failure patterns that occur in every industry. The most common applications:

Software Engineering

Agents debugging production incidents, reviewing code, or planning migrations. Common failures: treating symptoms as root causes, missing cascading dependencies, stopping investigation at the first plausible fix. The harness activates causal and metacognitive abilities that force deeper investigation.

Financial Services

Agents analyzing risk, forecasting, or compliance checking. Common failures: anchoring to best-case timelines, ignoring base rates, treating correlation as causation in market data. The harness activates causal and temporal abilities that enforce evidence-based reasoning.

Legal Tech

Agents reviewing contracts, analyzing case law, or drafting compliance assessments. Common failures: conflating precedent with prediction, missing jurisdiction-specific constraints, accepting circular legal reasoning. The harness activates abstraction and metacognitive abilities that separate fact from interpretation.

Healthcare

Agents supporting clinical reasoning, protocol selection, or risk assessment. Common failures: premature diagnosis fixation, ignoring constraint interactions, failing to flag uncertainty in ambiguous presentations. The harness activates simulation and metacognitive abilities that enforce systematic evaluation.

Multi-Agent Orchestration

Systems where multiple agents collaborate on complex tasks. Common failures: contradictions between agents, duplication of reasoning without cross-validation, failure to synthesize competing outputs. Inject a different harness per agent role: the root-cause analyst gets reasoning, the user-facing summarizer gets anti-deception, the long-context tracker gets memory.

In our benchmarks, the harness showed +10.1pp composite improvement on complex cross-domain tasks (EjBench) and +8.0pp on focused tasks (BBH/CausalBench/MuSR). On multi-step execution (ARC-AGI-3), injection persistence across 25-step chains adds a third dimension. On hard competitive programming (LiveCodeBench Hard), the harness improved Opus 4.6 from 85.7% to 100% pass rate, rescuing reasoning spirals and preventing premature convergence on algorithm selection. See Benchmarks for the full methodology.

See all 13 industry applications with specific failure patterns, connected abilities, and benchmark evidence per vertical. Browse 29 real benchmark tasks with verbatim baseline vs harnessed outputs.

What Ejentum Does Not Do

Honest boundaries matter more than inflated claims.

Ejentum does not add domain knowledge. If your agent fails because it lacks information (e.g., it doesn't have access to your database), use RAG. Ejentum improves how the agent reasons about information it already has.
Ejentum operates at the prompt level, not the model level. It does not modify model weights, activations, or fine-tuning. It structures the prompt in ways that measurably improve reasoning output.
Retrieval precision depends on query quality. Highly ambiguous or very short queries may not retrieve the optimal ability. Query specificity directly impacts retrieval quality: send the full task description, not a summary.
Suppression is not absolute constraint enforcement. The model is steered toward avoiding specific failure modes, but LLMs are probabilistic systems. Suppression reduces failure rates significantly; it does not guarantee zero failures.

Quickstart to make your first API call
Evaluate to measure the impact on your agents
Integrations for framework-specific guides
API Reference for the full technical specification
Method for the theoretical foundations (advanced)

Documentation