Why We Killed Our Most Complex Mode

We built a sophisticated execution framework. Four abilities composed into operational graphs with parsed execution steps, perspective checks, and topology annotations. It was the structurally richer, more engineered version of our injection format.

The data said kill it.

What Heavy Mode Was

The harness returns cognitive scaffolds in two rendering formats. We called them "light" and "heavy."

Light rendered the raw scaffold: suppression signals, amplification signals, a reasoning topology, a failure exemplar, and a verification test. Compact. 93% of the tokens carried reasoning-relevant content.

Heavy wrapped the same content in a parsed execution framework: numbered operation steps (Step 1... Step 2... Step 3...), perspective checks, structural metadata, and topology annotations. It looked more like a program. 50% of the tokens carried reasoning-relevant content. The other 50% was structural overhead: headers, labels, formatting that organized the scaffold but didn't add signal.

Heavy mode was the version we expected to win. More structure should mean better reasoning. Explicit execution steps should guide the model through the procedure. Right?

Run 1: Light Wins (110 Tasks)

We tested five conditions on 110 tasks (BBH, CausalBench, MuSR). Binary correctness scoring.

Condition	Correctness	Delta	Net Flips
Baseline	69.7%	.	.
Light Single	76.8%	+7.1pp	+7
Light Composite	75.2%	+5.5pp	+4
Heavy Composite	73.6%	+3.9pp	+2
Heavy Single	71.5%	+1.8pp	0

Light Single won. Heavy Single achieved NET 0 flips: it improved as many tasks as it degraded. The most structurally complex mode was indistinguishable from noise.

We Upgraded and Retested

Maybe the heavy builders were broken. We audited them and found real issues:

Synergy chains were silently dropping suppression signals from 3 of 4 abilities: only the primary ability's suppressions were rendering
Missing structural fields created empty labels that wasted tokens on nothing
The perspective check section added ~100 tokens of metacognitive scaffolding that duplicated what the suppression signals already provided

We fixed everything. Retested on 70 external-only tasks (harder subset).

Run 2: Same ranking.

Condition	Correctness	Delta	Net Flips
Baseline	69.3%	.	.
Light Single	74.3%	+5.0pp	+3
Light Composite	74.3%	+5.0pp	+4
Heavy Composite	71.4%	+2.1pp	+2
Heavy Single	70.3%	+1.0pp	0

The upgrades fixed individual task failures (4 of 5 previously-failed heavy Single tasks recovered; 5 of 5 heavy Composite tasks recovered). But the aggregate ranking didn't change. Light > Heavy in both runs.

Heavy Single: NET 0 flips in both runs. Two independent evaluations. Same result.

The Signal Density Analysis

We measured what percentage of each injection format carries reasoning-relevant content vs. structural overhead.

Format	Total Tokens	Signal Tokens	Signal Density
Light Single	~265 words	~247 words	93%
Heavy Single	~306 words	~261 words	85%
Light Composite	~547 words	~358 words	65%
Heavy Composite	~479 words	~240 words	50%

Heavy Composite dedicated half its injection budget to structural overhead. Every structural token competes with signal tokens for the model's attention. On a frontier model that already reasons with extended chain-of-thought, the structural scaffolding doesn't help: it dilutes.

The Mechanism: Suppression Drives Flips, Not Structure

We traced individual task flips to understand what caused them. The pattern was consistent: when a task flipped from wrong to right, the flip was driven by a suppression signal, not by the structural framework around it.

On murder mystery tasks, the suppression of "anchoring on first suspicious evidence" caused the agent to evaluate all suspects instead of committing early. This worked in both light and heavy modes. The execution steps in heavy mode (Step 1: identify suspects. Step 2: evaluate evidence...) didn't add anything. The model already knew how to investigate a mystery. What it didn't know was to stop anchoring. The suppression signal told it to stop.

Structure is scaffolding for models that don't know what to do. Suppression is enforcement for models that know what to do but take shortcuts. Frontier models fall in the second category.

The Methodology Lesson: Don't Let the Model Grade Its Own Injection

Before we trusted these results, we had to confront an uncomfortable finding from our own benchmarking history.

In an earlier evaluation phase, we used simulated evaluation: agents reasoned ABOUT the injection effects rather than receiving real injection. The model read the scaffold and scored how it would perform differently with it active.

Simulated evaluation said Heavy Composite won (+1.8pp, best of all conditions).

When we switched to real injection (actually injecting the scaffold into the model's context with blind two-stage scoring), Light modes won (+7.1pp for Light Single).

Same tasks. Same model. Same abilities. Opposite conclusions.

Why simulated evaluation lies: The model that evaluates its own augmentation overweights structural complexity. Heavy mode looks more rigorous: parsed steps, explicit operations, structured metadata. The model reasons: "this structured approach would improve my performance." But when the structure is actually injected, it competes with the task for attention budget. The model doesn't need to be told to think step-by-step. It already does. The overhead is pure cost.

The lesson: If you're evaluating any LLM augmentation system, inject for real and score blind. Never let the model imagine how it would perform differently. It will systematically overestimate the value of structural complexity.

The Decision

We deprecated heavy modes. The product ships light rendering only.

This was not a compromise. Light mode has higher signal density (93% vs 50%), produces larger correctness improvements (+7.1pp vs +1.8pp), and has a better regression profile (1 regression per 110 tasks vs equal improvements and regressions). By every metric we measured, less structure produced better results.

What This Means

For Ejentum users: You're getting the highest signal-density format we've measured. Every token in the scaffold is earning its place.

For anyone building injection systems: Measure signal density, not structural sophistication. The intuition that "more structure = better reasoning" is wrong on frontier models with native chain-of-thought. Suppression signals, telling the model what NOT to do, carry more inferential weight per token than procedural instructions telling it what to do.

For the benchmarking community: Simulated evaluation and real-injection evaluation produce opposite conclusions about what works. If your evaluation method lets the model reason about the intervention instead of experiencing it, your results are unreliable.

Source Data

Run 1: 110 tasks, 5 conditions, binary correctness scoring
Run 2: 70 external tasks, 5 conditions, binary correctness scoring (replication)
Signal density analysis: Token-level content audit across all 4 injection formats
Model: Claude Opus 4.6 | Evaluation dates: March 2026
Full benchmark data: External benchmark report

For how injection modes work: Architecture. For the full benchmark data: Benchmarks. Related: Domain-Agnostic Suppression.