Why We Killed Our Most Complex Mode
We built a sophisticated execution framework. Four abilities composed into operational graphs with parsed execution steps, perspective checks, and topology annotations. It was the structurally richer, more engineered version of our injection format.
The data said kill it.
What Heavy Mode Was
The Logic API returns cognitive scaffolds in two rendering formats. We called them "light" and "heavy."
Light rendered the raw scaffold: suppression signals, amplification signals, a reasoning topology, a failure exemplar, and a verification test. Compact. 93% of the tokens carried reasoning-relevant content.
Heavy wrapped the same content in a parsed execution framework: numbered operation steps (Step 1... Step 2... Step 3...), perspective checks, structural metadata, and topology annotations. It looked more like a program. 50% of the tokens carried reasoning-relevant content. The other 50% was structural overhead. headers, labels, formatting that organized the scaffold but didn't add signal.
Heavy mode was the version we expected to win. More structure should mean better reasoning. Explicit execution steps should guide the model through the procedure. Right?
Run 1: Light Wins (110 Tasks)
We tested five conditions on 110 tasks (BBH, CausalBench, MuSR). Binary correctness scoring.
| Condition | Correctness | Delta | Net Flips |
|---|---|---|---|
| Baseline | 69.7% | . | . |
| Light Single (Ki) | 76.8% | +7.1pp | +7 |
| Light Multi (Haki) | 75.2% | +5.5pp | +4 |
| Heavy Multi | 73.6% | +3.9pp | +2 |
| Heavy Single | 71.5% | +1.8pp | 0 |
Light Single won. Heavy Single achieved NET 0 flips. it improved as many tasks as it degraded. The most structurally complex mode was indistinguishable from noise.
We Upgraded and Retested
Maybe the heavy builders were broken. We audited them and found real issues:
- Synergy chains were silently dropping suppression signals from 3 of 4 abilities. only the primary ability's suppressions were rendering
- Missing structural fields created empty labels that wasted tokens on nothing
- The perspective check section added ~100 tokens of metacognitive scaffolding that duplicated what the suppression signals already provided
We fixed everything. Retested on 70 external-only tasks (harder subset).
Run 2: Same ranking.
| Condition | Correctness | Delta | Net Flips |
|---|---|---|---|
| Baseline | 69.3% | . | . |
| Light Single (Ki) | 74.3% | +5.0pp | +3 |
| Light Multi (Haki) | 74.3% | +5.0pp | +4 |
| Heavy Multi | 71.4% | +2.1pp | +2 |
| Heavy Single | 70.3% | +1.0pp | 0 |
The upgrades fixed individual task failures (4 of 5 previously-failed heavy single tasks recovered; 5 of 5 heavy multi tasks recovered). But the aggregate ranking didn't change. Light > Heavy in both runs.
Heavy Single: NET 0 flips in both runs. Two independent evaluations. Same result.
The Signal Density Analysis
We measured what percentage of each injection format carries reasoning-relevant content vs. structural overhead.
| Format | Total Tokens | Signal Tokens | Signal Density |
|---|---|---|---|
| Light Single | ~265 words | ~247 words | 93% |
| Heavy Single | ~306 words | ~261 words | 85% |
| Light Multi | ~547 words | ~358 words | 65% |
| Heavy Multi | ~479 words | ~240 words | 50% |
Heavy Multi dedicated half its injection budget to structural overhead. Every structural token competes with signal tokens for the model's attention. On a frontier model that already reasons with extended chain-of-thought, the structural scaffolding doesn't help. it dilutes.
The Mechanism: Suppression Drives Flips, Not Structure
We traced individual task flips to understand what caused them. The pattern was consistent: when a task flipped from wrong to right, the flip was driven by a suppression signal, not by the structural framework around it.
On murder mystery tasks, the suppression of "anchoring on first suspicious evidence" caused the agent to evaluate all suspects instead of committing early. This worked in both light and heavy modes. The execution steps in heavy mode (Step 1: identify suspects. Step 2: evaluate evidence...) didn't add anything. the model already knew how to investigate a mystery. What it didn't know was to stop anchoring. The suppression signal told it to stop.
Structure is scaffolding for models that don't know what to do. Suppression is enforcement for models that know what to do but take shortcuts. Frontier models fall in the second category.
The Methodology Lesson: Don't Let the Model Grade Its Own Injection
Before we trusted these results, we had to confront an uncomfortable finding from our own benchmarking history.
In an earlier evaluation phase, we used simulated evaluation: agents reasoned ABOUT the injection effects rather than receiving real injection. The model read the scaffold and scored how it would perform differently with it active.
Simulated evaluation said Heavy Multi won (+1.8pp, best of all conditions).
When we switched to real injection. actually injecting the scaffold into the model's context with blind two-stage scoring. Light modes won (+7.1pp for Light Single).
Same tasks. Same model. Same abilities. Opposite conclusions.
Why simulated evaluation lies: The model that evaluates its own augmentation overweights structural complexity. Heavy mode looks more rigorous. parsed steps, explicit operations, structured metadata. The model reasons: "this structured approach would improve my performance." But when the structure is actually injected, it competes with the task for attention budget. The model doesn't need to be told to think step-by-step. it already does. The overhead is pure cost.
The lesson: If you're evaluating any LLM augmentation system, inject for real and score blind. Never let the model imagine how it would perform differently. It will systematically overestimate the value of structural complexity.
The Decision
We deprecated heavy modes. The product ships light rendering only.
This was not a compromise. Light mode has higher signal density (93% vs 50%), produces larger correctness improvements (+7.1pp vs +1.8pp), and has a better regression profile (1 regression per 110 tasks vs equal improvements and regressions). By every metric we measured, less structure produced better results.
What This Means
For Ejentum users: You're getting the highest signal-density format we've measured. Every token in the scaffold is earning its place.
For anyone building injection systems: Measure signal density, not structural sophistication. The intuition that "more structure = better reasoning" is wrong on frontier models with native chain-of-thought. Suppression signals. telling the model what NOT to do. carry more inferential weight per token than procedural instructions telling it what to do.
For the benchmarking community: Simulated evaluation and real-injection evaluation produce opposite conclusions about what works. If your evaluation method lets the model reason about the intervention instead of experiencing it, your results are unreliable.
Source Data
- Run 1: 110 tasks, 5 conditions, binary correctness scoring
- Run 2: 70 external tasks, 5 conditions, binary correctness scoring (replication)
- Signal density analysis: Token-level content audit across all 4 injection formats
- Model: Claude Opus 4.6 | Evaluation dates: March 2026
- Full benchmark data: External benchmark report
For how injection modes work: Architecture. For the full benchmark data: Benchmarks. Related: Domain-Agnostic Suppression.