What We Saw When Opus Thought Harder

We gave Claude Opus 4.6 twenty-eight hard competitive programming problems and told it to think as hard as it could. It solved twenty-four. Then we gave it the same problems with one harness call before each task. It solved all twenty-eight.

Then we asked a blind evaluator to judge the code without knowing which solution used the scaffold. Here's what it found.

The Scaffold Changes How the Model Thinks, Not What It Knows

Across the twenty-four tasks both conditions solved, the scaffold produced different code every time. Not tweaked code. Different approaches to the same problem. The blind evaluator confirmed: on over half the tasks, the two solutions used different algorithms, different decomposition strategies, different architectural instincts.

The model already knows the algorithms. What it doesn't always know is which one to pick, when to stop exploring, and how to organize its thinking under time pressure. The scaffold addresses exactly that.

Impressive Code vs Maintainable Code

The clearest pattern the blind evaluator found: baseline code optimizes for performance. Augmented code optimizes for clarity.

Baseline uses bit tricks, minimal allocations, terse control flow. Augmented uses explicit variable names, separated concerns, readable structure. Both pass the tests. They optimize for different audiences.

This showed up on 10+ tasks. In competitive programming, the baseline's optimization instinct is appropriate. In production code review, the scaffold's clarity wins. The scaffold shifts the model toward the solution someone else can maintain.

General Technique vs Specific Structure

Six tasks showed the baseline reaching for a familiar template while the augmented condition identified the problem's own structure.

On Cans and Openers, the baseline applied ternary search on a concave function, a general optimization technique. The augmented condition recognized that binary search on the marginal tradeoff was sufficient, eliminating the ternary search, the feasibility bounds, and the special-case branch. The blind evaluator ranked it highest of three solutions: "A pattern-matched to a general technique; B identified the specific structure that made a simpler technique sufficient; C reframed the problem so that the hard part dissolved."

On Art Gallery, the baseline locked onto a standard BFS traversal in 11 seconds. The augmented condition spent 125 seconds and arrived at Dial's algorithm, a bucket-based BFS that handles the edge-weight structure correctly. The blind evaluator independently found the baseline's sentinel bug and scored it 2/10 on correctness.

Local Reasoning vs Global Reasoning

Three tasks revealed a distinction in how the two conditions reason about correctness.

Baseline solutions verify locally. Each component works in isolation. Augmented solutions verify globally. The invariants hold across the entire execution.

The Art Gallery sentinel bug is the extreme case. Initializing to 0 where 0 is a valid computed value passes local reasoning ("0 means no stamina, skip it"). It fails global reasoning ("0 also means unreached, and now those two states are indistinguishable"). The blind evaluator named it precisely: "This is the kind of off-by-one that survives local reasoning but fails global reasoning."

Two Reasoning Spirals Rescued

Best Performances: 610 seconds of thinking. Zero code. The model explored approaches, rejected them, explored more, never converged. With scaffold: 495 seconds, working code. The blind evaluator noted: "The spiral likely attempted to derive an approach from first principles and got caught evaluating trade-offs without committing. This solution shows no such deliberation."

Tangency of Cuboids: 1,190 seconds. Zero code. With scaffold: the model saw coordinates capped at 100, reframed from 3D geometry to grid adjacency, and produced a working solution. The blind evaluator: "The defining characteristic is constraint recognition as architecture."

Neither extended thinking alone nor the scaffold alone would have solved these. Together, deep thinking with structural guidance produces convergence where neither component succeeds independently.

When the Scaffold Matters, It Matters More

The blind evaluator preferred the augmented solution on 9 tasks and the baseline on 8. Close to a coin flip. But the magnitude tells a different story.

Average augmented win: +5.7 points. Average baseline win: -1.6 points. A 3.5x ratio.

When the scaffold wins, it wins because the algorithm is different, a bug is prevented, or the architecture is redesigned. When the baseline wins, it wins because a loop is slightly tighter or a variable name is slightly shorter. Structural improvements vs marginal preferences.

The augmented condition was never outscored on correctness (2-0) or robustness (4-0). When these axes differ, they always favor the scaffold.

Transparency: the blind evaluator is Claude Opus 4.6, the same model family as the code generator. This is both a strength (dogfooding) and a limitation (shared biases). Mitigated by randomized A/B assignment and the evaluator's demonstrated ability to identify genuine bugs: Art Gallery scored 2/10 on correctness. The evaluator also preferred baseline on 8 tasks, which we publish.

Nothing Broke

Zero regressions across twenty-eight tasks. The scaffold changed every solution. It changed nothing about whether correct code stays correct.

46% of tasks produced near-identical solutions regardless of condition. This is by design. The scaffold concentrates its effect on tasks where the model's native reasoning fails: spirals, premature convergence, precision miscalibration. On well-calibrated tasks, it stays out of the way.

When to Use It

The scaffold's value concentrates on tasks where the model might spiral (complex multi-step problems) or converge prematurely (problems with subtle edge cases). On straightforward tasks, it adds no overhead to correctness and stays out of the way. The cost is 2.4x average time from the tool-call architecture.

Baseline: 24/28 (85.7%). Augmented: 28/28 (100%). +14.3pp. Zero regressions.

The scaffold does not tell the model what to code. It calibrates when the model commits.

Full results in the benchmark report. Skill file: harness documentation. Data: github.com/ejentum/benchmarks/lcb-hard.