LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

Claude Opus 4.6 with maximum-effort extended thinking solves 85.7% of 28 hard AtCoder problems. With one harness call per task, it solves 100%. Four tasks flipped. Zero broke.

The Setup

28 hard competitive programming tasks from LiveCodeBench, all AtCoder. Read stdin, compute, write stdout. Exact string match on public test cases.

Baseline: Opus 4.6, --effort max, no augmentation. Augmented: Same model, same effort. One harness call per task: a cognitive scaffold injected before code generation via the skill file.

Results

Condition	Passed	Rate
Baseline	24/28	85.7%
+ harness scaffold	28/28	100.0%
Delta	+4	+14.3pp

Zero regressions. Every task that passed baseline also passed augmented.

The Four Failures the Scaffold Fixed

Reasoning spirals (2 tasks). Best Performances: 610 seconds of thinking, zero code. The model explored approaches and never converged. With scaffold: 495 seconds, working code. Tangency of Cuboids: 1,190 seconds, zero code. With scaffold: the model reframed 3D geometry into grid adjacency and produced a solution.

Premature convergence (1 task). Art Gallery on Graph: code in 11 seconds, passed 1 of 3 tests. The model locked onto a BFS traversal with a sentinel collision: initializing to 0 where 0 is a valid computed value. With scaffold: the model spent 125 seconds and arrived at Dial's algorithm, which eliminates the sentinel bug by design.

Precision (1 task). Roulettes: correct algorithm, wrong output formatting. The scaffold forced output validation that the baseline skipped.

What a Blind Evaluator Found

We submitted all solutions to a blind evaluator: same model, fresh session, no knowledge of which solution used the scaffold. A/B labels randomized per task. Scored on correctness, efficiency, structure, readability, robustness (1-10 each).

3.5x magnitude asymmetry. When the scaffold wins, it wins by +5.7 points on average. When baseline wins, it wins by -1.6. The scaffold's improvements are structural: different algorithms, fixed bugs. Baseline's wins are marginal: tighter loops, fewer variables.

Never loses on correctness or robustness. Correctness: 2-0. Robustness: 4-0. When these axes differ, they always favor the scaffold.

Independent bug discovery. The evaluator traced the Art Gallery sentinel collision through two sample inputs, scored the baseline 2/10 on correctness, and explained exactly why it fails, without knowing which solution used the scaffold.

46% of tasks produced near-identical solutions. The scaffold changes outcomes only where outcomes need changing.

Three-Way Blind Evaluations

On two tasks, the scaffold produced algorithmically distinct solutions. All three (baseline plus two scaffold variants) were submitted blind.

Art Gallery: Dial's algorithm ranked highest (44/50). Standard BFS scored 44/50. Baseline scored 22/50. The evaluator: "The progression is: broken implementation → correct simple implementation → correct optimal implementation."

Cans and Openers: Binary search on marginal tradeoffs scored 46/50. Alternative decomposition scored 43/50. Baseline's ternary search scored 33/50. The evaluator: "A pattern-matched to a general technique; B identified the specific structure; C reframed the problem so that the hard part dissolved."

Behavioral Patterns

The blind evaluator found four patterns across tasks:

Impressive → maintainable. Baseline optimizes for performance. Augmented optimizes for clarity. Both correct. Different audiences.
General technique → specific structure. Baseline reaches for familiar templates. Augmented identifies the problem's own structure.
Local → global correctness. Baseline verifies per-component. Augmented verifies cross-execution invariants.
Implicit → explicit sentinels. Augmented eliminates ambiguous initialization values or chooses algorithms that don't need sentinels.

The Mechanism: Convergence Calibration

The model's convergence threshold is uncalibrated. It commits too early (Art Gallery: 11 seconds to a wrong algorithm) or too late (Best Performances: 610 seconds, zero code). The scaffold calibrates this:

Suppression signals block premature convergence: forcing the model past the first-plausible solution.
Reasoning topology prevents spirals: giving extended thinking a structured path with endpoints.

The scaffold does not tell the model what to code. It calibrates when the model commits.

Cost

2.4x average time overhead from the tool-call architecture. Two exceptions: Defect (515s → 157s, -69%) and Cans and Openers (404s → 269s, -33%): the scaffold saved more thinking time than it cost.

Limitations

28 tasks. McNemar's p=0.134. Direction unambiguous (4-0), mechanism documented, significance requires ~50 tasks.
Public test cases only. Hidden tests may differ.
AtCoder only. Other platforms untested.
Single model. All results on Opus 4.6.

The Takeaway

Opus 4.6 at maximum effort is already strong: 85.7% on hard competitive programming. But it fails 4 tasks to reasoning failures: spirals, premature convergence, precision. One API call per task fixes all four without breaking anything that already works. A blind evaluator confirmed: the scaffold never loses on correctness or robustness, and when it matters, it matters 3.5x more.

Observations: What We Saw When Opus Thought Harder
Full report and data: github.com/ejentum/benchmarks/lcb-hard
Skill file: harness skill file