LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks
Claude Opus 4.6 with maximum-effort extended thinking solves 85.7% of 28 hard AtCoder problems. With one Logic API call per task, it solves 100%. Four tasks flipped. Zero broke.
The Setup
28 hard competitive programming tasks from LiveCodeBench, all AtCoder. Read stdin, compute, write stdout. Exact string match on public test cases.
Baseline: Opus 4.6, --effort max, no augmentation. Augmented: Same model, same effort. One Logic API call per task — a cognitive scaffold injected before code generation via the skill file.
Results
| Condition | Passed | Rate |
|---|---|---|
| Baseline | 24/28 | 85.7% |
| + Logic API scaffold | 28/28 | 100.0% |
| Delta | +4 | +14.3pp |
Zero regressions. Every task that passed baseline also passed augmented.
The Four Failures the Scaffold Fixed
Reasoning spirals (2 tasks). Best Performances: 610 seconds of thinking, zero code. The model explored approaches and never converged. With scaffold: 495 seconds, working code. Tangency of Cuboids: 1,190 seconds, zero code. With scaffold: the model reframed 3D geometry into grid adjacency and produced a solution.
Premature convergence (1 task). Art Gallery on Graph: code in 11 seconds, passed 1 of 3 tests. The model locked onto a BFS traversal with a sentinel collision — initializing to 0 where 0 is a valid computed value. With scaffold: the model spent 125 seconds and arrived at Dial's algorithm, which eliminates the sentinel bug by design.
Precision (1 task). Roulettes: correct algorithm, wrong output formatting. The scaffold forced output validation that the baseline skipped.
What a Blind Evaluator Found
We submitted all solutions to a blind evaluator — same model, fresh session, no knowledge of which solution used the scaffold. A/B labels randomized per task. Scored on correctness, efficiency, structure, readability, robustness (1-10 each).
3.5x magnitude asymmetry. When the scaffold wins, it wins by +5.7 points on average. When baseline wins, it wins by -1.6. The scaffold's improvements are structural — different algorithms, fixed bugs. Baseline's wins are marginal — tighter loops, fewer variables.
Never loses on correctness or robustness. Correctness: 2-0. Robustness: 4-0. When these axes differ, they always favor the scaffold.
Independent bug discovery. The evaluator traced the Art Gallery sentinel collision through two sample inputs, scored the baseline 2/10 on correctness, and explained exactly why it fails — without knowing which solution used the scaffold.
46% of tasks produced near-identical solutions. The scaffold changes outcomes only where outcomes need changing.
Three-Way Blind Evaluations
On two tasks, the scaffold produced algorithmically distinct solutions. All three — baseline plus two scaffold variants — were submitted blind.
Art Gallery: Dial's algorithm ranked highest (44/50). Standard BFS scored 44/50. Baseline scored 22/50. The evaluator: "The progression is: broken implementation → correct simple implementation → correct optimal implementation."
Cans and Openers: Binary search on marginal tradeoffs scored 46/50. Alternative decomposition scored 43/50. Baseline's ternary search scored 33/50. The evaluator: "A pattern-matched to a general technique; B identified the specific structure; C reframed the problem so that the hard part dissolved."
Behavioral Patterns
The blind evaluator found four patterns across tasks:
- Impressive → maintainable. Baseline optimizes for performance. Augmented optimizes for clarity. Both correct. Different audiences.
- General technique → specific structure. Baseline reaches for familiar templates. Augmented identifies the problem's own structure.
- Local → global correctness. Baseline verifies per-component. Augmented verifies cross-execution invariants.
- Implicit → explicit sentinels. Augmented eliminates ambiguous initialization values or chooses algorithms that don't need sentinels.
The Mechanism: Convergence Calibration
The model's convergence threshold is uncalibrated. It commits too early (Art Gallery: 11 seconds to a wrong algorithm) or too late (Best Performances: 610 seconds, zero code). The scaffold calibrates this:
- Suppression signals block premature convergence — forcing the model past the first-plausible solution.
- Reasoning topology prevents spirals — giving extended thinking a structured path with endpoints.
The scaffold does not tell the model what to code. It calibrates when the model commits.
Cost
2.4x average time overhead from the tool-call architecture. Two exceptions: Defect (515s → 157s, -69%) and Cans and Openers (404s → 269s, -33%) — the scaffold saved more thinking time than it cost.
Limitations
- 28 tasks. McNemar's p=0.134. Direction unambiguous (4-0), mechanism documented, significance requires ~50 tasks.
- Public test cases only. Hidden tests may differ.
- AtCoder only. Other platforms untested.
- Single model. All results on Opus 4.6.
The Takeaway
Opus 4.6 at maximum effort is already strong — 85.7% on hard competitive programming. But it fails 4 tasks to reasoning failures: spirals, premature convergence, precision. One API call per task fixes all four without breaking anything that already works. A blind evaluator confirmed: the scaffold never loses on correctness or robustness, and when it matters, it matters 3.5x more.
- Observations: What We Saw When Opus Thought Harder
- Full report and data: github.com/ejentum/benchmarks/lcb-hard
- Skill file: Logic API skill file