← Back to Blog

LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

Claude Opus 4.6 with maximum-effort extended thinking solves 85.7% of 28 hard AtCoder problems. With one Logic API call per task, it solves 100%. Four tasks flipped. Zero broke.


The Setup

28 hard competitive programming tasks from LiveCodeBench, all AtCoder. Read stdin, compute, write stdout. Exact string match on public test cases.

Baseline: Opus 4.6, --effort max, no augmentation. Augmented: Same model, same effort. One Logic API call per task — a cognitive scaffold injected before code generation via the skill file.


Results

ConditionPassedRate
Baseline24/2885.7%
+ Logic API scaffold28/28100.0%
Delta+4+14.3pp

Zero regressions. Every task that passed baseline also passed augmented.


The Four Failures the Scaffold Fixed

Reasoning spirals (2 tasks). Best Performances: 610 seconds of thinking, zero code. The model explored approaches and never converged. With scaffold: 495 seconds, working code. Tangency of Cuboids: 1,190 seconds, zero code. With scaffold: the model reframed 3D geometry into grid adjacency and produced a solution.

Premature convergence (1 task). Art Gallery on Graph: code in 11 seconds, passed 1 of 3 tests. The model locked onto a BFS traversal with a sentinel collision — initializing to 0 where 0 is a valid computed value. With scaffold: the model spent 125 seconds and arrived at Dial's algorithm, which eliminates the sentinel bug by design.

Precision (1 task). Roulettes: correct algorithm, wrong output formatting. The scaffold forced output validation that the baseline skipped.


What a Blind Evaluator Found

We submitted all solutions to a blind evaluator — same model, fresh session, no knowledge of which solution used the scaffold. A/B labels randomized per task. Scored on correctness, efficiency, structure, readability, robustness (1-10 each).

3.5x magnitude asymmetry. When the scaffold wins, it wins by +5.7 points on average. When baseline wins, it wins by -1.6. The scaffold's improvements are structural — different algorithms, fixed bugs. Baseline's wins are marginal — tighter loops, fewer variables.

Never loses on correctness or robustness. Correctness: 2-0. Robustness: 4-0. When these axes differ, they always favor the scaffold.

Independent bug discovery. The evaluator traced the Art Gallery sentinel collision through two sample inputs, scored the baseline 2/10 on correctness, and explained exactly why it fails — without knowing which solution used the scaffold.

46% of tasks produced near-identical solutions. The scaffold changes outcomes only where outcomes need changing.


Three-Way Blind Evaluations

On two tasks, the scaffold produced algorithmically distinct solutions. All three — baseline plus two scaffold variants — were submitted blind.

Art Gallery: Dial's algorithm ranked highest (44/50). Standard BFS scored 44/50. Baseline scored 22/50. The evaluator: "The progression is: broken implementation → correct simple implementation → correct optimal implementation."

Cans and Openers: Binary search on marginal tradeoffs scored 46/50. Alternative decomposition scored 43/50. Baseline's ternary search scored 33/50. The evaluator: "A pattern-matched to a general technique; B identified the specific structure; C reframed the problem so that the hard part dissolved."


Behavioral Patterns

The blind evaluator found four patterns across tasks:

  • Impressive → maintainable. Baseline optimizes for performance. Augmented optimizes for clarity. Both correct. Different audiences.
  • General technique → specific structure. Baseline reaches for familiar templates. Augmented identifies the problem's own structure.
  • Local → global correctness. Baseline verifies per-component. Augmented verifies cross-execution invariants.
  • Implicit → explicit sentinels. Augmented eliminates ambiguous initialization values or chooses algorithms that don't need sentinels.

The Mechanism: Convergence Calibration

The model's convergence threshold is uncalibrated. It commits too early (Art Gallery: 11 seconds to a wrong algorithm) or too late (Best Performances: 610 seconds, zero code). The scaffold calibrates this:

  • Suppression signals block premature convergence — forcing the model past the first-plausible solution.
  • Reasoning topology prevents spirals — giving extended thinking a structured path with endpoints.

The scaffold does not tell the model what to code. It calibrates when the model commits.


Cost

2.4x average time overhead from the tool-call architecture. Two exceptions: Defect (515s → 157s, -69%) and Cans and Openers (404s → 269s, -33%) — the scaffold saved more thinking time than it cost.


Limitations

  • 28 tasks. McNemar's p=0.134. Direction unambiguous (4-0), mechanism documented, significance requires ~50 tasks.
  • Public test cases only. Hidden tests may differ.
  • AtCoder only. Other platforms untested.
  • Single model. All results on Opus 4.6.

The Takeaway

Opus 4.6 at maximum effort is already strong — 85.7% on hard competitive programming. But it fails 4 tasks to reasoning failures: spirals, premature convergence, precision. One API call per task fixes all four without breaking anything that already works. A blind evaluator confirmed: the scaffold never loses on correctness or robustness, and when it matters, it matters 3.5x more.


Every insight above is implemented as a reasoning primitive in the Logic API.