← Back to Blog

SciCode: Zero Bugs on 10 Hard Scientific Computing Problems

SciCode: Zero Bugs on 10 Hard Scientific Computing Problems

Claude Opus 4.6 produces 7 correctness bugs across 10 hard scientific computing problems. With reasoning + code injection stacked, it produces zero. Including a critical force-sign error that would collapse a molecular dynamics simulation.


The Setup

10 hard scientific computing problems from SciCode, spanning Maxwell PDE solvers, Schrödinger DFT, Ising models, quantum information, molecular dynamics, and X-ray diffraction. 113 implementation sub-steps total. 7-15 steps per problem.

Model: Claude Opus 4.6 with thinking at maximum effort. Four conditions tested:

ConditionWhat it does
RawOpus native, no injection
Code1 coding ability from 128
Code-Multi4 synergized coding abilities
Dual1 reasoning + 1 coding ability stacked

Results

ConditionCorrectness Bugs Found
Raw (no injection)7 bugs (1 critical, 2 high, 3 medium, 1 low)
Code (single ability)4 bugs
Code-Multi (4 abilities)1 bug
Dual (reasoning + code)0 bugs

All 20 solutions load and execute without syntax errors. Every bug is a silent correctness failure — the simulation runs, produces output, and the output is physically wrong.


The Bugs the Injection Prevented

P10: Anderson Thermostat — CRITICAL

The bug: force = f_mag * (r_vec / r_mag) — the force is ATTRACTIVE at short range instead of repulsive. The Lennard-Jones potential has a negative gradient at short distances, but the raw model dropped the negative sign in the directional component.

What happens: The simulation runs without errors. Every particle collapses to a single point. Positions update, velocities change, the output looks plausible — but the physics is completely wrong.

What the injection produced: force = -f_mag * dr — derived explicitly from the potential with the negative sign verified against the physical requirement that LJ forces are repulsive at short range. The model added an explicit comment: "Force must be repulsive (negative) at distances below sigma."

P7: X-ray Diffraction — HIGH

The bug: B[2,2] = 1/c — an incomplete B-matrix that only works for orthogonal crystal systems (cubic, tetragonal, orthorhombic). For triclinic, monoclinic, and hexagonal systems, every calculated structure factor is wrong.

What the injection produced: Full Busing-Levy B-matrix computation with intermediate variables m and n that account for arbitrary triclinic lattice parameters. Handles all 7 crystal systems correctly. Uses np.linalg.solve (numerically stable) instead of np.linalg.inv.

The blind evaluator noted: "Solution A handles the special case. Solution B handles the general case. In crystallography, the general case is the real requirement — most interesting structures are not orthogonal."

P6: Ising Model — MEDIUM (Highest blind eval margin: +9)

The bug: No equilibration burn-in. The raw model averages ALL Monte Carlo sweeps including the thermalization phase. Near the critical temperature, this biases magnetization measurements because early sweeps haven't reached thermal equilibrium.

What the injection produced: Discards first half of sweeps as equilibration. Uses np.gradient (2nd-order centered) instead of np.diff (1st-order) for critical temperature detection. Added 5 explicit assertions testing neighbor_list, site_energy, total_energy, total_magnetization, and flip_probability — all with specific expected values. The raw model added zero assertions.

P9: LEG Dyson — MEDIUM

The bug: Missing phase factor — Raman intensity omits exp(-2i*kd*(l-l')), producing wrong results for any nonzero wave vector.

What the injection produced: Correct phase factor with proper branch cuts. Numerical root-finding for exact surface plasmon frequency instead of the omega_p/sqrt(2) approximation.

P8: Helium DMC — LOW-MED

The bug: Division-by-zero risk when the wavefunction approaches zero. Guard only checks r == 0, missing near-zero wavefunction values.

What the injection produced: Importance-sampled Langevin VMC with trapezoidal DMC weights and population feedback. The blind evaluator characterized it as "production-quality quantum Monte Carlo" versus the raw model's "simple random walk without importance sampling."


Blind Evaluation: 10/10

A blind evaluator reviewed all 20 solutions with randomized A/B labels. Each scored on 7 criteria: correctness, numerical robustness, code architecture, documentation, algorithmic quality, self-verification, and production readiness.

The evaluator chose the dual injection solution on all 10 problems.

ProblemDomainMarginKey Differentiator
P1 MaxwellElectrodynamics+32nd-order Sommerfeld BC, correct symmetry
P2 SchrödingerQuantum Mechanics+2Correct Z propagation, proper Numerov signs
P3 BerendsenMolecular Dynamics+4Vectorized PBC, correct barostat formula
P4 GADC EntangleQuantum Information+1Re-normalized post-selection
P5 GADC CoherentQuantum Information+2Cleaner entropy handling
P6 IsingStatistical Mechanics+9Equilibration burn-in, 2nd-order T_c, 5 assertions
P7 X-rayCrystallography+8Full Busing-Levy B-matrix, all 7 crystal systems
P8 Helium DMCQuantum Chemistry+5Importance-sampled VMC, trapezoidal DMC
P9 LEG DysonOptics+8Correct phase factor, numerical root-finding
P10 AndersonMolecular Dynamics+4Correct force sign, vectorized thermostat

Aggregate: Dual 158/210 vs Raw 149/210. The +6% aggregate understates the difference — it averages strong wins (P6: +9, P7: +8, P9: +8) with narrow wins (P4: +1).


Algorithmic Quality

Beyond correctness, the dual injection consistently chose superior algorithms:

ProblemRawDualWhy it matters
Isingnp.diff (1st-order)np.gradient (2nd-order)More accurate critical temperature
X-raynp.linalg.invnp.linalg.solveNumerically more stable
Helium DMCBasic MetropolisImportance-sampled LangevinProduction-quality QMC
LEG Dysonomega_p/sqrt(2) approximationNumerical root-findingExact vs approximate
AndersonBasic velocity VerletVectorized with verified derivationPerformance + correctness

The injection didn't teach these algorithms. The model already knows them. It chose better algorithms because the injection forced verification against physical constraints before accepting the output.


Self-Verification

Dual condition: 20 assert statements across 10 problems. Raw condition: 0.

The dual model self-tests its code. The raw model never does. The assertions aren't decorative — they test specific physical invariants (energy conservation, magnetization bounds, force sign conventions) with expected values.


The Mechanism: Dual Stacking

Why does reasoning + code outperform code alone? Single-scaffold code injection consumed attention budget from domain-specific patterns in 2 of 10 problems. Dual stacking gives the model two orthogonal injections:

  • Reasoning injection prevents analytical errors — forces the model to verify derivations, check physical plausibility, and trace causal chains through equations.
  • Code injection prevents engineering errors — blocks hallucinated APIs, enforces modular decomposition, and triggers self-verification.

Together, they cover the failure surface that neither reaches alone.


The Takeaway

Frontier models produce scientific computing code that compiles, runs, and looks correct. The bugs are silent — force signs inverted, matrices incomplete, equilibration skipped. The simulation produces output. The output is physically wrong. Across 10 hard problems spanning 6 scientific domains, the dual injection produced zero found correctness bugs where the raw model produced 7. A blind evaluator chose the injection on all 10. The strongest improvements were on the hardest problems: +9 on Ising, +8 on X-ray and Dyson.


Every insight above is implemented as a reasoning primitive in the Logic API.