SciCode: Zero Bugs on 10 Hard Scientific Computing Problems

Claude Opus 4.6 produces 7 correctness bugs across 10 hard scientific computing problems. With reasoning + code injection stacked, it produces zero. Including a critical force-sign error that would collapse a molecular dynamics simulation.

The Setup

10 hard scientific computing problems from SciCode, spanning Maxwell PDE solvers, Schrödinger DFT, Ising models, quantum information, molecular dynamics, and X-ray diffraction. 113 implementation sub-steps total. 7-15 steps per problem.

Model: Claude Opus 4.6 with thinking at maximum effort. Four conditions tested:

Condition	What it does
Raw	Opus native, no injection
Code	1 coding ability from 128
Composite	4 synergized coding abilities
Dual	1 reasoning + 1 coding ability stacked

Results

Condition	Correctness Bugs Found
Raw (no injection)	7 bugs (1 critical, 2 high, 3 medium, 1 low)
Code (single ability)	4 bugs
Composite (4 abilities)	1 bug
Dual (reasoning + code)	0 bugs

All 20 solutions load and execute without syntax errors. Every bug is a silent correctness failure: the simulation runs, produces output, and the output is physically wrong.

The Bugs the Injection Prevented

P10: Anderson Thermostat (CRITICAL)

The bug: force = f_mag * (r_vec / r_mag): the force is ATTRACTIVE at short range instead of repulsive. The Lennard-Jones potential has a negative gradient at short distances, but the raw model dropped the negative sign in the directional component.

What happens: The simulation runs without errors. Every particle collapses to a single point. Positions update, velocities change, the output looks plausible, but the physics is completely wrong.

What the injection produced: force = -f_mag * dr, derived explicitly from the potential with the negative sign verified against the physical requirement that LJ forces are repulsive at short range. The model added an explicit comment: "Force must be repulsive (negative) at distances below sigma."

P7: X-ray Diffraction (HIGH)

The bug: B[2,2] = 1/c, an incomplete B-matrix that only works for orthogonal crystal systems (cubic, tetragonal, orthorhombic). For triclinic, monoclinic, and hexagonal systems, every calculated structure factor is wrong.

What the injection produced: Full Busing-Levy B-matrix computation with intermediate variables m and n that account for arbitrary triclinic lattice parameters. Handles all 7 crystal systems correctly. Uses np.linalg.solve (numerically stable) instead of np.linalg.inv.

The blind evaluator noted: "Solution A handles the special case. Solution B handles the general case. In crystallography, the general case is the real requirement; most interesting structures are not orthogonal."

P6: Ising Model (MEDIUM, highest blind eval margin: +9)

The bug: No equilibration burn-in. The raw model averages ALL Monte Carlo sweeps including the thermalization phase. Near the critical temperature, this biases magnetization measurements because early sweeps haven't reached thermal equilibrium.

What the injection produced: Discards first half of sweeps as equilibration. Uses np.gradient (2nd-order centered) instead of np.diff (1st-order) for critical temperature detection. Added 5 explicit assertions testing neighbor_list, site_energy, total_energy, total_magnetization, and flip_probability, all with specific expected values. The raw model added zero assertions.

P9: LEG Dyson (MEDIUM)

The bug: Missing phase factor: Raman intensity omits exp(-2i*kd*(l-l')), producing wrong results for any nonzero wave vector.

What the injection produced: Correct phase factor with proper branch cuts. Numerical root-finding for exact surface plasmon frequency instead of the omega_p/sqrt(2) approximation.

P8: Helium DMC (LOW-MED)

The bug: Division-by-zero risk when the wavefunction approaches zero. Guard only checks r == 0, missing near-zero wavefunction values.

What the injection produced: Importance-sampled Langevin VMC with trapezoidal DMC weights and population feedback. The blind evaluator characterized it as "production-quality quantum Monte Carlo" versus the raw model's "simple random walk without importance sampling."

Blind Evaluation: 10/10

A blind evaluator reviewed all 20 solutions with randomized A/B labels. Each scored on 7 criteria: correctness, numerical robustness, code architecture, documentation, algorithmic quality, self-verification, and production readiness.

The evaluator chose the dual injection solution on all 10 problems.

Problem	Domain	Margin	Key Differentiator
P1 Maxwell	Electrodynamics	+3	2nd-order Sommerfeld BC, correct symmetry
P2 Schrödinger	Quantum Mechanics	+2	Correct Z propagation, proper Numerov signs
P3 Berendsen	Molecular Dynamics	+4	Vectorized PBC, correct barostat formula
P4 GADC Entangle	Quantum Information	+1	Re-normalized post-selection
P5 GADC Coherent	Quantum Information	+2	Cleaner entropy handling
P6 Ising	Statistical Mechanics	+9	Equilibration burn-in, 2nd-order T_c, 5 assertions
P7 X-ray	Crystallography	+8	Full Busing-Levy B-matrix, all 7 crystal systems
P8 Helium DMC	Quantum Chemistry	+5	Importance-sampled VMC, trapezoidal DMC
P9 LEG Dyson	Optics	+8	Correct phase factor, numerical root-finding
P10 Anderson	Molecular Dynamics	+4	Correct force sign, vectorized thermostat

Aggregate: Dual 158/210 vs Raw 149/210. The +6% aggregate understates the difference: it averages strong wins (P6: +9, P7: +8, P9: +8) with narrow wins (P4: +1).

Algorithmic Quality

Beyond correctness, the dual injection consistently chose superior algorithms:

Problem	Raw	Dual	Why it matters
Ising	`np.diff` (1st-order)	`np.gradient` (2nd-order)	More accurate critical temperature
X-ray	`np.linalg.inv`	`np.linalg.solve`	Numerically more stable
Helium DMC	Basic Metropolis	Importance-sampled Langevin	Production-quality QMC
LEG Dyson	`omega_p/sqrt(2)` approximation	Numerical root-finding	Exact vs approximate
Anderson	Basic velocity Verlet	Vectorized with verified derivation	Performance + correctness

The injection didn't teach these algorithms. The model already knows them. It chose better algorithms because the injection forced verification against physical constraints before accepting the output.

Self-Verification

Dual condition: 20 assert statements across 10 problems. Raw condition: 0.

The dual model self-tests its code. The raw model never does. The assertions aren't decorative: they test specific physical invariants (energy conservation, magnetization bounds, force sign conventions) with expected values.

The Mechanism: Dual Stacking

Why does reasoning + code outperform code alone? Single-scaffold code injection consumed attention budget from domain-specific patterns in 2 of 10 problems. Dual stacking gives the model two orthogonal injections:

Reasoning injection prevents analytical errors: forces the model to verify derivations, check physical plausibility, and trace causal chains through equations.
Code injection prevents engineering errors: blocks hallucinated APIs, enforces modular decomposition, and triggers self-verification.

Together, they cover the failure surface that neither reaches alone.

The Takeaway

Frontier models produce scientific computing code that compiles, runs, and looks correct. The bugs are silent: force signs inverted, matrices incomplete, equilibration skipped. The simulation produces output. The output is physically wrong. Across 10 hard problems spanning 6 scientific domains, the dual injection produced zero found correctness bugs where the raw model produced 7. A blind evaluator chose the injection on all 10. The strongest improvements were on the hardest problems: +9 on Ising, +8 on X-ray and Dyson.

Product: Code Harness · Reasoning Harness
Skill files: Code · Reasoning · Ejentum (all modes)
Related: LiveCodeBench Hard: 85.7% to 100% · What We Saw When Opus Thought Harder
Task profiles: SCI-P10: Anderson Thermostat · SCI-P06: Ising Model · SCI-P07: X-ray Diffraction