SciCode: Zero Bugs on 10 Hard Scientific Computing Problems
Claude Opus 4.6 produces 7 correctness bugs across 10 hard scientific computing problems. With reasoning + code injection stacked, it produces zero. Including a critical force-sign error that would collapse a molecular dynamics simulation.
The Setup
10 hard scientific computing problems from SciCode, spanning Maxwell PDE solvers, Schrödinger DFT, Ising models, quantum information, molecular dynamics, and X-ray diffraction. 113 implementation sub-steps total. 7-15 steps per problem.
Model: Claude Opus 4.6 with thinking at maximum effort. Four conditions tested:
| Condition | What it does |
|---|---|
| Raw | Opus native, no injection |
| Code | 1 coding ability from 128 |
| Code-Multi | 4 synergized coding abilities |
| Dual | 1 reasoning + 1 coding ability stacked |
Results
| Condition | Correctness Bugs Found |
|---|---|
| Raw (no injection) | 7 bugs (1 critical, 2 high, 3 medium, 1 low) |
| Code (single ability) | 4 bugs |
| Code-Multi (4 abilities) | 1 bug |
| Dual (reasoning + code) | 0 bugs |
All 20 solutions load and execute without syntax errors. Every bug is a silent correctness failure — the simulation runs, produces output, and the output is physically wrong.
The Bugs the Injection Prevented
P10: Anderson Thermostat — CRITICAL
The bug: force = f_mag * (r_vec / r_mag) — the force is ATTRACTIVE at short range instead of repulsive. The Lennard-Jones potential has a negative gradient at short distances, but the raw model dropped the negative sign in the directional component.
What happens: The simulation runs without errors. Every particle collapses to a single point. Positions update, velocities change, the output looks plausible — but the physics is completely wrong.
What the injection produced: force = -f_mag * dr — derived explicitly from the potential with the negative sign verified against the physical requirement that LJ forces are repulsive at short range. The model added an explicit comment: "Force must be repulsive (negative) at distances below sigma."
P7: X-ray Diffraction — HIGH
The bug: B[2,2] = 1/c — an incomplete B-matrix that only works for orthogonal crystal systems (cubic, tetragonal, orthorhombic). For triclinic, monoclinic, and hexagonal systems, every calculated structure factor is wrong.
What the injection produced: Full Busing-Levy B-matrix computation with intermediate variables m and n that account for arbitrary triclinic lattice parameters. Handles all 7 crystal systems correctly. Uses np.linalg.solve (numerically stable) instead of np.linalg.inv.
The blind evaluator noted: "Solution A handles the special case. Solution B handles the general case. In crystallography, the general case is the real requirement — most interesting structures are not orthogonal."
P6: Ising Model — MEDIUM (Highest blind eval margin: +9)
The bug: No equilibration burn-in. The raw model averages ALL Monte Carlo sweeps including the thermalization phase. Near the critical temperature, this biases magnetization measurements because early sweeps haven't reached thermal equilibrium.
What the injection produced: Discards first half of sweeps as equilibration. Uses np.gradient (2nd-order centered) instead of np.diff (1st-order) for critical temperature detection. Added 5 explicit assertions testing neighbor_list, site_energy, total_energy, total_magnetization, and flip_probability — all with specific expected values. The raw model added zero assertions.
P9: LEG Dyson — MEDIUM
The bug: Missing phase factor — Raman intensity omits exp(-2i*kd*(l-l')), producing wrong results for any nonzero wave vector.
What the injection produced: Correct phase factor with proper branch cuts. Numerical root-finding for exact surface plasmon frequency instead of the omega_p/sqrt(2) approximation.
P8: Helium DMC — LOW-MED
The bug: Division-by-zero risk when the wavefunction approaches zero. Guard only checks r == 0, missing near-zero wavefunction values.
What the injection produced: Importance-sampled Langevin VMC with trapezoidal DMC weights and population feedback. The blind evaluator characterized it as "production-quality quantum Monte Carlo" versus the raw model's "simple random walk without importance sampling."
Blind Evaluation: 10/10
A blind evaluator reviewed all 20 solutions with randomized A/B labels. Each scored on 7 criteria: correctness, numerical robustness, code architecture, documentation, algorithmic quality, self-verification, and production readiness.
The evaluator chose the dual injection solution on all 10 problems.
| Problem | Domain | Margin | Key Differentiator |
|---|---|---|---|
| P1 Maxwell | Electrodynamics | +3 | 2nd-order Sommerfeld BC, correct symmetry |
| P2 Schrödinger | Quantum Mechanics | +2 | Correct Z propagation, proper Numerov signs |
| P3 Berendsen | Molecular Dynamics | +4 | Vectorized PBC, correct barostat formula |
| P4 GADC Entangle | Quantum Information | +1 | Re-normalized post-selection |
| P5 GADC Coherent | Quantum Information | +2 | Cleaner entropy handling |
| P6 Ising | Statistical Mechanics | +9 | Equilibration burn-in, 2nd-order T_c, 5 assertions |
| P7 X-ray | Crystallography | +8 | Full Busing-Levy B-matrix, all 7 crystal systems |
| P8 Helium DMC | Quantum Chemistry | +5 | Importance-sampled VMC, trapezoidal DMC |
| P9 LEG Dyson | Optics | +8 | Correct phase factor, numerical root-finding |
| P10 Anderson | Molecular Dynamics | +4 | Correct force sign, vectorized thermostat |
Aggregate: Dual 158/210 vs Raw 149/210. The +6% aggregate understates the difference — it averages strong wins (P6: +9, P7: +8, P9: +8) with narrow wins (P4: +1).
Algorithmic Quality
Beyond correctness, the dual injection consistently chose superior algorithms:
| Problem | Raw | Dual | Why it matters |
|---|---|---|---|
| Ising | np.diff (1st-order) | np.gradient (2nd-order) | More accurate critical temperature |
| X-ray | np.linalg.inv | np.linalg.solve | Numerically more stable |
| Helium DMC | Basic Metropolis | Importance-sampled Langevin | Production-quality QMC |
| LEG Dyson | omega_p/sqrt(2) approximation | Numerical root-finding | Exact vs approximate |
| Anderson | Basic velocity Verlet | Vectorized with verified derivation | Performance + correctness |
The injection didn't teach these algorithms. The model already knows them. It chose better algorithms because the injection forced verification against physical constraints before accepting the output.
Self-Verification
Dual condition: 20 assert statements across 10 problems. Raw condition: 0.
The dual model self-tests its code. The raw model never does. The assertions aren't decorative — they test specific physical invariants (energy conservation, magnetization bounds, force sign conventions) with expected values.
The Mechanism: Dual Stacking
Why does reasoning + code outperform code alone? Single-scaffold code injection consumed attention budget from domain-specific patterns in 2 of 10 problems. Dual stacking gives the model two orthogonal injections:
- Reasoning injection prevents analytical errors — forces the model to verify derivations, check physical plausibility, and trace causal chains through equations.
- Code injection prevents engineering errors — blocks hallucinated APIs, enforces modular decomposition, and triggers self-verification.
Together, they cover the failure surface that neither reaches alone.
The Takeaway
Frontier models produce scientific computing code that compiles, runs, and looks correct. The bugs are silent — force signs inverted, matrices incomplete, equilibration skipped. The simulation produces output. The output is physically wrong. Across 10 hard problems spanning 6 scientific domains, the dual injection produced zero found correctness bugs where the raw model produced 7. A blind evaluator chose the injection on all 10. The strongest improvements were on the hardest problems: +9 on Ising, +8 on X-ray and Dyson.
- Product: Code Harness · Reasoning Harness
- Skill files: Code · Reasoning · Ejentum (all modes)
- Related: LiveCodeBench Hard: 85.7% to 100% · What We Saw When Opus Thought Harder
- Task profiles: SCI-P10: Anderson Thermostat · SCI-P06: Ising Model · SCI-P07: X-ray Diffraction