SCI Statistical MechanicsSciCodeCorrectness Flip
SCI-P06
mode: reasoning + code (dual)SciCode · Statistical Mechanics
The Task
Ising Model: Implement a 2D Ising model simulation with Metropolis-Hastings algorithm. Calculate magnetization and specific heat near the critical temperature. 9 implementation sub-steps.
Scroll to read full task
The Outputs
Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.
Opus 4.6 · Raw (no injection)
RAW (no injection):
BUG — No equilibration burn-in:
Averages ALL Monte Carlo sweeps including the thermalization phase. Near the critical temperature T_c, this biases magnetization measurements because early sweeps haven't reached thermal equilibrium.
The simulation runs, produces numbers, and the magnetization curve looks approximately right — but the critical temperature estimate is shifted because thermalization noise contaminates the ensemble average.
Additional issue: uses np.diff (1st-order finite difference) for critical temperature detection instead of centered differences.
Scroll to read full output
Opus 4.6 · Dual (reasoning + code)
DUAL (reasoning + code injection):
Correct equilibration protocol:
Discards first half of sweeps as equilibration burn-in. Only post-equilibration sweeps contribute to ensemble averages.
Additional improvements:
- np.gradient (2nd-order centered) instead of np.diff for more accurate T_c detection
- 5 explicit assertions: neighbor_list validity, site_energy bounds, total_energy conservation, total_magnetization bounds, flip_probability range
- Blind evaluator margin: +9 points (highest of all 10 problems)
The evaluator noted: "Solution A computes averages over all sweeps. Solution B explicitly discards thermalization. This is a textbook requirement that Solution A's training data apparently did not enforce."
Scroll to read full output
Source: bbh_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.