SCI Statistical MechanicsSciCodeCorrectness Flip

SCI-P06

mode: reasoning + code (dual)
SciCode · Statistical Mechanics

The Task

Ising Model: Implement a 2D Ising model simulation with Metropolis-Hastings algorithm. Calculate magnetization and specific heat near the critical temperature. 9 implementation sub-steps.

Scroll to read full task

The Outputs

Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.

Opus 4.6 · Raw (no injection)

RAW (no injection): BUG — No equilibration burn-in: Averages ALL Monte Carlo sweeps including the thermalization phase. Near the critical temperature T_c, this biases magnetization measurements because early sweeps haven't reached thermal equilibrium. The simulation runs, produces numbers, and the magnetization curve looks approximately right — but the critical temperature estimate is shifted because thermalization noise contaminates the ensemble average. Additional issue: uses np.diff (1st-order finite difference) for critical temperature detection instead of centered differences.

Scroll to read full output

Opus 4.6 · Dual (reasoning + code)

DUAL (reasoning + code injection): Correct equilibration protocol: Discards first half of sweeps as equilibration burn-in. Only post-equilibration sweeps contribute to ensemble averages. Additional improvements: - np.gradient (2nd-order centered) instead of np.diff for more accurate T_c detection - 5 explicit assertions: neighbor_list validity, site_energy bounds, total_energy conservation, total_magnetization bounds, flip_probability range - Blind evaluator margin: +9 points (highest of all 10 problems) The evaluator noted: "Solution A computes averages over all sweeps. Solution B explicitly discards thermalization. This is a textbook requirement that Solution A's training data apparently did not enforce."

Scroll to read full output

Source: bbh_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.