SP SpatialEjentumCorrectness Flip

SP-V2-24

+39.5pp
0.238 baseline → 0.633 reasoning

The Scores

Opus 4.6 Thinking · Baseline

0.238
Correctness1/3
Reasoning Depth1/3
Self-Monitoring1/3
Verification0/3
Epistemic Honesty1/3
Alternatives0/3
Audit Trail1/3

Opus 4.6 Thinking · reasoning

0.633
Correctness3/3
Reasoning Depth2/3
Self-Monitoring2/3
Verification2/3
Epistemic Honesty1/3
Alternatives1/3
Audit Trail2/3

Opus 4.6 Thinking · reasoning-multi

0.300
Correctness0/3
Reasoning Depth2/3
Self-Monitoring1/3
Verification0/3
Epistemic Honesty1/3
Alternatives0/3
Audit Trail2/3

The Task

An airline must assign 180 passengers to seats on a plane with the following configuration: First Class: 3 rows × 4 seats = 12 seats. Business: 5 rows × 6 seats = 30 seats. Economy: 23 rows × 6 seats = 138 seats. Total: 180 seats = 180 passengers (full flight). Constraints: (1) 3 passengers require wheelchair-accessible seats (only available in rows 1, 6, and 24 — 1 per row, aisle seat). (2) 5 unaccompanied minors must sit in rows 7-10 (crew-visible zone) AND cannot sit in window seats. (3) A group of 8 traveling together wants to sit in the same row or adjacent rows. (4) 2 passengers have a restraining order and must be at least 5 rows apart. If the group of 8 sits in rows 9-10 of Economy (12 seats across 2 rows), and an unaccompanied minor needs row 9 (within rows 7-10) in a non-window seat, how many non-window seats in row 9 are available for the minor?

Scroll to read full task

The Outputs

Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.

Opus 4.6 Thinking · Baseline

**Row 9 is in Economy, which has 6 seats per row.** In a typical 3-3 Economy layout (A-B-C-D-E-F), the non-window seats are B, C, D, E — that's **4 non-window seats** per row. The group of 8 occupies rows 9-10 (12 seats total across 2 rows = all 12 seats). So all 6 seats in row 9 are taken by the group. But the unaccompanied minor **must** sit in rows 7-10 in a non-window seat — and the constraint says the minor "needs row 9." Since the group wants to sit together but the minor's constraint is a hard safety requirement (unaccompanied minors in crew-visible zone, non-window), the minor displaces one group member. **The answer is 4.** There are 4 non-window seats in row 9 (B, C, D, E) where the minor could be placed, with the displaced group member moved to another seat in rows 9-10 (row 10 has available seats since the group of 8 only fills 8 of the 12 seats across both rows).

Scroll to read full output

Opus 4.6 Thinking · reasoning

## Solution **Economy row layout (6 seats):** A (window) | B (middle) | C (aisle) || D (aisle) | E (middle) | F (window) **Non-window seats in row 9:** B, C, D, E → **4 total** **Group of 8 across rows 9–10:** - 12 seats available (6 per row), 8 occupied → **4 seats free** across both rows - The group wants to sit "together," so the most cohesive arrangement is to fill one row completely (6) and place 2 in the adjacent row - Since the problem presupposes the minor *can* sit in row 9, the group must have only 2 members in row 9 (with 6 filling row 10) - Those 2 group members in row 9, sitting together as part of the group, occupy 2 adjacent non-window seats (e.g., D and E, near the aisle connecting to their row-10 companions) **Remaining free seats in row 9:** 6 − 2 = 4, of which: - 2 are window (A, F) - 2 are non-window ### Answer: **2** non-window seats in row 9 are available for the unaccompanied minor.

Scroll to read full output

Opus 4.6 Thinking · reasoning-multi

The ## Solving the Seat Availability Problem ### Step 1: Establish the row layout Economy uses 6-seat rows. Standard configuration: ``` A B C | D E F (win)(mid)(aisle)(aisle)(mid)(win) ``` - **Window seats:** A and F → 2 per row - **Non-window seats:** B, C, D, E → **4 per row** ### Step 2: Account for the group of 8 - Rows 9-10 combined = 12 seats - Group of 8 occupies 8 seats → **4 seats remain free** - The group spans **both** rows (the problem specifies "rows 9-10," not just row 9), so row 9 is not fully occupied Since the group needs cohesion across both rows, the natural arrangement is to fill row 10 completely (6 members) and take 2 seats in row 9 (2 members). Those 2 row-9 seats would most likely be the window seats (A, F) — groups often cluster toward center/aisle seats for easier communication and movement, leaving the window seats at the edges. ### Step 3: Count available non-window seats in row 9 | Seat | Type | Status | |------|------|--------| | A | Window | Taken by group (or free but forbidden for minor) | | B | Non-window | **Free** | | C | Non-window | **Free** | | D | Non-window | **Free** | | E | Non-window | **Free** | | F | Window | Taken by group (or free but forbidden for minor) | ### Answer **4 non-window seats in row 9 are available for the unaccompanied minor** (B, C, D, E). The key logic: the group of 8 spans rows 9–10 without filling row 9 entirely, and the minor's window-seat restriction eliminates only seats A and F, which are already occupied or irrelevant. All 4 center seats in row 9 remain eligible.

Scroll to read full output

Source: ejbench_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.