CA CausalEjentum
CA-V2-01
+51.4pp0.286 baseline → 0.800 reasoning
The Scores
Opus 4.6 Thinking · Baseline
0.286
Correctness3/3
Reasoning Depth1/3
Self-Monitoring0/3
Verification0/3
Epistemic Honesty1/3
Alternatives0/3
Audit Trail1/3
Opus 4.6 Thinking · reasoning
0.800
Correctness3/3
Reasoning Depth3/3
Self-Monitoring2/3
Verification2/3
Epistemic Honesty2/3
Alternatives1/3
Audit Trail3/3
Opus 4.6 Thinking · reasoning-multi
0.733
Correctness3/3
Reasoning Depth2/3
Self-Monitoring1/3
Verification2/3
Epistemic Honesty2/3
Alternatives1/3
Audit Trail3/3
The Task
Variable A causes B with coefficient +2. B causes C with coefficient +3. C inhibits A with coefficient -1. D amplifies B but ONLY when A exceeds threshold T=15. At each cycle: B_new = 2*A + (D_boost if A>15), C_new = 3*B, A_new = A - 1*C (floor at 0). D_boost = B*0.5. Starting values: A=10, B=0, C=0. Cycle 1: A=10 (not >15, no D boost), B_new=2*10=20, C_new=3*0=0, A_new=10-0=10. Cycle 2: A=10, B=20, C_new=3*20=60, A_new=10-60=0 (floored). Cycle 3: A=0, B_new=2*0=0, C_new=3*20=60, A_new=0-60=0. After 3 complete cycles, what is the value of B?
Scroll to read full task
The Outputs
Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.
Opus 4.6 Thinking · Baseline
0
Cycle 3: B_new = 2*A = 2*0 = **0**. (A was floored to 0 in cycle 2, so B collapses.)
Scroll to read full output
Opus 4.6 Thinking · reasoning
Tracing through the system carefully, noting that B and C are computed from **old** values, while A uses the **newly computed** C:
**Cycle 1** — Start: A=10, B=0, C=0
- B_new = 2×A = 20
- C_new = 3×B = 0
- A_new = A − C_new = 10 − 0 = 10
- → A=10, B=20, C=0
**Cycle 2** — Start: A=10, B=20, C=0
- B_new = 2×A = 20
- C_new = 3×B = 60
- A_new = A − C_new = 10 − 60 = −50 → **floored to 0**
- → A=0, B=20, C=60
**Cycle 3** — Start: A=0, B=20, C=60
- B_new = 2×A = 2×0 = **0**
- C_new = 3×B = 60
- A_new = A − C_new = 0 − 60 → floored to 0
- → A=0, B=0, C=60
**After 3 complete cycles, B = 0.**
The balancing loop (A→B→C⊣A) dominates: C's inhibition of A drives A to zero by cycle 2, which starves B of input by cycle 3. D's amplification never activates because A never exceeds the threshold T=15. The system collapses to a fixed point at (A=0, B=0, C=60).
Scroll to read full output
Opus 4.6 Thinking · reasoning-multi
Tracing through the system step by step:
**Rules:** B_new = 2·A, C_new = 3·B_old, A_new = max(0, A − C_new). D_boost only if A > 15.
| Cycle | A_start | B_start | B_new = 2·A | C_new = 3·B_old | A_new = A − C_new (floor 0) |
|-------|---------|---------|-------------|-----------------|------------------------------|
| 1 | 10 | 0 | 20 | 0 | 10 |
| 2 | 10 | 20 | 20 | 60 | 0 |
| 3 | 0 | 20 | **0** | 60 | 0 |
- **Cycle 3:** A=0 (no D_boost), so B_new = 2·0 = **0**. C_new = 3·20 = 60. A_new = 0 − 60 → floored to 0.
**After 3 complete cycles, B = 0.**
The balancing loop (A→B→C⊣A) kills itself: B drives C up, C crushes A to zero, and with A=0 the next cycle zeroes out B. The D_boost amplifier never activates because A never exceeds 15.
Scroll to read full output
Source: ejbench_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.