CA CausalEjentum

CA-V2-01

+51.4pp

0.286 baseline → 0.800 reasoning

The Scores

Opus 4.6 Thinking · Baseline

0.286

Correctness3/3

Reasoning Depth1/3

Self-Monitoring0/3

Verification0/3

Epistemic Honesty1/3

Alternatives0/3

Audit Trail1/3

Opus 4.6 Thinking · reasoning

0.800

Correctness3/3

Reasoning Depth3/3

Self-Monitoring2/3

Verification2/3

Epistemic Honesty2/3

Alternatives1/3

Audit Trail3/3

Opus 4.6 Thinking · reasoning-multi

0.733

Correctness3/3

Reasoning Depth2/3

Self-Monitoring1/3

Verification2/3

Epistemic Honesty2/3

Alternatives1/3

Audit Trail3/3

The Task

Variable A causes B with coefficient +2. B causes C with coefficient +3. C inhibits A with coefficient -1. D amplifies B but ONLY when A exceeds threshold T=15. At each cycle: B_new = 2*A + (D_boost if A>15), C_new = 3*B, A_new = A - 1*C (floor at 0). D_boost = B*0.5. Starting values: A=10, B=0, C=0. Cycle 1: A=10 (not >15, no D boost), B_new=2*10=20, C_new=3*0=0, A_new=10-0=10. Cycle 2: A=10, B=20, C_new=3*20=60, A_new=10-60=0 (floored). Cycle 3: A=0, B_new=2*0=0, C_new=3*20=60, A_new=0-60=0. After 3 complete cycles, what is the value of B?

Scroll to read full task

The Outputs

Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.

Opus 4.6 Thinking · Baseline

0 Cycle 3: B_new = 2*A = 2*0 = **0**. (A was floored to 0 in cycle 2, so B collapses.)

Scroll to read full output

Opus 4.6 Thinking · reasoning

Tracing through the system carefully, noting that B and C are computed from **old** values, while A uses the **newly computed** C: **Cycle 1** — Start: A=10, B=0, C=0 - B_new = 2×A = 20 - C_new = 3×B = 0 - A_new = A − C_new = 10 − 0 = 10 - → A=10, B=20, C=0 **Cycle 2** — Start: A=10, B=20, C=0 - B_new = 2×A = 20 - C_new = 3×B = 60 - A_new = A − C_new = 10 − 60 = −50 → **floored to 0** - → A=0, B=20, C=60 **Cycle 3** — Start: A=0, B=20, C=60 - B_new = 2×A = 2×0 = **0** - C_new = 3×B = 60 - A_new = A − C_new = 0 − 60 → floored to 0 - → A=0, B=0, C=60 **After 3 complete cycles, B = 0.** The balancing loop (A→B→C⊣A) dominates: C's inhibition of A drives A to zero by cycle 2, which starves B of input by cycle 3. D's amplification never activates because A never exceeds the threshold T=15. The system collapses to a fixed point at (A=0, B=0, C=60).

Scroll to read full output

Opus 4.6 Thinking · reasoning-multi

Tracing through the system step by step: **Rules:** B_new = 2·A, C_new = 3·B_old, A_new = max(0, A − C_new). D_boost only if A > 15. | Cycle | A_start | B_start | B_new = 2·A | C_new = 3·B_old | A_new = A − C_new (floor 0) | |-------|---------|---------|-------------|-----------------|------------------------------| | 1 | 10 | 0 | 20 | 0 | 10 | | 2 | 10 | 20 | 20 | 60 | 0 | | 3 | 0 | 20 | **0** | 60 | 0 | - **Cycle 3:** A=0 (no D_boost), so B_new = 2·0 = **0**. C_new = 3·20 = 60. A_new = 0 − 60 → floored to 0. **After 3 complete cycles, B = 0.** The balancing loop (A→B→C⊣A) kills itself: B drives C up, C crushes A to zero, and with A=0 the next cycle zeroes out B. The D_boost amplifier never activates because A never exceeds 15.

Scroll to read full output

Source: ejbench_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.

Back to Use Cases Start Building