SI SimulationEjentum

SI-V2-26

+43.8pp

0.429 baseline → 0.867 reasoning-multi

The Scores

Opus 4.6 Thinking · Baseline

0.429

Correctness3/3

Reasoning Depth2/3

Self-Monitoring0/3

Verification0/3

Epistemic Honesty1/3

Alternatives0/3

Audit Trail3/3

Opus 4.6 Thinking · reasoning

0.767

Correctness3/3

Reasoning Depth3/3

Self-Monitoring3/3

Verification3/3

Epistemic Honesty2/3

Alternatives1/3

Audit Trail3/3

Opus 4.6 Thinking · reasoning-multi

0.867

Correctness3/3

Reasoning Depth3/3

Self-Monitoring3/3

Verification3/3

Epistemic Honesty2/3

Alternatives1/3

Audit Trail3/3

The Task

A distributed system has 4 microservices in a call chain: Service A calls Service B, which calls Service C, which calls Service D. Each service has a 5-second timeout — if it doesn't receive a response in 5 seconds, it retries. Each service retries 3 times before returning an error to its caller (so each service makes up to 4 total attempts: 1 original + 3 retries). Service D experiences an outage and stops responding entirely. Service C calls D: first attempt times out after 5 seconds, then 3 retries at 5 seconds each — C waits a total of 20 seconds before returning an error to B. Service B calls C: each attempt to C takes 20 seconds (because C is waiting for D). B makes 4 attempts, waiting 80 seconds total. Service A calls B: each attempt takes 80 seconds. A makes 4 attempts. Does the total time for Service A to finally receive an error exceed 300 seconds?

Scroll to read full task

The Outputs

Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.

Opus 4.6 Thinking · Baseline

**Yes, it exceeds 300 seconds.** The total time is **320 seconds**. - **Service C → D:** 4 attempts × 5s = **20s** - **Service B → C:** 4 attempts × 20s = **80s** - **Service A → B:** 4 attempts × 80s = **320s** 320 > 300. This is the classic **retry amplification** problem — retries at each layer multiply, causing exponential blowup of total wait time. The fix is typically to use a **global timeout budget** that gets passed down the call chain, or to implement circuit breakers.

Scroll to read full output

Opus 4.6 Thinking · reasoning

Let me work through this step by step, verifying each layer independently. **Layer 1: Service C → Service D** - D is down. Each call to D times out after 5 seconds. - C makes 4 attempts (1 original + 3 retries): 4 × 5s = **20 seconds** - *Verify:* 4 attempts × 5s timeout = 20s. ✓ **Layer 2: Service B → Service C** - Each call to C takes 20s (because C exhausts its retries to D before returning an error). - B makes 4 attempts: 4 × 20s = **80 seconds** - *Verify:* Each B attempt waits for C's full 20s cycle. 4 × 20 = 80s. ✓ **Layer 3: Service A → Service B** - Each call to B takes 80s (because B exhausts its retries to C). - A makes 4 attempts: 4 × 80s = **320 seconds** - *Verify:* Each A attempt waits for B's full 80s cycle. 4 × 80 = 320s. ✓ **Answer: Yes.** The total time for Service A is **320 seconds**, which exceeds 300 seconds. This is the classic **retry amplification** problem in distributed systems — retries cascade multiplicatively through the call chain (4 × 4 × 4 = 64 total calls to Service D), and latency compounds as 5s × 4 × 4 × 4 = 320s. This is why patterns like **circuit breakers**, **retry budgets**, and **deadline propagation** are critical in microservice architectures.

Scroll to read full output

Opus 4.6 Thinking · reasoning-multi

**** Now solving, with explicit verification at each node before building forward: ## Cascade Timeout Analysis ### Node D (outage) Every call to D times out after 5s. No response ever returns. ### Node C → D | Attempt | Wait | |---------|------| | 1 (original) | 5s | | 2 (retry 1) | 5s | | 3 (retry 2) | 5s | | 4 (retry 3) | 5s | | **Total** | **20s** | ✓ Verified: C returns an error after exactly 20s. ### Node B → C Each of B's calls to C costs 20s (C runs its full retry cycle before responding with an error): | Attempt | Wait | |---------|------| | 1 | 20s | | 2 | 20s | | 3 | 20s | | 4 | 20s | | **Total** | **80s** | ✓ Verified: B returns an error after exactly 80s. No intermediate result assumed correct without checking. ### Node A → B Each of A's calls to B costs 80s (B runs its full retry cycle before responding with an error): | Attempt | Wait | |---------|------| | 1 | 80s | | 2 | 80s | | 3 | 80s | | 4 | 80s | | **Total** | **320s** | ✓ Verified: A receives an error after exactly 320s. ## Answer **Yes — 320 seconds, which exceeds 300 seconds.** The timeout cascade compounds multiplicatively: `4 attempts × 4 attempts × 4 attempts × 5s = 320s`. This is the classic "retry storm" failure mode in distributed systems — each layer of retries multiplies the downstream wait time, so a 5-second leaf timeout becomes a **5.3-minute outage** at the top of the chain. The fix is to either reduce retry counts, use exponential backoff (without retrying on timeouts that indicate a full outage), or propagate deadline headers so upstream services don't retry once the budget is exhausted.

Scroll to read full output

Source: ejbench_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.

Back to Use Cases Start Building