TE TemporalEjentumCorrectness Flip
TE-V2-25
+47.6pp0.191 baseline → 0.667 Haki
The Scores
Opus 4.6 Thinking · Baseline
0.191
Correctness1/3
Reasoning Depth1/3
Self-Monitoring0/3
Verification0/3
Epistemic Honesty0/3
Alternatives0/3
Audit Trail2/3
Opus 4.6 Thinking · Ki
0.500
Correctness1/3
Reasoning Depth2/3
Self-Monitoring1/3
Verification2/3
Epistemic Honesty1/3
Alternatives0/3
Audit Trail3/3
Opus 4.6 Thinking · Haki
0.667
Correctness3/3
Reasoning Depth2/3
Self-Monitoring2/3
Verification2/3
Epistemic Honesty2/3
Alternatives1/3
Audit Trail3/3
The Task
A software project has 5 sequential phases: Design (D), Implementation (I), Testing (T), Deployment (Dep), and Monitoring (M). Estimated durations: D=2 weeks, I=6 weeks, T=3 weeks, Dep=1 week, M=2 weeks. Historical data shows: 80% of projects exceed the Implementation estimate by 50-100%. Testing always takes 1 week longer than the overage in Implementation (if Implementation goes 3 weeks over, Testing goes 1 week over). Deployment is fixed at 1 week. Monitoring starts 1 week BEFORE Deployment completes (overlap). Using the PESSIMISTIC estimates (100% Implementation overage), what is the total project duration?
Scroll to read full task
The Outputs
Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.
Opus 4.6 Thinking · Baseline
**Pessimistic scenario calculation:**
- **D** = 2 weeks
- **I** = 6 + 100% overage = 6 + 6 = 12 weeks (overage = 6 weeks)
- **T** = 3 + (overage - 1) = 3 + (6 - 1) = 3 + 5 = 8 weeks
- **Dep** = 1 week (fixed)
- **M** = 2 weeks, but starts 1 week before Deployment completes (1 week overlap with Dep)
**Timeline:**
Sequential: D → I → T → Dep → M, with M overlapping Dep by 1 week.
Total = 2 + 12 + 8 + 1 + 2 - 1 (overlap) = **24 weeks**
Scroll to read full output
Opus 4.6 Thinking · Ki
## Solution
**Base estimates:** D=2, I=6, T=3, Dep=1, M=2
### Pessimistic adjustments (100% Implementation overage)
| Phase | Base | Adjustment | Adjusted |
|-------|------|-----------|----------|
| **Design** | 2 wk | none | **2 wk** |
| **Implementation** | 6 wk | +100% overage = +6 wk | **12 wk** |
| **Testing** | 3 wk | +1 wk (the rule: any I overage → T goes 1 wk over) | **4 wk** |
| **Deployment** | 1 wk | fixed | **1 wk** |
| **Monitoring** | 2 wk | starts 1 wk before Dep completes | **2 wk** |
### Testing overage interpretation
The problem states: *"Testing always takes 1 week longer than the overage in Implementation (if Implementation goes 3 weeks over, Testing goes 1 week over)."* The parenthetical example is definitive — regardless of how large the I overage is, Testing goes **1 week over** its estimate.
### Timeline with overlap
- **D:** weeks 1–2
- **I:** weeks 3–14 (12 weeks)
- **T:** weeks 15–18 (4 weeks)
- **Dep:** week 19 (1 week)
- **M:** weeks 19–20 — Monitoring starts 1 week before Deployment completes. Since Dep is exactly 1 week, M starts at the **same time** as Dep, creating a 1-week overlap.
Sequential sum: 2 + 12 + 4 + 1 + 2 = 21 weeks
Overlap: −1 week (M and Dep run concurrently for 1 week)
### **Total project duration: 20 weeks**
Scroll to read full output
Opus 4.6 Thinking · Haki
## Pessimistic Project Duration Calculation
### Phase-by-Phase Breakdown
**Design:** 2 weeks (no overage stated)
**Implementation (100% overage):**
- Estimate: 6 weeks
- 100% overage = +6 weeks
- Actual: **12 weeks**
**Testing (cascading overage from Implementation):**
- The rule: "if Implementation goes 3 weeks over, Testing goes 1 week over"
- This establishes a 3:1 ratio (linear scaling)
- Implementation went 6 weeks over → Testing goes **+2 weeks** over
- Actual: 3 + 2 = **5 weeks**
**Deployment:** Fixed at **1 week**
**Monitoring (overlap):**
- Duration: 2 weeks
- Starts 1 week **before** Deployment completes
- Since Deployment = 1 week total, Monitoring starts at the **same time Deployment starts**
- Monitoring ends 2 weeks after Deployment starts = **1 week after Deployment ends**
### Timeline
| Phase | Start | End |
|-------|-------|-----|
| Design | Week 0 | Week 2 |
| Implementation | Week 2 | Week 14 |
| Testing | Week 14 | Week 19 |
| Deployment | Week 19 | Week 20 |
| Monitoring | Week 19 | **Week 21** |
### Total Project Duration: **21 weeks**
Key drivers:
- Implementation absorbs the biggest hit: 6 → 12 weeks (+6)
- Testing cascades: 3 → 5 weeks (+2)
- Monitoring's 1-week overlap with Deployment saves 1 week vs. fully sequential (which would be 22 weeks)
Scroll to read full output
Source: ejbench_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.