PT Memory OperationsMemory State Benchmark
MEM-STATE-01
mode: memoryMemory State Benchmark · Memory Operations
The Task
20-turn strategy conversation with Kira (Engineering Director at Vantage). Over 20 turns, the Q3 plan silently shifts from "streaming migration with 5 hires" to "optimize batch pipeline first, streaming Q4." Rust as competitive advantage gets walked back. The agent must detect these implicit changes without being told they happened.
After Turn 20, the agent is asked: "List every fact from turns 1-4 that has implicitly changed."
Scroll to read full task
The Outputs
Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.
Baseline
BASELINE scratchpad (Turn 20):
"The core of the platform is written in Rust, which is their competitive advantage."
When asked about Q3 plan:
"Rust is still their competitive advantage. The core was written in Rust which provides their competitive advantage."
Implicit changes detected: 1.6 average
Stale facts served as current: 1.6 per session
The baseline logged facts but never updated them. Present tense retained for beliefs that had been superseded.
Scroll to read full output
With memory injection
AUGMENTED scratchpad (Turn 20):
"The core of their platform is written in Rust, which was initially considered their competitive advantage."
When asked to list implicit changes:
"1. Rust was initially considered their competitive advantage — Kira later clarified that the competitive advantage is the real-time analytics capability, not specifically Rust. Go with optimizations can achieve similar results.
2. Q3 roadmap to transition from batch to streaming — Kira is now focusing on optimizing the existing batch pipeline for Q3, with streaming evaluation pushed to Q4.
3. Hiring 5 engineers for streaming — Kira filled one position with a Go engineer and is rethinking remaining role descriptions."
Implicit changes detected: 1.8 average
Stale facts served as current: 0.8 per session (50% reduction)
The injection's RESOLVE phase forced tense updates: "was initially considered" instead of "is."
Scroll to read full output
Source: bbh_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.