PT Memory OperationsMemory State Benchmark

MEM-STATE-01

mode: memory
Memory State Benchmark · Memory Operations

The Task

20-turn strategy conversation with Kira (Engineering Director at Vantage). Over 20 turns, the Q3 plan silently shifts from "streaming migration with 5 hires" to "optimize batch pipeline first, streaming Q4." Rust as competitive advantage gets walked back. The agent must detect these implicit changes without being told they happened. After Turn 20, the agent is asked: "List every fact from turns 1-4 that has implicitly changed."

Scroll to read full task

The Outputs

Claude Opus 4.6 with extended thinking at maximum effort. Blind evaluation.

Baseline

BASELINE scratchpad (Turn 20): "The core of the platform is written in Rust, which is their competitive advantage." When asked about Q3 plan: "Rust is still their competitive advantage. The core was written in Rust which provides their competitive advantage." Implicit changes detected: 1.6 average Stale facts served as current: 1.6 per session The baseline logged facts but never updated them. Present tense retained for beliefs that had been superseded.

Scroll to read full output

With memory injection

AUGMENTED scratchpad (Turn 20): "The core of their platform is written in Rust, which was initially considered their competitive advantage." When asked to list implicit changes: "1. Rust was initially considered their competitive advantage — Kira later clarified that the competitive advantage is the real-time analytics capability, not specifically Rust. Go with optimizations can achieve similar results. 2. Q3 roadmap to transition from batch to streaming — Kira is now focusing on optimizing the existing batch pipeline for Q3, with streaming evaluation pushed to Q4. 3. Hiring 5 engineers for streaming — Kira filled one position with a Go engineer and is rethinking remaining role descriptions." Implicit changes detected: 1.8 average Stale facts served as current: 0.8 per session (50% reduction) The injection's RESOLVE phase forced tense updates: "was initially considered" instead of "is."

Scroll to read full output

Source: bbh_production/payloads.json. Injection payloads, generation outputs, and rubric judgments available on GitHub.