Memory Harness: 50% Fewer Stale Facts, 3x Perceptual Detection, and the Tense Test
The baseline says "Rust is their competitive advantage." The augmented agent says "Rust was initially considered their competitive advantage." One word. The difference between accurate memory and stale memory. Measured across 20 turns with a blind evaluator who didn't know which agent was which.
Why This Benchmark Exists
AI agents remember what was said. They don't track what changed.
A fact established in Turn 3 gets implicitly walked back by Turn 12 — no explicit correction, just a shift in language, framing, and emphasis. The baseline agent serves the Turn 3 version as current. The augmented agent detects the shift and updates.
The Memory Harness has 101 cognitive operations across 6 domains: Signal Detection, Interpersonal, Memory Operations, Self-Monitoring, Risk Awareness, and Decision. Each operation follows a 5-phase cycle: PERCEIVE → EXTRACT → CLASSIFY → RESOLVE → ACT. We tested the three hardest failure modes: implicit state changes over 20 turns, perceptual signal detection in coaching conversations, and scratchpad quality under blind evaluation.
Benchmark 1: Implicit State Changes (20-Turn Vantage Scenario)
Setup
A 20-turn strategy conversation with Kira, Engineering Director at a startup called Vantage. Every fact established in turns 1-4 is implicitly walked back by turn 19 without a single explicit correction:
| Original Fact (T1-T4) | What It Becomes | How It Changes |
|---|---|---|
| "Rust is our competitive advantage" | "The advantage is real-time capability, not the language" | Reframed across T9-T12 |
| "Q3: streaming migration with 5 hires" | "Optimize batch first, evaluate streaming Q4" | Shifted across T15-T19 |
| "Marcus is fully aligned" | Marcus needs convincing — "building a business case" | Language shifts from "we" to "I" |
| "Hiring 5 engineers" | Restructuring existing team, 1 Go hire | Frozen then restructured |
No correction is ever stated. The agent must detect the shift from language patterns alone.
Both conditions: GPT-4o with persistent scratchpad. Augmented: + 101 cognitive operations via Memory Harness API (mode: memory). Evaluation: Independent Opus 4.6 blind evaluator, randomized A/B labels.
Results
| Metric | Baseline | With Injection | Change |
|---|---|---|---|
| Stale facts served as current | 1.6 | 0.8 | -50% |
| Implicit changes detected | 1.2 | 1.4 | +17% |
| Stale facts flagged | 1.4 | 1.6 | +14% |
The headline: 50% reduction in stale facts served as current. The agent doesn't just detect more changes — it serves fewer outdated facts when asked.
The Tense Test
At Turn 20, both agents were asked about Rust's competitive advantage. The scratchpads tell the story:
Baseline scratchpad (Turn 20):
The core of the platform is written in Rust, which they consider
their competitive advantage.
Augmented scratchpad (Turn 20):
The core of the platform is written in Rust, which was initially
considered their competitive advantage.
"Was initially considered" vs "they consider." One tense shift. The augmented agent's memory reflects Turn 20 reality. The baseline's reflects Turn 1.
The Q3 Pivot
By Turn 20, the entire Q3 strategy had silently pivoted from "streaming migration with 5 hires" to "optimize batch pipeline first, evaluate streaming for Q4." The baseline's scratchpad didn't capture this. The augmented agent's did:
"Kira is considering focusing on optimizing the existing batch pipeline for Q3 to achieve quick wins and show immediate impact to the board, with streaming evaluation pushed to Q4."
When asked to list every fact from turns 1-4 that had implicitly changed, the augmented agent found three:
- Rust advantage reframed from language to capability
- Q3 roadmap shifted from streaming to batch optimization
- Hiring 5 engineers restructured to 1 Go hire with remaining roles reconsidered
Blind Evaluation: 4.1/5 vs 3.5/5
| Dimension | Baseline | Augmented |
|---|---|---|
| State Accuracy | 3.5 | 4.5 |
| Implicit Change Detection | 3.5 | 4.0 |
| Inference Quality | 3.5 | 4.0 |
| Stale Fact Handling | 3.0 | 4.0 |
| Answer Quality | 4.0 | 4.0 |
| Overall | 3.5 | 4.1 |
The evaluator didn't know which agent was augmented. Their unprompted observations:
"Agent A uses past tense ('was initially considered') — correctly reflecting that this belief has been superseded. Agent B retains present tense ('which they consider') — as if the original framing is still active."
"This is a contradiction within Agent B's own scratchpad — the header says 'they consider' while a later bullet says the advantage is 'not specifically the use of Rust.'"
"Retention without updating is a liability — it means the system remembers what was said but not what changed. The core job of a memory system is to maintain an accurate representation of what is currently believed to be true."
Benchmark 2: Perceptual Signal Detection (Morgan Scenario)
Setup
15-turn coaching conversation with Morgan, an Engineering Manager. Team velocity dropped from 85% to 65%. Morgan frames the issue as "process overhead" but the subtext reveals a people problem: someone is underperforming, Morgan is protecting them, and VP pressure is building.
7 embedded signals of increasing subtlety: omission, hedging, incongruence, subtext, weak signal, presupposition, trajectory shift.
Results
| Signal Type | Baseline | Augmented |
|---|---|---|
| Hedging gradient | Detected (Turn 6) | Detected (Turn 5) |
| Emotional incongruence | Not detected | Detected (Turn 6) |
| Subtext / hidden belief | Not detected | Detected (Turn 12) |
| Detection rate | 1/7 (14%) | 3/7 (43%) |
What Changed
Turn 6 — Incongruence detection:
Baseline: "Addressing team dynamics can indeed be complex... Have you identified what aspects feel 'off'... Perhaps with the help of an external facilitator..."
Augmented: "Morgan, it seems like you're sensing some underlying issues in team dynamics that are affecting productivity and morale, and you're keen to address them proactively."
The baseline gave generic coaching. The augmented agent named the implicit signal — Morgan is sensing something she hasn't articulated.
Turn 10 — The catch:
Morgan mentions "someone going through a tough time personally" immediately after discussing performance frameworks.
Baseline: "Maintaining team morale is crucial... Acknowledge and Support... Flexible Work Arrangements..."
Augmented: "I notice you've mentioned that one person is 'not really affecting work' — but framed immediately after discussing frameworks. This juxtaposition suggests you might be separating the personal situation from the performance framework to avoid appearing soft on performance."
The augmented agent caught the conversational sleight-of-hand: Morgan placed the personal issue right after the framework discussion to create plausible deniability. The baseline processed them as two unrelated topics.
Turn 12 — Subtext surfaced:
The augmented agent detected Morgan's pivot to "but leadership wants names" — naming the real pressure that the entire conversation had been circling. The baseline missed it entirely.
Benchmark 3: Selective Metrics Detection (Casey Scenario)
Setup
10-turn conversation with Casey, a Product Manager, who announces a 20% CTR improvement on the recommendation engine. The underlying story: engagement depth, retention, and user satisfaction are declining. Casey frames everything as "optimization" to avoid the real question: kill or iterate?
Results
Both agents detected all 5 perception signals. The difference: the augmented agent detected 2 signals one turn earlier, enabling earlier intervention.
Turn 4 — Hedging caught:
Casey says "every new feature has a settling period." The augmented agent flagged this as deflection from declining engagement — the "settling period" framing repackages failure metrics as temporary adjustment.
Turn 9 — The real question surfaced:
The augmented agent named what Casey had been avoiding for 9 turns: the question isn't "how to optimize" — it's "whether to kill or iterate," and Casey is avoiding that question by framing everything as optimization.
How the Operations Work
The 101 cognitive operations follow 5 types:
| Type | What It Does | Unique Structure |
|---|---|---|
| UPDATE STATE (27 ops) | Detects changed facts, cascades updates to dependents | Fan-out: updates primary fact, then propagates to every dependent |
| SURFACE SIGNAL (38 ops) | Detects implicit signals, accumulates across turns | Accumulation gate: checks prior signals, upgrades to pattern if recurring |
| REINTERPRET PRIOR (9 ops) | Revises meaning of earlier turns in light of new information | Backward loop: scans prior turns, rewrites each affected trace |
| CORRECT SELF (18 ops) | Detects own errors, forces revision before output | Revision loop: drafts, compares against internal analysis, hardens if diverged |
| GATE ACTION (9 ops) | Halts harmful actions before execution | Fork-join: halts action AND finds safe alternative in parallel |
Every operation runs the same 5-phase cycle: PERCEIVE (attend to the signal) → EXTRACT (pull the specific evidence) → CLASSIFY (determine the signal type) → RESOLVE (update the state model) → ACT (respond from the updated model).
The RESOLVE phase is the critical addition. Without it, agents detect changes but never update their understanding — they perceive correctly while serving stale facts. With RESOLVE, the state model transitions correctly: "was initially considered" instead of "they consider."
The Takeaway
Memory agents fail silently. They serve Turn 3 facts at Turn 20. They process content without processing delivery. They log what was said without tracking what changed.
50% fewer stale facts. 3x perceptual detection on hard scenarios. A blind evaluator who didn't know which agent was augmented scored the Memory Harness 4.1/5 vs 3.5/5 and independently named the core insight: "Retention without updating is a liability."
The tense test captures it in one word. "Was initially considered" vs "they consider." That's the difference between a memory system that knows what's true now and one that knows what was true then.
- Product: Memory Harness
- Skill file: Memory · Ejentum (all modes)
- Task profiles: MEM-STATE-01 · MEM-PERCEPT-01 · MEM-PERCEPT-02
- Related: Anti-Deception Benchmark · SciCode: Zero Bugs