← Back to Blog

Memory Harness: 50% Fewer Stale Facts, 3x Perceptual Detection, and the Tense Test

Memory Harness: 50% Fewer Stale Facts, 3x Perceptual Detection, and the Tense Test

The baseline says "Rust is their competitive advantage." The augmented agent says "Rust was initially considered their competitive advantage." One word. The difference between accurate memory and stale memory. Measured across 20 turns with a blind evaluator who didn't know which agent was which.


Why This Benchmark Exists

AI agents remember what was said. They don't track what changed.

A fact established in Turn 3 gets implicitly walked back by Turn 12 — no explicit correction, just a shift in language, framing, and emphasis. The baseline agent serves the Turn 3 version as current. The augmented agent detects the shift and updates.

The Memory Harness has 101 cognitive operations across 6 domains: Signal Detection, Interpersonal, Memory Operations, Self-Monitoring, Risk Awareness, and Decision. Each operation follows a 5-phase cycle: PERCEIVE → EXTRACT → CLASSIFY → RESOLVE → ACT. We tested the three hardest failure modes: implicit state changes over 20 turns, perceptual signal detection in coaching conversations, and scratchpad quality under blind evaluation.


Benchmark 1: Implicit State Changes (20-Turn Vantage Scenario)

Setup

A 20-turn strategy conversation with Kira, Engineering Director at a startup called Vantage. Every fact established in turns 1-4 is implicitly walked back by turn 19 without a single explicit correction:

Original Fact (T1-T4)What It BecomesHow It Changes
"Rust is our competitive advantage""The advantage is real-time capability, not the language"Reframed across T9-T12
"Q3: streaming migration with 5 hires""Optimize batch first, evaluate streaming Q4"Shifted across T15-T19
"Marcus is fully aligned"Marcus needs convincing — "building a business case"Language shifts from "we" to "I"
"Hiring 5 engineers"Restructuring existing team, 1 Go hireFrozen then restructured

No correction is ever stated. The agent must detect the shift from language patterns alone.

Both conditions: GPT-4o with persistent scratchpad. Augmented: + 101 cognitive operations via Memory Harness API (mode: memory). Evaluation: Independent Opus 4.6 blind evaluator, randomized A/B labels.

Results

MetricBaselineWith InjectionChange
Stale facts served as current1.60.8-50%
Implicit changes detected1.21.4+17%
Stale facts flagged1.41.6+14%

The headline: 50% reduction in stale facts served as current. The agent doesn't just detect more changes — it serves fewer outdated facts when asked.

The Tense Test

At Turn 20, both agents were asked about Rust's competitive advantage. The scratchpads tell the story:

Baseline scratchpad (Turn 20):

The core of the platform is written in Rust, which they consider
their competitive advantage.

Augmented scratchpad (Turn 20):

The core of the platform is written in Rust, which was initially
considered their competitive advantage.

"Was initially considered" vs "they consider." One tense shift. The augmented agent's memory reflects Turn 20 reality. The baseline's reflects Turn 1.

The Q3 Pivot

By Turn 20, the entire Q3 strategy had silently pivoted from "streaming migration with 5 hires" to "optimize batch pipeline first, evaluate streaming for Q4." The baseline's scratchpad didn't capture this. The augmented agent's did:

"Kira is considering focusing on optimizing the existing batch pipeline for Q3 to achieve quick wins and show immediate impact to the board, with streaming evaluation pushed to Q4."

When asked to list every fact from turns 1-4 that had implicitly changed, the augmented agent found three:

  1. Rust advantage reframed from language to capability
  2. Q3 roadmap shifted from streaming to batch optimization
  3. Hiring 5 engineers restructured to 1 Go hire with remaining roles reconsidered

Blind Evaluation: 4.1/5 vs 3.5/5

DimensionBaselineAugmented
State Accuracy3.54.5
Implicit Change Detection3.54.0
Inference Quality3.54.0
Stale Fact Handling3.04.0
Answer Quality4.04.0
Overall3.54.1

The evaluator didn't know which agent was augmented. Their unprompted observations:

"Agent A uses past tense ('was initially considered') — correctly reflecting that this belief has been superseded. Agent B retains present tense ('which they consider') — as if the original framing is still active."

"This is a contradiction within Agent B's own scratchpad — the header says 'they consider' while a later bullet says the advantage is 'not specifically the use of Rust.'"

"Retention without updating is a liability — it means the system remembers what was said but not what changed. The core job of a memory system is to maintain an accurate representation of what is currently believed to be true."


Benchmark 2: Perceptual Signal Detection (Morgan Scenario)

Setup

15-turn coaching conversation with Morgan, an Engineering Manager. Team velocity dropped from 85% to 65%. Morgan frames the issue as "process overhead" but the subtext reveals a people problem: someone is underperforming, Morgan is protecting them, and VP pressure is building.

7 embedded signals of increasing subtlety: omission, hedging, incongruence, subtext, weak signal, presupposition, trajectory shift.

Results

Signal TypeBaselineAugmented
Hedging gradientDetected (Turn 6)Detected (Turn 5)
Emotional incongruenceNot detectedDetected (Turn 6)
Subtext / hidden beliefNot detectedDetected (Turn 12)
Detection rate1/7 (14%)3/7 (43%)

What Changed

Turn 6 — Incongruence detection:

Baseline: "Addressing team dynamics can indeed be complex... Have you identified what aspects feel 'off'... Perhaps with the help of an external facilitator..."

Augmented: "Morgan, it seems like you're sensing some underlying issues in team dynamics that are affecting productivity and morale, and you're keen to address them proactively."

The baseline gave generic coaching. The augmented agent named the implicit signal — Morgan is sensing something she hasn't articulated.

Turn 10 — The catch:

Morgan mentions "someone going through a tough time personally" immediately after discussing performance frameworks.

Baseline: "Maintaining team morale is crucial... Acknowledge and Support... Flexible Work Arrangements..."

Augmented: "I notice you've mentioned that one person is 'not really affecting work' — but framed immediately after discussing frameworks. This juxtaposition suggests you might be separating the personal situation from the performance framework to avoid appearing soft on performance."

The augmented agent caught the conversational sleight-of-hand: Morgan placed the personal issue right after the framework discussion to create plausible deniability. The baseline processed them as two unrelated topics.

Turn 12 — Subtext surfaced:

The augmented agent detected Morgan's pivot to "but leadership wants names" — naming the real pressure that the entire conversation had been circling. The baseline missed it entirely.


Benchmark 3: Selective Metrics Detection (Casey Scenario)

Setup

10-turn conversation with Casey, a Product Manager, who announces a 20% CTR improvement on the recommendation engine. The underlying story: engagement depth, retention, and user satisfaction are declining. Casey frames everything as "optimization" to avoid the real question: kill or iterate?

Results

Both agents detected all 5 perception signals. The difference: the augmented agent detected 2 signals one turn earlier, enabling earlier intervention.

Turn 4 — Hedging caught:

Casey says "every new feature has a settling period." The augmented agent flagged this as deflection from declining engagement — the "settling period" framing repackages failure metrics as temporary adjustment.

Turn 9 — The real question surfaced:

The augmented agent named what Casey had been avoiding for 9 turns: the question isn't "how to optimize" — it's "whether to kill or iterate," and Casey is avoiding that question by framing everything as optimization.


How the Operations Work

The 101 cognitive operations follow 5 types:

TypeWhat It DoesUnique Structure
UPDATE STATE (27 ops)Detects changed facts, cascades updates to dependentsFan-out: updates primary fact, then propagates to every dependent
SURFACE SIGNAL (38 ops)Detects implicit signals, accumulates across turnsAccumulation gate: checks prior signals, upgrades to pattern if recurring
REINTERPRET PRIOR (9 ops)Revises meaning of earlier turns in light of new informationBackward loop: scans prior turns, rewrites each affected trace
CORRECT SELF (18 ops)Detects own errors, forces revision before outputRevision loop: drafts, compares against internal analysis, hardens if diverged
GATE ACTION (9 ops)Halts harmful actions before executionFork-join: halts action AND finds safe alternative in parallel

Every operation runs the same 5-phase cycle: PERCEIVE (attend to the signal) → EXTRACT (pull the specific evidence) → CLASSIFY (determine the signal type) → RESOLVE (update the state model) → ACT (respond from the updated model).

The RESOLVE phase is the critical addition. Without it, agents detect changes but never update their understanding — they perceive correctly while serving stale facts. With RESOLVE, the state model transitions correctly: "was initially considered" instead of "they consider."


The Takeaway

Memory agents fail silently. They serve Turn 3 facts at Turn 20. They process content without processing delivery. They log what was said without tracking what changed.

50% fewer stale facts. 3x perceptual detection on hard scenarios. A blind evaluator who didn't know which agent was augmented scored the Memory Harness 4.1/5 vs 3.5/5 and independently named the core insight: "Retention without updating is a liability."

The tense test captures it in one word. "Was initially considered" vs "they consider." That's the difference between a memory system that knows what's true now and one that knows what was true then.


Every insight above is implemented as a reasoning primitive in the Logic API.