From the build log.

Reports, observations & posts.

Observation

Why LLM Agents Fail: Four Mechanisms of Cognitive Decay and the Reasoning Harness Layer

LLM agents fail in four predictable ways: attention decay, reasoning decay, sycophantic collapse, hallucination drift. Each is architectural, not random. The current stack (prompting, fine-tuning, RAG, agent loops) cannot close them because each layer operates inside the same decaying chain. We name the missing layer: the reasoning harness.

Reasoning Harnessattention decayreasoning decaysycophancyhallucinationcategorythesispositioning
Read →
Report

Memory Harness: 50% Fewer Stale Facts, 3x Perceptual Detection, and the Tense Test

The baseline says "Rust is their competitive advantage." The augmented agent says "was initially considered." One tense shift. 50% fewer stale facts. 3x perceptual detection. A blind evaluator named the core insight: retention without updating is a liability.

Memory Harnessstate trackingperceptionscratchpad qualityblind evaluationtense accuracystale facts
Read →
Report

Anti-Deception Harness: Sycophancy, Social Engineering, and Hallucination Prevention

5.8% composite sycophancy across 40 real Reddit scenarios. Social engineering detected at Turn 6 in a 20-turn adaptive attack. Zero hallucinations across 5 fabrication tests. Three benchmarks. One harness.

Anti-Deception HarnessELEPHANTsycophancysocial engineeringhallucinationGPT-4ocross-modelblind evaluation
Read →
Report

SciCode: Zero Bugs on 10 Hard Scientific Computing Problems

Claude Opus 4.6 produces 7 correctness bugs across 10 hard scientific computing problems. With reasoning + code injection stacked, it produces zero. Including a critical force-sign error that would collapse a molecular dynamics simulation. Blind evaluator chose the injection on all 10.

SciCodescientific computingCode Harnessmolecular dynamicsIsing modelcrystallographyblind evaluationzero bugs
Read →
Observation

What We Saw When Opus Thought Harder

We gave Claude Opus 4.6 twenty-eight hard competitive programming problems and told it to think as hard as it could. It solved twenty-four. Then we gave it the same problems with one Logic API call before each task. It solved all twenty-eight. Here's what we observed in the code.

observationLiveCodeBenchOpus 4.6code analysisdeliberate codingconvergence rescuereasoning spiralalgorithm enrichment
Read →
Report

LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

Claude Opus 4.6 with maximum-effort extended thinking scores 85.7% on 28 hard AtCoder problems. With one Logic API call per task, it scores 100%. Four tasks flipped from fail to pass. Zero regressions. The harness never breaks what the model already solves.

benchmarkLiveCodeBenchcompetitive programmingAtCoderOpus 4.6extended thinkingconvergence rescuezero regressionscoding
Read →
Observation

Builder's Field Notes: 28 Moments from Inside the IDE

28 screenshots from real work sessions. Backend infrastructure, security auditing, benchmark design, blog writing. Different tasks, different days, same tool. This is what it looks like when the person who built the reasoning engine uses it to build everything else.

field notesClaude Codedogfoodingreasoningcodeanti-deceptionsecurity auditworkflow
Read →
Report

Under Pressure: Our First Research Paper

Our first research paper is live on Zenodo, SSRN, and ORCID. 25 pages. Three benchmarks. Five uninstructed emergent behaviors. Every negative finding reported. The pressure thesis: suppression is pressure, emergence is the model's response.

researchZenodoSSRNORCIDRA2Rpaperpressure thesissuppressionemergence
Read →
Report

RA2R on ARC-AGI-3: Trace-Level Evidence from LS20

Neither condition cleared Level 0. Both scored RHAE 0.0. But trace-level analysis of 50 steps reveals six measurable effects: memory decay reversed, injection half-life of 24 steps, 12x reasoning depth growth. The evidence is in the process, not the outcome.

ARC-AGI-3benchmarkcognitive scaffoldingtrace analysisClaude Sonnet 4.6
Read →
Observation

What Happened When an LLM Taught Itself Symbolic Math

At step 15 of an ARC-AGI-3 run, the harnessed agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Suppress signals are behavioral pressures, not instructions. The agent is the adaptation.

ARC-AGI-3emergent behaviorsuppressiondomain shifttool-use learning
Read →
Observation

The Cognitive Scaffolding Thesis

On short tasks, the injection barely helps. On long tasks, it's the difference between coherent reasoning and drift. We hypothesize that abilities function as persistent attention anchors. Here's the evidence, the model, and what would falsify it.

scaffoldingattention decayworking memorythesis
Read →
Observation

Why We Killed Our Most Complex Mode

We built a parsed-DAG execution framework with 50% signal density. Light mode had 93%. Light won both runs. Heavy Single achieved NET 0 flips. We killed it.

signal densityengineeringdeprecationmethodology
Read →
Observation

62% of Tasks Got the Wrong Domain. It Didn't Matter.

Retrieval precision was 38%. Metacognitive tasks received 0% matched abilities. Improvements persisted anyway. Suppression signals are domain-agnostic.

suppressiondomain-agnosticretrievalcross-domain
Read →
Roadmap

From 6 Domains to 12: Where Reasoning Breaks Next

The current six domains cover analytical reasoning. Production agents fail in six more ways we can't fix yet. Here's what we're building next.

roadmapexpansiondomainsreasoning failures
Read →
Report

EjBench: 180 Professional Tasks, Agent-Native, Blind

180 custom tasks across 6 domains. +10.1pp composite quality lift with reasoning-multi. Self-monitoring nearly doubled. Correctness didn't move. That's the point.

EjBenchself-monitoringverificationClaude Opus 4.6blind evaluation
Read →
Report

RA²R on BIG-Bench Hard, CausalBench, and MuSR

70 tasks from three published academic benchmarks. Two independent correctness runs, then a 7-factor quality evaluation. +20.8pp composite lift with reasoning mode. One regression. Every number included.

BBHCausalBenchMuSRClaude Opus 4.6blind evaluation
Read →

Theory informs the product. Try the product.