From the build log.

Reports, observations & posts.

Observation

What We Saw When Opus Thought Harder

We gave Claude Opus 4.6 twenty-eight hard competitive programming problems and told it to think as hard as it could. It solved twenty-four. Then we gave it the same problems with one Logic API call before each task. It solved all twenty-eight. Here's what we observed in the code.

observationLiveCodeBenchOpus 4.6code analysisdeliberate codingconvergence rescuereasoning spiralalgorithm enrichment
Read →
Report

LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

Claude Opus 4.6 with maximum-effort extended thinking scores 85.7% on 28 hard AtCoder problems. With one Logic API call per task, it scores 100%. Four tasks flipped from fail to pass. Zero regressions. The scaffold never breaks what the model already solves.

benchmarkLiveCodeBenchcompetitive programmingAtCoderOpus 4.6extended thinkingconvergence rescuezero regressionscoding
Read →
Observation

Builder's Field Notes: 28 Moments from Inside the IDE

28 screenshots from real work sessions. Backend infrastructure, security auditing, benchmark design, blog writing. Different tasks, different days, same tool. This is what it looks like when the person who built the reasoning engine uses it to build everything else.

field notesClaude CodedogfoodingKiHakisecurity auditworkflow
Read →
Report

Under Pressure: Our First Research Paper

Our first research paper is live on Zenodo, SSRN, and ORCID. 25 pages. Three benchmarks. Five uninstructed emergent behaviors. Every negative finding reported. The pressure thesis: suppression is pressure, emergence is the model's response.

researchZenodoSSRNORCIDRA2Rpaperpressure thesissuppressionemergence
Read →
Report

RA2R on ARC-AGI-3: Trace-Level Evidence from LS20

Neither condition cleared Level 0. Both scored RHAE 0.0. But trace-level analysis of 50 steps reveals six measurable effects: memory decay reversed, scaffold half-life of 24 steps, 12x reasoning depth growth. The evidence is in the process, not the outcome.

ARC-AGI-3benchmarkcognitive scaffoldingtrace analysisClaude Sonnet 4.6
Read →
Observation

What Happened When an LLM Taught Itself Symbolic Math

At step 15 of an ARC-AGI-3 run, the scaffolded agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Suppress signals are behavioral pressures, not instructions. The agent is the adaptation.

ARC-AGI-3emergent behaviorsuppressiondomain shifttool-use learning
Read →
Observation

The Cognitive Scaffolding Thesis

On short tasks, the scaffold barely helps. On long tasks, it's the difference between coherent reasoning and drift. We hypothesize that abilities function as persistent attention anchors. Here's the evidence, the model, and what would falsify it.

scaffoldingattention decayworking memorythesis
Read →
Observation

Why We Killed Our Most Complex Mode

We built a parsed-DAG execution framework with 50% signal density. Light mode had 93%. Light won both runs. Heavy Single achieved NET 0 flips. We killed it.

signal densityengineeringdeprecationmethodology
Read →
Observation

62% of Tasks Got the Wrong Domain. It Didn't Matter.

Retrieval precision was 38%. Metacognitive tasks received 0% matched abilities. Improvements persisted anyway. Suppression signals are domain-agnostic.

suppressiondomain-agnosticretrievalcross-domain
Read →
Roadmap

From 6 Domains to 12: Where Reasoning Breaks Next

The current six domains cover analytical reasoning. Production agents fail in six more ways we can't fix yet. Here's what we're building next.

roadmapexpansiondomainsreasoning failures
Read →
Report

EjBench: 180 Professional Tasks, Agent-Native, Blind

180 custom tasks across 6 domains. +10.1pp composite quality lift with Haki. Self-monitoring nearly doubled. Correctness didn't move. That's the point.

EjBenchself-monitoringverificationClaude Opus 4.6blind evaluation
Read →
Report

RA²R on BIG-Bench Hard, CausalBench, and MuSR

70 tasks from three published academic benchmarks. Two independent correctness runs, then a 7-factor quality evaluation. +20.8pp composite lift with Ki. One regression. Every number included.

BBHCausalBenchMuSRClaude Opus 4.6blind evaluation
Read →

Theory informs the product. Try the product.