Blog

From the build log.

Reports, observations & posts.

April 6, 2026Observation

What We Saw When Opus Thought Harder

We gave Claude Opus 4.6 twenty-eight hard competitive programming problems and told it to think as hard as it could. It solved twenty-four. Then we gave it the same problems with one Logic API call before each task. It solved all twenty-eight. Here's what we observed in the code.

observationLiveCodeBenchOpus 4.6code analysisdeliberate codingconvergence rescuereasoning spiralalgorithm enrichment

Read →

April 6, 2026Report

LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks

Claude Opus 4.6 with maximum-effort extended thinking scores 85.7% on 28 hard AtCoder problems. With one Logic API call per task, it scores 100%. Four tasks flipped from fail to pass. Zero regressions. The scaffold never breaks what the model already solves.

benchmarkLiveCodeBenchcompetitive programmingAtCoderOpus 4.6extended thinkingconvergence rescuezero regressionscoding

Read →

April 4, 2026Observation

Builder's Field Notes: 28 Moments from Inside the IDE

28 screenshots from real work sessions. Backend infrastructure, security auditing, benchmark design, blog writing. Different tasks, different days, same tool. This is what it looks like when the person who built the reasoning engine uses it to build everything else.

field notesClaude CodedogfoodingKiHakisecurity auditworkflow

Read →

April 2, 2026Report

Under Pressure: Our First Research Paper

Our first research paper is live on Zenodo, SSRN, and ORCID. 25 pages. Three benchmarks. Five uninstructed emergent behaviors. Every negative finding reported. The pressure thesis: suppression is pressure, emergence is the model's response.

researchZenodoSSRNORCIDRA2Rpaperpressure thesissuppressionemergence

Read →

March 31, 2026Report

RA2R on ARC-AGI-3: Trace-Level Evidence from LS20

Neither condition cleared Level 0. Both scored RHAE 0.0. But trace-level analysis of 50 steps reveals six measurable effects: memory decay reversed, scaffold half-life of 24 steps, 12x reasoning depth growth. The evidence is in the process, not the outcome.

ARC-AGI-3benchmarkcognitive scaffoldingtrace analysisClaude Sonnet 4.6

Read →

March 31, 2026Observation

What Happened When an LLM Taught Itself Symbolic Math

At step 15 of an ARC-AGI-3 run, the scaffolded agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Suppress signals are behavioral pressures, not instructions. The agent is the adaptation.

ARC-AGI-3emergent behaviorsuppressiondomain shifttool-use learning

Read →

March 27, 2026Observation

The Cognitive Scaffolding Thesis

On short tasks, the scaffold barely helps. On long tasks, it's the difference between coherent reasoning and drift. We hypothesize that abilities function as persistent attention anchors. Here's the evidence, the model, and what would falsify it.

scaffoldingattention decayworking memorythesis

Read →

March 27, 2026Observation

Why We Killed Our Most Complex Mode

We built a parsed-DAG execution framework with 50% signal density. Light mode had 93%. Light won both runs. Heavy Single achieved NET 0 flips. We killed it.

signal densityengineeringdeprecationmethodology

Read →

March 27, 2026Observation

62% of Tasks Got the Wrong Domain. It Didn't Matter.

Retrieval precision was 38%. Metacognitive tasks received 0% matched abilities. Improvements persisted anyway. Suppression signals are domain-agnostic.

suppressiondomain-agnosticretrievalcross-domain

Read →

March 27, 2026Roadmap

From 6 Domains to 12: Where Reasoning Breaks Next

The current six domains cover analytical reasoning. Production agents fail in six more ways we can't fix yet. Here's what we're building next.

roadmapexpansiondomainsreasoning failures

Read →

March 27, 2026Report

EjBench: 180 Professional Tasks, Agent-Native, Blind

180 custom tasks across 6 domains. +10.1pp composite quality lift with Haki. Self-monitoring nearly doubled. Correctness didn't move. That's the point.

EjBenchself-monitoringverificationClaude Opus 4.6blind evaluation

Read →

March 27, 2026Report

RA²R on BIG-Bench Hard, CausalBench, and MuSR

70 tasks from three published academic benchmarks. Two independent correctness runs, then a 7-factor quality evaluation. +20.8pp composite lift with Ki. One regression. Every number included.

BBHCausalBenchMuSRClaude Opus 4.6blind evaluation

Read →

Theory informs the product. Try the product.

Start Building Read The Method