Blog
From the build log.
Reports, observations & posts.
What We Saw When Opus Thought Harder
We gave Claude Opus 4.6 twenty-eight hard competitive programming problems and told it to think as hard as it could. It solved twenty-four. Then we gave it the same problems with one Logic API call before each task. It solved all twenty-eight. Here's what we observed in the code.
LiveCodeBench Hard: 85.7% to 100% on 28 Hard Competitive Programming Tasks
Claude Opus 4.6 with maximum-effort extended thinking scores 85.7% on 28 hard AtCoder problems. With one Logic API call per task, it scores 100%. Four tasks flipped from fail to pass. Zero regressions. The scaffold never breaks what the model already solves.
Builder's Field Notes: 28 Moments from Inside the IDE
28 screenshots from real work sessions. Backend infrastructure, security auditing, benchmark design, blog writing. Different tasks, different days, same tool. This is what it looks like when the person who built the reasoning engine uses it to build everything else.
Under Pressure: Our First Research Paper
Our first research paper is live on Zenodo, SSRN, and ORCID. 25 pages. Three benchmarks. Five uninstructed emergent behaviors. Every negative finding reported. The pressure thesis: suppression is pressure, emergence is the model's response.
RA2R on ARC-AGI-3: Trace-Level Evidence from LS20
Neither condition cleared Level 0. Both scored RHAE 0.0. But trace-level analysis of 50 steps reveals six measurable effects: memory decay reversed, scaffold half-life of 24 steps, 12x reasoning depth growth. The evidence is in the process, not the outcome.
What Happened When an LLM Taught Itself Symbolic Math
At step 15 of an ARC-AGI-3 run, the scaffolded agent spontaneously switched from natural language to algebraic notation. Nobody told it to. Suppress signals are behavioral pressures, not instructions. The agent is the adaptation.
The Cognitive Scaffolding Thesis
On short tasks, the scaffold barely helps. On long tasks, it's the difference between coherent reasoning and drift. We hypothesize that abilities function as persistent attention anchors. Here's the evidence, the model, and what would falsify it.
Why We Killed Our Most Complex Mode
We built a parsed-DAG execution framework with 50% signal density. Light mode had 93%. Light won both runs. Heavy Single achieved NET 0 flips. We killed it.
62% of Tasks Got the Wrong Domain. It Didn't Matter.
Retrieval precision was 38%. Metacognitive tasks received 0% matched abilities. Improvements persisted anyway. Suppression signals are domain-agnostic.
From 6 Domains to 12: Where Reasoning Breaks Next
The current six domains cover analytical reasoning. Production agents fail in six more ways we can't fix yet. Here's what we're building next.
EjBench: 180 Professional Tasks, Agent-Native, Blind
180 custom tasks across 6 domains. +10.1pp composite quality lift with Haki. Self-monitoring nearly doubled. Correctness didn't move. That's the point.
RA²R on BIG-Bench Hard, CausalBench, and MuSR
70 tasks from three published academic benchmarks. Two independent correctness runs, then a 7-factor quality evaluation. +20.8pp composite lift with Ki. One regression. Every number included.