The series, in order.
Twelve essays in four parts, each building on the last. Read it front to back, or pull whatever thread you like.
- 1The Harness Is the Product
A deployed agent's capability is the product of model quality and harness quality — and in 2026, the harness term has the steeper gradient.
- 2The Agent Evaluation Crisis
Agent benchmark numbers measure demos, not the capability teams think they bought — and the fix is a measurement regime borrowed from deep RL, not a better leaderboard.
- 3Environments Are the Bottleneck
Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.
- 4Agents That Learn on the Job
Deployed agents are amnesic by default. The deployment already generates the experience stream continual learning always wanted — and the system-space half of the loop is shippable today.
- 5The Memory Stack
Agent memory is not one problem but a stack — working, episodic, semantic, procedural. Most production failures are layer-confusion: solving one layer's problem with another layer's tool.
- 6Tools, Skills, and the Action Interface
Agent capability leaks at the action interface: the model knows a tool is needed and fails to use it, and the dominant protocol taxes every turn. The real question is whether the tools→skills evolution is engineered to compound — or to collapse.
- 7Planning and the Myopia Problem
A reasoning model's chain of thought looks like a plan: it weighs futures, considers options, deliberates. Extract the search tree behind it and the deliberation turns out to be theater — the model expands deep branches and then chooses by the shallow ones. Where the stakes justify the cost, the fix is not a longer chain of thought. It is search you move outside the model.
- 8Multi-Agent Systems and Their Failure Modes
A multi-agent system fails in ways a single agent cannot — its diversity collapses, its blame becomes untraceable, its coordination cost outgrows the work. The systems that survive do not fix the org chart. They make coordination something the system learns or something it pays for.
- 9Software Engineering Agents: The Proving Ground
Software engineering is where agents grew up — the only domain that handed them verifiable rewards, endless environments, and expert oversight all at once. The platform layer is settled now. What is still open — cost, coordination, deployment at scale — is the playbook every other domain inherits next.
- 10Securing the Agentic Perimeter
An agent is an attack surface that acts. Goal hijack, tool misuse, and memory poisoning are not prompt-injection-with-extra-steps — they are a new perimeter where the payload runs with the agent's privileges. This is the one chapter where practitioner documents lead the research.
- 11Agent Ops: Running Agents in Production
Production agents need an ops discipline the way services needed SRE — and its founding move is architectural. Make the append-only event log the source of truth and derive the agent loop from it. Auditability, forking, replay, cost control, and context hygiene are not five features you bolt on. They are five consequences of one inversion.
- 12Self-Improving Agents
Every harness improvement this series described was made by a human. Here are the papers where that stops being true — and the guardrails for a loop that edits itself.