Contents

The series, in order.

Twelve essays in four parts, each building on the last. Read it front to back, or pull whatever thread you like.

IThe ProblemName the discipline; expose the measurement crisis.
  1. 1
    The Harness Is the Product

    A deployed agent's capability is the product of model quality and harness quality — and in 2026, the harness term has the steeper gradient.

    AGTO· 20 min· Yao et al. (2022) · Li et al. (2026) · Zhang et al. (2026) · Xiong et al. (2026)
  2. 2
    The Agent Evaluation Crisis

    Agent benchmark numbers measure demos, not the capability teams think they bought — and the fix is a measurement regime borrowed from deep RL, not a better leaderboard.

    EVAG· 23 min· D'Oro et al. (2026) · Kapoor et al. (2026)
IIThe EngineBuild the agent's training loop — environments, learning, memory, tools.
  1. 3
    Environments Are the Bottleneck

    Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.

    ARAG· 26 min
  2. 4
    Agents That Learn on the Job

    Deployed agents are amnesic by default. The deployment already generates the experience stream continual learning always wanted — and the system-space half of the loop is shippable today.

    CLAG· 27 min· OpenClaw-RL · Wang et al. (2026) · Memento-Skills (2026) · Tracking · Sutton, Koop & Silver (2007) · Continual Learning Bench · Asawa et al. (2026)
  3. 5
    The Memory Stack

    Agent memory is not one problem but a stack — working, episodic, semantic, procedural. Most production failures are layer-confusion: solving one layer's problem with another layer's tool.

    MEAG· 26 min· Jia et al. (2026) · AI Hippocampus · Park et al. (2023) · Generative Agents · Packer et al. (2023) · MemGPT · Behrouz et al. (2025) · Titans
  4. 6
    Tools, Skills, and the Action Interface

    Agent capability leaks at the action interface: the model knows a tool is needed and fails to use it, and the dominant protocol taxes every turn. The real question is whether the tools→skills evolution is engineered to compound — or to collapse.

    TOAG· 23 min· ReAct · Yao et al. (2023) · Knowing-Doing Gap · Cheng et al. (2026) · MCP Tax · Sadani & Kumar (2026) · AgenticQwen · Lyu et al. (2026)
IIIThe BuildNavigate the hard problems — planning, coordination, the proving ground.
  1. 7
    Planning and the Myopia Problem

    A reasoning model's chain of thought looks like a plan: it weighs futures, considers options, deliberates. Extract the search tree behind it and the deliberation turns out to be theater — the model expands deep branches and then chooses by the shallow ones. Where the stakes justify the cost, the fix is not a longer chain of thought. It is search you move outside the model.

    AGEV· 25 min· Chen et al. (2026) · Search Trees · Sunkaraneni et al. (2026) · Boosting Weak Reasoners · Tsoukalas et al. (2026) · Formal Proof Search
  2. 8
    Multi-Agent Systems and Their Failure Modes

    A multi-agent system fails in ways a single agent cannot — its diversity collapses, its blame becomes untraceable, its coordination cost outgrows the work. The systems that survive do not fix the org chart. They make coordination something the system learns or something it pays for.

    MAAG· 23 min· Beyond Individual Intelligence · Qi et al. (2026) · Multiagent Debate · Du et al. (2023) · Diversity Collapse · Chen et al. (2026) · Conductor · Nielsen & Cetin et al. (2025)
  3. 9
    Software Engineering Agents: The Proving Ground

    Software engineering is where agents grew up — the only domain that handed them verifiable rewards, endless environments, and expert oversight all at once. The platform layer is settled now. What is still open — cost, coordination, deployment at scale — is the playbook every other domain inherits next.

    SWAG· 26 min· OpenHands · Wang et al. (2024) · SWE-bench · Jimenez et al. (2023) · RADAR · Meta (2026) · Token Spend · Bai et al. (2026)
IVThe Deployment FrontierShip it — security, operations, and autonomous improvement.
  1. 10
    Securing the Agentic Perimeter

    An agent is an attack surface that acts. Goal hijack, tool misuse, and memory poisoning are not prompt-injection-with-extra-steps — they are a new perimeter where the payload runs with the agent's privileges. This is the one chapter where practitioner documents lead the research.

    SEAG· 30 min· AgentDojo · Debenedetti et al. (2024) · OWASP Top 10 for Agentic Apps (2025) · CaMeL · Debenedetti et al. (2025) · Opus 4.7 System Card · Anthropic (2026)
  2. 11
    Agent Ops: Running Agents in Production

    Production agents need an ops discipline the way services needed SRE — and its founding move is architectural. Make the append-only event log the source of truth and derive the agent loop from it. Auditability, forking, replay, cost control, and context hygiene are not five features you bolt on. They are five consequences of one inversion.

    AGME· 23 min· The Log Is the Agent · Nakajima (2026) · How Do AI Agents Spend Your Money? · Bai et al. (2026) · FrugalGPT · Chen et al. (2023) · Event-Driven Design · Falconer / Confluent (2025)
  3. 12
    Self-Improving Agents

    Every harness improvement this series described was made by a human. Here are the papers where that stops being true — and the guardrails for a loop that edits itself.

    AGCL· 24 min· Zhang et al. (2026) · Memento-Team (2026) · Hebbar et al. (2026) · Yang et al. (2026)