Agent Ops: Running Agents in Production

§1 · The $400 incident review

The page comes in around 3 a.m. An agent you left running on a routine task — a task that should have cost about forty cents — has spent $400 overnight and produced nothing. By the time anyone looks, it has been turning the same crank for hours: read the file, re-plan, call the model, fail to make progress, append the failure to a context that only ever grows, and start again. Each turn was individually cheap. There were a great many turns. The artifact never arrived; the bill did.

The expensive part is not the $400. The expensive part is the review the next morning, when someone asks you the obvious questions and you discover you cannot answer a single one. Which run was it? You have a billing total and a chat transcript that scrolled off the top. What was the first event that tipped it into the loop? Unknown — the transcript shows the symptom, not the cause. Can you reproduce it? Not deterministically; re-running the prompt takes a different path. What did it actually cost, broken down by step? You have one number for the whole run. Did it touch production data on the way? You genuinely do not know. What you have is a story and a number, with nothing inspectable in between.

Services lived through this exact lesson a decade earlier. When a web service fell over at 3 a.m., the discipline that grew up to answer those questions was not heroics — it was Site Reliability Engineering, and its substrate was telemetry you could replay: structured logs, distributed traces, metrics, the ability to reconstruct precisely what happened and when. Agents are now where services were before SRE: powerful, deployed, and operationally blind. This chapter is about the discipline that has to grow up around them, and its claim is that the discipline has a single founding move, and that the move is architectural rather than procedural.

Here is the thesis, stated plainly enough to be wrong. Make the append-only event log the source of truth and derive the agent loop from it. Do that, and auditability, forking, replay, cost control, and context hygiene stop being five separate features you bolt onto a running system and become five consequences of one inversion. The falsifiable version: if you could get deterministic replay, blame assignment, per-step cost attribution, and safe what-if forking by adding observability after the fact to an ordinary conversation-loop agent, this chapter would be wrong. The argument is that you cannot — that those properties are structural, and that a logging layer bolted on at the end can never reconstruct what an architecture organized around the log gives you for free. The motif above (see Figure 1) is the whole idea in one picture: the agent is a projection of its log.

This is also, frankly, the missing chapter. When the harness was first mapped [← 1], the survey of its components left a slot labeled operations mostly empty. Memory, tools, planning, evaluation, and multi-agent coordination each got their own treatment; how to actually run the resulting machine in production, and answer for it when it misbehaves, was deferred. This is that deferred chapter. We reuse the harness notation throughout — M for the model and H for the harness around it [← 1] — and introduce the two new objects that the rest of the argument rests on in §2.

Key Takeaway 1

The cost of an incident is not the dollar figure; it is the questions you cannot answer afterward. Agents today are where web services were before SRE — deployed and operationally blind. The chapter's thesis is one falsifiable architectural claim: make the append-only event log the source of truth and derive the agent from it, and replay, blame assignment, cost attribution, and safe forking follow from that single inversion rather than from five features bolted on later.

§2 · Event-sourcing the agent

Most agent frameworks are built around the language model. A conversation loop comes first; then tools are attached to it; then rules and guardrails; and finally a logging layer is bolted on for observability, with state persisted as retrievable “memory” that the loop queries when it needs to remember. Nakajima (2026) observes that this ordering is exactly backwards for everything you later need operationally, and proposes inverting it. The runtime, called ActiveGraph, is organized so that the append-only event log is the source of truth; the working graph the agent reasons over is a deterministic projection of that log; and behaviors — ordinary functions, classes, model-backed routines, or logic attached to typed edges — react to changes in the graph and emit new events. No component instructs another. Coordination happens entirely through the shared graph, which is itself just the current reading of the log.

Two objects carry the rest of the article, so fix them now.

MThe model — weights and forward pass [← 1].

HThe harness — prompts, tools, memory [← 1].

LThe append-only event log — the source of truth.

G = π(L)The working graph: a deterministic projection of the log.

replayRecompute G from L — reproduce any run exactly.

forkBranch L at an event without re-running the prefix.

The reason to pay this architectural price is that the inversion yields three properties that a retrieval-and-summarization memory system cannot give you, no matter how good its retrieval is (see Figure 2). First, deterministic replay: because G = π(L) and the projection π is deterministic, any run can be recomputed from its log alone — the answer to “can you reproduce it?” becomes yes, by construction. Second, cheap forking: you can branch a run at any event without re-executing the shared prefix, because the prefix is already recorded — the answer to “what if we had stopped it here?” becomes a fork, not a re-run. Third, end-to-end lineage: every artifact traces from the high-level goal down to the individual model call that produced it, because each of those steps was an event — the answer to “which step, and what did it cost, and did it touch production?” becomes a query against L. Each of the §1 questions you could not answer maps to exactly one of these properties. That is not a coincidence; it is the point.

Figure 2: The inversion (Nakajima, 2026). Left: the model-first default — the loop is primary and logging is the last thing added, so cause is not recoverable. Right: log-first — the append-only log L is the foundation, the working graph G is a deterministic projection π(L), and behaviors react and emit new events back to L. Replay, forking, and lineage fall out of the shape, not from an observability add-on. Redrawn from the paper's described architecture; box layout is illustrative.

There is one honest cost, and it is worth stating because it is the thing to build first rather than last. Replay is only sound if the nondeterministic inputs to a run — tool outputs, model samples, clocks, randomness — are themselves recorded as events. Nakajima's architecture includes what the paper calls a determinism contract: the discipline that every external effect enters the system through the log, so that the run becomes a pure function of L. Get that contract right and replay is exact; get it wrong — let a tool reach out to the network without recording what came back — and replay quietly diverges from the run it claims to reproduce. The contract is the real engineering work of event-sourcing an agent. It is also the reason this cannot be retrofitted: you cannot record, after the fact, an effect you never captured.

One downstream payoff deserves a name here because a sister chapter raised it as an open problem. A multi-agent system fails in a way single agents cannot — blame becomes untraceable, because no one recorded which agent's which step produced the bad handoff [← 8]. Event-sourcing dissolves that problem rather than mitigating it: if every handoff is an event in a shared log, the offending step is found by reading, not guessed by re-running. The blame question that had no answer becomes a query.

[← 8] Multi-Agent Systems and Their Failure Modes — named “untraceable blame” as a failure mode with no fix absent recorded history. Event-sourcing is that history: the recorded handoff turns attribution from guesswork into a read.

Key Takeaway 2

The inversion is the whole discipline in one move: L first, G = π(L) derived. It buys deterministic replay, cheap forking, and end-to-end lineage — the three answers a retrieval-memory system structurally cannot give — and each maps to a question you could not answer at 3 a.m. The price is the determinism contract: every external effect must enter through the log, which is exactly why it must be built first and cannot be bolted on later.

§3 · The cost meter

You cannot govern what you cannot attribute, and the $400 began as a number with no breakdown. The first operational artifact the log gives you is a meter. Bai et al. (2026), in the first systematic study of token consumption in agentic coding, analyzed trajectories from eight frontier LLMs on SWE-bench Verified and asked three questions that read like an ops runbook: where do agents spend their tokens, which models are more token-efficient, and — the one that matters most for production — can an agent predict its token usage before it executes. The first question is a lineage question; “where” only has an answer if spend is attributable to events. The third is the precondition for a budget guard: if you can forecast the spend, you can cap it before it runs away.

Once you can meter, the literature gives you three distinct levers for spending less, and they are easy to confuse, so separate them cleanly (see Figure 3).

Do not just add calls. The naive ops instinct under uncertainty is to spend more inference — sample the model again, add another reviewer, vote. Chen et al. (2024), in Are More LM Calls All You Need?, studied the simplest version of this (majority-vote aggregation, optionally screened by a model filter) and demonstrated that the scaling is non-monotonic: accuracy first rises and then falls as the number of calls increases. The mechanism is that a real workload mixes easy and hard queries; more calls lift the easy ones and drag the hard ones down, and past a task-specific peak the aggregate turns over. Spending more is not buying reliability past that peak — it is buying cost and, eventually, worse answers.

Route. Ong et al. (2024), with RouteLLM, learn a router from human preference data that sends each query to either a stronger or a weaker model depending on what the query needs. The reported result is a cost reduction of over 2× with no sacrifice in response quality, and the routers generalize across model pairs. Routing is the cheapest structural win available: most traffic does not need the expensive model, and a learned policy can tell which.

Cascade. Chen et al. (2023), with FrugalGPT, gave the founding result of this whole sub-field. They first note the economic fact that makes it matter: LLM-API prices differ by two orders of magnitude across providers. Their cascade learns which combination of models to try per query — cheap first, escalating only when needed — and the headline is striking: it can match the strongest single model (GPT-4 in their study) with up to a 98% cost reduction, or, run the other direction, raise accuracy over GPT-4 by 4 points at the same cost.

The ops point unifies the three. Each lever is only governable through the log. A budget guard is not a feature of the model; it is a behavior that reacts to spend events and trips a kill-switch when a forecast or a running total crosses a threshold. A routing decision — which model, why, and what it actually cost — is itself an event you want recorded, so that the next incident review can ask “why did this query go to the expensive model?” and get an answer instead of a shrug. Cost attribution is just lineage measured in dollars; that cost is itself a first-class evaluation metric, not an afterthought [← 2]. The meter is not separate from the log. It is the log, read with a price list.

Figure 3: The inference-economics frontier. Cascading (FrugalGPT — up to 98% cost reduction at parity) and routing (RouteLLM — over 2× cheaper, no quality loss) move you down the cost axis at fixed quality, toward the frontier. Naively adding calls walks you right — more cost — and, past a task-specific peak, down: the non-monotone scaling of Chen et al. (2024). Point positions are illustrative; the percentage and multiple are exact from the cited papers.

Key Takeaway 3

Metering is an ops primitive, and it lives in the log: spend is governable only when it is attributable to events. Three levers, cleanly separated — cap calls (Chen et al., 2024, non-monotone), route (Ong et al., 2024, over 2× cheaper), cascade (Chen et al., 2023, up to 98% cheaper at parity). A budget guard is a behavior that reacts to spend events and trips a kill-switch; the routing decision and its realized cost are themselves events. Cost attribution is lineage with a price list [← 2].

§4 · Context hygiene as ops

The runaway loop in §1 was, among other things, a context-hygiene failure: a context that only ever grew, until the model was reasoning over a transcript dominated by its own failures. The reflexive fix is to prune — drop stale content to keep the window lean. The literature says: sometimes, and you had better know which regime you are in, because the same intervention that rescues one deployment wrecks another.

Start with the framing. Shen et al. (2026), in The Efficiency Frontier, argue that context-reduction methods — retrieval, memory compression, and the rest — have been evaluated on quality and on cost independently, which makes them impossible to compare and impossible to choose between for a real deployment. Their proposal is to score them jointly, on a single cost–performance frontier, so the question stops being “does pruning help?” in the abstract and becomes “at what cost, for what quality, on this workload?” That reframing is what turns context management from a modeling tweak into an operational decision with a defensible answer.

Then the sharp empirical result that shows why the framing matters. Zhang et al. (2026) studied the most common pruning move in long-horizon search agents — masking stale observations, paging out retrieved content the agent has moved past — across backbones from 4B to 284B parameters, three retrievers, and both offline and live-web benchmarks. The gain from masking is not a constant; it traces an asymmetric inverted-U against the model's own accuracy without any context management (see Figure 4). Under weak retrievers it is a plateau — masking barely matters, because the retrieved content was not helping anyway. It peaks when a strong retriever meets a mid-capacity model — there, dropping content the model has stopped attending to frees capacity that the model can actually use. And it collapses, sharply, when the model is already saturated — masking now removes signal the strong model would have used, and accuracy falls. Mechanistically, the authors characterize masking as a token-for-turn trade-off: it buys room for more turns by giving up tokens the model had largely stopped reading.

Figure 4: When pruning helps — and when it backfires (Zhang et al., 2026). The accuracy gain from masking stale observations is an asymmetric inverted-U: a plateau under weak retrievers, a peak where a strong retriever meets a mid-capacity model, and a sharp collapse below zero once the model is saturated. The regime is a property of your retriever and model, swept across 4B–284B backbones — not a universal best practice. Curve shape redrawn from the reported regime map; axes are qualitative.

Read operationally, this is a warning against exactly the reflex that the runaway loop invites. “Always prune the context” is a policy that lands you in the saturated-model regime where masking collapses; “never prune” is the only-ever-grows context that produced the $400. The right amount of pruning is a tuned knob, and which way to turn it depends on measurable properties of your stack. So context hygiene is not a one-time configuration; it is an operational variable you tune per deployment and, crucially, log. Record what was masked and when, as events, and an incident review can replay the same run with and without the cut and see which one looped. That is only possible because memory here is a projection of the log rather than a lossy summary of it [← 5]: a retrieval-and-summarization memory throws away the very history a forensic replay would need.

Key Takeaway 4

Context management is an ops knob, not a fixed setting. Shen et al. (2026) make it a joint cost–performance decision; Zhang et al. (2026) show the payoff from masking is an inverted-U — a plateau under weak retrievers, a peak at strong-retriever-plus-mid-model, a collapse once the model is saturated. “Always prune” and “never prune” are both wrong defaults. Log what you masked, as events, so the cut is itself replayable [← 5].

§5 · The integration substrate

Everything so far concerns one agent. Production rarely stops there: a real deployment is a fleet, embedded in systems that already exist. The good news is that the architecture this chapter argues for has a name enterprises have been running for years — the event bus — and the inversion generalizes to it cleanly. Confluent's practitioner guide (Falconer, 2025) lays out event-driven design for agents and multi-agent systems: agents as producers and consumers on a shared event backbone, coordinating by publishing and subscribing rather than by calling each other directly. The guide is prescriptive industry guidance, not an empirical study, but the architecture it describes is precisely the one large organizations already use to move orders and payments, and that is the point.

Event-driven design and event-sourcing are the same principle at two scales (see Figure 5). Nakajima's append-only log is the single-agent case; the enterprise event bus is that log generalized to many agents and many services. The defining property carries over exactly: no agent instructs another, coordination happens through the shared, recorded stream of events — which is the same property that makes blame traceable in a multi-agent system [← 8]. An organization that already runs a bus does not need to adopt a novel agent framework to get auditable agents; it needs to treat its agents as first-class participants on the substrate it already trusts.

The bus also answers the cost curve of §3 structurally. A fleet where everything is published as events is a fleet where you can put the cheapest competent model on each task and audit the result. Lyu et al. (2026), with AgenticQwen, train small agentic models for exactly the industrial regime — strict cost and latency constraints — using multi-round reinforcement learning with what they call dual data flywheels: a reasoning flywheel that raises task difficulty by learning from the model's own errors, and an agentic flywheel that grows linear workflows into multi-branch behavior trees. The relevance to ops is direct: small competent models are the supply side of routing. Put them on the bus, send the easy, high-volume traffic to them, reserve the expensive model for what genuinely needs it — and because every routing decision is an event, the fleet's spend is as auditable as one agent's run.

Figure 5: The same principle at fleet scale. Event-driven design (Falconer, 2025) is event-sourcing generalized: agents publish to and subscribe from a shared append-only bus rather than calling each other. A small-model tier (AgenticQwen — Lyu et al., 2026) supplies the cheap end of routing, and because every decision is an event, fleet spend is as auditable as a single run. Topology is illustrative; the architecture is from the cited sources.

Key Takeaway 5

Event-driven design (Falconer, 2025) and event-sourcing are one principle at two scales: the single-agent log generalizes to the enterprise event bus, and the “no agent instructs another” property is what makes a fleet traceable [← 8]. The bus is also the cost answer — small competent models (AgenticQwen; Lyu et al., 2026) supply the cheap end of routing, and every routing decision being an event keeps fleet spend as auditable as one run.

§6 · The rollout path

The half of agent ops that is not architecture is change management, and it is the half that decides whether the architecture ever gets used. Anthropic's deployment guide (2026), Deploying Claude across your organization, lays out a five-level maturity model and a sequence for getting there: get started, drive adoption at scale, then level up, with concrete use cases and timelines drawn from how teams inside the company actually work. It is a practitioner source — prescriptive guidance from the lab, not a controlled study — and read that way, the sequencing carries one ops lesson worth extracting (see Figure 6).

You do not earn the higher rungs until the lower-rung infrastructure exists. The top of any agent-maturity ladder is autonomous, multi-step agents acting in production with real authority; the bottom is a human running a model in a chat box. The ops claim is that what gates the climb between them is not model capability — that is usually available off the shelf — but operational readiness: the logs, the budgets, the kill-switches that make it safe to let capability run unattended. An organization that climbs the capability ladder faster than it builds the event log is the organization that gets the $400 page — except at the higher rungs the number has more zeros and the blast radius includes production data, because the agent now has the authority to touch it. Maturity, read operationally, is not how capable your agents are. It is how much of what they do you can reconstruct after the fact.

Figure 6: The rollout ladder. The rungs are the guide's own five-level maturity model (Anthropic, 2026); the green gates between them are this chapter's overlay — the ops capability you must have in place before each climb is safe. You do not earn unattended, production-authority agents (Level 5) until logs, budgets, replay, and a fleet bus exist below them. Level count is from the source; the ops gates are the article's synthesis, not the guide's labels.

Key Takeaway 6

Agent ops is half architecture, half change management. Anthropic's deployment guide (2026) frames adoption as a five-level maturity model sequenced from getting started to scaling to leveling up. The operational reading: capability is rarely the gate — operational readiness is. Climb the capability ladder faster than you build the event log and you get the $400 page with more zeros and production data in the blast radius.

§7 · The agent-ops checklist

Here is the artifact to keep on the wall (see Figure 7). Every operational requirement raised in this chapter reduces to the same move — record it as an event, early — and the table below makes that reduction explicit, row by row, so it doubles as a pre-deployment checklist. Hold it against any agent before it ships: for each requirement, name the 3 a.m. question it answers, the event-sourcing move that delivers it, and — the column that keeps you honest — the residual risk that survives the move. A green mechanism is not a green residual. Replay still needs the determinism contract; cost attribution still needs metering wired into events; context audits are still regime-dependent. The thesis, in checklist form, is that there is no row whose mechanism is anything other than “put it in the log.”

Ops requirement (the 3 a.m. question)	The event-sourcing move that delivers it	Residual risk
Incident replay “reproduce the exact run”	✅ deterministic replay from the append-only log `L` (Nakajima, 2026)	⚠️ sound only if the determinism contract holds — tool I/O and model samples must be recorded
Cost attribution “what did this cost, and which step”	✅ per-event token accounting; cost is lineage with a price list (Bai et al., 2026)	⚠️ requires metering wired into every event; forecasts are approximate
Budget guard / kill-switch “stop the runaway loop”	✅ a behavior that reacts to spend events and trips on a threshold (Bai et al., 2026)	⚠️ guard granularity — a coarse cap can fire late or kill useful work
Forking / what-if “branch without re-running the prefix”	✅ cheap fork at any event; the shared prefix is already recorded (Nakajima, 2026)	✅ falls out of the log directly — lowest-cost property to obtain
Blame assignment “which agent, which handoff failed”	✅ every handoff is a recorded event — found by reading, not guessing [← 8]	⚠️ attribution method beyond “which step” is still open research [← 8]
Context audit “is pruning helping or hurting?”	⚠️ log what was masked, as events; replay with and without the cut (Zhang et al., 2026)	❌ the right amount is regime-dependent — must be re-measured per stack
Audit-grade memory “what did it know, and when?”	✅ memory as a projection of `L`, not a lossy summary [← 5]	⚠️ retrieval-and-summarization memory discards the history a replay needs [← 5]
Forensic perimeter “what did the run touch?”	✅ end-to-end lineage — goal down to each model call and external effect [← 10]	⚠️ the record is only as complete as the events you captured [← 10]

Figure 7: The agent-ops checklist — the deployable artifact of this chapter. Each row is an operational requirement, the event-sourcing move that delivers it, and the residual risk that survives. Keep it on the wall: every mechanism column reduces to “record it as an event, early” — and a green move is not a green residual.

Key Takeaway 7

Run the checklist before you ship: logs, budgets, kill-switches, replay, forking, blame, context audits, lineage. Every requirement reduces to one move — event-source it, early — which is why ops cannot be retrofitted onto a conversation-loop agent after an incident. The residual-risk column is the discipline: even with the log, replay needs the determinism contract, metering must be wired in, and the right amount of pruning is regime-dependent. Build the log first; the rest is reading it.

What comes next

Nakajima's paper ends on a deliberately careful note: it discusses, without claiming to demonstrate, why an event-sourced substrate is unusually well suited to self-improving agents. That caution is exactly the right bridge to close on. A system that proposes changes to itself needs, more than anything else, a trustworthy record of what it did and what happened as a result — because the training signal for self-improvement is the log. You cannot safely let an agent rewrite its own harness if you cannot replay the run that motivated the rewrite, attribute the outcome to the change, and roll back deterministically when the change makes things worse. The next chapter takes up self-improving agents directly; the prerequisite it inherits from this one is the append-only log that makes any of that auditable. Build the ops discipline first, and self-improvement becomes a thing you can supervise. Skip it, and self-improvement becomes the most expensive 3 a.m. page you will ever get.

[→ 12] Self-Improving Agents — the training signal for self-improvement is the event log this chapter argued you must build first; an agent cannot safely rewrite itself without replay, attribution, and deterministic rollback.

References

Nakajima, Y. (2026). The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems. Preprint. arXiv:2605.21997.
Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., & Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. Preprint. arXiv:2604.22750.
Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024). Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems. Preprint. arXiv:2403.02419.
Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. Preprint. arXiv:2406.18665.
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. Preprint. arXiv:2305.05176.
Shen, B., Jin, L., Cai, H., Hu, L., & Xin, Y. (2026). The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management. Preprint. arXiv:2605.23071.
Zhang, H., Xu, Q., Li, Z., Zhang, L., Jiang, P., Zhang, Y., & McAuley, J. (2026). Masking Stale Observations Helps Search Agents – Until It Doesn't: A Regime Map and Its Mechanism. Preprint. arXiv:2606.00408.
Falconer, S. (2025). A Guide to Event-Driven Design for Agents and Multi-Agent Systems. Confluent (practitioner white paper).
Lyu, Y., Wang, C., Zheng, H., Yue, Y., Yan, J., Wang, M., & Huang, J. (2026). AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use. Preprint. arXiv:2604.21590.
Anthropic. (2026). Deploying Claude across your organization: A practical guide for deploying Claude across your business. Anthropic (practitioner guide).