Part I · The Problem · 1 of 12AGTO

The Harness Is the Product

A deployed agent's capability is the product of model quality and harness quality — and in 2026, the harness term has the steeper gradient.

Figure 1: The model M (gray) is one block inside a larger harness scaffold H (teal/amber) that pulls feedback in and pushes actions out. The thesis in one picture — capability is what the whole assembly does, not what the inner block knows.
Anchor papers
Yao et al. (2022)Li et al. (2026)Zhang et al. (2026)Xiong et al. (2026)
20 min read4,498 words↳ Reading order: first · 2 →

§1 · The 1MB question

Two teams deploy the same frontier model. One ships an agent that resolves most of its tasks; the other ships one that stalls, loops, and quietly fails — on the same tasks, with the same weights, against the same tools. Nothing about the model varied. Something else did almost all of the work, and most agent teams cannot say precisely what it was.

We can now put a number on the gap. In a controlled scaling study, Zhang et al. (2026) held the base model fixed and held the raw budget fixed — same token allowance, same number of tool calls — and changed only how the surrounding system converted that budget into usable feedback. Task success moved from 0.27 to 0.90. More than a tripling, with not a single weight changed. The raw expenditure was identical on both ends; what differed was the quality of the loop wrapped around the model.

The most extreme version of this result is almost insulting: a frontier model inside a poor loop can be beaten by something that does no reasoning at all. We hold that punchline for the next article — it is the wedge that cracks open the whole evaluation problem.

[→ 2] The Evaluation Crisis — the replay-script result, and what it means that a recorded action sequence can rival a reasoning agent on a popular benchmark.

For now the point is narrower and sharper. The thing wrapped around the model — the part that assembles context, routes tools, keeps state, verifies outputs, decides when to retry, and orchestrates the whole multi-step episode — is not plumbing. It is where most of the deployed capability lives, and it is engineerable. Call it the harness.

So here is the organizing claim of this series, stated plainly enough to be wrong: the capability of a deployed agent factors into two terms — the model M and the harness H (see Figure 1) — and in 2026 the harness term has the steeper gradient. Spending a month improving H buys more deployed capability than waiting for the next M. That is falsifiable: if harness changes could not move end-to-end success much with the model held constant, the claim would be dead. The 0.27-to-0.90 result is the first nail it has to survive, and it survives comfortably.

This article does one thing: it defines H precisely — its anatomy, its scaling law, and the evidence that it, not the model, drives much of what we call "agentic" performance. The other eleven articles in this series build on that definition. We are not touring frameworks. We are naming a discipline.

Key Takeaway 1

Holding the model and the raw compute budget fixed and changing only the harness moved task success from 0.27 to 0.90 (Zhang et al., 2026). The variable that moves deployed capability the most right now is not the model — it is the system wrapped around it. Treat that system, the harness, as the product.

§2 · Anatomy of a harness

Before we can engineer a thing, we have to draw it. The harness is the closed loop that turns a stateless next-token predictor into an agent that acts over many steps. Strip it to its irreducible parts and six show up every time. They are the decomposition the rest of this series reuses, so we define them once, carefully.

First, the notation. We will be deliberate about symbols because later articles inherit them.

MThe model (weights θ). One block.
HThe harness — the six components below.
EThe environment / tool-world the agent acts in.
τA trajectory: the multi-step episode.
s, a, o, rState, action, observation, reward at each step.
ΣThe skill/rule store — memory that outlives one episode.
CfbEffective feedback compute — the verified feedback the harness extracts per task.

A trajectory τ is a sequence of steps: from state s the model emits an action a (a tool call or a token), the environment E returns an observation o, and the harness updates its state and memory before the next step. Zhang et al. (2026) formalize exactly this — a run as τ = {(s, a, o, u)} over a horizon set by the harness's stopping rule — which tells you that the loop, not the single forward pass, is the unit of analysis.

The six components of H partition the work inside that loop (see Figure 2):

  • Context assembly — what goes into the prompt this turn: instructions, retrieved facts, prior observations, trimmed to fit.
  • Tool routing — choosing which tool or model call to make, and how the result is delivered back (inline in the context, or by reference).
  • State & memory (Σ) — what is retained across steps and across episodes: verified facts, failed attempts, task constraints.
  • Verification — checking whether an action's result is correct or sufficient before the agent builds on it.
  • Revision — what the agent does when verification fails: retry, repair, replan.
  • Orchestration — the controller that sequences all of the above and decides when to stop.

[→ 5] the memory layer (Σ) and [→ 6] the tool interface each earn a full article later in the series; here we hold them at the level of the loop.

This is not an idiosyncratic list. The founding pattern, ReAct (Yao et al., 2022), is the minimal harness: interleave reasoning and acting so that reasoning traces update the plan while actions gather information from the environment. Every harness since is an elaboration of that loop. And the field's first survey converges on a near-identical anatomy: Li et al. (2026) name agent-harness engineering as an independent system layer "whose engineering quality drives a large share of real-world reliability," and propose a seven-layer taxonomy (ETCLOVG) that extends earlier six-component frameworks by promoting observability and governance to first-class concerns. When a problem this practical produces the same decomposition from two independent directions, the decomposition is probably real.

The cleanest evidence that these components are separable comes from treating them as a ladder. Zhang et al. (2026) define seven harness families, H0 through H6: H0 is the bare model answering in one pass; H1 adds verification; H2 adds tool routing; H3 adds stateful memory; H5 and H6 combine routing, verification, and memory into an iterative closed loop. Each rung adds one of our six components, and — crucially — H4 is a deliberate trap: it spends a larger raw budget under weak routing, verification, and memory, isolating raw expenditure from useful feedback. We will see in §3 that H4 is where the naive "just spend more" intuition goes to die.

Bare model (H0) M prompt one answer no feedback loop Model + harness (H) H M context assembly tool routing verification revision memory (Σ) orchestration E tools, files s a o r
Figure 2: The harness anatomy the series reuses. Left, a bare model (H0) answers once with no feedback loop; right, the same model wrapped by six components — context assembly, tool routing, verification, revision, memory (Σ), orchestration — running a closed s → a → o → r loop against the environment E. Components named after the survey's taxonomy (Li et al., 2026) and the H0–H6 families of Zhang et al. (2026).
Key Takeaway 2

A harness is a closed loop with six separable parts: context assembly, tool routing, state/memory (Σ), verification, revision, and orchestration. Two independent sources — the founding ReAct loop (Yao et al., 2022) and the field's first survey (Li et al., 2026) — converge on this anatomy, and the H0–H6 family ladder (Zhang et al., 2026) shows the parts stack one at a time. This is the vocabulary; the rest of the series speaks it.

§3 · Harness scaling laws

Here is the result that turns harness engineering from craft into discipline: the harness has a scaling law, and its coordinate is not the one everyone reaches for. Performance does not scale with how much you spend. It scales with effective feedback compute — the verified feedback the harness extracts per task (Zhang et al., 2026).

The setup matters. Most test-time-scaling analyses parameterize an agent by raw expenditure: tokens, tool calls, wall-clock, dollars. Zhang et al. (2026) show that coordinate is almost useless. Across controlled tasks, raw tokens and raw tool calls explain only limited variation in success — coefficients of determination of 0.33 and 0.42. A strong multivariate baseline that combines several raw signals reaches 0.88. But their feedback coordinate — which credits a feedback event only when it is informative, valid, non-redundant, and actually retained for later decisions — reaches 0.94, and 0.99 once normalized by how much feedback the task demands. The same coordinate, estimated at trace time without an oracle, still predicts a held-out, prospective batch at 0.85. The signal is the feedback, not the spend.

We will write that coordinate as Cfb and carry it through the series. The companion quantity is harness efficiency — how much effective feedback a harness extracts per unit of raw budget — which on its own explains success variation at 0.97 while raw cost explains almost none. This is what the H4 "high-budget-but-noisy" family was built to expose: pour in more tokens and tool calls under weak verification and memory, and the raw budget rises while Cfb does not, so success does not move. Spending is not the lever. Conversion is.

The sharpest demonstration is a matched-budget intervention: fix the raw cost and the number of tool calls, improve only feedback quality, and success climbs from 0.27 to 0.90 (Zhang et al., 2026). That is the same number from §1, now with a mechanism attached. It is the empirical content of "the harness has the steeper gradient": the budget was pinned, the model was pinned, and the harness still tripled the outcome by converting that fixed budget into durable, task-sufficient feedback (see Figure 3).

0.27 0.90 0 task success → effective feedback compute (C_fb) → raw compute — R²≈0.33–0.42 0.90 C_fb coordinate — R² up to 0.99 (multivariate raw baseline: 0.88)
Figure 3: Success scales with effective feedback compute, not raw spend. Raw tokens/tool calls are weak predictors (R² ≈ 0.33–0.42); the feedback coordinate predicts at 0.94, up to 0.99 task-normalized (Zhang et al., 2026). Endpoints 0.27 and 0.90 are the matched-budget intervention — same model, same raw budget. Why it matters: it tells you which dial to turn.

One caution keeps this honest. Changing a harness is not the same as improving it. Lin et al. (2026) separate two capabilities in self-evolving agents — the capacity to produce a persistent harness update, and the capacity to actually benefit from one — and find they come apart: a model's base task-solving skill does not reliably predict either. So Cfb is not just a knob to crank; it is a measurement you have to take. Instrument the feedback your loop actually retains, or you will ship harness "improvements" that move nothing.

Key Takeaway 3

The harness scaling coordinate is effective feedback compute (Cfb), not raw expenditure: feedback quality predicts success at up to 0.99 while raw tokens/tool calls sit near 0.33–0.42 (Zhang et al., 2026). Spending more under a weak loop (the H4 trap) buys nothing. And because harness updating is not harness benefit (Lin et al., 2026), Cfb is something you measure, not just something you tune.

§4 · What actually drives performance

If the harness has the steeper gradient, then a lot of what we credit to "the model being smart" is really the harness doing work. Two recent results make that uncomfortably concrete by ablating sophisticated machinery down to simple mechanism.

The first is about orchestration. Elaborate multi-agent systems — coordinators spawning specialists, voting, debating — are usually credited with their own gains. Xi et al. (2026) ask what is actually doing the work and isolate a single mechanism they call heavy thinking: a two-stage pipeline of parallel reasoning followed by summarization that can run beneath any harness. Across domains it consistently outperforms Best-of-N sampling, and with stronger models it approaches Pass@N. The deflationary reading is the important one: much of what intricate orchestration buys is reducible to one simple harness-level pattern, not to the cleverness of the system diagram.

The second is about retrieval. Conventional wisdom says agentic search wants sophisticated machinery — dense vector retrieval, rerankers, the full retrieval-augmented stack. Sen et al. (2026) ran the comparison on a 116-question long-conversation memory benchmark, across both custom harnesses and provider-native command-line harnesses, and found that a plain lexical primitive — grep over the corpus — generally yielded higher accuracy than vector retrieval. More striking: overall scores depended strongly on which harness and tool-calling style was used, even when the underlying data were held identical. When grep is a native shell tool, they note, the line between "retrieval strategy" and "agent capability" blurs — the agent writes its own search. The retrieval algorithm was not the deciding factor. The harness was (see Figure 4).

Observed performanceUsually credited toWhat the ablation shows is doing the workSource
Gains from multi-agent orchestration ⚠️ a clever system design ✅ one harness-level pattern — parallel reasoning then summarization — beating Best-of-N Xi et al. (2026)
Strong agentic search ⚠️ sophisticated vector retrieval ✅ plain lexical grep generally wins; harness + tool-calling style swings scores on identical data Sen et al. (2026)
Marginal task success ❌ a bigger model / more tokens ✅ effective feedback compute — verified feedback per task Zhang et al. (2026)
Figure 4: The deflationary audit. For three results where capability is reflexively credited to the model or to algorithmic sophistication, the controlled comparison points back to a harness mechanism. Why it matters: it tells you where to spend the next engineering week. (Audit-table style carried from [← A10].)

Neither result says the model is irrelevant — Xi et al. (2026) are explicit that heavy thinking is also a skill internalized in the weights, and a better model raises every harness's ceiling. The claim is narrower and more useful: at the margin, the lever that is cheap to pull and large in effect is the harness. That is exactly what you would predict if Cfb, not raw model capacity, governs the marginal task.

Key Takeaway 4

When you ablate "agentic" performance, simple harness mechanisms keep falling out: a two-stage thinking pattern explains much of multi-agent orchestration's edge (Xi et al., 2026), and plain grep with the right tool-calling style beats sophisticated retrieval (Sen et al., 2026). The model is not irrelevant — but the cheap, high-leverage marginal move is almost always the harness.

§5 · The runtime frontier

If the harness is real engineering, where is it heading? The honest answer is a spectrum, defined by one question: where does the harness logic live — in hand-written code, or in learned weights? Today almost all of it is code. The frontier is the slow migration of harness functions into the model itself (see Figure 5).

At the near end, harness logic is explicit and the engineering goal is reuse. AAFLOW (Sarker et al., 2026) is representative: a catalog of scalable, composable workflow patterns for agentic systems, so that orchestration is assembled from known-good structures rather than rebuilt each time. This is the harness as software architecture — versioned, testable, shared.

At the far end sits the speculative endpoint: dissolve the boundary entirely. Neural Computers (Xiong et al., 2026) — from a Meta team with Schmidhuber — propose folding computation, memory, and I/O into a single learned runtime state, aiming to "make the model itself the running computer," distinct from today's agents that act over external environments. As an early step they show a learned runtime, trained only on interaction traces, can acquire interface primitives like I/O alignment and short-horizon control. They are candid about what does not work yet: routine reuse, controlled updates, and symbolic stability remain open. The components of H we drew in §2 are, on this view, scaffolding the model will eventually internalize — but not soon, and not reliably.

Why expect the migration to favor systems at all, rather than just bigger monoliths? Liao et al. (2026), in an ICML position paper, make the steelmanned case: grounding the argument in the No Free Lunch theorem, they argue via theoretical derivation that agentic systems — routing that generalizes to directed-acyclic-graph topologies — achieve exponentially superior generalization and sample efficiency than monolithic scaling on heterogeneous real-world task distributions. It is a position, not a measurement, and it leans on idealized assumptions. But it gives the harness-first stance a principled spine: structure is not a workaround for small models; it is how you cover a task distribution no single model covers well.

harness logic in hand-written code …absorbed into learned weights reusable patternsAAFLOW (Sarker '26) today's agentmostly code learned runtimeNeural Computers (Xiong '26) open: reuse, controlled updates, stability why systems over monoliths: Liao et al. (2026), position
Figure 5: The runtime spectrum — where harness logic lives. From explicit, reusable code patterns (Sarker et al., 2026) to a fully learned runtime (Xiong et al., 2026), with the systems-over-monolith argument (Liao et al., 2026) as the directional case. Why it matters: it tells you what stays code (most of it, for now) and what is research, not roadmap.
Key Takeaway 5

Harness logic lives on a spectrum from hand-written code to learned weights. The near end — reusable workflow patterns (Sarker et al., 2026) — is shippable engineering today; the far end — a learned runtime that absorbs the harness (Xiong et al., 2026) — is a real research frontier with reuse and stability still open. The position case for systems over monoliths (Liao et al., 2026) says the migration has a principled reason, not just a pragmatic one.

§6 · The prompt layer as engineering

The harness component closest to the model is the prompt — the context assembled each turn. It is also the one most often left to taste. It does not have to be. The prompt layer can be optimized with the same trace data the harness already generates.

ContraPrompt (Rishav et al., 2026) is a clean instance of the idea. Its observation: when a model fails a task and then succeeds on a retry with feedback, the difference between the two chains of thought is an optimization signal that single-trace methods throw away. Because the two traces share the same model, the same input, and the same base prompt, what remains when you diff them is precisely the reasoning strategy that worked plus the error feedback that triggered it. They call this dyadic reasoning-trace analysis, structure the retry phase as an instrumented agentic loop that manufactures the contrastive pairs automatically — no human annotation — and distill the difference into input-aware prompt rules (see Figure 6).

The mechanism is less important than the move. The prompt stops being a static artifact you hand-tune and becomes a learned output of the harness's own failure-and-retry traffic. That is the harness improving itself at the layer nearest the model — and it is exactly the kind of update Lin et al. (2026) warn must be measured for benefit, not just produced.

same model · same input · same base prompt trace A — fails chain of thought trace B — succeeds on retry chain of thought + feedback Δ dyadic input-aware prompt rule pairs generated by an instrumented retry loop — no human annotation
Figure 6: ContraPrompt's dyadic trace analysis (Rishav et al., 2026). Diffing a failed trace against a successful retry — same model, input, and base prompt — isolates the working strategy as a reusable, input-aware prompt rule. Why it matters: the prompt layer becomes a learned output of the harness, not a hand-tuned artifact.
Key Takeaway 6

The prompt is a harness component, and it is optimizable from the loop's own traces. ContraPrompt (Rishav et al., 2026) diffs failed-then-succeeded reasoning traces to mine input-aware rules with no human annotation — turning the layer nearest the model from taste into engineering. Just measure that the learned rules actually help (Lin et al., 2026).

§7 · The discipline's trajectory

A field crystallizes when it gets its first survey. The survey is the signal that practitioners have built enough of the same thing, in enough places, that the patterns are worth naming and organizing rather than rediscovering. Agent-harness engineering just got that signal (see Figure 7).

The arc is short and steep. In 2022, ReAct (Yao et al.) gave the bare loop — reason, act, observe, repeat — almost as an afterthought to a paper about reasoning. By 2026 the loop had grown six components, a scaling law, and a literature: Li et al. (2026) survey the discipline, propose a taxonomy, and map 170-plus open-source projects onto it. The thing that was glue code in 2022 is a system layer with its own engineering evolution — prompt engineering, then context engineering, then harness engineering — in 2026.

[→ C1] The survey-as-crystallization argument — why the appearance of a field's first survey is a leading indicator that a discipline has formed — is developed in the long-horizon series.

Extrapolate carefully and the job follows the artifact. If Cfb is the metric that governs deployed performance, someone owns it. By 2028 the plausible role is a harness engineer whose work is exactly the six components: instrumenting context assembly and memory, measuring verified feedback per task, and treating tool routing and the revision loop as tuned subsystems with dashboards — not as code someone wrote once and forgot. The model will keep improving on its own schedule. The harness improves on yours.

2022 ReAct: the bare loop (Yao et al.) 2026 first survey · 170+ projects (Li et al.) — the field names itself 2028 "harness engineer" projected role
Figure 7: A discipline crystallizing. The loop ReAct sketched in 2022 (Yao et al.) became, by 2026, a named field with a survey and 170+ mapped projects (Li et al., 2026); the projected 2028 node is extrapolation, drawn dashed. Why it matters: the surface you under-invest in today is the role you hire for in two years.
Key Takeaway 7

Harness engineering went from an afterthought in a 2022 reasoning paper (Yao et al.) to a surveyed discipline with 170+ catalogued projects by 2026 (Li et al., 2026). First surveys are crystallization signals [→ C1]. The metric that governs deployed performance — Cfb — will get an owner; plan for the role before you need it.

The harness-design checklist

Everything above reduces to one reusable artifact. If you read nothing else, read this. The harness is six components; for each, there is a thing to build and a number to instrument so you can tell whether you improved it (Cfb) rather than merely changed it (see Figure 8, then Figure 9).

M 1 · context assemblybudget & context-rot log 2 · tool routingroute-correctness rate 3 · memory (Σ)repeat-error reduction 4 · verificationcheck coverage 5 · revisionsuccess-on-retry 6 · orchestrationmeasures C_fb across the loop
Figure 8: The harness-anatomy reference card — the six components and the one number to instrument in each. Orchestration owns the end-to-end metric, effective feedback compute (Cfb, Zhang et al., 2026). Why it matters: it is the design surface, on one page.
ComponentWhat to buildWhat to instrumentGrounded in
Context assemblyWhat enters the prompt each turn; trimming policyToken budget used; "context rot" as result sets crowd the windowSen et al. (2026)
Tool routingTool selection; inline vs. by-reference result deliveryRoute-correctness rate; cost of wrong routesSen et al. (2026); Yao et al. (2022)
State & memory (Σ)Compact store of verified facts, failed attempts, constraintsReduction in repeated errors across steps/episodesZhang et al. (2026, H3)
VerificationChecks on each action before the agent builds on itFraction of actions verified; escaped-error rateZhang et al. (2026, H1)
RevisionRetry / repair / replan on verification failureSuccess-on-retry; trace pairs for prompt miningRishav et al. (2026); Zhang et al. (2026, H5–H6)
OrchestrationSequencing + stopping rule for the whole loopEffective feedback compute Cfb: verified, non-redundant, retained feedback per task — and benefit, not just changeZhang et al. (2026); Lin et al. (2026)
Figure 9: The harness-design checklist. Build the left column; instrument the middle column; the right column is where each metric is grounded. Why it matters: it converts "improve the harness" into six measurable subsystems and one north-star metric (Cfb).
Key Takeaway 8 — the artifact

Engineer the harness as six instrumented subsystems, not as glue code. For each — context assembly, tool routing, memory (Σ), verification, revision, orchestration — build the mechanism and track the metric in Figure 9. The one number that ties them together is effective feedback compute (Cfb): verified, non-redundant, retained feedback per task. Improve that, measure that you improved it, and the deployed agent improves on your schedule rather than the model vendor's.

What comes next

This article assumed we can tell a good harness from a bad one. We mostly cannot — and that is the next crisis. If the harness, not the model, drives most deployed capability, then a benchmark that scores the model in isolation measures the wrong thing, and a benchmark that scores the whole agent can be gamed by mechanisms that have nothing to do with intelligence. The replay-script result we deferred in §1 is the clean demonstration: a recorded action sequence, doing no reasoning at all, can outscore a frontier agent on a popular task. 2 takes that result apart and asks what an honest agent evaluation would even look like once you accept that the harness is the product.

[→ 2] The Evaluation Crisis — why current agent benchmarks measure the wrong term, and how the replay-script result exposes it.

References

  1. Zhang, X., Wang, D., Xu, K., Zhu, Q., & Che, W. (2026). Scaling Laws for Agent Harnesses via Effective Feedback Compute. Preprint. arXiv:2605.29682.
  2. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), 2023. arXiv:2210.03629.
  3. Li, J., Xiao, X., Zhang, Y., Liu, C., et al. (2026). Agent Harness Engineering: A Survey. Preprint (project page: Awesome-Agent-Harness).
  4. Lin, M., Wu, J., Wang, Z., et al. (2026). Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents. Preprint. arXiv:2605.30621.
  5. Xi, X., Li, X., Wang, W., & Cai, X. (2026). HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness. Preprint. arXiv:2605.02396.
  6. Sen, S., Kasturi, A., Lumer, E., Gulati, A., & Subbiah, V. K. (2026). Is Grep All You Need? How Agent Harnesses Reshape Agentic Search. Preprint. arXiv:2605.15184.
  7. Sarker, A. K., Staylor, M., Alsaadi, A., von Laszewski, G., & Jha, S. (2026). AAFLOW: Scalable Patterns for Agentic AI Workflows. Preprint. arXiv:2605.02162.
  8. Xiong, Y., Yang, Y., Tian, Y., Shi, Y., Chandra, V., & Schmidhuber, J. (2026). Neural Computers. Preprint. arXiv:2604.06425.
  9. Liao, J., Li, S., Wen, M., Wang, J., & Zhang, W. (2026). Position: Agentic AI System Is a Foreseeable Pathway to AGI. Proceedings of the 43rd International Conference on Machine Learning (ICML), PMLR 306. arXiv:2605.12966.
  10. Rishav, R., Pujari, P., & Rastogi, P. (2026). ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis. Preprint. arXiv:2604.17937.