Tools, Skills, and the Action Interface
Agent capability leaks at the action interface: the model knows a tool is needed and fails to use it, and the dominant protocol taxes every turn. The real question is whether the tools→skills evolution is engineered to compound — or to collapse.
§1 · Where the capability leaks
Give a model a tool it needs. Document it flawlessly. Put it one function call away — and the model will still, on a large fraction of problems it cannot solve unaided, decline to use it. Cheng et al. (2026) put a number on that fraction. They compared what each model should do against what it does across four models on arithmetic and factual question-answering, and found mismatches of 26.5–54.0% on arithmetic and 30.8–41.8% on factual QA. A quarter to a half of the time, the agent had a tool, needed it, and didn't reach for it.
The measurement is careful, and the care is the point. Earlier work treated "does this task need a tool?" as a fixed property a human or judge could label. Cheng et al. make it model-adaptive: a problem a strong model solves in its head genuinely does not need a tool, while the same problem may require one for a weaker model. Necessity is defined per model, grounded in that model's own empirical performance — so the mismatch they report is not a labeling artifact, it is a real gap between capability and behavior.
Then they localize it. Decompose a tool call into two stages: a cognition stage — does the model believe a tool is needed? — and an execution stage — does it actually issue the call? The gap lives disproportionately in execution (see Figure 2). The model knows. It just doesn't act. This is the knowing-doing gap, and naming it precisely matters, because a cognition failure and an execution failure call for opposite fixes.
To talk about this precisely, fix the series notation [← 1]. An agent is a model M wrapped in a harness H that lets it act on an environment E. The harness assembles context, exposes tools, and carries each action a the model emits across to E, returning an observation o. The action interface is exactly that crossing: the boundary where a leaves H for E and o comes back. Capability that M possesses but cannot push across that boundary is capability lost (see Figure 1) — and the gap above is the first measurement of how much.
The reason-and-act loop that every agent harness descends from is ReAct (Yao et al., 2023): interleave reasoning traces with task actions so that reasoning can induce, track, and update a plan while actions gather information from the world. Demonstrated on HotpotQA and Fever through nothing more exotic than a Wikipedia API, ReAct cut the hallucination and error-propagation that pure chain-of-thought suffered, and produced trajectories a human could actually follow. That loop is the foundation — and the knowing-doing gap is a failure in its act half. The reasoning is fine. The bridge to the world is where it breaks.
So here is the thesis, stated plainly enough to be wrong: agent capability leaks at the action interface; the dominant tool protocol taxes every turn with eager schema injection; and tools are evolving into skills — retrieved, generated, and self-evolving — whose ability to compound depends entirely on how that interface is designed. If interface design were incidental — if you could bolt arbitrarily many tools onto a capable model and watch performance rise monotonically — this article would be wrong. The rest of it is the accounting that shows you cannot.
Capability leaks at the action interface — the boundary where action a crosses from the harness H to the environment E. Cheng et al. (2026) measured the leak: a model that knows a tool is needed fails to call it on 26.5–54.0% of arithmetic and 30.8–41.8% of factual-QA cases, and the failure lives in execution, not cognition. ReAct (Yao et al., 2023) gave the field its reason-and-act loop; the gap is a failure of its act half.
§2 · The protocol tax
The first thing the interface does to your capability is charge rent on it. The Model Context Protocol (MCP) made tools composable — connect a server, expose its tools, and any agent can call them — and that composability is real and worth keeping. But the default has a cost that is paid on every single turn. Sadani & Kumar (2026), in a practitioner report from Infrrd.ai, give the cost a name: the MCP Tax, or Tools Tax. Its mechanism is precise, and getting the mechanism right matters more than the slogan: MCP's reliance on stateless, eager schema injection imposes a hidden per-turn overhead. Stateless, because the protocol carries no memory of what was already loaded; eager, because the full JSON schema of every connected tool is re-injected into context whether or not this turn could use it.
The size is the surprise. Practitioner reports place the injected payload between roughly 10k and 60k tokens per turn in typical multi-server deployments (Sadani & Kumar, 2026). That is not a one-time setup cost; it is a recurring charge on the context window, paid before the model has reasoned about anything. And it is not only a billing problem. The payload inflates the key-value cache the model must attend over every turn, and — the authors report — reasoning degrades as context utilization climbs toward published fracture points around 70%. So the schema you injected to enable tools begins to crowd out the reasoning that would use them. Capability you paid for, spent against itself (see Figure 3).
The fix is engineering, not a new protocol — and that distinction is the whole point. Tool Attention, the mechanism Sadani & Kumar propose, generalizes the familiar idea of attention over tokens into gated attention over tools. Three moves: an Intent–Schema Overlap score from sentence embeddings ranks each tool against the current intent; a state-aware gating function refuses tools the current state forbids by enforcing preconditions and access scopes; and a two-phase lazy schema loader keeps a compact summary pool resident and promotes the full JSON schema only for the top-k gated tools. They evaluate it on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real deployments. The lesson transfers regardless of which protocol you run: do not put schema in front of the model that this turn cannot use. Gate by intent, gate by state, load lazily. None of that argues against MCP; all of it argues against the eager default.
The interface charges rent every turn. The MCP Tax is stateless, eager schema injection — the full JSON of every connected tool, re-loaded each turn — at roughly 10k–60k tokens (Sadani & Kumar, 2026): it inflates the KV cache and degrades reasoning as context climbs toward ~70% utilization. The deployable fix is interface engineering, not protocol advocacy: gate by intent (an overlap score), gate by state (preconditions and scopes), and load full schemas lazily for the top-k tools only.
§3 · Acting in stateful worlds
Lowering the tax gets the right tools in front of the model. It does not make the action safe. The standard test-time-compute trick — sample many trajectories, keep the best — quietly assumes a world you can re-roll. In a stateful environment that assumption collapses: a database you can write to, a shell that runs rm, a booking you cannot un-book. Each trial mutates E, and there is no reset. The interface that let you act once will not let you take the action back.
The response is to look before you leap — to model the world rather than query it. Dynamics modelling, DyMo, develops exactly this, and the published series treats it in depth [← A3: The Big World Hypothesis]. The move: during post-training, augment the model with a state-prediction capability alongside function calling, so that before committing an action the model predicts the future state that action would produce, through an internal model of the environment. On the Berkeley Function Calling Leaderboard V2 this improves success rates and reduces hallucinations; folded into self-verification sampling, it raises passk over the number of trials and lets the model refuse outputs it cannot verify (see Figure 4). When you cannot try the world, model it — predict the next state, check the prediction, and decline when the prediction is unreliable. It is the action-interface form of caution.
Now the synthesis — and the unique accounting this article exists to offer. Read the interface as a pipeline that capability flows through, and tally what it loses and recovers at each stage (see Figure 5). Capability enters at the model. The schema-injection overhead of §2 takes the first cut: 10k–60k tokens of per-turn pressure, reasoning fraying as utilization nears ~70%. The knowing-doing gap of §1 takes the second: 26.5–54.0% of arithmetic and 30.8–41.8% of factual-QA cases where the model knew and did not act. Dynamics modelling recovers some of the loss — turning blind action into predicted-then-verified action [← A3]. And the skills layer, which the next section builds, recovers more, by amortizing the cost of getting the interface right across every future task instead of re-paying it per turn. The leak is not one bug to swat. It is a pipeline, and each stage has a named, deployable fix.
Stateful worlds break the "sample more, keep the best" trick because each trial mutates E with no reset. Dynamics modelling answers it by predicting then verifying — modelling the next state before committing, and refusing when unsure (DyMo, [← A3]). Account for the whole interface as a pipeline: capability is lost to the schema tax (§2) and the knowing-doing gap (§1), and recovered by dynamics modelling (§3) and the skills layer (§4). The leak is a pipeline; each stage has a fix.
§4 · From tools to skills
The deepest fix changes what crosses the interface. A tool is a capability you wire in; a skill is a procedure the agent keeps — selected, generated, and refined over time, living in the store Σ outside the weights. Skills are procedural memory [← 5]; the move from tools to skills is the move from "what can I call?" to "what have I learned to do?" And it is forced by the same scaling wall as the MCP Tax, one level up.
Su et al. (2026) make the wall explicit. The dominant way to give an agent skills is to enumerate them in the context window — and it fails to scale: as the corpus grows, context budget is consumed and the agent becomes markedly less accurate at identifying the right skill. Their SRA-Bench makes this measurable, with 636 manually constructed gold skills mixed with web-collected distractors into a corpus of 26,262 skills, evaluated across 5,400 capability-intensive instances, and the pipeline decomposed into retrieval, incorporation, and execution. Skill Retrieval Augmentation — retrieving the relevant skills on demand rather than listing all of them — substantially beats enumeration. This is the lesson of §2 again: do not put the whole library in front of the model; retrieve the slice this task needs.
And the benchmark surfaces something that should sound familiar. Su et al. find that current agents "tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities" — the bottleneck, they conclude, lies not only in retrieval but in the base model's ability to determine which skill to load and when. That is the knowing-doing gap of §1, re-appearing at the skill layer: even with the right skill retrieved, deciding to use it is its own failure point. The interface problem does not vanish when tools become skills; it moves up the stack with them.
Step back and the field is sliding along a spectrum, from cheap-and-static to adaptive-and-compounding (see Figure 6). Four positions:
- Retrieve. SRA (Su et al., 2026) pulls skills from an external corpus on demand. Gorilla (Patil et al., 2023) is the early proof of the same principle for tools: a finetuned LLaMA that surpasses GPT-4 at writing API calls and, paired with a document retriever, adapts to test-time documentation and version changes while cutting hallucinated API usage — evaluated on APIBench, built over HuggingFace, TorchHub, and TensorHub. The Berkeley line that produced Gorilla also produced the function-calling leaderboard that dynamics modelling is measured on.
- Generate. Toolformer (Schick et al., 2023) showed a model can teach itself: in a self-supervised loop it decides which API to call, when, with what arguments, and how to fold the result back into generation — manufacturing its own tool-use training data. ColleagueSkill (Zhou et al., 2026) carries the idea into person-grounded skills, distilling heterogeneous expert traces into a versioned, inspectable, correctable skill package — a capability track of practices and decision heuristics alongside a bounded persona track.
- Self-evolve. SkillOpt (Yang et al., 2026) observes that today's skills are hand-crafted, generated one-shot, or loosely self-revised — none of which behaves like an optimizer for the skill, and none of which reliably improves over its starting point under feedback. SkillOpt is an executive strategy that treats skill improvement as an actual optimization loop. (Self-evolution at the level of the whole system is its own subject [→ 12].)
- Select. Even a good library has to be queried. Gan et al. (2026) cast skill selection as latent-variable learning of a user's implicit preferences, run by a lightweight local preference harness that decouples statistical preference learning from semantic intent parsing — selection pushed to the edge, beside the user, on top of a remote model. It is the harness
H[← 1], specialized down to a single job.
The throughline is one idea: a skill is a tool call promoted into reusable procedure. The §2 interface decision — what to put in front of the model — becomes a library decision: retrieve rather than enumerate, generate from experience, let the good procedures evolve, and learn which to select. Done well, each task pays a little into Σ and every future task draws on it; the interface cost is amortized instead of re-paid. Done badly — enumerate everything, regenerate from scratch — the same cost recurs forever. That is the difference between compounding and collapse, and it is decided at the interface.
Tools become skills — procedures kept in Σ [← 5]. Enumerating them in context collapses (SRA-Bench: a 26,262-skill corpus; Su et al., 2026), so the field moves along a spectrum: retrieve (SRA, Gorilla), generate (Toolformer, ColleagueSkill), self-evolve (SkillOpt), select by implicit preference (Gan et al.). Tellingly, the knowing-doing gap re-appears here: even a retrieved skill must be chosen. A skill is a tool call promoted into reusable procedure — and promotion is what makes interface cost amortize instead of recur.
§5 · Industrial scale
A reasonable objection: this all sounds like frontier-model luxury. Real deployments run under hard cost and latency budgets that forbid a giant model on every turn. Does the tools→skills loop survive contact with production economics? AgenticQwen (Lyu et al., 2026), from Alibaba, is the existence proof that it does. The premise is exactly the constraint: industrial applications need agents that reason over multiple steps and use tools under strict cost and latency limits, which makes small agentic models the target rather than a compromise.
The method turns interface competence into training data, and back again, through two coupled data flywheels (see Figure 7). A reasoning flywheel raises task difficulty by learning from the model's own errors — failures become harder problems, which produce sharper failures. An agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect how real tool use forks, retries, and recovers. Both feed multi-round reinforcement learning on synthetic data plus a limited amount of open data. The flywheels manufacture their own escalating curriculum: the agent's growing competence at the interface generates harder interface tasks, and training on those tasks makes a small model more competent still. It is the compounding of §4, industrialized — and run inside a budget, which is the only version that ships.
The compounding holds under production economics. AgenticQwen (Lyu et al., 2026) trains small agentic models for strict cost and latency budgets using two coupled data flywheels — a reasoning flywheel that turns errors into harder tasks, and an agentic flywheel that grows linear workflows into multi-branch behavior trees — feeding multi-round RL. Interface competence becomes training data, which buys more interface competence. The loop is not a frontier luxury; it is an industrial method.
§6 · The systems floor
None of this is new physics, and it helps to know that the field is re-deriving a result the systems world settled long ago. The canonical example has a name: io_uring. The io_uring design document (Axboe) lays out the lineage of file IO on Linux — read/write, then pread/pwrite, then the vectored preadv/pwritev, then preadv2/pwritev2 — all of which share one trait: they are synchronous, and every operation crosses the user/kernel boundary with its own blocking system call. Each call returns only when the data is ready. When you have enormous numbers of operations, that per-operation boundary crossing becomes the bottleneck — the document notes that the older interfaces require at least two system calls per IO, a real cost in the post-Spectre/Meltdown era.
io_uring's answer is to stop paying per crossing. It sets up a pair of shared ring buffers — a submission queue and a completion queue — so the application can batch many operations, submit them with very few system calls, and reap completions asynchronously. The throughput win does not come from a faster disk. It comes from redesigning the interface so the boundary is crossed rarely and in bulk. Interface design is the performance (see Figure 8).
The mapping is close enough to be useful, not just cute. Eager per-turn schema injection is syscall-per-IO: a fixed cost paid at the boundary for every operation, whether or not the operation needs it. Gated, lazy, cached schema loading — and retrieved skills — are the submission and completion rings: pay once, batch, amortize, and never cross the boundary for what you do not need. Agent runtimes are re-deriving operating-system interface design, and the OS already wrote the answer on the wall: when the boundary is hot, stop paying per crossing. (This is an analogy and nothing more — io_uring's numbers are about disks, not agents — but the design lesson is exactly transferable.)
The systems world settled this argument decades ago. io_uring (Axboe) replaced synchronous syscall-per-IO with shared submission/completion rings, winning throughput by redesigning the interface, not the device. The map onto agents is exact: eager per-turn schema injection is syscall-per-IO; gated, lazy, cached schema and retrieved skills are the ring buffer. When the boundary is hot, stop paying per crossing.
§7 · Interface design rules
Reduce all of it to decisions you can make on Monday. The interface-design checklist below has five levers; each names a signal that should trigger it and an action grounded in the work above (see Figure 9). It is meant to be useful read alone — if you take nothing else from this article, take the table.
Three of the levers concern what reaches the model. The schema budget: when tool schemas run to thousands of tokens per turn, or context utilization trends toward the ~70% region, gate by intent and load full schemas lazily for the top-k tools only, keeping a compact summary pool resident. The tool-count regime: with a small number of tools, static schemas may be fine; once the count is large enough that selection accuracy drops as you add more, switch from enumeration to retrieval. The knowing-doing gap itself: when the model can describe the right tool but won't call it, the fix is to make acting cheaper than explaining — cut execution friction and lower the schema tax so the act half of the loop is not starved.
The other two concern what the action does and becomes. Statefulness: when actions mutate the world and cannot be cheaply reset, predict-then-verify before committing, and let the agent refuse low-confidence actions rather than gamble the world on them. And the lever that turns this article into a program — promote a tool call into a skill. The trigger is recurrence: the same tool sequence showing up across tasks, the agent re-deriving the same procedure, or a manual procedure stable enough to package. When you see it, promote — retrieve the procedure, generate it from traces, and let it evolve under feedback. The single rule beneath all five: never put capability in front of the model that this turn cannot use, and never re-derive procedure the agent already earned. Promotion is the compounding; enumeration is the collapse.
| Lever | Signal that trips it | Action | Grounded in |
|---|---|---|---|
| Schema budget | schemas > a few k tokens/turn; context use trending to ~70% | gate by intent (overlap score); lazy-load full JSON for top-k; keep a summary pool resident | MCP Tax — Sadani & Kumar (2026) |
| Tool-count regime | selection accuracy falls as you add tools | small N: static may be fine; large N: retrieve, don't enumerate | SRA — Su et al. (2026); MCP Tax |
| Statefulness | actions mutate the world; no cheap reset | predict-then-verify before committing; let the agent refuse low-confidence actions | dynamics modelling — [← A3] |
| Knowing-doing gap | model describes the right tool but won't call it | cut execution friction; lower the schema tax so the act half isn't starved | Knowing-Doing Gap — Cheng et al. (2026) |
| Promote tool → skill | the same tool sequence recurs across tasks | retrieve it; generate it from traces; let it evolve under feedback | SRA; ColleagueSkill; Toolformer; SkillOpt |
Carry five levers to any agent build. Schema budget: gate + lazy-load when schemas bloat or context nears ~70%. Tool-count regime: retrieve, don't enumerate, once selection accuracy drops. Statefulness: predict-then-verify and allow refusal when the world can't be reset. Knowing-doing gap: make acting cheaper than explaining. Promote tool → skill: when a procedure recurs, retrieve/generate/evolve it. One rule underneath: never front capability this turn can't use, and never re-derive procedure the agent already earned.
What comes next
This article governed which action crosses the interface and at what cost — schema in, action out, observation back, one step at a time. It said nothing about which sequence of actions to take. And a clean, cheap, well-gated interface is wasted on a search that looks only one step ahead: the agent that picks the locally best tool can still walk a globally doomed path. The interface decides whether an action is cheap and correct; planning decides whether the sequence of cheap, correct actions actually reaches the goal.
[→ 7] Planning and the Myopia Problem — why agents choose locally good actions that are globally wrong, and how planning depth trades against verification cost. A perfect action interface is the precondition for planning, not a substitute for it.
There is also a larger version of §4's last move. Promoting a tool into a skill optimizes one procedure; making the whole system rewrite and improve itself is a different loop, with different risks, which the series takes up later [→ 12]. Here the claim stays narrow and shippable: the action interface is the cheapest large win in agent engineering, because every fix is an interface fix — gate the schema, predict the state, retrieve the skill, promote the recurring procedure — and not one of them requires retraining the model you already have.
- Capability leaks at the action interface: models know a tool is needed and fail to act — 26.5–54.0% (arithmetic) and 30.8–41.8% (factual QA) mismatch (Cheng et al., 2026).
- The dominant protocol taxes every turn: stateless, eager schema injection at ~10k–60k tokens (Sadani & Kumar, 2026). Gate by intent and state; load lazily.
- Account for the leak as a pipeline — losses (schema tax, knowing-doing gap) and recoveries (predict-then-verify [← A3], retrieved skills). The leak is a pipeline; each stage has a fix.
- Promote recurring tool calls into retrieved, generated, and self-evolving skills; AgenticQwen (Lyu et al., 2026) shows the loop holds under cost and latency budgets. Promotion compounds; enumeration collapses.
References
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), 2023. arXiv:2210.03629.
- Cheng, Y., Fan, C., JafariRaviz, M., Rezaei, K., & Feizi, S. (2026). Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use. Preprint. arXiv:2605.14038.
- Sadani, A., & Kumar, D. (2026). Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows. Infrrd.ai, practitioner report (Preprint). arXiv:2604.21816.
- Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. arXiv:2302.04761.
- Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. Preprint. arXiv:2305.15334.
- Su, W., Long, J., Ai, Q., Tang, Y., Wang, C., Tu, Y., & Liu, Y. (2026). Skill Retrieval Augmentation for Agentic AI. Preprint. arXiv:2604.24594.
- Zhou, T., Liu, D., Yuan, L., Shao, J., & Hu, X. (2026). COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation. Shanghai Artificial Intelligence Laboratory, Preprint. arXiv:2605.31264.
- Yang, Y., Gong, Z., Huang, W., Yang, Q., Zhou, Z., Huang, Z., et al. (2026). SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Microsoft, Preprint. arXiv:2605.23904.
- Gan, Z., Tang, H., & Liu, Y. (2026). Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents. Renmin University of China, Preprint. arXiv:2606.05828.
- Lyu, Y., Wang, C., Zheng, H., Yue, Y., Yan, J., Wang, M., & Huang, J. (2026). AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use. Alibaba Group, Preprint. arXiv:2604.21590.
- Axboe, J. (n.d.). Efficient IO with io_uring. Systems design document.