Multi-Agent Systems and Their Failure Modes
A multi-agent system fails in ways a single agent cannot — its diversity collapses, its blame becomes untraceable, its coordination cost outgrows the work. The systems that survive do not fix the org chart. They make coordination something the system learns or something it pays for.
§1 · The tenth agent made it worse
Here is a pattern anyone who has shipped agents will recognize. You have one agent that does a job adequately. You add a second to review the first, and the pair is better — the reviewer catches mistakes. So you keep going. A planner, a coder, a tester, a critic, a researcher, a summarizer. By the tenth agent the system is slower, more expensive, harder to debug, and — this is the part that stings — worse at the task than the three-agent version you had two weeks ago. Nobody decided to make it worse. Each addition was locally reasonable. The degradation was structural, and it was waiting for you the whole time.
This is the failure that organizes the chapter, so state it precisely. Chen et al. (2024) studied the simplest possible multi-agent design — ask the model the same question many times and aggregate by majority vote (their Vote), optionally screening answers with a model-based filter first (Filter-Vote). They asked one question: as you add calls, does accuracy go up? Across multiple language tasks the answer was not monotonically. Performance first rises and then falls as a function of the number of calls (see Figure 2). The mechanism is exact and worth internalizing: a real task is a mix of easy and hard queries; more calls push the easy queries higher but drag the hard ones down, and once the hard mass dominates, the aggregate turns over. Adding capacity does not add capability uniformly. It redistributes it.
Gu et al. (2026) found the same shape one level up, where the agents are agents rather than retries. They held the model fixed and grew a homogeneous society — same backbone, more members — to isolate the one variable everyone conflates with intelligence: collaboration itself. The collective dynamics are not a smooth climb. There is a regime where more agents help and a regime where they do not, and the boundary is a property of the task and the topology, not of the model. The tenth agent did not get dumber. The system around it changed shape.
So before any architecture, fix the thesis, stated plainly enough to be wrong. Multi-agent systems fail in ways single agents cannot — diversity collapse, untraceable blame, and coordination overhead that exceeds the work — and the field now has names and measurements for each. The systems that work do not solve this with a better fixed org chart; they make coordination a learned component or an economic mechanism. If a naive multi-agent system reliably beat a single strong agent plus good tools, this article would be wrong. The evidence says it does not.
One notation, fixed once and reused. Write M for a model — the weights and the forward pass [← 1]. A multi-agent system, which we abbreviate MAS, is a set of agents {M₁ … Mₙ} each wrapping a model in a harness H (its prompts, tools, and memory [← 1]), connected by a topology — who may send what to whom — and run for some number of rounds. Two systems with identical models and identical agents can behave completely differently because their topology differs. The topology is the system. Most of this article is about who designs it, and whether they design it once or learn it.
Adding agents is non-monotonic, not additive. Chen et al. (2024) demonstrated that majority-vote and filter-vote accuracy rises then falls with the number of calls — easy queries improve, hard queries degrade — and Gu et al. (2026) found the same regime boundary when the calls are agents. A multi-agent system is defined by its topology, not its members; the tenth agent makes it worse because it changes the system's shape, not because it is a worse agent.
§2 · The canonical promise, read honestly
The case for multiple agents has a canonical result, and it is worth stating exactly, because the whole demo wave grew from it. Du et al. (2023) introduced multiagent debate: instead of asking one model once, you run several instances of a model that each produce an answer with its reasoning, then show each instance the others' responses and let them revise over several rounds. The instances converge — and the converged answer is more factual and reasons better than any single instance's first attempt. The result was real, the mechanism was clean (see Figure 3), and it was the right kind of surprise: a coordination procedure, not a bigger model, bought the gain.
But read what it actually claims. Debate improves on self-consistency and single-pass prompting by spending more inference compute on the same model in a structured way. It is a procedure for extracting a better answer from a fixed M — closer to a sophisticated form of sampling-and-aggregation than to a society of specialists. That framing matters, because the demo wave that followed quietly substituted a much stronger claim: that wiring many different agents into an organization buys you something debate never demonstrated. The honest version of the canonical result is narrower and more durable than the slogan it became.
The framework wave made the slogan concrete. MetaGPT (Hong et al., 2023) encoded human standard operating procedures into the prompt sequence — assign roles, pass structured artifacts down an assembly line, have each role verify the last — explicitly to suppress the cascading hallucinations that naive LLM chaining produces. AutoGen (Wu et al., 2023) generalized the plumbing: conversable agents and flexible conversation patterns you could wire into hierarchies or group chats. These frameworks are genuinely useful infrastructure, and they are the context this chapter traces — not its endorsement. They make it easy to build the ten-agent org chart that Figure 2 says will eventually underperform. The frameworks solved the engineering of coordination. They did not solve the question of whether a fixed coordination structure is the right object at all.
That MetaGPT had to invent SOPs specifically to stop cascading hallucinations is the tell. The moment you chain agents, one agent's confident error becomes the next agent's trusted input, and the error compounds down the line. A single agent that hallucinates is wrong once. A pipeline that hallucinates is wrong, then builds on being wrong. The canonical promise is real; the failure modes were baked into the same architecture that delivered it.
M — narrower than the "society of specialists" slogan the demo wave built on top of it.The canonical win — multiagent debate (Du et al., 2023) — is real but narrow: structured inference over instances of one model, not a society of specialists. The framework wave (MetaGPT, Hong et al., 2023; AutoGen, Wu et al., 2023) industrialized fixed org charts and, tellingly, had to invent SOPs to suppress cascading hallucination — the first failure mode unique to chained agents, where one agent's error becomes the next's trusted input.
§3 · The failure modes have names now
The most useful thing that happened to this field recently is that the failures got named. You cannot engineer around a problem you can only describe as "it got worse." Two 2026 results give the vocabulary, and they map onto the two columns of any honest multi-agent audit: why the collective underperforms, and where you look when it does.
Diversity collapse
The intuition for putting many agents on a problem is that they will explore more of the solution space than one. Chen et al. (2026) tested that intuition directly on open-ended idea generation and found it fails for a specific, mechanical reason they call structural coupling. When agents read and respond to each other — which is the entire point of making them a system — they pull toward consensus. The shared context that lets them collaborate is the same context that makes them converge, and the convergence is to a narrower region than even a single agent sampled alone would have covered (see Figure 4). The society's diversity collapses precisely because it is a society.
The study is careful — it separates the effect across model intelligence, agent cognition, and system dynamics — and it surfaces an uncomfortable corollary the authors call a compute-efficiency paradox: stronger, more heavily aligned models, the ones you would reach for to improve quality, tend to collapse diversity faster, because alignment makes them agree more readily. The better your agents, the harder this particular failure bites. Diversity collapse is not a tuning problem you can prompt your way out of; it is a consequence of coupling, and coupling is the mechanism of collaboration itself.
Untraceable blame
The second failure is operational. When a single agent fails, you read its trajectory τ and find the bad step. When a ten-agent system fails, the bad output is the end of a chain of handoffs, and the question "which agent, at which step, caused this?" has no easy answer. Qi et al. (2026), surveying the field, elevate failure attribution to a first-class problem — one of the three axes (alongside collaboration and self-evolution) along which they organize the entire literature. That a survey needs a dedicated axis for "whose fault was it" tells you the question is both central and unsolved.
This is where the engineering discipline of the rest of the series pays off. Attribution is hard when the system's history is ephemeral and easy when it is recorded. An agent system whose every message, tool call, and state transition is captured as an append-only event log turns "which agent caused this" from forensics into a query.
[→ 11] Agent Ops: Running Agents in Production — logs make blame assignable. Failure attribution in a MAS reduces to event-sourcing: if every handoff is a recorded event, the bad step is found by reading, not guessing.
Two named, measured failure modes organize every multi-agent audit. Diversity collapse (Chen et al., 2026): structural coupling makes a society converge to a narrower solution space than independent agents — and aligned, stronger models collapse faster (the compute-efficiency paradox). Untraceable blame (Qi et al., 2026): failure attribution is a first-class open problem, solvable in practice only when the system's history is recorded as events [→ 11].
§4 · Coordination as a learned component
If a fixed org chart is the disease, the first cure is to stop fixing it by hand. The strongest recent systems do not hard-code who talks to whom; they learn the coordinator. This is the single most important shift in the chapter, so make it concrete with two systems that learn coordination in two different ways (see Figure 5).
The Conductor (Nielsen, Cetin et al., 2025) trains a coordinator with reinforcement learning whose entire job is to orchestrate a pool of worker LLMs. It learns two things jointly: the communication topology — which workers talk to which, in what order — and the instructions it writes to each worker to extract that worker's strength. The result that matters for the thesis: a 7B Conductor, orchestrating workers, achieves performance beyond any individual worker and reaches state-of-the-art results on demanding reasoning benchmarks the authors name explicitly — LiveCodeBench and GPQA. And because it is trained over randomized pools, it adapts to arbitrary sets of open- and closed-source workers rather than memorizing one lineup. The org chart became a policy. A small model that has learned to route is worth more than a large model deployed alone.
TRINITY (Xu, Sun et al., 2025) reaches a similar place from the opposite constraint: what if you cannot touch the workers' weights at all, because they are behind closed APIs? It puts a deliberately tiny coordinator in front of them — a compact language model of roughly 0.6B parameters with a lightweight head of about 10K parameters — and optimizes it with an evolutionary strategy rather than gradients through the workers. At each turn the coordinator assigns one of three roles — Thinker, Worker, or Verifier — to a selected model, which keeps the hard skill of solving the problem in the workers and the skill of delegation in the coordinator. It outperforms the individual models it orchestrates and, notably, generalizes to out-of-distribution tasks — the property fixed org charts conspicuously lack.
The structural point unifies them. A fixed pipeline encodes a human's one-time guess about the right division of labor; it cannot adapt when the task shifts, which is exactly when Figure 2's curve turns over. A learned coordinator treats "who does what, and who checks it" as a function to be optimized against outcomes. Conductor learns it by RL with weight access; TRINITY learns it by evolution without weight access. Both replace the org chart with something that updates. That is the difference between a system that degrades as you add agents and one that can decide an agent is not worth adding.
The first real cure for org-chart failure is to learn the coordinator. Conductor (Nielsen, Cetin et al., 2025) trains a 7B orchestrator by RL to design topologies and worker instructions, reaching state-of-the-art on LiveCodeBench and GPQA beyond any single worker. TRINITY (Xu, Sun et al., 2025) puts a ~0.6B evolved coordinator in front of closed-API workers, assigning Thinker/Worker/Verifier roles per turn and generalizing out-of-distribution. Coordination becomes a function optimized against outcomes, not a human's one-time guess.
§5 · Organizations, and the price of a thought
Learning the coordinator solves routing. It does not, by itself, solve organization — how a workforce of agents is assembled, governed, and improved over a long horizon, decoupled from what any single agent knows. Two 2026 systems push past routing into that organizational layer, and they are interesting precisely because they reach for the two coordination mechanisms humans actually use at scale: hierarchy and markets.
OneManCompany (Yu, Fu et al., 2026) takes the hierarchy seriously. It separates what an agent can do from how the workforce is structured, packaging skills, tools, and runtime configuration into portable identities it calls Talents, recruited on demand from a Talent Market when the organization hits a capability gap. Crucially, its decision-making is an Explore-Execute-Review tree search: tasks are decomposed top-down into accountable units and outcomes aggregated bottom-up to drive review — and the authors give it formal guarantees on termination and deadlock-freedom. That last clause is the point. A fixed multi-agent pipeline can deadlock or loop with no guarantee it ever stops; treating the organization as a search procedure with provable termination is an engineering answer to a structural failure mode.
Economy of Minds (Qi, Su et al., 2026) makes the more radical move: it replaces the org chart with a market. Agents interact through economic transactions — they pay and are paid — so that price becomes the coordination signal. This is worth sitting with, because it inverts the entire framing. In a fixed org chart, a human decides in advance which agent's contribution matters. In a market, that decision is made continuously and locally by what each agent is willing to pay for another's output. Diversity collapse, in this lens, is a market failure — everyone bidding on the same consensus — and the mechanism design becomes the lever. You do not redraw the org chart; you change the incentives and let the structure form.
Notice what both systems abandon. Neither treats the set of agents and their connections as a thing you specify once and freeze. OneManCompany makes the structure a search; Economy of Minds makes it an equilibrium. The org chart, in the systems that work, is an output, not an input.
Beyond routing lies organization, and the working systems borrow the two mechanisms humans scale with. OneManCompany (Yu, Fu et al., 2026) makes the org a tree search over portable Talents with formal termination and deadlock-freedom guarantees — a structural answer to pipelines that loop forever. Economy of Minds (Qi, Su et al., 2026) replaces the org chart with a market where price is the coordination signal. In both, the structure is an output to be optimized, not an input to be hand-drawn.
§6 · Recursion: scaling the society, not the agent
There is one more axis worth naming, because it is where this chapter touches the frontier. The single-model world recently found a new scaling knob: looped or recursive computation, where the same model refines its own latent state over iterations to deepen reasoning rather than simply growing wider. Recursive MAS (Tong, Zhang et al., 2026) asks the obvious next question — can collaboration itself be scaled by recursion? — and answers it by casting an entire multi-agent system as one recursive computation in latent space.
The construction is specific. A lightweight module the authors call RecursiveLink connects heterogeneous agents into a collaboration loop, passing latent thoughts and transferring latent state between agents across recursion rounds rather than re-serializing everything to text each hop. Training uses an inner-outer loop that co-optimizes the whole system with shared, gradient-based credit assignment across rounds — so the society is tuned end-to-end rather than as a bag of independently-prompted parts. Across the patterns and benchmarks the authors evaluate — spanning mathematics, science, medicine, search, and code — they report an average accuracy improvement of 8.3% over single- and multi-agent and recursive-computation baselines, at 1.2×–2.4× efficiency, with stable gradients through the recursion.
The reason this belongs in a failure-modes chapter and not a victory-lap one: recursion done naively is a new way to fail. A society that feeds its own latent state back into itself is, structurally, a loop that can amplify whatever it started with — including a collapsed consensus or a cascading error. Recursive MAS's contribution is not "recurse and win"; it is making the recursion trainable and stable, which is exactly the part you have to engineer for the axis to be a knob rather than a hazard.
[→ C3] The Long Bet · Recursion as a Scaling Substrate — the looped-compute axis that Recursive MAS applies to societies is, in the long view, one of several substrates that compound when you let a system improve its own process.
Recursion is the newest scaling axis for societies, not just models. Recursive MAS (Tong, Zhang et al., 2026) casts a whole multi-agent system as one latent-space recursive computation, co-optimized end-to-end, reporting an average 8.3% accuracy gain at 1.2×–2.4× efficiency over single-, multi-agent, and recursive baselines. The engineering content is stability: an untrained recursive society is a feedback loop that amplifies collapse, so the contribution is making the loop stable and trainable [→ C3].
§7 · The game-theory floor the LLM wave skipped
Step back and notice something about every system so far: each is, formally, a multi-agent decision problem, and there is a mature science of those. Multi-agent reinforcement learning fused game theory with learning decades ago, and Albrecht, Christianos & Schäfer (2024) is its standard modern reference. The LLM multi-agent wave largely skipped this literature, and it shows — many "novel" coordination failures are textbook results wearing new clothes.
The vocabulary alone is clarifying. Non-stationarity: when other agents are also learning, the environment each agent faces is a moving target — which is the formal reason a coordinator trained against one fixed worker pool (Conductor's randomized-pool training is the mitigation) degrades against another. Equilibria: a collection of individually rational agents settles into a joint outcome that need not be collectively good — which is what diversity collapse looks like through a game-theoretic lens, a society at a bad equilibrium. Credit assignment across agents: the formal name for untraceable blame. The LLM-MAS field rediscovered these as engineering surprises; MARL had names, theorems, and algorithms for them already.
This connects directly back to the published Continual-Intelligence series, where the strategic-interaction questions were treated head-on. The work on robust social strategies and on opponent shaping — how an agent should behave when the other agents are themselves adapting, and how it can deliberately shape their learning — lives there, and is the right entry point for anyone building a MAS that must hold up against other adaptive agents rather than fixed ones.
[← A2] Does RL Teach LLMs to Reason? — and the A-series treatment of robust social strategies and opponent shaping. What RL does and does not teach the individual agents you are coordinating sets the ceiling on what any coordination layer can extract from them.
Most "new" multi-agent failures are old theory. MARL (Albrecht, Christianos & Schäfer, 2024) already names them: non-stationarity (why a coordinator trained on one pool degrades on another), bad equilibria (the formal shape of diversity collapse), and cross-agent credit assignment (untraceable blame). The strategic-interaction questions — robust social strategies, opponent shaping — are treated in the published A-series [← A2]; read them before betting a system on agents that face other adaptive agents.
§8 · When to go multi-agent — the decision
Now the deployable part. The honest default, given everything above, is uncomfortable for anyone who wanted to build a society: start with a single strong agent plus good tools, and add agents only against a named failure mode you can measure. A single agent has no coordination overhead, no diversity to collapse, and a single trajectory you can read when it breaks. Every one of this chapter's failure modes is a tax you take on the moment you add the second agent. The tax is sometimes worth paying. It is never free.
Use this checklist before adding agents. Go multi-agent only when at least one is a clear yes:
- Genuine parallelism. The task decomposes into subtasks that can run independently, and wall-clock matters more than the token cost of redundancy. (If the subtasks must mostly run in sequence, you have a pipeline, and pipelines cascade errors — §2.)
- Verification asymmetry. Checking an answer is much cheaper or more reliable than producing it, so a separate verifier earns its keep — the role TRINITY assigns explicitly (§4).
- Required diversity that coupling won't kill. You need genuinely different perspectives and can keep agents from reading each other into consensus (§3). If you cannot, independent sampling of one agent may give you more diversity than a coupled society.
- You can learn or price the coordination. You are willing to make coordination a learned component (§4) or an economic mechanism (§5) rather than a frozen org chart — because a frozen org chart is what Figure 2's curve punishes.
If none of these holds, the second agent is overhead. And whatever you build, instrument it so the failure modes are detectable: the audit table below (see Figure 6) is the artifact to keep on the wall. Every row is a failure unique to multi-agent systems, with how to detect it, how the cited systems mitigate it, and the residual risk that survives mitigation — because none of these is fully solved.
| Failure mode (MAS-only) | How to detect it | Mitigation (and source) | Residual risk |
|---|---|---|---|
| Non-monotonic scaling more agents → less accuracy |
❌ ablate agent/call count; watch for the turn-over peak | ⚠️ cap agents at the peak; difficulty-aware routing (Chen et al., 2024; Gu et al., 2026) | ⚠️ peak is task-specific; must be re-measured per task |
| Diversity collapse society converges too narrow |
❌ measure idea-space spread vs independent sampling | ⚠️ reduce coupling; avoid over-aligned homogeneous pools (Chen et al., 2026) | ❌ compute-efficiency paradox: stronger models collapse faster |
| Cascading hallucination one error becomes trusted input |
⚠️ verify intermediate artifacts, not just final output | ✅ SOP + verifier roles (Hong et al., 2023; Verifier role, Xu et al., 2025) | ⚠️ a confident verifier can still pass a subtle error downstream |
| Untraceable blame which agent, which step? |
❌ unanswerable without recorded history | ✅ event-source every handoff (Qi et al., 2026 names it; [→ 11] solves it) | ⚠️ attribution method itself is open research (Qi et al., 2026) |
| Coordination overhead > work talking costs more than doing |
⚠️ compare end-to-end cost/latency to single-agent + tools | ✅ learned coordinator routes only what helps (Nielsen, Cetin et al., 2025; Xu et al., 2025) | ⚠️ training the coordinator is its own cost; needs loop ownership |
| Deadlock / non-termination the society loops forever |
❌ no built-in stop in a hand-wired graph | ✅ org-as-search with termination + deadlock-freedom guarantees (Yu, Fu et al., 2026) | ⚠️ guarantees hold for the search procedure, not arbitrary topologies |
| Bad equilibrium / non-stationarity adaptive agents settle badly |
⚠️ test against adaptive opponents, not fixed ones | ⚠️ randomized-pool training; opponent-aware design (Albrecht et al., 2024; [← A2]) | ❌ general multi-agent equilibrium selection is unsolved |
Default to a single strong agent plus tools; add agents only against a measured failure mode — genuine parallelism, verification asymmetry, diversity you can preserve, or coordination you can learn or price. Then instrument for all seven failure modes in Figure 6. The chapter's rule in one line: if you cannot make coordination a learned component or an economic mechanism, you are deploying the fixed org chart that the scaling curve punishes.
What comes next
We have argued in the abstract — failure modes, coordinators, markets, recursion. The next chapter makes it concrete in the one domain where multi-agent claims get tested against a ground truth that does not negotiate: software engineering. A patch either applies and passes the tests or it does not. B9 treats software-engineering agents as the field's drosophila — the model organism where coordination, verification, and the cost curve are all measurable at once — and asks which of this chapter's lessons survive contact with a real proving ground.
[→ 9] Software Engineering Agents: The Proving Ground — where multi-agent coordination meets a verifier that cannot be talked out of a failing test, and the cost curves stop being qualitative.
References
- Qi, S., Ma, J., Xing, R., Guo, W., Huang, X., Gao, Z., Deng, J., Liu, J., Zhang, L., Wei, B., et al. (2026). Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems. Preprint. arXiv:2605.14892.
- Chen, N., Tong, Y., Yang, Y., He, Y., Zhang, X., Zou, Q., Wang, Q., & He, B. (2026). Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation. Preprint. arXiv:2604.18005.
- Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. Preprint. arXiv:2305.14325.
- Nielsen, S., Cetin, E., Schwendeman, P., Sun, Q., Xu, J., & Tang, Y. (2025). Learning to Orchestrate Agents in Natural Language with the Conductor. Preprint. arXiv:2512.04388.
- Xu, J., Sun, Q., Schwendeman, P., Nielsen, S., Cetin, E., & Tang, Y. (2025). TRINITY: An Evolved LLM Coordinator. Preprint. arXiv:2512.04695.
- Yu, Z., Fu, Y., He, Z., Huang, Y., Ka Yiu, L., Fang, M., Luo, W., & Wang, J. (2026). From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company (OneManCompany). Preprint. arXiv:2604.22446.
- Qi, Z., Su, H., Qu, A., Wang, C., Yao, Y., Zheng, H., Du, Y., Reddi, V. J., Li, J., Liang, P. P., Lakkaraju, H., Kakade, S., et al. (2026). Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions. Preprint. arXiv:2606.02859.
- Tong, Y., Zhang, T., Buehler, M. J., He, J., Zou, J., & Tong, H. (2026). Recursive Multi-Agent Systems. Preprint. arXiv:2604.25917.
- Gu, Z., Li, J., Cai, Y., & Feng, H. (2026). Scaling Behavior of Single LLM-Driven Multi-Agent Systems. Preprint. arXiv:2606.00655.
- Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024). Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems. Preprint. arXiv:2403.02419.
- Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., et al. (2023). MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. Preprint. arXiv:2308.00352.
- Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. Preprint. arXiv:2308.08155.
- Albrecht, S. V., Christianos, F., & Schäfer, L. (2024). Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press.