Recursion: The Third Scaling Axis
After depth and width gave us parameters, and the internet gave us data, a third axis arrived almost unannounced: recursion — running the same computation again over its own evolving output. In 2025–26 it showed up independently at four different scales within months of each other. That convergence is the signal.
1 — Four Papers, One Curve
Over a single stretch of seven months, eleven papers from groups that share no authors converged on one idea. One looped a block of transformer layers and watched accuracy climb. Another iterated a tiny model's latent state until a Sudoku grid resolved. Another let a language model call itself recursively over a long document. Another folded an entire multi-agent system into one loop and trained the loop end-to-end. They were not at the same lab and were not solving the same task, and beneath the different vocabularies — looped transformers, recursive reasoners, recursive agents, recursive systems — sat one move: take a fixed computation and run it again over its own evolving output. The wave crested in a single month — May 2026 alone carries eight of them.
That move deserves a name and a place in the scaling story. For a decade we have scaled two things. We scaled parameters — depth and width, the size of the function the network computes. We scaled data — how much of the world that function is fit to. Those are the axes of the scaling laws everyone quotes. This essay is about a third one. Write any neural computation as a function f mapping a state to a new state, and write M for the model that contains it — its weights and its forward pass. Parameters make f bigger. Data makes f a better fit. Recursion holds f fixed and applies it again: a state x evolves under xt+1 = f(xt) while the computation stays the same. Parameters scale the operator; data scales the target; recursion scales the number of times you run it.
The reason to take this seriously is not that any single result is decisive. It is the convergence. When four communities that do not read each other's submissions independently reach for the same primitive in the same season, the primitive is usually real — not an artifact of one lab's taste. One of these papers says it out loud. The authors of Recursive Multi-Agent Systems open by observing that "recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states," and then ask whether the axis extends past a single model to a society of them (Tong, Zhang et al., 2026). They answer yes. That sentence is the thesis of this essay, generalized: recursion is not a trick that one architecture happens to enjoy. It is an axis, and in 2025–26 it surfaced at four substrates at once — inside the layers, over a latent state, across inference calls, and between agents — four communities, four scales, all posted within a single season (see Figure 2).
They converge in time, and they converge in shape. Plot each paper's capability against how many times its computation is re-applied, and the curve is the same four times — a rise that steepens, then bends toward a plateau (see Figure 3).
The rest of this essay is a walk down that ladder of scales, from the finest grain to the coarsest, with one stop at each substrate. The published companion to this piece already mapped the finest grain in detail — recursion inside the forward pass, where reasoning happens in continuous latent state with no tokens emitted at all.
[← A11] Thinking Without Tokens — established the latent-compute substrate: iterating reasoning-critical layers (ETD), recurrent action heads (RD-VLA), and per-neuron temporal dynamics (the Continuous Thought Machine). That essay is the finest grain of recursion; this one bridges it to coarser substrates and does not re-derive it.
After parameters and data, the third scaling axis is recursion: hold a computation f fixed and apply it again to its own output, xt+1 = f(xt). The evidence is not one result but a convergence — in 2025–26, four independent communities reached for the same primitive at four scales (layers, latent state, inference, agents) within months. One of them named it outright: recursion as "a new scaling axis."
2 — Substrate I: Layers
Start at the smallest scale, inside a single model. A transformer is a stack of layers applied once, top to bottom. The looped-transformer idea is almost embarrassingly simple: tie a block of those layers together and run it more than once before decoding. The parameters do not grow; the depth does. You are spending compute, not memory, to think harder.
What changed in 2026 is that this stopped being a curiosity and became a program with four distinct moves. The first is a retrofit: you do not need to train a looped model at all. A test-time wrapper can take a frozen, already-trained checkpoint, re-apply a contiguous mid-stack block, and improve it — naive re-application usually degrades the model, but if you treat each layer as a forward-Euler step on an underlying differential equation and the loop as smaller, damped sub-steps of that same step, the gains return. Across seven model families this training-free loop lifts Qwen3-4B-Instruct by +2.64 points on MMLU-Pro with no fine-tuning and no new parameters (Li et al., 2026). Recursion here is pure inference-time scaling: same weights, more passes.
The second move is making the loop cheap. Looping pairs badly with full attention, because every extra pass pays the quadratic cost again. LT2 replaces softmax attention with linear-time variants and finds that looping and linearity are not merely compatible but synergistic — iteration refines a linear-attention memory and progressively widens a sparse-attention receptive field. Their hybrid matches a standard looped transformer's quality at fully linear-time cost, and a converted 1.4B model becomes competitive with industry 4B models, a +2.1-point gain over the standard looped form (Deng et al., 2026). The third move is showing the principle is not tied to autoregression at all: looping the early-middle layers of a masked diffusion language model buys a depth-scaling effect at fixed parameters, matching a same-size baseline with 3.3× fewer training FLOPs and adding up to +8.5 points on GSM8K, with the number of loops left as an inference-time compute dial (Lee et al., 2026).
The fourth move is the one that matters most for the bet, because it confronts recursion's oldest enemy: instability. Recurrent architectures are hard to train, expensive to deploy, and stuck at small fixed recurrence depths. Attractor Models answer this by not unrolling the loop at all. A backbone proposes an output embedding; an attractor module refines it by solving for the fixed point of the recurrence, with gradients obtained through implicit differentiation. Training memory becomes constant in effective depth, and the number of iterations is chosen adaptively by convergence rather than fixed in advance. The payoff is large at both ends of the size spectrum: a Pareto improvement in language-model pretraining (perplexity better by up to 46.6%, downstream accuracy by up to 19.7%, with a 770M model beating a 1.3B Transformer trained on twice the tokens), and, with only 27M parameters and roughly a thousand examples, 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard — puzzles where frontier general models fail outright (Fein-Ashley & Rashidinejad, 2026) (see Figure 4).
k times, converting parameters into iterations. Why it matters: three independent 2026 results — fewer FLOPs to match, a small model beating a larger one, and a training-free retrofit — all show iteration substituting for parameters, the cleanest case that recursion is a scaling lever.Two of these four — the diffusion loop and the attractor — sit squarely in latent-compute territory that the published series already framed, and I lean on it rather than re-opening it.
[← A11] Thinking Without Tokens — already argued that iterating reasoning-critical layers (ETD) adds compute in latent space with no output tokens. The 2026 looped-transformer program is that argument at scale: same principle, more architectures, harder stability engineering (the attractor's fixed-point solve).
At the layer scale, recursion is depth bought with compute instead of parameters. The 2026 program made four moves: retrofit a loop onto a frozen model (+2.64 pp, training-free), make the loop linear-time (LT2), port it off autoregression onto diffusion (3.3× fewer FLOPs), and defeat the instability by solving for the loop's fixed point instead of unrolling it (770M > 1.3B). The recurring engineering content is stability, and that theme returns at every coarser scale.
3 — Substrate II: Latent State
Move out one notch. Instead of looping layers inside one forward pass, carry an explicit latent state and a candidate answer, and recurse on those across passes — refine, re-examine, refine again. This is the substrate where a 7M-parameter model can out-reason a frontier system on a puzzle, because the puzzle rewards iteration over a small carried state more than it rewards raw scale.
The base case is the Tiny Recursive Model: iteratively refine a latent state and a final answer with a tiny network. Its weakness is the weakness of any deterministic loop — it can converge to a wrong fixed point and has no way out. The fix is to make the recursion stochastic. The Probabilistic Tiny Recursive Model injects Gaussian noise at each deep recursion step so that parallel trajectories explore different solution basins, then selects among them using the model's existing confidence head — no retraining, no task-specific augmentation. On Sudoku-Extreme it moves accuracy from 87.4% to 98.75%; on a suite of pencil puzzles, from 62.6% to 91.2%, with a 7M-parameter model reaching nearly double the accuracy of frontier general models (Sghaier et al., 2026). The lesson generalizes: a companion paper makes "probabilistic multi-trajectory recursion" its explicit thesis, proposing it as a design principle for future recurrent and recursive reasoners rather than a single architecture (Jo et al., 2026). Recursion at this scale is not just deeper computation; it is search, and a noisy loop is a cheap way to search.
The most language-model-native version of this substrate closes the gap back to chain-of-thought. Textual reasoning forces every intermediate step through a discrete, serial token stream — even when the underlying update is continuous and only half-formed. Latent reasoning is higher-bandwidth, but earlier latent methods gave up the things that make autoregression work: probabilistic sampling, KV-cache decoding, tractable likelihoods. NF-CoT recovers them by modeling each continuous "thought" with a normalizing flow inside the language-model backbone, so the model recurses over continuous thoughts and then emits text in the same causal stream, with exact likelihoods and direct policy-gradient training in the latent space — improving pass rates on code generation over both explicit chain-of-thought and prior latent baselines (Tu et al., 2026) (see Figure 5).
[← A11] Thinking Without Tokens — the Continuous Thought Machine put the latent loop at the neuron level and showed adaptive computation is native to it. These tiny recursive reasoners are the same substrate one level up: a carried state instead of a per-neuron history, and a deliberate dose of noise to turn refinement into search.
At the latent-state scale, recursion is iterative refinement of a carried state — and once you add noise, it is search. A stochastic loop over a 7M-parameter model lifts Sudoku-Extreme from 87.4% to 98.75% and proposes multi-trajectory recursion as a design principle, while normalizing-flow latents make the loop compatible with everything that makes autoregression work. The right problems reward iteration over a small state more than they reward parameters.
4 — Substrate III: Inference
Climb out another notch and the loop leaves the weights entirely. Now the unit of recursion is a whole model call. Here it helps to bring in the second piece of series notation: E, the environment a model acts over. Ordinarily a prompt is just input. Recursive Language Models reframe the long prompt as an environment E that the model can examine, decompose, and act on — and the central action available to M is to call itself on a snippet of E, recursively, until the pieces are small enough to answer directly (Recursive Language Models, MIT, 2025).
The numbers are the strongest single argument in this essay that recursion buys capability rather than just smoothing a curve. Treating context as a recursable environment lets the system process inputs more than an order of magnitude beyond the model's context window, and even on inputs that fit, it beats strong scaffolds — against GPT-5, a median 26% over context compaction, 130% over a sub-call coding agent, and 13% over a frontier coding harness across four long-context tasks, at comparable cost. A small model post-trained to use the recursion, RLM-Qwen3-8B, improves on its own base by a median 28%. The mechanism is exactly xt+1 = f(xt) with f a model call and x a shrinking, structured view of the environment (see Figure 6).
I treat this result as the inference-scale data point on the recursion axis and lean on its dedicated home in this series for the full argument, rather than re-litigating it here.
[← 2] Paradigm Bets: The Ten-Year Tier — makes "context as environment" one of its long bets, with Recursive Language Models (part of the published collection) as the lead exhibit. Here that same system is read narrowly: as the proof that recursion is a scaling lever at the granularity of the model call.
E; the model's core action is to call itself on snippets of E, recursing until pieces are answerable. Why it matters: processing inputs more than 10× the context window, and a 28% lift over the base model (RLM, MIT, 2025), show recursion adding capability the single forward pass cannot reach — the same xt+1 = f(xt) move, now with f a whole model call.At the inference scale, the loop leaves the weights: the model recurses by calling itself over a long prompt reframed as an environment E. This lets a system handle inputs >10× its context window and beat strong scaffolds by double-digit margins at comparable cost (RLM, MIT, 2025). It is the clearest evidence that recursion buys capability, not just a smoother accuracy curve — and it is the hinge between recursion inside a model and recursion between models.
5 — Substrate IV: Agents and Societies
The coarsest scale is the one where the loop runs between whole models. An agent that can spawn copies of itself and delegate sub-tasks to them is running xt+1 = f(xt) where f is "instantiate another instance of me on a sub-problem." Recursive Agent Optimization makes this trainable: rather than bolting a recursion scaffold onto a frozen model, it uses reinforcement learning to teach an agent when to delegate, how to decompose, and how to communicate with its own spawned children. Agents trained this way scale to tasks beyond their context window, generalize to problems much harder than they were trained on, and can finish in less wall-clock time than a single agent grinding sequentially (Gandhi et al., 2026). The structure is divide-and-conquer, and the contribution is making the model good at running it.
One scale up again, the loop encloses an entire heterogeneous system. RecursiveMAS casts a whole multi-agent system as a single latent-space recursive computation: agents are connected into a collaboration loop through a lightweight link module that passes latent state between them, and the entire system is co-optimized end-to-end by an inner–outer loop that shares gradient-based credit across recursion rounds. Across four collaboration patterns and nine benchmarks it delivers an average +8.3% accuracy at 1.2×–2.4× the efficiency of single-agent, multi-agent, and recursive baselines (Tong, Zhang et al., 2026). And because recursion at this scale is also breadth — many candidates explored in parallel — it needs a selection step. OpenDeepThink supplies one: when there is no ground-truth verifier and pointwise judging is noisy, it runs a population loop that selects by pairwise Bradley–Terry comparison, mutates survivors using the natural-language critiques produced during judging, and culls the rest. Eight sequential rounds raise a strong model's effective Codeforces Elo by +405 points in about 27 minutes, with the gains concentrated in objectively verifiable domains (Zhou et al., 2026) (see Figure 7).
Two clarifications keep this honest. First, the selection step is not decoration: at breadth, recursion without a way to choose among candidates just multiplies noise, and the absence of a cheap, unbiased selector is the real bottleneck — which is why Bradley–Terry aggregation belongs in this story, not as a footnote but as the part that makes the loop pay. Second, the engineering of running recursive societies in production — keeping the loop stable, attributing blame when it fails, paying for it — is a field-guide problem this essay deliberately does not annex.
[→ B8] Multi-Agent Systems and Their Failure Modes — the practitioner's account of recursive societies: an untrained recursive system is a feedback loop that amplifies collapse, so the real work is making the loop stable, trainable, and debuggable. This essay treats RecursiveMAS as the coarsest data point on the recursion axis; B8 treats it as something you have to operate.
At the coarsest scale, f is "instantiate another model": agents that spawn and delegate to copies of themselves (RAO), and whole systems folded into one trainable latent loop (RecursiveMAS, +8.3% at 1.2–2.4× efficiency). Because recursion here is also breadth, it requires a selection step — pairwise Bradley–Terry aggregation lifts a strong model by +405 Elo in eight rounds. The same axis that bought depth in the layers buys coordination between agents.
6 — What Recursion Buys, and What It Doesn't
Now the discipline. An axis is only an axis if its returns keep coming, and the honest reading of these ten papers is that recursion's returns are real but bounded in ways we do not yet understand. Three boundaries are already visible in the same results that make the case.
The first is instability. Naive looping usually degrades a model rather than improving it; the training-free retrofit works only because it reframes the loop as damped sub-steps, not brute repetition (Li et al., 2026). Recurrent architectures are, in the words of the attractor paper, "unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths" — which is precisely why solving for a fixed point beats unrolling the loop (Fein-Ashley & Rashidinejad, 2026). At the system scale the instability is worse, not better: an untrained recursive society amplifies its own errors around the loop. Every paper that succeeds spends its cleverness on stability. That is the tell of a young axis.
The second is selection. Depth-recursion has a built-in objective; breadth-recursion does not. The moment you explore parallel trajectories — noisy latent rollouts, populations of candidates, spawned sub-agents — you inherit the problem of choosing among them without a ground-truth verifier, and pointwise judging is too noisy to lean on (Zhou et al., 2026). Recursion's returns at breadth are gated by the quality of the selector, and selectors are themselves an open research problem.
The third, and the one the bet turns on, is saturation. The depth-equivalence argument — k loops of one block approximates a network k times as deep — has limits, because not every problem rewards more depth, and the published series already showed the cleanest example: a recurrent action head that fails completely at one iteration exceeds 90% success at four, and then stops improving — simpler tasks saturate early, allocating no benefit to further loops (RD-VLA, [← A11]). Adaptive-compute mechanisms exist precisely because the right number of iterations is small and input-dependent. If that early saturation is universal — if two to four iterations capture essentially all the available gain at every substrate — then recursion is a useful efficiency trick, not a scaling axis (see Figure 8).
Recursion's returns are real but bounded by three forces visible in the very results that establish it: instability (every successful paper spends its ingenuity on stabilizing the loop), selection (breadth-recursion is gated by a verifier we do not have), and saturation (gains concentrate at small iteration counts; simpler tasks stop improving by ~2–4 iterations). Whether saturation is universal is exactly what separates a scaling axis from a clever trick.
7 — The Bet, Stated
Here is the position, stated plainly enough to be wrong. Recursion is a genuine third scaling axis. Like parameters and data, it has a returns curve that keeps paying as you push it — not at every problem, but as a lever you can reliably pull — and because it has now appeared independently at four substrates, the right unit of progress over the next few years is not any single looped architecture but the axis itself. If the bet is right, the 2029 frontier model is natively recursive at more than one scale at once: layers that solve for a fixed point rather than stacking; a latent loop that searches before it speaks; an inference policy that calls itself over its context; and, around all of it, a trained society loop. Depth, breadth, and delegation stop being separate tricks and become settings on one dial. The compute argument underneath is the same one that made the first two axes irresistible: recursion converts cheap, parallel, repeatable computation into capability, and cheap repeatable computation is the thing the field has most of.
And here is the falsifier, with no escape hatch. If loop returns saturate at small iteration counts across all four substrates — if two to four iterations capture essentially all the available gain whether you are looping layers, refining a latent state, recursing inference calls, or rounding a society — then recursion is an efficiency trick, not an axis, and this essay is wrong. An axis is defined by a curve that keeps climbing; a trick is defined by a curve that flattens. The evidence today is genuinely split: the inference substrate (processing >10× the context window) and the systems substrate (gains compounding across rounds) look like climbing curves, while the robotics and puzzle results look like early saturation. That split is why this is a bet and not a report (see Figure 9).
The 2029 review. I will revisit this bet in 2029 against one question: at how many substrates do loop returns still climb past a handful of iterations? If at least the inference and systems substrates show curves that keep paying — and especially if a frontier model ships that is recursive at two scales at once — the axis is real. If every substrate has flattened into a two-to-four-iteration efficiency tweak, the bet was wrong and "recursion" was the wrong word for what was really just adaptive compute. There is no third reading that lets the position survive a flat curve.
The bet: recursion is a true third scaling axis, and the 2029 frontier model is natively recursive at more than one substrate at once. The falsifier: loop returns saturate at small iteration counts across all four substrates, making recursion an efficiency trick rather than an axis. The evidence is split today — inference and systems look like climbing curves, robotics and puzzles like early saturation — which is exactly why it is a bet, to be settled at a dated 2029 review.
What Comes Next
Recursion produces something valuable as a by-product: better trajectories. A looped reasoner, a recursive agent, a society that rounds to a good answer — each emits a path through problem space that is stronger than what a single forward pass would have produced. The obvious next question is how to keep that quality: how to bake the trajectories a recursive process discovers back into the weights, so the model does in one pass what previously took a loop. That is a training problem, not an architecture problem, and it is the subject of the next essay.
[→ 4] On-Policy Distillation Quietly Ate Post-Training — while RLVR took the headlines, on-policy distillation became the workhorse for moving frontier-agent skill into deployed models: on-policy trajectories with dense teacher supervision. If recursion is how you generate better trajectories, distillation is how you internalize them. And the practitioner's view of running recursive societies in production remains [→ B8] Multi-Agent Systems and Their Failure Modes.
References
- Li, J., Liang, C., & Lao, N. (2026). Training-Free Looped Transformers. Preprint. arXiv:2605.23872.
- Deng, C., Zhang, Y., Zhu, R., Xu, Y., Liu, J., Ng, T. S. E., & Chen, H. (2026). LT2: Linear-Time Looped Transformers. Preprint. arXiv:2605.20670.
- Lee, S., Hong, C., Kim, S., Lee, J., Park, J., & Park, D. (2026). Looped Diffusion Language Models. Preprint. arXiv:2605.26106.
- Fein-Ashley, J., & Rashidinejad, P. (2026). Solve the Loop: Attractor Models for Language and Reasoning. Preprint. arXiv:2605.12466.
- Sghaier, A., Parviz, A., & Jolicoeur-Martineau, A. (2026). Probabilistic Tiny Recursive Model. Preprint. arXiv:2605.19943.
- Jo, M., Kim, M., & Ren, M. (2026). Generative Recursive Reasoning. Preprint. arXiv:2605.19376.
- Tu, G., Fu, X., Yu, S., Tang, Y., Kang, H., Qin, L., Zhang, Y., & Gu, J. (2026). Latent Reasoning with Normalizing Flows. Preprint. arXiv:2606.06447.
- Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Preprint. arXiv:2605.06639.
- Tong, H., Zhang, T., Buehler, M. J., He, J., & Zou, J. (2026). Recursive Multi-Agent Systems. Preprint. arXiv:2604.25917.
- Zhou, S., Chai, W., Liu, K., Mao, H., Mang, Q., & Shang, J. (2026). OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation. Preprint. arXiv:2605.15177.