Part II · The Engine · 3 of 12ARAG

Environments Are the Bottleneck

Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.

Figure 1: The thesis in one picture. An environment factory stamps out interactive tiles (E) onto a conveyor that feeds a model M in a training loop; a reward signal (r) closes the loop. The scarce machine is the factory, not the loop.
26 min read5,680 words↳ Reading order: ← 2 · 4 →

§1 — Your RL project is stuck, and it's not the algorithm

Take an inventory of your stalled agentic-RL project. Count what you have. Algorithms: more than you can read — PPO, GRPO, DPO, and a dozen multi-turn variants published this year alone. Compute: enough, or close enough that it is not what wakes you at night. Environments: three. They are hand-built, they took a quarter to write, and your agent has already saturated all of them. That third number is the whole story. The bottleneck is not the optimizer; it is the supply of worlds for the optimizer to learn in.

This is not a metaphor reached for after the fact. It is the opening sentence of the most direct paper on the subject: "Environments are the bottleneck for self-improving agents" (Gandhi et al., 2026). Their system, Endless Terminals, makes the point by removing everything else. They train terminal agents with vanilla PPO, binary episode-level rewards, and a minimal 16-turn interaction loop — no retrieval, no multi-agent coordination, no specialized tools — and still take Qwen2.5-7B from 10.7% to 53.3% on a held-out evaluation set. The algorithm is deliberately boring. The variable they scaled is the number of environments: a procedural pipeline that yields 3,255 verified, containerized tasks with no human annotation (see Figure 1).

Before going further, fix the vocabulary this series uses for the moving parts. An environment E is the tool-world the agent acts in — its state, its transitions, and crucially the function that decides whether an outcome counts as success. A trajectory τ is one multi-turn episode of the agent acting and observing inside E. A reward r is the scalar the environment returns. The harness H is the scaffold around the model — context assembly, tools, memory, verification — that we will return to in §4. Hold those four symbols; the rest of the article is about who manufactures E, and at what scale.

The reason this caught the field off guard is historical. For a decade the binding constraint in machine learning was the model: architectures, optimizers, parameter counts. Agentic RL inherited a world where models are strong and cheap to fine-tune, and discovered the constraint had moved underneath it — to the substrate the model trains against. You can feel the shift in the failure mode. A 2019-era project failed because the network would not learn. A 2026-era project fails because there is nothing left to learn from.

Key Takeaway

1. The scarce resource in agentic RL is not algorithms or compute — it is environments. Endless Terminals (Gandhi et al., 2026) demonstrates the asymmetry directly: hold the algorithm at vanilla PPO with binary rewards, scale only the environment supply to 3,255 procedurally generated tasks, and a 7B agent still jumps from 10.7% to 53.3%. When the boring lever moves the number, the boring lever is the bottleneck.

§2 — Evaluation is not a training substrate

The deepest mistake in applied agentic RL is reaching for a benchmark when you need a pipeline. They are different objects with different jobs, and the confusion is expensive. A benchmark is built to be fixed: a frozen, finite set of tasks with a leaderboard, designed so two systems can be compared fairly. A training substrate is built to be consumed: it must hand the learner a fresh, gradeable problem on every step, forever. Freezing is a feature for the first and a fatal flaw for the second.

Concretely, a substrate has to do four things a benchmark never promised. It must reset cleanly so each τ starts from a known state. It must return verifiable rewards — programmatic checks, not vibes — so r is trustworthy at scale. It must offer a difficulty curriculum, because an agent that has solved everything easy needs harder problems on tap. And it must provide unbounded supply, because RL burns through tasks orders of magnitude faster than evaluation does. WebArena (Zhou et al., 2023) is the cautionary example: a beautifully reproducible, standalone web environment where the best GPT-4 agent reached only 14.41% against 78.24% for humans. It was engineered to expose a gap, and it does — but it is a yardstick, not a quarry. You cannot run a million RL episodes against it without the agent memorizing its finite task set.

SUPERVISED-LEARNING ERA AGENTIC-RL ERA Labels Rewards  r Datasets Environment distributions E Data cleaning Reward verification Data flywheel Interaction flywheel ↳ Endless Terminals ↳ Gym-Anything audit agent ↳ Agent-World self-evolution
Figure 2: "Environments are the new data," made literal. Each pillar of the supervised data pipeline has an exact counterpart in agentic RL. The right column is not an analogy for its own sake — each row already has a working instantiation (named below it), which the rest of this article walks through.

The cleanest way to think about the new constraint is to run an old playbook forward. In the supervised era, the asset that decided who won was not the model — it was the data pipeline. The map is exact (see Figure 2). Labels became rewards: the supervision signal moved from a human annotation attached to an input to a programmatic verdict attached to a trajectory. Datasets became environment distributions: you no longer ship a static file, you ship a generator over E. Data cleaning became reward verification: the unglamorous, decisive work of making sure the signal is correct. And the data flywheel became the interaction flywheel: the system that turns a deployed model's experience back into the next round of training material. Whoever industrializes that column owns the next era, exactly as the data-infrastructure teams owned the last one.

[← 2] established that agent benchmark numbers measure demos, not the capability you think you bought — the 1MB replay script that beats frontier models on prominent computer-use benchmarks is the proof. This article takes the next step: even a benchmark you could trust is built to be measured once, not consumed a million times.
Key Takeaway

2. Benchmarks are built to be frozen; training substrates are built to be consumed. A substrate needs four things a benchmark never promised — resets, verifiable rewards, a difficulty curriculum, and unbounded supply. Run the supervised playbook forward and the lesson is unambiguous: the winning asset is the environment pipeline, the way the data pipeline was the winning asset before it.

§3 — The environment factories

The good news for a practitioner is that the factories already exist, in the open, and they disagree about how to build environments in instructive ways. Five are worth knowing, because between them they cover the whole design space (see Figure 3).

Endless Terminals (Gandhi et al., 2026) is the procedural extreme. Its pipeline runs four stages — generate diverse task descriptions, build and validate a containerized environment, generate completion tests, and filter for solvability by sampling solutions from a strong model — and emits tasks with verification baked in. Resets are free because every task is a fresh container; the reward is a binary completion test. Gym-Anything (Aggarwal et al., 2026) attacks a different axis: economic coverage. It treats environment creation itself as a multi-agent task — a coding agent writes setup scripts, downloads real data, and configures the software while producing evidence, and an independent audit agent checks that evidence against a quality checklist. Pointed at a GDP-grounded taxonomy of occupations, it converts 200 applications into 10,000+ long-horizon tasks. That audit agent is the most important detail in the section: it is reward verification (Figure 2, row 3) implemented as a second model whose only job is to refuse bad environments.

Agent-World (Dong et al., 2026) closes the loop into a curriculum. Built on the Model Context Protocol — a unified interface for connecting an agent to real-world tool services — it pairs an environment-and-task discovery component that synthesizes verifiable tasks with controllable difficulty against a self-evolving trainer that identifies the agent's capability gaps and synthesizes new tasks aimed at them. Its 8B and 14B agents beat strong proprietary baselines across 23 agent benchmarks, and the paper reports clean scaling trends in both environment diversity and self-evolution rounds. OpenResearcher (Li et al., 2026) is the reproducibility play: it runs the search-and-browse loop entirely offline over a 15-million-document corpus through three primitives — search, open, find — and uses a teacher (GPT-OSS-120B) to synthesize 97,000+ trajectories, including a long tail with 100+ tool calls. Fine-tuning a 30B-A3B model on that synthetic experience reaches 54.8% on BrowseComp-Plus, a +34.0-point jump over the base model. Note the honest asterisk: OpenResearcher's signal is distilled teacher trajectories, not a live RL reward, which is why it scores differently on the audit. Finally, ROME/iFlow (ROCK, ROLL & iFlow Joint Team, 2025) is the systems answer — an open ecosystem that cleanly separates the post-training framework (ROLL) from the sandbox environment manager (ROCK), which is exactly the modularization that lets an environment pipeline scale independently of the trainer.

There is also a sixth design that belongs in any honest survey, even though it lives in a different lineage: generate the environment itself. Genie (Bruce et al., 2024) is a foundation world model that turns prompts, sketches, and photographs into playable, controllable environments learned from unlabeled video. It is "environments are data" taken to the limit — the distribution over E is no longer hand-authored, it is sampled from a model. We return to that frontier in §6.

Environment factoryResetsVerifiable rewardsDifficulty curriculumSupply
Endless Terminals
Gandhi et al., 2026
✅ fresh container / task ✅ binary completion tests ⚠️ solvability-filtered; diversity by sampled category ✅ 3,255 tasks, no human labels
Gym-Anything
Aggarwal et al., 2026
✅ per-software setup scripts ✅ audit agent verifies setup + task ⚠️ breadth via GDP occupation taxonomy ✅ 200 apps → 10,000+ tasks
Agent-World
Dong et al., 2026
✅ stateful MCP tool worlds ✅ synthesized verifiable tasks ✅ controllable difficulty + self-evolution to gaps ✅ thousands of themes; 23-benchmark gains
OpenResearcher
Li et al., 2026
✅ fully offline, instrumented ⚠️ teacher-distilled trajectories, not RL reward ⚠️ long-horizon tail (100+ tool calls) ✅ 97,000+ trajectories, reproducible
ROME / iFlow
iFlow Joint Team, 2025
✅ ROCK sandbox manager ⚠️ ROLL post-training + Terminal Bench Pro ❌ ecosystem, not a single curriculum ✅ open end-to-end stack
Figure 3: The environment factory floor, audited against the four substrate requirements from §2. No single factory maxes every column — Endless Terminals optimizes verified supply, Agent-World optimizes curriculum, OpenResearcher optimizes reproducibility — which is exactly why "environment supply" is a portfolio decision, not a product purchase. Every cell is sourced to the cited paper's reported design.
Key Takeaway

3. Five open factories already span the design space: procedural supply (Endless Terminals), economic coverage with an audit agent (Gym-Anything), self-evolving curriculum (Agent-World), offline reproducibility (OpenResearcher), and modular systems (ROME/iFlow). No one factory wins every column. Treat environment supply as a portfolio you assemble, and notice that reward verification keeps showing up as a second model whose job is to reject bad environments.

§4 — Train where you serve

Here is the failure that quietly killed most applied RL of the 2018 era, and it had nothing to do with reward design. Teams trained agents in a clean simplified simulator and deployed them into a messy production system, and the policy fell apart on contact — the train/serve gap. The two distributions were different, so a policy optimal in the first was mediocre in the second. The modern equivalent is sharper, because the production system is the harness H: the context-assembly, tool-routing, memory, and verification scaffold the agent actually runs inside. If you train in a stripped-down environment and serve inside a rich harness, you have rebuilt the 2018 gap with new words.

The fix is the most important infrastructure idea in agentic RL right now: train on the production harness itself. Polar (Zou et al., 2026) does this by treating the harness as a black box. It proxies the agent's LLM API calls, records the token-level interactions, and reconstructs token-faithful trajectories from whatever the real harness did — so the RL trainer sees exactly the distribution the agent will be served in. The decoupling is what makes it scale: rollout nodes are agnostic to harness, trainer, and algorithm. And the result makes the train/serve gap visible as a number. Using simple GRPO, Polar improves the same Qwen3.5-4B by +22.6, +4.8, +0.6, and +6.2 points on SWE-Bench Verified across the Codex, Claude Code, Qwen Code, and Pi harnesses respectively. Same model, same algorithm, four harnesses, and the gain ranges from negligible to enormous (see Figure 4). The harness is not a deployment detail you bolt on afterward — it is part of the MDP, and which harness you train against is a first-order decision.

[← 1] defined the harness H and argued that a deployed agent's capability is the product of model quality and harness quality — and that H belongs inside the MDP, not outside it. Polar's 0.6-to-22.6 spread across harnesses is that thesis measured during training: the substrate you optimize against is a design variable, not a constant.

Harness-1 (Jiang et al., 2026) attacks the same gap from the other side: instead of making RL swallow the whole harness, it offloads work out of the policy and into the environment. A search agent trained as a policy over a growing transcript is forced to optimize two things at once — the semantic decision of what to search next, and the clerical bookkeeping of what it has already seen, which evidence matters, and which claims are checked. Harness-1 gives the environment a working memory: a candidate pool, an importance-tagged curated set, compact evidence links, verification records, deduplicated observations, and budget-aware context rendering. With that state externalized, its 20B search agent only has to learn the genuinely hard decisions; the recoverable state is maintained by H, reliably, for free. The general principle: every bit of state the harness can hold is a bit the policy does not have to learn to hold.

THE 2018 GAP · train ≠ serve Trainsurrogate E′ Serveproduction H distribution gap TRAIN ON THE HARNESS — Polar Model Mtrainer Production HCodex · Claude Code… proxy LLM calls token-faithful τ Same model, same GRPO, on SWE-Bench Verified: +22.6Codex +6.2Pi +4.8Claude Code +0.6Qwen Code
Figure 4: Closing the train/serve gap. Left, the classic failure: optimize in a surrogate E′, deploy in a different production H, lose the policy to distribution shift. Right, Polar (Zou et al., 2026) trains against the real harness by proxying its LLM calls and reconstructing token-faithful trajectories. The +0.6-to-+22.6 spread across four harnesses is the gap, quantified: which H you train on is a first-order choice.
Key Takeaway

4. The harness H is part of the MDP, so train where you serve. Polar (Zou et al., 2026) makes this practical by reconstructing token-faithful trajectories from the real production harness, and exposes the stakes: the same model and algorithm gain anywhere from +0.6 to +22.6 on SWE-Bench Verified depending only on which harness they trained against. Harness-1 (Jiang et al., 2026) adds the dual move — externalize recoverable state into H so the policy only learns the decisions that genuinely require learning.

§5 — The credit-assignment hole

Once the environment supply is solved, the next bottleneck announces itself immediately: a long trajectory earns one reward at the end, and the agent has to figure out which of forty actions deserved the blame. This is credit assignment, and it is the reason a coarse reward over a long τ trains slowly and unstably. The right way to read the recent algorithm work is not as a zoo of competing optimizers but as a single axis — how finely the learning signal is resolved along the trajectory — with the whole field migrating from coarse to dense (see Figure 5).

At the coarse end sits the honest baseline. RAGEN (Wang et al., 2025), through its StarPO framework, optimizes whole trajectories against the outcome reward, and the paper is candid about what that costs: a recurring instability it names the Echo Trap, where reward variance collapses and gradients spike, plus the finding that without reasoning-aware signals, multi-turn reasoning barely emerges at all. One step denser, GiGPO (Feng et al., 2025) keeps the critic-free simplicity of group methods but adds a second level of comparison: an anchor-state grouping mechanism that retroactively finds repeated environment states across trajectories and computes a step-level advantage from the actions taken at each. Denser still, SWEET-RL (Zhou et al., 2025) trains a critic with access to training-time information the agent never sees at inference and uses it to hand out step-level rewards — worth a +6% absolute gain on its ColBench benchmark, enough to let Llama-3.1-8B match or exceed GPT-4o. At the dense end, SDAR (Lu et al., 2026) pushes credit down to individual tokens: it gates On-Policy Self-Distillation as an auxiliary objective on top of RL, strengthening dense supervision on teacher-endorsed tokens while softly suppressing spurious negatives, for +9.4% on ALFWorld, +10.2% on WebShop, and +7.0% on Search-QA over GRPO. And ArCHer (Zhou et al., 2024) gives the structural picture the others fill in: a hierarchical actor-critic that assigns credit at two levels at once — across turns and within a turn.

[→ C4] The dense-supervision frontier — on-policy trajectories with dense teacher supervision, sitting between SFT's distribution shift and outcome-RL's sparse credit — is the subject of its own deep dive. SDAR's gated self-distillation is one instance of a pattern that deserves a full chapter; this section places it on the map without pre-empting that treatment.

Two more results keep this section honest about what "dense" buys you. LOOP (Hamburger et al., 2025) shows the coarse end is not hopeless if the plumbing is right: a memory-efficient PPO variant with no value network and a single LLM copy in memory trains a 32B agent that beats a much larger baseline on AppWorld, where prior methods cleared less than half the tasks. And RAGEN-2 (Wang et al., 2026) supplies the warning that denser signals do not automatically mean healthier ones. It identifies template collapse — a failure where the agent's reasoning looks diverse but is input-agnostic, invisible to entropy, the field's usual stability gauge. Its diagnosis decomposes reasoning quality into within-input diversity (entropy) and cross-input distinguishability (mutual information), traces the collapse to a signal-to-noise mechanism where low reward variance lets regularization erase real differences, and proposes filtering high-signal prompts as the fix.

COARSE / cheap DENSE / expensive Trajectory rRAGEN / StarPO"Echo Trap" Episode + step groupsGiGPO · anchor-state grouping Step-level criticSWEET-RL · +6% ColBench Token-level distillSDAR · +9.4% ALFWorld Turn × tokenArCHer · hierarchical LOOP holds the coarse end viable with the right plumbing · RAGEN-2 warns dense ≠ healthy (template collapse)
Figure 5: The credit-assignment granularity spectrum — read it as one axis, not a zoo. Moving right resolves the learning signal more finely along the trajectory, from a single outcome reward to per-token supervision; it buys faster, more stable credit at higher engineering and compute cost. Every datapoint is the cited paper's own reported result.
[← A7] The instabilities named here — the Echo Trap, template collapse — are not new species; they are the multi-turn faces of the stability failures catalogued in the RL-at-scale toolkit, where the prescription is to monitor distribution-level signals (not just the loss) and intervene with the matching fix. Bring that toolkit to bear when you push the spectrum rightward.
Key Takeaway

5. Multi-turn credit assignment is one axis — coarse to dense — not a zoo. The field is migrating rightward: trajectory reward (RAGEN) → step groups (GiGPO) → step-level critic (SWEET-RL, +6%) → token-level distillation (SDAR, +9.4%) → hierarchical (ArCHer). Denser signals train faster but cost engineering and stability; RAGEN-2's template collapse is the reminder that a denser signal still has to be a real one.

§6 — When hand-building runs out: recursive and test-time axes

Hand-authored factories have a ceiling. Even a procedural pipeline is bounded by the templates a human wrote for it, and an agent that wants to keep improving eventually needs experience that no fixed generator can supply. Three escape routes are open, and each is a way of manufacturing experience rather than authoring it.

The first is to train inside a learned world model — let a model dream the environment. Dreamer 4 (Hafner et al., 2025) is the proof that this is no longer a toy. It learns control by reinforcement learning entirely inside a world model that runs in real time on a single GPU, and in Minecraft it becomes the first agent to obtain diamonds purely from offline data — a task requiring sequences of over 20,000 mouse-and-keyboard actions from raw pixels, with no live environment interaction at all. Coupled with Genie from §3, this is the literal endpoint of Figure 2's second row: the environment distribution E is itself a learned, samplable object. The second route is recursion. RAO (Gandhi et al., 2026) trains agents that spawn and delegate subtasks to fresh instantiations of themselves — divide-and-conquer as a learned skill — which lets a trained agent scale to problems beyond its own context window and generalize to tasks harder than any it saw in training, at reduced wall-clock time. The agent manufactures its own subproblems, each in a clean context. Search is the close cousin: Agent Q (Putta et al., 2024) couples MCTS-guided tree search with off-policy DPO to manufacture high-quality trajectories from sparse rewards, lifting a Llama-3 70B agent from 18.6% to 81.7% on real-world booking tasks — searching the action tree for the experience worth training on.

The third route is to reuse experience at test time. Scaling test-time compute for agentic coding (Silva et al., 2026) confronts the fact that a long agent rollout is not a short answer you can rank — it is a sprawling trace of actions, errors, and partial progress. Their move is to compress each rollout into a structured summary that keeps the salient hypotheses and failure modes, then scale two ways over those summaries: Recursive Tournament Voting narrows a population through small-group comparisons (parallel), and Parallel-Distill-Refine conditions new rollouts on summaries distilled from prior attempts (sequential). The experience an agent generates while solving a task becomes raw material for solving it better — the interaction flywheel turning at inference time (see Figure 6).

GENERATE THE WORLD learned world model agent Dreamer 4 · 20k+ actions single GPU, offline RECURSE RAO · delegate to self beyond context window REUSE τ summaries refined RTV (parallel) PDR (sequential)
Figure 6: Three ways to manufacture experience when hand-built factories run out. Generate the world and train inside it (Dreamer 4); recurse, so the agent spawns its own subproblems (RAO); or reuse, compressing past rollouts into material for better ones (test-time compute, Silva et al., 2026). Each turns a fixed environment budget into a renewable one.
Key Takeaway

6. When hand-authored factories saturate, manufacture experience instead of authoring it: train inside a learned world model (Dreamer 4 — diamonds in Minecraft from offline data, 20,000+ actions on a single GPU), recurse so the agent generates its own subproblems (RAO), or reuse rollouts as raw material at test time (Silva et al., 2026). Each converts a fixed environment budget into a renewable one.

§7 — Exploration is still the boss

Push the thesis to its limit and it turns into a question the environment factory cannot answer for you: which environment should you generate next? Infinite supply is not the same as useful supply. An agent that has saturated everything easy gains nothing from the ten-thousandth easy task; it needs the specific next problem that sits just past its frontier. Manufacturing capacity solves "how much"; it does not solve "what." That is the exploration problem, and it does not go away — it gets promoted.

This was seen early and clearly. Jiang et al. (2022) framed the field as moving from "learning from data" to "learning what data to learn from," and argued that this — generalized exploration — is the universal bottleneck of open-ended learning, common to supervised learning and RL alike. Read through this article's lens, that is the exact statement that environment supply, once industrialized, collapses back into an exploration problem. Agent-World (Dong et al., 2026) is the live instance: its self-evolving loop does not just synthesize more tasks, it identifies the agent's capability gaps and synthesizes tasks aimed at them, and the paper's scaling trend in self-evolution rounds is what a working interaction flywheel looks like when exploration steers it (see Figure 7). The factory makes the worlds; exploration decides which worlds are worth making.

1 · Explore which world to build? 2 · Synthesize E + tasks controllable difficulty 3 · Train (RL on E) credit assignment §5 4 · Stronger agent new frontier, harder INTERACTION FLYWHEEL exploration steers it
Figure 7: The interaction flywheel — the data flywheel's successor (Figure 2, row 4). Exploration decides which world to build, the factory synthesizes it, RL trains on it, and the stronger agent returns to exploration with harder questions. Agent-World (Dong et al., 2026) implements exactly this loop, aiming self-evolution at the agent's capability gaps; the cycle is only as good as the exploration that steers it.
[→ C2] Why exploration — "learning what data to learn from" — is the load-bearing bet for open-ended capability, and what it would take to falsify it, is argued at length in the long-horizon essay. Here it is the natural successor bottleneck: the thing that becomes scarce after you have industrialized environment supply.
Key Takeaway

7. Industrializing environment supply answers "how many worlds"; it does not answer "which world next." Jiang et al. (2022) named that successor bottleneck a decade early — the shift from learning from data to learning what data to learn from — and Agent-World shows it in practice: a flywheel is only as good as the exploration that aims it at the agent's actual frontier.

§8 — The practitioner's path

Collapse everything above into an order of operations you can run on Monday. The mistake teams make is to reach straight for RL — the most expensive, least stable tool — before they have earned the right to use it. The cheaper interventions come first, and each one either fixes the problem outright or makes the next stage tractable (see Figure 8). The open frontier recipes confirm the order: Kimi K2 (Kimi Team, 2025) — a practitioner report on a 1T-parameter MoE — describes post-training as a large-scale agentic data-synthesis pipeline feeding a joint RL stage over real and synthetic environments, and reaches 65.8 on SWE-Bench Verified; Qwen3 (Qwen Team, 2025) reports a comparable agentic, tool-use-centered recipe across its 0.6B-to-235B family. Both, read as lab reports rather than peer-reviewed findings, say the same thing: the recipe is the environment supply.

Step 1 — prompt and measure. Establish the ceiling of the model as-is, with a good prompt and the production harness, before you change any weights. Most "the agent can't do this" problems are harness problems in disguise. Step 2 — tune the harness. If the failures are in state-tracking, context assembly, or tool routing, fix H first: externalize recoverable state the way Harness-1 does, so the policy is left only the decisions that need learning. Re-measure. Step 3 — distill from a stronger agent (OPD). If a stronger teacher exists, buy dense supervision before you pay for sparse RL: synthesize trajectories the way OpenResearcher does and distill them, the cheapest path to a competent starting policy. Step 4 — RL on synthesized environments. Only now reach for RL — on the production harness (Polar), against a scalable environment pipeline (Endless Terminals / Gym-Anything / Agent-World, or a learned world model), with a credit-assignment algorithm matched to your reward's granularity (§5), watching for Echo Trap and template collapse. As the agent saturates the supply, the bottleneck becomes exploration — which environment to synthesize next.

Agent underperforms 1 · Prompt + measure on prod Hestablish the as-is ceiling first Failure in state-tracking /context assembly? yes 2 · Tune Hexternalize state [←1] re-measure no Stronger teacher available? yes 3 · OPD distilldense τ [→C4] no / then Scalable env pipeline?resets+rewards+curriculum+supply no Build/borrow afactory · or learn E yes 4 · RL on prod H (Polar) · credit algo by granularityGiGPO / SWEET-RL / ArCHer · watch Echo Trap & template collapse [←A7] … then exploration becomes the bottleneck [→ C2]
Figure 8: The deployable artifact — an order of operations from cheapest to most expensive. Prompt, then tune the harness, then distill from a stronger agent, then (and only then) run RL on a scalable, production-faithful environment pipeline. Each gate either fixes the problem or makes the next stage tractable; RL is the last resort, not the first move.
Practitioner contract

The decision tree, in one line: prompt → tune harness H → OPD-distill from a stronger agent → RL on synthesized environments. Spend in that order. If you skip to RL before the harness is right and the environment pipeline is real, you will pay the most for the least — exactly the trap the 2018 applied-RL projects fell into.

Key Takeaway

8. Run the cheap interventions first: prompt, then harness, then distillation, then RL on synthesized environments. The open frontier recipes (Kimi K2, Qwen3 — practitioner reports) describe exactly this, with environment supply as the centerpiece of post-training. RL is the last and most expensive tool; earn the right to use it by exhausting the cheaper ones.

§ What Comes Next

Everything in this article assumed a clean separation between training and serving: build the factory, run the loop, ship the agent. But the most valuable agents are never finished — they are deployed into a world that keeps changing, and the experience that matters most arrives after training, in production, one interaction at a time. The interaction flywheel of Figure 2 was drawn as a closed offline loop; the next article cuts it open and lets it run online. 4 takes this same thesis — environments, not algorithms — into the continual setting, where the environment is no longer a pipeline you control but a stream you cannot reset, and where the bill for re-learning what you already knew comes due daily.

Final Key Takeaways
  1. The bottleneck in agentic RL has moved from the model to the substrate. Algorithms and compute are abundant; environments are not. The Endless Terminals result — boring algorithm, scaled environments, 10.7%→53.3% — is the whole thesis in one experiment.
  2. Environments are the new data, line for line: rewards are labels, environment distributions are datasets, reward verification is data cleaning, and the interaction flywheel is the data flywheel. The team that industrializes that column wins the era.
  3. Train where you serve. The harness is part of the MDP; Polar's +0.6-to-+22.6 spread across harnesses is the train/serve gap measured in points.
  4. Credit assignment is one axis from coarse to dense, and the field is migrating rightward — but a denser signal still has to be a real one (RAGEN-2's template collapse).
  5. The order of operations is fixed: prompt → harness → distill → RL on synthesized environments. Then exploration — which world to build next — becomes the boss.

References

  1. Aggarwal, P., Neubig, G., & Welleck, S. (2026). Gym-Anything: Turn any Software into an Agent Environment. Carnegie Mellon University. arXiv:2604.06126.
  2. Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., et al. (2024). Genie: Generative Interactive Environments. Google DeepMind. arXiv:2402.15391.
  3. Dong, G., Dou, Z., et al. (2026). Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Renmin University of China & ByteDance Seed. arXiv:2604.18292.
  4. Feng, L., Xue, Z., Liu, T., & An, B. (2025). Group-in-Group Policy Optimization for LLM Agent Training. Nanyang Technological University & Skywork AI. arXiv:2505.10978.
  5. Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Carnegie Mellon University & Amazon AGI Labs. arXiv:2605.06639.
  6. Gandhi, K., Garg, S., Goodman, N. D., & Papailiopoulos, D. (2026). Endless Terminals: Scaling RL Environments for Terminal Agents. Stanford University & Microsoft Research. arXiv:2601.16443.
  7. Hafner, D., Yan, W., & Lillicrap, T. (2025). Training Agents Inside of Scalable World Models (Dreamer 4). Google DeepMind. arXiv:2509.24527.
  8. Hamburger, J., Koltun, V., & Krähenbühl, P. (2025). Reinforcement Learning for Long-Horizon Interactive LLM Agents (LOOP). arXiv:2502.01600.
  9. Jiang, M., Rocktäschel, T., & Grefenstette, E. (2022). General Intelligence Requires Rethinking Exploration. Meta AI, UCL & Cohere. arXiv:2211.07819.
  10. Jiang, P., Shi, Z., Hong, K., et al. (2026). Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses. UIUC, UC Berkeley & Chroma. arXiv:2606.02373.
  11. Kimi Team. (2025). Kimi K2: Open Agentic Intelligence. Moonshot AI. arXiv:2507.20534.
  12. Li, Z., Jiang, D., Ma, X., et al. (2026). OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis. Texas A&M University & University of Waterloo. arXiv:2603.20278.
  13. Lu, Z., Yao, Z., et al., & Shen, Y. (2026). Self-Distilled Agentic Reinforcement Learning (SDAR). Zhejiang University & Meituan. arXiv:2605.15155.
  14. Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., & Rafailov, R. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. MultiOn & Stanford University. arXiv:2408.07199.
  15. Qwen Team. (2025). Qwen3 Technical Report. arXiv:2505.09388.
  16. ROCK, ROLL & iFlow Joint Team. (2025). Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem. arXiv:2512.24873.
  17. Silva, R., Chen, Z., Iyer, S., et al. (2026). Scaling Test-Time Compute for Agentic Coding. Meta Superintelligence Labs. arXiv:2604.16529.
  18. Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., et al. (2025). RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (StarPO). arXiv:2504.20073.
  19. Wang, Z., Gui, C., Jin, X., et al. (2026). RAGEN-2: Reasoning Collapse in Agentic RL. arXiv:2604.06268.
  20. Zhou, S., Xu, F. F., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
  21. Zhou, Y., Zanette, A., Pan, J., Levine, S., & Kumar, A. (2024). ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. UC Berkeley & Google DeepMind. arXiv:2402.19446.
  22. Zhou, Y., Jiang, S., Tian, Y., Weston, J., Levine, S., Sukhbaatar, S., & Li, X. (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. FAIR at Meta & UC Berkeley. arXiv:2503.15478.
  23. Zou, Y., Demoret, M., Kautz, J., & Dong, Y. (2026). Polar: Agentic RL on Any Harness at Scale. NVIDIA. arXiv:2605.24220.