Environments Are the Bottleneck
Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.
E) onto a conveyor that feeds a model M in a training loop; a reward signal (r) closes the loop. The scarce machine is the factory, not the loop.§1 — Your RL project is stuck, and it's not the algorithm
Take an inventory of your stalled agentic-RL project. Count what you have. Algorithms: more than you can read — PPO, GRPO, DPO, and a dozen multi-turn variants published this year alone. Compute: enough, or close enough that it is not what wakes you at night. Environments: three. They are hand-built, they took a quarter to write, and your agent has already saturated all of them. That third number is the whole story. The bottleneck is not the optimizer; it is the supply of worlds for the optimizer to learn in.
This is not a metaphor reached for after the fact. It is the opening sentence of the most direct paper on the subject: "Environments are the bottleneck for self-improving agents" (Gandhi et al., 2026). Their system, Endless Terminals, makes the point by removing everything else. They train terminal agents with vanilla PPO, binary episode-level rewards, and a minimal 16-turn interaction loop — no retrieval, no multi-agent coordination, no specialized tools — and still take Qwen2.5-7B from 10.7% to 53.3% on a held-out evaluation set. The algorithm is deliberately boring. The variable they scaled is the number of environments: a procedural pipeline that yields 3,255 verified, containerized tasks with no human annotation (see Figure 1).
Before going further, fix the vocabulary this series uses for the moving parts. An environment E is the tool-world the agent acts in — its state, its transitions, and crucially the function that decides whether an outcome counts as success. A trajectory τ is one multi-turn episode of the agent acting and observing inside E. A reward r is the scalar the environment returns. The harness H is the scaffold around the model — context assembly, tools, memory, verification — that we will return to in §4. Hold those four symbols; the rest of the article is about who manufactures E, and at what scale.
The reason this caught the field off guard is historical. For a decade the binding constraint in machine learning was the model: architectures, optimizers, parameter counts. Agentic RL inherited a world where models are strong and cheap to fine-tune, and discovered the constraint had moved underneath it — to the substrate the model trains against. You can feel the shift in the failure mode. A 2019-era project failed because the network would not learn. A 2026-era project fails because there is nothing left to learn from.
1. The scarce resource in agentic RL is not algorithms or compute — it is environments. Endless Terminals (Gandhi et al., 2026) demonstrates the asymmetry directly: hold the algorithm at vanilla PPO with binary rewards, scale only the environment supply to 3,255 procedurally generated tasks, and a 7B agent still jumps from 10.7% to 53.3%. When the boring lever moves the number, the boring lever is the bottleneck.
§2 — Evaluation is not a training substrate
The deepest mistake in applied agentic RL is reaching for a benchmark when you need a pipeline. They are different objects with different jobs, and the confusion is expensive. A benchmark is built to be fixed: a frozen, finite set of tasks with a leaderboard, designed so two systems can be compared fairly. A training substrate is built to be consumed: it must hand the learner a fresh, gradeable problem on every step, forever. Freezing is a feature for the first and a fatal flaw for the second.
Concretely, a substrate has to do four things a benchmark never promised. It must reset cleanly so each τ starts from a known state. It must return verifiable rewards — programmatic checks, not vibes — so r is trustworthy at scale. It must offer a difficulty curriculum, because an agent that has solved everything easy needs harder problems on tap. And it must provide unbounded supply, because RL burns through tasks orders of magnitude faster than evaluation does. WebArena (Zhou et al., 2023) is the cautionary example: a beautifully reproducible, standalone web environment where the best GPT-4 agent reached only 14.41% against 78.24% for humans. It was engineered to expose a gap, and it does — but it is a yardstick, not a quarry. You cannot run a million RL episodes against it without the agent memorizing its finite task set.
The cleanest way to think about the new constraint is to run an old playbook forward. In the supervised era, the asset that decided who won was not the model — it was the data pipeline. The map is exact (see Figure 2). Labels became rewards: the supervision signal moved from a human annotation attached to an input to a programmatic verdict attached to a trajectory. Datasets became environment distributions: you no longer ship a static file, you ship a generator over E. Data cleaning became reward verification: the unglamorous, decisive work of making sure the signal is correct. And the data flywheel became the interaction flywheel: the system that turns a deployed model's experience back into the next round of training material. Whoever industrializes that column owns the next era, exactly as the data-infrastructure teams owned the last one.
2. Benchmarks are built to be frozen; training substrates are built to be consumed. A substrate needs four things a benchmark never promised — resets, verifiable rewards, a difficulty curriculum, and unbounded supply. Run the supervised playbook forward and the lesson is unambiguous: the winning asset is the environment pipeline, the way the data pipeline was the winning asset before it.
§3 — The environment factories
The good news for a practitioner is that the factories already exist, in the open, and they disagree about how to build environments in instructive ways. Five are worth knowing, because between them they cover the whole design space (see Figure 3).
Endless Terminals (Gandhi et al., 2026) is the procedural extreme. Its pipeline runs four stages — generate diverse task descriptions, build and validate a containerized environment, generate completion tests, and filter for solvability by sampling solutions from a strong model — and emits tasks with verification baked in. Resets are free because every task is a fresh container; the reward is a binary completion test. Gym-Anything (Aggarwal et al., 2026) attacks a different axis: economic coverage. It treats environment creation itself as a multi-agent task — a coding agent writes setup scripts, downloads real data, and configures the software while producing evidence, and an independent audit agent checks that evidence against a quality checklist. Pointed at a GDP-grounded taxonomy of occupations, it converts 200 applications into 10,000+ long-horizon tasks. That audit agent is the most important detail in the section: it is reward verification (Figure 2, row 3) implemented as a second model whose only job is to refuse bad environments.
Agent-World (Dong et al., 2026) closes the loop into a curriculum. Built on the Model Context Protocol — a unified interface for connecting an agent to real-world tool services — it pairs an environment-and-task discovery component that synthesizes verifiable tasks with controllable difficulty against a self-evolving trainer that identifies the agent's capability gaps and synthesizes new tasks aimed at them. Its 8B and 14B agents beat strong proprietary baselines across 23 agent benchmarks, and the paper reports clean scaling trends in both environment diversity and self-evolution rounds. OpenResearcher (Li et al., 2026) is the reproducibility play: it runs the search-and-browse loop entirely offline over a 15-million-document corpus through three primitives — search, open, find — and uses a teacher (GPT-OSS-120B) to synthesize 97,000+ trajectories, including a long tail with 100+ tool calls. Fine-tuning a 30B-A3B model on that synthetic experience reaches 54.8% on BrowseComp-Plus, a +34.0-point jump over the base model. Note the honest asterisk: OpenResearcher's signal is distilled teacher trajectories, not a live RL reward, which is why it scores differently on the audit. Finally, ROME/iFlow (ROCK, ROLL & iFlow Joint Team, 2025) is the systems answer — an open ecosystem that cleanly separates the post-training framework (ROLL) from the sandbox environment manager (ROCK), which is exactly the modularization that lets an environment pipeline scale independently of the trainer.
There is also a sixth design that belongs in any honest survey, even though it lives in a different lineage: generate the environment itself. Genie (Bruce et al., 2024) is a foundation world model that turns prompts, sketches, and photographs into playable, controllable environments learned from unlabeled video. It is "environments are data" taken to the limit — the distribution over E is no longer hand-authored, it is sampled from a model. We return to that frontier in §6.
| Environment factory | Resets | Verifiable rewards | Difficulty curriculum | Supply |
|---|---|---|---|---|
| Endless Terminals Gandhi et al., 2026 |
✅ fresh container / task | ✅ binary completion tests | ⚠️ solvability-filtered; diversity by sampled category | ✅ 3,255 tasks, no human labels |
| Gym-Anything Aggarwal et al., 2026 |
✅ per-software setup scripts | ✅ audit agent verifies setup + task | ⚠️ breadth via GDP occupation taxonomy | ✅ 200 apps → 10,000+ tasks |
| Agent-World Dong et al., 2026 |
✅ stateful MCP tool worlds | ✅ synthesized verifiable tasks | ✅ controllable difficulty + self-evolution to gaps | ✅ thousands of themes; 23-benchmark gains |
| OpenResearcher Li et al., 2026 |
✅ fully offline, instrumented | ⚠️ teacher-distilled trajectories, not RL reward | ⚠️ long-horizon tail (100+ tool calls) | ✅ 97,000+ trajectories, reproducible |
| ROME / iFlow iFlow Joint Team, 2025 |
✅ ROCK sandbox manager | ⚠️ ROLL post-training + Terminal Bench Pro | ❌ ecosystem, not a single curriculum | ✅ open end-to-end stack |
3. Five open factories already span the design space: procedural supply (Endless Terminals), economic coverage with an audit agent (Gym-Anything), self-evolving curriculum (Agent-World), offline reproducibility (OpenResearcher), and modular systems (ROME/iFlow). No one factory wins every column. Treat environment supply as a portfolio you assemble, and notice that reward verification keeps showing up as a second model whose job is to reject bad environments.
§4 — Train where you serve
Here is the failure that quietly killed most applied RL of the 2018 era, and it had nothing to do with reward design. Teams trained agents in a clean simplified simulator and deployed them into a messy production system, and the policy fell apart on contact — the train/serve gap. The two distributions were different, so a policy optimal in the first was mediocre in the second. The modern equivalent is sharper, because the production system is the harness H: the context-assembly, tool-routing, memory, and verification scaffold the agent actually runs inside. If you train in a stripped-down environment and serve inside a rich harness, you have rebuilt the 2018 gap with new words.
The fix is the most important infrastructure idea in agentic RL right now: train on the production harness itself. Polar (Zou et al., 2026) does this by treating the harness as a black box. It proxies the agent's LLM API calls, records the token-level interactions, and reconstructs token-faithful trajectories from whatever the real harness did — so the RL trainer sees exactly the distribution the agent will be served in. The decoupling is what makes it scale: rollout nodes are agnostic to harness, trainer, and algorithm. And the result makes the train/serve gap visible as a number. Using simple GRPO, Polar improves the same Qwen3.5-4B by +22.6, +4.8, +0.6, and +6.2 points on SWE-Bench Verified across the Codex, Claude Code, Qwen Code, and Pi harnesses respectively. Same model, same algorithm, four harnesses, and the gain ranges from negligible to enormous (see Figure 4). The harness is not a deployment detail you bolt on afterward — it is part of the MDP, and which harness you train against is a first-order decision.
H and argued that a deployed agent's capability is the product of model quality and harness quality — and that H belongs inside the MDP, not outside it. Polar's 0.6-to-22.6 spread across harnesses is that thesis measured during training: the substrate you optimize against is a design variable, not a constant.Harness-1 (Jiang et al., 2026) attacks the same gap from the other side: instead of making RL swallow the whole harness, it offloads work out of the policy and into the environment. A search agent trained as a policy over a growing transcript is forced to optimize two things at once — the semantic decision of what to search next, and the clerical bookkeeping of what it has already seen, which evidence matters, and which claims are checked. Harness-1 gives the environment a working memory: a candidate pool, an importance-tagged curated set, compact evidence links, verification records, deduplicated observations, and budget-aware context rendering. With that state externalized, its 20B search agent only has to learn the genuinely hard decisions; the recoverable state is maintained by H, reliably, for free. The general principle: every bit of state the harness can hold is a bit the policy does not have to learn to hold.
E′, deploy in a different production H, lose the policy to distribution shift. Right, Polar (Zou et al., 2026) trains against the real harness by proxying its LLM calls and reconstructing token-faithful trajectories. The +0.6-to-+22.6 spread across four harnesses is the gap, quantified: which H you train on is a first-order choice.4. The harness H is part of the MDP, so train where you serve. Polar (Zou et al., 2026) makes this practical by reconstructing token-faithful trajectories from the real production harness, and exposes the stakes: the same model and algorithm gain anywhere from +0.6 to +22.6 on SWE-Bench Verified depending only on which harness they trained against. Harness-1 (Jiang et al., 2026) adds the dual move — externalize recoverable state into H so the policy only learns the decisions that genuinely require learning.
§5 — The credit-assignment hole
Once the environment supply is solved, the next bottleneck announces itself immediately: a long trajectory earns one reward at the end, and the agent has to figure out which of forty actions deserved the blame. This is credit assignment, and it is the reason a coarse reward over a long τ trains slowly and unstably. The right way to read the recent algorithm work is not as a zoo of competing optimizers but as a single axis — how finely the learning signal is resolved along the trajectory — with the whole field migrating from coarse to dense (see Figure 5).
At the coarse end sits the honest baseline. RAGEN (Wang et al., 2025), through its StarPO framework, optimizes whole trajectories against the outcome reward, and the paper is candid about what that costs: a recurring instability it names the Echo Trap, where reward variance collapses and gradients spike, plus the finding that without reasoning-aware signals, multi-turn reasoning barely emerges at all. One step denser, GiGPO (Feng et al., 2025) keeps the critic-free simplicity of group methods but adds a second level of comparison: an anchor-state grouping mechanism that retroactively finds repeated environment states across trajectories and computes a step-level advantage from the actions taken at each. Denser still, SWEET-RL (Zhou et al., 2025) trains a critic with access to training-time information the agent never sees at inference and uses it to hand out step-level rewards — worth a +6% absolute gain on its ColBench benchmark, enough to let Llama-3.1-8B match or exceed GPT-4o. At the dense end, SDAR (Lu et al., 2026) pushes credit down to individual tokens: it gates On-Policy Self-Distillation as an auxiliary objective on top of RL, strengthening dense supervision on teacher-endorsed tokens while softly suppressing spurious negatives, for +9.4% on ALFWorld, +10.2% on WebShop, and +7.0% on Search-QA over GRPO. And ArCHer (Zhou et al., 2024) gives the structural picture the others fill in: a hierarchical actor-critic that assigns credit at two levels at once — across turns and within a turn.
Two more results keep this section honest about what "dense" buys you. LOOP (Hamburger et al., 2025) shows the coarse end is not hopeless if the plumbing is right: a memory-efficient PPO variant with no value network and a single LLM copy in memory trains a 32B agent that beats a much larger baseline on AppWorld, where prior methods cleared less than half the tasks. And RAGEN-2 (Wang et al., 2026) supplies the warning that denser signals do not automatically mean healthier ones. It identifies template collapse — a failure where the agent's reasoning looks diverse but is input-agnostic, invisible to entropy, the field's usual stability gauge. Its diagnosis decomposes reasoning quality into within-input diversity (entropy) and cross-input distinguishability (mutual information), traces the collapse to a signal-to-noise mechanism where low reward variance lets regularization erase real differences, and proposes filtering high-signal prompts as the fix.
5. Multi-turn credit assignment is one axis — coarse to dense — not a zoo. The field is migrating rightward: trajectory reward (RAGEN) → step groups (GiGPO) → step-level critic (SWEET-RL, +6%) → token-level distillation (SDAR, +9.4%) → hierarchical (ArCHer). Denser signals train faster but cost engineering and stability; RAGEN-2's template collapse is the reminder that a denser signal still has to be a real one.
§6 — When hand-building runs out: recursive and test-time axes
Hand-authored factories have a ceiling. Even a procedural pipeline is bounded by the templates a human wrote for it, and an agent that wants to keep improving eventually needs experience that no fixed generator can supply. Three escape routes are open, and each is a way of manufacturing experience rather than authoring it.
The first is to train inside a learned world model — let a model dream the environment. Dreamer 4 (Hafner et al., 2025) is the proof that this is no longer a toy. It learns control by reinforcement learning entirely inside a world model that runs in real time on a single GPU, and in Minecraft it becomes the first agent to obtain diamonds purely from offline data — a task requiring sequences of over 20,000 mouse-and-keyboard actions from raw pixels, with no live environment interaction at all. Coupled with Genie from §3, this is the literal endpoint of Figure 2's second row: the environment distribution E is itself a learned, samplable object. The second route is recursion. RAO (Gandhi et al., 2026) trains agents that spawn and delegate subtasks to fresh instantiations of themselves — divide-and-conquer as a learned skill — which lets a trained agent scale to problems beyond its own context window and generalize to tasks harder than any it saw in training, at reduced wall-clock time. The agent manufactures its own subproblems, each in a clean context. Search is the close cousin: Agent Q (Putta et al., 2024) couples MCTS-guided tree search with off-policy DPO to manufacture high-quality trajectories from sparse rewards, lifting a Llama-3 70B agent from 18.6% to 81.7% on real-world booking tasks — searching the action tree for the experience worth training on.
The third route is to reuse experience at test time. Scaling test-time compute for agentic coding (Silva et al., 2026) confronts the fact that a long agent rollout is not a short answer you can rank — it is a sprawling trace of actions, errors, and partial progress. Their move is to compress each rollout into a structured summary that keeps the salient hypotheses and failure modes, then scale two ways over those summaries: Recursive Tournament Voting narrows a population through small-group comparisons (parallel), and Parallel-Distill-Refine conditions new rollouts on summaries distilled from prior attempts (sequential). The experience an agent generates while solving a task becomes raw material for solving it better — the interaction flywheel turning at inference time (see Figure 6).
6. When hand-authored factories saturate, manufacture experience instead of authoring it: train inside a learned world model (Dreamer 4 — diamonds in Minecraft from offline data, 20,000+ actions on a single GPU), recurse so the agent generates its own subproblems (RAO), or reuse rollouts as raw material at test time (Silva et al., 2026). Each converts a fixed environment budget into a renewable one.
§7 — Exploration is still the boss
Push the thesis to its limit and it turns into a question the environment factory cannot answer for you: which environment should you generate next? Infinite supply is not the same as useful supply. An agent that has saturated everything easy gains nothing from the ten-thousandth easy task; it needs the specific next problem that sits just past its frontier. Manufacturing capacity solves "how much"; it does not solve "what." That is the exploration problem, and it does not go away — it gets promoted.
This was seen early and clearly. Jiang et al. (2022) framed the field as moving from "learning from data" to "learning what data to learn from," and argued that this — generalized exploration — is the universal bottleneck of open-ended learning, common to supervised learning and RL alike. Read through this article's lens, that is the exact statement that environment supply, once industrialized, collapses back into an exploration problem. Agent-World (Dong et al., 2026) is the live instance: its self-evolving loop does not just synthesize more tasks, it identifies the agent's capability gaps and synthesizes tasks aimed at them, and the paper's scaling trend in self-evolution rounds is what a working interaction flywheel looks like when exploration steers it (see Figure 7). The factory makes the worlds; exploration decides which worlds are worth making.
7. Industrializing environment supply answers "how many worlds"; it does not answer "which world next." Jiang et al. (2022) named that successor bottleneck a decade early — the shift from learning from data to learning what data to learn from — and Agent-World shows it in practice: a flywheel is only as good as the exploration that aims it at the agent's actual frontier.
§8 — The practitioner's path
Collapse everything above into an order of operations you can run on Monday. The mistake teams make is to reach straight for RL — the most expensive, least stable tool — before they have earned the right to use it. The cheaper interventions come first, and each one either fixes the problem outright or makes the next stage tractable (see Figure 8). The open frontier recipes confirm the order: Kimi K2 (Kimi Team, 2025) — a practitioner report on a 1T-parameter MoE — describes post-training as a large-scale agentic data-synthesis pipeline feeding a joint RL stage over real and synthetic environments, and reaches 65.8 on SWE-Bench Verified; Qwen3 (Qwen Team, 2025) reports a comparable agentic, tool-use-centered recipe across its 0.6B-to-235B family. Both, read as lab reports rather than peer-reviewed findings, say the same thing: the recipe is the environment supply.
Step 1 — prompt and measure. Establish the ceiling of the model as-is, with a good prompt and the production harness, before you change any weights. Most "the agent can't do this" problems are harness problems in disguise. Step 2 — tune the harness. If the failures are in state-tracking, context assembly, or tool routing, fix H first: externalize recoverable state the way Harness-1 does, so the policy is left only the decisions that need learning. Re-measure. Step 3 — distill from a stronger agent (OPD). If a stronger teacher exists, buy dense supervision before you pay for sparse RL: synthesize trajectories the way OpenResearcher does and distill them, the cheapest path to a competent starting policy. Step 4 — RL on synthesized environments. Only now reach for RL — on the production harness (Polar), against a scalable environment pipeline (Endless Terminals / Gym-Anything / Agent-World, or a learned world model), with a credit-assignment algorithm matched to your reward's granularity (§5), watching for Echo Trap and template collapse. As the agent saturates the supply, the bottleneck becomes exploration — which environment to synthesize next.
The decision tree, in one line: prompt → tune harness H → OPD-distill from a stronger agent → RL on synthesized environments. Spend in that order. If you skip to RL before the harness is right and the environment pipeline is real, you will pay the most for the least — exactly the trap the 2018 applied-RL projects fell into.
8. Run the cheap interventions first: prompt, then harness, then distillation, then RL on synthesized environments. The open frontier recipes (Kimi K2, Qwen3 — practitioner reports) describe exactly this, with environment supply as the centerpiece of post-training. RL is the last and most expensive tool; earn the right to use it by exhausting the cheaper ones.
§ What Comes Next
Everything in this article assumed a clean separation between training and serving: build the factory, run the loop, ship the agent. But the most valuable agents are never finished — they are deployed into a world that keeps changing, and the experience that matters most arrives after training, in production, one interaction at a time. The interaction flywheel of Figure 2 was drawn as a closed offline loop; the next article cuts it open and lets it run online. 4 takes this same thesis — environments, not algorithms — into the continual setting, where the environment is no longer a pipeline you control but a stream you cannot reset, and where the bill for re-learning what you already knew comes due daily.
- The bottleneck in agentic RL has moved from the model to the substrate. Algorithms and compute are abundant; environments are not. The Endless Terminals result — boring algorithm, scaled environments, 10.7%→53.3% — is the whole thesis in one experiment.
- Environments are the new data, line for line: rewards are labels, environment distributions are datasets, reward verification is data cleaning, and the interaction flywheel is the data flywheel. The team that industrializes that column wins the era.
- Train where you serve. The harness is part of the MDP; Polar's +0.6-to-+22.6 spread across harnesses is the train/serve gap measured in points.
- Credit assignment is one axis from coarse to dense, and the field is migrating rightward — but a denser signal still has to be a real one (RAGEN-2's template collapse).
- The order of operations is fixed: prompt → harness → distill → RL on synthesized environments. Then exploration — which world to build next — becomes the boss.
References
- Aggarwal, P., Neubig, G., & Welleck, S. (2026). Gym-Anything: Turn any Software into an Agent Environment. Carnegie Mellon University. arXiv:2604.06126.
- Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., et al. (2024). Genie: Generative Interactive Environments. Google DeepMind. arXiv:2402.15391.
- Dong, G., Dou, Z., et al. (2026). Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Renmin University of China & ByteDance Seed. arXiv:2604.18292.
- Feng, L., Xue, Z., Liu, T., & An, B. (2025). Group-in-Group Policy Optimization for LLM Agent Training. Nanyang Technological University & Skywork AI. arXiv:2505.10978.
- Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Carnegie Mellon University & Amazon AGI Labs. arXiv:2605.06639.
- Gandhi, K., Garg, S., Goodman, N. D., & Papailiopoulos, D. (2026). Endless Terminals: Scaling RL Environments for Terminal Agents. Stanford University & Microsoft Research. arXiv:2601.16443.
- Hafner, D., Yan, W., & Lillicrap, T. (2025). Training Agents Inside of Scalable World Models (Dreamer 4). Google DeepMind. arXiv:2509.24527.
- Hamburger, J., Koltun, V., & Krähenbühl, P. (2025). Reinforcement Learning for Long-Horizon Interactive LLM Agents (LOOP). arXiv:2502.01600.
- Jiang, M., Rocktäschel, T., & Grefenstette, E. (2022). General Intelligence Requires Rethinking Exploration. Meta AI, UCL & Cohere. arXiv:2211.07819.
- Jiang, P., Shi, Z., Hong, K., et al. (2026). Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses. UIUC, UC Berkeley & Chroma. arXiv:2606.02373.
- Kimi Team. (2025). Kimi K2: Open Agentic Intelligence. Moonshot AI. arXiv:2507.20534.
- Li, Z., Jiang, D., Ma, X., et al. (2026). OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis. Texas A&M University & University of Waterloo. arXiv:2603.20278.
- Lu, Z., Yao, Z., et al., & Shen, Y. (2026). Self-Distilled Agentic Reinforcement Learning (SDAR). Zhejiang University & Meituan. arXiv:2605.15155.
- Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., & Rafailov, R. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. MultiOn & Stanford University. arXiv:2408.07199.
- Qwen Team. (2025). Qwen3 Technical Report. arXiv:2505.09388.
- ROCK, ROLL & iFlow Joint Team. (2025). Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem. arXiv:2512.24873.
- Silva, R., Chen, Z., Iyer, S., et al. (2026). Scaling Test-Time Compute for Agentic Coding. Meta Superintelligence Labs. arXiv:2604.16529.
- Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., et al. (2025). RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (StarPO). arXiv:2504.20073.
- Wang, Z., Gui, C., Jin, X., et al. (2026). RAGEN-2: Reasoning Collapse in Agentic RL. arXiv:2604.06268.
- Zhou, S., Xu, F. F., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
- Zhou, Y., Zanette, A., Pan, J., Levine, S., & Kumar, A. (2024). ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. UC Berkeley & Google DeepMind. arXiv:2402.19446.
- Zhou, Y., Jiang, S., Tian, Y., Weston, J., Levine, S., Sukhbaatar, S., & Li, X. (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. FAIR at Meta & UC Berkeley. arXiv:2503.15478.
- Zou, Y., Demoret, M., Kautz, J., & Dong, Y. (2026). Polar: Agentic RL on Any Harness at Scale. NVIDIA. arXiv:2605.24220.