Sources & references · Harness Engineering

On attribution

Each work below is cited to its authors and linked to its canonical source — an arXiv page or the publishing venue's proceedings. The papers themselves remain the work and property of their respective authors; they are referenced here for scholarly commentary under normal academic citation. The selection draws on a curated corpus of several hundred papers; only those actually cited by an essay appear here. This list reflects the literature as cited through June 2026. If you spot a citation that is wrong or incomplete, let me know at dattgoswami@gmail.com.

References by essay

The papers behind each of the twelve essays, in the order they are cited.

1 The Harness Is the Product

Zhang, X., Wang, D., Xu, K., Zhu, Q., & Che, W. (2026). Scaling Laws for Agent Harnesses via Effective Feedback Compute. Preprint. arXiv:2605.29682.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), 2023. arXiv:2210.03629.
Li, J., Xiao, X., Zhang, Y., Liu, C., et al. (2026). Agent Harness Engineering: A Survey. Preprint (project page: Awesome-Agent-Harness).
Lin, M., Wu, J., Wang, Z., et al. (2026). Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents. Preprint. arXiv:2605.30621.
Xi, X., Li, X., Wang, W., & Cai, X. (2026). HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness. Preprint. arXiv:2605.02396.
Sen, S., Kasturi, A., Lumer, E., Gulati, A., & Subbiah, V. K. (2026). Is Grep All You Need? How Agent Harnesses Reshape Agentic Search. Preprint. arXiv:2605.15184.
Sarker, A. K., Staylor, M., Alsaadi, A., von Laszewski, G., & Jha, S. (2026). AAFLOW: Scalable Patterns for Agentic AI Workflows. Preprint. arXiv:2605.02162.
Xiong, Y., Yang, Y., Tian, Y., Shi, Y., Chandra, V., & Schmidhuber, J. (2026). Neural Computers. Preprint. arXiv:2604.06425.
Liao, J., Li, S., Wen, M., Wang, J., & Zhang, W. (2026). Position: Agentic AI System Is a Foreseeable Pathway to AGI. Proceedings of the 43rd International Conference on Machine Learning (ICML), PMLR 306. arXiv:2605.12966.
Rishav, R., Pujari, P., & Rastogi, P. (2026). ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis. Preprint. arXiv:2604.17937.

2 The Agent Evaluation Crisis

D'Oro, P., Silwal, S., Wong, W., Sun, Y., Xiao, F., Wang, M., Gan, E., Bolourchi, A., & Tighe, J. (2026). Computer Use at the Edge of the Statistical Precipice. arXiv preprint. arXiv:2605.08261.
Wiemann, M. L., Smith, L. M., Melchior, P., Mishra-Sharma, S., Wilson, A. G., Izmailov, P., & Cuesta-Lázaro, C. (2026). DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking. arXiv preprint. arXiv:2605.26087.
Zhou, S., Xu, F. F., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint. arXiv:2307.13854.
Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. (2023). GAIA: A Benchmark for General AI Assistants. arXiv preprint. arXiv:2311.12983.
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint. arXiv:2406.12045.
Liu, X., Yu, H., Zhang, H., Xu, Y., et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv preprint. arXiv:2308.03688.
Merrill, M. A., Shaw, A. G., Carlini, N., et al. (2026). Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces. arXiv preprint. arXiv:2601.11868.
Imajuku, Y., Horie, K., Iwata, Y., Aoki, K., Takahashi, N., & Akiba, T. (2025). ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering. 39th Conference on Neural Information Processing Systems (NeurIPS 2025), Datasets and Benchmarks Track.
Towards a Science of AI Agent Reliability. (2026). arXiv preprint. arXiv:2602.16666.
Agents' Last Exam. (2026). arXiv preprint. arXiv:2606.05405.
Asawa, P., Glaze, C. M., Orlanski, G., Ramakrishnan, R., Xu, B., Biswal, A., Chen, V. S., Sala, F., Zaharia, M., & Gonzalez, J. E. (2026). Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments. arXiv preprint. arXiv:2606.05661.
Xu, Z., Chen, J., Huang, Y., et al. (2026). AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? arXiv preprint. arXiv:2606.05080.
Murphy, K. (2026). Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs. arXiv preprint. arXiv:2604.18576.
Patel, A., Reddy, S., Mosbach, M., & Bahdanau, D. (2026). Forecasting Downstream Performance of LLMs with Proxy Metrics. arXiv preprint. arXiv:2605.18607.
Kapoor, S., Kirgis, P., Schwartz, A., Rabanser, S., Allaire, J. J., Bommasani, R., et al. (2026). Open-World Evaluations for Measuring Frontier AI Capabilities. arXiv preprint. arXiv:2605.20520.
Anthropic. (2026). Claude Opus 4.7 System Card. Anthropic.
Singh, C., Tan, Y. S., Xu, W., Gero, Z., Yang, W., Galley, M., & Gao, J. (2026). Agentic-imodels: Evolving Agentic Interpretability Tools via Autoresearch. arXiv preprint. arXiv:2605.03808.

3 Environments Are the Bottleneck

Aggarwal, P., Neubig, G., & Welleck, S. (2026). Gym-Anything: Turn any Software into an Agent Environment. Carnegie Mellon University. arXiv:2604.06126.
Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., et al. (2024). Genie: Generative Interactive Environments. Google DeepMind. arXiv:2402.15391.
Dong, G., Dou, Z., et al. (2026). Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence. Renmin University of China & ByteDance Seed. arXiv:2604.18292.
Feng, L., Xue, Z., Liu, T., & An, B. (2025). Group-in-Group Policy Optimization for LLM Agent Training. Nanyang Technological University & Skywork AI. arXiv:2505.10978.
Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Carnegie Mellon University & Amazon AGI Labs. arXiv:2605.06639.
Gandhi, K., Garg, S., Goodman, N. D., & Papailiopoulos, D. (2026). Endless Terminals: Scaling RL Environments for Terminal Agents. Stanford University & Microsoft Research. arXiv:2601.16443.
Hafner, D., Yan, W., & Lillicrap, T. (2025). Training Agents Inside of Scalable World Models (Dreamer 4). Google DeepMind. arXiv:2509.24527.
Hamburger, J., Koltun, V., & Krähenbühl, P. (2025). Reinforcement Learning for Long-Horizon Interactive LLM Agents (LOOP). arXiv:2502.01600.
Jiang, M., Rocktäschel, T., & Grefenstette, E. (2022). General Intelligence Requires Rethinking Exploration. Meta AI, UCL & Cohere. arXiv:2211.07819.
Jiang, P., Shi, Z., Hong, K., et al. (2026). Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses. UIUC, UC Berkeley & Chroma. arXiv:2606.02373.
Kimi Team. (2025). Kimi K2: Open Agentic Intelligence. Moonshot AI. arXiv:2507.20534.
Li, Z., Jiang, D., Ma, X., et al. (2026). OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis. Texas A&M University & University of Waterloo. arXiv:2603.20278.
Lu, Z., Yao, Z., et al., & Shen, Y. (2026). Self-Distilled Agentic Reinforcement Learning (SDAR). Zhejiang University & Meituan. arXiv:2605.15155.
Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., & Rafailov, R. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. MultiOn & Stanford University. arXiv:2408.07199.
Qwen Team. (2025). Qwen3 Technical Report. arXiv:2505.09388.
ROCK, ROLL & iFlow Joint Team. (2025). Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem. arXiv:2512.24873.
Silva, R., Chen, Z., Iyer, S., et al. (2026). Scaling Test-Time Compute for Agentic Coding. Meta Superintelligence Labs. arXiv:2604.16529.
Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., et al. (2025). RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (StarPO). arXiv:2504.20073.
Wang, Z., Gui, C., Jin, X., et al. (2026). RAGEN-2: Reasoning Collapse in Agentic RL. arXiv:2604.06268.
Zhou, S., Xu, F. F., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
Zhou, Y., Zanette, A., Pan, J., Levine, S., & Kumar, A. (2024). ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL. UC Berkeley & Google DeepMind. arXiv:2402.19446.
Zhou, Y., Jiang, S., Tian, Y., Weston, J., Levine, S., Sukhbaatar, S., & Li, X. (2025). SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks. FAIR at Meta & UC Berkeley. arXiv:2503.15478.
Zou, Y., Demoret, M., Kautz, J., & Dong, Y. (2026). Polar: Agentic RL on Any Harness at Scale. NVIDIA. arXiv:2605.24220.

4 Agents That Learn on the Job

Asawa, P., Glaze, C. M., Orlanski, G., Ramakrishnan, R., Xu, B., Biswal, A., Chen, V. S., Sala, F., Zaharia, M., & Gonzalez, J. E. (2026). Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments. Preprint. arXiv:2606.05661.
Wang, Y., Chen, X., Jin, X., Wang, M., & Yang, L. (2026). OpenClaw-RL: Train Any Agent Simply by Talking. Preprint. arXiv:2603.10165.
Xue, T., Liao, Z., Shi, T., Wang, Z., Zhang, K., Song, D., Su, Y., & Sun, H. (2026). Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents. Preprint. arXiv:2602.10356.
Memento-Team. (2026). Memento-Skills: Let Agents Design Agents. Preprint.
Goswami, D. (2026). cl-agent: A Continual-Learning Substrate for Coding Agents — Episode Capture, Replay, and Rule-Based Distillation for Cross-Session Improvement Without Fine-Tuning. Independent Research, Preprint. PDF · github.com/dattgoswami/cl-agent
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. Preprint. arXiv:2305.16291.
Dai, A., Meinardus, B., Regan, C., Tian, Y., & Tang, Y. (2026). Discovering Novel LLM Experts via Task-Capability Coevolution. International Conference on Learning Representations (ICLR), 2026. arXiv:2604.14969.
Tiwari, R., Sareen, K., Agrawal, L. A., Gonzalez, J. E., Zaharia, M., Keutzer, K., Dhillon, I. S., Agarwal, R., & Khatri, D. (2026). Learning, Fast and Slow: Towards LLMs That Adapt Continually. Preprint. arXiv:2605.12484.
Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., & Sutton, R. S. (2024). Loss of plasticity in deep continual learning. Nature, 632, 768–774.
Sutton, R. S., Koop, A., & Silver, D. (2007). On the Role of Tracking in Stationary Environments. Proceedings of the 24th International Conference on Machine Learning (ICML), 2007.
Tamborski, M., & Abel, D. (2025). Memory Allocation in Resource-Constrained Reinforcement Learning. Preprint. arXiv:2506.17263.
Orenstein, A., Chen, J., Delos Santos, G. A., Sapara, B., & Bowling, M. (2025). Toward Agents That Reason About Their Computation. Preprint. arXiv:2510.22833.
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., et al. (2015). Never-Ending Learning. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015.

5 The Memory Stack

Jia, Z., Li, J., Kang, Y., Wang, Y., Wu, T., Wang, Q., Qi, S., Liang, Y., He, D., Zheng, Z., & Zhu, S.-C. (2026). The AI Hippocampus: How Far Are We From Human Memory? Preprint (OpenReview). arXiv:2601.09113.
Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th ACM Symposium on User Interface Software and Technology (UIST '23). arXiv:2304.03442.
Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv preprint. arXiv:2310.08560.
Shen, B., Jin, L., Cai, H., Hu, L., & Xin, Y. (2026). The Efficiency Frontier: A Unified Framework for Cost–Performance Optimization in LLM Context Management. arXiv preprint. arXiv:2605.23071.
Zhang, H., Xu, Q., Li, Z., Zhang, L., Jiang, P., Zhang, Y., & McAuley, J. (2026). Masking Stale Observations Helps Search Agents — Until It Doesn't: A Regime Map and Its Mechanism. arXiv preprint. arXiv:2606.00408.
Yao, Y., Zhu, Y., Du, L., & Deng, S. (2026). StructMem: Structured Memory for Long-Horizon Behavior in LLMs. arXiv preprint. arXiv:2604.21748.
Lei, J., Zhang, D., Li, J., Wang, W., Fan, K., Liu, X., Ma, X., Chen, B., & Poria, S. (2026). δ-mem: Efficient Online Memory for Large Language Models. arXiv preprint. arXiv:2605.12357.
Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H. S., & Ranzato, M. (2019). On Tiny Episodic Memories in Continual Learning. arXiv preprint. arXiv:1902.10486.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2303.11366.
Dong, B., Zhu, H., Huang, R., Yu, G., Wei, Y., Zheng, G., Xiong, F., Wang, H., Chen, H., & Zhang, N. (2026). Rethinking Memory as Continuously Evolving Connectivity. arXiv preprint. arXiv:2605.28773.
Gutiérrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. Advances in Neural Information Processing Systems 37 (NeurIPS 2024). arXiv:2405.14831.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint. arXiv:2305.16291.
McLeish, S., Goldstein, T., & Fanti, G. (2026). Language Models Need Sleep. arXiv preprint. arXiv:2605.26099.
van de Ven, G. M., Siegelmann, H. T., & Tolias, A. S. (2020). Brain-Inspired Replay for Continual Learning with Artificial Neural Networks. Nature Communications, 11, 4069.
Behrouz, A., Zhong, P., & Mirrokni, V. (2025). Titans: Learning to Memorize at Test Time. arXiv preprint. arXiv:2501.00663.
Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv preprint. arXiv:2407.04620.
Hatamizadeh, A., Choi, Y., & Kautz, J. (2026). Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention. arXiv preprint. arXiv:2605.22791.
Srinivasan, V. (2026). Stateless Decision Memory for Enterprise AI Agents. Practitioner analysis (arXiv preprint). arXiv:2604.20158.

6 Tools, Skills, and the Action Interface

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR), 2023. arXiv:2210.03629.
Cheng, Y., Fan, C., JafariRaviz, M., Rezaei, K., & Feizi, S. (2026). Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use. Preprint. arXiv:2605.14038.
Sadani, A., & Kumar, D. (2026). Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows. Infrrd.ai, practitioner report (Preprint). arXiv:2604.21816.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. arXiv:2302.04761.
Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. Preprint. arXiv:2305.15334.
Su, W., Long, J., Ai, Q., Tang, Y., Wang, C., Tu, Y., & Liu, Y. (2026). Skill Retrieval Augmentation for Agentic AI. Preprint. arXiv:2604.24594.
Zhou, T., Liu, D., Yuan, L., Shao, J., & Hu, X. (2026). COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation. Shanghai Artificial Intelligence Laboratory, Preprint. arXiv:2605.31264.
Yang, Y., Gong, Z., Huang, W., Yang, Q., Zhou, Z., Huang, Z., et al. (2026). SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Microsoft, Preprint. arXiv:2605.23904.
Gan, Z., Tang, H., & Liu, Y. (2026). Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents. Renmin University of China, Preprint. arXiv:2606.05828.
Lyu, Y., Wang, C., Zheng, H., Yue, Y., Yan, J., Wang, M., & Huang, J. (2026). AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use. Alibaba Group, Preprint. arXiv:2604.21590.
Axboe, J. (n.d.). Efficient IO with io_uring. Systems design document.

7 Planning and the Myopia Problem

Chen, S., Li, J.-A., et al. (2026). Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning. Preprint. arXiv:2605.06840.
Islah, N., Abbes, I., & Rish, I. (2026). Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them). Preprint. arXiv:2606.05145.
Setlur, A., Yang, M. Y. R., Snell, C., Greer, J., Wu, I., Smith, V., Simchowitz, M., & Kumar, A. (2025). e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs. Preprint. arXiv:2506.09026.
Qu, Y., Singh, A., Lee, Y., Setlur, A., Salakhutdinov, R., Finn, C., & Kumar, A. (2025). RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems. Preprint. arXiv:2510.02263.
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Preprint. arXiv:2408.03314.
Sunkaraneni, V., Beneventano, P., Neumarker, R., Poggio, T., & Galanti, T. (2026). Agentic Systems as Boosting Weak Reasoning Models. Preprint. arXiv:2605.14163.
Zhou, S., Chai, W., Liu, K., Mao, H., Mang, Q., & Shang, J. (2026). OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation. Preprint. arXiv:2605.15177.
Chu, M., Zhang, X. B., Lin, K. Q., Kong, L., et al. (2026). Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond. Preprint. arXiv:2604.22748.
Timor, N., Shwartz-Ziv, R., Goldblum, M., LeCun, Y., & Harel, D. (2026). On Training in Imagination. Preprint. arXiv:2605.06732.
Tsoukalas, G., Kovsharov, A., Shirobokov, S., Surina, A., Firsching, M., et al. (2026). Advancing Mathematics Research with AI-Driven Formal Proof Search. Preprint. arXiv:2605.22763.
Kung, P.-N., Song, L., Hwang, D., Yoon, J., Li, C.-L., et al. (2026). LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks. Preprint. arXiv:2606.03303.

8 Multi-Agent Systems and Their Failure Modes

Qi, S., Ma, J., Xing, R., Guo, W., Huang, X., Gao, Z., Deng, J., Liu, J., Zhang, L., Wei, B., et al. (2026). Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems. Preprint. arXiv:2605.14892.
Chen, N., Tong, Y., Yang, Y., He, Y., Zhang, X., Zou, Q., Wang, Q., & He, B. (2026). Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation. Preprint. arXiv:2604.18005.
Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. Preprint. arXiv:2305.14325.
Nielsen, S., Cetin, E., Schwendeman, P., Sun, Q., Xu, J., & Tang, Y. (2025). Learning to Orchestrate Agents in Natural Language with the Conductor. Preprint. arXiv:2512.04388.
Xu, J., Sun, Q., Schwendeman, P., Nielsen, S., Cetin, E., & Tang, Y. (2025). TRINITY: An Evolved LLM Coordinator. Preprint. arXiv:2512.04695.
Yu, Z., Fu, Y., He, Z., Huang, Y., Ka Yiu, L., Fang, M., Luo, W., & Wang, J. (2026). From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company (OneManCompany). Preprint. arXiv:2604.22446.
Qi, Z., Su, H., Qu, A., Wang, C., Yao, Y., Zheng, H., Du, Y., Reddi, V. J., Li, J., Liang, P. P., Lakkaraju, H., Kakade, S., et al. (2026). Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions. Preprint. arXiv:2606.02859.
Tong, Y., Zhang, T., Buehler, M. J., He, J., Zou, J., & Tong, H. (2026). Recursive Multi-Agent Systems. Preprint. arXiv:2604.25917.
Gu, Z., Li, J., Cai, Y., & Feng, H. (2026). Scaling Behavior of Single LLM-Driven Multi-Agent Systems. Preprint. arXiv:2606.00655.
Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024). Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems. Preprint. arXiv:2403.02419.
Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., et al. (2023). MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. Preprint. arXiv:2308.00352.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. Preprint. arXiv:2308.08155.
Albrecht, S. V., Christianos, F., & Schäfer, L. (2024). Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press.

9 Software Engineering Agents: The Proving Ground

Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Peng, H., Ji, H., & Neubig, G. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. Preprint. arXiv:2407.16741.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Preprint. arXiv:2310.06770.
Geng, J., & Neubig, G. (2026). Effective Strategies for Asynchronous Software Engineering Agents. Preprint. arXiv:2603.21489.
Kim, J. (D.), Yang, W., Niu, K., Zhang, H., Zhu, Y., Helenowski, E., Silva, R., Chen, Z., Iyer, S., Fried, D., Synnaeve, G., Salakhutdinov, R., & Goyal, A. (2026). Scaling Test-Time Compute for Agentic Coding. Preprint, Meta Superintelligence Labs. arXiv:2604.16529.
Goswami, D. (2026). cl-agent: A Continual-Learning Substrate for Coding Agents — Episode Capture, Replay, and Rule-Based Distillation for Cross-Session Improvement Without Fine-Tuning. Independent Research preprint. PDF · github.com/dattgoswami/cl-agent
Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., & Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. Preprint. arXiv:2604.22750.
Adams, C., Banga, A. S., Bansal, P., Bhattacharya, S., Cao, R., Cook, N., Ellis, B., Goyal, P., Grewal, G., Mockus, A., Rigby, P., & Nagappan, N. (Meta). (2026). Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency. Practitioner deployment report. arXiv:2605.30208.
Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Muennighoff, N., Konwinski, A., & Schmidt, L. (2026). Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. Preprint. arXiv:2601.11868.
BUAA-SKLCCSE, Alibaba, ByteDance, Shanghai AI Lab, et al. (2025). From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence. Preprint. arXiv:2511.18538.
GLM-4.5 Team, Zhipu AI & Tsinghua University. (2025). GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. Preprint. arXiv:2508.06471.
Imajuku, Y., Horie, K., Iwata, Y., Aoki, K., Takahashi, N., & Akiba, T. (2025). ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering. Sakana AI. github.com/SakanaAI/ALE-Bench.
Kojic, M., Bondyrev, I., de Moor, A., Shtok, J., Borovlev, P., Lysaniuk, K., Kannan, M., Dolgov, I., & Pavlichenko, N. (2026). Mellum2 Technical Report. JetBrains. arXiv:2605.31268.
Zheng, K., Chambon, P., Decugis, J., Gehring, J., Cohen, T., Negrevergne, B., & Synnaeve, G. (2026). Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL. Preprint, Meta Superintelligence Labs & Université Paris-Dauphine. arXiv:2605.28751.
Holzmann, G. J. (2006). The Power of Ten — Rules for Developing Safety-Critical Code. IEEE Computer, 39(6), NASA/JPL Laboratory for Reliable Software.

10 Securing the Agentic Perimeter

Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. Preprint. arXiv:2406.13352.
Zhan, Q., Liang, Z., Ying, Z., & Kang, D. (2024). InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Preprint. arXiv:2403.02691.
OWASP GenAI Security Project. (2025). Security Foundations for the OWASP Top 10 for Agentic Applications. Zenity (practitioner artifact).
Cheng, S., Tao, G., Liu, Y., An, S., Xu, X., Feng, S., Shen, G., Zhang, K., Xu, Q., Ma, S., & Zhang, X. (2023). BEAGLE: Forensics of Deep Learning Backdoor Attack for Better Defense. Network and Distributed System Security Symposium (NDSS).
Anthropic. (2026). System Card: Claude Opus 4.7. Anthropic (practitioner artifact).
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Preprint. arXiv:2212.08073.
Li, C., Price, S., Marks, S., & Kutasov, J. (2026). Model Spec Midtraining: Improving How Alignment Training Generalizes. Preprint. arXiv:2605.02087.
Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., & Beutel, A. (2024). The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. Preprint. arXiv:2404.13208.
Zhang, H., & Guo, Y. (2025). Skill-based Safe Reinforcement Learning with Risk Planning. Preprint. arXiv:2505.01619.
Liu, W., Mou, X., Yan, H., Wei, Z., & He, Y. (2026). Large Language Models Hack Rewards, and Society. Preprint. arXiv:2606.04075.
Debenedetti, E., Shumailov, I., Fan, T., Hayes, J., Carlini, N., Fabian, D., Kern, C., Shi, C., Terzis, A., & Tramèr, F. (2025). Defeating Prompt Injections by Design (CaMeL). Preprint. arXiv:2503.18813.
Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2024). AI Control: Improving Safety Despite Intentional Subversion. Proceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2312.06942.
Bowman, S. R., Hyun, J., Perez, E., et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models. Preprint. arXiv:2211.03540.
Kumar, A., Bahlous-Boldi, R., Sharma, P., Isola, P., Risi, S., Tang, Y., & Ha, D. (2026). Digital Red Queen: Adversarial Program Evolution in Core War with LLMs. Preprint. arXiv:2601.03335.

11 Agent Ops: Running Agents in Production

Nakajima, Y. (2026). The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems. Preprint. arXiv:2605.21997.
Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., & Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. Preprint. arXiv:2604.22750.
Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024). Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems. Preprint. arXiv:2403.02419.
Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. Preprint. arXiv:2406.18665.
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. Preprint. arXiv:2305.05176.
Shen, B., Jin, L., Cai, H., Hu, L., & Xin, Y. (2026). The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management. Preprint. arXiv:2605.23071.
Zhang, H., Xu, Q., Li, Z., Zhang, L., Jiang, P., Zhang, Y., & McAuley, J. (2026). Masking Stale Observations Helps Search Agents – Until It Doesn't: A Regime Map and Its Mechanism. Preprint. arXiv:2606.00408.
Falconer, S. (2025). A Guide to Event-Driven Design for Agents and Multi-Agent Systems. Confluent (practitioner white paper).
Lyu, Y., Wang, C., Zheng, H., Yue, Y., Yan, J., Wang, M., & Huang, J. (2026). AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use. Preprint. arXiv:2604.21590.
Anthropic. (2026). Deploying Claude across your organization: A practical guide for deploying Claude across your business. Anthropic (practitioner guide).

12 Self-Improving Agents

Memento-Team. (2026). Memento-Skills: Let Agents Design Agents. Preprint. arXiv:2603.18743.
Wang, F. Y., & Buehler, M. J. (2026). Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence. Preprint. arXiv:2606.01444.
Yang, Y., Gong, Z., Huang, W., Yang, Q., et al. (2026). SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Preprint. arXiv:2605.23904.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.11366.
Hebbar, P., Manawat, Y., Verboomen, S., Ivanova, A., et al. (2026). SIA: Self-Improving AI with Harness and Weight Updates. Preprint. arXiv:2605.27276.
Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Preprint. arXiv:2605.06639.
Wu, C. H., & Raghunathan, A. (2026). Self-Trained Verification for Training- and Test-Time Self-Improvement. Preprint. arXiv:2605.30290.
Dai, A., Meinardus, B., Regan, C., Tian, Y., & Tang, Y. (2026). Discovering Novel LLM Experts via Task-Capability Coevolution. International Conference on Learning Representations (ICLR), 2026. arXiv:2604.14969.
Zhang, J., Zhao, B., Yang, W., Foerster, J., Clune, J., Jiang, M., Devlin, S., & Shavrina, T. (2026). HyperAgents. Preprint. arXiv:2603.19461.
Stanley, K. O., & Miikkulainen, R. (2002). Evolving Neural Networks through Augmenting Topologies. Evolutionary Computation, 10(2), 99–127.
Pourcel, J., Colas, C., & Oudeyer, P.-Y. (2025). Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI. Preprint. arXiv:2507.14172.
Kumar, A., Bahlous-Boldi, R., Sharma, P., Isola, P., Risi, S., Tang, Y., & Ha, D. (2026). Digital Red Queen: Adversarial Program Evolution in Core War with LLMs. Preprint. arXiv:2601.03335.
Meng, R., Dalvi Mishra, B., Chen, J., Li, C.-L., et al. (2026). ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence. Preprint. arXiv:2605.26340.