Software Engineering Agents: The Proving Ground

§1 · The drosophila of agents

Genetics did not make its fastest progress on humans, or on anything we particularly cared about. It made it on Drosophila melanogaster, the common fruit fly — because the fly was tractable. It was cheap to keep, it bred in days, you could see what happened to it, and its genome was small enough to actually map. A century of foundational biology came out of an organism nobody chose for its importance and everybody chose for its convenience. The science generalized; the fly was just where it was discovered.

Agentic AI has a drosophila, and it is software engineering. Three properties, rare to find together, made it the place agents grew up first.

Verifiable rewards. A patch either applies and passes the test suite or it does not. The reward is not a learned preference model, not a human rating, not a rubric — it is the exit code of pytest. You get a ground-truth signal, for free, on every attempt. Abundant environments. Every public repository with a test suite is a ready-made world for an agent to act in. There are millions of them. Nobody had to build the environment; the global open-source commons already did, and keeps doing it. Expert oversight. The people who can tell whether the output is good — engineers — are the same people building the agents. The loop between "the agent did something" and "an expert caught that it was wrong" is as short as it gets.

Most domains have at most one of these. Law has expert oversight but no cheap verifier and few clean environments. Open-ended scientific research has none of the three. Medicine has expensive verification and almost no abundant, safe environment to practice in. Software has all three at once — and that is the entire reason the first agents that did real economic work were coding agents, and the reason the sharpest empirical findings about how agents behave were discovered here before anywhere else.

The benchmark that crystallized this is worth re-reading for its shape, not its scores. Jimenez et al. (2023) built SWE-bench from 2,294 real issues drawn from the pull-request histories of 12 popular Python repositories: hand the agent a codebase and an issue, let it edit the code, and let the repository's own tests judge the result. The contribution that matters here is not any number on any leaderboard — those belong to the evaluation chapter [← 2]. It is that someone showed you could assemble a benchmark out of tasks that are simultaneously real, verifiable, and abundant. That combination is the fly.

So here is the thesis of this article, stated plainly enough to be wrong. The SWE-agent platform layer is settled; the open frontier is economics, long-horizon coordination, and deployment evidence at scale — and the lessons learned here transfer out to every domain that industrializes agents next. If the hard problems of agentic software engineering turned out to be idiosyncratic to code — if nothing discovered on the fly generalized to other organisms — this article would be wrong. The rest of it is the case that they do.

One notation, fixed once and reused throughout the series. We write it out because the SWE setting gives each symbol a concrete referent for the first time.

Mthe model — weights and forward pass [← 1]

Hthe harness — prompts, tools, memory, verification around M

Ethe environment — here, a repo + its test suite + a sandboxed runtime

τa trajectory — one multi-step episode of acting on E

rthe reward — here, simply: do the tests pass?

Σthe skill / rule store — durable memory across episodes

Key Takeaway 1

Software engineering is the model organism of agentic AI: it is the only domain that supplies verifiable rewards (tests), abundant environments (every repo), and expert oversight (engineers build the agents) at once. SWE-bench (Jimenez et al., 2023) matters not for its leaderboard but for its shape — proof that real, verifiable, abundant tasks exist. Read this article as findings about agents that were merely discovered in code first.

§2 · The platform layer is settled

The first thing a maturing engineering field does is stop rebuilding its own foundation. In 2023, every team building a coding agent wrote its own agent loop, its own sandbox, its own file-editing tool, its own way of feeding terminal output back to the model. By 2025 they had mostly stopped — not because the problem got boring, but because the substrate got standard.

OpenHands (Wang et al., 2024) is the clearest marker of that transition: a shared platform where an agent perceives an environment, takes actions through a fixed interface — a code editor, a bash shell, a browser — and observes the results, all inside a sandboxed runtime that can actually execute code. The significance is not any single feature. It is that the interface stopped being a research question. The action space, the runtime isolation, the perceive-act-observe loop — these became infrastructure you import rather than decisions you relitigate per project. That is what "settled" means: not finished, but agreed upon.

Look at the stack and the claim sharpens (see Figure 2). The platform layer is the bottom: the environment E, the action interface, and the agent loop. It is stable across teams. What sits on top — which model you drop in, how you coordinate multiple agents, what you are willing to spend — is exactly where the open questions live. The 2025 code-intelligence survey, a synthesis spanning the full model lifecycle from data curation through pre-training, fine-tuning, reinforcement learning, and autonomous agents, names the gap that organizes the rest of this article: a research–practice gap between what benchmarks measure and what deployment actually demands. The platform is settled. The practice is wide open.

State the transferable lesson at altitude, because it is the one that predicts the future of every other domain: the platform layer commoditizes first. Once the action interface is standard, competition moves up the stack — to model quality, to coordination, to cost. The earliest reliable sign that an agentic domain is maturing is mundane and unmistakable: people stop writing their own harness and start arguing about what runs on the shared one. We argued in B1 that the harness is the product; a settled platform is simply the harness a whole field agreed to share.

Figure 2: The settled platform vs the open frontier. The bottom of the stack — environment, action interface, agent loop (OpenHands; Wang et al., 2024) — is stable across teams; the model is now swappable. Why it matters: a domain matures when its platform commoditizes and competition moves up-stack to cost, coordination, and deployment — exactly the gap the 2025 code-intelligence survey names.

Key Takeaway 2

The SWE-agent platform layer is settled: OpenHands (Wang et al., 2024) made the environment, action interface, and agent loop shared infrastructure you import rather than rebuild. The general law is that the platform commoditizes first; competition then moves up-stack to model, coordination, and cost. The first sign any agentic domain is maturing is that teams stop writing their own harness — and the 2025 code-intelligence survey names the resulting research–practice gap as the open territory.

§3 · Long-horizon work needs coordination and memory

A single GitHub issue is the easy case. Real software work is long-horizon: many interdependent subtasks spread across files, sessions, and days. Two problems appear at length that simply do not exist in a one-shot patch — coordination and memory — and, true to form, SWE agents ran into both first and produced the first credible answers.

Coordination, the hard way and the human way

The naive route to long-horizon work is to run many agents in parallel. It fails for a mechanical reason anyone who has tried it knows: concurrent edits collide, dependencies fall out of sync, and partial results refuse to merge into a coherent whole. Geng and Neubig (2026) make the observation that human teams solved this problem decades ago, with version control, task delegation, and isolated branches — and they encode those primitives directly as CAID (Centralized Asynchronous Isolated Delegation): a central manager builds a dependency-aware plan, delegates subtasks to agents working asynchronously in isolated workspaces, and consolidates their progress through structured integration that the test suite gates (see Figure 3). The lesson is not "adopt CAID." It is that long-horizon multi-agent software work needs the same coordination scaffolding human teams needed — and the verifiable reward is what makes the eventual merge safe rather than hopeful.

Compute as compressed experience

Kim et al. (2026), at Meta Superintelligence Labs, attack length from a different angle. Standard test-time scaling assumes short, comparable outputs you can rank or refine. A long-horizon coding agent violates that premise: each attempt is a sprawling trajectory τ of actions, observations, errors, and half-finished progress. Their insight is that the bottleneck is no longer generating more attempts — it is representing prior attempts compactly enough to reuse. They compress each rollout into a structured summary of its hypotheses, progress, and failure modes, then scale two ways: Recursive Tournament Voting narrows a population of parallel rollout summaries through small-group comparisons, and Parallel-Distill-Refine conditions fresh attempts on summaries distilled from prior ones. The transferable idea outlives the method: at long horizons, the scarce resource is a good representation of experience, not more of it.

Memory across the session boundary

Even with coordination and compute, an agent that forgets everything between runs re-derives the repository from scratch every time. The cl-agent substrate (Goswami, 2026) treats this as a continual-learning problem [← 4]: capture each coding episode as append-only records, replay the relevant past failures and fixes before a new run, and distill durable rules into inspectable artifacts — all without touching model weights, writing into Σ rather than into θ. Keep it in proportion: the point for this article is not the specific design but why software was where cross-session memory got a concrete substrate at all. SWE has a clean episode boundary — an issue opens, a pull request closes it — that makes "capture and replay" a well-defined operation. Domains without that natural boundary will have to manufacture one.

Figure 3: Coordination for long-horizon work (CAID; Geng & Neubig, 2026). A central manager delegates to async agents in isolated workspaces; integration is gated by the test suite, and a failure re-plans. Why it matters: long-horizon agent work needs the scaffolding human teams already built — and the verifiable reward r is what makes the merge safe. Test-time scaling (Kim et al., 2026) adds reuse of compressed prior experience.

Key Takeaway 3

Length, not difficulty, is what breaks single-shot agents — and SWE solved it first along three axes. Coordination: CAID (Geng & Neubig, 2026) borrows human collaboration primitives — delegation, isolation, test-gated integration. Compute: Kim et al. (2026) show the bottleneck at length is a compact representation of prior experience (RTV, PDR), not more attempts. Memory: cl-agent (Goswami, 2026) writes durable rules into Σ across the clean issue→PR boundary [← 4]. Other domains must manufacture the boundary SWE got for free.

§4 · The economics nobody benchmarks

Here is the result that should change how you build, and almost no leaderboard reports it. Bai et al. (2026) ran the first systematic study of where agents spend tokens — eight frontier models on SWE-bench Verified — and the headline is a cost curve, not an accuracy curve (see Figure 4). Average cost per task came out to $0.016 for single-turn code reasoning, $0.023 for multi-turn code chat, and $1.857 for a tool-using agentic coding loop. That is more than two orders of magnitude — roughly 116× — separating the cheapest and most expensive way to apply the same underlying model to a coding problem. The cost is bought almost entirely by wrapping the model in an agent. An agentic task averaged on the order of 17 million tokens.

And the spread across models was wide: some consumed, on average, over 1.5 million more tokens per task than the most efficient model in the study. Two models of comparable capability can differ enormously in what they cost to run as agents — which means the question "which model is best?" is incomplete until you ask "at what price?"

Two consequences follow, and they are the actionable core of this section. First, cost is a first-class product metric, not an afterthought. The harness change that doubles your pass rate may also multiply your bill by a hundred, and whether that trade is worth making is a business question no accuracy benchmark can answer for you. Second — and this is the hopeful part — Bai et al. found that token usage is, in meaningful part, predictable before a task is run. Predictability is what makes budgeting possible: route the cheap single-turn path for tasks that need it, and reserve the expensive agentic loop for the tasks that actually warrant it.

The transferable lesson is blunt. The moment a capability becomes an agent loop, its unit economics change by orders of magnitude — and the discipline that survives contact with a finance team is metering. Measure dollars-per-task before you celebrate win-rate, because a win you cannot afford to serve is not a win. We pick this thread up in production [→ 11].

[→ 11] Agent Ops: Running Agents in Production — cost curves stop being a study and become a billing line: budgeting, routing, and the $/task you actually pay when the agent runs all day.

Figure 4: The cost curve of agency (Bai et al., 2026; eight frontier models on SWE-bench Verified). Per-task spend rises from $0.016 (single-turn) to $0.023 (multi-turn) to $1.857 for an agentic loop — over two orders of magnitude for the same model, averaging ~17M tokens. Why it matters: agency is the cost multiplier, not the model; metering $/task is the discipline that scales. Values are the paper's reported per-task averages; the log axis mirrors the source figure.

Key Takeaway 4

Agency is a cost multiplier of orders of magnitude, and it is the metric leaderboards skip. Bai et al. (2026) measured per-task spend climbing from $0.016 (single-turn) to $1.857 (agentic) — over 100× for the same model, ~17M tokens — with token use varying by >1.5M across models and partly predictable before execution. The rule: meter $/task before win-rate, then route cheap-by-default and pay for the agentic loop only where the task earns it.

§5 · Deployment evidence, at the scale of a company

Benchmarks ask "can it?" Deployment asks "at what scale, and who reviews the output?" The most honest data we have on the second question comes from inside Meta, and it is worth treating with the care its provenance demands: it is a practitioner deployment report, not a peer-reviewed finding, and the numbers describe one company's production system rather than a controlled experiment.

With that caveat stated, the numbers are striking. Meta's RADAR report (2026) records that significant lines of code per human-landed diff grew by 105.9% year over year, per-developer diff volume rose 51%, and over 80% of that growth was attributable to agentic AI. The bottleneck moved, visibly and measurably: code supply outran reviewer bandwidth, and the share of diffs receiving timely human review fell. This is what the proving ground looks like when it stops being a benchmark and becomes a workplace — the constraint is no longer whether an agent can write the code, but whether anyone can review it fast enough.

RADAR — Risk Aware Diff Auto Review — is the response, and its shape is the transferable lesson (see Figure 5). It does not try to automate review wholesale. It risk-stratifies: a multi-stage funnel classifies each diff by authorship and source, applies eligibility gates and static heuristics, and computes a machine-learned Diff Risk Score. Low-risk diffs flow to automated review; higher-risk diffs are routed to human attention. The crucial engineering detail is that the risk threshold is a tunable, not a constant — turn it up to harvest more automation, down to buy more safety — and they treat that dial as something you calibrate against measured outcomes, not a value you hard-code once. Mellum2 (Kojic et al., 2026, JetBrains) shows the other deployment surface entirely: an open-weight 12B Mixture-of-Experts model with 2.5B active parameters, tuned for IDE-native, low-latency assistance with speculative decoding. That is the agent that lives in the editor, beside the developer, rather than in the CI cluster behind a risk score.

State the lesson at altitude: at scale, the binding constraint is not generation, it is review — and the pattern that works is risk-calibrated automation, handling the low-risk mass automatically and reserving scarce expert attention for the consequential tail. Every domain that deploys agents in volume will hit the same review wall, and will need the same dial.

Figure 5: Risk-stratified review at scale (RADAR; Meta, 2026, practitioner report). A funnel scores each diff and splits low-risk (automated) from higher-risk (human). Why it matters: when agentic AI drove >80% of a 105.9% YoY jump in code volume, the bottleneck became review, not generation — and the working answer is a calibrated risk dial, not blanket automation.

Key Takeaway 5

Deployment moves the bottleneck from generation to review. Meta's RADAR report (2026) — a practitioner source — records lines per human-landed diff up 105.9% YoY with >80% from agentic AI, and answers it not by automating review wholesale but by risk-stratifying it behind a tunable threshold. Mellum2 (Kojic et al., 2026) covers the IDE-native surface. The transferable pattern for every scaling domain: auto-handle the low-risk mass, reserve experts for the tail, and make the risk threshold a dial you calibrate.

§6 · Training the coder against a reward that means something

Where does the capability come from in the first place? Increasingly, from reinforcement learning against the verifiable reward that software hands you for free — what RL does and does not teach a model is the subject the published series took head-on [← A2]. But RL on code carries a subtlety that generalizes well beyond code: correctness is not the only axis you care about, and if your reward only measures one axis, that is the only axis you get.

[← 3] Environments Are the Bottleneck — the verifiable reward is an environment, and environments are the scarce input RL is bottlenecked on; the coder is only ever as good as the signal it trains against.

Zheng et al. (2026), at Meta Superintelligence Labs, make this concrete in competitive programming, where hidden unit tests enforce both functional correctness and computational efficiency under time and memory limits. Training checkpoints under nested test-coverage rewards surfaces a correctness–efficiency frontier — and the elegant part is what you can do with it without any further training. Linear interpolation between a low-coverage and a high-coverage checkpoint traces the frontier; weight extrapolation beyond the trained endpoints extends it to new operating points (see Figure 6). The effect holds across three inference settings — pure reasoning, tool use, and agentic coding — and across two model scales, 32B and 7B. You can pick where on the correctness-versus-efficiency trade you want to sit, at deployment, by averaging weights rather than retraining.

The settled-platform thesis reaches the model layer too. GLM-4.5 (GLM-4.5 Team, 2025) is an open-weight Mixture-of-Experts model — 355B total parameters, 32B active, trained on 23T tokens — built deliberately for agentic, reasoning, and coding work, and frontier-competitive with substantially fewer active parameters than its rivals. The specific scores that rank it belong to the evaluation chapter [← 2]; what matters here is the trajectory. Open-weight models trained specifically for the agentic-coding loop now sit near the frontier — which means the model term in the capability product, like the platform term before it, is on its way to becoming a commodity you select rather than a moat you build.

The transferable lesson is a discipline about reward design: when your domain has a verifiable reward, train against it directly — and make the reward capture every axis you actually care about, because the optimizer will give you exactly what you measured and nothing you forgot to. Weight-space techniques then let you trade between those axes at deployment without paying to retrain.

Figure 6: The correctness–efficiency frontier from weight averaging (Zheng et al., 2026). Interpolating between checkpoints trained under low- and high-coverage test rewards traces the frontier; extrapolation extends it past the trained endpoints — no extra RL. Why it matters: design the reward for every axis you care about, then trade between them at deployment by averaging weights. Axes are qualitative; the shape redraws the paper's reported frontier.

Key Takeaway 6

A verifiable reward lets you train the coder directly — but design it for every axis you care about. Zheng et al. (2026) show that nested correctness/efficiency rewards surface a frontier, and that weight interpolation traces it while extrapolation extends it — across 32B and 7B, with no extra training. GLM-4.5 (GLM-4.5 Team, 2025), an open-weight 355B/32B-active MoE built for agentic coding, shows the model term commoditizing too. The rule: the optimizer returns exactly what you measure, so measure correctness and cost [← A2].

§7 · Discipline for machine coders

Now the angle nobody covers. Software engineering spent decades learning to write code that must not fail — flight control, spacecraft, medical devices — and it codified what it learned. The cleanest codification is Holzmann's Power of Ten (2006), ten rules for safety-critical code from NASA/JPL: keep control flow simple, avoid recursion, give every loop a fixed upper bound, check the return value of every function, keep functions short and single-purpose, compile with all warnings treated as errors and run static analysis continuously. These are practitioner rules, written for human engineers. Read them again on the assumption that the coder is an agent, and they stop being style guidance and become a harness specification (see Figure 7).

Walk a few of them. "Give every loop a fixed upper bound" is, for an agent, a hard cap on the number of action-steps it may take on a task — the very budget §4 says your economics require. "Check the return value of every function" becomes "verify every tool result before you act on it" — the direct antidote to the cascading errors that chained agents suffer [← 8]. "Keep functions short and scoped" becomes "emit small, reviewable diffs" — which is precisely what makes RADAR's risk-stratified review tractable in the first place. And "use static analysis continuously, treat all warnings as errors" is nothing less than the verifiable reward gate r that the agent's work must pass before it counts. A document written in 2006 to constrain human programmers turns out to specify, almost line for line, what a disciplined agent harness needs to be.

Then there is the question mixed authorship forces. When a single diff is part-human and part-agent, "who reviewed this?" loses its clean answer — you cannot assume the author understood the code, because one of the authors was a model. The discipline that survives is exactly the one RADAR encodes: classify by authorship, gate by risk, and make the verifier, not the author, the thing you trust. Code review, when the coder is an agent, is no longer one human reading another's intent. It is a calibrated machine deciding what a human must still personally see — and being honest about its own residual error.

The transferable lesson is the one most teams will need and fewest will reach for: every safety-critical domain already has its Power of Ten — its hard-won rules for work that must not fail. When you put an agent into that domain, those rules are not a relic to modernize past. They are your harness specification, pre-written. Translate them, do not discard them.

Power-of-Ten rule (Holzmann, 2006)	Intent for a human coder	Meaning when the coder is an agent	Harness gate
Fixed upper bound on every loop	prevent runaway execution	a hard cap on action-steps per task — the budget §4 requires	✅ step limit
No recursion; simple control flow	keep the call graph analyzable	no unbounded self-invocation; bounded sub-agent depth	✅ depth cap
Check every return value	no silent failures	verify every tool result before acting — anti-cascade [← 8]	⚠️ verifier coverage
Short, single-purpose functions	reviewable units	small, scoped diffs → tractable risk-stratified review (§5)	✅ diff-size gate
All warnings as errors; static analysis always on	catch defects before runtime	tests + linters are the verifiable reward `r` the agent must pass	✅ CI gate

Figure 7: Safety-critical rules, re-read as a harness spec (Power of Ten; Holzmann, 2006, practitioner standard). Each rule written for human coders maps onto a concrete agent-harness control. Why it matters: the discipline a domain already built for code that must not fail is, almost verbatim, the specification for a disciplined agent in that domain — and one row (verify every result) remains only partially solved.

Key Takeaway 7

The discipline nobody ports: a domain's safety-critical coding rules become the agent's harness spec. Holzmann's Power of Ten (2006) maps almost line-for-line onto agent controls — bounded loops → step caps, check-every-return → verify-every-tool-result, short functions → small reviewable diffs, static-analysis-always → the verifiable reward gate. Mixed authorship breaks "who reviewed this," and the surviving answer is RADAR's: trust the verifier, not the author. Every safety-critical field already wrote your harness spec — translate it.

§8 · The transfer playbook

Now pull it together into the artifact that justifies reading this chapter even alone. The discoveries above are not facts about software. They are facts about agents that software happened to expose first, because software was the tractable organism. Each one predicts something specific about the domains industrializing agents next — legal, finance, operations — and the prediction is concrete enough to act on (see Figure 8).

The benchmark canon already shows the leading edge moving outward. ALE-Bench (Imajuku et al., 2025) pushes evaluation from pass/fail coding toward long-horizon, score-based optimization — routing, scheduling, planning — where there is no known exact answer and the agent must iteratively refine. Terminal-Bench (Merrill, Shaw et al., 2026) stretches it toward open-ended, realistic command-line work. Neither is here as a leaderboard; they are here as the shape of what "harder and more realistic" looks like, and the next domains will resemble them more than they resemble a single clean GitHub issue.

So here is the deployable playbook — the six moves that the SWE proving ground predicts for any domain about to put agents to work.

SWE-agent lesson	Legal agents	Finance agents	Ops / SRE agents
Verifiable reward gate §1 — tests decide	citation & clause validators	reconciliation & back-tests	synthetic transactions & health checks
Platform commoditizes first §2 — buy, don't build	shared case/research substrate	shared market-data + execution harness	shared runbook / automation platform
$/task is the product metric §4 — meter first	cost per matter	cost per decision	cost per incident
Risk-stratified review §5 — calibrate the dial	auto-file routine, human-gate novel	auto-clear small trades, gate large	auto-remediate low-sev, page on high-sev
Length needs memory + coordination §3 — manufacture the boundary	matter-level episode memory	portfolio state across sessions	incident-history replay
Translate the Power of Ten §7 — rules become harness	bar rules & privilege scoping	audit & compliance controls	change-management / SRE discipline

Figure 8: The transfer playbook — the deployable artifact. Each row is a lesson the SWE proving ground established; each column is its concrete translation to a domain industrializing agents next. Why it matters: the hard part everywhere is row one — a cheap verifiable reward — so domains that can manufacture one will move first, and the cost curve (§4) sets how fast.

Key Takeaway 8 · the playbook

Run these six moves, in order, for any domain putting agents to work:

Find or manufacture the verifiable reward. SWE had pytest for free; most domains must build the verifier first. No verifier, no agent.
Buy the platform, don't build it. The action interface commoditizes first (§2); spend your effort up-stack.
Meter $/task before win-rate. Agentic loops cost ~100× their single-turn equivalent (§4); know the unit economics before you scale.
Risk-stratify review (§5): auto-handle the low-risk mass, route the tail to experts, make the threshold a calibrated dial.
Engineer memory and coordination for length (§3): clean episode boundaries enable capture/replay; interdependent subtasks need delegation with a verification gate.
Translate the domain's Power of Ten (§7): its safety-critical rules are your harness spec, pre-written.

Honest limit: domains without a cheap verifier — the hardest move — will lag, and the cost curve may be steeper where a single "task" runs far longer than a coding task. The playbook predicts where agents go next, not that they arrive for free.

What comes next

The proving ground gave us a settled platform, a cost curve nobody had drawn, a review discipline that scales, and a playbook that transfers. It also opened a wound. The very property that made software tractable — agents that read from the world (issues, docs, tool output) and write back to it (commits, pull requests, deployments) — places the coder squarely on the boundary between untrusted input and production systems. A hostile instruction smuggled into a tool result, a poisoned dependency, a booby-trapped issue: any of them can turn the agent into a confused deputy that files a pull request against its own repository. When the coder can act, the coder is an attack surface. The next chapter names the threats and the mechanisms that actually hold.

[→ 10] Securing the Agentic Perimeter — the injection that files a PR, mapped from OWASP risk to lab practice to the gaps still left honest; and the harness as a security boundary.

References

Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Peng, H., Ji, H., & Neubig, G. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. Preprint. arXiv:2407.16741.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Preprint. arXiv:2310.06770.
Geng, J., & Neubig, G. (2026). Effective Strategies for Asynchronous Software Engineering Agents. Preprint. arXiv:2603.21489.
Kim, J. (D.), Yang, W., Niu, K., Zhang, H., Zhu, Y., Helenowski, E., Silva, R., Chen, Z., Iyer, S., Fried, D., Synnaeve, G., Salakhutdinov, R., & Goyal, A. (2026). Scaling Test-Time Compute for Agentic Coding. Preprint, Meta Superintelligence Labs. arXiv:2604.16529.
Goswami, D. (2026). cl-agent: A Continual-Learning Substrate for Coding Agents — Episode Capture, Replay, and Rule-Based Distillation for Cross-Session Improvement Without Fine-Tuning. Independent Research preprint. PDF · github.com/dattgoswami/cl-agent
Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., & Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. Preprint. arXiv:2604.22750.
Adams, C., Banga, A. S., Bansal, P., Bhattacharya, S., Cao, R., Cook, N., Ellis, B., Goyal, P., Grewal, G., Mockus, A., Rigby, P., & Nagappan, N. (Meta). (2026). Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency. Practitioner deployment report. arXiv:2605.30208.
Merrill, M. A., Shaw, A. G., Carlini, N., Li, B., Raj, H., Muennighoff, N., Konwinski, A., & Schmidt, L. (2026). Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. Preprint. arXiv:2601.11868.
BUAA-SKLCCSE, Alibaba, ByteDance, Shanghai AI Lab, et al. (2025). From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence. Preprint. arXiv:2511.18538.
GLM-4.5 Team, Zhipu AI & Tsinghua University. (2025). GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. Preprint. arXiv:2508.06471.
Imajuku, Y., Horie, K., Iwata, Y., Aoki, K., Takahashi, N., & Akiba, T. (2025). ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering. Sakana AI. github.com/SakanaAI/ALE-Bench.
Kojic, M., Bondyrev, I., de Moor, A., Shtok, J., Borovlev, P., Lysaniuk, K., Kannan, M., Dolgov, I., & Pavlichenko, N. (2026). Mellum2 Technical Report. JetBrains. arXiv:2605.31268.
Zheng, K., Chambon, P., Decugis, J., Gehring, J., Cohen, T., Negrevergne, B., & Synnaeve, G. (2026). Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL. Preprint, Meta Superintelligence Labs & Université Paris-Dauphine. arXiv:2605.28751.
Holzmann, G. J. (2006). The Power of Ten — Rules for Developing Safety-Critical Code. IEEE Computer, 39(6), NASA/JPL Laboratory for Reliable Software.