Self-Improving Agents

§1 · The engineer in the loop

Here is a result that should unsettle anyone who builds agents for a living. An agent handed no new skills, no fine-tuning, and no human-written playbook — only the ability to write and revise its own skills between attempts — raised its accuracy on Humanity's Last Exam by 116.2% relative to where it started, and by 26.2% relative on the General AI Assistants (GAIA) benchmark (Memento-Team, 2026). It got there by designing better versions of itself, run over run. No engineer touched it mid-flight.

For eleven articles this series has made a single argument: the harness — the system wrapped around the model — is the product [← 1], and improving it is engineering work. We learned to measure it, to evaluate it honestly [← 2], to feed it environments, to give it memory and tools and plans, and to let it learn on the job [← 4]. Every one of those improvements had the same author. A human sat in the loop, read the traces, and changed the system. The engineer is the last human in the loop.

This article is about the papers where that stops being true — where the loop that used to run through a person closes without one (see Figure 1). Skills that evolve. Harnesses that rewrite themselves. Curricula that regenerate. Systems that edit the very procedure they use to improve. Stated plainly enough to be wrong: production self-improvement is now closing the engineer's loop under real deployment constraints — budgets, eval gates, and rollback. If a self-modifying agent could not move deployed capability without a human authoring each change, the claim would be dead. The 116.2% result is the first nail it has to survive, and it survives.

There is a trap here, and most writing on the subject falls into it. The phrase "self-improvement" names two different machines, and conflating them is how you get both the hype and the paralysis. One machine is a research programme: open-ended, unbounded, reaching for capabilities nobody specified. The other is a production discipline: bounded, gated, reversible, aimed at shipping a better agent under the constraints you already have. The next section pulls them apart and keeps them apart. This article is the field guide to the second one — and it ends with a guardrail table you can put into an on-call runbook.

Key Takeaway 1

The loop that improves the agent is closing. An agent that only edited its own skills lifted its score on a frontier exam by 116.2% relative with no human in the loop (Memento-Team, 2026). The engineering question is no longer whether agents improve themselves, but which of two loops you are running — the open-ended research loop, or the bounded production loop — because they demand opposite disciplines.

§2 · Two loops, one word

Self-improvement is two machines, not one, and the entire engineering decision is which one you are running. Get this distinction wrong and you will either ship something unbounded into production, or refuse to ship something bounded and safe because it shares a name with the unbounded thing.

The research loop is open-ended code evolution. Propose a variant of the system, test it empirically, keep what scores higher in an archive that only ever grows, and repeat — searching for capability nobody asked for in advance. This is the lineage of the Darwin Gödel Machine and of large-scale evolution strategies, and it is the subject of a published article in the companion series [← A8]. We will not re-tell that story. What matters here is its shape: there is no acceptance gate beyond "the benchmark went up," no budget ceiling by design, and no obligation to ever ship. Its purpose is discovery.

The production loop has the opposite shape. Every proposed self-edit passes an eval gate [← 2] before it touches anything users see; every edit is reversible; every run has a budget; and every change lands in an event log you can read at 3 a.m. Its purpose is not to discover the unknown. It is to ship a measurably better agent under constraints that already exist. Same word, opposite contracts.

There is a clean way to say what separates them. Wang and Buehler (2026), building a category-theoretic account of self-revising discovery systems, distinguish operations that update a system's state within a fixed representational regime from transitions that change the regime itself. Production self-improvement is the first kind: bounded updates inside a regime you fixed and can audit. Open-ended evolution reaches for the second: it tries to change the rules of its own game. Most of the danger, and most of the engineering, lives in keeping a production system firmly in the first category while it still feels like it is doing the second.

We will use the series notation throughout. The pieces a self-improvement loop can edit are exactly the pieces we have named all along:

MThe model — weights θ.

HThe harness — context, tools, memory, verification, revision, orchestration.

EThe environment / task-world the agent acts in — including the curriculum.

ΣThe skill/rule store — memory that outlives one episode.

τA trajectory — one multi-step episode the loop runs on.

A self-improvement loop is just an operator that takes feedback from trajectories τ and edits one of Σ, H, E, or M — or, at the frontier, edits the operator itself. The rest of this article walks up that list, cheapest and safest first (see Figure 2).

Figure 2: The two loops, drawn so they cannot be confused. The research loop (left) grows an archive without bound and ships nothing by design — the open-ended programme covered in [← A8]. The production loop (right) routes every self-edit through an eval gate; passes ship, failures roll back, and the whole cycle runs under a budget and an event log. The gate, rollback, budget, and log appear only on the right. This is the figure that keeps the rest of the article honest.

Key Takeaway 2

One word, two machines. The research loop optimizes for discovery and is unbounded by design; the production loop optimizes for a shippable, reversible improvement under a budget. The one-line test: does a human have to sign off before the change reaches users? If yes, you are running the production loop — gate it, log it, and engineer it as such. Everything below is the production loop.

§3 · Self-improving the skills (Σ)

Start where the loop is cheapest and safest to close: the skill store Σ. It needs no weight updates, it is fully inspectable, and it is reversible by construction — skills are just files. If you are going to let an agent improve anything about itself in production, let it improve this first.

Two papers show what disciplined skill-improvement looks like. SkillOpt (Yang et al., 2026) makes the sharp move: treat a skill as the external state of a frozen agent and optimize it with the same discipline you would apply to training a network — a specified objective, and stable, measurable steps under feedback. Today's skills are hand-crafted, generated one-shot, or evolved by loosely controlled self-revision; none of those behaves like an optimizer, and none reliably gets better than where it started. SkillOpt's contribution is to make skill-evolution behave like optimization rather than like luck. The skill stops being a prompt someone wrote and becomes a quantity the system descends.

Memento-Skills (Memento-Team, 2026) shows the same idea wired into a working agent that designs other agents. Reusable skills are stored as structured markdown files that serve as persistent, evolving memory; a read–write reflective loop selects the right skill for the current state, then updates and expands the library from new experience. The crucial property for a deployment engineer: this is continual learning without updating the model's weights. All adaptation lives in the externalized skills and prompts — which means all adaptation is inspectable and revertible. That is how the same system reached 26.2% and 116.2% relative gains on GAIA and Humanity's Last Exam: not by becoming a different model, but by accumulating a better Σ (see Figure 3).

None of this is unprecedented. The ancestor is Reflexion (Shinn et al., 2023): verbal reinforcement, where an agent writes natural-language reflections on its failures and carries them forward as episodic memory — learning from feedback without gradient updates. Modern skill-evolution is Reflexion with two things added: an optimizer that makes the updates principled, and a library that makes them compound. The lineage matters because it tells you the mechanism is mature; what is new is the discipline around it. Memory and the action interface each earned their own articles earlier in the series [← 5] [← 6]; here the point is narrower — those stores are now things the agent edits, not just things it reads.

Figure 3: Self-improving the skill store. A frozen model runs a task; the feedback writes back into the skill Σ, which SkillOpt treats as external state to be optimized (the descending curve), while Memento-Skills lets the markdown library grow round over round. Because nothing updates the weights, every gain is a file you can read and revert. Why it matters: this is the lowest-blast-radius place to let an agent improve itself.

Key Takeaway 3

Let the agent improve its skills first. Treating a skill as optimizable external state (SkillOpt) and storing the library as inspectable files (Memento-Skills) delivers real, compounding gains — 26.2% and 116.2% relative on GAIA and HLE — while keeping every change reversible and the model's weights untouched. The blast radius of a bad self-edit is a file you can roll back, which is exactly why Σ is where the production loop should start.

§4 · Self-improving the harness (H) and weights (M)

Go up a level and two bigger levers appear: rewriting the scaffold, and retraining the weights. They have lived in separate research worlds, and most teams pull only one. The result worth internalizing is what happens when a single system pulls both — and the constraint that decides how far either can go.

The two worlds are easy to name. The harness-update school rewrites the scaffold of a task-specific agent — its tools, prompts, retry logic, and search procedure — while holding the model's weights fixed. The test-time-training school does the opposite: it updates the weights on task feedback while holding the harness fixed. SIA (Hebbar et al., 2026) observes that these are complementary, not competing, and builds an agent that turns both at once — a language-model agent that updates both the harness and the weights of a task-specific agent. Combining the levers produced gains across three unrelated domains: 25.1% on legal charge classification, an 8× speedup on GPU-kernel optimization, and a 51.3% reduction in error on RNA denoising. The headline is not any single number; it is that a system pulling both levers beats systems pulling either alone, and most deployments are leaving one lever untouched (see Figure 4).

The harness has a second, subtler thing it can improve: its own control policy. Recursive Agent Optimization (Gandhi et al., 2026) uses reinforcement learning to teach an agent when and how to spawn and delegate sub-tasks to fresh copies of itself. Delegation stops being a hand-written orchestration rule and becomes trained behavior — a divide-and-conquer policy the model learned. The payoff is concrete for anyone fighting context limits: the recursive agent scales to tasks beyond a single context window and cuts wall-clock time versus a single-agent baseline. It is the multi-agent question from earlier in the series [← 8], answered by letting the system learn its own coordination instead of decreeing it.

But every lever in this section is gated by one thing, and naming it now sets up the article's last move. Self-Trained Verification (Wu and Raghunathan, 2026) shows that both test-time verification-refinement loops and train-time self-training bottleneck on the same component — the verifier. Verification–refinement loops stall when verifier scores inflate while real accuracy stagnates; self-training fails when bad self-generated data leak into training. Their fix is to train the verifier by showing the model the reference solution, which it can learn to check even when it cannot solve. The effect is large: roughly doubled accuracy on hard math, a 14× lift on scientific reasoning (from 1.5% to 21%), and a further 33% gain in pass@1 when the verifier is put in the training loop. Hold that result. The verifier is the thing the whole self-improving system trusts to tell it whether it got better — and in §8 we make it the load-bearing wall.

Figure 4: Two levers, and the corner most teams never reach. Harness-update (right edge) and test-time training (top edge) are usually studied apart; SIA (top-right, teal) turns both and reports gains across three unrelated domains. The inset shows Recursive Agent Optimization learning when to delegate to copies of itself rather than being told. Why it matters: if you are pulling one lever, the other is sitting idle.

Key Takeaway 4

The harness can rewrite itself and the weights can retrain themselves — and they are complementary: SIA pulls both and beats pulling either alone. The harness can even learn its own control policy (RAO). But all of it rests on the verifier: self-improvement loops stall or collapse exactly when the verifier inflates (Wu and Raghunathan, 2026). Pull both levers, learn the policy — and treat the verifier as the component that decides how far either can safely go.

§5 · Self-improving the curriculum (E)

The deepest lever is not the agent at all — it is the distribution the agent learns from. When the curriculum starts generating itself, you have removed the single highest-leverage human in the pipeline: the one who decides what the system trains on next.

The current paradigm hides that human in plain sight. Every pre-training and post-training run begins with a person choosing a static dataset or reward function and pressing start; to extend a model's capabilities, someone restarts the process with new data. AC/DC (Dai et al., 2026) — Assessment Coevolving with Diverse Capabilities — asks what happens if the models and the tasks evolve together in a single run instead. It coevolves a population of LLMs, via model merging, with a population of natural-language tasks, via synthetic generation: harder tasks pull the models forward, and stronger models justify generating harder tasks. The reported outcome is that the evolved population covers a broader range of expertise than curated baselines on downstream benchmarks — and it does so without optimizing for any of those benchmarks. The training distribution has become a moving target the system produces for itself (see Figure 5).

This is the biggest unlock in the article and the biggest risk, for the same reason. The human who chooses what to train on next holds more leverage over what a system becomes than almost anyone else in the loop; automating that choice buys reach no hand-authored curriculum can match. It also creates a failure mode you cannot see from inside, because the system is now choosing both what it learns and, implicitly, what it will be good at. Environments and data were already the bottleneck for agent capability [← 3]; a self-generating curriculum makes that bottleneck self-referential. The governance move is unambiguous: a self-written task may train the model, but it must never be allowed to also grade it — the held-out set has to be one the system did not write.

Figure 5: When the curriculum improves itself. AC/DC coevolves a population of models (merged into novel experts) with a population of synthetically generated tasks; each pushes the other, and coverage of expertise widens over a single run with no manual restart and no benchmark optimization. Why it matters: this removes the highest-leverage human in the pipeline — and the one whose absence is hardest to audit.

Key Takeaway 5

A self-generating curriculum (AC/DC) is the highest-leverage lever and the highest-variance one: it buys capability coverage no hand-authored dataset can match, while removing the human whose absence you can least afford to ignore. The non-negotiable guardrail is separation of powers — a task the system wrote may train it, but the set that grades it must be one the system did not write.

§6 · Self-improving the system

Assemble all of it under one roof and you reach the frontier question: can a system improve the very procedure it uses to improve itself? HyperAgents (Zhang et al., 2026) is the cleanest answer, and reading it carefully shows why the bounds from §8 are not optional decoration but part of the architecture.

The problem it solves is a regress. Most self-improvement architectures rely on a fixed meta-level — a higher-level system that modifies a base system. That base can only ever improve within the boundaries the meta-level's design allows. Adding a meta-meta-level to improve the meta-level does not dissolve the limit; it just lifts the ceiling one floor and threatens an infinite tower of meta-levels. HyperAgents collapses the tower: it folds the task agent (which solves the problem) and the meta agent (which modifies the task agent and itself) into a single editable program in which the meta-level modification procedure is itself editable. The result is metacognitive self-modification — the system improves not only its task behavior but the mechanism that generates its future improvements (see Figure 6).

Two findings make this concrete rather than mystical. Across diverse domains — coding, paper review, robotics reward design, and Olympiad-level math-solution grading — the system improves over time and outperforms both baselines without self-improvement and prior self-improving systems. And critically, the improvements it makes to its own improvement process transfer across domains and accumulate across runs: it gets better at getting better, and that compounding carries over. This is the production-leaning descendant of the open-ended self-improvement demonstrated by the Darwin Gödel Machine [← A8] — but where the research lineage runs unbounded, HyperAgents reports that every experiment was conducted with explicit safety precautions, including sandboxing and human oversight. That sentence is not a disclaimer. When the thing being edited is the editor, the sandbox is the design.

Figure 6: The loop assembled. A fixed meta-level (left) can only lift its ceiling one floor at a time, toward an infinite regress. HyperAgents (right) folds task agent and meta agent into one editable program where the meta-level procedure edits itself — and reports that it runs under sandboxing and human oversight. Why it matters: when the editor is editable, the bounds stop being a policy and become the architecture.

Key Takeaway 6

The frontier of self-improvement is editing the thing that does the editing. HyperAgents folds the task and meta agents into one editable program whose improvements to its own improvement process transfer across domains and accumulate across runs. The same paper runs every experiment sandboxed and overseen — because at this level a self-edit can reach the mechanism itself, and the bounds are no longer optional.

§7 · The evolutionary ancestry

None of this is new in spirit. Strip the language model away and what remains is population search — propose, select, keep an archive — and it has returned at every scale of compute we have ever had. Knowing the ancestry tells you which parts are mature mechanism and which parts are the genuinely new constraint.

The mechanism is old. NEAT (Stanley and Miikkulainen, 2002) evolved neural-network topologies, not just weights, and earned its results from three ideas that long predate deep learning: principled crossover of different structures, speciation to protect a structural innovation long enough for it to mature, and complexification — growing from minimal structure rather than starting large. Variation, selection, and protecting the new: the vocabulary of every self-improving system in this article was written down before most of its authors were born.

The modern reappearances are recognizable. SOAR (Pourcel, Colas, and Oudeyer, 2025) places a language model inside a self-improving evolutionary loop on the ARC-AGI program-synthesis benchmark: it alternates evolutionary search to sample and refine candidate programs with hindsight learning that turns failed search attempts into valid training pairs, fine-tuning the model that does the sampling. The loop lifts performance across model scales and iterations — pre-DGM evidence that population search plus a learnable sampler is a working recipe. And large-scale evolution strategies show the oldest pattern of all: a backprop-free optimizer that simply rides compute, scaling the way everything in this field eventually scales [← A8].

One reappearance carries a warning the others do not. Digital Red Queen (Kumar et al., 2026) points out that most LLM-driven evolution is framed as static optimization toward a fixed target, which is not how real evolution works. In Core War — a game where programs battle for control of memory — the authors show LLMs driving genuinely adversarial co-evolution: an arms race with no fixed objective, where each improvement provokes a counter-improvement. This is the bridge to the next section. A self-improving system under adversarial pressure does not converge; it escalates. The lesson of the ancestry, then, is double: evolution is the heritage, not the architecture — production loops borrow selection and the archive and then bound them with a fixed budget, an eval gate, and rollback, which is exactly what biological and adversarial evolution never have (see Figure 7).

Figure 7: The ancestry. Variation, selection, and protecting innovation run from NEAT (2002) through verbal feedback and hindsight learning to today's bounded production loops; large-scale evolution strategies show the pattern simply riding compute [← A8]. The red branch — Digital Red Queen — is the warning: under an adversary, the open-ended version becomes an arms race that does not converge. Why it matters: the search is old; the engineering is the bound.

Key Takeaway 7

The machinery of self-improvement — variation, selection, archives, speciation — is decades old and demonstrably rides compute (NEAT, SOAR, evolution strategies). What is new in the production loop is not the search but the bound: a fixed budget, an eval gate, and rollback. And Digital Red Queen names the reason those bounds are mandatory — under adversarial pressure, open-ended self-improvement escalates rather than converges.

§8 · Guardrails for a loop that edits itself

A loop that edits itself is not a research toy; it is a production system with a new failure surface, and it must be run as an ops problem and a security problem before it is run as a capability win. This section delivers the artifact — a guardrail table you can lift into a runbook — but first the three failure modes it is built to stop.

The grader is gameable. ScientistOne (Meng et al., 2026) studied autonomous research agents and found that their professional-looking outputs routinely contain verifiability failures: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. Across 75 papers produced by five systems on five research tasks, every baseline showed at least one systematic failure mode, with hallucinated reference rates reaching 21%. A self-improving system optimizes whatever its grader measures — including the gaps in the grader. Their countermeasure, chain-of-evidence, requires every claim to be traceable to its source by construction; that traceability is what an event log has to provide for a self-editing loop.

The verifier inflates. Recall §4: verification–refinement loops stall precisely when verifier scores climb while real accuracy does not (Wu and Raghunathan, 2026). If the gate your loop must pass is the same quantity the loop is optimizing, you do not have a gate — you have a target. The eval gate has to measure something the self-improvement step cannot directly tune.

The adversary is assumed. Digital Red Queen showed self-improvement escalating under adversarial pressure. In production that translates to a concrete threat model: someone will try to feed your loop a poisoned task, a malicious skill, or a crafted trajectory designed to push a self-edit in their favor. Self-improvement is therefore a security discipline as much as an ML one — the same posture the series brought to agent security applies, now to a system that rewrites itself.

[← 11] Agent Ops: Running Agents in Production — the event log, alerting, and on-call posture a self-editing loop inherits; every self-edit is an event that must be attributable and replayable.

[← 10] Securing the Agentic Perimeter — the threat model for adversarial inputs to a loop that rewrites itself; treat poisoned tasks and malicious skills as the default, not the exception.

The contract that follows from the three failure modes is short enough to memorize: a self-improving change ships only behind a gate it cannot edit, a rollback it cannot disable, and a budget it cannot raise. The table below (see Figure 8) instantiates that contract for each thing a loop can edit — skills, harness, curriculum, weights, and the meta-level itself — with the source that motivates each row. Note that HyperAgents already runs its meta-level experiments sandboxed and overseen; the table generalizes that instinct into a checklist.

Self-edited target	Eval gate [← 2]	Rollback	Resource bound	If left ungoverned	Source
Σ — skills (markdown / rules)	✅ score a new skill on a held-out set before promoting it	✅ skills are files — revert the commit	✅ cap library size and write-rate	⚠️ silent skill drift; regressions hide in the library	SkillOpt; Memento-Skills
H — harness (scaffold, policy)	✅ A/B the rewritten scaffold against the current one	✅ keep the prior scaffold as a live fallback	✅ cap retries, search depth, tool budget	❌ a self-evolving harness games the metric, not the task	SIA; RAO
E — curriculum (generated tasks)	✅ grade only on a set the system did not write	⚠️ hard once the distribution has drifted	✅ bound the synthetic-data fraction per round	❌ reward hacking; coverage gaps you cannot see	AC/DC; ScientistOne
M — weights (test-time training)	✅ verifier-gated; freeze on verifier–accuracy divergence	✅ checkpoint before every update	✅ trust-region / step cap	❌ self-training collapse on bad self-data	SIA; Self-Trained Verification
Meta-level (edits the editor)	✅ human sign-off to promote a meta-edit	✅ archive every meta-variant	✅ sandbox; no production credentials	❌ adversarial escalation; unbounded regress	HyperAgents; Digital Red Queen

Figure 8: The guardrail table — the deployable artifact. For each thing a production loop can edit, the gate that must pass, the rollback that must exist, the bound that must hold, and the failure that follows if you skip it. Every row is sourced to a paper in this article. Read alone, this is the on-call runbook for a self-improving agent.

Key Takeaway 8 — the artifact

Run self-improvement as ops and security, not just capability. A self-editing change ships only behind a gate it cannot edit, a rollback it cannot disable, and a budget it cannot raise — applied to each editable target in Figure 8. The gate must measure something the self-improvement step cannot directly tune (Wu and Raghunathan, 2026), every self-edit must be an attributable, replayable event (the lesson of ScientistOne's 21% fabrication rate), and the meta-level runs sandboxed and overseen by default (HyperAgents). If you implement only the table, you have implemented the article.

What comes next

This is where the field guide ends and the bet begins. For twelve articles the Agentic Engineering series has answered how to build the agent today: name the harness, evaluate it, feed it environments, give it memory and tools and plans, let it learn on the job, and — here — let it begin to improve itself behind gates you control. The production loop closes the engineer's loop under today's constraints. It does not tell you which of these directions is merely useful and which is inevitable.

That is the question the companion series takes up. Its argument is that the bets worth making over a decade share two traits: they ride compute, and they remove a human from a loop. Self-improving agents are the purest instance of the second — the human they remove is the engineer who has been the protagonist of this entire series. The capstone of one series is the opening move of the next.

[→ C2] Paradigm Bets: The Ten-Year Tier — which of these self-improving loops will look obvious in 2036, why they ride compute and remove the human, and what observation would prove each bet wrong. (The Long Bet, C-series.)

References

Memento-Team. (2026). Memento-Skills: Let Agents Design Agents. Preprint. arXiv:2603.18743.
Wang, F. Y., & Buehler, M. J. (2026). Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence. Preprint. arXiv:2606.01444.
Yang, Y., Gong, Z., Huang, W., Yang, Q., et al. (2026). SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Preprint. arXiv:2605.23904.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.11366.
Hebbar, P., Manawat, Y., Verboomen, S., Ivanova, A., et al. (2026). SIA: Self-Improving AI with Harness and Weight Updates. Preprint. arXiv:2605.27276.
Gandhi, A., Chakraborty, S., Wang, X., Kumar, A., & Neubig, G. (2026). Recursive Agent Optimization. Preprint. arXiv:2605.06639.
Wu, C. H., & Raghunathan, A. (2026). Self-Trained Verification for Training- and Test-Time Self-Improvement. Preprint. arXiv:2605.30290.
Dai, A., Meinardus, B., Regan, C., Tian, Y., & Tang, Y. (2026). Discovering Novel LLM Experts via Task-Capability Coevolution. International Conference on Learning Representations (ICLR), 2026. arXiv:2604.14969.
Zhang, J., Zhao, B., Yang, W., Foerster, J., Clune, J., Jiang, M., Devlin, S., & Shavrina, T. (2026). HyperAgents. Preprint. arXiv:2603.19461.
Stanley, K. O., & Miikkulainen, R. (2002). Evolving Neural Networks through Augmenting Topologies. Evolutionary Computation, 10(2), 99–127.
Pourcel, J., Colas, C., & Oudeyer, P.-Y. (2025). Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI. Preprint. arXiv:2507.14172.
Kumar, A., Bahlous-Boldi, R., Sharma, P., Isola, P., Risi, S., Tang, Y., & Ha, D. (2026). Digital Red Queen: Adversarial Program Evolution in Core War with LLMs. Preprint. arXiv:2601.03335.
Meng, R., Dalvi Mishra, B., Chen, J., Li, C.-L., et al. (2026). ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence. Preprint. arXiv:2605.26340.