The Continual Agent
Deployed agents are amnesic, and fine-tuning is the wrong first answer. The minimal continual substrate is episode capture, replay, and rule distillation in system space — and that is not a stopgap awaiting real learning. It is the auditable half of a two-substrate architecture production agents will keep.
§1 · The agent that forgot my codebase, twice
Yesterday my coding agent spent the better part of an hour discovering that a FastAPI endpoint was failing because of a Pydantic validation error, traced it, and fixed it. Today, on the same repository, it made the identical diagnostic journey from scratch — same dead ends, same rediscovery, same fix. It had learned nothing, because there was no place for it to put what it learned. The next morning it did it a third time. This is not a story about a weak model; the model is extraordinary within a single session. It is a story about a missing organ.
I will state the diagnosis in the words I used when I wrote it down: LLM-based coding agents are amnesic across sessions. Each new run begins without explicit access to prior attempts, failed tests, successful repair patterns, or repository-specific conventions. The capability is there and the experience is there, but nothing connects them across time. As I put it in the cl-agent paper, this is "not a failure of the underlying model; it is a failure of the interface between the model and its environment."
The reflexive fix is to make the model itself remember — to fine-tune on the agent's own transcripts so the weights absorb yesterday's lesson. I think that is the wrong first answer, and most of this essay is an argument for why. The thesis is one sentence, stated plainly enough to be wrong: the minimal continual substrate is episode capture, replay, and rule-based distillation in system space — cross-session learning without fine-tuning — and system-space learning is not a workaround we tolerate until "real," weight-based learning arrives. It is the auditable half of a two-substrate architecture that production agents will keep even after weight-space learning is cheap.
Two symbols carry the whole argument, so I fix them now. Write θ for the model's weights; learning in weight-space means changing θ. Write Σ for the skill/rule store — durable memory that lives outside the weights, in files, episodes, and prompts, and outlives any single run. (These are the series' symbols: M is the model, H the harness around it, E the environment, τ a trajectory or episode.) Everything below is a claim about where learning gets deposited — into θ or into Σ — and how often (see Figure 1).
[→ B4] Agents That Learn on the Job is the field-guide survey of this whole problem — every on-the-job-learning system, placed and compared. This essay is the opposite altitude: a first-person research programme built around one system I deployed and the position I will defend with it. Where B4 maps the territory, I am planting a flag on it.
A deployed agent that cannot deposit experience anywhere durable will re-derive the same fix forever. The bottleneck is not model capability; it is the absence of a place to put learning. The cheapest such place is not the weights.
§2 · The design space
Before defending a corner, I want the whole map. Learning in a deployed agent can be sorted along two axes (see Figure 2). The first is substrate: does an update change the weights (weight-space, θ) or an external store (system-space, Σ)? The second is cadence: does the update happen online, per interaction off the live stream, or episodically, in batches after episodes complete? Every system I will discuss lands in one of the four quadrants, and the map's punchline is its bottom row: system-space learning needs no gradient access and is deployable on top of a model you cannot retrain.
Why does production pin you to the bottom-left-to-right of that map first? Three reasons, each structural rather than temporary. First, access. In most deployments the model is not yours to retrain — it is a frozen endpoint behind an API. System-space learning is the only kind you can do at all when you do not own θ. Second, plasticity. Even when you can touch the weights, you should know what happens when you keep touching them. Dohare and colleagues demonstrated that ordinary backprop, run continually, steadily loses its ability to learn — stochastic gradient descent alone is insufficient to learn continually, and a network's plasticity decays unless something actively restores it. A frozen model never has this problem, because it is never asked to keep absorbing; the accumulation happens in Σ, which has no plasticity to lose.
[← A1] The Plasticity Crisis in Continual Deep Learning — the published argument that a network's capacity to learn new tasks degrades the more it has already been trained, distinct from catastrophic forgetting. It is the deeper reason weight-space continual learning is hard, and the deeper reason the system-space alternative is attractive.
Third, auditability — the reason I will keep coming back to, and the one that turns "deployable first" into "permanent." A weight update is opaque and effectively irreversible at the granularity you care about; a line in a Markdown file is neither. I will develop this in §6. For now, the map's bottom row is simply the row you can stand on without owning the model, without fighting plasticity, and without giving up the ability to read what your agent has learned.
We can finally measure how much this matters, because a benchmark for it now exists. Asawa and colleagues built Continual Learning Bench (CL-Bench) as, in their words, "the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience," spanning six real domains — software engineering, signal processing, disease-outbreak forecasting, database querying, strategic game-playing, demand forecasting — each constructed so tasks "share a learnable latent structure … that a stateful system can discover online but a stateless one cannot." Their finding is the one that should make every agent builder uncomfortable: current systems "leave headroom," agents "frequently overfit to immediate observations or fail to reuse knowledge across instances," and — crucially — "dedicated memory systems do not fix this; in fact, naive in-context learning outperforms systems dedicated to memory management." The gap is real, and elaborate memory machinery is not automatically the answer. That is the case for a minimal substrate, not a maximal one.
Production pins you to system-space first for three structural reasons — you rarely own the weights, weights lose plasticity under continual updates, and only a store you can read is a store you can audit. CL-Bench confirms the gap is real and that bigger memory systems do not automatically close it.
§3 · cl-agent: the minimal substrate
So I built the smallest thing that could work, and deployed it against my own coding agent. cl-agent is a continual-learning substrate that gives a coding agent cross-session memory through episode capture, replay, and rule-based distillation — without fine-tuning. It borrows four responsibilities from the continual-learning literature and adapts them to the agent setting: capture (record what happened with maximum fidelity), replay (surface relevant past experience before a new run), distillation (compile durable patterns out of the episode history), and evaluation (measure whether cross-session improvement is real and attributable). Together, these four responsibilities form a single loop (see Figure 3).
developer_instructions.The mechanics are deliberately unglamorous. Episodes are recorded as normalized, append-only JSONL objects that preserve raw observable outcomes before any reward compression. A domain-heuristic replay buffer surfaces recent failures (as warnings) and same-domain successes (as guidance) before each new run — no embeddings, no vector database, no model inference at retrieval time. A rule-based distillation pipeline compiles the episode history into three inspectable, human-editable Markdown artifacts — skills.md, dreams.md, and program.md. And the substrate assumes no access to model internals: context injection through developer_instructions is the only coupling point. A captured episode and the rule it distills look like this:
// episode.jsonl — append-only; one record per run (representative)
{"task":"fix FastAPI 422 on /orders","tests_failed":["test_orders_create"],
"root_cause":"Pydantic model expected `customer_id:int`, body sent string",
"repair":"coerce in schema + add validator","outcome":"pass","domain":"fastapi"}
# skills.md — distilled rule, injected via developer_instructions
- When a FastAPI route returns 422, check the Pydantic request model FIRST:
type mismatches between the JSON body and the schema are the usual cause.
That is the entire trick: the morning's two-hour diagnosis becomes one line the agent reads before it starts. Nothing about the model changed; what changed is what the model is told. The design philosophy is, by intention, conservative. cl-agent does not fine-tune the underlying model; it does not wrap, replace, or compete with the agent; it does not depend on vector databases, embeddings, or ML inference at distillation time; and it persists knowledge as plain Markdown that any agent can read and any human can inspect or edit. It is a substrate-level approach, closer to the case-based-reasoning and lifelong-learning traditions than to gradient-level continual learning.
I want to be honest about the claim, because the honesty is the point. cl-agent makes a deliberately narrow, falsifiable claim: a thin substrate built from episode recording, replay, and rule-based distillation can measurably improve coding-agent performance across repeated tasks in a narrow domain, without fine-tuning the underlying model. Not general intelligence; not transfer across arbitrary domains; a measurable improvement, in a narrow domain, attributable to the substrate. To make "attributable" mean something, the evaluation runs two modes — a baseline agent with its own native memory, and the same agent integrated with the substrate — and separates operational metrics (computable from any run) from research-grade transfer metrics (forward and backward transfer) that are only valid under a controlled benchmark protocol.
The revision is evidence
There are two drafts of this paper in front of me — an earlier one and the current one — and the diff between them is data about the problem, so I will use it as such. What is striking is how little changed at the core. Both drafts open with the same diagnosis (amnesia across sessions), the same four responsibilities, the same single coupling point through developer_instructions, and the same narrow falsifiable claim. Under revision I did not reach for more power — not fine-tuning, not embeddings, not a heavier memory system. I reached for more discipline: a sharper line between operational and research claims, an explicit baseline-versus-integrated protocol so the substrate's effect could be isolated from the agent's native memory. The thing that survived contact with deployment unchanged was the conservative core. When a design refuses to grow more elaborate across revisions, that refusal is telling you the bottleneck was never capability — it was the interface.
cl-agent is the minimal substrate: capture episodes as append-only JSONL, replay them before each run, distill them into editable Markdown rules injected through a single context channel, and evaluate baseline-vs-integrated. Its narrow falsifiable claim — measurable cross-session improvement in a narrow domain without fine-tuning — and its refusal to grow more elaborate across revisions are the same statement: the bottleneck was the interface, not the model.
§4 · The live programme
cl-agent is not a lone artifact; it is one move in a programme that several groups are running at once, from different corners of the map. Reading them together is what convinced me the substrate question is the right question. Figure 4 places the live cohort.
| System | Substrate · cadence | Signal it learns from | Ships on a frozen model? |
|---|---|---|---|
| OpenClaw-RL Wang et al. (2026) | weight-space · online | next-state signal (PRM reward + directive distillation) | ❌ no — you own the training loop |
| ACuRL Xue et al. (2026) | weight-space · in rounds | self-generated curriculum, judged by CUAJudge | ❌ no — RL on a specific environment |
| Memento-Skills Memento-Team (2026) | system-space · online + episodic | reflective read/write over a skill library | ✅ yes — no parameter updates |
| Learning Fast & Slow Tiwari et al. (2026) | both substrates | textual feedback (fast) + RL (slow) | ⚠️ partial — fast half, yes; slow half, no |
| cl-agent Goswami (2026) | system-space · episodic | captured episodes → distilled rules | ✅ yes — developer_instructions only |
OpenClaw-RL (Wang et al., 2026) names the signal that all the others quietly rely on. Every agent action is followed by a next-state — the user's reply, the tool's output, the terminal or GUI change — and "no existing agentic RL system recovers it as a live, online learning source." Their observation is that these next-states are universal: personal conversations, terminal executions, GUI interactions, software tasks, and tool-call traces "are not separate training problems," and one policy can learn from all of them at once. The next-state carries two kinds of information — evaluative (how well the action did, extracted as a scalar reward via a process-reward-model judge) and directive (how it should have differed, recovered through hindsight-guided on-policy distillation as token-level advantage richer than any scalar). It is weight-space and online, and its engineering achievement is that serving and training run asynchronously with zero coordination overhead — but it is the hardest corner to deploy, because you own the loop and its risks.
ACuRL (Xue et al., 2026) takes the harder version of the data problem: what if the tasks worth learning do not exist yet? For computer-use agents in diverse, shifting digital environments, "high-quality and environment-grounded training data" is exactly what is missing. So the agent manufactures its own: it explores, a curriculum generator synthesizes difficulty-matched tasks from past experience and the previous round's feedback, and a robust automatic judge (CUAJudge) supplies the reward — continual adaptation to a specific environment "with zero human data." It stays in weight-space, but its cadence is rounds, not interactions.
Memento-Skills (Memento-Team, 2026) crosses to the bottom of the map and shows the system-space corner can be strong, not merely cheap. Reusable skills, stored as structured Markdown, serve as "persistent, evolving memory"; a Read–Write Reflective Learning loop selects the most relevant skill (read) and updates the library from new experience (write). The headline is the phrase I built cl-agent around: "continual learning without updating LLM parameters." And it is not a toy — on the General AI Assistants benchmark and Humanity's Last Exam, the authors report 26.2% and 116.2% relative improvements in overall accuracy, entirely from a growing skill store on a fixed model.
Learning Fast and Slow (Tiwari et al., 2026) is the one that refuses to pick a corner, and is right to. Updating parameters "forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity"; in-context learning with fixed parameters adapts cheaply but cannot, by itself, match what weight updates buy. Their conclusion is the cleanest statement of the whole design space: "there is no good reason for restricting learning to being in-context or in-weights." So they keep both — parameters as slow weights, optimized context as fast weights — and let the fast half learn from textual feedback. On the map, it is the arrow spanning two quadrants.
Read together, the cohort says one thing: the substrate question is primary, and the cadence question is secondary. cl-agent occupies the system-space, episodic corner deliberately — the corner you can ship on a frozen model, behind a single coupling point, with everything it learns written in a language a human can read. For the full field-guide treatment of all of these and more, the survey is the place to go; here, the point is that the corner I chose is a defensible position, not a default.
[→ B4] Agents That Learn on the Job places this entire cohort on the same 2×2 and works through each in engineering detail. I am not reproducing that survey — I am arguing from one corner of it.
Four live systems attack agent learning from every corner of the map. OpenClaw-RL names the universal signal; ACuRL manufactures its own curriculum; Memento-Skills shows system-space can be powerful, not just cheap; Fast-Slow keeps both substrates. The substrate choice is the primary decision — and cl-agent's system-space corner is a position I chose, not a fallback.
§5 · The lineage says this was always the point
It would be easy to read all of this as a 2026 trend — a reaction to frozen API models and expensive fine-tuning. It is not. System-space continual learning is one of the oldest ideas in the field; what is new is only that it finally got cheap, because the model is frozen and the store is text. The lineage in Figure 5 is worth walking, because it tells you the substrate idea has been load-bearing for twenty years.
Start at the deepest root. In 2007, Sutton, Koop and Silver argued something counterintuitive: tracking can beat converging even in a stationary world. The folklore held that algorithms which keep adjusting (tracking the best current solution) rather than settling on a fixed answer (converging) only help when the world is non-stationary. They showed otherwise — on a stationary problem, a tracking algorithm could outperform any converging one. That is the deepest argument for an always-learning agent: continual adaptation is not a concession to a changing world; it can be the better policy even when the world holds still.
Three years later, NELL made it concrete at scale. Mitchell and colleagues proposed never-ending learning as a paradigm, and the Never-Ending Language Learner has, in their words, "been learning to read the web 24 hours/day since January 2010," accumulating a knowledge base of over 80 million confidence-weighted beliefs. NELL is the agentic continual-learning vision in its purest form — an always-on system whose learning lives in an externalized, inspectable knowledge base — and it predates the LLM era by a decade. The substrate was never the model; it was the store.
Then the cautionary result. In 2021, Dohare, Sutton and Mahmood demonstrated that backprop run continually degrades — the first clear demonstration that stochastic gradient descent alone cannot learn continually, and that plasticity must be actively maintained (their generate-and-test method continually re-injects randomness to preserve it). Read alongside the others, Continual Backprop is the negative space that makes the positive case: weight-space continual learning is not free; it fights a decay the system-space alternative simply does not have.
[← A4] The formal scaffolding for "an agent as a continual learner" lives in the published series — Abel and Barreto's A Definition of Continual Reinforcement Learning and the history-process foundations of Elelimy, White and Bowling. They give the notion of a never-converging agent a precise home; I lean on that home rather than rebuild it.
By 2023 the same shape had jumped from knowledge bases to embodied agents. Voyager (Wang et al., 2023) gave a frozen GPT-4 an ever-growing skill library of executable code and let it learn lifelong in Minecraft through blackbox queries alone — no fine-tuning — obtaining 3.3× more unique items and reaching key tech-tree milestones up to 15.3× faster than prior agents. The skills it wrote were compositional and, the authors note, alleviated catastrophic forgetting precisely because they lived outside the weights. That is the system-space thesis, demonstrated in a game world three years before I needed it for a codebase.
The throughline is unbroken: tracking over converging (2007), an externalized always-on store (2010), the proof that weights resist continual updating (2021), the formal definition of an agent that never stops learning (2023–25), and — in 2026 — a frozen model with a Markdown memory that ships on a laptop. The reason this idea is suddenly everywhere is not that it is new. It is that a frozen foundation model plus a plain-text store made the oldest idea in continual learning finally cheap.
System-space continual learning is not a 2026 trend. Tracking-beats-converging (2007), NELL's externalized store (2010), and Continual Backprop's plasticity decay (2021) make the same point across two decades: keep learning in an inspectable store, not only in the weights. cl-agent is the cheap, frozen-model instance of that argument.
§6 · The two-substrate architecture
Now the claim that turns "deployable first" into "permanent." I do not think system-space learning is a phase we pass through on the way to weight-space learning. I think the mature architecture keeps both substrates, with different jobs, joined by a consolidation bridge (see Figure 6). The reason is a property that weight updates structurally cannot have and Markdown rules structurally do: auditability.
Consider what each substrate is good at. Weight-space learning is compounding (capability folds permanently into the model and is free at inference), slow (an update is a training run, not an edit), and opaque (you cannot point at the parameter that learned the Pydantic lesson, and you cannot remove it without another training run). System-space learning is the mirror image: auditable (every rule is a line of Markdown you can read), instant (a new rule takes effect on the next run with no training), and reversible (a bad rule is deleted in one edit). These are complementary, not competing, profiles.
[← B5] The Memory Stack is the engineering treatment of how these layers fit together — including the consolidation step that selectively moves durable memories from the fast, external store toward slower, deeper storage, and the surrounding memory canon it surveys. I take its audit-constraint argument as a premise here.
The bridge between them is consolidation: not everything that lands in Σ should ever reach θ, but the rules that prove themselves — stable across many episodes, never overturned, broadly useful — are exactly the candidates to fold permanently into the weights, the way a skill rehearsed for months becomes reflex. The discipline is that consolidation should be earned in Σ first, where a rule is cheap to test and cheap to revoke, before it is paid for in θ, where it is expensive to install and expensive to remove.
And here is why the bottom layer is permanent rather than transitional. Suppose weight-space online learning becomes everything its proponents hope — cheap, fast, safe. It still cannot, by its nature, give you a per-update record you can read, approve, and revert. In any setting where you must answer "why did the agent do that, and how do I undo it" — which is to say, any production setting that matters — you will keep an auditable store of what the agent has learned, because the alternative is an agent whose acquired behavior you cannot inspect. The audit constraint is not a temporary limitation of today's tooling; it is a permanent requirement of deploying systems you are accountable for. That is the sense in which system-space continual learning is the auditable half of a two-substrate architecture production agents will keep.
Weight-space (compounding, slow, opaque) and system-space (auditable, instant, reversible) are complementary profiles, bridged by consolidation that promotes only rules already proven in Σ. Because accountability requires a learned-behavior record you can read and revert, the auditable layer does a job the weights structurally cannot — so it is permanent, not a stopgap.
§7 · Open problems — the research agenda
A manifesto that only celebrates its own design is propaganda. So here is where the substrate is weak, stated as the four papers I most want to write next — and, in keeping with this series, each with the observation that would prove it a dead end (see Figure 7). These are not rhetorical caveats; they are the open front of the programme.
| Open problem | What breaks if ignored | Falsifier — what would kill the line |
|---|---|---|
| 1 · Rule conflict & staleness the headline problem | As Σ grows, distilled rules contradict each other and age out as the repo changes; a stale rule is worse than no rule. | If conflict and staleness cannot be bounded as Σ scales — if a growing store reliably degrades the agent — the minimal substrate does not scale. |
| 2 · Eval protocol for continual agents [← B2] | Without a shared protocol, "it got better" is unfalsifiable; baseline-vs-integrated is a start, not a standard. | If no protocol can separate genuine cross-session learning from prompt-context luck, every claim in this essay — including mine — is untestable. |
| 3 · Cross-repo transfer | If rules are purely repo-local, Σ is sophisticated caching, not learning; the interesting claim is transfer. | If distilled rules show zero transfer across repositories — if every Σ must be rebuilt from scratch — the substrate is a cache, and should be sold as one. |
| 4 · Rule expiry / principled forgetting | A store that only grows eventually buries the useful rule under dead ones; forgetting is half of memory. | If rules must be hand-pruned forever — if no principled expiry policy works — the substrate does not self-maintain and will not scale operationally. |
Rule conflict and staleness is the one that keeps me up. A store that grows monotonically will, eventually, contain two rules that disagree, or a rule that was true before a refactor and is false after it. Distillation today is rule-based and conservative, which limits the damage but does not solve it; the real problem is keeping a growing Σ coherent. An evaluation protocol is the second, and it is not only my problem — it is the field's. CL-Bench is a genuine advance, and the baseline-versus-integrated design in cl-agent is a local answer, but the community does not yet have a shared standard for "this agent learned across sessions" that an adversary could not game. Cross-repo transfer is the third and the most ambitious: do rules learned on one codebase help on another, or is each Σ hostage to its repository? If nothing transfers, the honest framing is caching, and I should say so. Rule expiry is the fourth and the most overlooked — forgetting is not the failure of memory, it is half of its function, and a substrate with no principled way to retire dead rules will eventually drown in them.
I am stating these as my next papers because that is what they are. The point of a falsifiable programme is that its author can name the experiments that would end it, and mean it.
[← B2] The Agent Evaluation Crisis — open problem #2 is a special case of its thesis: today's agent evaluations rarely measure what they claim, and a continual agent, whose whole point is to change over time, is the hardest case of all. The shared protocol I need is the one B2 argues the field is still missing.
The substrate's open front is four problems, each falsifiable: keeping a growing Σ coherent (conflict/staleness), proving cross-session gains under an adversary-proof protocol [← B2], transferring rules across repositories, and retiring dead rules. If conflict cannot be bounded, transfer is zero, or expiry is impossible, the minimal substrate does not scale — and I would have to say so.
§8 · The long bet, restated
This is the last essay in the series, so let me state the bet as plainly as I can, in the form the series demands — with its falsifier and a date.
The bet is that system-space continual learning is a permanent layer of agent architecture, not a workaround we are stuck with until weight-space learning matures. The whole argument reduces to one load-bearing claim: the value of Σ is not that it is the only thing you can do to a frozen model today, but that it is auditable — readable, reviewable, reversible — and accountability for deployed agents will always require that. If that is right, the two-substrate architecture is the steady state, not a way station (see Figure 8).
- The bet
- Production agents keep a system-space layer (episode capture → replay → rule distillation) permanently, as the auditable half of a two-substrate architecture — even after weight-space online learning is cheap.
- The falsifier
- If, by 2029, a deployed agent can learn online in weight-space with per-update inspectability and one-command rollback equal to a Markdown diff — if θ becomes as auditable, instant, and reversible as Σ — then the auditable advantage vanishes, the two substrates collapse into one, and I was wrong: system-space was a stopgap.
- The review
- I will revisit this in 2029, against that specific test, and report the result whichever way it falls.
I want to be precise about what would change my mind, because that is the discipline this series has tried to model. The falsifier is not "weight-space learning gets good." It is narrower and sharper: weight-space learning becomes as auditable as a text file — every update inspectable before it lands, every update reversible in one command. If that arrives, the reason to keep a separate system-space layer evaporates, and the honest move is to drop it. Until it arrives, the auditable layer is doing a job nothing else does, and dropping it would mean deploying agents whose learned behavior you cannot read. I will check in 2029.
And now the loop closes. This series opened, in the published work, with the Big World Hypothesis — Javed and Sutton's claim that for real problems "the world is multiple orders of magnitude larger than the agent," so the agent is permanently underfit and convergence to an optimal policy is the wrong goal. The consequence the A-series drew was stark: continual learning is not a feature; it is a requirement. Everything in this capstone is the engineering completion of that sentence. If the world is bigger than the agent — permanently, structurally bigger — then an agent that stops learning when it is deployed has not been deployed; it has been frozen mid-thought. Continual learning is not a feature of a deployed agent. It is the definition of deployment. The C-series ends exactly where the A-series began.
[← A3] The Big World Hypothesis: Why Continual Learning Is Inevitable — the published argument that the world is multiple orders of magnitude larger than any agent, so the agent is permanently underfit and must keep learning. This capstone is its engineering completion: if the big world is real, continual learning is what "deployed" means.
The bet: production agents keep an auditable system-space layer permanently. The falsifier: weight-space learning becomes as inspectable and reversible as a Markdown diff by 2029 — then the two substrates collapse to one and I was wrong. The review: 2029. The close: if the big world is bigger than the agent [← A3], continual learning is not a feature — it is the definition of deployment.
References
- Goswami, D. (2026). cl-agent: A Continual-Learning Substrate for Coding Agents — Episode Capture, Replay, and Rule-Based Distillation for Cross-Session Improvement Without Fine-Tuning. Preprint, Independent Research, April 2026. PDF · github.com/dattgoswami/cl-agent
- Goswami, D. (2026). cl-agent (earlier draft). Preprint, Independent Research, April 2026. [Superseded by the current draft; cited as revision history in §3.]
- Wang, Y., Chen, X., Jin, X., Wang, M., & Yang, L. (2026). OpenClaw-RL: Train Any Agent Simply by Talking. Preprint, Princeton University. arXiv:2603.10165.
- Xue, T., Liao, Z., Shi, T., Wang, Z., Zhang, K., Song, D., Su, Y., & Sun, H. (2026). Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents (ACuRL). Preprint, The Ohio State University. arXiv:2602.10356.
- Memento-Team (2026). Memento-Skills: Let Agents Design Agents. Preprint. arXiv:2603.18743.
- Tiwari, R., Sareen, K., Agrawal, L. A., Gonzalez, J. E., Zaharia, M., Keutzer, K., Dhillon, I. S., Agarwal, R., & Khatri, D. (2026). Learning, Fast and Slow: Towards LLMs That Adapt Continually. Preprint, UC Berkeley / Mila / UT Austin. arXiv:2605.12484.
- Asawa, P., Glaze, C. M., Orlanski, G., Ramakrishnan, R., Xu, B., Biswal, A., Chen, V. S., Sala, F., Zaharia, M., & Gonzalez, J. E. (2026). Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments. Preprint, UC Berkeley / Snorkel AI / University of Wisconsin–Madison. arXiv:2606.05661.
- Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research. arXiv:2305.16291.
- Dohare, S., Sutton, R. S., & Mahmood, A. R. (2021). Continual Backprop: Stochastic Gradient Descent with Persistent Randomness. Preprint, University of Alberta / Amii. arXiv:2108.06325.
- Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., et al. (2015). Never-Ending Learning. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI).
- Sutton, R. S., Koop, A., & Silver, D. (2007). On the Role of Tracking in Stationary Environments. Proceedings of the 24th International Conference on Machine Learning (ICML).