Agents That Learn on the Job

§1 · The amnesia bill

Your coding agent solved this yesterday. It learned that the repo pins dependencies with uv, that the test runner needs an environment variable set, that the linter rejects relative imports — and it learned all of it the hard way, by failing, reading the errors, and recovering. Then the session ended, and every byte of that hard-won knowledge was discarded. This morning it starts over: same repo, same conventions, same failures, same recovery. You are paying — in tokens, in wall-clock, in your own attention — to re-derive yesterday's lesson. That recurring charge is the amnesia bill, and most deployed agents pay it in full, every run, forever.

The previous article left the agent in a factory: environments are the bottleneck, and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning. But that factory was an offline, resettable pipeline — build the worlds, run the loop, ship the agent. Deployment breaks the reset button. In production the environment is no longer a pipeline you control; it is a stream you cannot rewind, and the most valuable experience arrives after training, one interaction at a time.

[← 3] Environments Are the Bottleneck — the factory that manufactures experience offline. This article cuts that loop open and lets it run online, against a stream that never resets.

Here is the strange part: the discard is a choice, not a constraint. Every interaction a deployed agent has is already a labeled example of what works in exactly the distribution that matters. The user's reply tells you whether the answer landed. The test suite's exit code tells you whether the patch was right. The terminal's next prompt tells you whether the command did what you meant. The experience stream the continual-learning literature spent two decades wishing for is flowing past every production agent right now (see Figure 1). Almost nobody drinks from it.

We can finally measure the gap. Asawa et al. (2026) built Continual Learning Bench precisely to ask whether frontier systems improve across real, stateful sessions at all — not whether they are clever in one shot, but whether they get better at your environment the way a new hire does. For most deployed agents today the honest answer is that they do not. They are as good on day 100 as on day 1.

This article is about closing that loop, and its thesis is one sentence, stated plainly enough to be wrong: the field is closing the on-the-job learning loop from both ends — weight-space (online RL from the signals the deployment already emits) and system-space (skills, episodes, and rules kept as evolving memory) — and the surprise is that the first deployable wins are not gradient-based. You can ship the system-space half today, on top of a frozen model you do not own.

Two symbols organize everything, so we fix them first. Write θ for the model's weights; learning in weight-space means changing θ. Write Σ for the skill/rule store — durable memory that lives outside the weights, in files, databases, or prompts, and outlives any single episode. (These are the series' symbols [← 1]: M is the model, H the harness around it, E the environment, τ a trajectory or episode.) The whole article is a claim about where learning gets deposited — into θ or into Σ — and how often.

That gives us a map (see Figure 2). One axis is substrate: does the learning change the weights (weight-space) or an external store (system-space)? The other is cadence: does the update happen online, per interaction, off the live stream, or episodically, in batches after episodes complete? Every system in this article lands somewhere on that 2×2, and the map's punchline is its bottom row: system-space learning needs no gradient access and is deployable on top of a model you cannot retrain.

Figure 2: The learning-substrate map. Vertical axis: substrate (weight-space θ vs system-space Σ). Horizontal axis: cadence (online vs episodic). Every system in this article lands here. Why it matters: the bottom row — system-space — ships on a frozen model, so it is where most teams should start; Fast-Slow shows the strongest systems span the axis rather than picking a corner.

Key Takeaway 1

Deployed agents re-derive the same context every run — the amnesia bill — while the experience stream that would fix it flows past unused. Continual Learning Bench (Asawa et al., 2026) makes the gap measurable: most systems are no better on day 100 than day 1. The loop can be closed in weight-space (changing θ) or system-space (an external store Σ), online or episodically — and the system-space half ships today.

§2 · The signal was always there

What, exactly, is the signal? OpenClaw-RL (Wang et al., 2026) gives it a name and a recovery method. The observation is almost embarrassingly simple: every agent action is followed by a next-state — the user's reply, the tool's output, the terminal or GUI change — and that next-state is a learning signal that no prior agentic-RL system bothered to recover as a live, online source. Personal conversations, terminal sessions, GUI clicks, SWE tasks, tool-call traces: they look like different training problems, but they are the same problem. An action was taken; the world responded; the response says something about the action. One policy can learn from all of them in one loop.

The recovery splits the next-state into two kinds of information (see Figure 3). Some of it is evaluative: how good was the action? OpenClaw-RL extracts this as a scalar reward through a process-reward-model judge — a learned critic that scores the ongoing interaction. But the richer part is directive: not just how good the action was, but how it should have been different. A failing test does not only say "wrong"; its error message says what was wrong. OpenClaw-RL recovers this through Hindsight-Guided On-Policy Distillation: it reads textual hints out of the next state, builds an enhanced teacher context from them, and supervises the policy with token-level directional advantage — a signal strictly richer than any single scalar.

The systems lesson is in the schedule. OpenClaw-RL is asynchronous by construction: the model serves live requests, the judge scores interactions, and the trainer updates the policy concurrently, with zero coordination overhead between them. That is what makes the weight-space half of the loop a deployment rather than a research artifact — you do not pause serving to train; you fold training into serving. On the map, OpenClaw-RL sits in the top-left: weight-space, online, learning off the live stream.

Two honest caveats keep this from being magic. First, the directive signal is only as good as the next-state's information content: a curt "no" teaches little, a stack trace teaches a lot. Second, learning online in weight-space means you own the training loop and accept its risks — reward hacking, drift, and the cost of a judge that is itself a model. This is the harder half of the loop, which is exactly why the rest of the article spends most of its time on the half you can ship without any of it.

Figure 3: How OpenClaw-RL (Wang et al., 2026) recovers a learning update from the live stream. Each next-state forks into an evaluative signal (scalar reward via a PRM judge) and a richer directive signal (token-level advantage via hindsight-guided distillation). Why it matters: it turns ordinary deployment traffic into online weight-space learning — the top-left corner of Figure 2 — without pausing serving.

Key Takeaway 2

Every action produces a next-state, and the next-state carries two signals: evaluative (how good — a scalar reward via a judge) and directive (how to differ — token-level advantage via hindsight-guided distillation). OpenClaw-RL (Wang et al., 2026) makes this a live, asynchronous online source, serving and training at once. It is the weight-space, online corner of the map — powerful, and the hardest corner to deploy safely.

§3 · Self-generated curricula

OpenClaw-RL assumes the stream hands you gradeable interactions. What if your environment is unfamiliar and the tasks worth learning do not yet exist? Then the agent has to manufacture its own curriculum. ACuRL (Xue et al., 2026) does exactly this for computer-use agents, which live in the most unforgiving version of the problem: digital environments are diverse and dynamic, so a deployed agent constantly meets unseen apps and distribution shift, and high-quality, environment-grounded training data is exactly what it does not have.

ACuRL — Autonomous Curriculum Reinforcement Learning — adapts an agent to a specific environment with zero human-annotated data (see Figure 4). The agent first explores to gather initial experience. Then a curriculum generator uses that experience, plus feedback from the previous training round, to synthesize new tasks calibrated to the agent's current ability — not too easy, not yet impossible. Rewards come from CUAJudge, an automatic evaluator the authors build so the loop never waits on a human verdict. The result is a closed loop that converts raw exposure to an environment into a stream of gradeable, difficulty-appropriate practice.

The conceptual move is that the agent generates the very thing B3 said was scarce — environments and tasks — on demand, grounded in the environment it is actually deployed against. Where Endless Terminals industrialized environment supply offline [← 3], ACuRL pushes a slice of that factory into the deployment itself: the agent that meets a new world writes its own homework for that world. On the map it stays in weight-space — it is still RL, still changing θ — but its cadence is iterative rather than per-interaction; it learns in rounds, each round defined by the curriculum the last one earned.

The dependency to respect is the judge. A self-generated curriculum graded by an unreliable evaluator will happily teach the agent to satisfy the evaluator. CUAJudge is load-bearing; the moment your automatic reward is gameable, autonomous curricula amplify the gaming rather than the skill. Self-generated data is leverage, and leverage cuts both ways.

Figure 4: ACuRL's autonomous curriculum loop (Xue et al., 2026). Exploration becomes experience; a generator turns experience into difficulty-matched tasks; RL trains on them with an automatic judge (CUAJudge); the judge's feedback shapes the next round. Why it matters: the agent manufactures its own environment-grounded training data — moving part of B3's factory inside the deployment — but the loop is only as honest as its judge.

Key Takeaway 3

When the tasks worth learning do not exist yet, the agent can synthesize them. ACuRL (Xue et al., 2026) adapts computer-use agents to a specific environment with zero human data: explore, generate a difficulty-matched curriculum from past experience, grade it with an automatic judge (CUAJudge), repeat. It moves a slice of B3's environment factory into the deployment — but the whole loop is only as trustworthy as its judge.

§4 · Skills, episodes, and rules

Everything so far changes θ. Now cross to the bottom of the map — the half you can deploy without ever touching the weights. System-space learning keeps a store Σ outside the model and grows it from experience; the frozen model is a fixed function, and all the accumulation happens in files and prompts it reads. The surprise of this article is that this is where the first shippable wins are, and three systems show the same shape from three directions (see Figure 5).

Memento-Skills (Memento-Team, 2026) makes Σ a library of skills written as structured markdown — behavior plus context, stored as files. Its Read–Write Reflective Learning loop has two phases: in the read phase a skill router selects the most relevant skill for the current state; in the write phase the agent updates and expands the library from new experience. Crucially, this is continual learning "without updating LLM parameters" — all adaptation is the evolution of externalized skills. On the General AI Assistants benchmark and Humanity's Last Exam, the authors report sustained gains: 26.2% and 116.2% relative improvements in overall accuracy, respectively, entirely from a growing skill store on a fixed model.

cl-agent (Goswami, 2026) targets the same amnesia from the coding side, and names the diagnosis exactly: LLM coding agents are "amnesic across sessions," starting each run without prior attempts, failed tests, repair patterns, or repository conventions. Its substrate has four responsibilities — capture, replay, distill, evaluate. Episodes are recorded as append-only JSONL; a replay buffer surfaces relevant past failures and successes before each new run; a rule-based distillation step compiles them into inspectable markdown artifacts (skills.md, dreams.md, program.md); and an evaluation layer keeps operational metrics separate from research claims. It assumes no access to model internals — context injection is the only coupling point — and makes a deliberately narrow, falsifiable claim: a thin capture–replay–distill substrate can measurably improve a coding agent across repeated tasks in a narrow domain, without fine-tuning.

Voyager (Wang et al., 2023) is the early proof that the shape generalizes. Its ever-growing skill library of executable code, fed by an automatic curriculum and iterative self-verification, let a GPT-4 agent learn lifelong in Minecraft through blackbox queries — no fine-tuning — obtaining 3.3× more unique items, traveling 2.3× longer distances, and unlocking key tech-tree milestones up to 15.3× faster than prior state of the art. The skills it wrote were reusable and compositional and, the authors note, alleviated catastrophic forgetting precisely because they lived outside the weights.

Three systems, one pattern: capture experience, surface the relevant slice, distill it into a durable store the frozen model reads next time. They differ in cadence. Skill retrieval is online — the router picks a skill for the current state on every run — but distillation is episodic, happening after episodes close. So these systems straddle the bottom row of the 2×2: online on the read side, episodic on the write side.

A fourth system marks the boundary. AC/DC (Dai et al., 2026) coevolves a population of LLMs (via model merging) with the tasks that test them (via synthetic generation), discovering an archive of experts that covers more capability than larger models while using less GPU memory — improvement in a single open-ended run rather than a restarted training job. It is partly weight-space (merging changes parameters) and partly system-space (the archive is an external store that grows), which is why it sits in the episodic column spanning both rows. And it points straight at the next section's question: why commit to one substrate at all?

Figure 5: The system-space pattern shared by Memento-Skills, cl-agent, and Voyager. Capture each episode, replay the relevant slice, distill it into Σ, inject Σ into the next run — all on a frozen model. Why it matters: this is the deployable-today half of Figure 2, and it requires no access to θ.

Key Takeaway 4

You can close the loop without touching θ. Memento-Skills (Memento-Team, 2026) grows a markdown skill library — 26.2% and 116.2% relative accuracy gains on GAIA and Humanity's Last Exam with no parameter updates; cl-agent (Goswami, 2026) captures, replays, and distills coding episodes into inspectable rules; Voyager (Wang et al., 2023) wrote an executable skill library that learned lifelong in Minecraft (3.3× / 2.3× / 15.3× over prior SOTA). The shared pattern — capture → replay → distill into Σ — is the deployable half of the map.

§5 · Learning fast and slow

If one half of the loop changes θ and the other grows Σ, the obvious question is why you would restrict yourself to either. Tiwari et al. (2026) argue you should not. Their framing: updating parameters lets a model absorb task-specific information but risks catastrophic forgetting and loss of plasticity; in-context learning on fixed parameters adapts cheaply and fast but cannot, by itself, match what parameter updates buy. "There is no good reason for restricting learning to being in-context or in-weights." Humans, they note, learn on multiple timescales — fast and slow — and so should agents.

Their Fast-Slow framework makes the substrate axis of our 2×2 literal (see Figure 6). The slow weights are θ, updated by RL and kept close to the base model so general reasoning behaviors persist. The fast "weights" are an optimized context that learns from textual feedback to absorb the task-specific part. The division of labor is the point: let the cheap, reversible, system-space layer carry what is task-specific and volatile, and let the expensive, weight-space layer change slowly and rarely. Fast-Slow Training is up to 3× more sample-efficient than slow learning (RL) alone across reasoning tasks — it matches RL's reward with up to 3× fewer rollouts, and then reaches a higher ceiling.

This is the cleanest single argument in the article, because it refuses the binary the map might seem to impose. Plotted on Figure 2, Fast-Slow does not sit in a quadrant; it is an arrow connecting two of them — fast/system-space online at one end, slow/weight-space episodic at the other — and its result says the connection is worth more than either endpoint alone. The right question for a deployed agent is not "weights or system?" but "which timescale should carry this particular thing I just learned?"

A parallel line of short-horizon-RL work points the same way — keep the heavy weight updates rare while a lighter loop carries continual improvement — and the convergence is the signal. The field keeps rediscovering that the durable answer is a two-speed system.

Figure 6: The two timescales of Fast-Slow learning (Tiwari et al., 2026). Slow weights (θ, via RL) change rarely and hold general capability; a fast context layer changes constantly and holds task-specific information from textual feedback. Why it matters: splitting learning by timescale yields up to 3× the sample efficiency of RL alone — the strongest case for spanning, not picking, a substrate.

Key Takeaway 5

Stop treating weight-space and system-space as a choice. Tiwari et al. (2026) split learning by timescale — slow weights (θ, via RL) carry general capability, fast context (system-space) carries task-specific information from textual feedback — and get up to 3× the sample efficiency of RL alone. On the 2×2, the best systems are not points in one quadrant; they are arrows that span the substrate axis.

§6 · Why this is hard: the old truths

None of this is new, and pretending it is would be the fastest way to repeat old mistakes. On-the-job learning has a deep literature, and that literature is mostly a catalog of why it is hard. Four results are worth holding (see Figure 7).

The first is plasticity. Train a network continually and it slowly loses the ability to learn — not its memories of old tasks, but its capacity to acquire new ones. Dohare et al. (2024) established this at scale and showed that ordinary gradient descent does not fix it; their continual backprop, which keeps injecting fresh randomness into the network, does. The full treatment is in the published series [← A1]; the load-bearing point here is that an agent learning forever in weight-space will, without intervention, grind to a halt. Online weight-space learning is not free; it decays.

The second is the deepest, and the most counterintuitive. Sutton, Koop & Silver (2007) showed that tracking — continually adjusting your solution rather than converging to a fixed one — can beat any converging algorithm even in a stationary world, where conventional wisdom says convergence is exactly what you want. Their Black-and-White problem makes it concrete: a learner that never stops adjusting outperforms one that finds the "right" answer and freezes. The implication for deployed agents is radical: continual learning is not a concession to non-stationarity; it can be the better policy even when nothing is changing. Never freezing is not a patch — it is sometimes the optimum.

The third is resource constraint. Tamborski & Abel (2025) study agents navigating unknown environments under a fixed memory budget, and find the budget itself reshapes optimal behavior: a memory-constrained agent faces a genuine dilemma about how much memory to spend modeling the world versus planning within it, and the right split changes across MCTS- and DQN-based learners and across episodic and continual settings. Σ is never free; how you allocate it is a first-class decision, not an implementation detail.

The fourth closes the loop on cost. Orenstein et al. (2025) gave agents the cost of their own computation and the ability to decide when to spend it — and on the Arcade Learning Environment, with the same training budget, those agents performed better on 75% of games while using 3× less compute on average. An agent that learns on the job should also learn when not to think. Reasoning about your own compute is itself something to be learned.

Together these are the constraints any on-the-job system inherits: weight-space learning decays (plasticity), the right target is a moving one (tracking), memory is a budget (allocation), and compute is a decision (efficiency). The recent systems in this article are not naïve about these — they are engineered responses to them.

Figure 7: The old truths. The schematic curves show tracking beating convergence even in a stationary world (Sutton, Koop & Silver, 2007); the side panel lists the other three constraints — plasticity decay, memory budgets, and compute as a learnable decision. Why it matters: on-the-job learning is a regime with its own physics, and every system here is an answer to one of these forces.

Key Takeaway 6

On-the-job learning is old, and the old results are warnings. Weight-space learning loses plasticity without intervention (Dohare et al., 2024 [← A1]); tracking can beat converging even in a stationary world (Sutton, Koop & Silver, 2007); memory is a budget that reshapes behavior (Tamborski & Abel, 2025); and compute is a decision an agent can learn — better on 75% of games at 3× less compute (Orenstein et al., 2025). Continual learning is not a feature bolted on; it is the regime, with its own physics.

§7 · The lineage

It is worth saying plainly: on-the-job learning was never a feature request. It was the definition of intelligence the field started from, then quietly abandoned when convergence-on-a-benchmark turned out to be easier to publish (see Figure 8).

Run the lineage forward. In 2007, the tracking result already said always-learning could be optimal. In 2010, NELL — the Never-Ending Language Learner (Mitchell et al., 2015) — put the philosophy into a running system: it has read the web 24 hours a day since January 2010, accumulating a knowledge base of over 80 million confidence-weighted beliefs and learning, along the way, the features that let it read. NELL is the agentic-continual-learning vision in its pre-LLM form: a system whose entire point is that it never stops.

By 2023 the idea had a formalism. A Definition of Continual Reinforcement Learning recast the agent's goal away from finding a solution and toward never stopping — the best agents, in this view, are the ones that keep adapting indefinitely. The published series develops this in depth [← A3]; the point for us is that it makes "learning forever" a precise object rather than a slogan. By 2025 the foundations were being rebuilt to match: a history-process reformulation argued that the Markov-decision-process scaffolding and sum-of-rewards metric inherited from convergence-era RL are actively wrong for the continual case, and proposed replacing them.

What changed in 2026 is not the idea. It is that the idea finally has a substrate worth running it on. NELL had to learn the world from scratch; OpenClaw-RL, ACuRL, Memento-Skills, and cl-agent start from a frozen model that already knows most of the world and only has to learn your part of it. The Sutton-and-CMU lineage — tracking, never-ending learning, continual-RL definitions — spent two decades describing a kind of agent the technology could not yet build. Production LLM agents are the first systems for which that description is also a deployment plan.

Figure 8: The lineage of never-stopping agents, drawn with later milestones higher to show acceleration. From the tracking result (2007) and NELL (2010) through the formal definition of continual RL (2023 [← A3]) and the 2025 history-process critique to 2026's production agents. Why it matters: the idea is the field's oldest; only the substrate is new.

Key Takeaway 7

On-the-job learning is the field's oldest ambition, not a new one. Tracking (2007), NELL's never-ending learner (Mitchell et al., 2015; 80M+ beliefs since 2010), the formal definition of continual RL (2023 [← A3]), and the 2025 history-process critique of convergence-era foundations all describe an agent that never stops. The 2026 systems are the first to inherit a model that makes that description shippable.

§8 · The deploy-today playbook

Reduce everything to a decision you can make on Monday. You have a deployed agent paying the amnesia bill. Which corner of the 2×2 do you build first? The map resolves into a path (see Figure 9), and the path is mostly determined by two questions.

First: can you safely update the model's weights in production? For most teams the answer is no — you serve a model you do not own, or you cannot accept the risk and cost of an online training loop with a judge in it. That answer alone puts you on the bottom row, system-space — and that is good news, because §4 showed the bottom row ships on a frozen model.

Second: do you have a trustworthy automatic verdict on the live stream — a test suite, a checkable outcome, a judge you would bet on? If not, start in the bottom-right: system-space, episodic. Capture episodes, distill them offline into inspectable skills and rules, inject them next run. This is the lowest-risk, highest-certainty intervention, and it is shippable on a frozen model this week. If you do have a trustworthy verdict, add the bottom-left: surface relevant past episodes online, before each run, so the agent walks in already knowing your environment. Only once both work — and only if you own the training loop and can afford a judge — should you graduate to the top-left, weight-space online, OpenClaw-RL-style, where the gains are largest and the failure modes are worst.

The order is the whole artifact: episodic system-space first, then online system-space, then weight-space — cheap and safe before expensive and sharp. It is the same shape B3's recipe had [← 3]: exhaust the cheap interventions before reaching for the expensive one.

Figure 9: The map resolved into a decision path. Start at step 1 (system-space, episodic — capture and distill, shippable now); add step 2 (system-space, online — replay before each run) when you have a trustworthy verdict; reach step 3 (weight-space, online) only if you own the training loop. Why it matters: it converts the 2×2 into a build order — cheap and safe before expensive and sharp.

None of it works without instrumentation, and instrumentation has to start on day one, because you cannot distill experience you did not record. Figure 10 is the literal artifact: what to log from the first deployment, before you have built any learning at all. Log the next-state after every action, because that is the signal OpenClaw-RL recovers. Log episode boundaries with an outcome verdict and who or what produced it, because that is what ACuRL's judge and Asawa et al.'s benchmark grade. Log the context you actually injected, because without it you cannot assign credit to a skill or rule. Log the conventions you discover, because those are the amnesia targets. Log cost per episode, because effective feedback per unit cost is the harness's real scaling coordinate [← 1], and because an agent that learns should also learn when not to think. And tag every episode with a stable id, with a held-out replay set, so you can answer the only question that matters: is the agent getting better, or just getting different?

Log from day one	What it powers later	Grounded in
Next-state after every action — reply, exit code, state diff	The online learning signal (evaluative + directive)	OpenClaw-RL (Wang et al., 2026)
Episode boundaries + outcome verdict — and who/what judged it	Curriculum generation; measuring improvement at all	ACuRL (Xue et al., 2026); Asawa et al. (2026)
Context actually injected each run	Credit assignment to a skill or rule	Memento-Skills (2026); cl-agent (Goswami, 2026)
Conventions discovered — repo/env specifics	The amnesia targets to distill into `Σ`	cl-agent (Goswami, 2026)
Cost per episode — tokens, tool calls, wall-clock	Feedback-per-cost; learning when not to think	[← 1]; Orenstein et al. (2025)
Stable task/repo id + held-out replay set	Telling "better" apart from "different"	[← 1]; Asawa et al. (2026)

Figure 10: What to log from day one — the artifact. Each row is cheap to capture now and is the precondition for a learning mechanism later. Why it matters: every system in this article distills experience, and you cannot distill a stream you never recorded.

Key Takeaway 8 — the artifact

Build in this order: system-space episodic (capture → distill, shippable now), then system-space online (replay before each run), then — only if you own the loop and trust your judge — weight-space online. And log from day one: next-state signals, episode boundaries with verdicts, injected context, discovered conventions, per-episode cost, and a stable id with a held-out replay set. You cannot learn from a stream you never recorded.

What comes next

This article answered what learns and how the loop closes. It did not answer where the learned thing physically lives. Σ has been an abstraction — "an external store" — but a skill library, an episodic JSONL log, a replay buffer, and a vector index are very different objects with different costs, latencies, and failure modes, and choosing among them is its own engineering problem.

[→ 5] The Memory Stack — takes Σ apart: where memories live, how they are indexed and retrieved, and what the storage hierarchy of a learning agent actually looks like, from the context window out to external stores.

There is also a longer-horizon version of this chapter's argument — that closing the on-the-job loop is the road to agents that genuinely compound — which the long-bet series develops as a research programme [→ C6]. Here we keep the claim narrow and shippable: the experience is already flowing, the system-space half of the loop needs no weights, and the only unforgivable choice is to keep paying the amnesia bill while the signal that would end it runs past your agent untouched.

Final Key Takeaways

The on-the-job loop is closing from both ends — weight-space (online RL from next-state signals) and system-space (skills, episodes, rules) — and the first deployable wins are not gradient-based.
Place every approach on one 2×2: substrate (weight vs system) × cadence (online vs episodic). The bottom row ships on a frozen model; the strongest systems (Fast-Slow) span the axis rather than pick a corner.
The hard part is old: plasticity decays, tracking can beat converging, memory is a budget, compute is a decision. On-the-job learning is the regime, not a feature.
Deploy in order — system-space episodic, then online, then weight-space — and instrument from day one. You cannot distill a stream you never recorded.

References

Asawa, P., Glaze, C. M., Orlanski, G., Ramakrishnan, R., Xu, B., Biswal, A., Chen, V. S., Sala, F., Zaharia, M., & Gonzalez, J. E. (2026). Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments. Preprint. arXiv:2606.05661.
Wang, Y., Chen, X., Jin, X., Wang, M., & Yang, L. (2026). OpenClaw-RL: Train Any Agent Simply by Talking. Preprint. arXiv:2603.10165.
Xue, T., Liao, Z., Shi, T., Wang, Z., Zhang, K., Song, D., Su, Y., & Sun, H. (2026). Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents. Preprint. arXiv:2602.10356.
Memento-Team. (2026). Memento-Skills: Let Agents Design Agents. Preprint.
Goswami, D. (2026). cl-agent: A Continual-Learning Substrate for Coding Agents — Episode Capture, Replay, and Rule-Based Distillation for Cross-Session Improvement Without Fine-Tuning. Independent Research, Preprint. PDF · github.com/dattgoswami/cl-agent
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. Preprint. arXiv:2305.16291.
Dai, A., Meinardus, B., Regan, C., Tian, Y., & Tang, Y. (2026). Discovering Novel LLM Experts via Task-Capability Coevolution. International Conference on Learning Representations (ICLR), 2026. arXiv:2604.14969.
Tiwari, R., Sareen, K., Agrawal, L. A., Gonzalez, J. E., Zaharia, M., Keutzer, K., Dhillon, I. S., Agarwal, R., & Khatri, D. (2026). Learning, Fast and Slow: Towards LLMs That Adapt Continually. Preprint. arXiv:2605.12484.
Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., & Sutton, R. S. (2024). Loss of plasticity in deep continual learning. Nature, 632, 768–774.
Sutton, R. S., Koop, A., & Silver, D. (2007). On the Role of Tracking in Stationary Environments. Proceedings of the 24th International Conference on Machine Learning (ICML), 2007.
Tamborski, M., & Abel, D. (2025). Memory Allocation in Resource-Constrained Reinforcement Learning. Preprint. arXiv:2506.17263.
Orenstein, A., Chen, J., Delos Santos, G. A., Sapara, B., & Bowling, M. (2025). Toward Agents That Reason About Their Computation. Preprint. arXiv:2510.22833.
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., et al. (2015). Never-Ending Learning. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015.