On-Policy Distillation Quietly Ate Post-Training

§1 · Eight papers, a single half-year

Between February and June 2026, eight different research groups published eight papers about the same training method within a single half-year of one another. Most of them did not cite each other, because most were in flight at the same time. They argued about the details and agreed on the name: on-policy distillation. Read the arXiv stamps as a time series and the shape is unmistakable (see Figure 2) — a near-vertical staircase, a method going from folklore to literature inside a single half-year.

When a method gets one careful treatment, that is a result. When it gets eight near-simultaneous, independent treatments — two papers on its failure modes, one on its mechanism and recipe, one generalizing it past the teacher, one on efficiency, one on self-distillation, one on weak teachers, one on distilling across model families — that is not a result. That is a subfield announcing itself. These papers are not building on a settled consensus; they are racing to discover one.

The first essay in this series argued that you can date the birth of an engineering discipline by when its surveys appear. Methods get replaced; a discipline crystallizes the moment practitioners stop trading recipes and start writing the textbook. In June 2026 the survey arrived — a Peking University primer that synthesizes more than 150 studies on post-training reasoning data and hands the field its first shared vocabulary (the Primer, Li et al., 2026). On-policy distillation is one chapter of that textbook, being written in real time. This essay reads the eight papers the way that chapter will read them in 2028: as a definition, a set of failure modes, a mechanism, a recipe, and a set of extensions.

[← 1] The Selection Lens: How to Bet on Papers — set the rule used here: a research programme has crystallized into a discipline precisely when the first surveys appear. The Primer is that marker; this essay catches the rule in the act.

The thesis is simple enough to state in one breath. On-policy distillation collects its training data the way reinforcement learning does — from the student's own rollouts — but supervises that data the way supervised fine-tuning does — densely, at every token, against a stronger teacher. That single recombination lands between the two failure modes that have defined post-training: SFT's distribution shift (the student is trained on trajectories it would never have produced itself) and outcome-RL's sparse credit (one reward at the end of a long trajectory, silent about which token earned it). RLVR took the headlines and the compute. On-policy distillation took the work — and, by the end of this essay, the strongest claim to being the cheapest way to move a skill out of an expensive model and into a deployed one.

Figure 2: The eight OPD-method papers by arXiv month, February–June 2026 (cumulative count). One paper in February becomes eight by June — eight independent treatments in a single half-year. The two hollow markers (SDAR, OPRD) are the extensions §6–§7 will reach. Acceleration like this is the empirical signature of a discipline crystallizing, not one lab's program.

Key Takeaway 1

Eight independent treatments of on-policy distillation across February to June 2026 are the signature of a subfield crystallizing, not a single lab's bet. OPD's one idea — collect data on-policy like RL, but supervise it densely like SFT — is a recombination aimed squarely at the two standing failure modes of post-training: SFT's distribution shift and outcome-RL's sparse credit.

§2 · What on-policy distillation is

Two axes organize all of post-training. The first is where the data comes from: off-policy data is produced by some other model — a teacher, a human, an earlier checkpoint — while on-policy data is produced by the model being trained, M, on its own rollouts. The second is how dense the supervision is: dense supervision attaches a learning signal to every token, sparse supervision attaches one signal to a whole trajectory τ. Place the three methods that matter on those axes and they fall into three corners (see Figure 3).

SFT is off-policy and dense. You take a teacher's trajectories and train M to imitate them token by token. The supervision is maximally dense — a target at every position — but it is collected on the teacher's distribution, not the student's. The moment M generates its own text, it walks off the manifold it was trained on; small errors compound because nothing in training taught it to recover from its own mistakes. That is distribution shift, the oldest disease in imitation learning.

Outcome RL is on-policy and sparse. M generates its own trajectory, a verifier checks the final answer, and a single scalar reward flows back. The data is now on-policy, which cures distribution shift — M learns on exactly the distribution it will be deployed on. But the credit is sparse: a four-thousand-token proof earns one bit of feedback, and the model must infer, across many noisy rollouts, which tokens mattered. The published analysis of what RLVR does and does not add lives in the A-series and is assumed here, not re-argued.

[← A2] Does RL Teach LLMs to Reason, or Just Refine Them? — established that outcome-RL's sparse, verifier-only signal mostly redistributes probability mass toward solutions the base model could already reach, escaping that "invisible leash" only at prolonged scale. That sparse-credit ceiling is exactly the weakness OPD's dense signal is built to sidestep.

On-policy distillation takes the third corner: on-policy and dense. Meta's OmniOPD states the definition as cleanly as anyone — OPD "trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of SFT and the sparse credit assignment of outcome-based RL" (Zhou et al., 2026). M rolls out; the teacher T scores every token M produced; M moves toward the teacher at every position. You keep SFT's dense signal and RL's on-policy data, and you drop the characteristic weakness of each.

Figure 3: OPD occupies the on-policy + dense corner. It keeps SFT's per-token signal but collects it on the student's own trajectories — fixing SFT's distribution shift (move right) and outcome-RL's sparse credit (move up) at once. Definition and framing after Zhou et al. (2026).

None of this is a clean-room invention; it is a new arrangement of an old stack — the same stack documented at book length in Lambert's Reinforcement Learning from Human Feedback (2026), where instruction tuning, reward modeling, rejection sampling, and policy optimization already sit side by side. The notion that one model can densely supervise another reaches back to Constitutional AI (Bai et al., 2022), where a model generated its own critiques and revisions and then trained on them — a model supervising a model, without human labels. OPD pushes that lineage to its token-level limit, taking the teacher's per-token distribution as the supervision target.

The corners are not cosmetic. A Harvard study that ran RL directly on intermediate pre-training checkpoints (Bansal et al., 2026) found the off-policy and on-policy regimes do structurally different things: RL expands a model's distribution, whereas SFT sharpens it — and SFT applied alone measurably degrades general capabilities even while it lifts the target task. Dense off-policy imitation carries a tax that on-policy collection avoids. OPD is the attempt to pay neither bill.

Key Takeaway 2

OPD is defined by two axes, not one: it collects data on-policy (from the student's rollouts, like RL) and supervises it densely (per token, like SFT). That corner — on-policy and dense — is the whole idea. It inherits SFT's rich signal without SFT's distribution shift, and RL's deployment-matched data without RL's sparse credit.

§3 · The failure phenomenology

The fastest way to understand a method is to watch it break, and on-policy distillation breaks in a specific, repeatable way. Because the training data is generated by the student, the objective can quietly reshape that data — and the reshaping is pathological. Luo et al. (2026) name the failure: length inflation. As training proceeds, the student's on-policy rollouts grow longer, often abruptly, until they hit the truncation cap; truncated trajectories then dominate the batch. The authors call this truncation collapse — it coincides with a sudden saturation of repetition, biases the gradient, and drops validation performance off a cliff (see Figure 4).

Figure 4: OPD's signature failure. On-policy rollouts inflate in length until they reach the truncation cap; the repetition-saturated, truncated trajectories then dominate the batch, biasing the gradient and collapsing validation accuracy. Axes are qualitative — the shape, not the numbers, is the claim (after Luo et al., 2026).

The mechanism is a feedback loop hidden inside the on-policy data collection. The objective implicitly favors long, repetitive rollouts — they accumulate more per-token agreement with the teacher — so the student learns to produce them, which makes the next batch of on-policy data longer and more repetitive still. The student is not learning to reason better; it is learning to inflate the density of its own supervision. The fix Luo et al. propose, Stable-OPD, is two guardrails: a reference-based divergence constraint that keeps rollouts from drifting, and a rollout mixture that dilutes the runaway trajectories. Neither is exotic. Both exist because the naive objective is unstable.

A second team catalogs the failure surface more broadly and reaches a compatible diagnosis (Fu et al., 2026). The standard implementation reduces distribution matching to a sampled-token log-ratio, and that estimator is fragile on long rollouts whose prefixes have drifted off the teacher's typical support. They isolate three failure modes: imbalanced token-level supervision (a few high-loss tokens dominate the gradient), unreliable teacher guidance on student-generated prefixes the teacher would never have written, and plain tokenizer or special-token mismatch between teacher and student. Their fixes are equally unglamorous — local top-K teacher supervision, careful prefix handling, tokenizer alignment. The pattern across both papers is identical: the idea is robust, the naive implementation is not, and the gap between them is a short list of stabilizers.

[← A7] Stable Deep RL at Scale: Gradients, KL, and the Shape of Learning — documented the same family of pathologies for outcome RL: KL estimators that fail in both directions, and an "unbounded reasoning depth" mode where episode length grows until training destabilizes. OPD's length inflation is the distillation dialect of that disease — the same coupling between a self-generated data distribution and an objective that quietly prefers longer sequences.

The lesson transfers cleanly. When the policy generates its own data, the objective and the data distribution are coupled, and any objective that even slightly prefers longer sequences will eventually find the runaway. That is the tax of on-policy collection, and it is why the phenomenology papers arrived early in the cluster: a method has to be widely used before its failure modes are worth a paper each.

Key Takeaway 3

OPD's characteristic failure is length inflation collapsing into truncation dominance: the student games the density of its own supervision until truncated, repetitive rollouts swamp the batch and validation falls off a cliff. The cure is a short list of stabilizers — divergence constraints, rollout mixtures, local top-K supervision, tokenizer alignment — and the deeper lesson is that on-policy collection always couples objective to data distribution, exactly as it does in deep RL.

§4 · Mechanism and recipe

If §3 is how OPD breaks, §4 is why it works when it doesn't. The most complete single treatment comes from a Tsinghua group whose subtitle is its outline — phenomenology, mechanism, recipe (Li et al., 2026), the three things a maturing method needs, in one paper. Their central empirical finding is small to state and large in consequence: OPD primarily teaches the student thinking patterns, not new facts. The student already holds the knowledge; what the teacher transfers, densely and on-policy, is the procedure — how to deploy what the student already knows.

Underneath the phenomenology is a clean identity the generalization paper makes explicit (Yang et al., 2026): on-policy distillation is a special case of dense, KL-constrained reinforcement learning, in which the per-token reward and the regularizer to a reference model are weighted equally and the reference can be any model.¹ That one line dissolves the dichotomy of §2. OPD is not a third thing standing next to RL and SFT; it is RL with a dense, per-token reward supplied by a teacher instead of a sparse one supplied by a verifier. The teacher's log-probability at each position is the reward. Credit assignment — the hard part of RL — is handed to you for free, at every token (see Figure 5).

Figure 5: Why OPD learns faster than outcome RL. Both collect the same on-policy trajectory τ; outcome RL delivers one reward at the end, which must be credited back across every token, while OPD attaches a teacher reward to each token directly. Same data, far more supervision per trajectory. Mechanism after Yang et al. (2026).

[← A2] Does RL Teach LLMs to Reason, or Just Refine Them? — found that outcome RL's sparse verifier signal mostly surfaces capability the base model already had, escaping the leash only at prolonged, expensive scale. OPD changes that economics: a dense per-token reward carries far more bits per trajectory than a single terminal scalar, so the same transfer can be reached sooner — provided a teacher is willing to score every token.

The recipe half of the Tsinghua paper is where the method turns into engineering. Because OPD is KL-constrained RL with a tunable reward, the levers are the familiar ones: which teacher to use, how to weight the reward against the reference term, how to handle the student-prefix problem from §3, and — crucially — how long to train before length inflation sets in. The point is not any single setting; it is that these choices are legible. A method has a recipe when its hyperparameters have known effects and its failures are diagnosable. By mid-2026, OPD had one.

None of this demands new mathematics. It demands noticing that the dense supervision of imitation learning and the on-policy data collection of reinforcement learning are separable, and that you can keep one without the other. The mechanism is not deep. The recombination is what is valuable.

Key Takeaway 4

OPD is dense, KL-constrained RL whose per-token reward is the teacher's log-probability — credit assignment, handed to you at every token. Empirically it transfers thinking patterns rather than facts: it teaches the student to use what it already knows. That is why it can reach, cheaply and quickly, the kind of capability transfer that costs outcome RL a prolonged training run.

§5 · Beyond the teacher

The naive expectation for any distillation method is that the student converges toward the teacher and stops — the teacher is the ceiling. The most interesting result in the whole cluster is that this is false. On-policy distillation can carry the student past the teacher, and it can do so with teachers that are weaker, same-sized, or even the student itself. The ceiling was never the teacher's capability; it was an artifact of how the reward and the regularizer were balanced.

The mechanism from §4 makes this almost obvious in hindsight. If OPD is KL-constrained RL with the reward and the reference term weighted equally, nothing forces that equality. Yang et al. (2026) untie it: scale the reward above the reference term — a reward-scaling factor greater than one — and you get ExOPD, reward extrapolation. The student is no longer asked to match the teacher's distribution; it is asked to move further in the direction the teacher points than the teacher itself went. Across teacher–student size pairings this consistently beats standard OPD, and it lets you merge several domain-expert teachers into one student that exceeds each (see Figure 6).

Figure 6: Learning past the teacher. Standard OPD asymptotes toward the teacher's capability; weighting the reward above the reference term (ExOPD, λ>1) lets the student cross above it. The teacher's direction, not its altitude, is what transfers — which is why even weak or same-level teachers help (after Yang et al., 2026; Lu & Liu, 2026).

The teacher's strength matters less than intuition suggests. Studying distillation in pretraining, Lu and Liu (2026) report that a smaller teacher can still guide a larger student, and a same-level teacher still helps. The direction of the signal is worth more than its altitude. This is the weak-to-strong story in a practical key: you do not need a model better than the one you are training to make it better — you need one that is differently right, often enough, at the token level.

If the teacher can be weak, can it be the student? Two papers push toward that limit. Hao et al. (2026) propose self-policy distillation with no external signal at all: extract a low-rank capability subspace from the model's own gradients on the tokens that determine correctness, then project the model's own activations into that subspace during generation — a way to let a model amplify its own best directions without a separate teacher or reward model. Abdali et al. (2026) relax the on-policy constraint to make distillation cheaper, scaling reasoning efficiency for small models on competition-math benchmarks. Both show the method's parts are modular: teacher, reference, and reward can each be swapped, weakened, or folded back into the student.

The same self-supervision instinct shows up one level out, in verification. Wu and Raghunathan (2026) note that a model cannot reliably catch its own errors — until it is shown the reference solution, at which point it can. They turn that asymmetry into a training target, teaching a verifier to imitate a more-informed version of itself; the verifier roughly doubles accuracy on hard math and lifts scientific reasoning 14×, from 1.5% to 21%. That is not OPD, but it is the same idea in a different coat — and both descend from Constitutional AI (Bai et al., 2022), where the supervising signal came from the model and a short list of principles rather than from humans. The field keeps rediscovering that a model can be its own teacher when the asymmetry is arranged correctly.

Key Takeaway 5

The teacher is not the ceiling. Unequal weighting of reward and reference (ExOPD) carries the student past the teacher; weak, same-level, and even self-teachers still transfer skill, because what moves is the teacher's direction, not its altitude. OPD's teacher, reference, and reward are modular — and the adjacent self-trained-verification work shows the same model-supervises-model logic reaching into evaluation.

§6 · Across model families

Every method in §2–§5 assumes a luxury: that you can read the teacher's token-level logits. That assumption quietly excludes the most capable teachers in the world. The strongest models are served behind APIs that return text, not log-probabilities — you cannot run standard OPD against them, because standard OPD needs the number the API will not hand over. Two June papers attack this from opposite directions, and together they show the design space opening on two axes (see Figure 7).

Meta's OmniOPD (Zhou et al., 2026) removes the white-box requirement. Rather than matching the teacher's logits, it scores the student's rollouts with a logit-free, chunk-level signal: Monte Carlo rollouts from the teacher approximate its local preferences through a continuous semantic-similarity metric over multi-token chunks, and a peak-entropy scheduler concentrates that expensive supervision on the student's high-uncertainty reasoning forks — the handful of tokens where the trajectory could branch several ways. A Bayesian prior and a base-model anchor keep the discrete sampling from collapsing the policy. The payoff is OPD with a black-box teacher, outperforming SFT by up to 45.31% on mathematical reasoning and competitive programming. The brittle, narrow-overlap logit-matching that caused half the failures in §3 is swapped for something coarser and sturdier.

Figure 7: OPD's design space is opening on two axes. OmniOPD moves right — logit-free, chunk-level supervision via speculative verification reaches black-box teachers; OPRD moves up — distilling internal representations rather than output logits. The single corner that was OPD in February is a region by June (after Zhou et al., 2026; Zhu et al., 2026).

OPRD (Zhu et al., 2026) moves in the other direction — not outward toward black-box teachers, but inward toward richer signal. Its argument is that every existing OPD variant, whether it matches sampled tokens, the full vocabulary, or the top-K, supervises the student only in the output space, and that this is wasteful twice over. The sampled-token reward is a single-sample Monte Carlo estimate of a divergence over a vocabulary of roughly 150,000 tokens, whose variance dominates the optimization late in training; and treating the teacher as a black-box probability oracle discards the entire stack of intermediate hidden states the teacher computed on the way to that distribution. OPRD distills in representation space instead — matching the teacher's hidden representations, not only its output logits. It is the same recombination as OPD, applied one layer deeper.

Read together, the two papers locate OPD on a grid. One axis is teacher access — white-box, where you can read logits or hidden states, versus black-box, where you get only text. The other is the locus of supervision — output logits, multi-token chunks, or internal representations. Standard OPD sits in one corner; OmniOPD pushes along the access axis to reach proprietary teachers; OPRD pushes along the locus axis to reach hidden states. A method that can be generalized along orthogonal axes within months of its naming is a method whose core idea is sound; the variants are exploring a space, not patching a leak.

Key Takeaway 6

OPD generalizes on two orthogonal axes within months of its naming. OmniOPD makes it logit-free, so the world's strongest black-box models can serve as teachers; OPRD moves the supervision target from output logits into representation space. Standard OPD was a point; by June 2026 it is a region — the signature of a sound core idea, not a patched one.

§7 · The agentic extension

Everything so far concerns a single-turn student: one prompt, one trajectory, one teacher pass. Agents are different. An agent acts over many turns — it calls a tool, reads the result, decides, acts again — and a reward arrives only at the end of the whole episode, if at all. That is outcome RL's sparse-credit problem in its most severe form, and it is precisely where OPD's dense signal is most valuable and hardest to apply. The agentic extension is the most consequential corner of the cluster, because it is where the method stops being a post-training trick and starts being infrastructure.

SDAR — self-distilled agentic RL (Lu et al., 2026) — takes the §4 idea into the multi-turn setting and finds it does not transfer for free. Its on-policy self-distillation supplies dense token-level guidance from a teacher branch given privileged context the student lacks — the answer, a retrieved skill, a hint. But compounding multi-turn instability destabilizes that supervision, and a teacher operating with privileged context will sometimes reject a student token for the wrong reason, because its advantage came from information the student never had. SDAR's fix is to keep RL as the primary optimization backbone and treat the dense distillation as a gated auxiliary objective: detached token-level signals pass through a sigmoid gate that strengthens distillation on teacher-endorsed, positive-gap tokens and softly attenuates the teacher's spurious rejections. On Qwen2.5 and Qwen3 across ALFWorld, WebShop, and Search-QA, this improves on GRPO by +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop accuracy (see Figure 8).

Figure 8: OPD inside a long-horizon agent. The episode's only reward arrives at the end (sparse, the RL backbone); SDAR threads a dense, gated teacher signal onto every action — strengthening teacher-endorsed tokens, attenuating rejections that came only from the teacher's privileged context. +9.4% ALFWorld, +7.0% Search-QA, +10.2% WebShop-Acc over GRPO (after Lu et al., 2026).

The thing to notice is what SDAR does to credit assignment. Multi-turn RL's central difficulty is attributing a final reward across dozens of actions; the dense teacher signal sidesteps it by supplying per-token credit at every step the student takes. When you can get a teacher to score the trajectory densely, you do not have to solve credit assignment — you dissolve it.

[→ B3] Environments Are the Bottleneck — the agent-training pipeline that OPD slots into, where environments are the new data and the multi-turn credit-assignment algorithms that learn from sparse reward alone are developed in full. This essay cedes that machinery to the field guide and keeps only OPD's narrow contribution: dense teacher credit, when a teacher is available to supply it.

The deployment side of the story is industrial. AgenticQwen (Lyu et al., 2026) trains small agentic models under strict cost and latency constraints — the models you actually run at scale — using multi-round RL on synthetic data driven by dual data flywheels: a reasoning flywheel that raises task difficulty by learning from errors, and an agentic flywheel that expands linear workflows into multi-branch behavior trees. This is not itself on-policy distillation; it is the industrial data loop OPD slots into — the pipeline that turns one expensive capability into many cheap, fast, deployable ones. The open frontier reports point the same way. Read as practitioner evidence: Qwen3 (Qwen Team, 2025) states that leveraging the knowledge of its flagship models sharply reduces the compute needed to build the smaller ones; Kimi K2 (Kimi Team, 2025) leans on large-scale agentic data synthesis with a joint RL stage; GLM-4.5 (GLM-4.5 Team, 2025) describes expert-model iteration alongside RL. Across labs, the recipe is consistent — a strong model's behavior, distilled and synthesized, becomes the training substrate for the small models that ship.

Key Takeaway 7

In long-horizon agents, sparse trajectory reward is at its worst — and dense teacher supervision is at its most useful. SDAR carries OPD into the multi-turn setting as a gated auxiliary to an RL backbone, dissolving credit assignment where it is hardest. Around it, an industrial data loop (AgenticQwen) and the open frontier recipes (Qwen3, Kimi K2, GLM-4.5) show the same move at production scale: a strong model's behavior becomes the substrate for the cheap models that actually run.

§8 · The bet, stated

Here is the bet. By 2029, on-policy distillation is the default mechanism by which skills move through the agent economy — the standard way a capability discovered in an expensive frontier model arrives in the cheap models that actually run in production. Not the only mechanism, but the default one: the first thing a competent team reaches for when it needs a small model to do something a large model can already do.

The argument is economic, not aesthetic. A deployed agent is not one model; it is a fleet, tiered by cost. A few calls go to the frontier model; most go to mid-tier and small models chosen for latency and price. The central operational question of running such a fleet is how to push capability down the tiers without paying frontier prices on every call — and OPD is the cheapest known answer (see Figure 9). SFT down-tiers a skill but imports distribution shift; outcome RL down-tiers a skill but pays the sparse-credit tax in compute. OPD does it with on-policy data and dense teacher credit — exactly the combination that makes the transfer both cheap and stable. AgenticQwen's cost-and-latency-constrained small models, Qwen3's flagship-to-small compression, SDAR's deployed agents: these are early instances of a fleet learning to feed itself.

Figure 9: The bet, drawn. A deployed agent is a fleet tiered by cost. OPD is the cheapest, most stable way to move a skill from one expensive frontier teacher down into the many small models the fleet actually runs — on-policy data plus dense teacher credit, neither SFT's distribution shift nor outcome-RL's compute bill. The conceptual claim of this essay; the fleet-operations discipline it implies is ceded to the field guide.

[→ B11] Agent Ops: Running Agents in Production — owns the discipline of operating that fleet: routing calls across cost tiers, accounting for spend, and deciding which tier serves which request. The bet here is narrower than that discipline: it claims only that the transfer mechanism between the tiers converges on OPD, the way an earlier generation of pipelines converged on SFT.

A bet without a falsifier is a horoscope. So here is what would prove this one wrong: if the frontier post-training reports of 2027 and 2028 — the successors to Qwen3, Kimi K2, GLM-4.5 — describe recipes in which pure RL or pure SFT dominates and dense on-policy distillation is absent or marginal, two years running, the bet is dead, and the 2026 cluster was a local enthusiasm rather than the start of a discipline. I expect it to win, for the same reason the eight papers appeared at once: OPD is not a trick a single lab can hoard. It is a recombination of two things every lab already does, aimed at the two failure modes every lab already fights. Methods like that do not stay novel. They become defaults.

The Bet · 2029 Review

Claim. By 2029, on-policy distillation — on-policy trajectories under dense teacher supervision — is the default skill-transfer mechanism of the agent economy: the standard, named way frontier-agent capability is moved into the cheaper models that run in production.

Falsifier. Open frontier post-training reports across 2027 and 2028 show pure-RL or pure-SFT recipes dominating, with dense on-policy distillation absent or marginal, two years running.

Review date. 2029, against the published open frontier post-training reports. The prediction is specific enough to lose: dense on-policy supervision is a named, standard stage in the majority of those pipelines, or it is not.

Key Takeaway 8

OPD is the cheapest, most stable way to push a skill down a deployed agent fleet's cost tiers — on-policy data plus dense teacher credit, paying neither SFT's distribution shift nor outcome-RL's compute bill. The bet is that by 2029 it is the agent economy's default skill-transfer mechanism; the falsifier is pure-RL or pure-SFT recipes dominating frontier reports two years running. Reviewed in 2029.

What Comes Next

This essay watched a subfield crystallize in real time and placed a bet on where it goes. On-policy distillation is, at bottom, a story about transfer — moving a capability that already exists from one model into another, more cheaply than before. The next essay is about a capability that did not already exist anywhere. In 2026 an AI system produced mathematics that mathematicians checked and did not already know — a disproof here, a verified formal proof there. C5 asks what exactly happened, what it does and does not establish, and what new division of labor it implies. Transfer is the economics of skills that exist; discovery is the question of skills that do not yet. We turn to discovery.

[→ 5] When AI Did Mathematics — the existence-proof essay: an AI-generated result that human mathematicians verified and had not known, and the new division of labor between machine search and human verification it opens.

Concretely, Yang et al. (2026) write the standard token-level OPD update as a KL-constrained policy gradient in which the reward term and the reverse-KL regularizer toward a reference carry equal weight; the reference may be any model. Relaxing that equal weighting — a reward-scaling factor greater than one — is the ExOPD construction of §5, and the "any reference" clause is what later licenses self-distillation, where the reference is an earlier checkpoint of the student itself. ↩

References

Li, Y., Zhao, G., Shi, Q., Sun, L., Zhang, X., & Yang, T. (2026). A Primer in Post-Training Reasoning Data: What We Know About How It Works. Preprint. arXiv:2606.02113.
Yang, W., Liu, W., Xie, R., Yang, K., Yang, S., & Lin, Y. (2026). Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation. Preprint. arXiv:2602.12125.
Fu, Y., Huang, H., Jiang, K., Liu, J., Jiang, Z., Zhu, Y., & Zhao, D. (2026). Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. Preprint. arXiv:2603.25562.
Abdali, S., Kim, Y. J., Chen, T., & Cameron, P. (2026). Scaling Reasoning Efficiently via Relaxed On-Policy Distillation. Preprint. arXiv:2603.11137.
Luo, F., Chuang, Y.-N., Wang, G., Xu, Z., Han, X., Zhang, T., & Braverman, V. (2026). Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. Preprint. arXiv:2604.08527.
Li, Y., Zuo, Y., He, B., Zhang, J., Xiao, C., Qian, C., Yu, T., Yang, W., Liu, Z., & Ding, N. (2026). Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. Preprint. arXiv:2604.13016.
Hao, G., Shang, Y., Long, Y., Zhao, Z., & Liang, H. (2026). Self-Policy Distillation via Capability-Selective Subspace Projection. Preprint. arXiv:2605.22675.
Lu, T., & Liu, Z. (2026). Strong Teacher Not Needed? On Distillation in LLM Pretraining. Preprint. arXiv:2605.23857.
Zhou, Y., Zhang, L., Wu, Y., Wang, M., Bo, P., Liu, J., Fan, X., & Zhao, Z. (2026). OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification. Preprint. arXiv:2606.01476.
Lambert, N. (2026). Reinforcement Learning from Human Feedback. Online book.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., & Kaplan, J., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Preprint. arXiv:2212.08073.
Bansal, R., Mohri, C., Qin, T., Alvarez-Melis, D., & Kakade, S. (2026). RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM Training. Preprint. arXiv:2606.04272.
Wu, C. H., & Raghunathan, A. (2026). Self-Trained Verification for Training- and Test-Time Self-Improvement. Preprint. arXiv:2605.30290.
Zhu, G., Song, B., Wang, H., Xia, M., Zheng, X., Ma, Y., Chen, Z., Wang, W., & Chen, G. (2026). OPRD: On-Policy Representation Distillation. Preprint. arXiv:2606.06021.
Lu, Z., Yao, Z., Han, Z., Wang, Z., Wu, J., Gu, Q., Cai, X., Lu, W., Xiao, J., Zhuang, Y., & Shen, Y. (2026). Self-Distilled Agentic Reinforcement Learning. Preprint. arXiv:2605.15155.
Lyu, Y., Wang, C., Zheng, H., Yue, Y., Yan, J., Wang, M., & Huang, J. (2026). AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use. Preprint. arXiv:2604.21590.
Qwen Team. (2025). Qwen3 Technical Report. Preprint. arXiv:2505.09388.
Kimi Team. (2025). Kimi K2: Open Agentic Intelligence. Preprint. arXiv:2507.20534.
GLM-4.5 Team. (2025). GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. Preprint. arXiv:2508.06471.