The Selection Lens: How to Bet on Papers
Most reading lists optimize for recency. This one optimizes for survivorship — a lens you can backtest against the canon before you trust it with the future.
§1 · Your reading list is a portfolio
A reading list is a portfolio, and most of them are badly allocated. Reading time is the scarcest input a researcher has — more binding than compute, because you cannot buy more of it. Every paper you choose to read is a position you take with that capital, and every position has an opportunity cost: the three papers you did not read instead.
The default allocation strategy is recency. Read what was posted this week, what is trending, what everyone is citing right now. Recency feels like diligence — like staying current — but as an investment rule it is close to buying at the top. The attention a paper commands in its first month is the worst available estimate of how much it will matter in its tenth year, because that early attention is dominated by novelty, institutional megaphone, and the benchmark it happened to top.
The question that should govern allocation is not "is this new?" but "will this still be load-bearing in a decade?" The honest way to ask it is through citation half-life: how long until half of the work that depends on a paper has quietly moved on to something else. Methods have short half-lives. A cleaner optimizer, a stronger baseline, a better architecture arrives, and the old recipe is dropped without ceremony. Definitions have long half-lives. Once a paper names a problem the field agrees is real, everything that follows has to cite it — including, especially, the work that beats it (see Figure 2).
So "this paper aged well" has a precise meaning, and it is not the obvious one. It does not mean the paper was correct. It means the paper is still being cited by the people who showed it was incomplete. That is the asset class this series buys, and the rest of this essay is the lens that finds it.
A reading list is capital allocation under a hard budget. Optimize it for citation half-life, not recency — buy the papers that the work which eventually beats them will still be obligated to cite.
§2 · Three rules for what survives
Three rules sort papers by how long they will stay load-bearing. They are deliberately falsifiable, which is the only reason to trust them at all; §6 states exactly how each one fails. Read them as horizons of a single lens, near to far (see Figure 3).
Ten-year survivors are problem definitions and paradigm bets, not methods. Methods get replaced; definitions get cited. A method is an answer, and answers are competitive — a better one arrives and yours is retired. A definition is a question the field accepts as worth asking, and questions are not competitive in the same way: naming the problem well is what every subsequent answer must orient against. The ten-year tier is therefore a bet that a frame outlives its first occupant.
Five-year survivors are research programmes with open scaling curves. A programme is a paper that opens a curve — accuracy, capability, efficiency, sample-cost — that has not yet flattened, and that names the knob which moves it. You bet on a programme while its curve is still climbing, because a climbing curve is a promise of results you have not had to produce yet. The bet has a clean expiry: when the curve flattens, the programme is over, and pretending otherwise is how people stay invested in a dead idea.
Three-year survivors are engineering disciplines crystallizing now. The reliable tell that a sub-area has reached this stage is the appearance of its first serious survey. A survey is a field's initial public offering: it signals that there are enough independent results to systematize, that the practices have stabilized enough to be written down. A discipline at this stage will be standard practice within three years — which means reading it now is a matter of timing, not insight, and it belongs to a field guide rather than a long bet.
This lens is not a hypothetical. The published Continual Intelligence series [← A-series] is itself a worked example of running it: it tracked three programmes — the plasticity of continually-trained networks, world models, and reinforcement learning for reasoning — rather than chasing the quarter's leaderboard, which is exactly why its anchors were definitions and open curves and not last spring's state of the art.
The lens has three settings: ten-year definitions, five-year open curves, three-year crystallizing disciplines. Each is falsifiable — a definition that stops being cited, a curve that flattens, or a discipline that never standardizes refutes its tier.
§3 · The backtest: what made the canon last
Before applying a lens, backtest it. A lens that only works going forward is astrology; a lens worth using has to retrodict the canon we already agree on. So take six papers no serious reader disputes and ask, of each, which single property kept it alive — because if the answer is always "definition, frame, or capability proof" and never "method," the rule earns the right to be pointed at the future (see Figure 4).
GPT-3 (Brown et al., 2020) survived as a capability proof. It introduced no new architecture. What it demonstrated is that few-shot, in-context learning emerges when a language model is scaled far enough — that you can specify a task in the prompt and get competent behavior without gradient updates. The particular model is long obsolete. The existence proof is permanent: you cannot un-show that a capability is reachable, and everything built since has been built on the knowledge that it is.
Scaling Laws (Kaplan et al., 2020) survived as a quantitative frame. It established that loss falls as a smooth power law in model size, dataset size, and compute. Every capacity-planning decision since has been made inside that frame — how big a model to train, on how much data, for how long. A frame is not a method and is not replaced; you operate within it, and operating within it is what the field has done for half a decade.
MAML (Finn et al., 2017) survived as a problem definition whose method died. It defined fast adaptation crisply — train a model's initial parameters so that a few gradient steps on a new task produce good performance, in its own words to make the model "easy to fine-tune" — and demonstrated it on few-shot benchmarks. The specific second-order gradient procedure was promptly approximated away, and the broader meta-learning program was later largely absorbed by in-context learning. Yet "fast adaptation" is still the name of the problem, and MAML is still the citation that names it. The method is a museum piece; the definition is everywhere.
CLIP (Radford et al., 2021) survived as a capability proof at scale. Trained on 400 million image–text pairs with the deliberately simple objective of predicting which caption goes with which image, it produced zero-shot transfer across more than 30 vision datasets. The recipe has been beaten many times over. The demonstration — that natural-language supervision yields transferable perception — is load-bearing under every computer-use agent shipping today.
NEAT (Stanley & Miikkulainen, 2002) is the dead method, undead idea. Almost no one runs NEAT now. But its core bet — that you can search over structures, not just weights, by growing from minimal structure and protecting innovation long enough to mature — keeps returning: in neural architecture search, in program evolution, and in the self-improving agent loops the field is rediscovering at this very moment. A method can be retired while its paradigm stays undead, surfacing again each time the surrounding compute makes population search affordable.
EWC (Kirkpatrick et al., 2017) is the reference point that outlived its own effectiveness. It overcomes catastrophic forgetting by selectively slowing learning on the weights that matter for earlier tasks, demonstrated across MNIST and a sequence of Atari games. Stronger continual-learning methods arrived within a year and have kept arriving. EWC is cited more than any of them — because it is where the modern problem got its name. Being the reference point turns out to be a more durable asset than being the state of the art.
The pattern is the whole argument. Every survivor lasted as a definition, a frame, or an existence proof. None lasted as a method. A lens that retrodicts a canon you already agree on is a lens you are allowed to aim at a future you do not.
Backtested against the canon, the lens retrodicts cleanly: each surviving paper lasted as a definition, a quantitative frame, or a capability proof — and none survived as a method. A rule that explains the past you agree on is one you can aim at the future you don't.
§4 · Three biases worth having
The lens tells you what survives. Three biases tell you where to look for it before the survival is obvious — before the citations have accumulated and the bet has become consensus. Each is a bias in the literal sense: a thumb on the scale, applied on purpose, because each has positive expected value (see Figure 5).
Bias toward ideas that ride compute. Prefer the simple idea that gets better as hardware gets cheaper over the clever idea that needs a fixed budget. The scaling-law frame is the underwriting case: if a curve is open and compute is rising, the idea is being carried by an exponential you do not personally have to pay for. GPT-3 and CLIP both rode that exponential; their cleverness was mostly the decision to get out of its way.
Bias toward ideas that remove a human from a loop. Every human in an inner loop is a ceiling — on throughput, on scale, on how unattended a system can run. A paper that converts a human-supplied input into something the system produces for itself has lifted a ceiling, and lifted ceilings do not come back. Jiang, Rocktäschel & Grefenstette (2022) make this the central problem of the coming decade: the field is moving from "learning from data" to "learning what data to learn from," automating the choice of what to train on — the last large human still standing in the supervised loop. That is a problem definition, and §5 places it in the ten-year tier for exactly that reason.
Bias toward existence proofs of new capabilities. A result that shows a capability is reachable for the first time is worth more than a result that makes a known capability cheaper. The first cannot be undone. The second is a method, and methods get replaced — efficiency is a treadmill, capability is a ratchet. When a paper turns an "impossible" into a "demonstrated," it has minted a permanent fact, and permanent facts are the things later definitions get built around.
Why all three carry positive expected value: each bets on a direction the field is structurally forced to move — falling compute cost, rising automation, accumulating capability — rather than on a particular artifact that competitors can obsolete. The deepest version of the underlying claim is the position taken by Kunin et al. (2026): that deep learning will acquire a genuine scientific theory, that the frames will harden into something with predictive law-like content. If that is even partly right, then preferring frames to methods is not a matter of taste. It is a bet that the field is becoming a science — and in a science, the definitions are the durable objects and the methods are the disposable experiments.
Bias toward ideas that ride compute, remove a human from a loop, or prove a new capability exists. Each bets on a direction the field is forced to move, not on an artifact — and if deep learning is becoming a science, its durable objects are its frames, not its methods.
§5 · The lens applied: the 2026 tier lists
A lens you only backtest is a museum piece. Here it is applied, live and accountably, to the 308 papers in the mid-2026 collection — and the output is deliberately not a ranking. It is three tiers, sorted by horizon, with most of the collection left out on purpose (see Figure 6). This essay does not re-argue each entry; the rest of the series is that argument. What §5 owes you is the sorting logic and the few load-bearing calls that the rule produces.
The ten-year tier holds problem definitions and paradigm bets. The clearest definition in the whole collection is exploration: Jiang, Rocktäschel & Grefenstette (2022) argue that "learning what data to learn from" is the central open problem of general intelligence, and unify exploration as a single objective spanning supervised and reinforcement learning. That earns the ten-year tier because it is a name for a problem, not a solution to it — precisely the shape that survives. Beside it sit the paradigm bets that are not solutions either, only frames worth a decade: that a model's context is an environment to be acted in rather than an input to be consumed, and that the runtime state of a model is where long-term memory should live. Those are bets on frames, and C2 argues each one in full, with its own falsifier.
The five-year tier holds programmes with open scaling curves. Reinforcement-learning environment pipelines as the genuinely scarce resource; agent harnesses with a scaling variable of their own; the recursion of agents acting on themselves; and the quiet consolidation of on-policy distillation into a named sub-field. Each is a climbing curve with a named knob, and each gets its own essay — the recursion programme and the distillation programme most of all. The test for staying in this tier is brutal and simple: when the curve flattens, the programme leaves it.
The three-year tier holds the engineering disciplines crystallizing now — harness construction, memory systems, the tool interface. These are real, and they will shape what practitioners ship this year. But they belong to the practitioner's quarter, not to the decade, and the honest move is to cede them to the field-guide series rather than dress them up as long bets they are not.
What deliberately did not make a tier is the part that matters most, because the discipline of a lens is mostly subtraction. Point methods that move a benchmark a few points. Papers whose only claim is a new state of the art on a leaderboard that will turn over by winter. Surveys of areas that have not yet produced anything worth surveying. Recency optimizes for exactly these, and they are exactly what the lens is built to leave out. The cost is real: a rule this strict will occasionally exclude something that turns out to matter. That is a cost worth paying — and §6 is where I agree to pay it, in public and on a deadline.
Five-year programmes argued in full: [→ C4] On-Policy Distillation, [→ 3] Recursion as a scaling axis, and [→ B3] Environments Are the Bottleneck. Three-year disciplines ceded to the field guide: [→ B1] harness engineering and [→ B5] the memory stack.
Applied to 308 papers, the lens yields three tiers and one discipline — exclusion. Most of the collection is correctly left out; the ten-year tier is definitions (exploration, context-as-environment, runtime-as-memory), the five-year tier is open curves, and the three-year tier is ceded to the field guide.
§6 · How to be wrong
A bet you cannot lose is not a bet; it is a horoscope. The credibility of this lens rests entirely on being able to say, in advance, what observation would refute it — tier by tier, and on a date. So here is exactly how it fails (see Figure 7).
The ten-year tier is wrong if its definitions stop being cited or its paradigm bets are abandoned rather than extended. Concretely: if, by 2029, "learning what data to learn from" has not become a standard framing of the problem — if context-as-environment and runtime-as-memory have produced no sustained follow-on programme but instead sit as one-off curiosities — then these were recency dressed as definitions, and the tier was wrong.
The five-year tier is wrong if the curves close. A programme earns this tier by having an open scaling curve. If, by 2029, the environment, harness-scaling, recursion, or on-policy-distillation curves have flattened — more compute and data buying no more capability — then they were point results wearing a programme's clothing, and I will have mistaken a local slope for an open one.
The three-year tier is wrong if its disciplines never crystallize. The bet is that harness, memory, and tool-interface engineering become standard practice. If, by 2029, they have instead been absorbed wholesale into the model — if the harness dissolves because the weights now do its job unaided — then the discipline bet was a temporary scaffold mistaken for a foundation, and the field guide was documenting plumbing the model was about to internalize.
The lens itself is wrong if it beats recency only in hindsight. The backtest in §3 is necessary but not sufficient: any lens can be tuned to retrodict a canon everyone already agrees on. The real test is forward and comparative. If, in 2029, a portfolio chosen by this lens in 2026 has not outlasted a portfolio chosen by raw recency — or by citation count at the moment of selection — then the lens added nothing but the appearance of rigor.
The commitment is therefore specific. In 2029 I will publish a scored review of these calls — which tiers held, which definitions accumulated citations, which curves flattened, which disciplines dissolved into the model — and grade the lens against a recency baseline chosen now. The tier lists are dated and public for precisely this reason. A selection lens that is never scored against an alternative is just taste with footnotes.
| Tier | The bet | What would falsify it by 2029 | Status |
|---|---|---|---|
| Ten-year | Definitions outlive methods | Its definitions go uncited; the paradigm bets are abandoned, not extended | ⚠ open |
| Five-year | Programmes have open curves | The curves flatten — more compute, no more capability | ⚠ open |
| Three-year | Disciplines crystallize | They never standardize, or dissolve wholesale into the model | ⚠ open |
| The lens | Beats recency going forward | A 2026 lens-picked list fails to outlast a recency-picked one | ⚠ open |
The lens is falsifiable per tier — definitions that go uncited, curves that flatten, disciplines that never standardize — and it commits to a dated, scored 2029 review against a recency baseline. A bet you cannot lose is not a bet.
What comes next
The ten-year tier is the heart of the portfolio and, so far, the least defended: three names and a promise to argue them later. C2 pays that debt. It takes the paradigm bets — context as an environment, the runtime as memory, self-reference as a route to self-improvement — and argues each one as a position with its own falsifier, the way this essay argued the lens. The instrument is now built and calibrated against the canon. Next we use it to place the longest bets in the collection, and to say plainly what would make each of them wrong.
[→ 2] Paradigm Bets: The Ten-Year Tier — the ten-year tier argued in full, one falsifiable bet at a time. Earlier in the project, [→ B12] Self-Improving Agents shows where NEAT's population-search paradigm returns in the agent loop, and [→ 5] When AI Did Mathematics is the existence proof of §4 argued end to end.
References
- Stanley, K. O., & Miikkulainen, R. (2002). Evolving Neural Networks through Augmenting Topologies. Evolutionary Computation, 10(2), 99–127. MIT Press.
- Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Proceedings of the 34th International Conference on Machine Learning (ICML), PMLR 70. arXiv:1703.03400.
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences (PNAS), 114(13), 3521–3526. arXiv:1612.00796.
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (NeurIPS). arXiv:2005.14165.
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint. arXiv:2001.08361.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139. arXiv:2103.00020.
- Jiang, M., Rocktäschel, T., & Grefenstette, E. (2022). General Intelligence Requires Rethinking Exploration. arXiv preprint. arXiv:2211.07819.
- Kunin, D., Atanasov, A., Boix-Adserà, E., Bordelon, B., Cohen, J., Ghosh, N., et al. (2026). There Will Be a Scientific Theory of Deep Learning. arXiv preprint. arXiv:2604.21691.