Paradigm Bets: The Ten-Year Tier

What a ten-year paper looks like the day it appears

In June 2020, the most consequential paper of the decade was widely received as a parlor trick. Language Models are Few-Shot Learners introduced GPT-3, and the first reactions were not awe but dismissal. That October, Yann LeCun wrote that trying to build intelligent machines by scaling up language models was "like building high-altitude airplanes to go to the Moon" — you might set altitude records, but reaching the Moon would require a completely different approach. Two months earlier, in MIT Technology Review, Gary Marcus and Ernest Davis had run the verdict under the headline "GPT-3, Bloviator," concluding that the system "has no idea what it is talking about." Within thirty months the same architecture, scaled, was the substrate of the fastest-adopted software product in history.

The skeptics were not careless. They were reasoning correctly from what was visible. And that is exactly the problem a ten-year paper poses. The day it appears, it does not look like a result; it looks like a position someone is overpaying to hold. The curve it rides is still short. The capability it removes a human from is still clumsy. The cheaper incumbent still wins on every benchmark that exists. What separates the bets that age into inevitability from the ones that age into footnotes is not optimism, and it is not taste. It is a falsifiable claim about a scaling curve — a statement of the form "this gets better as you add X, and here is the observation that would show it doesn't."

This essay places six such bets from the 2026 collection. None is obvious today; each, I will argue, looks inevitable from 2036. I argue all six the same way — the frame the paper defines, the scaling curve it opens, and the single observation that would make me fold — because the discipline that "future of AI" essays usually skip is the discipline of treating a paper as a position, not a result, and naming the price at which you abandon it. Throughout, I use the series notation: M for the model (its weights), and E for the environment an agent acts in; both are defined again where they first matter.

I score the bets on the two axes the selection lens uses to build this whole reading list: does the idea ride compute (get better as you add parallel hardware, not just cleverness), and does it remove a human from a loop (eliminate a dependency on human data, labels, or hand-engineering). Plotting the six bets together with three historical landmarks (see Figure 2) gives the map the rest of the essay walks across.

Figure 2: The selection lens as a map. Each 2026 bet is scored on whether it rides compute (x) and removes a human from a loop (y); three landmarks (◇, common knowledge) calibrate the axes. AlphaZero sits at the top (self-play removed human data outright) and GPT-3 at the far right (it rode compute like nothing before it) — together they bracket the top-right corner where Genie-class world models, learned runtimes, and recursive context now sit. Evolution Strategies and AlexNet ride compute without removing a human; latent world models remove human supervision but are, so far, an efficiency win rather than a compute monster. Placements are justified bet by bet in the sections below.

Key Takeaway 1

A ten-year paper arrives contested, not obvious — it reads as an overpriced position because its scaling curve is still short. The honest way to hold one is to state, in advance, the observation that would fold it.

The discipline: every bet below names exactly one falsifier, and the portfolio commits to a dated 2029 review. Bets without a stop-loss are not bets; they are vibes.

Bet 1 — Context is an environment

The first bet is that the context window was always a temporary abstraction. We currently treat a long prompt as data you pour into the model: you compact it, chunk it, retrieve over it, and hope it fits. Recursive Language Models, from MIT CSAIL, bet the opposite. They treat a long prompt as an external environment E — a world the model can examine — and let the model "programmatically examine, decompose, and recursively call itself over snippets" of that environment rather than swallowing it whole.

The reframing is small to state and large in consequence. Inference stops being a single pass over a buffer and becomes interaction: the model reads part of E, decides what to look at next, and spends more forward passes to go further (see Figure 3). The reported payoff is the tell — this paradigm processes inputs more than an order of magnitude beyond a model's context-window limit, and on shorter inputs still beats vanilla long-context use and common coding scaffolds at comparable cost. The bottleneck stops being "fit it in the window" and becomes "spend inference to explore the environment," which is a bottleneck you relieve with compute.

That is why this is a position, not a feature. The feature is "handle long inputs." The position is that the durable interface between a model and a large body of information is an environment you act in, not a buffer you fill — and that as both models and contexts grow, interacting with context will dominate ingesting it.

Figure 3: Context as an environment, not a buffer. The model interacts with the prompt — reading, decomposing, and recursively calling itself over parts of it — rather than ingesting it in one pass. This is inference-time compute pointed at context, which is why it scales past the window.

Recursive Language Models are part of the published Continual Intelligence collection; they are cited here by cross-reference, not re-introduced. The in-weights version of the same instinct — spending compute inside the forward pass instead of in an external loop — is [← A11] Thinking Without Tokens. The bet is expanded into a full scaling-axis argument in [→ 3] Recursion: The Third Scaling Axis.

Key Takeaway 2

Bet 1 is that the context window is a passing abstraction: the lasting interface between a model and a large body of information is an environment it interacts with, and interaction will outscale ingestion as models and contexts grow.

Falsifier: flat or negative returns to recursive context interaction at scale — if a single forward pass over a long window matches recursive decomposition at equal cost, the bet fails.

Bet 2 — The runtime is the memory

Today, memory is something we bolt onto a model from outside: a vector store, a scratchpad, a retrieval layer, a database the agent queries. The engineering of that external stack is real work and it has its own essay ([→ B5] The Memory Stack). The second bet is that the stack is scaffolding — that memory is something the runtime learns, not something you attach.

Two papers stake it from opposite ends. Neural Computers, from Meta AI and Jürgen Schmidhuber, proposes "a new frontier" that "unifies computation, memory, and I/O in a learned runtime state" — the aim is to "make the model itself the running computer" rather than a program that runs on one (see Figure 4). It is refreshingly a position paper, not a results paper: it names the destination — a Completely Neural Computer with "stable execution, explicit reprogramming, and durable capability reuse" — and concedes that today only "early interface primitives" such as I/O alignment and short-horizon control have been learned, while "routine reuse, controlled updates, and symbolic stability remain open." Titans, from Google Research, attacks the same dissolution from the architecture side: it learns to memorize at test time, treating attention as a fast, accurate short-term memory and adding a neural long-term memory that decides what to retain as the model runs.

Put together, the bet is that the line between "the model" and "the model's computer" collapses into a single learned runtime — and that, like hand-engineered features before deep learning, the external memory stack is the thing that gets absorbed. The scaling curve is the usual one: if learned runtimes get better with model size and compute, the engineered stack becomes a temporary convenience.

Figure 4: Memory as a learned runtime, not an attached store. Neural Computers aim to fold program, memory, and I/O into one learned state; Titans builds the split into the architecture — fast attention for the short term, a learned long-term memory for the rest. The bet is that this absorbs the external memory stack [→ B5].

Key Takeaway 3

Bet 2 is that memory and computer are not things you attach to a model but things the runtime learns — Neural Computers from the systems side, Titans from the architecture side — and that the learned runtime eventually absorbs the engineered memory stack.

Falsifier: explicit, hand-engineered memory systems [→ B5] stay strictly better than learned runtimes at what matters — durability, reuse, controllable updates. Neural Computers itself lists these as open today; the bet is that they close.

Bet 3 — Latent world models win

There are two ways to build a world model, and they disagree about what a model of the world is for. One camp predicts the next pixels: render the future and you have modeled it. The other camp — the joint-embedding line associated with Yann LeCun — predicts the next latent embedding: you should not have to draw the future in order to plan over it. The third bet is that the latent camp is right, and that its long-standing fragility was an engineering problem, not a verdict.

LeWorldModel is the cleanest evidence yet. It is, by its own account, the first joint-embedding predictive architecture that trains stably end-to-end from raw pixels using only two loss terms — a next-embedding prediction loss and a regularizer that keeps the latent embeddings Gaussian — which cuts the tunable loss hyperparameters from six to one compared with the only prior end-to-end alternative (see Figure 5). It is 15M parameters, trains on a single GPU in a few hours, and plans up to 48× faster than world models built on top of foundation models, while staying competitive across 2D and 3D control. Probing shows its latent space encodes physical structure, and a surprise signal reliably flags physically implausible events.

Why this is a ten-year position and not a one-off: latent prediction has been theoretically attractive and practically brittle for years — propped up by exponential moving averages, pretrained encoders, multi-term losses, all to dodge representation collapse. The bet was always that stable, efficient, end-to-end latent prediction is possible. LeWorldModel is the first clean existence proof, and the open curve is whether that efficiency edge survives the trip to large, general world models. The bet is sharp precisely because it has a live rival, which is Bet 6.

Figure 5: The latent bet, in numbers. LeWorldModel collapses six tunable loss terms to one, plans up to 48× faster than foundation-model world models at 15M parameters on a single GPU, and predicts the next latent embedding rather than the next frame. Stable, efficient, end-to-end latent prediction is the existence proof the LeCun line had been missing.

Key Takeaway 4

Bet 3 is that world models should predict in latent space, not pixel space — and that LeWorldModel's stable, two-loss, single-GPU result shows the LeCun line's long fragility was an engineering problem, now solved in miniature.

Falsifier: generative, pixel-space world models keep beating latent prediction on downstream control across tasks. (Bet 6 is exactly that rival — which is why I hold both, and reconcile them in the portfolio.)

Bet 4 — Optimization without backprop

Backpropagation is so dominant that we forget it is a choice with requirements. It needs a differentiable path from loss to parameters, and it needs synchronized gradients flowing back through the whole network. Evolution Strategies needs neither. It perturbs the parameters, evaluates the perturbations with forward passes only, and recombines — which makes it embarrassingly parallel and indifferent to non-differentiable, noisy objectives. The fourth bet is that this is not a small-problem curiosity but a compute-rider: gradient-free optimization gets cheaper relative to backprop exactly as scale and parallelism grow.

The reason ES never displaced backprop at scale was arithmetic intensity: naïve perturbations waste GPU throughput. Evolution Strategies at the hyperscale removes that wall. Its EGGROLL method structures each perturbation as a low-rank matrix, recovering a hundredfold increase in training speed for billion-parameter models at large population sizes — up to 91% of the throughput of pure batch inference (see Figure 6). With that wall gone, the abstract's three results read like a probe of the whole training stack: ES can stably pretrain nonlinear recurrent language models that run in pure integer datatypes, it is competitive with GRPO for post-training language models on reasoning, and it matches ordinary ES in from-scratch RL while running far faster.

This is the bet that most clearly rides compute and least clearly removes a human — and I place it that way on the map (lower-right in Figure 2) on purpose. ES is an optimization substrate; a human still writes the objective. Its honesty is the point: not every ten-year bet is an autonomy story. Some are bets that, if you can buy throughput, you can buy away a structural constraint — here, the differentiability requirement itself.

Figure 6: Removing the gradient, not the human. Backprop needs a differentiable, synchronized path; ES needs only forward evaluations of perturbed parameters, which parallelize freely. EGGROLL's low-rank perturbations recover ~100× training speed at billion-parameter scale — up to 91% of pure batch-inference throughput — turning a small-problem method into a compute-rider.

Evolution Strategies at the hyperscale is part of the published Continual Intelligence collection and is cited here by cross-reference. For evolution as a self-improvement outer loop — systems that rewrite themselves under empirical selection — see [← A8] Darwin-Gödel to ShinkaEvolve. The claim here is narrower and sharper: evolution as a drop-in replacement for backprop at scale.

Key Takeaway 5

Bet 4 is that gradient-free optimization is a compute-rider, not a curiosity: once EGGROLL fixes ES's arithmetic intensity, forward-only, parallel optimization becomes competitive at billion-parameter scale — the bet that throughput can buy away the differentiability requirement.

Falsifier: gradient access stays cheap at every scale that matters — if backprop never becomes the bottleneck EGGROLL removes, the bet fails.

Bet 5 — Exploration is the master problem

The deepest bet on the board is the oldest paper on it. In 2022, Minqi Jiang, Tim Rocktäschel, and Edward Grefenstette argued that AI is "at the cusp of a transition from 'learning from data' to 'learning what data to learn from.'" Their claim is that once large unified architectures — transformers — largely solved how to train a model, the binding constraint moved to what to train it on. They name that constraint exploration, and they generalize it past reinforcement learning to all learning, supervised included: the problem of generalized exploration is choosing which experience to acquire next (see Figure 7).

I rank it the master problem because it sits underneath the other five. Recursive context (Bet 1) is exploration of an information environment. World models (Bet 6) manufacture the experience there is to explore. The big-world frame says the world is orders of magnitude larger than the agent, so the agent can never converge — it can only keep choosing what to learn next, forever; exploration is the name for that choice. The engineering of it is the subject of two field-guide essays ([→ B3] Environments Are the Bottleneck and [→ B4] Agents That Learn on the Job).

The position, stated plainly: future capability comes less from more data or more parameters and more from better choices about which experience to acquire. It is the least falsifiable-sounding of the six, which is exactly why it needs a sharp falsifier — otherwise it is a slogan.

Figure 7: The transition the other bets sit on top of. When unified architectures solved how to train, the bottleneck moved to what to train on. Jiang and colleagues call the general version of that choice exploration — and argue it spans supervised learning and RL, not just RL. It is the substrate of recursive context (Bet 1) and of world-models-as-data (Bet 6).

The frame that makes exploration unavoidable — an agent multiple orders of magnitude smaller than its world, which can never converge and must keep choosing what to learn — is the published series' [← A3] The Big World Hypothesis. This essay does not re-argue it; it inherits it.

Key Takeaway 6

Bet 5 is that exploration — "learning what data to learn from," generalized across supervised learning and RL — is the master problem under the other five: the source of future capability is which experience an agent acquires, not how much.

Falsifier: frontier capability keeps coming from scaling passive data and compute, with no measurable gain from systems that decide what data to collect.

Bet 6 — World models are trainable environments

The previous bet says exploration needs the right environment. This bet draws the consequence: stop building environments and start training them. A world model good enough to act inside is an environment you can generate — which turns the scarcest input in agent learning, the environment, into a trainable, ownable asset. The slogan is "environments are data," and in 2024–2026 it stopped being a slogan. Four papers, pointing the same way, make this the bet with the most existence proofs already on the table (see Figure 8).

Genie is the anchor. It is, in its authors' words, "the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos." At 11B parameters it behaves as a foundation world model, built from a spatiotemporal video tokenizer, an autoregressive dynamics model, and — the crucial piece — a learned latent action model with a small codebook (eight latent actions) that infers what action occurred between frames with no action labels at all. Trained on a filtered 30,000 hours of 2D-platformer video drawn from over 200,000 hours of public gameplay, using only video at train time, it turns a single image, photograph, or hand-drawn sketch into a world you can step into and play frame by frame. Its own framing is the bet in one line: the learned latent actions let you train agents to imitate behavior from unseen video, "unlocking unlimited data for training the next generation of generalist agents."

GameNGen is the existence proof that a neural network can simply be the environment. It is "the first game engine powered entirely by a neural model," running DOOM in real time at 20 frames per second on a single TPU, with next-frame quality at a PSNR of 29.4 — comparable to lossy JPEG — and stability across multi-minute sessions. The decisive number is human: raters were "only slightly better than random chance" at telling short clips of the real game from the simulation, even after five minutes of autoregressive generation. Dreamer 4 closes the loop from environment to agent: it trains the policy by reinforcement learning inside the world model — in imagination — and is "the first agent to obtain diamonds in Minecraft purely from offline data," a task requiring sequences of over 20,000 mouse and keyboard actions from raw pixels, with the world model running in real time on a single GPU. That is the payoff the bet promises: a hard task learned with no live environment access. Cosmos, finally, is the industrialization signal — NVIDIA shipping a "World Foundation Model Platform for Physical AI," with open-weight world-foundation models, a video-curation pipeline, and tokenizers meant to be fine-tuned into custom world models. As a vendor platform its downstream results are a practitioner claim, not a peer-reviewed finding; what it demonstrates is that world models are now being sold as infrastructure.

Note the deliberate tension with Bet 3. These four are mostly generative, pixel-space models — the very camp whose continued dominance on control would falsify the latent-world-model bet. I hold both on purpose, and I reconcile them in the portfolio: I am betting hard that world models matter, and hedging on which representation wins.

System	What it demonstrates	Maturity	Open risk
Genie DeepMind, 2024	11B foundation world model; playable environments from a single prompt; latent actions learned unsupervised from unlabelled video (only video at train time)	✅ existence proof	⚠️ 2D platformers; fidelity and horizon
GameNGen Google, 2024	A neural net that is the engine: DOOM at 20 fps on one TPU, PSNR 29.4, humans near chance after 5 min	✅ real-time playable	⚠️ single game; long-horizon drift
Dreamer 4 Hafner et al., 2025	RL inside the world model: Minecraft diamonds from offline data only, >20,000 actions, single-GPU real time	✅ hard-task transfer	⚠️ one domain so far
Cosmos NVIDIA, 2025	Open-weight world-foundation-model platform for Physical AI (curation, tokenizers, fine-tuning)	⚠️ platform / infra (vendor)	❌ downstream agent results pending

Figure 8: The bet is already live. Four systems, one direction — a foundation world model from raw video (Genie), a neural net that is the engine (GameNGen), an agent trained inside one (Dreamer 4), and a vendor platform (Cosmos). Existence proofs, not promises — which is what moves "environments are data" from slogan to position.

Key Takeaway 7

Bet 6 is that environments — the scarcest input in agent learning — become a trainable asset: Genie learns one from raw video, GameNGen is one, Dreamer 4 trains an agent inside one to a hard offline task, and Cosmos sells them. "Environments are data" already has its existence proofs.

Falsifier: agents trained inside learned world models keep failing to transfer — if in-imagination training never beats training on real interaction at equal cost, the bet fails. Dreamer 4's diamonds-from-offline-data is the current evidence against that failure.

The portfolio view

These six are not independent draws, and reading them as a portfolio is the point of putting them on one page (see Figure 9). Two are the same instinct at different layers: Bet 1 (context as an environment) and Bet 5 (exploration) both say intelligence is acting on an environment to obtain the next useful piece of data — and Bet 6 (world models as environments) supplies the environments Bet 5 needs. Two are partial rivals: Bet 3 (latent prediction wins) and Bet 6 (generative world models are already trainable environments) disagree about representation, so the generative camp's success is precisely Bet 3's falsifier. Holding both is not indecision; it is the correct structure of the wager — long "world models matter," neutral on "latent versus pixel." And one stands almost alone: Bet 4 (backprop-free optimization) is a substrate that can win or lose without touching the others.

You do not need all six to pay off. The payoff structure of a ten-year tier is convex: most bets return nothing, and a couple return the decade. Suppose just two hit — say world models become the dominant source of training experience (Bet 6) and "what to learn next" becomes an explicit, optimized objective (Bet 5). That alone sets the shape of 2036: agents trained largely inside generated environments, with exploration as the central control problem. The other four can fail and the decade still turns on those two. That is why a tier is scored as a portfolio and not as a list of predictions to be graded individually.

Figure 9: The six as a portfolio. Bets 1, 5, and 6 reinforce each other; Bet 3 and the generative camp inside Bet 6 are rivals (one falsifies the other); Bet 4 stands alone. The payoffs are convex — two hits set the decade — and the whole tier is reviewed on one date.

Now the contract. Each bet above names the one observation that folds it. Here is the dated review — 2029, the midpoint to 2036 — with a concrete thing to look at for each:

Bet 1 (context): by 2029, do frontier systems interact with long context recursively by default, or did larger windows make recursion moot?
Bet 2 (runtime): by 2029, is there a deployed system whose memory is a learned runtime rather than an attached store [→ B5]?
Bet 3 (latent): by 2029, does latent prediction lead any control benchmark that generative world models do not?
Bet 4 (backprop-free): by 2029, has a gradient-free method trained a frontier-scale model that backprop could not, or could not afford?
Bet 5 (exploration): by 2029, is "what data to collect" a named, optimized component of frontier training, or still incidental?
Bet 6 (world models): by 2029, has an agent trained mostly in a learned world model beaten one trained on real interaction at equal cost, on a task anyone cares about?

If most of those read "no" in 2029, the honest conclusion is not "wait longer." It is that I mis-scored the tier, and the selection lens that produced it ([← 1] The Selection Lens) needs re-fitting — not just these six picks.

Key Takeaway 8

Score the tier as a portfolio, not a list: the bets are correlated (1·5·6), rival (3 vs the generative camp in 6), and independent (4), and the payoffs are convex — two hits set the decade.

The committed review: in 2029 I check each bet against the concrete observable above. If most read "no," the lens is wrong, not just the picks — and that is the falsifier for the essay as a whole.

What comes next

Of the six, one is already metastasizing into a scaling axis of its own. Recursion — the recursive self-calls of Bet 1 — is not just a trick for long context; the same shape shows up in looped transformers, in recursive agents, and in recursive multi-agent systems. [→ 3] Recursion: The Third Scaling Axis expands that bet into a claim that recursion is a third axis beside depth and width, with one curve drawn across four substrates. And the discipline that makes Bet 6 credible — a capability you can point at rather than predict — gets its sharpest test in [→ 5] When AI Did Mathematics, where a system produced mathematics that mathematicians verified and did not already know.

A ten-year bet is not a prediction. It is a position with a stop-loss. I have stated six, and I have stated the price of each. The next essay raises one of them — recursion — and asks how far the curve really goes.

References

Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., et al. (2024). Genie: Generative Interactive Environments. Proceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2402.15391.
Chang, E., Le Lan, G., Fei, J., Zhang, W., Sun, Y., Cai, Z., Liu, Z., Xiong, Y., Yang, Y., Tian, Y., Shi, Y., Chandra, V., & Schmidhuber, J. (2026). Neural Computers. arXiv:2604.06425.
Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y., & Balestriero, R. (2026). LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels. arXiv:2603.19312.
Behrouz, A., Zhong, P., & Mirrokni, V. (2025). Titans: Learning to Memorize at Test Time. arXiv:2501.00663.
Yan, W., Lillicrap, T., Hafner, D., et al. (2025). Training Agents Inside of Scalable World Models (Dreamer 4). arXiv:2509.24527.
Valevski, D., Leviathan, Y., Arar, M., & Fruchter, S. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen). arXiv:2408.14837.
Jiang, M., Rocktäschel, T., & Grefenstette, E. (2022). General Intelligence Requires Rethinking Exploration. arXiv:2211.07819.
NVIDIA. (2025). Cosmos World Foundation Model Platform for Physical AI. arXiv:2501.03575.

Cross-referenced from the published Continual Intelligence collection and cited there: Recursive Language Models (MIT CSAIL, 2025) and Evolution Strategies at the Hyperscale / EGGROLL (2025). Historical landmarks in §1 and Figure 2 — GPT-3 (2020), AlphaZero, AlexNet — are common-knowledge calibration, not sources in this essay's reading list.