Planning and the Myopia Problem

[← 6] Tools, Skills, and the Action Interface — that article wired the agent to the world: how a single action gets chosen and executed through tools. This one asks the harder question one level up — whether the model can plan the sequence of those actions. The answer, read off the model's own search, is that it cannot plan far, and the engineering response is to stop asking it to.

§1 · The plan that wasn't

Give a strong reasoning model a board position and ask it for the best move. Its chain of thought will read like a plan: it names threats, considers replies, projects a few moves ahead, weighs one line against another. It looks exactly like deliberate lookahead — the thing we mean when we say a system plans (see Figure 1). So it is worth asking, with instruments rather than impressions, whether the lookahead is real.

Chen and colleagues (2026) built the instrument. Working in the board game four-in-a-row — small enough to enumerate, rich enough to require foresight — they take a reasoning model's raw trace and reconstruct the search tree behind it: which positions it considered, how deep each line went, and which of those considerations actually moved its final choice. Then they fit computational models of planning to that tree, the same way a psychologist fits a model of search to a human player's eye movements. The reconstruction is the contribution; what it reveals is the problem (see Figure 2).

Three findings, in increasing order of discomfort. First, the model's search is shallower than a human's on the same positions. Second — and this is the one that should change how you build — its performance is predicted by search breadth, the number of distinct first moves it entertains, and not by search depth, how far down any line it reads. Third, and most damning: although the trace visibly expands deep nodes, the model's actual move is best explained by a myopic model that ignores those deep nodes entirely. A causal test confirms it — prune the deep-lookahead paragraphs out of the chain of thought and the move does not change; prune the shallow ones and it does. The deliberation over distant futures is, mechanically, decoration. This is the inverse of human expertise, where deeper search is exactly what drives skill.

This is the myopia problem, and it is not a quirk of one board game; it is a property of how a single forward pass spends its compute. Before going further, fix the notation the series uses (defined in B1 and reused here):

Mthe model — the weights θ and the single forward pass that runs the chain of thought.

Hthe harness around M — sampling, tools, verifiers, memory, the loop that calls M more than once.

Ethe environment — the world or checker that says, with ground truth, whether an action was good.

τa trajectory — the multi-step episode of states, actions, and observations the agent produces.

Here is the thesis the rest of the article defends. Reasoning models search myopically; where the stakes justify the cost, externalized search — verifiers, committees, formal checkers — beats a longer chain of thought, and verified search at the frontier shows the ceiling is high when the search is real. It is a falsifiable claim: if longer chains of thought bought genuine lookahead, or if extracted trees showed deep search driving choices, it would be wrong. They do not. The cure, then, is structural. Myopia lives inside M's forward pass. Every fix in this article is a way of moving search out — first to cheap tricks still inside M, then into the harness H, finally onto ground truth in E — and the whole progression sorts onto a single axis of cost.

Figure 2: A search tree reconstructed from a reasoning trace in four-in-a-row (Chen et al., 2026). The model expands deep lines (faint) but its move is explained by the shallow layer (orange); performance tracks breadth, not depth. Why it matters: the lookahead is present in the text and absent from the decision — myopia is measurable, not a metaphor.

Key Takeaway 1

Reconstruct the search tree behind a reasoning trace and the planning turns out to be myopic: the model expands deep nodes but chooses by shallow ones, and performance tracks search breadth, not depth (Chen et al., 2026). A longer chain of thought does not buy lookahead. Myopia is a property of the single forward pass M — which is why every real fix moves search outside it.

§2 · Why a chain of thought is not a search

Set aside the philosophical question of whether the trace counts as "real reasoning"; it is a distraction from the engineering one. The A-series read these traces for their shape — what separates a good reasoning path from a bad one [← A5]; this article reads them for their search. The engineering question is narrow and answerable: is the search deep enough to trust? And the mechanism behind Figure 2 says why it is not. A classical search procedure has two things this model lacks: a backtracking budget — the license to abandon a committed line and return to it — and a frontier it actually expands in order of promise. A forward pass has neither. It emits tokens left to right, and once it has written its way into a line of analysis, the probability mass that would carry it back out has mostly evaporated. It commits early because the architecture rewards local coherence, and local coherence is precisely what myopia looks like from the inside.

The causal-pruning test makes the point sharper than any aggregate score could (see Figure 3). Take a completed trace, delete the paragraphs that did the deep lookahead, and re-run the decision: the move is unchanged. Delete an early, shallow paragraph instead and the move flips. The deep search is load-bearing in the prose and inert in the choice. This is why a benchmark number alone would have hidden the problem — a myopic model and a deep one can score identically on positions where shallow heuristics happen to suffice; only by reconstructing the tree do you see that the model would not survive positions where depth is the whole game.

There is a deployment corollary that matters even when you never extract a tree. Once a trajectory has committed to a wrong line, what does the next unit of compute buy you? Islah and colleagues (2026) study failed reasoning traces directly and find that the common reflex — resample from the same model — is rank-preserving: temperature and retry reweight the tokens the model already preferred, but cannot promote a token that was never a local mode, so a trajectory locked into a stable wrong line stays wrong no matter how many times you roll again. Their diagnostic reads the distributional signature of the failed trace, not its words, and sorts failures into kinds that demand different responses: an unlucky sample wants a retry; a locked trajectory wants a perturbation or a logit-space steer that can actually invert local ranks; some want nothing at all under the budget you have. The lesson generalizes past their setting. If the model is myopic, spending more of the same myopia is not a plan. Depth has to come from somewhere else — and the rest of this article is a tour of where, sorted by what it costs.

Figure 3: The causal-pruning test (Chen et al., 2026), schematic. Removing the deep-lookahead paragraphs leaves the move unchanged; removing a shallow node changes it. Why it matters: it isolates cause from correlation — the move is made shallowly, and the deep deliberation is not what produced it.

Key Takeaway 2

A forward pass has no backtracking budget and no expandable frontier, so it commits early and the deep lookahead in its trace is decorative (Chen et al., 2026). The deployment corollary: once a trajectory is locked wrong, resampling is rank-preserving and cannot recover it — read the failed trace to decide whether to retry, perturb, or stop (Islah et al., 2026). Spending more of the same myopia is not a plan.

§3 · Training toward lookahead — the cheap fixes

The cheapest place to add depth is the model itself: change what M does inside its forward pass without standing up any external machinery. Two training ideas push directly on the myopia, and a third tells you how to spend the budget they unlock. None requires a verifier at inference time, which is exactly why they are the bottom of the cost axis — and also why their ceiling is low. (The training-side view — shaping exploration through the environment itself — is B3's subject [← 3]; here the lever is test-time.)

The first is to train the model to explore. Setlur and colleagues (2025), in a recipe they call e3, observe that most reasoning models do not extrapolate: give them more test-time tokens than they saw in training and they do not keep improving. The cure is to teach in-context exploration during training — chaining operations of asymmetric competence (generate, then verify, then refine), amplifying the search with negative gradients from incorrect traces so the model keeps looking instead of settling, and coupling problem difficulty to the token budget through a curriculum. A model trained this way uses a longer budget as search rather than as a longer monologue, and it keeps improving past the horizon it was trained on. This is the honest version of "let it think longer" — it works only because the thinking was reshaped into exploration first.

The second is to give the model better units to search over. Qu and colleagues (2025), in RLAD, train two cooperating policies: one proposes reasoning abstractions — short natural-language descriptions of a procedure or fact worth trying — and the other writes solutions conditioned on them. The abstractions act as planning macros: instead of searching the raw token space one step at a time, the model searches a smaller space of named strategies, which is a structurally less myopic thing to do. Their most telling result is an allocation one: at large test-time budgets, spending compute on generating more abstractions beats spending it on generating more solutions. Better moves to consider beats more rollouts of the same move — the breadth lesson from §1, now as a training target.

The third idea is not a model change but a spending rule, and it governs the other two. Snell and colleagues (2024) ask how to allocate a fixed inference budget and show that the right allocation is difficulty-adaptive: easy prompts want a light touch, hard prompts want verifier-guided search, and matching the strategy to the prompt makes the same compute up to 4× more efficient than a best-of-N baseline — enough, on problems where a small model has non-trivial success, to match a model 14× its size at matched FLOPs. The result is the quiet foundation under everything that follows: test-time compute is worth paying for, but only when it is steered, and steering means a signal about whether you are on track. Inside M, that signal is whatever the model can check about itself — which is the ceiling these cheap fixes hit. To go further you have to pay for a checker that is not the model. (see Figure 4)

Figure 4: The cheap fixes keep search inside M. Exploration training makes a longer budget behave like search and extrapolate past the training horizon (Setlur et al., 2025); discovered abstractions give the model coarser, less myopic units to search over (Qu et al., 2025); compute-optimal allocation spends the budget where difficulty warrants (Snell et al., 2024). Why it matters: these raise the floor cheaply but are bounded by what M can verify about itself.

Key Takeaway 3

The cheapest fixes keep search inside M: train it to explore so a longer budget extrapolates rather than rambles (Setlur et al., 2025), and let it discover abstractions that act as planning macros — where more abstractions beat more solutions (Qu et al., 2025). Steered by difficulty-adaptive allocation, the same compute goes up to 4× further and can match a 14× larger model (Snell et al., 2024). The ceiling is the model checking itself; to climb past it you must pay for a checker that is not the model.

§4 · Committees as shallow search made wide

The first real externalization is the cheapest one that leaves the model: stop asking one trajectory to be deep and instead run many shallow ones, then select. This is breadth bought in the harness H, and §1 already told us why it should work — for these models, breadth predicts performance and depth does not. A committee is the deliberate exploitation of that fact.

Sunkaraneni and colleagues (2026) give the mechanism its honest name: inference-time boosting. Boosting turns weak predictors into a strong one by combining many imperfect-but-useful signals, and a committee of reasoning-model calls is the same move at inference time. But they are careful about what the committee can and cannot do. Repeated sampling amplifies coverage — the chance that some proposal is correct — but sampling alone cannot build a critic or a comparator that recovers the good proposal from the pile. For that you need an additional local soundness signal: execution, tests, a type checker, a proof checker, a constraint solver — something outside the model that can say, locally, "this step is sound." Their numbers make the structure concrete. On SWE-bench Verified, a single proposal from a weak model (a GPT-5.4 nano) solves 67.0% of tasks; a critic–comparator orchestration over k = 8 of that same model's proposals reaches 76.4%, matching far stronger standalone reasoning models — the frontier systems whose rise the A-series traced [← A9] — and approaching the 79.0% oracle best-of-8 ceiling (see Figure 5). The remaining gap is almost entirely coverage failure — shared blind spots where no proposal was correct — which selection cannot fix. Breadth plus a verifier buys depth; breadth alone does not.

Zhou and colleagues (2026) attack the selection bottleneck head-on. Scaling breadth is trivial; choosing the winner without a ground-truth verifier is not, because a model asked to judge its own outputs pointwise is noisy and biased. OpenDeepThink replaces pointwise judging with a Bradley–Terry tournament: the model compares random pairs of candidates, votes are aggregated into a global ranking, the top of the population is kept and mutated with the natural-language critiques generated during comparison, and the bottom is discarded. Run as a small evolutionary loop, it raises a strong model's effective Codeforces rating by +405 Elo over eight rounds of comparison. The crucial caveat is the same one, observed from the other side: the gains concentrate in objectively verifiable domains and reverse in subjective ones. When there is no soundness signal — when "better" is a matter of taste — more comparison makes the aggregate worse, because the committee is now amplifying a bias instead of recovering a truth. A committee is a search procedure exactly to the degree that something outside it can check the answer.

Figure 5: A committee externalizes search into H — breadth made wide, then selected. A weak proposer's single shot (67.0%) rises to 76.4% under a critic–comparator over k=8, near the 79.0% oracle ceiling on SWE-bench Verified (Sunkaraneni et al., 2026); a Bradley–Terry tournament adds +405 Elo in eight rounds (Zhou et al., 2026). Why it matters: amplification is real only behind a local soundness signal; the residual gap is coverage, which selection cannot close.

Key Takeaway 4

A committee externalizes search into the harness: many shallow proposals plus selection substitute breadth for depth. A single weak proposal solves 67.0% of SWE-bench Verified; a critic–comparator over k=8 of its own proposals reaches 76.4%, near the 79.0% oracle ceiling (Sunkaraneni et al., 2026). But the amplification is real only behind a local soundness signal — execution, tests, proof or type checks — and gains reverse in subjective domains where no such signal exists (Zhou et al., 2026). The residual failures are coverage, not selection.

§5 · Planning in a world model

Committees buy breadth in the present. The next regime buys depth into the future — by giving the agent a model of the environment it can search inside before it acts. This is the world-model move, and it is where planning research and agentic engineering meet. Chu and colleagues (2026), surveying more than four hundred works, give the capability a clean ladder: an L1 Predictor learns one-step transitions; an L2 Simulator composes them into multi-step, action-conditioned rollouts that respect the domain's laws; an L3 Evolver revises its own model when its predictions break against new evidence. A planner with an L2 simulator can do the one thing a myopic forward pass cannot — expand a real frontier of futures and compare them — because the simulator, not the language model's foresight, supplies the lookahead (see Figure 6). The depth is no longer faked in prose; it is computed in a model.

That power comes with a failure mode that the cheap fixes did not have, and Timor and colleagues (2026) quantify it. When you train a policy on imagined rollouts — trajectories generated by a learned dynamics model and scored by a learned reward model, without touching the real environment — every error in those learned models is an error the policy will happily exploit. They derive the optimal balance between dynamics samples and reward samples that minimizes return error under power-law scaling, identify low Lipschitz constants of the learned dynamics and reward as the representation property that tightens the bound, and show a reassuring fact about noise: zero-mean reward noise leaves the policy-gradient estimator unbiased and only adds variance that shrinks with more rollouts. The engineering reading is a discipline, not a discouragement. Plan in the model to get cheap depth; verify in the world before you trust it, and budget your samples where the model is least reliable. The depth a world model gives you is real but on loan — the loan comes due in E, and an L3 system is one that pays it back by correcting itself when reality disagrees.

[← A3] The World-Model Hypothesis and [← A4] World Models for Control — the A-series foundations for why a learned simulator is the natural substrate for lookahead, and the DreamerV3 lineage these agentic systems inherit. Here we use the result, not re-derive it.

Figure 6: Plan in the model, verify in the world. An L2 simulator supplies real lookahead the forward pass cannot — the agent searches imagined futures, acts in E, and an L3 evolver corrects the model when reality disagrees (Chu et al., 2026). Why it matters: world models give cheap depth on loan; the loan is repaid by grounding, and unbalanced dynamics-vs-reward error is how you default on it (Timor et al., 2026).

Key Takeaway 5

A world model gives the lookahead a forward pass cannot: an L2 simulator expands a real frontier of imagined futures, and an L3 system corrects itself when reality disagrees (Chu et al., 2026). But imagined returns are only as good as the learned dynamics and reward — balance the samples and keep the learned models low-Lipschitz or errors compound (Timor et al., 2026). Plan in the model for cheap depth; verify in the world E before you trust it.

§6 · Verified search at the frontier

Follow the cost axis to its end and you reach the regime where myopia stops mattering at all. The trick is to make the verifier ground truth rather than another fallible model — and in formal mathematics it already is. A proof assistant like Lean accepts a step only if it is correct; there is no talking it out of a gap. Couple a language model to that checker and you get a search procedure whose every branch is validated by a machine, so the proposer is free to be as myopic as it likes — wrong branches are simply rejected, and depth is purchased by search rather than by the model's foresight (see Figure 7).

Tsoukalas and colleagues (2026) ran the first large-scale test of this on open mathematics. A basic agent that simply alternates LLM generation with Lean verification already works; their full system, AlphaProof Nexus, coordinates subagents that search independently against the compiler, orchestrated by an evolutionary loop. The agent autonomously resolved 9 of 353 open Erdős problems at a per-problem cost of a few hundred dollars, and 44 of 492 OEIS conjectures, with results now feeding live research in combinatorics, optimization, and beyond. The authors note the direction of travel explicitly: as the base models improve, the specialized trained provers are giving way to simple agentic loops wrapped around a checker. The harness, not a bespoke model, is doing the work — which is the whole thesis of this series stated in the hardest possible domain.

Kung and colleagues (2026), in LEAP, show how high the ceiling reaches when the loop is built well. Their agent decomposes a hard problem into a blueprint of smaller claims and discharges each against Lean through continuous compiler interaction. On the 2025 Putnam Competition it solves all 12 problems — against a human median of 2 — and on a new Lean-IMO-Bench it lifts a general-purpose model from below 10% to 70%, surpassing a gold-medal-caliber specialized system. These results are established by machine-checked construction, not asserted: the certainty is the verifier's, and that is exactly the point. Where a ground-truth checker exists, the question "can the model plan deeply enough?" dissolves, because the search no longer depends on the model's planning at all.

[→ C5] The Long Bet · Mathematics as the Existence Proof — verified search is the clean case where machine-checked construction settles the matter; the C-series essay takes up what that existence proof does and does not imply for the rest of agentic work.

Figure 7: Verified search makes the verifier ground truth. A myopic proposer is acceptable because Lean rejects every wrong branch, so depth comes from search, not foresight — simple agentic loops now resolve open Erdős problems (Tsoukalas et al., 2026) and solve all 12 Putnam 2025 problems (Kung et al., 2026). Why it matters: it is the existence proof that the ceiling is high when the search is real.

Key Takeaway 6

When the verifier is a proof checker, myopia stops mattering: a wrong branch is rejected by ground truth, so depth is bought by search rather than foresight. Simple agentic loops over Lean autonomously resolved 9 of 353 open Erdős problems at a few hundred dollars each (Tsoukalas et al., 2026) and solved all 12 Putnam 2025 problems, lifting a general model from below 10% to 70% on Lean-IMO-Bench (Kung et al., 2026). The ceiling is high when the search is real — and the harness, not a bespoke model, is what makes it real.

§7 · Choosing your point on the spectrum

Lay the five regimes on two axes and the article resolves to a single picture (see Figure 8). The horizontal axis is verification cost per decision — what you pay, in compute and dollars and latency, to check and re-search before you commit. The vertical axis is effective planning depth — how far the system actually looks ahead before acting, not how many tokens it emits. Every approach in this article sits on a rising diagonal from the bottom-left, where a raw reasoning model is cheap and myopic, to the top-right, where verified proof search is expensive and effectively unbounded. The diagonal is the thesis: depth is bought with verification cost, not with a longer chain of thought. The flat dashed line along the bottom is the falsifier made visible — that is where "just let it think longer" leaves you, and §1 and §2 are the evidence it stays there.

The practical question is not which regime is best in the abstract; it is which one the stakes justify. Verification cost is only worth paying in proportion to the regret of a wrong action. So the deployable artifact of this article is a decision rule keyed on exactly that (see Figure 9): how reversible is the action, and is there a soundness signal available to check it? Read the table by finding your row and accepting its verdict on the one question that started this article — may you ship myopia here, or may you not?

Figure 8: The planning-depth × verification-cost spectrum. Each regime in this article sits on a rising diagonal from cheap-and-myopic (Chen et al., 2026) to expensive-and-deep verified search (Tsoukalas et al., 2026; Kung et al., 2026), through cheap in-model fixes (Setlur, Qu, Snell), committees (Sunkaraneni, Zhou), and world models (Timor, Chu). Why it matters: it places any team on one axis and shows that depth is bought with verification cost — the flat line is where longer CoT alone leaves you.

The table turns that diagonal into a decision. The columns are the two questions that actually decide the regime — what a wrong action costs, and whether a soundness signal is available — and the final column answers the article's title question outright.

Stakes — regret of a wrong action	Recommended regime	Verification cost	May you ship myopia?
Low · reversible (drafts, search, brainstorming, ranked candidates a human picks from)	Raw reasoning model, perhaps exploration-trained	Cheap	✅ Yes — myopia is fine; an unlucky trajectory just gets resampled.
Medium · checkable (code, structured outputs, anything with tests, types, or a constraint solver)	Committee + critic/comparator over k proposals	Mid	⚠️ Only behind a local soundness signal — execution, tests, type or proof checks. No signal, no amplification.
High · expensive or irreversible (money moved, production changed, real-world actuation)	World-model planning, or externalized search gated by a ground-truth check before acting	High	❌ No — require external verification in `E`; a myopic plan must not reach the world unchecked.
Formal correctness (mathematics, verified code, safety-critical proofs)	Verified search against a checker (Lean, a type system, a solver)	Highest	❌ Irrelevant — the checker is ground truth, so the proposer is allowed to be myopic.

Figure 9: The cost-vs-stakes decision. Pay verification cost in proportion to the regret of a wrong action, and ship myopia only where actions are cheap to undo. Why it matters: it is the deployable rule — and before spending the next unit of compute on any row, read the failed trace, because resampling a locked trajectory only buys the same mistake again (Islah et al., 2026).

Key Takeaway 7

Place your system on one axis — verification cost versus effective planning depth — and let the stakes choose the regime. Ship myopia where actions are cheap to undo; where they are not, buy depth with externalized verification, not a longer chain of thought. And before spending the next unit of compute, read the failed trace: resampling a trajectory-locked failure only pays for the same mistake twice (Islah et al., 2026).

What comes next

A committee is a society in miniature — a handful of agents proposing, comparing, and selecting under a verifier. The obvious next move is to make the society real: standing teams of agents with roles, memory, and the authority to act on each other's outputs. But the failure modes do not scale gracefully. The breadth that helped a committee — many independent proposals — turns into diversity collapse when the agents start conditioning on each other; the selection that recovered the right proposal becomes untraceable blame when no single step owns the outcome; and the verification cost that bought depth becomes a coordination cost that can outgrow the work. The tenth agent can make it worse. That is the subject of the next article, where committees are generalized to organizations and the engineering question shifts from how deep can one agent search to what does it cost to make many of them agree.

[→ 8] Multi-Agent Systems and Their Failure Modes — committees generalized to societies, and the three ways a society of agents fails that a single one cannot. · [→ C5] the mathematics existence proof, read for what it does and does not promise the rest of agentic work.

References

Chen, S., Li, J.-A., et al. (2026). Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning. Preprint. arXiv:2605.06840.
Islah, N., Abbes, I., & Rish, I. (2026). Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them). Preprint. arXiv:2606.05145.
Setlur, A., Yang, M. Y. R., Snell, C., Greer, J., Wu, I., Smith, V., Simchowitz, M., & Kumar, A. (2025). e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs. Preprint. arXiv:2506.09026.
Qu, Y., Singh, A., Lee, Y., Setlur, A., Salakhutdinov, R., Finn, C., & Kumar, A. (2025). RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems. Preprint. arXiv:2510.02263.
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Preprint. arXiv:2408.03314.
Sunkaraneni, V., Beneventano, P., Neumarker, R., Poggio, T., & Galanti, T. (2026). Agentic Systems as Boosting Weak Reasoning Models. Preprint. arXiv:2605.14163.
Zhou, S., Chai, W., Liu, K., Mao, H., Mang, Q., & Shang, J. (2026). OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation. Preprint. arXiv:2605.15177.
Chu, M., Zhang, X. B., Lin, K. Q., Kong, L., et al. (2026). Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond. Preprint. arXiv:2604.22748.
Timor, N., Shwartz-Ziv, R., Goldblum, M., LeCun, Y., & Harel, D. (2026). On Training in Imagination. Preprint. arXiv:2605.06732.
Tsoukalas, G., Kovsharov, A., Shirobokov, S., Surina, A., Firsching, M., et al. (2026). Advancing Mathematics Research with AI-Driven Formal Proof Search. Preprint. arXiv:2605.22763.
Kung, P.-N., Song, L., Hwang, D., Yoon, J., Li, C.-L., et al. (2026). LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks. Preprint. arXiv:2606.03303.