Part IV · The Existence Proof · 5 of 6AGFNEV

When AI Did Mathematics

In 2026 an AI system produced a piece of mathematics that professional mathematicians checked, accepted, and had not already known. This is the existence proof — stripped of both the hype and the dismissal that surrounded it.

Figure 1: The conjecture that fell. A planar point set carries a web of unit-distance pairs (faint); Erdős conjectured the number can grow no faster than n1+o(1). The gold edges mark a new kind of configuration — built from number fields, not grids — that pushes the count to n1+ε for a fixed ε > 0 (Alon, Bloom, Gowers, Litt et al., 2026). Why it matters: it is the object an AI found and humans verified.
Anchor papers
Alon, Bloom, Gowers, Litt et al. (2026)Tsoukalas et al. (2026)Kung et al. (2026)Wiemann et al. (2026)
22 min read4,770 words↳ Reading order: ← 4 · 6 →

On an afternoon in 2026, nine mathematicians — among them W. T. Gowers, a Fields Medalist — sat down to check a proof. It refuted a conjecture of Paul Erdős that had stood for decades. None of them had written it, nor had any other human. According to their own account, the result "is due to an internal model at OpenAI"; the argument was, in their words, "first mathematically generated in one shot by an internal model at OpenAI, and then expositionally refined through human interactions with Codex" (Alon, Bloom, Gowers, Litt et al., 2026). They read it, simplified it, generalized it, and signed their names to it.

That is the event this essay is about, and it is easy to get wrong in two opposite directions. One temptation is to call it the moment machines started doing mathematics — the singularity, on schedule. The other is to wave it away — a search engine that retrieved a known trick, dressed up. Both readings skip the only interesting question: what, precisely, was done, and what was not? The honest answer requires separating two regimes that the hype cycle blurred together. In the first, an AI discovered an argument and humans verified it. In the second, AI systems searched for proofs that a machine verified, mechanically, with a certificate. These are different claims with different warrants, and the whole value of the year is in keeping them apart. One theorem is one theorem — but a theorem that was new and correct, found by a machine, is an existence proof, and existence proofs reprice everything downstream.

§1 · The conjecture that fell

The unit-distance problem is one of the oldest questions in combinatorial geometry, and you can state it to a child. Put n points in the plane. How many pairs can be at exactly distance one from each other? Erdős asked it and supplied the first bounds himself. An easy argument caps the count at O(n3/2): two unit circles meet in at most two points, so no three points can all be unit-distance from the same two others, and that forbidden configuration limits the edges. A √n × √n integer grid, on the other hand, achieves n1+Ω(1/log log n) unit distances — superlinear, but only barely, with the exponent crawling toward one as n grows. The best upper bound anyone has established is O(n4/3), due to Spencer, Szemerédi, and Trotter. Into that gap Erdős placed a conjecture: the truth is n1+o(1). The grid, he guessed, is essentially as dense as it gets; the exponent really does go to one (see Figure 1).

That conjecture is now false. The remarks document states the new result as Theorem 1.1: there exists a fixed ε > 0 and a sequence of planar point sets Pi with |Pi| → ∞ whose number of unit distances is at least |Pi|1+ε for every i (Alon, Bloom, Gowers, Litt et al., 2026). A fixed positive exponent, not a vanishing one. The grid was not the ceiling. The construction does not look like a grid at all: it is built from number fields — from magnitude-one algebraic numbers of bounded denominator living in a carefully chosen field — so that arithmetic, not geometry, supplies the density of coincidences.

What makes the discovery legible is that the system left a record of how it got there. The chain-of-thought behind the original argument contains the pivot in plain language: "in principle all extremal examples can be taken algebraic. But the degree and height of that algebraic realization can be enormous… Maybe that enormous degree is not just an annoyance but a source of possible counterexamples. Number fields deserve a closer look" (quoted in Alon, Bloom, Gowers, Litt et al., 2026). That is the whole idea in two sentences: treat the "enormous degree" of an algebraic configuration not as an obstacle to be bounded away, but as the raw material of a counterexample. A human could have had that thought. The point is that, for decades, none did — and a machine did.

Key Takeaway 1

Erdős conjectured that n points admit at most n1+o(1) unit distances — the exponent tending to one. An AI-generated construction, built from number fields rather than grids, established a fixed-exponent lower bound of n1+ε, refuting it. The decisive move — "number fields deserve a closer look" — is recorded in the system's own reasoning.

§2 · The verification chain

A claim is not mathematics until it is checked, and how it was checked is the load-bearing fact of this whole story. The unit-distance result entered the literature the same way any result does: expert humans read it, found it correct, and vouched for it. The artifact that records this is the "Remarks on the Disproof of the Unit Distance Conjecture" — by Noga Alon, Thomas F. Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. They describe it precisely: "a short, digested, human-verified version of the recent OpenAI-generated counterexample" (Alon, Bloom, Gowers, Litt et al., 2026). The machine produced an argument; the nine produced the verification.

The document is structured to make the division of labor visible. After a complete proof of Theorem 1.1, it collects individually signed reflections — a section from Alon, one from Bloom, one from Gowers, and so on through all nine. This is not a press release; it is the ordinary social machinery of mathematics, applied to a non-ordinary author. The proof was generated by the OpenAI model in one shot and then "expositionally refined through human interactions with Codex"; what the nine then did was independently reconstruct it into a version that was, by their description, "human-digested, somewhat simplified, and somewhat generalized." The humans did not merely approve the output. They metabolized it (see Figure 2).

REGIME A · INFORMAL DISCOVERY → HUMAN-VERIFIED OpenAI model generates in one shot NL argument refined via Codex Human verification 9 MATHEMATICIANS · SIGNED Alon · Bloom · Gowers Litt · Sawin · Shankar Tsimerman · Wang · Matchett Wood "Remarks" accepted, generalized REGIME B · FORMAL SEARCH → MACHINE-VERIFIED LLM agent AlphaProof Nexus · LEAP Lean candidate blueprint → tactics Lean compiler ✓ MACHINE · MECHANICAL no pending goals ⇒ correct Certified checkable certificate The verification node is the whole difference: A — humans read and vouch (no certificate) B — a compiler decides (machine certificate)
Figure 2: The discovery→verification chain, both regimes. In Regime A (Alon, Bloom, Gowers, Litt et al., 2026) the proof is generated by a machine but verified by humans — the nine named mathematicians who digested and signed it. In Regime B (Tsoukalas et al., 2026; Kung et al., 2026) the proof is generated by an agent and verified by a machine — the Lean compiler, which accepts a proof only when no goals remain. Why it matters: "AI did mathematics" means two different things on these two chains, and the difference lives entirely in the verification node.

Why dwell on the artifact? Because the verification did not come with a machine certificate. No compiler confirmed the unit-distance argument line by line; it was checked the way Andrew Wiles's proof was checked — by experts who understand the subject and are willing to put their reputations behind it. That makes the document historically important in a specific way: it is the human record that an AI-generated argument crossed the bar of mathematical truth, attested by people whose names are the bar. It also makes the result, as a claim about correctness, exactly as strong as the humans who signed it — no stronger, and no weaker.

Key Takeaway 2

The unit-distance disproof was generated by a machine and verified by humans — nine named mathematicians who reconstructed, simplified, and signed the argument. There was no machine certificate; correctness rests on expert testimony. The "Remarks" document is the historical artifact recording that an AI argument passed human verification.

§3 · Originality, carefully

Was this original mathematics, or expensive retrieval? The remarks authors are scrupulous about it, and their scruple is the most instructive part of the episode. The argument, they write, "relies crucially on ideas that may, at least in retrospect, be attributed to Ellenberg–Venkatesh, Golod–Shafarevich, and Hajir–Maire–Ramakrishna" (Alon, Bloom, Gowers, Litt et al., 2026). The ingredients were known. Ellenberg and Venkatesh had used small split primes and a pigeonhole argument to bound torsion in class groups; Golod–Shafarevich towers — infinite class field towers — were standard equipment; and the variant using infinitely many split primes, the authors note, "already appears in the literature and has been used for other applications." None of the building blocks was new (see Figure 3).

Ellenberg–Venkatesh small split primes · known Golod–Shafarevich class field towers · known Hajir–Maire–Ramakrishna prior idea · known in retrospect Assembled construction THE NOVEL INGREDIENT take [K : ℚ] → ∞ Disproof unit distances ≥ |P|​1+ε Golod–Shafarevich towers with ∞-many split primes already appear in the literature — not the novel part.
Figure 3: What "AI-original" means here. Three known ingredients (dashed — "attributable in retrospect") plus one genuinely new move, taking the field degree [K : ℚ] → ∞, assemble into the counterexample (Alon, Bloom, Gowers, Litt et al., 2026). Why it matters: originality in mathematics is almost always novel synthesis plus a new move — the bar this clears — not invention from nothing, which nothing clears.

And yet nobody had assembled them. The remarks identify the new step exactly: "a novel ingredient of the AI argument is to take [K : ℚ] → ∞" — to let the degree of the number field run to infinity, which is precisely the "enormous degree" the system's chain-of-thought had flagged as a possible source of counterexamples. That is the difference between attributable in retrospect and known in advance. After you see the proof, you can trace each component to prior work. Before you saw it, the components sat in separate corners of number theory and discrete geometry, and the conjecture stood.

This is the right place to be careful about the word "original." If the bar for AI-original mathematics is "used only ideas no human ever had," then no theorem in history is original, because every proof stands on prior mathematics — Wiles used Galois representations he did not invent. The workable bar is the one the field already applies to humans: did it produce a correct argument the community did not have and could not readily have reconstructed? By the testimony of the nine, the unit-distance disproof clears that bar. It is original in the way most breakthroughs are original — a synthesis nobody had made, turning on one move nobody had tried — and it happens that the synthesizer was a machine.

Key Takeaway 3

The disproof's ingredients were known and are "attributable in retrospect" to prior work; the new move — letting the field degree [K : ℚ] → ∞ — was not. Judged by the only standard any theorem meets — a correct argument the field lacked and could not readily reconstruct — this counts as AI-original, not AI-retrieved. Originality is novel synthesis plus a new move; here a machine did the synthesizing.

§4 · The other regime: verified search

The unit-distance event is singular and, so far, a sample of one. The second regime is the opposite: systematic, repeatable, and carrying machine certificates instead of human signatures. This is where the word proved is earned, because here a compiler — not a referee — decides. The setting is Lean, "a proof assistant in which definitions, theorems, and proofs are all mechanically verified code"; a Lean proof is correct exactly when the compiler reaches a state with no pending goals (Tsoukalas et al., 2026). Whatever a system claims to have proved, you can check by rerunning the compiler. There is no question of trust.

DeepMind's AlphaProof Nexus (Tsoukalas et al., 2026) ran the first large-scale evaluation of this paradigm on open problems — not competition exercises with known answers, but questions nobody had resolved. Its most capable agent autonomously proved 9 of 353 open Erdős problems, including two that had been open for fifty-six years, at an inference cost of a few hundred dollars each. It proved 44 of 492 open conjectures from the Online Encyclopedia of Integer Sequences, resolved a fifteen-year-old open question on Hilbert functions in algebraic geometry, improved an open bound in convex optimization by discovering a new algorithmic parameter schedule, and is being used in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. A revealing detail from the paper's own post-hoc analysis: a basic agent — independent prover subagents running multi-turn "Ralph loops" on Gemini 3.1 Pro, each refining a Lean sketch against compiler feedback — also solved all nine Erdős problems, just at higher cost on the hardest ones. The authors read this as "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable" (see Figure 4).

SystemWhat it producedVerificationHeadline result
AlphaProof Nexus
Tsoukalas et al. (2026)
Google DeepMind
Lean proofs of open problems ✓ machine-checked (Lean) 9 / 353 open Erdős (two open 56 yrs); 44 / 492 OEIS; a 15-yr Hilbert-function question; an open convex-optimization bound. ~$100s/problem.
LEAP
Kung et al. (2026)
Google
Lean proofs via agentic scaffold (general LLMs only) ✓ machine-checked (Lean) 2025 Putnam: 12 / 12 solved; Lean-IMO-Bench <10% → 70% (vs ATP 5%, Aristotle 48%); a verified subproblem of Knuth's even-order Cayley-graph decomposition.
Unit-distance disproof
Alon, Bloom, Gowers, Litt et al. (2026)
Regime A, for contrast
Natural-language argument, generated then digested ⚠ human-verified (no certificate) Refuted the Erdős unit-distance conjecture: unit distances can reach n1+ε. A sample of one.
Figure 4: The verification column is the punchline. The two formal systems carry machine-checkable Lean certificates (✓) — correctness is not in question; the informal disproof carries human testimony (⚠) — verified, but not certified. Why it matters: the regimes trade off coverage against certainty. Lean gives certainty on a narrow, formalizable frontier; the unit-distance line gives reach into open conceptual territory, at the cost of needing humans to vouch.

LEAP (Kung et al., 2026) attacks the same target from the agent-design side, and it is the cleanest illustration of verified search as a planning regime. The "LLM-in-Lean Environment Agentic Prover" uses only general-purpose models — no specialized prover — building a high-level blueprint as a directed acyclic graph, then generating the Lean proof and iteratively correcting it against compiler feedback. On the 2025 Putnam, where the top human score was 110 out of 120 and the median was 2, LEAP proved all twelve problems in Lean. On its new Lean-IMO-Bench it lifted the one-shot formal solve rate of general LLMs from under 10% to 70% — past specialized automated-theorem-proving baselines at 5% and past Aristotle, a gold-medal-caliber IMO system, at 48% — and it produced a verified proof of a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

[← B7 · Planning and the Myopia Problem] A verifier-coupled agent loop is exactly the planning regime that essay's spectrum ends on: a generator proposes, a ground-truth verifier accepts or rejects, and search is steered by the rejections. Lean is the strongest possible verifier — the reward is never wrong — which is why formal mathematics is the cleanest place to watch verified search work.

What does this regime establish, and what does it not? It establishes — in the literal, certificate-bearing sense — that AI systems can find genuinely new proofs of open problems, at low cost, with no human needed to check correctness. That is a real and deployable capability. But the frontier is narrow: you need a Lean formalization, which today mostly exists for combinatorics-flavored and competition-flavored problems; the hit rate is low (nine of 353); and the proofs assemble known machinery into certificates rather than introducing the kind of conceptual surprise the unit-distance move carried. Certainty and reach are, for now, traded against each other.

[← A9 · Reasoning at Scale: Frontier Systems] These results are the "edge of competence" thesis playing out at the boundary of human knowledge: systems succeed where problems are hard but formalizable — the zone where the signal is non-zero but not yet exhausted — and stall where they are not. Prolonged reasoning buys the most exactly there.

Key Takeaway 4

In the formal regime, "proved" is literal: Lean's compiler certifies every step. AlphaProof Nexus proved 9 of 353 open Erdős problems and 44 of 492 OEIS conjectures; LEAP proved all twelve 2025 Putnam problems and lifted general-LLM formal solve rates from under 10% to 70%. The certificate removes any question of correctness — but only on the narrow frontier where a formalization exists.

§5 · Thinking outside the box, measurably

Two events — one informal disproof, one stream of certified proofs — are not a capability curve. They are points. To know whether systems can think outside the box in general, you need a measurement that recall cannot fake, and that is the gap DiscoverPhysics (Wiemann et al., 2026) was built to fill. Its premise is that frontier models already score well on physics exams, "but it is hard to disentangle genuine reasoning from recall of established science." So the benchmark removes the established science. It constructs 22 simulated worlds whose physics deliberately deviates from ours — screened and fractional-power gravity, hidden dark-matter-like particles, time-varying couplings, extra dimensions — each generated on demand by an N-body simulator. An agent cannot look up the answer because the answer does not exist outside the simulator (see Figure 5).

Strongest agents pass ≈ half of 22 worlds across 11 frontier models · non-canonical physics you cannot recall passed ≈ 11 failed ≈ 11 · latent structure Two independent scores — and they decouple prediction · trajectory MSE (does it forecast?) understanding · explanation score (does it know why?) decoupled best predictor ≠ best explainer (bar lengths schematic)
Figure 5: Discovery, made into a number. Across 11 frontier models, the strongest agents pass only about half of 22 non-canonical worlds, failing where latent structure must be uncovered (Wiemann et al., 2026). Each world is scored twice — predictive accuracy and explanation quality — and the two decouple: the best forecaster is not the best explainer (bar lengths schematic). Why it matters: it turns "can it discover?" from an anecdote into a pass-rate that can move.

To solve a world, an agent must design experiments, observe noisy trajectory data, revise hypotheses, and submit both a natural-language law and a Python implementation. DiscoverPhysics scores each submission on two axes: trajectory MSE on held-out particles (can it predict?) and an LLM-judged explanation score against an expert rubric (does it understand?). The finding is sobering and useful: across eleven frontier models, the strongest agents pass only about half of the worlds, and consistently fail on the ones where latent structure — a hidden particle species, an extra dimension — must be uncovered. Predictive accuracy and conceptual understanding decouple: the model with the lowest MSE is not the one with the highest explanation score. The benchmark catches systems fitting the data without revising their picture of the world.

This is the methodological hinge of the whole essay. The unit-distance disproof and the Lean proofs tell you discovery is possible; DiscoverPhysics tells you how far along it is — and the answer is mid-curve, not finished. A pass rate of one-half on worlds that resist recall is exactly the kind of number the selection lens wants: a measurement that can move, attached to the capability the anecdotes only gestured at.

Key Takeaway 5

DiscoverPhysics measures out-of-the-box discovery by using 22 non-canonical worlds that recall cannot solve. The strongest of eleven frontier models pass only about half, failing where latent structure must be uncovered, and predictive accuracy decouples from genuine understanding. Discovery is now a measurable axis — and the axis reads mid-curve, not saturated.

§6 · The division of labor

Step back from the individual results and ask what actually moved. Mathematical work decomposes, roughly, into five activities: choosing a problem worth solving, generating candidate ideas, searching for a proof, verifying correctness, and judging whether the result matters. The events of 2026 moved some of these a great deal and others not at all (see Figure 6).

Choose problem UNMOVED human taste —which problem,and why Generate ideas MOVED SOME the "[K:ℚ]→∞"synthesis move(unit distance) Search proof MOVED A LOT AlphaProof Nexus,LEAP — agenticproof search Verify formal informal Lean certifieswhere formalizable;else still human Judge significance UNMOVED human taste —is the resultworth having? Search and a slice of idea-generation moved most; formal verification became cheap and absolute; taste did not move at all.
Figure 6: What moved, by stage. Proof search automated heavily (AlphaProof Nexus, LEAP); idea-generation moved partway (the unit-distance synthesis); formal verification became a cheap machine certificate where a Lean formalization exists. Problem-selection and significance-judgment — taste — did not move (Tsoukalas et al., 2026; Kung et al., 2026; Alon, Bloom, Gowers, Litt et al., 2026). Why it matters: the bottleneck of mathematical practice is migrating toward the parts machines did not touch.

Search moved most. AlphaProof Nexus and LEAP automate proof search and couple it to mechanical verification; this is now a deployable tool, in active use across five research areas. Idea-generation moved partway. The OpenAI model generated the decisive synthesis for the unit-distance disproof — a genuine conceptual contribution — but it did so once, and we do not yet know how reliably such moves can be produced on demand. Verification split in two. Where a Lean formalization exists, checking became cheap, absolute, and inhuman; where it does not, verification remained a human bottleneck — the nine mathematicians reading the unit-distance argument by hand. DiscoverPhysics shows the experiment-design loop, the empirical analogue of search, is partly automatable but still unreliable.

And then there is the part that did not move at all. No system chose its own problem. Every result here was pointed at a target a human selected — an Erdős problem, a Putnam exam, a named conjecture. None of these systems decided that the unit-distance conjecture was worth attacking, or that disproving it would matter. The remarks document exists because humans found the result interesting enough to digest and reflect on for eleven sections. Taste — what to work on, and what counts as worth having — is exactly the activity that stayed human. The plausible shape of mathematical practice in 2030 follows directly: a mathematician choosing problems and judging significance, served by a formal-search tool that closes routine lemmas with certificates, an idea-generator that proposes synthesis moves, and an experiment loop for empirical conjectures — with the human still verifying the informal arguments the machines cannot yet certify.

Key Takeaway 6

Proof search moved the most; idea-generation moved partway; formal verification became a cheap machine certificate where a formalization exists. Problem-selection and significance-judgment — taste — did not move at all. The bottleneck of mathematical practice is migrating toward the activities machines did not touch, which is where the next decade's value will sit.

§7 · What the existence proof licenses

The selection lens prizes existence proofs above almost everything else, for a precise reason: an existence proof reprices the question. Before one, you argue about whether a capability is possible; after one, the argument shifts to how far, how cheap, and how general. The 2026 results are exactly that kind of evidence. The unit-distance disproof, plus the stream of certified Lean proofs, together establish that AI can produce mathematics that is both new and correct. That cannot be un-shown.

[← 1 · The Selection Lens] Existence proofs are the strongest signal the lens recognizes, because possibility is the hardest thing to establish and the easiest to extrapolate from. Once a capability is demonstrated once, every downstream forecast that assumed it impossible has to be repriced.

But the discipline of this essay is to say precisely what the proof licenses and what it does not — because one theorem is one theorem, and a single brilliant result is not a trend. Figure 7 is the ledger (see Figure 7).

✓ What this existence proof licenses✗ What it does not license
"AI cannot do original mathematics" is now false as stated — one correct, new, human-verified result exists (Alon, Bloom, Gowers, Litt et al., 2026). That AI mathematicians have arrived. The informal-discovery regime is, so far, a sample of one.
Certified formal search is a real research tool today — deployed across five fields, machine-checked (Tsoukalas et al., 2026). That the rate is fast. 9 of 353 open problems; about half of 22 discovery worlds.
The bottleneck of practice is shifting toward taste — problem-selection and significance. That original synthesis like the unit-distance move is reproducible on demand. We have not shown it scales.
Figure 7: The licensing ledger. An existence proof is strong evidence and narrow evidence at once: it makes the impossibility claim false (left) without making the optimistic extrapolations true (right). Why it matters: the selection lens buys the repricing, not the timeline — and the honest position holds both columns simultaneously.

The reading that makes 2026 historic — that this was AI-original mathematics — rests on one checkable claim: that the construction was genuinely new, not already sitting in the literature. The authors' own hedge, "attributable in retrospect," is where the risk lives. So here is the falsification contract, stated plainly.

The falsifier & the 2029 review

What would undercut the "AI-original" reading: if the unit-distance construction — or its essential [K : ℚ] → ∞ move — turns out to have been published, or to be trivially reconstructible from a single prior paper, then the event collapses from AI-original to AI-assembled, and the strongest reading is wrong. The certified Lean results would survive this (they are proofs regardless of novelty), but the headline would not.

Dated review — 2029. By 2029, ask: (a) has the informal-discovery regime produced a second, independent, comparably surprising result — is the sample size still one? (b) has any unit-distance-style construction been shown to predate the AI? (c) has a DiscoverPhysics-style discovery pass rate moved materially above one-half? If (a) is still "no" and (b) becomes "yes," downgrade the bet from existence proof of AI-original mathematics to existence proof of AI-assisted formal proof search — still real, far less historic.

Key Takeaway 7

The existence proof licenses three things: that AI can do original mathematics, that certified formal search is a tool today, and that the bottleneck is shifting to taste. It does not license claims about pace, reproducibility, or arrival. The reading is falsifiable — if the construction predates the AI — and is committed to a 2029 review. One theorem is one theorem.

What comes next

There is a quiet limitation running through every system in this essay, and it is the thread into the final chapter. AlphaProof Nexus spins up prover subagents with "no shared state"; LEAP starts each proof from scratch; the OpenAI model generated the unit-distance disproof "in one shot." None of them remembers yesterday's proof when it begins today's. The mathematics happened — but it did not accumulate inside the machine. The accumulation lives where it always has: in the humans who read the results, and in the papers they write. These are systems that did mathematics once, not agents that get better at mathematics over time.

[→ 6 · The Continual Agent] The capstone of this series is the move from episodic competence to accumulated competence — agents that carry what they learned in one session into the next. That is the difference between "AI did mathematics" and "AI does mathematics," and it is the bet the whole series has been building toward.

References

  1. Alon, N., Bloom, T. F., Gowers, W. T., Litt, D., Sawin, W., Shankar, A., Tsimerman, J., Wang, V., & Matchett Wood, M. (2026). Remarks on the Disproof of the Unit Distance Conjecture. Mathematical note (manuscript).
  2. Tsoukalas, G., Kovsharov, A., Shirobokov, S., Surina, A., Firsching, M., Bérczi, G., Ruiz, F. J. R., Suggala, A., Wagner, A. Z., Wieser, E., Yu, L., Huang, A., Horváth, M. Z., Ferrauiolo, A., Michalewski, H., Grosu, C., Hubert, T., Balog, M., Kohli, P., & Chaudhuri, S. (2026). Advancing Mathematics Research with AI-Driven Formal Proof Search. Preprint, Google DeepMind. arXiv:2605.22763.
  3. Kung, P.-N., Song, L., Hwang, D., Yoon, J., Li, C.-L., Severini, S., Olšák, M., Lockhart, E., Le, Q. V., Gokturk, B., Luong, T., Pfister, T., & Peng, N. (2026). LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks. Preprint, Google. arXiv:2606.03303.
  4. Wiemann, M. L., Smith, L. M., Melchior, P., Mishra-Sharma, S., Wilson, A. G., Izmailov, P., & Cuesta-Lázaro, C. (2026). DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking. Preprint, Princeton University et al. arXiv:2605.26087.