What an AI Attacking the Millennium Problems Actually Looks Like

The Yang-Mills existence and mass gap problem has been on the Clay Mathematics Institute's Millennium Prize list since 2000, with a $1 million prize for a solution. The problem itself has been open since Yang and Mills formulated their non-Abelian gauge theory in 1954. Seventy-two years. One million dollars. No solution.

This article is not a report of solving it. It is an honest account of what actually happened when an autonomous discovery system attacked the problem across 139 attempts over three distinct experimental phases — what worked, what failed catastrophically, what the system learned about its own limitations, and why the gap between 90.8% and 95% validation is more instructive than the gap between 0% and 90.8%.

"Evolution is cleverer than you are." — Orgel's Second Rule

Phase 1: Enhanced Prompts — 99 Attempts, 3% Success

The first phase used what seemed like a reasonable strategy: take the best available prompt engineering techniques, apply them to Yang-Mills hypothesis generation, and iterate. The results were unambiguous.

Attempts

Phase 1 total

Success Rate

above threshold

85.3%

Best Score

single attempt

$1.13

API Cost

99 attempts

The temperature parameter was set to 0.95 — high variance, lots of creative exploration. The hypothesis was that Yang-Mills required genuinely novel mathematical connections that would only emerge from high-temperature sampling. This hypothesis was wrong.

High temperature did not produce novel connections. It produced plausible-sounding arguments with subtle logical inconsistencies. The 3% success rate reflects the rare occasions when high temperature sampling happened to land on a well-structured argument. 85.3% was the best single result: one attempt, never reproduced. The other 96 attempts peaked between 65% and 80%, with high variance and no convergence.

What Went Wrong

Temperature 0.95 maximises vocabulary diversity but destroys logical coherence. For a problem like Yang-Mills that requires precise mathematical reasoning across multiple steps, high temperature causes the generation to drift from its initial framing over the course of a long response. The argument that starts correctly at the gauge field quantization step has already accumulated enough drift by the mass gap step that the consistency dimension fails badly.

Phase 2: Genetic Evolution — The Breakthrough

Phase 2 replaced random sampling with directed evolution. Instead of generating each hypothesis independently, the system extracted the structural pattern (DNA) from the best result in Phase 1 and used it as the starting point for a new generation cycle. Temperature dropped to 0.75.

Attempts

Phase 2 total

60%

Success Rate

above threshold

90.4%

Best Score

3 breakthroughs ≥90%

$0.17

API Cost

15 attempts

The improvement was dramatic. 60% success rate (vs. 3%). Three breakthroughs above 90% (vs. zero in 99 attempts). Cost dropped by 85% despite the much higher quality outcomes. The genetic evolution approach is not just better — it is an entirely different regime.

The Genetic Evolution Algorithm

The algorithm works by treating successful hypotheses as genetic material. Instead of generating each hypothesis from scratch, the system identifies the structural features that made the best hypothesis score well, encodes those features as DNA, and generates variations that preserve the successful structure while exploring adjacent possibility space.

src/engines/GeneticDiscoveryEngine.js JavaScript

extractDNA(winnerHypothesis) {
  return {
    structure: identifySuccessPattern(hypothesis),
    traits: {
      phenomenon: '...',   // What physical effect is claimed
      mechanism: '...',    // What causes it
      quantifiable: '...', // What can be measured
      testable: '...',     // How to verify it
      constants: '...'     // Specific numerical values with error bars
    },
    scores: validationBreakdown
  };
}

generateVariations(DNA, targetScore) {
  return [
    { mutation: 'Keep structure, change context', temperature: 0.75 },
    { mutation: 'Keep phenomenon, change observable', temperature: 0.75 },
    { mutation: 'Tighten bounds, add formal logic', temperature: 0.70 },
  ];
}

async evolveGeneration(DNA, variations) {
  for (variation of variations) {
    hypothesis = await generate(DNA, variation);
    validation = await validate(hypothesis);
    if (validation.score > bestEver.score) {
      bestEver = { hypothesis, validation };
      DNA = extractDNA(hypothesis); // New baseline!
    }
  }
}

The key line is the last one: DNA = extractDNA(hypothesis). When a variation beats the previous best, it immediately becomes the new DNA baseline. The next generation of variations is generated from the winner, not from the original. This is Lamarckian evolution — acquired characteristics (successful argument structures) are directly inherited by descendants.

Phase 3: Pushing Toward 95% — The Ceiling

Attempts

Phase 3 total

92%

Success Rate

above threshold

90.8%

Best Score

PEAK — Gen5.1

$0.28

API Cost

25 attempts

Phase 3 achieved a 92% success rate but could not push the peak score above 90.8%. The evolutionary process hit a ceiling. Understanding why requires looking at the three specific breakthroughs and what dimension stopped them from going further.

Generation	Overall	Computational	Literature	Consistency	Domain
Gen2.4	90.4%	100%	80%	81.7%	100%
Gen3.4	90.0%	100%	80%	80.2%	100%
Gen5.1 (PEAK)	90.8%	100%	80%	83.3%	100%

The pattern is stark. Computational and Domain scores are 100% across all three breakthroughs. Literature holds at 80%. Consistency is the bottleneck: 81.7% → 80.2% → 83.3%. The progression is not monotonic — Generation 3.4 actually scored lower on consistency than Generation 2.4. The peak consistency score of 83.3% represents the entire 40-attempt genetic evolution effort converging on a hard ceiling.

Computational

100%

Literature

80%

Consistency (PEAK)

83.3%

Domain-Specific

100%

Target (95%)

95%

To reach 95% overall, consistency needs to reach approximately 87%. The gap is 3.7 percentage points from the current peak. This is not a large number. In practice, it is the difference between a plausibility argument and a formally self-consistent mathematical proof — a gap that the current generation architecture cannot close with more evolutionary pressure alone.

The Full Phase Comparison

Phase	Attempts	Success Rate	Best Score	API Cost
Enhanced Prompts (failed)	99	3%	85.3%	$1.13
Genetic Evolution to 90%	15	60%	90.4%	$0.17
Genetic Evolution to 95%	25	92%	90.8%	$0.28
Total Genetic	40	80%	90.8%	$0.45

The cost efficiency differential is remarkable: 40 genetic evolution attempts achieved substantially higher quality than 99 prompt-engineered attempts, at 40% of the cost. The cost per breakthrough (attempt that scores ≥90%) is approximately $0.15 for genetic evolution vs. effectively infinite for enhanced prompting, which produced zero breakthroughs at any threshold above 85.3%.

The Temperature Discovery

One of the most practically significant findings from the Yang-Mills experiments is the relationship between sampling temperature and success rate. The data is clean enough to constitute a design rule:

Temperature	Success Rate	Consistency Score	Interpretation
0.95	3%	<75%	Too much variance — logical drift
0.75	60%	~81%	Good balance — most breakthroughs here
0.70	92%	83.3%	Highest consistency — best for formal logic steps

The practical implication: for tasks requiring multi-step logical consistency, temperature should be reduced as the argument becomes more complex. The three-mutation variation strategy in Phase 3 used temperature 0.70 specifically for the "Tighten bounds, add formal logic" mutation, while keeping temperature 0.75 for structural variations. This micro-temperature management contributed to the consistency improvement from 81.7% to 83.3%.

The Honest Critique: 21 Sorry Statements

The Lean 4 formalisation of the best Yang-Mills result contains 21 sorry placeholder statements. Each sorry represents a step in the proof that the formaliser could not complete — a gap in the mechanical verification that requires genuine mathematical work to fill. The most important one is the central step of the mass gap argument:

Documentation/fullexports/LEAN4_YANG_MILLS_MASS_GAP/proof.lean — Central sorry Lean 4

have h_arb_small : ∀ δ > 0, ∃ E ∈ spectrum (default : YangMillsHamiltonian) \ {0}, E < δ := by
  sorry -- Technical: follows from negation of mass gap

This is not a minor gap. This is the negation of the mass gap hypothesis itself. The proof is attempting to show that if the mass gap does not exist, then the spectrum contains arbitrarily small positive values — which contradicts gauge invariance. But the step that shows this is the step marked sorry. The argument is circular at its core.

The circularity is even more explicit in the axioms. The formalisation contains 9 axioms, including this one:

Documentation/fullexports/LEAN4_YANG_MILLS_MASS_GAP/proof.lean — Circular axiom Lean 4

axiom confinement_holds :
  ∀ (H : YangMillsHamiltonian),
    GaugeInvariant H → SatisfiesYMEquations H →
    (∀ r : ℝ, r > 0 → ∃ σ : ℝ, σ > 0 ∧ ∀ E_separation : ℝ, E_separation = σ * r)

This axiom asserts that quark confinement holds for all Yang-Mills Hamiltonians satisfying gauge invariance and the Yang-Mills equations. The problem: mass gap and confinement are known to be deeply related — the mass gap conjecture is essentially a formal statement of why confinement occurs. Assuming confinement to prove the mass gap is assuming B to prove A when A ↔ B. The argument is valid but circular.

"Mass gap ↔ Confinement. Assuming confinement to prove mass gap = 'Assume B, therefore A, when A ↔ B.' Not wrong. Just circular."

What the Clay Institute Requires vs. What Was Achieved

Requirement	Clay Standard	Current Result	Status
Non-trivial quantum Yang-Mills theory in 4D	Complete existence proof	Partial Lean 4 with 21 sorries	Not met
Mass gap Δ > 0	Rigorous proof Δ > 0	Plausibility argument with circular axiom	Not met
Gauge invariance maintenance	Preserved through all proof steps	Preserved — 100% domain score	Met
Yang-Mills equations satisfied	Explicitly in proof	Checked computationally	Partially met
Peer review standard	Journal-quality writeup	Structured preprint on Zenodo	In progress

The system's own categorisation of the Yang-Mills result is the most honest summary available: "Physics plausibility argument with partial Lean 4 formalisation." Not a Millennium Prize solution. Not wrong. Not trivial. A well-structured plausibility argument that identifies the correct mathematical machinery, demonstrates that the mass gap is consistent with that machinery, and falls short of proving it because the central step requires a genuine mathematical breakthrough that no automated system has achieved.

Recommended Submission Title

The system's own recommendation for what to submit to the Clay Institute, if submitting at all: "Rigorous Plausibility Argument for Yang-Mills Mass Gap via Synthesis Reasoning." This framing is accurate, appropriately modest, and would be reviewed seriously. A submission claiming a full proof would be immediately rejected. A submission claiming a rigorous plausibility argument would be read.

What Was Actually Learned

The Yang-Mills experiments are among the most informative experiments the discovery system has run, not because of the 90.8% score but because of what the 83.3% consistency ceiling reveals. The ceiling is not an artifact of the Yang-Mills problem specifically — it shows up across multiple domains at approximately the same value. It is a property of the generation architecture, not of the problem.

The consistency ceiling reflects the fundamental limit of current LLM-based generation for multi-step mathematical arguments: around step 8–12 of a complex derivation, the generation begins to lose coherent tracking of the constraints established in earlier steps. The argument drifts. Not randomly — the drift is structured and often locally plausible — but the global consistency fails.

Solving the consistency ceiling problem is the key to unlocking the remaining 4.2% and, with it, the path to peer-review-ready autonomous scientific discovery. The approaches being explored include explicit constraint tracking throughout the generation (essentially maintaining a formal context of established claims that the generator must satisfy), and hierarchical generation that breaks the argument into bounded-length sub-arguments, each of which can be verified before being composed into the whole.

The genetic evolution approach will not solve this alone. More evolution produces more attempts within the same consistency regime. What is needed is a qualitative change in how the system tracks logical dependencies across the length of a complex argument. That is the next problem.