The Yang-Mills existence and mass gap problem has been on the Clay Mathematics Institute's Millennium Prize list since 2000, with a $1 million prize for a solution. The problem itself has been open since Yang and Mills formulated their non-Abelian gauge theory in 1954. Seventy-two years. One million dollars. No solution.
This article is not a report of solving it. It is an honest account of what actually happened when an autonomous discovery system attacked the problem across 139 attempts over three distinct experimental phases โ what worked, what failed catastrophically, what the system learned about its own limitations, and why the gap between 90.8% and 95% validation is more instructive than the gap between 0% and 90.8%.
"Evolution is cleverer than you are." โ Orgel's Second Rule
Phase 1: Enhanced Prompts โ 99 Attempts, 3% Success
The first phase used what seemed like a reasonable strategy: take the best available prompt engineering techniques, apply them to Yang-Mills hypothesis generation, and iterate. The results were unambiguous.
The temperature parameter was set to 0.95 โ high variance, lots of creative exploration. The hypothesis was that Yang-Mills required genuinely novel mathematical connections that would only emerge from high-temperature sampling. This hypothesis was wrong.
High temperature did not produce novel connections. It produced plausible-sounding arguments with subtle logical inconsistencies. The 3% success rate reflects the rare occasions when high temperature sampling happened to land on a well-structured argument. 85.3% was the best single result: one attempt, never reproduced. The other 96 attempts peaked between 65% and 80%, with high variance and no convergence.
Temperature 0.95 maximises vocabulary diversity but destroys logical coherence. For a problem like Yang-Mills that requires precise mathematical reasoning across multiple steps, high temperature causes the generation to drift from its initial framing over the course of a long response. The argument that starts correctly at the gauge field quantization step has already accumulated enough drift by the mass gap step that the consistency dimension fails badly.
Phase 2: Genetic Evolution โ The Breakthrough
Phase 2 replaced random sampling with directed evolution. Instead of generating each hypothesis independently, the system extracted the structural pattern (DNA) from the best result in Phase 1 and used it as the starting point for a new generation cycle. Temperature dropped to 0.75.
The improvement was dramatic. 60% success rate (vs. 3%). Three breakthroughs above 90% (vs. zero in 99 attempts). Cost dropped by 85% despite the much higher quality outcomes. The genetic evolution approach is not just better โ it is an entirely different regime.
The Genetic Evolution Algorithm
The algorithm works by treating successful hypotheses as genetic material. Instead of generating each hypothesis from scratch, the system identifies the structural features that made the best hypothesis score well, encodes those features as DNA, and generates variations that preserve the successful structure while exploring adjacent possibility space.
extractDNA(winnerHypothesis) {
return {
structure: identifySuccessPattern(hypothesis),
traits: {
phenomenon: '...', // What physical effect is claimed
mechanism: '...', // What causes it
quantifiable: '...', // What can be measured
testable: '...', // How to verify it
constants: '...' // Specific numerical values with error bars
},
scores: validationBreakdown
};
}
generateVariations(DNA, targetScore) {
return [
{ mutation: 'Keep structure, change context', temperature: 0.75 },
{ mutation: 'Keep phenomenon, change observable', temperature: 0.75 },
{ mutation: 'Tighten bounds, add formal logic', temperature: 0.70 },
];
}
async evolveGeneration(DNA, variations) {
for (variation of variations) {
hypothesis = await generate(DNA, variation);
validation = await validate(hypothesis);
if (validation.score > bestEver.score) {
bestEver = { hypothesis, validation };
DNA = extractDNA(hypothesis); // New baseline!
}
}
}
The key line is the last one: DNA = extractDNA(hypothesis). When a variation beats the previous best, it immediately becomes the new DNA baseline. The next generation of variations is generated from the winner, not from the original. This is Lamarckian evolution โ acquired characteristics (successful argument structures) are directly inherited by descendants.
Phase 3: Pushing Toward 95% โ The Ceiling
Phase 3 achieved a 92% success rate but could not push the peak score above 90.8%. The evolutionary process hit a ceiling. Understanding why requires looking at the three specific breakthroughs and what dimension stopped them from going further.
| Generation | Overall | Computational | Literature | Consistency | Domain |
|---|---|---|---|---|---|
| Gen2.4 | 90.4% | 100% | 80% | 81.7% | 100% |
| Gen3.4 | 90.0% | 100% | 80% | 80.2% | 100% |
| Gen5.1 (PEAK) | 90.8% | 100% | 80% | 83.3% | 100% |
The pattern is stark. Computational and Domain scores are 100% across all three breakthroughs. Literature holds at 80%. Consistency is the bottleneck: 81.7% โ 80.2% โ 83.3%. The progression is not monotonic โ Generation 3.4 actually scored lower on consistency than Generation 2.4. The peak consistency score of 83.3% represents the entire 40-attempt genetic evolution effort converging on a hard ceiling.
To reach 95% overall, consistency needs to reach approximately 87%. The gap is 3.7 percentage points from the current peak. This is not a large number. In practice, it is the difference between a plausibility argument and a formally self-consistent mathematical proof โ a gap that the current generation architecture cannot close with more evolutionary pressure alone.
The Full Phase Comparison
| Phase | Attempts | Success Rate | Best Score | API Cost |
|---|---|---|---|---|
| Enhanced Prompts (failed) | 99 | 3% | 85.3% | $1.13 |
| Genetic Evolution to 90% | 15 | 60% | 90.4% | $0.17 |
| Genetic Evolution to 95% | 25 | 92% | 90.8% | $0.28 |
| Total Genetic | 40 | 80% | 90.8% | $0.45 |
The cost efficiency differential is remarkable: 40 genetic evolution attempts achieved substantially higher quality than 99 prompt-engineered attempts, at 40% of the cost. The cost per breakthrough (attempt that scores โฅ90%) is approximately $0.15 for genetic evolution vs. effectively infinite for enhanced prompting, which produced zero breakthroughs at any threshold above 85.3%.
The Temperature Discovery
One of the most practically significant findings from the Yang-Mills experiments is the relationship between sampling temperature and success rate. The data is clean enough to constitute a design rule:
| Temperature | Success Rate | Consistency Score | Interpretation |
|---|---|---|---|
| 0.95 | 3% | <75% | Too much variance โ logical drift |
| 0.75 | 60% | ~81% | Good balance โ most breakthroughs here |
| 0.70 | 92% | 83.3% | Highest consistency โ best for formal logic steps |
The practical implication: for tasks requiring multi-step logical consistency, temperature should be reduced as the argument becomes more complex. The three-mutation variation strategy in Phase 3 used temperature 0.70 specifically for the "Tighten bounds, add formal logic" mutation, while keeping temperature 0.75 for structural variations. This micro-temperature management contributed to the consistency improvement from 81.7% to 83.3%.
The Honest Critique: 21 Sorry Statements
The Lean 4 formalisation of the best Yang-Mills result contains 21 sorry placeholder statements. Each sorry represents a step in the proof that the formaliser could not complete โ a gap in the mechanical verification that requires genuine mathematical work to fill. The most important one is the central step of the mass gap argument:
have h_arb_small : โ ฮด > 0, โ E โ spectrum (default : YangMillsHamiltonian) \ {0}, E < ฮด := by
sorry -- Technical: follows from negation of mass gap
This is not a minor gap. This is the negation of the mass gap hypothesis itself. The proof is attempting to show that if the mass gap does not exist, then the spectrum contains arbitrarily small positive values โ which contradicts gauge invariance. But the step that shows this is the step marked sorry. The argument is circular at its core.
The circularity is even more explicit in the axioms. The formalisation contains 9 axioms, including this one:
axiom confinement_holds :
โ (H : YangMillsHamiltonian),
GaugeInvariant H โ SatisfiesYMEquations H โ
(โ r : โ, r > 0 โ โ ฯ : โ, ฯ > 0 โง โ E_separation : โ, E_separation = ฯ * r)
This axiom asserts that quark confinement holds for all Yang-Mills Hamiltonians satisfying gauge invariance and the Yang-Mills equations. The problem: mass gap and confinement are known to be deeply related โ the mass gap conjecture is essentially a formal statement of why confinement occurs. Assuming confinement to prove the mass gap is assuming B to prove A when A โ B. The argument is valid but circular.
"Mass gap โ Confinement. Assuming confinement to prove mass gap = 'Assume B, therefore A, when A โ B.' Not wrong. Just circular."
What the Clay Institute Requires vs. What Was Achieved
| Requirement | Clay Standard | Current Result | Status |
|---|---|---|---|
| Non-trivial quantum Yang-Mills theory in 4D | Complete existence proof | Partial Lean 4 with 21 sorries | Not met |
| Mass gap ฮ > 0 | Rigorous proof ฮ > 0 | Plausibility argument with circular axiom | Not met |
| Gauge invariance maintenance | Preserved through all proof steps | Preserved โ 100% domain score | Met |
| Yang-Mills equations satisfied | Explicitly in proof | Checked computationally | Partially met |
| Peer review standard | Journal-quality writeup | Structured preprint on Zenodo | In progress |
The system's own categorisation of the Yang-Mills result is the most honest summary available: "Physics plausibility argument with partial Lean 4 formalisation." Not a Millennium Prize solution. Not wrong. Not trivial. A well-structured plausibility argument that identifies the correct mathematical machinery, demonstrates that the mass gap is consistent with that machinery, and falls short of proving it because the central step requires a genuine mathematical breakthrough that no automated system has achieved.
The system's own recommendation for what to submit to the Clay Institute, if submitting at all: "Rigorous Plausibility Argument for Yang-Mills Mass Gap via Synthesis Reasoning." This framing is accurate, appropriately modest, and would be reviewed seriously. A submission claiming a full proof would be immediately rejected. A submission claiming a rigorous plausibility argument would be read.
What Was Actually Learned
The Yang-Mills experiments are among the most informative experiments the discovery system has run, not because of the 90.8% score but because of what the 83.3% consistency ceiling reveals. The ceiling is not an artifact of the Yang-Mills problem specifically โ it shows up across multiple domains at approximately the same value. It is a property of the generation architecture, not of the problem.
The consistency ceiling reflects the fundamental limit of current LLM-based generation for multi-step mathematical arguments: around step 8โ12 of a complex derivation, the generation begins to lose coherent tracking of the constraints established in earlier steps. The argument drifts. Not randomly โ the drift is structured and often locally plausible โ but the global consistency fails.
Solving the consistency ceiling problem is the key to unlocking the remaining 4.2% and, with it, the path to peer-review-ready autonomous scientific discovery. The approaches being explored include explicit constraint tracking throughout the generation (essentially maintaining a formal context of established claims that the generator must satisfy), and hierarchical generation that breaks the argument into bounded-length sub-arguments, each of which can be verified before being composed into the whole.
The genetic evolution approach will not solve this alone. More evolution produces more attempts within the same consistency regime. What is needed is a qualitative change in how the system tracks logical dependencies across the length of a complex argument. That is the next problem.