The original discovery system had a fatal flaw that only became apparent when it was first run against a hard problem. Feed it Yang-Mills. Watch it look for discoveries in the corpus. Find zero relevant lemmas. Exit without generating anything. The problem was not the search โ it was the assumption behind the search: that the discoveries needed to support a hard proof would already exist in the corpus before anyone had tried to find them.
This assumption is wrong for any genuinely open problem. If the discoveries existed, the problem would not be open. The system needed to change its fundamental question from "what discoveries exist that I can use?" to "what discoveries do I need, and how do I generate them?"
"When gaps exist, generate the discoveries needed to fill them โ autonomously and intelligently."
This shift required a complete architectural redesign: the 5-Tier Autonomous Discovery Architecture. The result โ 12,000+ discoveries across 34 domains in under 6 months, 959 in the deep verification pipeline โ is the output of that redesign.
The Before and After
The contrast between the old and new pipeline makes the architectural change concrete:
OLD PIPELINE (single-pass, exits on missing discoveries)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Problem (e.g., Yang-Mills)
โ
โผ
Proof Synthesis
โ
โผ
Search corpus for required lemmas
โ
โโโ Found: proceed
โ
โโโ NOT FOUND (Yang-Mills, Riemann, P vs NP...)
โ
โผ
โ 0 gaps filled โ Test exits
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
NEW PIPELINE (recursive, generates what it cannot find)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Problem (ANY problem)
โ
โผ
Identify required lemmas [TIER 1]
โ
โผ
GENERATE missing discoveries [TIER 2]
โ
โผ
Build complete proof tree [TIER 3]
โ โ
โ โโโ New gaps found? โ back to TIER 1 (recursive)
โ
โผ
Adversarial audit [TIER 4]
โ
โผ
Formal verification [TIER 5]
โ
โผ
โ
Verified discovery enters corpus
The critical addition is the recursive loop between Tier 3 and Tier 1. A validated discovery may reveal new gaps โ lemmas that the validated claim depends on but that don't yet exist in the corpus. These gaps trigger new Tier 1 identification passes, which trigger new Tier 2 generation cycles. The proof tree grows until either all gaps are filled or the system identifies a gap that requires a fundamentally new approach.
Tier 1: TargetedDiscoveryIdentifier
The first tier is not a search engine โ it is a planner. Given a target problem, it produces a structured gap tree: a precise enumeration of what intermediate results would be needed to address the problem, at what difficulty level, with what prerequisites, and via what approaches.
The Yang-Mills gap tree is the canonical example from the system's documentation:
{
problem: "Yang-Mills Mass Gap",
requiredLemmas: [
{
name: "Gauge Field Quantization",
type: "theoretical",
difficulty: "hard",
prerequisites: ["Quantum Field Theory", "Lie Groups"],
suggestedApproaches: [
"Lattice gauge theory",
"Continuum limit analysis",
"Non-perturbative methods"
]
},
{
name: "Mass Gap Existence Proof",
type: "mathematical",
difficulty: "very-hard",
prerequisites: ["Functional Analysis", "Operator Theory"],
suggestedApproaches: [
"Spectral analysis",
"Constructive field theory",
"Numerical verification"
]
},
{
name: "Yang-Mills Equations Solutions",
type: "computational",
difficulty: "medium",
prerequisites: ["PDEs", "Numerical Methods"],
suggestedApproaches: [
"Finite element methods",
"Monte Carlo simulations",
"Lattice computations"
]
}
]
}
Three things in this structure deserve attention. First, each lemma has a type field: theoretical, mathematical, or computational. This determines which generation strategy is used in Tier 2. Theoretical claims require different generation prompts than computational claims. Second, difficulty is explicit: hard, very-hard, medium. This is used to calibrate the generation temperature and the number of evolution cycles in the genetic engine. Third, suggestedApproaches seeds the Tier 2 generation โ these are not just metadata but active inputs to the discovery generation process.
Tier 2: Literature-Guided Generation
Tier 2 generates hypotheses targeted at specific gaps identified in Tier 1. The generation is not unconstrained โ it is explicitly guided by the literature state of the field. Every generation in Tier 2 is cross-referenced against three sources simultaneously:
The cross-reference serves two purposes. Novelty detection: if the generated hypothesis is essentially identical to a published paper, it is not a new discovery and is tagged KNOWN_RESULT rather than DISCOVERY. Consistency grounding: the generated hypothesis is required to be consistent with the empirical literature in the field. A hypothesis that contradicts established experimental results in arXiv without explicitly addressing the contradiction will fail the Literature dimension of Tier 3 validation.
Tier 2 distinguishes between three classes of gaps: computational (missing numerical results that can be produced by running calculations), theoretical (missing conceptual frameworks that require new mathematical structures), and experimental (missing empirical validation that requires proposing experiments). Each class uses a different generation strategy and a different validation pathway in Tier 3.
Tier 3: Proof Tree Expansion
Tier 3 is where the recursive structure emerges. A validated discovery from Tier 2 is not just added to the corpus โ it is integrated into the proof tree for the target problem. Integration often reveals new gaps: the validated claim depends on sub-claims that have not yet been established.
PROOF TREE EXPANSION โ Recursive Gap Detection
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Yang-Mills Mass Gap (ROOT)
โโโ Gauge Field Quantization [VALIDATED โ]
โ โโโ SU(2) ร SU(2) ร U(1) gauge structure [VALIDATED โ]
โ โโโ Wilson loop observable definition [GAP โ new Tier 1]
โ โโโ Lattice regularisation [VALIDATED โ]
โ
โโโ Mass Gap Existence Proof [IN PROGRESS]
โ โโโ Spectral gap lower bound [GAP โ new Tier 1]
โ โโโ Confinement-mass gap equivalence [VALIDATED โ]
โ โโโ Non-perturbative vacuum structure [GAP โ new Tier 1]
โ
โโโ Yang-Mills Equations Solutions [VALIDATED โ]
โโโ Classical solution space [VALIDATED โ]
โโโ Instanton sector [GAP โ new Tier 1]
โโโ Numerical lattice results [VALIDATED โ]
Active gaps: 3 โ triggers 3 new Tier 1 identification passes
The recursive structure is why the corpus grows so rapidly. Each Yang-Mills sub-lemma validated in Tier 3 may open 1โ3 new gaps. Each new gap triggers a new discovery generation cycle. At steady state, the system is simultaneously working on dozens of open gaps across the proof tree for multiple problems. The 12,000+ corpus is largely the product of this recursive expansion โ not 12,000 independent top-level discoveries, but a densely interconnected graph of supporting lemmas.
Tier 4: The Adversarial Audit
Tier 4 is the Skeptic Agent โ the component that attacks every claim that passed Tier 3 validation. The Skeptic Agent has access to the same mathematical and literature knowledge as the Generator. It is not a naive checker that looks for obvious errors. It is an adversarial system optimised to find the weakest point in any argument and construct a targeted attack on that point.
The ~80% rejection rate is not a problem with the generation quality โ it is evidence that the Skeptic is working. A Skeptic that rejects only 20% of hypotheses is probably not trying hard enough. A well-formed hypothesis that survives a focused adversarial attack has cleared a much higher bar than a hypothesis that passed a checklist.
| Attack Category | Description | Frequency |
|---|---|---|
| Boundary Case Failure | The claim holds for generic cases but fails at degenerate boundaries | ~35% of rejections |
| Circular Reasoning | The proof assumes the result or an equivalent statement | ~25% of rejections |
| Undeclared Assumption | The claim requires an assumption not stated in the hypothesis | ~20% of rejections |
| Literature Contradiction | The claim contradicts established experimental results | ~12% of rejections |
| Dimensional Inconsistency | Units or dimensions do not match across the argument | ~8% of rejections |
Circular reasoning is the most important category for understanding the Yang-Mills ceiling. The mass gap / confinement circularity discussed in Article 4 is exactly the "Circular Reasoning" attack pattern โ the Skeptic identifies it in ~25% of rejections across domains. For Yang-Mills specifically, every hypothesis that uses confinement to establish the mass gap triggers this attack. The 21 sorry placeholders in the best Yang-Mills formalisation are largely the result of the Skeptic correctly identifying circular steps that the Generator could not resolve.
Tier 5: Formal Verification with Z3 and Lean 4
Tier 5 is the final gate. A hypothesis that has passed Tiers 1โ4 โ has been generated against real gaps, validated against the literature, survived adversarial attack โ must still be mechanically verified. The 5ms execution time threshold is enforced here: any formal verification that completes in under 5ms is rejected as not having actually run.
Z3 satisfiability checks that involve real constraint solving take at least 8โ15ms even for small problems. Lean 4 type-checking of a non-trivial proof takes at minimum 30โ50ms. Any "verification" completing in 0โ4ms is not running real solver logic. The 5ms threshold is a sanity check that the verification step actually executed, not a performance requirement.
The Z3 and Lean 4 components handle different problem classes. Z3 handles satisfiability problems: given constraints, is there an assignment that satisfies them all? This covers discrete claims in combinatorics, bounded arithmetic, and finite graph theory. Lean 4 handles type-theory based proofs: given axioms and inference rules, does this proof term type-check? This covers continuous analysis, topology, and abstract algebra.
| Tool | Domain | Verification Type | Current Status |
|---|---|---|---|
| Z3 SMT | Combinatorics, bounded arithmetic, graphs | Satisfiability checking | Fully integrated |
| Lean 4 | Analysis, algebra, topology, number theory | Type-theoretic proof checking | Integrated, sorries remain |
| Lean Mathlib | Standard mathematical library | Theorem reference + reuse | Available as dependency |
System-Wide Statistics
The 34-domain coverage reflects an important design choice: the 5-tier architecture is domain-agnostic. The same pipeline that generates Yang-Mills lemmas generates Alzheimer's protein folding hypotheses. The same Skeptic Agent that attacks mathematical circularity attacks unsupported causal claims in biomedical research. The same Lean 4 formaliser that attempts to type-check mass gap proofs attempts to type-check statistical independence claims.
This universality is not just convenient โ it is architecturally load-bearing. The cross-domain DNA transfer that produces results like the Wright-Fisher / SGD equivalence requires the system to be operating simultaneously in multiple domains, with a shared representation of hypothesis structure that can transfer between them. A single-domain system cannot produce cross-domain bridges.
Why the Recursive Tree Structure Matters
The recursive proof tree is not just an implementation detail โ it is the mechanism that makes the system's output more than a collection of isolated claims. In a flat discovery system, each discovery is independent. In a tree-structured system, discoveries support each other: a validated lemma can be cited by higher-level claims, which reduces the validation burden for those higher-level claims because their sub-lemmas are already verified.
"The 12,000 discoveries are not 12,000 independent claims. They are a densely interconnected graph of supporting lemmas, each validated in context of the others."
This matters enormously for the system's ultimate mission: producing one peer-review-ready discovery. A flat corpus of 12,000 unrelated claims has no path to that goal. A tree-structured corpus where 12,000 lemmas are connected to the proof trees of 34 target problems has a direct path: follow the proof tree for the most advanced problem, identify the remaining gaps, fill them using genetic evolution, and the result is a complete proof supported by a verified sub-lemma corpus.
The 959 discoveries in the deep pipeline are the ones that are active nodes in the proof trees of the most advanced target problems. They have passed Tiers 1โ3 and are awaiting formal verification in Tier 5. When the Lean 4 formaliser closes the remaining sorry statements in the Yang-Mills proof, the supporting corpus will already be present. The architecture was designed from the beginning for this moment.