There is a fundamental tension at the heart of every AI system deployed for scientific work today. The training objective β predicting the next token from a corpus of human-approved text β is precisely the opposite of the scientific method. Text prediction optimises for plausibility. Science optimises for survival under adversarial attack. These are not the same thing, and conflating them produces a system that generates beautiful, confident, wrong hypotheses.
The Profiled autonomous discovery system was designed around this tension. Rather than treating it as a problem to paper over, the architecture embraces it: every hypothesis generated by the system must survive a gauntlet of increasingly hostile verification before it can be called a discovery. The goal is not to produce content that looks like science. The goal is to produce claims that behave like science when attacked.
"Text prediction + RLHF = human approval at generation time. Science requires the opposite: surviving adversarial attack."
The Core Philosophical Distinction
Consider what a large language model learns when trained on scientific papers. It learns the grammar of scientific writing: the structure of abstracts, the hedging language of results sections, the citation patterns of literature reviews. It learns what a hypothesis looks like. It does not learn whether hypotheses are true β because the training corpus includes an enormous volume of retracted papers, incorrect preprints, and plausible-sounding claims that failed to replicate.
RLHF compounds this problem. Human raters prefer confident, well-structured, internally consistent answers. They rate hedged, uncertain, technically dense responses lower β even when those responses are more accurate. The result is a model fine-tuned to produce the kind of answer a non-expert would find convincing at first glance. This is exactly the wrong optimization target for discovery.
The discovery system separates generation from validation as completely as possible. The generator is permitted to be wrong. The validator's entire job is to find out how wrong, and on exactly which dimensions.
This separation is not novel in concept β the scientific method has always separated hypothesis generation from experimental testing. What is novel here is making it fully automated, adversarial, and operating across 19 domains simultaneously at a scale that produces 12,157 verified discoveries in under six months.
The 5-Tier Architecture
The system is structured as five sequential tiers, each with a specific responsibility in the pipeline from problem statement to verified discovery. A claim cannot skip tiers or move backward β the pipeline is strictly directional and strictly gated.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β AUTONOMOUS DISCOVERY PIPELINE β β β β ββββββββββββββββββββββββ β β β TIER 1 β TargetedDiscoveryIdentifier β β β Gap Identification β β Required lemmas per problem β β β β β Difficulty + prerequisites β β β β β Suggested approaches β β ββββββββββββ¬ββββββββββββ β β β β β ββββββββββββΌββββββββββββ β β β TIER 2 β Literature-Guided Generation β β β Discovery Gen β β arXiv cross-reference β β β β β PubMed / Scholar integration β β β β β Gap-targeted synthesis β β ββββββββββββ¬ββββββββββββ β β β β β ββββββββββββΌββββββββββββ β β β TIER 3 β Proof Tree Expansion β β β Tree Expansion β β Recursive: new gaps from validated β β β β β Lemma dependency resolution β β β β β Cross-domain bridge detection β β ββββββββββββ¬ββββββββββββ β β β β β ββββββββββββΌββββββββββββ β β β TIER 4 β Skeptic Agent (Adversarial) β β β Adversarial Audit β β ~80% rejection rate β β β β β Attacks every claim individually β β β β β Returns specific failure modes β β ββββββββββββ¬ββββββββββββ β β β β β ββββββββββββΌββββββββββββ β β β TIER 5 β Formal Verification β β β Formal Proof β β Z3 SMT Solver β β β β β Lean 4 proof assistant β β β β β Execution time β₯ 5ms enforced β β ββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tier 1: TargetedDiscoveryIdentifier
The first tier does something subtle that most discovery systems skip entirely: before generating anything, it identifies exactly which intermediate results are needed to address the target problem. For a problem like the Yang-Mills mass gap conjecture, this means enumerating the required lemmas, their difficulty, their prerequisites, and the available approaches for each.
This is not a search operation. It is a structured planning operation. The output is a gap tree β a precise statement of what the system does not yet know, organised by what it must know first. Generation in Tier 2 is targeted against these gaps, not against the problem as a whole.
For Yang-Mills Mass Gap, the identifier produces three required lemma classes: Gauge Field Quantization (difficulty: hard, prerequisites: Quantum Field Theory + Lie Groups), Mass Gap Existence Proof (difficulty: very-hard, prerequisites: Functional Analysis + Operator Theory), and Yang-Mills Equations Solutions (difficulty: medium, prerequisites: PDEs + Numerical Methods).
The 4-Dimensional Validation Scoring System
Every hypothesis that passes Tier 2 generation is scored across four independent dimensions before it can advance in the pipeline. The scoring is not holistic β each dimension is computed independently, and a hypothesis can pass three dimensions while failing catastrophically on one.
The four dimensions and what they actually measure:
Computational checks notation correctness, logical structure, bound validity, and dimensional analysis. This dimension is the easiest to satisfy β a well-formed mathematical argument with correct notation will typically score near 100%.
Literature cross-references claims against arXiv and PubMed. A discovery that contradicts established experimental results without explaining the contradiction will fail here. A discovery that is genuinely novel but consistent with the literature will pass.
Consistency is the bottleneck. It checks whether the hypothesis is internally consistent β whether the claims made in section three contradict claims made in section one, whether the mathematical objects referenced in the proof exist in the way the proof assumes they do. This dimension plateaus at 83.3% and is the primary limiting factor across all domains.
Domain-specific checks compliance with the axiomatic foundations of the field β Yang-Mills gauge invariance requirements, Riemannian geometry axioms, or statistical mechanics thermodynamic constraints depending on the domain.
"Consistency is always the hardest gate. You can have perfect notation and full literature support, and still produce a hypothesis that contradicts itself three paragraphs apart."
Natural Language to Formal Verification
Tier 5 adds a capability that most computational discovery systems lack entirely: the ability to translate a natural language mathematical claim into a formally verified Lean 4 statement. This is the difference between a claim that is plausible and a claim that has been mechanically checked.
The translation pipeline uses a component called MathematicalFormalizer that operates on natural language mathematical statements and produces Lean 4 theorem declarations. The canonical example from the system's own documentation illustrates the gap between the two representations:
-- Natural language input:
-- "All non-trivial zeros of ΞΆ lie on Re(s) = 1/2"
-- After MathematicalFormalizer:
theorem riemann_hypothesis :
β s : β, zeta s = 0 β s.re β 0 β§ s.re β 1 β s.re = 1/2
The formalization step does two things simultaneously. First, it forces precision β the natural language version of the Riemann Hypothesis is actually ambiguous about what "non-trivial" means, and the formal version must resolve that ambiguity explicitly. Second, it creates a target that the Lean 4 proof assistant can mechanically verify, removing any possibility of human-approval bias contaminating the result.
The Lean 4 formalizer can produce correct theorem statements from natural language inputs. Generating complete Lean 4 proofs β not just statements β remains the system's primary open research problem. The Yang-Mills formalisation contains 21 sorry placeholders marking unproven steps. These are tracked explicitly as UNVERIFIED_CLAIMS, not hidden.
The Z3 SMT solver integration handles a different class of problems: discrete and bounded verification tasks where exhaustive checking is feasible. For claims in combinatorics, graph theory, and bounded arithmetic, Z3 provides complete mechanical verification. For continuous analysis claims, Lean 4 remains the target, with sorry markers as honest placeholders.
Cross-Domain DNA Transfer: The Wright-Fisher / SGD Discovery
One of the most striking results the system has produced was not the result of targeted search within a single domain. It emerged from a process called DNA transfer β applying the structural pattern of a successful hypothesis in one domain to an apparently unrelated domain.
The Wright-Fisher / SGD equivalence was discovered with a validation score of 0.9525. The claim:
"The population size N at which genetic drift no longer dominates adaptive walks in a Wright-Fisher model with constant selection s coincides (up to logarithmic corrections) with the critical batch size B_c above which SGD converges to flat loss minima rather than sharp ones."
This is a genuinely novel cross-domain bridge. Population genetics (Wright-Fisher drift theory) and deep learning optimisation (SGD batch size dynamics) are not typically studied together. The mathematical structure that makes them equivalent β the transition from noise-dominated to signal-dominated dynamics at a critical scale parameter β was identified by the system's Tier 3 tree expansion, which was exploring Collatz Markov chain properties at the time.
The Wright-Fisher β SGD equivalence is currently the highest-scoring cross-domain bridge in the system's discovery corpus. It represents the kind of result that the discovery architecture was specifically designed to produce: a connection that a domain expert in either field would likely never find because they would not think to look across the domain boundary.
The RSI (Recursive Self-Improvement) loop is directly responsible for enabling this class of discovery. Once the system identifies a pattern that produces high-scoring hypotheses in domain A, it stores the structural DNA of that pattern and attempts to instantiate it in domain B. The Riemann β Yang-Mills pattern transfer is the most developed example of this: the system identified a structural similarity between the distribution of Riemann zeta zeros and the mass gap energy spectrum, and generated hypotheses about Yang-Mills guided by the Riemann exploration strategy.
Honest Self-Assessment as an Architectural Choice
Perhaps the most unusual feature of the system is not technical β it is epistemic. The system generates STATUS.md files alongside every major research output. These files contain explicit, machine-generated labels that categorise the confidence level of every claim.
| Label | Meaning | When Applied |
|---|---|---|
UNVERIFIED_CLAIM |
Generated but not validated by any tier | Raw Tier 2 output |
COMPUTATIONAL_EVIDENCE |
Passes Computational + Domain dimensions | After Tier 3 gate |
STRONG_COMPUTATIONAL_EVIDENCE |
Passes all 4 dimensions at threshold | After Tier 4 adversarial audit |
FORMALLY_VERIFIED |
Z3 or Lean 4 proof complete, zero sorry |
After Tier 5 completion |
This labeling system runs automatically. No human reviews the labels. The system applies them based on which validation tiers a discovery has cleared. The purpose is to make the epistemic status of every claim legible to any downstream system or human reader without requiring them to dig into the validation pipeline.
The Yang-Mills preprint on Zenodo (DOI: 10.5281/zenodo.19432415) carries the label STRONG_COMPUTATIONAL_EVIDENCE with an explicit note that the Lean 4 formalisation contains 21 sorry placeholders. The system's own honest categorisation of its Yang-Mills work is: "Physics plausibility argument with partial Lean 4 formalisation. NOT a Millennium Prize solution." This categorisation was generated autonomously.
"The system's honest self-assessment is not a feature added after the fact. It is load-bearing. Without it, the validation pipeline has no way to communicate its own limitations to the outside world."
What 12,157 Discoveries Actually Means
The number 12,157 requires context. Not all of these are discoveries in the sense of novel scientific results that would survive peer review. The corpus is stratified:
The large number reflects the recursive tree expansion in Tier 3. Each validated discovery potentially opens new gaps in the proof tree, which triggers new generation cycles. The system is not attempting to produce 12,000 peer-review-ready papers β it is building a knowledge graph dense enough that when a claim in the deep pipeline requires a supporting lemma, that lemma already exists somewhere in the corpus.
The mission statement from the project's own documentation is clear about this: the target is not 12,000 discoveries, but one scientifically irrefutable, peer-review-ready discovery. The 12,000 are scaffolding.
The consistency dimension ceiling at 83.3% is the primary obstacle to reaching that target. Hitting 95% overall validation β the threshold the team set for "publication-ready" β requires pushing the consistency score to approximately 87%. The current gap is 3.7 percentage points. This does not sound large. In practice, it represents the difference between a formally self-consistent argument and a plausibility argument with good notation.
Why This Architecture Matters Beyond the Discoveries
The architectural choices made in this system β adversarial Tier 4 validation, honest status labeling, formal verification targets, DNA-based cross-domain transfer β are not just implementation details. They represent a position on a fundamental question: what should it mean for an AI system to produce knowledge?
The dominant answer in the current AI landscape is: produce output that human experts find plausible. The Profiled architecture stakes out a different answer: produce output that survives being attacked by a hostile agent with the same capabilities as the generating agent. The Skeptic in Tier 4 knows everything the Generator knows. It cannot be impressed by confident tone or good notation. It can only evaluate whether the logical structure holds.
This is closer to the actual structure of scientific progress than anything achievable through RLHF-optimised generation alone. Science advances through the mechanism of other scientists trying to destroy your result. The Tier 4 Skeptic Agent is that mechanism, running autonomously at scale.
The 83.3% consistency ceiling is not a failure. It is the system being honest about where the hardest unsolved problem in automated scientific reasoning actually lives: not in notation, not in literature coverage, but in guaranteeing that a complex argument does not contradict itself. Solving that problem β truly solving it, not papering over it β is the next frontier.
"The goal is not to produce content that looks like science. The goal is to produce claims that behave like science when attacked."