Evolution is not a metaphor for this research system โ it is the actual architecture. The EvolutionaryResearchOrchestrator applies the core mechanisms of biological natural selection to scientific hypothesis generation: variation (many hypotheses), selection (computational and multi-agent scoring), inheritance (successful hypotheses seed the next generation), and mutation (deliberate variation of surviving hypotheses). Dead ends accumulate as anti-patterns. Anti-patterns feed the next generation. The search space shrinks as failures build.
This article examines the full architecture, the CLI interface, the available sub-problems sorted by feasibility, and why failure learning is the most important component of the system.
The Evolution Cycle
EvolutionaryResearchOrchestrator โ Full Cycle
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ANALYZE โ Deep reasoning: identify attack angles (generation 0)
โ
SEED โ Generate population of hypotheses from angles
โ
VERIFY โ ComputationalVerifier: score each hypothesis
โ
DEBATE โ Multi-agent critique (4 agents, optional)
โ โโโ PROPONENT: argue for hypothesis
โ โโโ SKEPTIC: identify weaknesses
โ โโโ IMPROVER: suggest enhancements
โ โโโ VISIONARY: extrapolate implications
โ
SELECT โ Survival of the fittest (by fitness score)
โ
MUTATE โ Deliberate variation of survivors
โ
CROSS โ Cross-breeding of compatible survivors
โ
TRACK โ LineageTracker: phylogenetic tree of all hypotheses
โ
REPORT โ Generate detailed report + fitness history
โ
LOOP โ Next generation (repeat from SEED with survivors)
The 4-agent debate (PROPONENT, SKEPTIC, IMPROVER, VISIONARY) is optional and activated with --debate. When enabled, each surviving hypothesis after the verification pass is subjected to structured critique. The PROPONENT maximizes the hypothesis's case โ finding the strongest evidence for it. The SKEPTIC identifies the most serious weaknesses โ logical gaps, missing evidence, alternative explanations. The IMPROVER synthesizes the proponent and skeptic's views to suggest a modified hypothesis that preserves the strengths and addresses the weaknesses. The VISIONARY ignores the current hypothesis and asks what the hypothesis implies about unsolved related problems โ sometimes the most valuable output is the question a hypothesis raises rather than the hypothesis itself.
Architecture: Six Components
EvolutionaryResearchOrchestrator
โโโ DeepReasoningEngine โ Identifies attack angles (gen 0)
โโโ AIHypothesisGenerator โ Creates hypotheses from angles
โโโ ComputationalVerifier โ Scores each hypothesis
โโโ MultiAgentCritique โ 4-agent debate (optional)
โ โโโ PROPONENT
โ โโโ SKEPTIC
โ โโโ IMPROVER
โ โโโ VISIONARY
โโโ LineageTracker โ Phylogenetic tree of all hypotheses
โโโ FailureLearner โ Anti-patterns fed forward
The LineageTracker is architecturally important: it maintains the full phylogenetic tree of all hypotheses across all generations. Every hypothesis knows its parents (the hypotheses it was mutated from), its generation, its fitness score at each generation, and whether it survived to reproduce. This tree provides the data for evolutionary analysis: which attack angles consistently produce fit hypotheses? Which mutations reliably improve fitness? Which cross-breedings produce novel high-fitness offspring that neither parent could have generated?
CLI Interface
# Run specific sub-problem
node scripts/run-evolutionary-research.js \
--problem bsd-conjecture \
--subproblem sha-finiteness \
--generations 10 \
--population 5
# Fast run (no multi-agent debate)
node scripts/run-evolutionary-research.js \
--problem alzheimers --subproblem biomarker-panel \
--generations 20 --population 5 --no-debate
The --generations and --population flags control the evolutionary budget. More generations allow the evolution to converge on stronger hypotheses but require more computational time. Larger populations create more diversity but also require more scoring and debate time per generation. The default settings (10 generations, population 5) are calibrated for overnight autonomous runs by KAALI: enough evolution to produce meaningful results, completing within a reasonable time window.
Available Sub-Problems: Sorted by Feasibility
| Problem | Sub-Problem | Feasibility | Description |
|---|---|---|---|
| alzheimers | biomarker-panel | 60% | Blood-based early detection panel |
| bsd-conjecture | rmt-verification | 50% | Random Matrix Theory verification |
| bsd-conjecture | bsd-formula-extension | 45% | Full BSD formula for curve families |
| alzheimers | multi-target-protocol | 45% | Multi-target therapeutic protocol |
| alzheimers | trem2-agonism | 40% | TREM2 microglial reprogramming |
| collatz | cycle-exclusion | 40% | Prove no non-trivial cycles |
| bsd-conjecture | sha-finiteness | 35% | Tate-Shafarevich finiteness |
| navier-stokes | liouville-ancient-solutions | 35% | Liouville theorems for ancient solutions |
| riemann-hypothesis | zero-free-extension | 30% | Extend the zero-free region |
| p-vs-np | barrier-avoiding-bounds | 30% | Circuit bounds avoiding relativization barriers |
The feasibility scores are calibrated estimates, not guarantees. A 60% feasibility score for Alzheimer's biomarker panel means the team estimates a 60% probability that evolutionary research will produce a hypothesis novel enough and well-supported enough to merit publication or patent filing. It does not mean the problem is easy โ it means the attack space is well-defined enough that evolutionary exploration is likely to find productive territory.
The 30% feasibility scores for Riemann zero-free extension and P vs NP circuit bounds reflect the extreme difficulty of these problems. Evolutionary research is not expected to produce solved proofs in these areas โ it is expected to produce novel attack angles and partial results that advance the state of knowledge, even without a complete solution. At 30% feasibility, a publishable partial result is the realistic target.
What One Generation Looks Like in Code
Each generation passes every hypothesis through the full ANALYZE โ VERIFY โ DEBATE โ SELECT pipeline. The failure learning is built into the mutation and crossover steps โ anti-patterns from dead ends are fed forward as generation-level context:
async runGeneration(genNumber, population, deadEnds) {
const survivors = [];
const newDeadEnds = [];
for (const hypothesis of population) {
// VERIFY: Score hypothesis against real data sources
const score = await this.verifier.verify(hypothesis);
if (score < this.config.selectionThreshold) {
newDeadEnds.push({
hypothesis, score,
reason: await this.verifier.getFailureReason(hypothesis),
generation: genNumber
});
continue;
}
// DEBATE: 4-agent critique (optional, enabled for hard problems)
if (this.config.enableDebate) {
const critique = await this.multiAgentCritique.evaluate(hypothesis);
if (critique.consensus < 0.6) {
newDeadEnds.push({ hypothesis, score, reason: 'Debate consensus too low' });
continue;
}
}
survivors.push({ hypothesis, score });
}
// SELECT: Keep top N, mutate with failure learning
const selected = survivors.sort((a, b) => b.score - a.score)
.slice(0, this.config.populationSize / 2);
const nextGen = await this.generateNextGeneration(selected, newDeadEnds);
return { survivors: selected, deadEnds: [...deadEnds, ...newDeadEnds], nextPopulation: nextGen };
}
// Failure learning: anti-patterns passed to generator as context
async generateNextGeneration(survivors, accumulatedDeadEnds) {
const antiPatterns = accumulatedDeadEnds.map(d => d.reason);
return Promise.all(survivors.flatMap(parent => [
this.generator.mutate(parent.hypothesis, { avoidPatterns: antiPatterns }),
this.generator.crossover(parent.hypothesis,
survivors[Math.floor(Math.random() * survivors.length)].hypothesis,
{ avoidPatterns: antiPatterns })
]));
}
Real Results Across All Active Research Tracks
| Problem | Sub-Problem | Start | Peak | Gens | Dead Ends | Status |
|---|---|---|---|---|---|---|
| Riemann Hypothesis | arithmetic-site | 69.3% | 97.2% | 7 seeded | 100+ | Strong evidence |
| Parkinson's Disease | cma-autophagy | 37.1% | 95.0% | 1 (gen 0) | 10 | Breakthrough |
| Alzheimer's | multi-target-protocol | 37.1% | 49.2% | 50 | 300 | Landscape mapped |
| Yang-Mills | mass-gap | 89.1%* | 90.8% | 40 | 100+ | Ceiling at ~91% |
| BSD Conjecture | sha-finiteness | 53.3% | 53.3% | 5 | 5 | Early stage |
| Navier-Stokes | liouville-ancient | โ | 80% feasibility | โ | 7 patterns | Synthesis viable |
| Collatz | tao-extension | โ | Lean4 scaffold | โ | โ | Formalized |
Failure Learning: The Critical Component
The FailureLearner is the component that makes the evolutionary system genuinely learn rather than just sample. Every hypothesis that fails โ every approach that does not survive the computational verification pass, every mutation that reduces fitness rather than increasing it โ is analyzed for the pattern that caused its failure.
Anti-patterns extracted from failures are fed to the next generation's AIHypothesisGenerator as explicit constraints: "Do NOT generate hypotheses of type X." The generator uses these constraints to avoid hypothesis forms that have repeatedly failed, focusing exploration on unexplored territory rather than revisiting known dead ends.
- For BSD conjecture: "Hypotheses assuming finiteness of the Tate-Shafarevich group without providing a new approach to the rank distribution consistently fail verification."
- For Alzheimer's: "Biomarker panels based solely on amyloid-ฮฒ concentration without inflammation markers show insufficient discriminative power in the verification benchmarks."
- For Collatz: "Approaches assuming ergodicity of the Collatz map without addressing the measure-theoretic subtleties consistently fail the proof structure verification."
Reseeding with knowledge uses the accumulated failure knowledge plus higher creativity settings: the hypothesis generator is told to produce more speculative, less conventional hypotheses because all the conventional approaches are in the anti-pattern list. This is the evolutionary equivalent of a population that has exhausted the local fitness landscape and needs to explore a new region โ the anti-pattern list forces the search away from the known local maxima.
Autonomous Mode: KAALI as Research Director
Autonomous mode is the fully operational state: 13 sub-problems seeded in MongoDB. KAALI picks up EVOLUTIONARY_RESEARCH tasks from the task queue by priority. Each task runs 10 generations; if not converged (fitness not plateauing, no publishable result found), the task auto-creates a follow-up task for the next 10 generations with updated failure knowledge. No human intervention is needed until a result meets publication criteria.
# Output artifacts per run
data/discoveries/fullexports/{problem}/
โโโ evolution_gen1_{timestamp}.json # Generation 1 population + scores
โโโ evolution_gen2_{timestamp}.json # Generation 2 population + scores
โโโ ...
โโโ evolution_gen10_{timestamp}.json # Final generation
โโโ full_report_{timestamp}.md # Narrative research report
โโโ computational_code_{timestamp}.js # Verification code
โโโ validation_results_{timestamp}.json
โโโ proof_document_{timestamp}.md # Formal proof attempt
โโโ logs_{timestamp}.txt # Full execution log
โโโ STATUS.md # Live fitness history + summary
The STATUS.md file is the live dashboard for autonomous research. It shows the fitness history of the current best hypothesis across all generations, the anti-pattern list accumulated so far, the current generation number, and the next scheduled run time. KAALI updates this file after each generation; researchers can monitor progress without interrupting the autonomous process.
The 10-Generation Convergence Criterion
The system is designed to run 10 generations per task as the default convergence window. If the best hypothesis fitness score has not improved by more than 5% in the last 3 generations, the system flags the run as "plateaued" and initiates reseeding with knowledge rather than continuing straight evolution. This prevents the system from spinning in the same local fitness maximum for 7 additional generations when it has clearly converged.
The auto-follow-up mechanism: after the reseeding run's 10 generations complete, if the new population has produced a hypothesis with fitness above the "publishable" threshold (typically 0.80+), the task is marked complete and the result is routed to the publication pipeline. If not, a new follow-up task is created with the expanded anti-pattern list and further increased creativity settings. This allows the system to continue indefinitely on hard problems without human management.
Connection to the Discovery Engine
The EvolutionaryResearchOrchestrator is the most powerful tool in the discovery engine's toolkit, but it is not the only one. For simpler discovery tasks (finding unexpected connections between well-understood domains, identifying patterns in literature), the single-hypothesis approach (generate one strong hypothesis, verify it, publish or discard) is sufficient and faster. The evolutionary approach is reserved for hard problems where the search space is large, the answer is not obvious, and failure information is genuinely valuable for narrowing the search.
The Millennium Prize problems, certain Alzheimer's mechanism questions, and the Wright-Fisher โ SGD formalization are all in the evolutionary category. The connection discovery work (finding that a learning science paper's methodology applies to a materials science problem) is typically in the single-hypothesis category. The system chooses the appropriate tool based on the estimated search space size and the value of failure information for that problem class.