Evolutionary Research Orchestrator: Natural Selection for Hypotheses

Evolution is not a metaphor for this research system — it is the actual architecture. The EvolutionaryResearchOrchestrator applies the core mechanisms of biological natural selection to scientific hypothesis generation: variation (many hypotheses), selection (computational and multi-agent scoring), inheritance (successful hypotheses seed the next generation), and mutation (deliberate variation of surviving hypotheses). Dead ends accumulate as anti-patterns. Anti-patterns feed the next generation. The search space shrinks as failures build.

This article examines the full architecture, the CLI interface, the available sub-problems sorted by feasibility, and why failure learning is the most important component of the system.

The Evolution Cycle

EvolutionaryResearchOrchestrator — Full Cycle
──────────────────────────────────────────────────────────────
ANALYZE → Deep reasoning: identify attack angles (generation 0)
   │
SEED → Generate population of hypotheses from angles
   │
VERIFY → ComputationalVerifier: score each hypothesis
   │
DEBATE → Multi-agent critique (4 agents, optional)
   │       ├── PROPONENT: argue for hypothesis
   │       ├── SKEPTIC: identify weaknesses
   │       ├── IMPROVER: suggest enhancements
   │       └── VISIONARY: extrapolate implications
   │
SELECT → Survival of the fittest (by fitness score)
   │
MUTATE → Deliberate variation of survivors
   │
CROSS → Cross-breeding of compatible survivors
   │
TRACK → LineageTracker: phylogenetic tree of all hypotheses
   │
REPORT → Generate detailed report + fitness history
   │
LOOP → Next generation (repeat from SEED with survivors)

The 4-agent debate (PROPONENT, SKEPTIC, IMPROVER, VISIONARY) is optional and activated with --debate. When enabled, each surviving hypothesis after the verification pass is subjected to structured critique. The PROPONENT maximizes the hypothesis's case — finding the strongest evidence for it. The SKEPTIC identifies the most serious weaknesses — logical gaps, missing evidence, alternative explanations. The IMPROVER synthesizes the proponent and skeptic's views to suggest a modified hypothesis that preserves the strengths and addresses the weaknesses. The VISIONARY ignores the current hypothesis and asks what the hypothesis implies about unsolved related problems — sometimes the most valuable output is the question a hypothesis raises rather than the hypothesis itself.

Architecture: Six Components

EvolutionaryResearchOrchestrator
├── DeepReasoningEngine    → Identifies attack angles (gen 0)
├── AIHypothesisGenerator  → Creates hypotheses from angles
├── ComputationalVerifier  → Scores each hypothesis
├── MultiAgentCritique     → 4-agent debate (optional)
│   ├── PROPONENT
│   ├── SKEPTIC
│   ├── IMPROVER
│   └── VISIONARY
├── LineageTracker         → Phylogenetic tree of all hypotheses
└── FailureLearner         → Anti-patterns fed forward

The LineageTracker is architecturally important: it maintains the full phylogenetic tree of all hypotheses across all generations. Every hypothesis knows its parents (the hypotheses it was mutated from), its generation, its fitness score at each generation, and whether it survived to reproduce. This tree provides the data for evolutionary analysis: which attack angles consistently produce fit hypotheses? Which mutations reliably improve fitness? Which cross-breedings produce novel high-fitness offspring that neither parent could have generated?

CLI Interface

bash — Run Evolutionary Research (Full)

# Run specific sub-problem
node scripts/run-evolutionary-research.js \
  --problem bsd-conjecture \
  --subproblem sha-finiteness \
  --generations 10 \
  --population 5

# Fast run (no multi-agent debate)
node scripts/run-evolutionary-research.js \
  --problem alzheimers --subproblem biomarker-panel \
  --generations 20 --population 5 --no-debate

The --generations and --population flags control the evolutionary budget. More generations allow the evolution to converge on stronger hypotheses but require more computational time. Larger populations create more diversity but also require more scoring and debate time per generation. The default settings (10 generations, population 5) are calibrated for overnight autonomous runs by KAALI: enough evolution to produce meaningful results, completing within a reasonable time window.

Available Sub-Problems: Sorted by Feasibility

Problem	Sub-Problem	Feasibility	Description
alzheimers	biomarker-panel	60%	Blood-based early detection panel
bsd-conjecture	rmt-verification	50%	Random Matrix Theory verification
bsd-conjecture	bsd-formula-extension	45%	Full BSD formula for curve families
alzheimers	multi-target-protocol	45%	Multi-target therapeutic protocol
alzheimers	trem2-agonism	40%	TREM2 microglial reprogramming
collatz	cycle-exclusion	40%	Prove no non-trivial cycles
bsd-conjecture	sha-finiteness	35%	Tate-Shafarevich finiteness
navier-stokes	liouville-ancient-solutions	35%	Liouville theorems for ancient solutions
riemann-hypothesis	zero-free-extension	30%	Extend the zero-free region
p-vs-np	barrier-avoiding-bounds	30%	Circuit bounds avoiding relativization barriers

The feasibility scores are calibrated estimates, not guarantees. A 60% feasibility score for Alzheimer's biomarker panel means the team estimates a 60% probability that evolutionary research will produce a hypothesis novel enough and well-supported enough to merit publication or patent filing. It does not mean the problem is easy — it means the attack space is well-defined enough that evolutionary exploration is likely to find productive territory.

The 30% feasibility scores for Riemann zero-free extension and P vs NP circuit bounds reflect the extreme difficulty of these problems. Evolutionary research is not expected to produce solved proofs in these areas — it is expected to produce novel attack angles and partial results that advance the state of knowledge, even without a complete solution. At 30% feasibility, a publishable partial result is the realistic target.

"This is how real mathematicians work — knowing what doesn't work narrows the search space. The FailureLearner accumulates all dead ends and feeds them forward as anti-patterns to the next generation's hypothesis generator. Dead ends are not wasted effort — they are negative knowledge."

What One Generation Looks Like in Code

Each generation passes every hypothesis through the full ANALYZE → VERIFY → DEBATE → SELECT pipeline. The failure learning is built into the mutation and crossover steps — anti-patterns from dead ends are fed forward as generation-level context:

JavaScript — Single Generation Run

async runGeneration(genNumber, population, deadEnds) {
  const survivors = [];
  const newDeadEnds = [];

  for (const hypothesis of population) {
    // VERIFY: Score hypothesis against real data sources
    const score = await this.verifier.verify(hypothesis);

    if (score < this.config.selectionThreshold) {
      newDeadEnds.push({
        hypothesis, score,
        reason: await this.verifier.getFailureReason(hypothesis),
        generation: genNumber
      });
      continue;
    }

    // DEBATE: 4-agent critique (optional, enabled for hard problems)
    if (this.config.enableDebate) {
      const critique = await this.multiAgentCritique.evaluate(hypothesis);
      if (critique.consensus < 0.6) {
        newDeadEnds.push({ hypothesis, score, reason: 'Debate consensus too low' });
        continue;
      }
    }
    survivors.push({ hypothesis, score });
  }

  // SELECT: Keep top N, mutate with failure learning
  const selected = survivors.sort((a, b) => b.score - a.score)
                            .slice(0, this.config.populationSize / 2);

  const nextGen = await this.generateNextGeneration(selected, newDeadEnds);
  return { survivors: selected, deadEnds: [...deadEnds, ...newDeadEnds], nextPopulation: nextGen };
}

// Failure learning: anti-patterns passed to generator as context
async generateNextGeneration(survivors, accumulatedDeadEnds) {
  const antiPatterns = accumulatedDeadEnds.map(d => d.reason);
  return Promise.all(survivors.flatMap(parent => [
    this.generator.mutate(parent.hypothesis, { avoidPatterns: antiPatterns }),
    this.generator.crossover(parent.hypothesis,
      survivors[Math.floor(Math.random() * survivors.length)].hypothesis,
      { avoidPatterns: antiPatterns })
  ]));
}

Real Results Across All Active Research Tracks

Problem	Sub-Problem	Start	Peak	Gens	Dead Ends	Status
Riemann Hypothesis	arithmetic-site	69.3%	97.2%	7 seeded	100+	Strong evidence
Parkinson's Disease	cma-autophagy	37.1%	95.0%	1 (gen 0)	10	Breakthrough
Alzheimer's	multi-target-protocol	37.1%	49.2%	50	300	Landscape mapped
Yang-Mills	mass-gap	89.1%*	90.8%	40	100+	Ceiling at ~91%
BSD Conjecture	sha-finiteness	53.3%	53.3%	5	5	Early stage
Navier-Stokes	liouville-ancient	—	80% feasibility	—	7 patterns	Synthesis viable
Collatz	tao-extension	—	Lean4 scaffold	—	—	Formalized

Failure Learning: The Critical Component

The FailureLearner is the component that makes the evolutionary system genuinely learn rather than just sample. Every hypothesis that fails — every approach that does not survive the computational verification pass, every mutation that reduces fitness rather than increasing it — is analyzed for the pattern that caused its failure.

Anti-patterns extracted from failures are fed to the next generation's AIHypothesisGenerator as explicit constraints: "Do NOT generate hypotheses of type X." The generator uses these constraints to avoid hypothesis forms that have repeatedly failed, focusing exploration on unexplored territory rather than revisiting known dead ends.

Examples of Anti-Patterns:

For BSD conjecture: "Hypotheses assuming finiteness of the Tate-Shafarevich group without providing a new approach to the rank distribution consistently fail verification."
For Alzheimer's: "Biomarker panels based solely on amyloid-β concentration without inflammation markers show insufficient discriminative power in the verification benchmarks."
For Collatz: "Approaches assuming ergodicity of the Collatz map without addressing the measure-theoretic subtleties consistently fail the proof structure verification."

Reseeding with knowledge uses the accumulated failure knowledge plus higher creativity settings: the hypothesis generator is told to produce more speculative, less conventional hypotheses because all the conventional approaches are in the anti-pattern list. This is the evolutionary equivalent of a population that has exhausted the local fitness landscape and needs to explore a new region — the anti-pattern list forces the search away from the known local maxima.

Autonomous Mode: KAALI as Research Director

Autonomous mode is the fully operational state: 13 sub-problems seeded in MongoDB. KAALI picks up EVOLUTIONARY_RESEARCH tasks from the task queue by priority. Each task runs 10 generations; if not converged (fitness not plateauing, no publishable result found), the task auto-creates a follow-up task for the next 10 generations with updated failure knowledge. No human intervention is needed until a result meets publication criteria.

bash — Output Structure

# Output artifacts per run
data/discoveries/fullexports/{problem}/
├── evolution_gen1_{timestamp}.json    # Generation 1 population + scores
├── evolution_gen2_{timestamp}.json    # Generation 2 population + scores
├── ...
├── evolution_gen10_{timestamp}.json   # Final generation
├── full_report_{timestamp}.md         # Narrative research report
├── computational_code_{timestamp}.js  # Verification code
├── validation_results_{timestamp}.json
├── proof_document_{timestamp}.md      # Formal proof attempt
├── logs_{timestamp}.txt               # Full execution log
└── STATUS.md                          # Live fitness history + summary

The STATUS.md file is the live dashboard for autonomous research. It shows the fitness history of the current best hypothesis across all generations, the anti-pattern list accumulated so far, the current generation number, and the next scheduled run time. KAALI updates this file after each generation; researchers can monitor progress without interrupting the autonomous process.

The 10-Generation Convergence Criterion

The system is designed to run 10 generations per task as the default convergence window. If the best hypothesis fitness score has not improved by more than 5% in the last 3 generations, the system flags the run as "plateaued" and initiates reseeding with knowledge rather than continuing straight evolution. This prevents the system from spinning in the same local fitness maximum for 7 additional generations when it has clearly converged.

The auto-follow-up mechanism: after the reseeding run's 10 generations complete, if the new population has produced a hypothesis with fitness above the "publishable" threshold (typically 0.80+), the task is marked complete and the result is routed to the publication pipeline. If not, a new follow-up task is created with the expanded anti-pattern list and further increased creativity settings. This allows the system to continue indefinitely on hard problems without human management.

On the Feasibility Scores: The feasibility scores are the team's honest estimates based on problem structure analysis and the system's current capabilities. They reflect two separate uncertainties: (1) whether the problem has enough well-defined structure for evolutionary search to make progress, and (2) whether the current system's mathematical reasoning capabilities are sufficient for the verification step. For the hardest problems (Riemann, P vs NP), the second uncertainty dominates — the verification step requires mathematical reasoning at a level the system is still developing. These are research targets that will become more tractable as the mathematical engine capabilities improve.

"Evolution doesn't care about your hypothesis. It only cares whether your hypothesis survives contact with reality. The orchestrator operationalises this ruthlessness — every hypothesis either clears the threshold or joins the permanent record of what doesn't work. The 300 dead ends in the Alzheimer's corpus are not waste. They are 10,500 seconds of permanently excluded territory."

Connection to the Discovery Engine

The EvolutionaryResearchOrchestrator is the most powerful tool in the discovery engine's toolkit, but it is not the only one. For simpler discovery tasks (finding unexpected connections between well-understood domains, identifying patterns in literature), the single-hypothesis approach (generate one strong hypothesis, verify it, publish or discard) is sufficient and faster. The evolutionary approach is reserved for hard problems where the search space is large, the answer is not obvious, and failure information is genuinely valuable for narrowing the search.

The Millennium Prize problems, certain Alzheimer's mechanism questions, and the Wright-Fisher ↔ SGD formalization are all in the evolutionary category. The connection discovery work (finding that a learning science paper's methodology applies to a materials science problem) is typically in the single-hypothesis category. The system chooses the appropriate tool based on the estimated search space size and the value of failure information for that problem class.