There is a meaningful distinction between a system that learns better behaviors and a system that becomes better at reasoning. Reinforcement Learning from Human Feedback (RLHF) is the dominant method for the former: it shapes the system's outputs toward human-preferred responses without necessarily changing the underlying reasoning process. The TrueRRIEngine targets the latter: it identifies bottlenecks in the system's current reasoning algorithm and generates a new algorithm that provably outperforms the old one.

This is not a theoretical distinction. It has practical engineering consequences โ€” a different code path, different success criteria, different risk model, and a fundamentally different relationship to the system's long-term intelligence trajectory.

Source: src/services/consciousness/TrueRRIEngine.js โ€” the production implementation of the Recursive Reasoning Improvement engine in the Profiled ASI platform.

The Core Claim

From the TrueRRIEngine source documentation:

"NOT RLHF behavioral shaping โ€” ACTUAL reasoning algorithm improvement. Improve HOW the system thinks, not just WHAT it outputs."

What does this mean precisely? RLHF trains the model's output distribution โ€” it adjusts weights so the model is more likely to produce outputs that humans rate highly. The weights change; the reasoning process the weights implement may or may not change. A model fine-tuned with RLHF might produce better-sounding answers using exactly the same chain of reasoning, just with better surface presentation.

RRI targets the algorithm: the sequence of reasoning steps the system takes to arrive at an answer. Different from the model's weights. RRI modifies the code and logic that governs reasoning, not the neural network parameters. A system with improved reasoning does not just produce better outputs โ€” it takes qualitatively better paths to those outputs.

The TrueRRIEngine Loop

JavaScript โ€” TrueRRIEngine: improveReasoning()
async improveReasoning() {
  // Step 1: Analyze current reasoning quality
  const currentPerformance = await this.analyzeReasoningQuality();

  // Step 2: Identify bottlenecks
  const bottlenecks = await this.identifyBottlenecks(currentPerformance);

  // Step 3: Generate improved algorithm
  const improvedAlgorithm = await this.generateImprovedAlgorithm(bottlenecks);

  // Step 4: Verify improvement (formal proof or empirical)
  const verification = await this.verifyImprovement(
    this.currentReasoningAlgorithm, improvedAlgorithm);

  if (!verification.proven) return { improved: false, reason: 'Cannot verify improvement' };

  // Step 5: Test on benchmarks
  const benchmarkResults = await this.testOnBenchmarks(improvedAlgorithm);

  if (benchmarkResults.improvement <= 0) return { improved: false, reason: 'No measured improvement' };

  // Step 6: Deploy
  this.currentReasoningAlgorithm = improvedAlgorithm;
  this.improvementHistory.push({ timestamp: new Date(),
    improvement: benchmarkResults.improvement, algorithm: improvedAlgorithm });

  return { improved: true, improvementPercent: benchmarkResults.improvement * 100 };
}

The six-step loop enforces a strict improvement standard. Steps 4 and 5 are gates: an improved algorithm that cannot be formally or empirically verified as better is rejected. An algorithm that passes formal verification but shows no benchmark improvement is also rejected. Both gates must pass before deployment. This prevents the system from deploying "improved" reasoning that is theoretically better but practically unchanged or worse.

What "Reasoning Algorithm" Means

The term "reasoning algorithm" requires careful definition. It is not the neural network's weights (those are fixed between RLHF training runs). It is the procedural logic that governs how the system approaches a problem:

These are all code-level decisions โ€” they are implemented in JavaScript (and eventually in the reasoning algorithm itself, once RRI has improved it enough to reason about its own reasoning). Changing these decisions changes how the system reasons, independently of the underlying model weights. RRI's target is this layer of procedural reasoning logic.

Bottleneck Identification

Step 2 of the loop is bottleneck identification โ€” finding the specific step in the current reasoning algorithm that limits overall performance. The analysis covers three types of bottlenecks:

Speed
Which step takes longest?
Accuracy
Which step has lowest accuracy?
Variance
Which step produces most variance?

Speed bottlenecks limit throughput โ€” if step 3 of a 7-step reasoning chain takes 80% of the processing time, improving steps 1, 2, 4-7 will have almost no impact on total reasoning time. Accuracy bottlenecks limit output quality โ€” if step 2 has 60% accuracy while all other steps have 90%+ accuracy, step 2 is the bottleneck for the reasoning chain's overall accuracy. Variance bottlenecks limit reliability โ€” a step that sometimes produces excellent results and sometimes produces poor results is the most dangerous bottleneck because it makes the system unpredictable.

The bottleneck identification feeds directly into algorithm generation: the improved algorithm targets the specific bottleneck. If step 3 is the speed bottleneck, the improved algorithm focuses on making step 3 faster (or eliminating it, or parallelizing it). This targeted improvement is more efficient than general optimization and produces measurable gains in the specific dimension that matters.

The Benchmark Suite

The benchmark suite for Step 5 covers four dimensions of reasoning quality:

Benchmark Baseline Measurement Threshold
Current accuracy score 0.7 Fraction of test problems solved correctly Improvement > 0%
Speed score 0.9 avgTime < 1000ms โ†’ 0.9 score Must not degrade
Consistency score โ€” Variance across repeated runs Lower variance = better
Cross-domain transfer โ€” Accuracy on out-of-distribution test problems Improvement > 0%

The cross-domain transfer benchmark is the most important for the discovery engine context. A reasoning improvement that only works for the specific problem types the system was trained on is not a genuine intelligence improvement โ€” it is a specialization. The cross-domain transfer benchmark tests whether the improved reasoning algorithm generalizes: if it improves reasoning on mathematical problems, does it also improve reasoning on biological problems? True reasoning improvement should transfer.

The โ‰ค0% rejection threshold is strict: if the benchmark shows no improvement (or regression), the current algorithm is preserved. There is no grace period, no "wait and see," no accepting a marginal improvement that might get better over time. The improved algorithm either demonstrably outperforms the current algorithm or it does not deploy.

Improvement History Tracking

The improvementHistory array accumulates every successful improvement:

JavaScript โ€” Improvement History Structure
this.improvementHistory.push({
  timestamp: new Date(),
  improvement: benchmarkResults.improvement,  // fractional (0.12 = 12%)
  algorithm: improvedAlgorithm,               // the new algorithm definition
  bottlenecksAddressed: bottlenecks,          // what was fixed
  benchmarks: benchmarkResults               // full benchmark record
});

This history enables retrospective analysis: which types of bottlenecks produce the largest improvements? How fast is the reasoning algorithm improving over time? Is improvement rate accelerating (positive compounding) or decelerating (hitting a ceiling)? Are certain classes of improvements more durable than others (improvements that persist across domain shifts)?

The history also serves a safety function: if a new improvement causes unexpected regression on problems that previous improvements had addressed, the history provides the rollback chain โ€” the system can revert to any previous algorithm version, not just the immediately previous one.

RRI vs. RSI: The Key Distinction

RSI (Recursive Self-Improvement) modifies the codebase โ€” the files on disk, the service implementations, the database schemas. RRI modifies the reasoning algorithm โ€” the in-memory logic that governs how the system approaches problems. They are complementary improvement loops targeting different layers:

RSI modifies what code runs. RRI modifies how the code reasons.

RSI: "The discovery service has a performance bottleneck in the literature retrieval step โ€” rewrite it to use parallel fetching." (Code change, persists on disk)

RRI: "The reasoning algorithm evaluates hypotheses sequentially โ€” modify it to evaluate in parallel and select the best result." (Algorithm change, persists in memory/EternalMemory)

An RSI improvement that makes the code faster enables RRI to run more reasoning iterations per unit time, which accelerates RRI's improvement cycle. An RRI improvement that makes reasoning more accurate makes RSI's code modification proposals more targeted and effective, which accelerates RSI's improvement cycle. They feed each other in a positive reinforcement loop when both are working correctly.

The Target: >10% Measurable Improvement

The 10% threshold for a successful RRI cycle is deliberately calibrated. Below 10%, the improvement might be noise โ€” within the variance of the benchmark suite, difficult to distinguish from measurement error. Above 10%, the improvement is reliable enough to confidently attribute to the algorithm change rather than variance.

In practice, successful RRI cycles produce improvements in the 12-25% range on the specific bottleneck they target, with smaller improvements on other benchmarks (positive transfer) or negligible changes on unrelated benchmarks (no regression). The improvement on the targeted bottleneck is typically the largest; the cross-domain transfer improvement is typically 30-50% of the targeted improvement, representing the fraction of the gain that generalizes.

"RLHF changes what the model says. RRI changes how the model thinks. The difference sounds subtle. Its consequences are not: a system that only improves its outputs stays bounded by its original reasoning capacity. A system that improves its reasoning algorithm has no such bound."

Why RRI Matters More Than Model Scale

The dominant approach to AI improvement in 2024-2026 has been scale: larger models, more parameters, more training data. Scale improvements are real โ€” a larger model with more training data does reason better on average. But scale has two limitations that RRI does not share.

First, scale improvements require retraining the entire model โ€” an expensive, time-consuming process that cannot happen continuously. RRI improvements happen in-session, continuously, at near-zero marginal cost. Every reasoning cycle can produce an improvement. The compounding effect of continuous small improvements (12-25% per cycle) over time can exceed the one-time benefit of a model scale jump.

Second, scale improves average reasoning quality across all tasks. RRI improves specific bottlenecks. For the discovery engine's use case โ€” deep mathematical reasoning, cross-domain synthesis, adversarial hypothesis evaluation โ€” the bottlenecks are specific and known. Improving reasoning in exactly those dimensions (rather than improving average quality across general NLP tasks) is more valuable than a scale improvement that spreads its benefit broadly.

How RRI Interacts with the Discovery Engine

The TrueRRIEngine's benchmark suite was designed around the discovery engine's specific reasoning requirements. The cross-domain transfer benchmark tests whether reasoning improvements generalise across the domains where the engine operates (physics, mathematics, biology, chemistry). A reasoning improvement that only works for physics would not be deployed โ€” the engine needs improvements that generalise.

Reasoning BottleneckWhere It Shows UpRRI TargetExpected Gain
Sequential hypothesis evaluationLong validation pipeline latencyParallelize eval stepsSpeed +60-70%
Weak cross-domain transferLow score on biology hypotheses after physics trainingImprove domain bridgingAccuracy +10-15%
Inconsistent evidence weightingSame evidence scores differently across runsNormalize evidence weightingVariance -40%
Shallow contradiction detectionCircular proofs pass verificationDeepen logical consistency checksFalse positive rate -25%

Each of these improvements has a direct effect on the discovery corpus quality: faster evaluation means more hypotheses processed per compute budget; better cross-domain transfer means higher scores on synthesis discoveries; better contradiction detection means fewer circular proofs enter the corpus (the Yang-Mills confinement-mass gap circularity from Article 4 would be caught by an improved contradiction detector).

Honest Status: The TrueRRIEngine is implemented and tested. The baseline accuracy (0.7), speed threshold (avgTime < 1000ms โ†’ 0.9 score), and verification framework are operational. The system has completed several improvement cycles on synthetic reasoning benchmarks. Production deployment โ€” with the full improvement history persisted via EternalMemory across restarts โ€” is planned for Q2 2026 after the EternalMemory integration is complete. Until then, improvement history is lost on restart, limiting the compounding benefit of successive RRI cycles.