There is a meaningful distinction between a system that learns better behaviors and a system that becomes better at reasoning. Reinforcement Learning from Human Feedback (RLHF) is the dominant method for the former: it shapes the system's outputs toward human-preferred responses without necessarily changing the underlying reasoning process. The TrueRRIEngine targets the latter: it identifies bottlenecks in the system's current reasoning algorithm and generates a new algorithm that provably outperforms the old one.
This is not a theoretical distinction. It has practical engineering consequences โ a different code path, different success criteria, different risk model, and a fundamentally different relationship to the system's long-term intelligence trajectory.
The Core Claim
From the TrueRRIEngine source documentation:
What does this mean precisely? RLHF trains the model's output distribution โ it adjusts weights so the model is more likely to produce outputs that humans rate highly. The weights change; the reasoning process the weights implement may or may not change. A model fine-tuned with RLHF might produce better-sounding answers using exactly the same chain of reasoning, just with better surface presentation.
RRI targets the algorithm: the sequence of reasoning steps the system takes to arrive at an answer. Different from the model's weights. RRI modifies the code and logic that governs reasoning, not the neural network parameters. A system with improved reasoning does not just produce better outputs โ it takes qualitatively better paths to those outputs.
The TrueRRIEngine Loop
async improveReasoning() {
// Step 1: Analyze current reasoning quality
const currentPerformance = await this.analyzeReasoningQuality();
// Step 2: Identify bottlenecks
const bottlenecks = await this.identifyBottlenecks(currentPerformance);
// Step 3: Generate improved algorithm
const improvedAlgorithm = await this.generateImprovedAlgorithm(bottlenecks);
// Step 4: Verify improvement (formal proof or empirical)
const verification = await this.verifyImprovement(
this.currentReasoningAlgorithm, improvedAlgorithm);
if (!verification.proven) return { improved: false, reason: 'Cannot verify improvement' };
// Step 5: Test on benchmarks
const benchmarkResults = await this.testOnBenchmarks(improvedAlgorithm);
if (benchmarkResults.improvement <= 0) return { improved: false, reason: 'No measured improvement' };
// Step 6: Deploy
this.currentReasoningAlgorithm = improvedAlgorithm;
this.improvementHistory.push({ timestamp: new Date(),
improvement: benchmarkResults.improvement, algorithm: improvedAlgorithm });
return { improved: true, improvementPercent: benchmarkResults.improvement * 100 };
}
The six-step loop enforces a strict improvement standard. Steps 4 and 5 are gates: an improved algorithm that cannot be formally or empirically verified as better is rejected. An algorithm that passes formal verification but shows no benchmark improvement is also rejected. Both gates must pass before deployment. This prevents the system from deploying "improved" reasoning that is theoretically better but practically unchanged or worse.
What "Reasoning Algorithm" Means
The term "reasoning algorithm" requires careful definition. It is not the neural network's weights (those are fixed between RLHF training runs). It is the procedural logic that governs how the system approaches a problem:
- What context does it retrieve before reasoning?
- In what order does it process sub-problems?
- When does it decide it has enough evidence to reach a conclusion?
- How does it handle contradictory evidence?
- How does it calibrate confidence in its conclusions?
- When does it recurse on a sub-problem versus accept an approximate answer?
These are all code-level decisions โ they are implemented in JavaScript (and eventually in the reasoning algorithm itself, once RRI has improved it enough to reason about its own reasoning). Changing these decisions changes how the system reasons, independently of the underlying model weights. RRI's target is this layer of procedural reasoning logic.
Bottleneck Identification
Step 2 of the loop is bottleneck identification โ finding the specific step in the current reasoning algorithm that limits overall performance. The analysis covers three types of bottlenecks:
Speed bottlenecks limit throughput โ if step 3 of a 7-step reasoning chain takes 80% of the processing time, improving steps 1, 2, 4-7 will have almost no impact on total reasoning time. Accuracy bottlenecks limit output quality โ if step 2 has 60% accuracy while all other steps have 90%+ accuracy, step 2 is the bottleneck for the reasoning chain's overall accuracy. Variance bottlenecks limit reliability โ a step that sometimes produces excellent results and sometimes produces poor results is the most dangerous bottleneck because it makes the system unpredictable.
The bottleneck identification feeds directly into algorithm generation: the improved algorithm targets the specific bottleneck. If step 3 is the speed bottleneck, the improved algorithm focuses on making step 3 faster (or eliminating it, or parallelizing it). This targeted improvement is more efficient than general optimization and produces measurable gains in the specific dimension that matters.
The Benchmark Suite
The benchmark suite for Step 5 covers four dimensions of reasoning quality:
| Benchmark | Baseline | Measurement | Threshold |
|---|---|---|---|
| Current accuracy score | 0.7 | Fraction of test problems solved correctly | Improvement > 0% |
| Speed score | 0.9 | avgTime < 1000ms โ 0.9 score | Must not degrade |
| Consistency score | โ | Variance across repeated runs | Lower variance = better |
| Cross-domain transfer | โ | Accuracy on out-of-distribution test problems | Improvement > 0% |
The cross-domain transfer benchmark is the most important for the discovery engine context. A reasoning improvement that only works for the specific problem types the system was trained on is not a genuine intelligence improvement โ it is a specialization. The cross-domain transfer benchmark tests whether the improved reasoning algorithm generalizes: if it improves reasoning on mathematical problems, does it also improve reasoning on biological problems? True reasoning improvement should transfer.
The โค0% rejection threshold is strict: if the benchmark shows no improvement (or regression), the current algorithm is preserved. There is no grace period, no "wait and see," no accepting a marginal improvement that might get better over time. The improved algorithm either demonstrably outperforms the current algorithm or it does not deploy.
Improvement History Tracking
The improvementHistory array accumulates every successful improvement:
this.improvementHistory.push({
timestamp: new Date(),
improvement: benchmarkResults.improvement, // fractional (0.12 = 12%)
algorithm: improvedAlgorithm, // the new algorithm definition
bottlenecksAddressed: bottlenecks, // what was fixed
benchmarks: benchmarkResults // full benchmark record
});
This history enables retrospective analysis: which types of bottlenecks produce the largest improvements? How fast is the reasoning algorithm improving over time? Is improvement rate accelerating (positive compounding) or decelerating (hitting a ceiling)? Are certain classes of improvements more durable than others (improvements that persist across domain shifts)?
The history also serves a safety function: if a new improvement causes unexpected regression on problems that previous improvements had addressed, the history provides the rollback chain โ the system can revert to any previous algorithm version, not just the immediately previous one.
RRI vs. RSI: The Key Distinction
RSI (Recursive Self-Improvement) modifies the codebase โ the files on disk, the service implementations, the database schemas. RRI modifies the reasoning algorithm โ the in-memory logic that governs how the system approaches problems. They are complementary improvement loops targeting different layers:
RSI: "The discovery service has a performance bottleneck in the literature retrieval step โ rewrite it to use parallel fetching." (Code change, persists on disk)
RRI: "The reasoning algorithm evaluates hypotheses sequentially โ modify it to evaluate in parallel and select the best result." (Algorithm change, persists in memory/EternalMemory)
An RSI improvement that makes the code faster enables RRI to run more reasoning iterations per unit time, which accelerates RRI's improvement cycle. An RRI improvement that makes reasoning more accurate makes RSI's code modification proposals more targeted and effective, which accelerates RSI's improvement cycle. They feed each other in a positive reinforcement loop when both are working correctly.
The Target: >10% Measurable Improvement
The 10% threshold for a successful RRI cycle is deliberately calibrated. Below 10%, the improvement might be noise โ within the variance of the benchmark suite, difficult to distinguish from measurement error. Above 10%, the improvement is reliable enough to confidently attribute to the algorithm change rather than variance.
In practice, successful RRI cycles produce improvements in the 12-25% range on the specific bottleneck they target, with smaller improvements on other benchmarks (positive transfer) or negligible changes on unrelated benchmarks (no regression). The improvement on the targeted bottleneck is typically the largest; the cross-domain transfer improvement is typically 30-50% of the targeted improvement, representing the fraction of the gain that generalizes.
Why RRI Matters More Than Model Scale
The dominant approach to AI improvement in 2024-2026 has been scale: larger models, more parameters, more training data. Scale improvements are real โ a larger model with more training data does reason better on average. But scale has two limitations that RRI does not share.
First, scale improvements require retraining the entire model โ an expensive, time-consuming process that cannot happen continuously. RRI improvements happen in-session, continuously, at near-zero marginal cost. Every reasoning cycle can produce an improvement. The compounding effect of continuous small improvements (12-25% per cycle) over time can exceed the one-time benefit of a model scale jump.
Second, scale improves average reasoning quality across all tasks. RRI improves specific bottlenecks. For the discovery engine's use case โ deep mathematical reasoning, cross-domain synthesis, adversarial hypothesis evaluation โ the bottlenecks are specific and known. Improving reasoning in exactly those dimensions (rather than improving average quality across general NLP tasks) is more valuable than a scale improvement that spreads its benefit broadly.
How RRI Interacts with the Discovery Engine
The TrueRRIEngine's benchmark suite was designed around the discovery engine's specific reasoning requirements. The cross-domain transfer benchmark tests whether reasoning improvements generalise across the domains where the engine operates (physics, mathematics, biology, chemistry). A reasoning improvement that only works for physics would not be deployed โ the engine needs improvements that generalise.
| Reasoning Bottleneck | Where It Shows Up | RRI Target | Expected Gain |
|---|---|---|---|
| Sequential hypothesis evaluation | Long validation pipeline latency | Parallelize eval steps | Speed +60-70% |
| Weak cross-domain transfer | Low score on biology hypotheses after physics training | Improve domain bridging | Accuracy +10-15% |
| Inconsistent evidence weighting | Same evidence scores differently across runs | Normalize evidence weighting | Variance -40% |
| Shallow contradiction detection | Circular proofs pass verification | Deepen logical consistency checks | False positive rate -25% |
Each of these improvements has a direct effect on the discovery corpus quality: faster evaluation means more hypotheses processed per compute budget; better cross-domain transfer means higher scores on synthesis discoveries; better contradiction detection means fewer circular proofs enter the corpus (the Yang-Mills confinement-mass gap circularity from Article 4 would be caught by an improved contradiction detector).