Correct Is Not Enough: Training Reasoning Planners...

The paper "Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards" addresses a critical limitation in how we train large language models to reason. Currently, most models are trained using reinforcement learning that rewards them only if they reach the correct final answer. The authors argue that this "outcome-only" approach is flawed because it can reward models for reaching the right answer through incorrect reasoning, logical shortcuts, or "hallucinated" steps. To fix this, the researchers propose a new framework called TraceLift, which treats reasoning as a vital intermediate product that must be both high-quality and genuinely useful to the system that consumes it.

A Two-Stage Training Framework

TraceLift separates the model into two distinct roles: a "planner" that generates reasoning and an "executor" that uses that reasoning to produce a final result. During training, the executor is kept frozen. This forces the planner to learn how to write reasoning that actually helps the executor succeed. The training process uses a unique reward signal that combines two factors: a rubric-based score that evaluates the quality of the reasoning itself, and a "measured uplift" score that tracks whether the reasoning actually improves the executor's performance compared to a baseline where no reasoning is provided.

Learning from Targeted Mistakes

To teach the model what "good" reasoning looks like, the authors created a dataset called TraceLift-Groups. This dataset consists of 6,000 problems, each paired with a high-quality reference reasoning trace and several "flawed" versions. These flawed traces were created by intentionally introducing specific errors—such as arithmetic slips, missing edge cases, or incorrect logic—while keeping the overall task relevant. By training on these groups, the model learns to distinguish between reliable, high-quality reasoning and plausible-sounding but ultimately flawed logic.

Why Reasoning Quality Matters

The experimental results across various code and math benchmarks show that TraceLift consistently outperforms standard execution-only training. Because the executor remains fixed during testing, the improvements observed are directly attributable to the planner’s ability to provide better, more useful guidance. The researchers found that this approach is particularly effective for harder problems where small details in the reasoning process—such as identifying constraints or handling edge cases—are the deciding factors in whether the final output is correct.

Key Takeaways

The core message of this research is that we should stop evaluating reasoning solely by whether the final answer is correct. Instead, we must treat reasoning as a "consumable artifact." By rewarding models for the quality of their intermediate steps and their measurable impact on downstream tasks, we can build more reliable systems that are "right for the right reasons." This shift in perspective is essential for developing models that are not just good at guessing answers, but are truly capable of sound, verifiable, and helpful reasoning.

Correct Is Not Enough: Training Reasoning Planners... | AI Research

Key Takeaways

A Two-Stage Training Framework

Learning from Targeted Mistakes

Why Reasoning Quality Matters

Key Takeaways

Comments (0)

No comments yet