Back to AI Research

AI Research

Correct Is Not Enough: Training Reasoning Planners... | AI Research

Key Takeaways

  • The paper "Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards" addresses a critical limitation in how we train large language...
  • This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems.
  • To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact.
  • During planner training, the planner emits tagged reasoning.
  • A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace.
Paper AbstractExpand

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.

The paper "Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards" addresses a critical limitation in how we train large language models to reason. Currently, most models are trained using reinforcement learning that rewards them only if they reach the correct final answer. The authors argue that this "outcome-only" approach is flawed because it can reward models for reaching the right answer through incorrect reasoning, logical shortcuts, or "hallucinated" steps. To fix this, the researchers propose a new framework called TraceLift, which treats reasoning as a vital intermediate product that must be both high-quality and genuinely useful to the system that consumes it.

A Two-Stage Training Framework

TraceLift separates the model into two distinct roles: a "planner" that generates reasoning and an "executor" that uses that reasoning to produce a final result. During training, the executor is kept frozen. This forces the planner to learn how to write reasoning that actually helps the executor succeed. The training process uses a unique reward signal that combines two factors: a rubric-based score that evaluates the quality of the reasoning itself, and a "measured uplift" score that tracks whether the reasoning actually improves the executor's performance compared to a baseline where no reasoning is provided.

Learning from Targeted Mistakes

To teach the model what "good" reasoning looks like, the authors created a dataset called TraceLift-Groups. This dataset consists of 6,000 problems, each paired with a high-quality reference reasoning trace and several "flawed" versions. These flawed traces were created by intentionally introducing specific errors—such as arithmetic slips, missing edge cases, or incorrect logic—while keeping the overall task relevant. By training on these groups, the model learns to distinguish between reliable, high-quality reasoning and plausible-sounding but ultimately flawed logic.

Why Reasoning Quality Matters

The experimental results across various code and math benchmarks show that TraceLift consistently outperforms standard execution-only training. Because the executor remains fixed during testing, the improvements observed are directly attributable to the planner’s ability to provide better, more useful guidance. The researchers found that this approach is particularly effective for harder problems where small details in the reasoning process—such as identifying constraints or handling edge cases—are the deciding factors in whether the final output is correct.

Key Takeaways

The core message of this research is that we should stop evaluating reasoning solely by whether the final answer is correct. Instead, we must treat reasoning as a "consumable artifact." By rewarding models for the quality of their intermediate steps and their measurable impact on downstream tasks, we can build more reliable systems that are "right for the right reasons." This shift in perspective is essential for developing models that are not just good at guessing answers, but are truly capable of sound, verifiable, and helpful reasoning.

Comments (0)

No comments yet

Be the first to share your thoughts!