AI Research

EvoLM: Self-Evolving Language Models through Co-Evo... | AI Research

Key Takeaways

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics Current methods for improving language models after their initial training oft...
Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers.
Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods.
We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal.
All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision.

Paper AbstractExpand

Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Current methods for improving language models after their initial training often rely on external guidance, such as human feedback, proprietary AI models, or rigid reward systems. These approaches create a "ceiling" for performance because the model cannot easily surpass the quality of its supervisors. This paper introduces EvoLM, a post-training framework that allows a language model to improve itself by tapping into its own internal evaluative knowledge, removing the need for human annotation or external supervision.

The Self-Evolving Mechanism

EvoLM functions by training two distinct capabilities within a single language model in an alternating cycle. First, the model acts as a "rubric generator," creating specific evaluation criteria for a given task. These rubrics are optimized to help a small, frozen judge model effectively distinguish between high-quality and low-quality responses. Second, the model acts as a policy that is trained using the scores derived from these rubrics as a reward signal. By using temporal contrast—comparing the model's current outputs against its own earlier versions—the system generates its own preference signals, allowing it to refine its performance autonomously.

Performance and Results

The researchers tested the EvoLM approach using a Qwen3-8B model. The results demonstrate significant improvements over existing benchmarks. The model’s generated rubrics outperformed GPT-4.1 on the RewardBench-2 dataset by 25.7%. Furthermore, the policy trained through this method achieved a 69.3% average on the OLMo3-Adapt suite. This performance surpassed policies guided by GPT-4.1 prompted rubrics by 3.9% and outperformed the state-of-the-art 8B reward model, SkyWork-RM, by 16%.

Why This Matters

The primary contribution of EvoLM is the demonstration that language models can effectively "self-supervise" their own improvement. By structuring evaluative capacity into explicit, co-evolving rubrics, the model moves beyond the limitations of human-dependent training. This approach suggests that as models grow more capable, their ability to evaluate and improve their own outputs can scale alongside them, potentially unlocking new levels of performance without the bottlenecks associated with traditional, externally-supervised training methods.

Comments (0)

No comments yet

Be the first to share your thoughts!