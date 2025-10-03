Fine-tuning is essential for every modern large language model. GPT, Claude, LLaMA, DeepSeek, Qwen – all of them are trained first on massive amounts of general data, then fine-tuned to further elicit reasoning capability or become better aligned with human intent.

Reinforcement learning (RL) is currently the primary choice for such fine-tuning. Techniques like PPO and GRPO dominate the field and have shaped nearly every production-grade LLM in use.

But RL comes with known limitations. It struggles with sparse, long-horizon rewards. It's sensitive to model initialization and hyperparameter choices. And without careful constraints, it’s prone to reward hacking, leading to undesirable behaviors.

We explore a long-standing but largely overlooked alternative to reinforcement learning: evolution strategies (ES). Earlier work from OpenAI demonstrated the promise of ES on control and optimization problems. However, in our paper, we present the first successful use of ES to fine-tune the full parameter set of large language models, including models with billions of parameters. The results show that ES can outperform state-of-the-art RL methods on key dimensions such as sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs.

This approach is not just an optimization tweak. It represents a new direction for how we fine-tune LLMs at scale and points toward simpler, more reliable, and more adaptable post-training techniques for future generations of AI systems.

A scalable framework for full-parameter fine-tuning using ES

Prior work using evolution strategies typically limited the model size (up to millions of parameters) or reduced optimization dimensionality, such as tuning only the output layer or adapters. This new approach instead scales ES to full-parameter fine-tuning of transformers with billions of parameters, without relying on gradients or backpropagation.