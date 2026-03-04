What Are Evolution Strategies (ES)?



Evolution Strategies (ES) are gradient-free optimization methods that update model parameters by directly exploring parameter space, rather than computing gradients through backpropagation.

Unlike reinforcement learning (RL) methods such as PPO or GRPO, ES:

Does not require gradient synchronization across machines

Optimizes directly on outcome-level rewards

Avoids token-level credit assignment

Operates entirely in parameter space

This removes several sources of instability common in RL-based post-training.

From Proof of Scale to Generalization

When we first showed that Evolution Strategies could fine-tune billion-parameter language models without backpropagation, it challenged a long-standing assumption in the field: that gradient-free methods cannot operate effectively in extremely high-dimensional parameter spaces.

Those results demonstrated that ES was not only viable at the billion-parameter scale, but competitive with state-of-the-art reinforcement learning (RL) across stability, robustness, tolerance to long-horizon outcome rewards, and training cost. We validated this across multiple fine-tuning settings, including symbolic reasoning and behavioral objectives that exposed known weaknesses in RL such as reward hacking and instability across runs.

Why Scaling Alone Is Not Sufficient

A method does not become a foundation for post-training because it succeeds in a few controlled settings. It must prove that its strengths persist as tasks diversify, models vary, and problem structure becomes more demanding.

In this expanded version of Evolution Strategies at Scale, we substantially broaden the evaluation. We extend ES across new reasoning domains and more demanding benchmarks to assess whether its core advantages remain intact under greater complexity.

Taken together, these experiments move ES from an initial proof of capability to a broader demonstration of generality, revealing that its advantages in stability, robustness, and gradient-free optimization persist across diverse reasoning domains.