Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization

Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, Percy Liang

TL;DR

We propose an optimizer wrapper called Hyperball that ****normalizes the Frobenius norm of ****both weights and optimizer updates of all matrices in the neural network throughout training instead of using weight decay. This operation leads to 20-30% speedup over weight decay and hyperparameter transfer across widths and depths.

Section 1: Motivation

In our previous paper Fantastic Pretraining Optimizers and Where to Find Them, we observed that the speedups of matrix-based optimizers including Muon over AdamW shrink from 30% to only 10% as model size and data scale grow. We've been searching for a way to keep those speedups at higher compute since.

It turns out the solution is extremely simple. We introduce a a simple optimizer wrapper that enforces constant weight and update norms, transforming any base optimizer into its hyperball variant (e.g., Muon → Muon Hyperball). ****This small change leads to two empirical benefits: (1) It preserves optimizer speedups across scales, and (2) it allows hyperparameters to transfer without retuning.

Section 2: Hyperball Optimization

Most modern LLM training uses weight decay, which controls the size of the weights implicitly. Let $W_t$ be the weight matrix at step $t$ , $u_t$ be the update provided by a base optimizer (e.g., from Adam), $\\eta$ be the learning rate, and $\\lambda$ be the weight decay coefficient. The standard update rule is:

$$ W_{t+1} = (1 - \eta \lambda) W_t - \eta u_t $$

Here $-\\eta u_t$ adds the new update information and typically leads to increasing weight norm without weight decay. The term $(1 - \\eta\\lambda$) softly controls the norm by shrinking the weights towards zero every step.

Hyperball replaces this soft control on weight norm with an explicit constraint. It decouples the magnitude of the weights from the direction of the update entirely. To define the update, we first introduce the following notation:

$R$ : The initial Frobenius norm of the weight matrix.
$\\mathrm{Normalize}(x) = x / \\|x\\|_F$ : A projection operator that maps a matrix to the sphere with Frobenius norm $R$

The Hyperball update rule is defined as:

$$ W_{t+1} = R \cdot \text{Normalize}\left(W_t - \eta R \cdot \text{Normalize}(u_t) \right) $$

Geometrically, Hyperball constrains the optimization trajectory to lie strictly on the surface of a hypersphere with radius $R$ . The update takes a step of length $\\eta R$ in the direction defined by the normalized update $-\\mathrm{Normalize}(u_t)$ , and the result is immediately projected back onto the sphere. This ensures that the norm of the weights and updates remains constant, while the optimizer purely navigates the direction of the updates.

Here $u_t$ can be the optimizer update from any optimizer. In this blog, we focus on two variants: Adam-Hyperball (AdamH) and Muon-Hyperball (MuonH).

Empirical Tips: