Aurora: A Leverage-Aware Optimizer for Rectangular Matrices | Tilde

40 min read Original article ↗

Back

5.05.2026

Alec Dewulf*,  Dhruv Pai*,  Li Yang*,  Ashley Zhang*,  Ben Keigwin

TL;DR

  • We show that Muon's update inherits row-norm anisotropy on tall matrices which can cause a significant portion of neurons in MLP layers to permanently die. Row normalizing updates fixes this, but at the cost of orthogonality.
  • We formulate the problem of steepest descent under the joint constraint of row-norm uniformity and orthogonality, and present Aurora optimizer as a solution.
  • We use Aurora to train a 1.1B model, which achieves 100x data efficiency on open-source internet data and outperforms larger models on general evals like HellaSwag.
  • We submitted a PR to the modded-nanoGPT speedrun where Aurora outperforms the current SoTA run.
  • Untuned Aurora is only a 6% overhead over traditional Muon, and a drop-in replacement.
  • We open-source code for both Riemannian and vanilla Aurora: github.com/tilde-research/aurora-release.

Loading...

Outline

  1. Introduction — Muon's success and the NorMuon puzzle
  2. Tension Due to Row Normalization — Row normalization competes with orthogonality on tall matrices
  3. Normalization Prevents Neuron Death — Muon kills MLP neurons; leverage scores explain why
  4. Augmenting the Muon Objective — Deriving Aurora from a joint constraint
  5. Results — nanoGPT speedrun SoTA, 1B pretraining, and downstream evals
  6. Inflection — Why this matters

1: Introduction

The Muon optimizer [1] first attracted attention due to its success in the nanoGPT speedrun competition [2], where it outperformed AdamW [3] in wall-clock time to convergence despite requiring more computation per step. Since then, substantial effort has gone into scaling Muon through improved distributed implementations [4], more accurate orthogonalization routines [5] [6], and methods to reduce compute and communication overhead [4]. The result is that Muon (including clipped and per-head versions) has become an increasingly popular choice for training frontier-scale models [7] [8] [18].

Recently, there has been a surge in Muon variants pushing the state-of-the-art on open research benchmarks [9] [10]. One example is NorMuon [9], which is currently SoTA on the modded-nanoGPT speedrun. NorMuon augments Muon with an additional step that scales each row by its inverse RMS norm, analogously to Adam's per-parameter normalization. Notably, this row normalization step can move the Muon update quite far from the true orthogonal gradient. The fact that NorMuon still succeeds suggests that there may be a gap in the Muon formulation that is being addressed by row normalization.

We study the effects of row normalization and find that Muon can result in neuron death in MLP layers, whereby some neurons receive persistently small updates early in training and fail to recover. We show that this failure mode can be avoided by redistributing mass equally across rows for updates to the up and gate projections in the MLP layers. Motivated by this observation, we propose Aurora, which uses this mechanism to prevent neuron death without sacrificing precision of the gradient orthogonalization. We show that Aurora achieves large gains over both Muon and NorMuon at 1B scale and outperforms the current SoTA on the modded-nanoGPT optimization track. Furthermore, we find that Aurora's performance gains scale with MLP width leading us to believe that Aurora is particularly effective for networks with large MLP expansion factors.

2: Tension Due to Row Normalization

The core algorithmic component in Muon is an iterative algorithm to compute the polar factor of a matrix. For GRm×nG \in \mathbb{R}^{m \times n} with thin Singular Value Decomposition (SVD) G=UΣVG = U\Sigma V^\top, the polar factor polar(G)=UV\operatorname{polar}(G) = UV^\top, is the closest semi-orthogonal matrix to GG in Frobenius norm. If GG is the gradient of a parameter WW, then Muon forms polar(G)=UV\operatorname{polar}(G) = UV^\top and updates WWηUVW \gets W - \eta\, UV^\top for a learning rate η\eta.

The existence of matmul-only iterative algorithms for computing polar(G)\operatorname{polar}(G) [5] [6] [1] is largely what makes Muon feasible at scale. We focus on two such algorithms here: quintic iteration from modded-nanoGPT[21] and Polar Express with 8 steps (PE-8). [5] We give a more thorough overview of different algorithms for computing the polar factor in Appendix A.

We find that in our training settings (as described in Appendix B) Muon achieves monotonically better downstream loss under more precise orthogonalization; i.e. under iterative algorithms that more accurately approximate polar(G)\operatorname{polar}(G). We plot Muon runs using the least precise iteration we test (quintic) and the most precise (PE-8) in Figure 1 to illustrate this gap. This result confirms that polar factor precision is an important component in strong Muon implementations.

Loading...

Figure 1. Muon achieves monotonically lower loss with more precise orthogonalization routines. PE-8 substantially outperforms the quintic iteration at 1B scale.

NorMuon augments Muon with an additional step that scales each row of the polar factor to have unit RMS norm.

Algorithm: NorMuonRequire: Initial weights W0Rm×n,  loss L,  learning rate η,  momentum (β1,β2),  perturbation ε,  weight decay λInitialize M0Rm×n0,  v0Rm0for t=1,2, doGtWL(Wt)Mtβ1Mt1+(1β1)GtOtpolar(Mt)vtβ2vt1+(1β2)mean_cols(OtOt)VtExpandRows(vt)(VtRm×n)O^tOt(Vt+ε)η^=0.2ηmn/O^tFWt+1WtηλWtη^O^tend for\begin{aligned} &\textbf{Algorithm: NorMuon} \\ &\textbf{Require: } \text{Initial weights } \mathbf{W}_0 \in \mathbb{R}^{m \times n},\; \text{loss } L,\; \text{learning rate } \eta,\; \text{momentum } (\beta_1, \beta_2),\; \text{perturbation } \varepsilon,\; \text{weight decay } \lambda \\ &\text{Initialize } \mathbf{M}_0 \in \mathbb{R}^{m \times n} \leftarrow \mathbf{0},\; \mathbf{v}_0 \in \mathbb{R}^m \leftarrow \mathbf{0} \\ &\textbf{for } t = 1, 2, \ldots \textbf{ do} \\ &\quad \mathbf{G}_t \leftarrow \nabla_{\mathbf{W}} L(\mathbf{W}_t) \\ &\quad \mathbf{M}_t \leftarrow \beta_1 \mathbf{M}_{t-1} + (1 - \beta_1)\, \mathbf{G}_t \\ &\quad \mathbf{O}_t \leftarrow \mathrm{polar}(\mathbf{M}_t) \\ &\quad {\color{#2563EB} \mathbf{v}_t \leftarrow \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\, \operatorname{mean\_cols}(\mathbf{O}_t \odot \mathbf{O}_t)} \\ &\quad {\color{#2563EB} \mathbf{V}_t \leftarrow \mathrm{ExpandRows}(\mathbf{v}_t) \qquad (\mathbf{V}_t \in \mathbb{R}^{m \times n})} \\ &\quad {\color{#2563EB} \widehat{\mathbf{O}}_t \leftarrow \mathbf{O}_t \oslash \bigl(\sqrt{\mathbf{V}_t} + \varepsilon\bigr)} \\ &\quad \hat{\eta} = 0.2\, \eta \sqrt{mn}\, /\, \|\widehat{\mathbf{O}}_t\|_F \\ &\quad \mathbf{W}_{t+1} \leftarrow \mathbf{W}_t - \eta \lambda \mathbf{W}_t - \hat{\eta}\, \widehat{\mathbf{O}}_t \\ &\textbf{end for} \end{aligned}

We show that for tall matrices, this row normalization can significantly reduce polar factor precision, which is undesirable both theoretically and in light of our previous results (Figure 1).

Claim 1. A tall matrix cannot be simultaneously column-orthogonal and have uniform unit row norms. That is, if MRm×n\mathcal{M} \in \mathbb{R}^{m \times n} with m>nm > n satisfies MM=In\mathcal{M}^\top \mathcal{M} = I_n, then M\mathcal{M} cannot have all its row norms equal to one.

Proof. Suppose by contradiction that all the rows of M\mathcal{M} have norm one. Then

tr(MM)=tr(In)=nandtr(MM)=tr(MM)=i=1mMi,22=m\text{tr}(\mathcal{M}^\top\mathcal{M}) = \text{tr}(I_n) = n \quad \text{and} \quad \text{tr}(\mathcal{M}^\top\mathcal{M}) = \text{tr}(\mathcal{M}\mathcal{M}^\top) = \sum_{i=1}^m \|\mathcal{M}_{i, \cdot}\|_2^2 = m

Together, these facts imply that m=nm = n, contradicting our assumption that M\mathcal{M} is tall. The average norm of a row in M\mathcal{M} is n/mn/m. \square

Therefore, row normalization in NorMuon necessarily introduces a precision defect into the orthogonalization routine. We find that this polar precision defect can be quite large for matrices with non-uniform row norms. In particular, the magnitude of the introduced error can be much larger than the precision improvement of PE-8 over quintic iteration in the same setting (see Appendix A for a full comparison of different Newton-Schulz iterations).

Row normalization precision

Figure 2. (a) We sample n=1000 random 512×128 matrices for row norm standard deviations in {1, 2, 3, 4, 5} and orthogonalize them with different polar factor algorithms (PE-8, CANS-10, CANS-12 and quintic); the precision of the resulting polar factors is shown on the plot (dotted lines). We then apply unit row normalization and plot the updated precision (dark lines). At a row-norm standard deviation of 3, the orthogonality defect peaks at 0.06. (b) Plots the average defect magnitude for each row standard norm deviation.

Indeed, the performance gap between PE-8 and quintic iteration disappears when these algorithms are used with NorMuon in the same setting. However, both runs outperform our Muon+PE-8 baseline, suggesting that row normalization can be independently useful.

Loading...

Figure 3. NorMuon does not benefit from more precise orthogonalization at 1B scale. PE-8 and quintic achieve the same downstream loss, consistent with our result that row normalization degrades the polar factor precision.

To mitigate the defect introduced by row normalization, we can simply normalize tall matrices to have row norms n/m\sqrt{n/m} instead of one. We call this variant U-NorMuon and find that it outperforms NorMuon in our 340M setting (Figure 4; setting described in Appendix B).

Algorithm: U-NorMuonRequire: Initial weights W0Rm×n,  loss L,  learning rate η,  momentum (β1,β2),  perturbation ε,  weight decay λInitialize M0Rm×n0,  v0Rm0for t=1,2, doGtWL(Wt)Mtβ1Mt1+(1β1)GtOtpolar(Mt)vtβ2vt1+(1β2)mean_cols(OtOt)VtExpandRows(vt)(VtRm×n)O^tOt(Vt+ε)η^=0.2ηn/O^tFWt+1WtηλWtη^O^tend for\begin{aligned} &\textbf{Algorithm: U-NorMuon} \\ &\textbf{Require: } \text{Initial weights } \mathbf{W}_0 \in \mathbb{R}^{m \times n},\; \text{loss } L,\; \text{learning rate } \eta,\; \text{momentum } (\beta_1, \beta_2),\; \text{perturbation } \varepsilon,\; \text{weight decay } \lambda \\ &\text{Initialize } \mathbf{M}_0 \in \mathbb{R}^{m \times n} \leftarrow \mathbf{0},\; \mathbf{v}_0 \in \mathbb{R}^m \leftarrow \mathbf{0} \\ &\textbf{for } t = 1, 2, \ldots \textbf{ do} \\ &\quad \mathbf{G}_t \leftarrow \nabla_{\mathbf{W}} L(\mathbf{W}_t) \\ &\quad \mathbf{M}_t \leftarrow \beta_1 \mathbf{M}_{t-1} + (1 - \beta_1)\, \mathbf{G}_t \\ &\quad \mathbf{O}_t \leftarrow \mathrm{polar}(\mathbf{M}_t) \\ &\quad \mathbf{v}_t \leftarrow \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\, \operatorname{mean\_cols}(\mathbf{O}_t \odot \mathbf{O}_t) \\ &\quad \mathbf{V}_t \leftarrow \mathrm{ExpandRows}(\mathbf{v}_t) \qquad (\mathbf{V}_t \in \mathbb{R}^{m \times n}) \\ &\quad \widehat{\mathbf{O}}_t \leftarrow \mathbf{O}_t \oslash \bigl(\sqrt{\mathbf{V}_t} + \varepsilon\bigr) \\ &\quad \hat{\eta} = 0.2\, \eta {\color{#2563EB} n}\, /\, \|\widehat{\mathbf{O}}_t\|_F \\ &\quad \mathbf{W}_{t+1} \leftarrow \mathbf{W}_t - \eta \lambda \mathbf{W}_t - \hat{\eta}\, \widehat{\mathbf{O}}_t \\ &\textbf{end for} \end{aligned}

Now we turn our analysis to wide and square matrices, which also receive unit row norm updates under NorMuon. A left-orthogonal wide or square matrix necessarily has all its row norms equal to one, so row-norm uniformity is implied by orthogonality in this case.

Thus, we expect that row-normalization is unnecessary or perhaps even harmful under precise orthogonalization routines like PE-8. Iterative polar factor algorithms can fail to converge precisely early in training when raw gradients tend to be ill-conditioned, allowing for the accumulation of row norm statistics that quickly become stale. In this case, row normalization will make rows less uniform until the statistics are corrected. We find evidence of this occurring at 340M scale where row-normalizing only tall matrices outperforms NorMuon and U-NorMuon by a small but non-trivial margin (Figure 4).

Loading...

Figure 4. Column normalization of down projection performs very similarly to up and gate row normalization under NorMuon. NorMuon applied only to tall matrices outperforms all NorMuon variants and Muon.

We will show that Muon can allow a large subset of neurons in the MLP layers to effectively die, but that this pathology is mitigated by (U-)NorMuon. Building on ideas in the literature, we propose this as an explanation for the performance gap between Muon and NorMuon that we find in all our settings. We then derive Aurora which effectively row normalizes updates to tall parameters without sacrificing precision of the polar factor.

3: Normalization Prevents Neuron Death

We define a dead model component as a subset of model parameters receiving persistently small learning signal after the earliest phases of training. We identify dead model components with the following three criteria:

  1. Low effective gradient norm. Model components which, for a particular batch, receive low effective gradient norm must correspond to relatively flat directions in the loss landscape. Components with persistently small gradient norms will thus receive little learning signal. There is a strong precedent in the attribution and pruning literatures for using gradient norms as a proxy for feature importance [11] [12] [13].

  2. Low effective update norm. For networks trained with optimizers like Muon, small grad norms are not themselves sufficient for identifying dead network components. In particular, the orthogonalization step can drastically alter the gradient before it is applied, including amplifying small magnitude directions. We therefore require that effective update norm, which measures the actual parameter change, also be small.

  3. Persistence over training. It's possible for parameters to receive small update norms at specific points in training but still contribute non-trivially to the network output. For example, some components may learn useful features early and receive smaller updates as those features stabilize. To distinguish these static but useful components from genuinely dead ones, we require low effective gradient and update norms to be persistent and predictable over training.

Gradient norm-9-8-7-6-5-4-3-2-101log grad normneuron index (sorted)

deadaliveunpredictable

Figure 5. Three criteria for identifying a dead neuron: low effective gradient norm, low effective update norm, and persistence of both over training.

We will show that neuron death, as we've defined it, can and does occur in networks trained with Muon because tall matrices in MLP layers are allowed to receive updates with very non-uniform row-norms. In particular, we will show that the Muon update privileges network components that happened to get large updates early in training, allowing others to die. We find that U-NorMuon avoids this problem completely.

The Mechanism of Neuron Death

For a matrix MRm×nM \in \mathbb{R}^{m \times n} with thin SVD M=UrΣVM = U_r \Sigma V^\top, define the leverage score of row ii as i(M)=(Ur)i,22\ell_i(M) = \|(U_r)_{i,\cdot}\|_2^2. The leverage score of a row is directly proportional to the scale of its update so that rows with consistently low leverage scores in the update matrix receive a consistently small portion of the update mass. We can show that row leverage is mostly preserved under orthogonalization of reasonably well-conditioned matrices so that under Muon, rows with low leverage in the raw gradient tend to also have low leverage in the update.

M-0.960.32-0.27-0.49-1.84-0.31-1.14-1.14-0.530.091.23-1.12-0.270.72-1.80=Uᵣ0.07-0.28-0.80-0.58-0.400.42-0.37-0.56-0.280.55-0.250.110.46-0.620.30Σ2.812.330.98V⊤0.200.90-0.390.540.230.810.82-0.37-0.44Row leverage scores0.722r10.681r20.532r30.382r40.684r5

Shrink a row/col to see its leverage score drop for rectangular matrices.

Figure 6. Interactive illustration of leverage scores. Each row's leverage score determines its share of update energy under the polar factor. Rows with small leverage receive proportionally small updates.

Claim 2 (Informal). For tall matrices, rows with small gradient norm still have small norm after orthogonalization, so that the Muon update does not prevent neuron death.

Intuition. The leverage score of row ii measures how much row ii participates in the column space of MM. A row that has near-zero norm contributes very little to any singular vector, so UrU_r places little weight on it. Unlike singular values, which are unit-normalized by orthogonalization, this row-wise structure lives in UrU_r and passes directly through to the polar factor. For wide or square matrices, the orthogonality constraint UrUr=IU_r^\top U_r = I forces leverage scores to be uniform, but for tall matrices there are degrees of freedom under this constraint. The mm leverage scores must sum to nn but are otherwise allowed to concentrate on rows arbitrarily.

Claim 2 (Formal). Let MRm×nM \in \mathbb{R}^{m \times n} with m>nm > n and thin SVD M=UrΣVM = U_r \Sigma V^\top. Suppose row ii satisfies Mi,ϵ\|M_{i,\cdot}\| \leq \epsilon for some ϵ0\epsilon \geq 0. Then i(M)ϵ2/σn(M)2\ell_i(M) \leq \epsilon^2 / \sigma_n(M)^2, where σn(M)\sigma_n(M) is the smallest singular value of MM.

Proof. We have Mi,=(Ur)i,ΣVM_{i,\cdot} = (U_r)_{i,\cdot} \Sigma V^\top, so (Ur)i,=Mi,VΣ1(U_r)_{i,\cdot} = M_{i,\cdot} V \Sigma^{-1}. Then:

i=(Ur)i,2=Mi,VΣ12Mi,2Σ12=Mi,2σn(M)2ϵ2σn(M)2\begin{aligned} \ell_i &= \|(U_r)_{i,\cdot}\|^2 = \|M_{i,\cdot} V \Sigma^{-1}\|^2 \leq \|M_{i,\cdot}\|^2 \cdot \|\Sigma^{-1}\|^2 \\ &= \frac{\|M_{i,\cdot}\|^2}{\sigma_n(M)^2} \leq \frac{\epsilon^2}{\sigma_n(M)^2} \quad \square \end{aligned}

Thus, the leverage score (and hence update) for a neuron is bounded by the squared norm of the corresponding row in the momentum buffer, normalized by the square of the smallest singular value. In particular, when MM is reasonably well-conditioned (so that σn(M)\sigma_n(M) is not too small) Muon preserves row leverage anisotropy. This leads to the following self-reinforcing feedback loop in networks trained with Muon:

  1. Dead neurons receive near-zero gradient rows
  2. Momentum accumulates into a buffer MM with highly anisotropic row norms.
  3. The updates inherit this anisotropy, resulting in dead neurons not being updated.
  4. Hence, dead neurons remain dead.

We find significant empirical evidence that this loop occurs in our standard 1B and 340M transformer settings.

Training0%

xW_gateW_upSiLUhW_downyper-neuron row view (12 rows of W_gate / W_up)Activation |hᵢ|Gradient ‖∇W[i,:]‖Update ‖ΔW[i,:]‖SiLU(gᵢ)σ′(gᵢ)score00.350.760.2310.790.950.8120.540.860.3930.830.960.9040.370.770.2850.880.971.0060.620.890.5170.390.780.2080.390.790.3490.420.800.24100.700.920.65110.460.820.28
Figure 7. The neuron death feedback loop under Muon, visualized per-neuron across the SwiGLU MLP. Each row corresponds to one neuron in gate/up projection. Dead neurons receive vanishing activations, near-zero gradient scalars, and near-zero leverage. Press play or drag the slider to watch neurons collapse over training.

Empirical Evidence of Neuron Death

We train vanilla transformer models with 340 million non-embedding parameters following the recipe given in Appendix B using both Muon and U-NorMuon. For each, we extract the momentum buffer every 50 steps and compute row-wise gradient norms and leverage scores. The U-NorMuon update has approximately uniform row norms (and hence leverage scores) by construction, but leverage scores tend to be very anisotropic for tall matrices under Muon. Our models use SwiGLU MLPs and we verify later that our results transfer to models that use ReLU2^2 (see Appendix G).

Neuron death is visible in activation space. We first examine the MLP latent statistics directly. For each neuron, we track the mean and standard deviation of activations across tokens at each training step. Under Muon, the distribution of gate means diverges sharply: as training progresses we find that a band of neurons drifts toward zero mean with collapsing variance; we interpret this as activation space collapse. By contrast, gate means and variances remain consistently high under U-NorMuon.

Loading gate latent statistics...

Figure 8. SwiGLU hidden activations over training for all 2816 neurons in layer 12. Each dot is one neuron; the x-axis is the mean of hᵢ across tokens and the y-axis is log₁₀ of its standard deviation. Under Muon (left), a cluster of neurons collapses to near-zero variance early in training. Under U-NorMuon (right), all neurons maintain healthy variance. Scrub the slider to watch neurons die.

Muon kills neurons early and permanently. We visualize leverage scores for the tall MLP projections of a middle layer over training (Figure 9). Under Muon, neurons are initially alive with uniformly high leverage, but a large fraction of neurons die during learning rate warmup and never recover. By step 500, more than one in four neurons are effectively dead, producing a sharply bimodal distribution of leverage scores; one mass of neurons receives near-zero updates, and the other receives disproportionately large ones.

step 0

Figure 9. Each square on the grid corresponds to a row in the parameter matrix. Gray neurons have low leverage and are classified as "dead". The distribution of leverage scores for Muon becomes highly anisotropic over training, whereas U-NorMuon leverage scores are isotropic and there are no dead neurons.

This neuron death is self-reinforcing and leads to a fundamental collapse in learning dynamics (Figure 10). On a log-log plot, row gradient norms and leverage scores are tightly correlated, indicating that the momentum buffer tends to be in the well-conditioned domain where our bound from Claim 2 is tight.

Loading\u2026

Figure 10. Muon has a strong tail of dead neurons. These rows have both low gradient norm and row leverage scores. U-NorMuon, by contrast, is highly isotropic. For Muon, the row gradient norm and leverage scores are strongly correlated on log-log scale.

To verify that neuron death persists throughout training, we split neurons into quartiles by their leverage score at step 100 (out of 10 thousand) and track each cohort as training progresses (Figure 11). The bottom quartile continues to decay, while the remaining cohorts quickly stabilize at high leverage under a "rich get richer" dynamic. If anisotropy were i.i.d. across rows, early leverage would not predict late leverage. This strong stratification confirms that, in our experiments, neuron death is a fixed point of learning rather than just training noise.

Training stepLog leverage scoreBottom 10%10–25%25–40%40–55%55–70%70–85%85–95%Top 5%

Figure 11. Rows are grouped into 8 cohorts by their initial leverage score at step 500. Under Muon, the bottom cohorts collapse to vanishing leverage while the top cohorts stabilize high, and dead neurons stay dead. Under U-NorMuon, all cohorts converge toward uniform leverage. Toggle between optimizers, projections, and layers to explore.

U-NorMuon prevents death, and the prevention propagates.

Under U-NorMuon, leverage scores are isotropic and approximately distributed according to a Gaussian throughout training with no dead-neuron tail (Figures 910). One might expect this to be a trivial consequence of the normalization step forcing uniformity post-hoc. Instead, we find that the momentum buffer itself converges to a regime with isotropic row norms.

The normalization intervenes most aggressively early in training when anisotropy is developing, and as the buffer stabilizes, the correction shrinks. More surprisingly, this benefit extends to parameters that U-NorMuon does not directly normalize. For example, in Figure 12 we plot column leverage scores for the down projection matrix, which is a wide matrix and thus receives no benefit from row normalization.

By contrast, column leverage becomes anisotropic under Muon. Dead rows in the up/gate projections starve the corresponding columns of the down projection of gradient signal. Under U-NorMuon, keeping the up/gate rows alive ensures isotropic gradient flow into the down projection, stabilizing its column leverage without any direct intervention.

Figure 12

Figure 12. Even though down projection does not benefit directly from row normalization, the induced dynamics prevent (column-wise) leverage anisotropy. Alive rows in the up and gate projections in turn induce gradient flow in the columns of the down projection.

Takeaway. U-NorMuon's benefit is not simply normalization. U-NorMuon breaks a pathological self-reinforcing collapse that wastes a quarter or more of the network's MLP parameters. However, it does so by forcefully overriding the polar factor with uniform row norms, sacrificing precision of the polar factor which is both empirically and theoretically desirable in the Muon framework. We will now present Aurora, which achieves both goals simultaneously. We show that Aurora has better downstream performance than both Muon and (U-)NorMuon at 1B scale and on the modded-nanoGPT optimizers track.

4: Augmenting the Muon Objective

Recall that the Muon update direction can be motivated as the solution to the following constrained optimization problem:

U=argmaxUTr(GU)subject toU21U^* = \arg\max_{U} \text{Tr}(G^\top U) \quad \text{subject to} \quad \|U\|_2 \le 1

Where G=WLG = \nabla_W \mathcal{L} is the gradient with respect to a matrix parameter WW, and U2\|U\|_2 is the spectral norm (ignoring the learning rate scalar for simplicity). There is no constraint on the uniformity of the row norms of the update URm×nU\in \mathbb{R}^{m \times n}, which we found can be quite far from uniform for tall matrices (Sections 2 & 3). Motivated by our results in Section 3, we augment the Muon objective for tall matrices (mnm \ge n) with the additional requirement that all the update row norms be equal to n/m\sqrt{n/m}:

U=argmaxUTr(GU)s.t.U21,Ui:2=nm    iU^* = \arg\max_{U} \text{Tr}(G^\top U) \quad \text{s.t.} \quad \|U\|_2 \le 1, \quad \|U_{i:}\|^2 = \frac{n}{m} \;\;\forall\, i

The constant row norm constraint fixes the squared Frobenius norm to UF2=m(n/m)=n\|U\|_F^2 = m \cdot (n/m) = n. Because the squared Frobenius norm also equals the sum of the squared singular values and the spectral norm bound implies each σj1\sigma_j \le 1, all nn singular values must exactly equal 1. This forces UU to be left semi-orthogonal. Therefore, the problem reduces to:

U=argmaxUTr(GU)s.t.UU=In,Ui:2=nm    iU^* = \arg\max_{U} \text{Tr}(G^\top U) \quad \text{s.t.} \quad U^\top U = I_n, \quad \|U_{i:}\|^2 = \frac{n}{m} \;\;\forall\, i

The solution is steepest descent under two constraints: a left semi-orthogonality constraint to make the operator RMS-to-RMS norm preserving, and a constant row norm constraint to enforce uniform updates. Solving this problem yields Aurora.

A Riemannian Solution

We first consider a Riemannian formulation of the constrained update-selection problem. We restrict the parameter update UU to lie on the joint Stiefel/equal-row-leverage manifold

M={URm×n:UU=I,  diag(UU)=nm1}\mathcal{M} = \{U \in \mathbb{R}^{m \times n} : U^\top U = I, \; \text{diag}(UU^\top) = \frac{n}{m}\mathbf{1}\}

Given a matrix gradient GG, we seek the feasible update most aligned with GG:

maxUG,Us.t.UM\max_U \langle G, U \rangle \quad \text{s.t.} \quad U \in \mathcal{M}

A Riemannian approach approximately solves this problem by projecting the Euclidean direction GG onto the tangent space of M\mathcal{M}, taking a step along the projected direction, and retracting the result back toward M\mathcal{M} (Figure 13). We refer to this reference solver as Riemannian-Aurora; the full derivation is given in Appendix D.

Algorithm: Riemannian-AuroraRequire: Momentum GRm×n,  step size η,  iterations TSet rn/mInitialize U0M (e.g. polar projection of G followed by row/Stiefel projection)for t=0,1,,T1 doBtsym(UtG)PtUtUtqtdiag(GUt)diag(UtBtUt)Solve: (rImPtPt)λt=qtDtdiag(λt)StBtUtDtUtZtGUtStDtUtYtUt+ηZtUt+1Retract(Yt)end forReturn UT\begin{aligned} &\textbf{Algorithm: Riemannian-Aurora} \\ &\textbf{Require: } \text{Momentum } \mathbf{G} \in \mathbb{R}^{m \times n},\; \text{step size } \eta,\; \text{iterations } T \\ &\text{Set } r \leftarrow n/m \\ &\text{Initialize } \mathbf{U}_0 \in \mathcal{M} \text{ (e.g. polar projection of } \mathbf{G} \text{ followed by row/Stiefel projection)} \\ &\textbf{for } t = 0, 1, \ldots, T-1 \textbf{ do} \\ &\quad \mathbf{B}_t \leftarrow \operatorname{sym}(\mathbf{U}_t^\top \mathbf{G}) \\ &\quad \mathbf{P}_t \leftarrow \mathbf{U}_t \mathbf{U}_t^\top \\ &\quad \mathbf{q}_t \leftarrow \operatorname{diag}(\mathbf{G}\mathbf{U}_t^\top) - \operatorname{diag}(\mathbf{U}_t \mathbf{B}_t \mathbf{U}_t^\top) \\ &\quad \text{Solve: } \bigl(r \mathbf{I}_m - \mathbf{P}_t \circ \mathbf{P}_t\bigr)\,\boldsymbol{\lambda}_t = \mathbf{q}_t \\ &\quad \mathbf{D}_t \leftarrow \operatorname{diag}(\boldsymbol{\lambda}_t) \\ &\quad \mathbf{S}_t \leftarrow \mathbf{B}_t - \mathbf{U}_t^\top \mathbf{D}_t \mathbf{U}_t \\ &\quad \mathbf{Z}_t \leftarrow \mathbf{G} - \mathbf{U}_t \mathbf{S}_t - \mathbf{D}_t \mathbf{U}_t \\ &\quad \mathbf{Y}_t \leftarrow \mathbf{U}_t + \eta\, \mathbf{Z}_t \\ &\quad \mathbf{U}_{t+1} \leftarrow \operatorname{Retract}(\mathbf{Y}_t) \\ &\textbf{end for} \\ &\textbf{Return } \mathbf{U}_T \end{aligned}
StiefelObliqueUMMuonAuroraGMuon: tangent to Stiefel only · Aurora: tangent to both constraints

Drag G to see how the projections diverge

Figure 13. 2D cross-section of the Riemannian geometry. The Stiefel and oblique manifolds intersect at U on the joint manifold M. Muon projects the gradient onto the Stiefel tangent only, drifting from the oblique constraint. Aurora projects onto the joint tangent, respecting both constraints simultaneously.

Unlike post-hoc row normalization, which first computes a Muon update and then rescales its rows, Riemannian-Aurora treats orthogonality and equal row leverage as a single joint constraint. Both the initialization of U0U_0 and retraction are done approximately via alternately computing the polar factor and row normalizing the update in our implementation.

Some of the steps in Riemannian-Aurora are particularly expensive, making it infeasible for use on moderately-sized networks as written. Namely:

  1. The tangent space projection involves solving an mm-dimensional linear system involving rIPPrI - P \odot P, where P=UUP = UU^\top. For an m×nm \times n matrix, explicitly forming PP costs O(m2n)O(m^2 n) memory/computation and storing it costs O(m2)O(m^2), which is prohibitive at moderate scales.
  2. Solving for the tangent constraint with conjugate gradient requires repeated projections of the form v(PP)vv \mapsto (P \odot P)v, i.e. dense row interactions.

We present it mostly as a faithful solution to our dual-constraint problem and as a comparison point for our practical instantiation of Aurora. We include a small-scale verification of Riemannian-Aurora in Appendix E.

An Iterative Approximation: Aurora

Though the joint Stiefel/equal-row-norm projection has no simple closed form, it admits a natural iterative approximation which alternates between the two constraint sets, starting with row normalization, then using Newton-Schultz iteration to approximate the polar factor.

In practice, we use a damped iteration with an EMA of row normalization factors to improve convergence and prevent oscillation. This requires just one additional damping parameter β\beta.

Algorithm: AuroraInput: Momentum buffer MRm×n, NS iterations K, damping parameter βX0=M/MFD0=Imfor k=0,1,,K1 dork=((Xk)1,:,,(Xk)m,:)Dk=Dk1βdiag(rk)(1β)X~k=n/mDk1XkXk+1=polar(X~k)end forReturn XK\begin{aligned} &\textbf{Algorithm: Aurora} \\ &\textbf{Input: } \text{Momentum buffer } M \in \mathbb{R}^{m \times n}, \text{ NS iterations } K, \text{ damping parameter } \beta \\ &X_0 = M / \|M\|_F \\ &D_0 = I_m \\ &\textbf{for } k = 0, 1, \ldots, K-1 \textbf{ do} \\ &\quad r_k = (\|(X_k)_{1,:}\|, \ldots, \|(X_k)_{m,:}\|) \\ &\quad D_k = D_{k-1}^\beta \cdot \text{diag}(r_k)^{(1-\beta)} \\ &\quad \tilde{X}_k = \sqrt{n/m} \cdot D_k^{-1} X_k \\ &\quad X_{k+1} = \mathrm{polar}(\tilde{X}_k) \\ &\textbf{end for} \\ &\textbf{Return } X_K \end{aligned}

These steps alternate because the polar factor computation can disturb row-norm uniformity and vice versa. We show that, under mild conditions, this alternating projection converges to a point on or near the intersection of the Stiefel and row-oblique manifolds (proof in Appendix F). Notably, the final iterate is always on the Stiefel manifold as the last step is always orthogonalization. Aurora is roughly KK times more compute expensive than Muon but in practice, we can effectively overlap much of this overhead with gradient communication in distributed settings. We present an interactive simulation of vanilla (iterative) Aurora below.

StiefelObliqueOrthogonality defectRow norm spread0.000.430.870.000.060.11

X

0.044

0.224

0.229

-0.192

0.322

0.374

-0.235

-0.360

0.430

0.131

0.501

0.518

-0.487

-0.033

0.488

-0.275

-0.200

0.340

‖row‖

XᵀX

orth defect 0.75448

row norm std 0.09671

0.10

1.00

0/20

Figure 14. Aurora alternates between row normalization and polar factor computation. Each iteration improves both orthogonality and row-norm uniformity, converging to a point on the joint Stiefel–oblique manifold.

5: Results

nanoGPT Speedrun

We evaluate Aurora on the modded-nanoGPT speedrun benchmark, comparing against NorMuon and the previous state-of-the-art run. Pure Aurora (2 damping iterations, β=0.5\beta = 0.5) improves on NorMuon by 25 steps, reaching the 3.28 target loss in 3225 steps. When combined with Contra-Muon and update/weight flooring, Aurora achieves a new SoTA of 3175 steps.

DescriptionKey hparamsS→3.28mean@S
NorMuon (#10 leaderboard)lr=0.035, wd=0.02532503.2789
Aurora (pure, ours)lr=0.05, wd=0.025, pp_iterations=2, pp_beta=0.532253.2772
ContraNorMuon w/ update clamping (#11, prior SOTA)Setup from #9 + Contra-Muon32253.2785
Aurora + Contra + u/w-floor (ours, NEW SOTA)lr=0.0375, wd=0, pp_iterations=2, pp_beta=0.531753.2796

Loading...

Figure 15. nanoGPT speedrun convergence curves. Aurora + Contra-Muon reaches the 3.28 eval loss target at step 3175, setting a new state-of-the-art on the optimization track.

Aurora 1.1B Pretraining

We train 1.1B-parameter transformers on ~100B tokens (see Appendix B for full details) and compare Aurora against Muon and NorMuon, each using PE-8. Aurora achieves the lowest final loss of all methods, reaching a smoothed loss of 2.26 at step 24k, which is a clear improvement over Muon (2.31) and NorMuon (2.33).

Loading...

Figure 16. 1.1B pretraining loss. Aurora (green) converges faster and reaches a lower final loss than all Muon and NorMuon baselines. All runs train for 24k steps.

Downstream evaluations. Aurora's loss improvement translates to consistent gains on standard benchmarks. On Hellaswag, Aurora achieves 67.6% - a 2.5 point improvement over both Muon and NorMuon. Gains are similarly consistent on ARC-Challenge and Winogrande. Strikingly, Aurora improves MMLU scores by 10 points over Muon. We hypothesize that since MLPs are predominantly responsible for memorization, Aurora's gains are most visible on memorization-intensive benchmarks like MMLU.

01020304050607065.164.667.6Hellaswag43.442.743.5ARC-C59.762.263.1Winogrande27.125.837.9MMLU+10.8MuonNorMuonAurora

Benchmark

55%60%65%70%75%100B1T10T100Tpretraining tokens (log scale)Hellaswag (%)~20–360x fewer tokens67.6%Aurora-1.1B62.94%Gemma3-1B67%Qwen2-1.5B65.73%Llama3.2-1B67.9%Qwen2.5-1.5B67.09%Qwen3-1.7B
Figure 17. Downstream benchmark comparison across optimizers (top) and token efficiency relative to Qwen models (bottom). Aurora matches Qwen2-1.5B on Hellaswag and ARC-C despite using 400M fewer parameters and ~100x fewer pretraining tokens. Toggle benchmarks on the scatter plot to compare.

Remarkably, Aurora-1.1B matches or approaches Qwen3-1.7B (36T tokens) on Hellaswag and Winogrande despite training on two orders of magnitude fewer tokens, having 400M fewer parameters, and an uncurated data mixture. The scatter plot reveals the striking token efficiency: Aurora sits far to the left of frontier small language models on the log-token axis while achieving competitive scores on benchmarks that primarily test language understanding rather than domain-specific knowledge. Note that our 1.1B model is trained on ~100B tokens of the Nemotron CC dataset, which contains no explicit math or science data, explaining the lower MMLU scores compared to models trained on curated multi-domain corpora.

Notably, despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, we match Qwen3-1.7B on several benchmarks.

Update Quality

We verify that Aurora's damped iteration actually solves the joint Stiefel–oblique problem for real momentum buffers, and compare it against the Riemannian reference solver from Section 4.

Gradient alignment vs. leverage uniformity

Gradient alignment vs row norm CV

Figure 18. Comparison of alignment vs row norm coefficient of variation tradeoff for Aurora, Riemannian-Aurora, and standard Muon. All updates lie on the Stiefel manifold by construction. Data is computed on a real momentum buffer from a Muon model very early in training (before pathology).

Figure 18 shows the tradeoff between gradient alignment (G,U/G\langle G, U \rangle / \|G\|) and row-norm coefficient of variation (CV) as a function of iteration count, computed on a real momentum buffer from early in training. Standard Muon achieves high alignment but poor row-norm uniformity. Aurora and Riemannian-Aurora converge to low CV with high alignment, but from opposite directions. Whereas Aurora starts orthogonal and corrects row norms, Riemannian-Aurora starts from the tangent projection and retracts onto the manifold. Riemannian-Aurora achieves 0.6% higher gradient alignment at convergence, suggesting that the practical Aurora iteration sacrifices very little relative to the reference solver.

Leverage score distributions

Figure 19 compares leverage score distributions across optimizers. Muon produces a heavy tail of dead rows spanning several orders of magnitude. Aurora with 1 damping iteration (equivalent to Muon-Eq) improves on Muon but retains visible spread. Aurora with 2 iterations produces by far the most isotropic distribution, tightly clustered around the uniform value nm\frac{n}{m}.

Loading...

Figure 19. Distribution of leverage scores for Muon, U-NorMuon, Aurora with 1 iter (Muon-Eq) and Aurora with 2 iters. Aurora with 2 iters has by far the most isotropic row leverage scores of all the approaches, tightly clustered around uniform.

Weight spectra

We examine the singular value spectra of the trained weight matrices. Under NorMuon, the down projection spectra are inflated by ~100x (σmax500\sigma_{max} \approx 500 to 15001500 vs Aurora's 1111 to 2121), while gate projection and up projection spectra remain comparable. This is consistent with the weight matrix absorbing scale information that the row-normalized update discards.

Loading...

Figure 20. Singular value spectra of trained weight matrices (log scale). Gate and up projections have similar spectra across optimizers, but NorMuon's down projection is ~100x larger, with worse conditioning. κ denotes the condition number. Slide to compare across layers.

6: Inflection

Four things excite us about this work:

1. Strong empirical gains. We stress that Aurora 1.1B was trained to be a research artifact: we did not do any tricks with data, architecture, or optimizer beyond simply using Aurora, to isolate its effect. Despite this, it achieved 100x data efficiency on several key benchmarks, and we see this as very promising.

2. Architecture-Optimizer Codesign. Aurora is an optimizer specifically for up and gate projections, which we show corrects a fundamental pathology in model training dynamics under Muon. We believe this is a nice example of optimizer-architecture co-design at a level of granularity that is very uncommon in the literature. Crucially, Aurora's advantage over Muon grows monotonically with MLP expansion factor (Figure 21), suggesting that wider MLPs, which increase the row-to-column ratio of the up and gate projections, amplify exactly the pathology Aurora corrects.

0-0.02-0.04-0.060.5×2×4×8×16×-0.0019-0.0045-0.0161-0.0399-0.0623MLP expansion factorΔ CE (Aurora − Muon)Aurora better ↓

Figure 21. Aurora's advantage over Muon scales with MLP expansion factor. Delta CE (Aurora − Muon) at step 3000 of 340M-parameter runs across MLP hidden ratios from 0.5× to 16×. More negative = larger Aurora gain.

3. Compression / Dense Solutions. Aurora's row uniformity constraint is complementary to Muon's motivating constraint. We propose thinking about the row-uniformity constraint as encouraging compression. Under Aurora, networks tend to more effectively utilize the parameters in their MLPs; that is, models tend to denser solutions. The empirical success of Aurora indicates that dense solutions to language modelling can be feasible.

4. Bottom-Up Optimizer Design. Aurora was derived by mechanistically studying a specific pathology in Muon and designing the minimal objective that corrects it. This contrasts with the top-down approach that dominates optimizer research, where one posits a mathematical formulation and hopes it yields empirical gains; an approach that offers little guidance on which of many sound framings to choose, and no explanation for why one works better than another (Figure 22). We strongly believe the field would benefit from more bottom-up research as a whole.

Top-Downposit formulation → validateBottom-Upstudy mechanisms → address pathologyDesirable propertyspectral norm, orthogonalityσ₁ = σ₂ = ⋯ = σₖDerive optimizersteepest descent under constraintEmpirical validationtrain models, measure lossWorks ✓Fails ✗newideaHard part: no signal on why it failsStudy existing optimizerscompare dynamics, ablate componentsIdentify pathologye.g. neuron death under MuonDesign targeted fixminimal objective for pathologyValidate & discoverempirical gains + new questionsiterateHard part: principled fix given signal

Figure 22. Two philosophies of optimizer design. Top-down begins with a desirable mathematical property and validates empirically, but offers little signal when validation fails. Bottom-up studies existing optimizers mechanistically, identifies specific pathologies, and designs targeted fixes, yielding both better optimizers and deeper understanding.

Understanding how Muon fails on tall matrices led us to a better objective, a better optimizer, and "denser" models. We are interested in faster and more precise methods of solving the dual-constraint steepest descent objective, and extending our results to large networks with wide MLPs.

Cite this work

@article{dewulf2026aurora,
  title   = {Aurora: A Leverage-Aware Optimizer for Rectangular Matrices},
  author  = {Dewulf, Alec and Pai, Dhruv and Yang, Li and Zhang, Ashley
             and Keigwin, Ben},
  year    = {2026},
  url     = {https://tilderesearch.com/blog/aurora}
}

Appendix

A. Comparing Newton-Schulz Iterations

We present a cost analysis of different polynomials for iterative orthogonalization. As discussed in Sections 2 & 5, we find that more precise polynomials are always helpful for Muon and Aurora, which justifies the additional computation in many settings.

Method & StepsMatmulsDerivative at 0Max Deviation from 1 (SVs ≥ 0.01)
Simple-5 [1]154840.318169
Quintic-5 [21]154920.031775
Polar Express (PE)-8 [5]246,8222.22e-16
CANS-12 [6]246,8681.11e-16

NS transfer functions

Transfer functions for Newton-Schulz iterations. PE-8 and CANS-12 converge to machine precision, while quintic retains a persistent spectral error of ~0.03.

CANS-12 and Polar Express-8 are both much more precise than quintic iteration, which has a persistent spectral error of 0.3, even in the limit. Both CANS-12 and PE-8 achieve nearly perfect accuracy for reasonably well-conditioned gradients.

B. Training Setting

We report exact configurations and training settings for our 340M and 1.1B runs to maximize the reproducibility of our results. We arrived at these settings via hyperparameter sweeps; we spent roughly the same compute sweeping each optimizer and setting.

We used an internal tokenizer with 128k vocab size for all our experiments, which was released in the model repository. For reproducibility, we trained on fully open-source internet data from NVIDIA Nemotron CC v2 [19].

Transformer 340M:

CategoryDetails
Training Configuration800k tokens per batch, 8192 token sequence length, WSD schedule
Data10.5B tokens of NemotronCCv2 [19] from the high quality split
(Nor)Muon Hyperparameterslr=8e-3
Aurora HyperparametersTwo projection iterations, damping beta=0.5. All other settings from NorMuon
Model Architecturehidden_size=1024, layers=24, GQA 4:1, RoPE, QKNorm, ShortConv, Gated Attention

Transformer 1.1B:

CategoryDetails
Training Configuration4M tokens per batch, 2048 token sequence length, WSD with cosine decay schedule
Data100B tokens of NemotronCCv2 [19] from the high quality split
Optimizer HyperparametersSame as the 340M settings in all cases, with appropriate hyperparameter transfer
Model Architecturehidden_size=2048, layers=24, GQA 4:1, RoPE, QKNorm, ShortConv, Gated Attention

C. Literature on Parameter Death

In Transformer FFNs, prior work has shown that intermediate hidden dimensions are not uniformly utilized: FFN activations are often highly sparse, and some intermediate neurons can remain inactive over large corpora, often referred to as dead or dormant neurons [14] [15] [16]. These neurons correspond to the hidden FFN dimensions between the up/gate projections and the down projection. When a hidden neuron remains inactive or near-zero for many tokens/steps across training, the gradients to its associated weights are greatly reduced. As a result, only a subset of the FFN intermediate dimensions and their corresponding weight matrices effectively participate in computation, reducing the model's effective capacity and parameter utilization [17].

D. Solving for the Riemannian Gradient

We want to solve

maxURm×nG,Us.t.UU=In,diag(UU)=nm1\max_{U \in \mathbb{R}^{m \times n}} \langle G, U \rangle \quad \text{s.t.} \quad U^\top U = I_n, \qquad \operatorname{diag}(UU^\top) = \frac{n}{m}\mathbf{1}

Let r:=nmr := \frac{n}{m}. The feasible set is the equal-row-norm Stiefel manifold

M={URm×n:UU=In,  Ui:22=r    i}\mathcal{M} = \left\{ U \in \mathbb{R}^{m \times n} : U^\top U = I_n, \; \|U_{i:}\|_2^2 = r \;\; \forall\, i \right\}

Tangent space. Let ZRm×nZ \in \mathbb{R}^{m \times n} be a tangent vector at UMU \in \mathcal{M}. The Stiefel constraint requires (U+ϵZ)(U+ϵZ)=In+O(ϵ2)(U + \epsilon Z)^\top (U + \epsilon Z) = I_n + O(\epsilon^2) for small ϵ>0\epsilon > 0. Keeping only first-order terms gives

UZ+ZU=0U^\top Z + Z^\top U = 0

The row-norm constraint requires Ui:+ϵZi:22=r+O(ϵ2)\|U_{i:} + \epsilon Z_{i:}\|_2^2 = r + O(\epsilon^2) for all ii. Expanding to first order gives Ui:,Zi:=0\langle U_{i:}, Z_{i:} \rangle = 0 for all ii, or equivalently diag(ZU)=0\operatorname{diag}(ZU^\top) = 0. Therefore,

TUM={ZRm×n:UZ+ZU=0,  diag(ZU)=0}T_U \mathcal{M} = \left\{ Z \in \mathbb{R}^{m \times n} : U^\top Z + Z^\top U = 0, \; \operatorname{diag}(ZU^\top) = 0 \right\}

Riemannian gradient. We compute the Riemannian gradient by projecting GG onto the tangent space:

Z=argminZ~Z~GF2s.t.UZ~+Z~U=0,diag(Z~U)=0Z = \arg\min_{\widetilde{Z}} \|\widetilde{Z} - G\|_F^2 \quad \text{s.t.} \quad U^\top \widetilde{Z} + \widetilde{Z}^\top U = 0, \qquad \operatorname{diag}(\widetilde{Z}U^\top) = 0

The projected vector has the form Z=GUSDUZ = G - US - DU, where S=SRn×nS = S^\top \in \mathbb{R}^{n \times n} and D=diag(λ)Rm×mD = \operatorname{diag}(\lambda) \in \mathbb{R}^{m \times m}. The USUS term enforces the Stiefel tangent constraint, while the DUDU term enforces the row-norm tangent constraint. Define

B:=sym(UG)=12(UG+GU),P:=UUB := \operatorname{sym}(U^\top G) = \frac{1}{2}(U^\top G + G^\top U), \qquad P := UU^\top

We impose the Stiefel tangent constraint UZ+ZU=0U^\top Z + Z^\top U = 0. Substituting Z=GUSDUZ = G - US - DU, we get

U(GUSDU)+(GUSDU)U=0U^\top(G - US - DU) + (G - US - DU)^\top U = 0

Using UU=InU^\top U = I_n, S=SS = S^\top, and D=DD = D^\top, this becomes

UG+GU2S2UDU=0U^\top G + G^\top U - 2S - 2U^\top DU = 0

Therefore,

S=sym(UG)UDU=BUDUS = \operatorname{sym}(U^\top G) - U^\top DU = B - U^\top DU

Now impose the row-norm tangent constraint diag(ZU)=0\operatorname{diag}(ZU^\top) = 0. Using Z=GUSDUZ = G - US - DU, this gives

diag(GU)diag(USU)diag(DUU)=0\operatorname{diag}(GU^\top) - \operatorname{diag}(USU^\top) - \operatorname{diag}(DUU^\top) = 0

Since diag(UU)=r1\operatorname{diag}(UU^\top) = r\mathbf{1} and D=diag(λ)D = \operatorname{diag}(\lambda), we have diag(DUU)=rλ\operatorname{diag}(DUU^\top) = r\lambda. Substituting S=BUDUS = B - U^\top DU into diag(USU)\operatorname{diag}(USU^\top) gives

diag(USU)=diag(UBU)diag(UUDUU)\operatorname{diag}(USU^\top) = \operatorname{diag}(UBU^\top) - \operatorname{diag}(UU^\top DUU^\top)

With P=UUP = UU^\top, this becomes diag(USU)=diag(UBU)diag(PDP)\operatorname{diag}(USU^\top) = \operatorname{diag}(UBU^\top) - \operatorname{diag}(PDP). Since D=diag(λ)D = \operatorname{diag}(\lambda),

diag(PDP)=(PP)λ\operatorname{diag}(PDP) = (P \circ P)\lambda

Therefore diag(USU)=diag(UBU)(PP)λ\operatorname{diag}(USU^\top) = \operatorname{diag}(UBU^\top) - (P \circ P)\lambda. Plugging this into the row constraint gives

diag(GU)[diag(UBU)(PP)λ]rλ=0\operatorname{diag}(GU^\top) - \left[\operatorname{diag}(UBU^\top) - (P \circ P)\lambda\right] - r\lambda = 0

Rearranging,

(rIPP)λ=diag(GU)diag(UBU)(rI - P \circ P)\lambda = \operatorname{diag}(GU^\top) - \operatorname{diag}(UBU^\top)

After solving this linear system for λ\lambda, define D=diag(λ)D = \operatorname{diag}(\lambda), S=BUDUS = B - U^\top DU, and the tangent projection is Z=GUSDUZ = G - US - DU.

E. CIFAR-10 Results

We train a 2-layer MLP with ReLU for 3 epochs on CIFAR-10 as a toy verification of Riemannian-Aurora. We use K=20K=20 conjugate gradient iterations, Nesterov Momentum, lr=4e-3, 2 retraction steps, PE-8, a Riemannian learning rate of η=0.1\eta = 0.1 and an AdamW learning rate of 4e-3 with betas (0.9,0.95)(0.9, 0.95). We found these parameters to be approximately optimal, after a brief sweep.

We find that Riemannian-Aurora outperforms the AdamW baseline by a significant amount, and performs very similarly to Muon in terms of train accuracy, but has slightly higher test accuracy. Unlike Muon, it leads to much more uniform row leverage scores.

CIFAR-10 results

CIFAR-10 training and test accuracy for AdamW, Muon, and Riemannian-Aurora, alongside leverage score distributions. Riemannian-Aurora achieves Muon-level accuracy with uniform leverage.

F. Alternating Iteration Converges

Let m>nm > n, r=n/mr = n/m, and define

S={URm×n:UU=In}\mathcal{S} = \{U \in \mathbb{R}^{m \times n} : U^\top U = I_n\} Or={URm×n:Ui,:22=r,  i=1,,m}\mathcal{O}_r = \{U \in \mathbb{R}^{m \times n} : \|U_{i,:}\|_2^2 = r,\; i = 1, \ldots, m\}

Let M=SOr\mathcal{M} = \mathcal{S} \cap \mathcal{O}_r. The alternating projection iteration is

Uk+1/2=ΠOr(Uk),Uk+1=ΠS(Uk+1/2)U_{k+1/2} = \Pi_{\mathcal{O}_r}(U_k), \qquad U_{k+1} = \Pi_{\mathcal{S}}(U_{k+1/2})

where ΠOr\Pi_{\mathcal{O}_r} is row-wise rescaling and ΠS\Pi_{\mathcal{S}} is the polar factor.

Lemma F.1: Nonemptiness

M\mathcal{M} \neq \emptyset.

Proof. By the Schur–Horn theorem, there exists a rank-nn orthogonal projector PRm×mP \in \mathbb{R}^{m \times m} with constant diagonal Pii=n/mP_{ii} = n/m. For example, the diagonal vector (n/m,,n/m)(n/m, \ldots, n/m) is majorized by the eigenvalue vector (1,,1,0,,0)(1, \ldots, 1, 0, \ldots, 0) with nn ones. Write P=UUP = UU^\top with UU=InU^\top U = I_n. Then Ui,:22=(UU)ii=Pii=n/m\|U_{i,:}\|_2^2 = (UU^\top)_{ii} = P_{ii} = n/m, so USOrU \in \mathcal{S} \cap \mathcal{O}_r. Hence M\mathcal{M} \neq \emptyset. \square

Regularity Condition

For UMU \in \mathcal{M}, define P=UUP = UU^\top and LU=rImPPL_U = rI_m - P \circ P, where \circ denotes the Hadamard product. Since P2=PP^2 = P and Pii=rP_{ii} = r,

jPij2=(P2)ii=Pii=r\sum_j P_{ij}^2 = (P^2)_{ii} = P_{ii} = r

Therefore LUL_U is a graph Laplacian:

λLUλ=12i,jPij2(λiλj)2\lambda^\top L_U \lambda = \frac{1}{2} \sum_{i,j} P_{ij}^2 (\lambda_i - \lambda_j)^2

Call UU regular if kerLU=span{1}\ker L_U = \operatorname{span}\{\mathbf{1}\}.

Lemma F.2: Clean Intersection

If UMU \in \mathcal{M} is regular, then S\mathcal{S} and Or\mathcal{O}_r intersect cleanly at UU.

Proof. The normal spaces are NUS={UA:A=A}N_U \mathcal{S} = \{UA : A = A^\top\} and NUOr={DU:D=diag(λ)}N_U \mathcal{O}_r = \{DU : D = \operatorname{diag}(\lambda)\}. Suppose a vector lies in both normal spaces, so UA=DUUA = DU. Since P=UUP = UU^\top is the projector onto col(U)\operatorname{col}(U), this implies (IP)DU=0(I - P)DU = 0. But

(IP)DUF2=DUF2PDUF2\|(I - P)DU\|_F^2 = \|DU\|_F^2 - \|PDU\|_F^2

Using Ui,:22=r\|U_{i,:}\|_2^2 = r, we get DUF2=rλ22\|DU\|_F^2 = r\|\lambda\|_2^2 and PDUF2=λ(PP)λ\|PDU\|_F^2 = \lambda^\top (P \circ P)\lambda, so

0=(IP)DUF2=λ(rImPP)λ=λLUλ0 = \|(I - P)DU\|_F^2 = \lambda^\top(rI_m - P \circ P)\lambda = \lambda^\top L_U \lambda

By regularity, this forces λ=α1\lambda = \alpha \mathbf{1}, hence D=αImD = \alpha I_m, A=αInA = \alpha I_n, and DU=UA=αUDU = UA = \alpha U. Therefore NUSNUOr=span{U}N_U \mathcal{S} \cap N_U \mathcal{O}_r = \operatorname{span}\{U\}. This one-dimensional common normal direction is unavoidable since both constraint sets imply the same Frobenius norm constraint UF2=n\|U\|_F^2 = n. Thus the combined constraint map has maximal rank modulo this single redundancy. By the constant-rank theorem, M\mathcal{M} is locally a smooth manifold and TUM=TUSTUOrT_U \mathcal{M} = T_U \mathcal{S} \cap T_U \mathcal{O}_r, so the intersection is clean. \square

Theorem F.3: Local Linear Convergence

Let UMU_\star \in \mathcal{M} be regular. Then for every initialization U0U_0 sufficiently close to UU_\star, the alternating projection sequence Uk+1/2=ΠOr(Uk)U_{k+1/2} = \Pi_{\mathcal{O}_r}(U_k), Uk+1=ΠS(Uk+1/2)U_{k+1} = \Pi_{\mathcal{S}}(U_{k+1/2}) is well-defined and converges linearly to some U^M\widehat{U} \in \mathcal{M}.

Proof. Both S\mathcal{S} and Or\mathcal{O}_r are CC^\infty embedded submanifolds. Near UU_\star, projection onto S\mathcal{S} is uniquely given by the polar factor, since nearby matrices have full column rank. Projection onto Or\mathcal{O}_r is uniquely given by row-wise normalization, since all rows of UU_\star have norm r>0\sqrt{r} > 0. By Lemma F.2, S\mathcal{S} and Or\mathcal{O}_r intersect cleanly at UU_\star. Clean intersection of smooth manifolds implies local linear convergence of alternating projections. Therefore the iterates converge linearly to a point U^SOr=M\widehat{U} \in \mathcal{S} \cap \mathcal{O}_r = \mathcal{M}. \square

G. Neuron Death in ReLU² MLPs

The neuron death analysis in Section 3 focuses on SwiGLU MLPs, but, as we show below, this pathology is not specific to the SiLU gating function. For any activation ϕ\phi with ϕ(0)=0\phi(0) = 0 and ϕ(0)0\phi'(0) \approx 0, the same vanishing-gradient feedback loop occurs: neurons with small pre-activations receive small gradients, accumulate small momentum, and produce small leverage scores under Muon. Common activation functions including ReLU² (defined by ϕ(x)=max(0,x)2\phi(x) = \max(0,x)^2), GELU, and SiLU all share this property.

We verify this by repeating our leverage score analysis on a 400M transformer trained with ReLU² MLPs (same setting as Appendix B, with SwiGLU replaced by ReLU²). The results closely mirror those of the SwiGLU experiments.

Neuron death grid (ReLU²). Under Muon, a large fraction of neurons die early in training. Under U-NorMuon, row leverage scores remain isotropic throughout training.

step 0

Leverage scores for all 5632 neurons (up projection, layer 12) in a ReLU² MLP. The same pattern of early, permanent death under Muon and healthy leverage under U-NorMuon is observed.

Leverage persistence (ReLU²). We sort neurons into cohorts by their leverage score at step 500 and track each cohort over training. The pattern matches our SwiGLU results: under Muon, the lowest-leverage cohorts collapse and never recover, while under U-NorMuon, all cohorts converge toward uniform leverage.

Training stepLog leverage scoreBottom 10%10–25%25–40%40–55%55–70%70–85%85–95%Top 5%

Leverage persistence for ReLU² MLPs. Toggle between optimizers, projections, and layers. The qualitative behavior matches the SwiGLU results in Section 3.

H. A Note on MoE Architectures

Aurora's benefits are most pronounced for tall matrices, where the ratio m/nm/n is large. This is the typical regime for up and gate projections in dense transformers, which commonly use MLP expansion factors of 4× or more. In MoE architectures, capacity is distributed across many smaller experts, each with a proportionally smaller hidden dimension and a lower m/nm/n ratio. We therefore expect the neuron death pathology to be less severe in MoE models of moderate size, though not entirely absent for experts that remain meaningfully tall.

I. Code Release

Our implementation of Aurora is open-sourced at github.com/tilde-research/aurora-release. The repository includes both the practical damped-iteration Aurora optimizer and the Riemannian-Aurora reference solver. Our modded-nanoGPT speedrun submission (Aurora + Contra-Muon + u/w-floor, 3175 steps) is available as a pull request at KellerJordan/modded-nanogpt#284. We also release the Aurora 1.1B pretrained model at tilde-research/aurora-1.1B.

References

  1. Muon: An optimizer for hidden layers in neural networks

    Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., and Bernstein, J. (2024)

    .

  2. Dion: Distributed Orthonormalized Updates

    Ahn, K., Xu, B., Abreu, N., Fan, Y., Magakyan, G., Sharma, P., Zhan, Z., and Langford, J. (2025)

    .

  3. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    Chang, D., Shi, Q., Zhang, L., Li, Y., Zhang, R., Lu, Y., Liu, Y., and Yuan, G. (2026)

    .

  4. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

    Li, Z., You, C., Bhojanapalli, S., Li, D., Rawat, A. S., Reddi, S. J., Ye, K., Chern, F., Yu, F., Guo, R., and Kumar, S. (2023)

    .

  5. MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

    Zhang, Z., Lin, Y., Liu, Z., Li, P., Sun, M., and Zhou, J. (2022)

    .

  6. GLM-5: From Vibe Coding to Agentic Engineering

    Zeng, A., Lv, X., Hou, Z., Du, Z., Zheng, Q., Chen, B., et al. (2026)

    .