muon optimizer explained to a toddler

2 min read Original article ↗

adamw is as of 2025 the most used optimizer to train deep neural network

adamw is momentum based though

empirically momentum based optimizer tend to favor few update directions over the rest

update matrix look suspiciously low-rank

it’s like trying to roll a boat down the cost plane

muon’s contributor thinks that if we could dampen these directions and boost rarer ones it would lead to better convergence

kinda bringing the boat back to ball shape

muon’s contributor says that if we run svd on momentum update matrix and throw away the singular value then we get roughly unit vector length in all directions

can’t run a svd on each update step though too costly

we don’t have to be that precise so we can look at faster iterative algorithms

newtonschulz is one such algorithm

basically if you take an odd-matrix polynomial and apply it to a matrix it’s like applying it straight to the singular values while U and V.T are left unchanged

would be neat to apply the sign function to the singular value because then we would just have 1s on that diagonal

cant use sign because it’s not a polynomial even though it’s odd

we can approximate the sign function though with cubic/quintic polynomial if we apply the function multiple times on itself (not well but close enough around [-1,1])

5 iterations (or steps) are good enough

if we normalize the update matrix between [0,1] we are always sure that the approximation is sign-like and gives us a 1 out

actually, empirically, we don’t even need to be at 1 straight, if we are at [0.7,1.3], that’s good enough

the cursed quintic gives us one ugly-looking sign function but has all the right characteristics + it’s fast

it’s parameter are `a, b, c = (3.4445, -4.7750, 2.0315)`

that’s muon

ps:

  • nesterov better than sgd-momentum

  • muon only work on 2d hidden dense linear layer

  • so no bias and no embedding

  • convolutional filters are good you just need to put them back in 2d

  • also empirically first and last layers should use adamw

32 min step-by-step breakdown here:

big shout out to SJ and Adrian for supporting the content!

also

more paper for deepdive:

📌 overview: https://rixhabh.notion.site/Muon-2302d1eca8f6801c88e3f80e43c316da?pvs=74

📌 Keller blog: https://kellerjordan.github.io/posts/muon/

📌 Jeremy blog: https://jeremybernste.in/writing/deriving-muon

📌 newtonschulz overview: https://docs.modula.systems/algorithms/newton-schulz/

Discussion about this post

Ready for more?