MHC: Manifold-Constrained Hyper-Connections

32 points by ipnon 18 days ago · 3 comments

Reader

So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!

rvz 18 days ago

Yes. This is a general improvement in a long time of the residual design in deep neural networks and it also improves on training LLMs with hyper-connections (HC) at a large scale when compared with the standard HC architecture.
So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.
This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).
Instead of just adding more GPUs + Money + Parameters + Data at the problem.
karmakaze 17 days ago

I saw this topic in my Youtube feed (YTers are fast). Looking for a bit more info for laypeople found this[0].
[0] https://www.toolmesh.ai/news/deepseek-mhc-architecture-ai-pe...

Settings

MHC: Manifold-Constrained Hyper-Connections

Keyboard Shortcuts