π Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: β’ Dynamic hierarchical sparse strategy β’ Coarse-grained token compression β’ Fine-grained token selection π‘ With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costsβwithout compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. π For more details, check out our paper here: arxiv.org/abs/2502.11089