πŸš€ Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: β€’ Dynamic hierarchical sparse strategy β€’ Coarse-grained token compression β€’ Fine-grained token selection πŸ’‘ With https://t.co/zjXuBzzDCp

1 min read Original article β†—

πŸš€ Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: β€’ Dynamic hierarchical sparse strategy β€’ Coarse-grained token compression β€’ Fine-grained token selection πŸ’‘ With optimized design for modern hardware, NSA speeds up inference while reducing pre-training costsβ€”without compromising performance. It matches or outperforms Full Attention models on general benchmarks, long-context tasks, and instruction-based reasoning. πŸ“– For more details, check out our paper here: arxiv.org/abs/2502.11089