Senko – Very Fast Speaker Diarization
github.com1 hour of audio processed in 5 seconds (RTX 4090, Ryzen 9 7950X). ~17x faster than Pyannote 3.1.
On M3 MacBook Air, 1 hour in 23.5 seconds (~14x faster).
This is a custom speaker diarization pipeline I've developed; it's a modified version of the pipeline found in the excellent 3D-Speaker project by Alibaba Research.
My optimizations/modifications were the following:
- changed VAD model
- multi-threaded Fbank feature extraction
- batched inference of CAM++ embeddings model
- clustering is accelerated by RAPIDS, when NVIDIA GPU available
Optimizations aside, massive credit needs to be given to the CAM++ speaker embeddings model, whose efficiency is where the majority of the speed comes from.
This pipeline powers the Zanshin media player, which is an attempt at a usable integration of diarization in a media player. Check it out here: https://zanshin.sh And discuss here: https://news.ycombinator.com/item?id=45104866
Let me know what you think! Were you also frustrated by how slow speaker diarization is? Does Senko's speed unlock new use cases for you? Cheers, everyone.