Senko – Very Fast Speaker Diarization

2 points by hamza_q_ 7 months ago · 1 comment

Reader

1 hour of audio processed in 5 seconds (RTX 4090, Ryzen 9 7950X). ~17x faster than Pyannote 3.1.

On M3 MacBook Air, 1 hour in 23.5 seconds (~14x faster).

This is a custom speaker diarization pipeline I've developed; it's a modified version of the pipeline found in the excellent 3D-Speaker project by Alibaba Research.

My optimizations/modifications were the following:

- changed VAD model

- multi-threaded Fbank feature extraction

- batched inference of CAM++ embeddings model

- clustering is accelerated by RAPIDS, when NVIDIA GPU available

Optimizations aside, massive credit needs to be given to the CAM++ speaker embeddings model, whose efficiency is where the majority of the speed comes from.

This pipeline powers the Zanshin media player, which is an attempt at a usable integration of diarization in a media player. Check it out here: https://zanshin.sh And discuss here: https://news.ycombinator.com/item?id=45104866

Let me know what you think! Were you also frustrated by how slow speaker diarization is? Does Senko's speed unlock new use cases for you? Cheers, everyone.

Settings

Senko – Very Fast Speaker Diarization

Keyboard Shortcuts