Full speech pipeline in native Swift/MLX – ASR, TTS, speech-to-speech, on-device

5 points by ipotapov a month ago · 2 comments

Reader

ipotapovOP a month ago

Been building this for a few months now and it's turned into a complete on-device audio pipeline for Apple Silicon:

ASR (Qwen3) → TTS (Qwen3 + CosyVoice, 10 languages) → Speech-to-Speech (PersonaPlex 7B, full-duplex) → Speaker Diarization (pyannote + WeSpeaker) → Voice Activity Detection (Silero, real-time streaming) → Forced Alignment (word-level timestamps)

No Python, no server, no CoreML — pure Swift through MLX. Models download automatically from HuggingFace on first run. The whole diarization stack is ~32 MB.

Everything is protocol-based and composable — VAD gates ASR, diarization feeds into transcription, embeddings enable speaker verification. Mix and match.

Repo: github.com/ivan-digital/qwen3-asr-swift (Apache 2.0)

Blog post with architecture details: blog.ivan.digital

There's a lot of surface area here and contributions are very welcome — whether it's new model ports, iOS integration, performance work, or just filing issues. If you've been wanting to do anything with audio or MLX in Swift, come build with us.

lyro a month ago

looks really solid, great job!

Settings

Full speech pipeline in native Swift/MLX – ASR, TTS, speech-to-speech, on-device

Keyboard Shortcuts