AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
π Read in: English Β· δΈζ Β· ζ₯ζ¬θͺ Β· νκ΅μ΄ Β· EspaΓ±ol Β· Deutsch Β· FranΓ§ais Β· ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯ Β· PortuguΓͺs Β· Π ΡΡΡΠΊΠΈΠΉ
On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon β no cloud, no API keys, no data leaves your device.
π Full Documentation β Β· π€ HuggingFace Models Β· π Blog
- Qwen3-ASR β Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
- Parakeet TDT β Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- Omnilingual ASR β Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
- Streaming Dictation β Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
- Qwen3-ForcedAligner β Word-level timestamp alignment (audio + text β timestamps)
- Qwen3-TTS β Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
- CosyVoice TTS β Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
- Kokoro TTS β On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
- Qwen3.5-Chat β On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
- PersonaPlex β Full-duplex speech-to-speech (7B, audio in β audio out, 18 voice presets)
- DeepFilterNet3 β Real-time noise suppression (2.1M params, 48 kHz)
- VAD β Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
- Speaker Diarization β Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
- Speaker Embeddings β WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)
Papers: Qwen3-ASR (Alibaba) Β· Qwen3-TTS (Alibaba) Β· Omnilingual ASR (Meta) Β· Parakeet TDT (NVIDIA) Β· CosyVoice 3 (Alibaba) Β· Kokoro (StyleTTS 2) Β· PersonaPlex (NVIDIA) Β· Mimi (Kyutai) Β· Sortformer (NVIDIA)
News
- 20 Mar 2026 β We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
- 26 Feb 2026 β Speaker Diarization and Voice Activity Detection on Apple Silicon β Native Swift with MLX
- 23 Feb 2026 β NVIDIA PersonaPlex 7B on Apple Silicon β Full-Duplex Speech-to-Speech in Native Swift with MLX
- 12 Feb 2026 β Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon β Architecture and Benchmarks
Quick start
Add the package to your Package.swift:
.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
Import only the modules you need β every model is its own SPM library, so you don't pay for what you don't use:
.product(name: "ParakeetStreamingASR", package: "speech-swift"), .product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI views
Transcribe an audio buffer in 3 lines:
import ParakeetStreamingASR let model = try await ParakeetStreamingASRModel.fromPretrained() let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
Live streaming with partials:
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) { print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)") }
SwiftUI dictation view in ~10 lines:
import SwiftUI import ParakeetStreamingASR import SpeechUI @MainActor struct DictateView: View { @State private var store = TranscriptionStore() var body: some View { TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial) .task { let model = try? await ParakeetStreamingASRModel.fromPretrained() guard let model else { return } for await p in model.transcribeStream(audio: samples, sampleRate: 16000) { store.apply(text: p.text, isFinal: p.isFinal) } } } }
SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.
Models
Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables β soniqo.audio/architecture.
| Model | Task | Backends | Sizes | Languages |
|---|---|---|---|---|
| Qwen3-ASR | Speech β Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 |
| Parakeet TDT | Speech β Text | CoreML (ANE) | 0.6B | 25 European |
| Parakeet EOU | Speech β Text (streaming) | CoreML (ANE) | 120M | 25 European |
| Omnilingual ASR | Speech β Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | 1,672 |
| Qwen3-ForcedAligner | Audio + Text β Timestamps | MLX, CoreML | 0.6B | Multi |
| Qwen3-TTS | Text β Speech | MLX, CoreML | 0.6B, 1.7B | 10 |
| CosyVoice3 | Text β Speech | MLX | 0.5B | 9 |
| Kokoro-82M | Text β Speech | CoreML (ANE) | 82M | 10 |
| Qwen3.5-Chat | Text β Text (LLM) | MLX, CoreML | 0.8B | Multi |
| PersonaPlex | Speech β Speech | MLX | 7B | EN |
| Silero VAD | Voice Activity Detection | MLX, CoreML | 309K | Agnostic |
| Pyannote | VAD + Diarization | MLX | 1.5M | Agnostic |
| Sortformer | Diarization (E2E) | CoreML (ANE) | β | Agnostic |
| DeepFilterNet3 | Speech Enhancement | CoreML | 2.1M | Agnostic |
| WeSpeaker | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |
Installation
Homebrew
Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.
brew tap soniqo/speech https://github.com/soniqo/speech-swift brew install speech
Then:
audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcriptSwift Package Manager
dependencies: [ .package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9") ]
Import only what you need β every model is its own SPM target:
import Qwen3ASR // Speech recognition (MLX) import ParakeetASR // Speech recognition (CoreML, batch) import ParakeetStreamingASR // Streaming dictation with partials + EOU import OmnilingualASR // 1,672 languages (CoreML + MLX) import Qwen3TTS // Text-to-speech import CosyVoiceTTS // Text-to-speech with voice cloning import KokoroTTS // Text-to-speech (iOS-ready) import Qwen3Chat // On-device LLM chat import PersonaPlex // Full-duplex speech-to-speech import SpeechVAD // VAD + speaker diarization + embeddings import SpeechEnhancement // Noise suppression import SpeechUI // SwiftUI components for streaming transcripts import AudioCommon // Shared protocols and utilities
Requirements
- Swift 5.9+, Xcode 15+ (with Metal Toolchain)
- macOS 14+ or iOS 17+, Apple Silicon (M1/M2/M3/M4)
Build from source
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make buildmake build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference β without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.
Full build and install guide β
Demo apps
- DictateDemo (docs) β macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
- iOSEchoDemo β iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
- PersonaPlexDemo β Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
- SpeechDemo β Dictation and TTS synthesis in a tabbed interface. macOS.
Each demo's README has build instructions.
Code examples
The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.
Speech-to-Text β full guide β
import Qwen3ASR let model = try await Qwen3ASRModel.fromPretrained() let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
Alternative backends: Parakeet TDT (CoreML, 32Γ realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).
Forced Alignment β full guide β
import Qwen3ASR let aligner = try await Qwen3ForcedAligner.fromPretrained() let aligned = aligner.align( audio: audioSamples, text: "Can you guarantee that the replacement part will be shipped tomorrow?", sampleRate: 24000 ) for word in aligned { print("[\(word.startTime)s - \(word.endTime)s] \(word.text)") }
Text-to-Speech β full guide β
import Qwen3TTS import AudioCommon let model = try await Qwen3TTSModel.fromPretrained() let audio = model.synthesize(text: "Hello world", language: "english") try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.
Speech-to-Speech β full guide β
import PersonaPlex let model = try await PersonaPlexModel.fromPretrained() let responseAudio = model.respond(userAudio: userSamples) // 24 kHz mono Float32 output ready for playback
LLM Chat β full guide β
import Qwen3Chat let chat = try await Qwen35MLXChat.fromPretrained() chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in print(token, terminator: "") }
Voice Activity Detection β full guide β
import SpeechVAD let vad = try await SileroVADModel.fromPretrained() let segments = vad.detectSpeech(audio: samples, sampleRate: 16000) for s in segments { print("\(s.startTime)s β \(s.endTime)s") }
Speaker Diarization β full guide β
import SpeechVAD let diarizer = try await DiarizationPipeline.fromPretrained() let segments = diarizer.diarize(audio: samples, sampleRate: 16000) for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }
Speech Enhancement β full guide β
import SpeechEnhancement let denoiser = try await DeepFilterNet3Model.fromPretrained() let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)
Voice Pipeline (ASR β LLM β TTS) β full guide β
import SpeechCore let pipeline = VoicePipeline( stt: parakeetASR, tts: qwen3TTS, vad: sileroVAD, config: .init(mode: .voicePipeline), onEvent: { event in print(event) } ) pipeline.start() pipeline.pushAudio(micSamples)
VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.
HTTP API server
Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.
Architecture
speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).
Full architecture diagram with backends, memory tables, and module map β soniqo.audio/architecture Β· API reference β soniqo.audio/api Β· Benchmarks β soniqo.audio/benchmarks
Local docs (repo):
- Models: Qwen3-ASR Β· Qwen3-TTS Β· CosyVoice Β· Kokoro Β· Parakeet TDT Β· Parakeet Streaming Β· Omnilingual ASR Β· PersonaPlex Β· FireRedVAD
- Inference: Qwen3-ASR Β· Parakeet TDT Β· Parakeet Streaming Β· Omnilingual ASR Β· TTS Β· Forced Aligner Β· Silero VAD Β· Speaker Diarization Β· Speech Enhancement
- Reference: Shared Protocols
Cache configuration
Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.
See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.
MLX Metal library
If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:
xcodebuild -downloadComponent MetalToolchain
Testing
make test # full suite (unit + E2E with model downloads) swift test --skip E2E # unit only (CI-safe, no downloads) swift test --filter Qwen3ASRTests # specific module
E2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.
Contributing
PRs welcome β bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.
License
Apache 2.0