GitHub - soniqo/speech-swift: AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский

On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.

📚 Full Documentation → · 🤗 HuggingFace Models · 📝 Blog

Qwen3-ASR — Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
Parakeet TDT — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
Omnilingual ASR — Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
Streaming Dictation — Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
Qwen3-ForcedAligner — Word-level timestamp alignment (audio + text → timestamps)
Qwen3-TTS — Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
CosyVoice TTS — Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
Kokoro TTS — On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
Qwen3.5-Chat — On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
PersonaPlex — Full-duplex speech-to-speech (7B, audio in → audio out, 18 voice presets)
DeepFilterNet3 — Real-time noise suppression (2.1M params, 48 kHz)
VAD — Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
Speaker Diarization — Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
Speaker Embeddings — WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)

Papers: Qwen3-ASR (Alibaba) · Qwen3-TTS (Alibaba) · Omnilingual ASR (Meta) · Parakeet TDT (NVIDIA) · CosyVoice 3 (Alibaba) · Kokoro (StyleTTS 2) · PersonaPlex (NVIDIA) · Mimi (Kyutai) · Sortformer (NVIDIA)

News

20 Mar 2026 — We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
26 Feb 2026 — Speaker Diarization and Voice Activity Detection on Apple Silicon — Native Swift with MLX
23 Feb 2026 — NVIDIA PersonaPlex 7B on Apple Silicon — Full-Duplex Speech-to-Speech in Native Swift with MLX
12 Feb 2026 — Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon — Architecture and Benchmarks

Quick start

Add the package to your Package.swift:

.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")

Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:

.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI",             package: "speech-swift"),  // optional SwiftUI views

Transcribe an audio buffer in 3 lines:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Live streaming with partials:

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}

SwiftUI dictation view in ~10 lines:

import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
    @State private var store = TranscriptionStore()

    var body: some View {
        TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
            .task {
                let model = try? await ParakeetStreamingASRModel.fromPretrained()
                guard let model else { return }
                for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
                    store.apply(text: p.text, isFinal: p.isFinal)
                }
            }
    }
}

SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.

Models

Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables → soniqo.audio/architecture.

Model	Task	Backends	Sizes	Languages
Qwen3-ASR	Speech → Text	MLX, CoreML (hybrid)	0.6B, 1.7B	52
Parakeet TDT	Speech → Text	CoreML (ANE)	0.6B	25 European
Parakeet EOU	Speech → Text (streaming)	CoreML (ANE)	120M	25 European
Omnilingual ASR	Speech → Text	CoreML (ANE), MLX	300M / 1B / 3B / 7B	1,672
Qwen3-ForcedAligner	Audio + Text → Timestamps	MLX, CoreML	0.6B	Multi
Qwen3-TTS	Text → Speech	MLX, CoreML	0.6B, 1.7B	10
CosyVoice3	Text → Speech	MLX	0.5B	9
Kokoro-82M	Text → Speech	CoreML (ANE)	82M	10
Qwen3.5-Chat	Text → Text (LLM)	MLX, CoreML	0.8B	Multi
PersonaPlex	Speech → Speech	MLX	7B	EN
Silero VAD	Voice Activity Detection	MLX, CoreML	309K	Agnostic
Pyannote	VAD + Diarization	MLX	1.5M	Agnostic
Sortformer	Diarization (E2E)	CoreML (ANE)	—	Agnostic
DeepFilterNet3	Speech Enhancement	CoreML	2.1M	Agnostic
WeSpeaker	Speaker Embedding	MLX, CoreML	6.6M	Agnostic

Installation

Homebrew

Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.

brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech

Then:

audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript

Full CLI reference →

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
]

Import only what you need — every model is its own SPM target:

import Qwen3ASR             // Speech recognition (MLX)
import ParakeetASR          // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import OmnilingualASR       // 1,672 languages (CoreML + MLX)
import Qwen3TTS             // Text-to-speech
import CosyVoiceTTS         // Text-to-speech with voice cloning
import KokoroTTS            // Text-to-speech (iOS-ready)
import Qwen3Chat            // On-device LLM chat
import PersonaPlex          // Full-duplex speech-to-speech
import SpeechVAD            // VAD + speaker diarization + embeddings
import SpeechEnhancement    // Noise suppression
import SpeechUI             // SwiftUI components for streaming transcripts
import AudioCommon          // Shared protocols and utilities

Requirements

Swift 5.9+, Xcode 15+ (with Metal Toolchain)
macOS 14+ or iOS 17+, Apple Silicon (M1/M2/M3/M4)

Build from source

git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build

make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.

Full build and install guide →

Demo apps

DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.

Speech-to-Text — full guide →

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)

Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).

Forced Alignment — full guide →

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)
for word in aligned {
    print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}

Text-to-Speech — full guide →

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.

Speech-to-Speech — full guide →

import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback

LLM Chat — full guide →

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
    print(token, terminator: "")
}

Voice Activity Detection — full guide →

import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }

Speaker Diarization — full guide →

import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }

Speech Enhancement — full guide →

import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)

Voice Pipeline (ASR → LLM → TTS) — full guide →

import SpeechCore

let pipeline = VoicePipeline(
    stt: parakeetASR,
    tts: qwen3TTS,
    vad: sileroVAD,
    config: .init(mode: .voicePipeline),
    onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)

VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.

HTTP API server

Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.

Architecture

speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).

Full architecture diagram with backends, memory tables, and module map → soniqo.audio/architecture · API reference → soniqo.audio/api · Benchmarks → soniqo.audio/benchmarks

Local docs (repo):

Models: Qwen3-ASR · Qwen3-TTS · CosyVoice · Kokoro · Parakeet TDT · Parakeet Streaming · Omnilingual ASR · PersonaPlex · FireRedVAD
Inference: Qwen3-ASR · Parakeet TDT · Parakeet Streaming · Omnilingual ASR · TTS · Forced Aligner · Silero VAD · Speaker Diarization · Speech Enhancement
Reference: Shared Protocols

Cache configuration

Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.

See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.

MLX Metal library

If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:

xcodebuild -downloadComponent MetalToolchain

Testing

make test                            # full suite (unit + E2E with model downloads)
swift test --skip E2E                # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests    # specific module

E2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.

Contributing

PRs welcome — bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.

License

Apache 2.0