GitHub - soniqo/speech-swift: AI speech toolkit for Apple Silicon β€” ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML

9 min read Original article β†—

AI speech models for Apple Silicon, powered by MLX Swift and CoreML.

πŸ“– Read in: English Β· δΈ­ζ–‡ Β· ζ—₯本θͺž Β· ν•œκ΅­μ–΄ Β· EspaΓ±ol Β· Deutsch Β· FranΓ§ais Β· ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯€ Β· PortuguΓͺs Β· Русский

On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon β€” no cloud, no API keys, no data leaves your device.

πŸ“š Full Documentation β†’ Β· πŸ€— HuggingFace Models Β· πŸ“ Blog

  • Qwen3-ASR β€” Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
  • Parakeet TDT β€” Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
  • Omnilingual ASR β€” Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
  • Streaming Dictation β€” Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
  • Qwen3-ForcedAligner β€” Word-level timestamp alignment (audio + text β†’ timestamps)
  • Qwen3-TTS β€” Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
  • CosyVoice TTS β€” Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
  • Kokoro TTS β€” On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
  • Qwen3.5-Chat β€” On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
  • PersonaPlex β€” Full-duplex speech-to-speech (7B, audio in β†’ audio out, 18 voice presets)
  • DeepFilterNet3 β€” Real-time noise suppression (2.1M params, 48 kHz)
  • VAD β€” Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
  • Speaker Diarization β€” Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
  • Speaker Embeddings β€” WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)

Papers: Qwen3-ASR (Alibaba) Β· Qwen3-TTS (Alibaba) Β· Omnilingual ASR (Meta) Β· Parakeet TDT (NVIDIA) Β· CosyVoice 3 (Alibaba) Β· Kokoro (StyleTTS 2) Β· PersonaPlex (NVIDIA) Β· Mimi (Kyutai) Β· Sortformer (NVIDIA)

News

Quick start

Add the package to your Package.swift:

.package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")

Import only the modules you need β€” every model is its own SPM library, so you don't pay for what you don't use:

.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI",             package: "speech-swift"),  // optional SwiftUI views

Transcribe an audio buffer in 3 lines:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Live streaming with partials:

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}

SwiftUI dictation view in ~10 lines:

import SwiftUI
import ParakeetStreamingASR
import SpeechUI

@MainActor
struct DictateView: View {
    @State private var store = TranscriptionStore()

    var body: some View {
        TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
            .task {
                let model = try? await ParakeetStreamingASRModel.fromPretrained()
                guard let model else { return }
                for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
                    store.apply(text: p.text, isFinal: p.isFinal)
                }
            }
    }
}

SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.

Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.

Models

Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables β†’ soniqo.audio/architecture.

Model Task Backends Sizes Languages
Qwen3-ASR Speech β†’ Text MLX, CoreML (hybrid) 0.6B, 1.7B 52
Parakeet TDT Speech β†’ Text CoreML (ANE) 0.6B 25 European
Parakeet EOU Speech β†’ Text (streaming) CoreML (ANE) 120M 25 European
Omnilingual ASR Speech β†’ Text CoreML (ANE), MLX 300M / 1B / 3B / 7B 1,672
Qwen3-ForcedAligner Audio + Text β†’ Timestamps MLX, CoreML 0.6B Multi
Qwen3-TTS Text β†’ Speech MLX, CoreML 0.6B, 1.7B 10
CosyVoice3 Text β†’ Speech MLX 0.5B 9
Kokoro-82M Text β†’ Speech CoreML (ANE) 82M 10
Qwen3.5-Chat Text β†’ Text (LLM) MLX, CoreML 0.8B Multi
PersonaPlex Speech β†’ Speech MLX 7B EN
Silero VAD Voice Activity Detection MLX, CoreML 309K Agnostic
Pyannote VAD + Diarization MLX 1.5M Agnostic
Sortformer Diarization (E2E) CoreML (ANE) β€” Agnostic
DeepFilterNet3 Speech Enhancement CoreML 2.1M Agnostic
WeSpeaker Speaker Embedding MLX, CoreML 6.6M Agnostic

Installation

Homebrew

Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.

brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech

Then:

audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript

Full CLI reference β†’

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
]

Import only what you need β€” every model is its own SPM target:

import Qwen3ASR             // Speech recognition (MLX)
import ParakeetASR          // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import OmnilingualASR       // 1,672 languages (CoreML + MLX)
import Qwen3TTS             // Text-to-speech
import CosyVoiceTTS         // Text-to-speech with voice cloning
import KokoroTTS            // Text-to-speech (iOS-ready)
import Qwen3Chat            // On-device LLM chat
import PersonaPlex          // Full-duplex speech-to-speech
import SpeechVAD            // VAD + speaker diarization + embeddings
import SpeechEnhancement    // Noise suppression
import SpeechUI             // SwiftUI components for streaming transcripts
import AudioCommon          // Shared protocols and utilities

Requirements

  • Swift 5.9+, Xcode 15+ (with Metal Toolchain)
  • macOS 14+ or iOS 17+, Apple Silicon (M1/M2/M3/M4)

Build from source

git clone https://github.com/soniqo/speech-swift
cd speech-swift
make build

make build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference β€” without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.

Full build and install guide β†’

Demo apps

  • DictateDemo (docs) β€” macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
  • iOSEchoDemo β€” iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
  • PersonaPlexDemo β€” Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
  • SpeechDemo β€” Dictation and TTS synthesis in a tabbed interface. macOS.

Each demo's README has build instructions.

Code examples

The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.

Speech-to-Text β€” full guide β†’

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)

Alternative backends: Parakeet TDT (CoreML, 32Γ— realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).

Forced Alignment β€” full guide β†’

import Qwen3ASR

let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
    audio: audioSamples,
    text: "Can you guarantee that the replacement part will be shipped tomorrow?",
    sampleRate: 24000
)
for word in aligned {
    print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}

Text-to-Speech β€” full guide β†’

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.

Speech-to-Speech β€” full guide β†’

import PersonaPlex

let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playback

LLM Chat β€” full guide β†’

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
    print(token, terminator: "")
}

Voice Activity Detection β€” full guide β†’

import SpeechVAD

let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s β†’ \(s.endTime)s") }

Speaker Diarization β€” full guide β†’

import SpeechVAD

let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }

Speech Enhancement β€” full guide β†’

import SpeechEnhancement

let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)

Voice Pipeline (ASR β†’ LLM β†’ TTS) β€” full guide β†’

import SpeechCore

let pipeline = VoicePipeline(
    stt: parakeetASR,
    tts: qwen3TTS,
    vad: sileroVAD,
    config: .init(mode: .voicePipeline),
    onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)

VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.

HTTP API server

Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.

Architecture

speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).

Full architecture diagram with backends, memory tables, and module map β†’ soniqo.audio/architecture Β· API reference β†’ soniqo.audio/api Β· Benchmarks β†’ soniqo.audio/benchmarks

Local docs (repo):

Cache configuration

Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.

See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.

MLX Metal library

If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:

xcodebuild -downloadComponent MetalToolchain

Testing

make test                            # full suite (unit + E2E with model downloads)
swift test --skip E2E                # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests    # specific module

E2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.

Contributing

PRs welcome β€” bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.

License

Apache 2.0