localvoxtral is a native macOS menu bar app for realtime dictation.
It keeps the loop simple: start dictation, speak, get text fast.
Unlike Whisper-based tools that transcribe after you stop speaking, Voxtral Realtime streams text as audio arrives, so words appear while you're still talking.
On Apple Silicon, localvoxtral + voxmlx provides a fully local path (audio + inference stay on-device), improving privacy and avoiding API costs.
It connects to any OpenAI Realtime-compatible endpoint. Recommended backends are voxmlx (Apple Silicon) and vLLM (NVIDIA GPU). An mlx-audio backend is also supported but deprecated.
Built for Mistral AI's Voxtral Mini 4B Realtime model, but it works with any OpenAI-compatible Realtime API backend.
Features
- Global shortcut with selectable behavior:
Toggle(press-to-start/stop) orPush to Talk(hold-to-dictate) - Native menu bar app with instant open and visual feedback with the icon
- Output modes: overlay buffer (commit on stop) or live auto-paste into focused input
- Fully local dictation option with
voxmlx(no third-party API traffic) - Pick your preferred microphone input device
- Copy the latest transcribed segment
Quick start
Recommended: install from GitHub Releases (DMG)
Download the latest .dmg from Releases.
If macOS blocks first launch, go to System Settings -> Privacy & Security and click Open Anyway for localvoxtral.
Alternatively, build from source as an app bundle
./scripts/package_app.sh release open ./dist/localvoxtral.app
Settings
- Open Settings from the menu bar popover to set:
- Dictation keyboard shortcut
- Shortcut behavior (
Toggle/Push to Talk) - Realtime endpoint
- Model name
- API key
- Commit interval (
vLLM/voxmlx) - Transcription delay (
mlx-audio) - Output mode (
Overlay Buffer/Live Auto-Paste) - Auto-copy final segment
Tested setup
In this tested setup, vLLM and voxmlx stream partial text fast enough for realtime dictation; latency and throughput will vary by hardware, model, and quantization.
voxmlx
voxmlx OpenAI Realtime-compatible running on M1 Pro with a 4-bit quantized model. Use this fork which adds a WebSocket server that speaks the OpenAI Realtime API protocol and memory management optimizations.
# install uv once: https://docs.astral.sh/uv/getting-started/installation/ uvx --from "git+https://github.com/T0mSIlver/voxmlx.git[server]" \ voxmlx-serve --model T0mSIlver/Voxtral-Mini-4B-Realtime-2602-MLX-4bit
vLLM
vllm OpenAI Realtime-compatible server running on an NVIDIA RTX 3090, using the default settings recommended on the Voxtral Mini 4B Realtime model page.
VLLM_DISABLE_COMPILE_CACHE=1
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'mlx-audio (deprecated)
Deprecated: mlx-audio does not provide true incremental inference for Voxtral Realtime, so partial transcripts are chunked and less responsive than the vLLM and voxmlx backends.
mlx-audio server on M1 Pro, running a 4-bit quant of Voxtral Mini 4B Realtime.
# Default max_chunk (6s) force-splits continuous speech mid-sentence; 30 lets silence detection handle segmentation naturally
MLX_AUDIO_REALTIME_MAX_CHUNK_SECONDS=30 python -m mlx_audio.server --workers 1Roadmap
- Add on-device LLM post-processing for dictation cleanup and technical wording improvements.
- Enhance the server connection UX
- Drive
voxmlx-serve(from thevoxmlxfork) upstream and assess app-managed local serving (start/stop/config) in localvoxtral. - Implement more of the on-device Voxtral Realtime integrations recommended in the model README:
- Pure C - thanks Salvatore Sanfilippo
- done
mlx-audio framework - thanks Shreyas Karnik - done
MLX - thanks Awni Hannun - Rust - thanks TrevorS
UI


