GitHub - T0mSIlver/localvoxtral: Native macOS menu bar app for real-time dictation with Voxtral, streaming STT directly into any app via OpenAI Realtime-compatible backends, including local vLLM and voxmlx.

3 min read Original article ↗

localvoxtral demo

localvoxtral app icon

localvoxtral is a native macOS menu bar app for realtime dictation. It keeps the loop simple: start dictation, speak, get text fast. Unlike Whisper-based tools that transcribe after you stop speaking, Voxtral Realtime streams text as audio arrives, so words appear while you're still talking. On Apple Silicon, localvoxtral + voxmlx provides a fully local path (audio + inference stay on-device), improving privacy and avoiding API costs.

It connects to any OpenAI Realtime-compatible endpoint. Recommended backends are voxmlx (Apple Silicon) and vLLM (NVIDIA GPU). An mlx-audio backend is also supported but deprecated.

Built for Mistral AI's Voxtral Mini 4B Realtime model, but it works with any OpenAI-compatible Realtime API backend.

Features

  • Global shortcut with selectable behavior: Toggle (press-to-start/stop) or Push to Talk (hold-to-dictate)
  • Native menu bar app with instant open and visual feedback with the icon
  • Output modes: overlay buffer (commit on stop) or live auto-paste into focused input
  • Fully local dictation option with voxmlx (no third-party API traffic)
  • Pick your preferred microphone input device
  • Copy the latest transcribed segment

Quick start

Recommended: install from GitHub Releases (DMG)

Download the latest .dmg from Releases.

If macOS blocks first launch, go to System Settings -> Privacy & Security and click Open Anyway for localvoxtral.

Alternatively, build from source as an app bundle

./scripts/package_app.sh release
open ./dist/localvoxtral.app

Settings

  • Open Settings from the menu bar popover to set:
    • Dictation keyboard shortcut
    • Shortcut behavior (Toggle / Push to Talk)
    • Realtime endpoint
    • Model name
    • API key
    • Commit interval (vLLM/voxmlx)
    • Transcription delay (mlx-audio)
    • Output mode (Overlay Buffer / Live Auto-Paste)
    • Auto-copy final segment

Tested setup

In this tested setup, vLLM and voxmlx stream partial text fast enough for realtime dictation; latency and throughput will vary by hardware, model, and quantization.

voxmlx

voxmlx OpenAI Realtime-compatible running on M1 Pro with a 4-bit quantized model. Use this fork which adds a WebSocket server that speaks the OpenAI Realtime API protocol and memory management optimizations.

# install uv once: https://docs.astral.sh/uv/getting-started/installation/
uvx --from "git+https://github.com/T0mSIlver/voxmlx.git[server]" \
  voxmlx-serve --model T0mSIlver/Voxtral-Mini-4B-Realtime-2602-MLX-4bit

vLLM

vllm OpenAI Realtime-compatible server running on an NVIDIA RTX 3090, using the default settings recommended on the Voxtral Mini 4B Realtime model page.

VLLM_DISABLE_COMPILE_CACHE=1
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

mlx-audio (deprecated)

Deprecated: mlx-audio does not provide true incremental inference for Voxtral Realtime, so partial transcripts are chunked and less responsive than the vLLM and voxmlx backends.

mlx-audio server on M1 Pro, running a 4-bit quant of Voxtral Mini 4B Realtime.

# Default max_chunk (6s) force-splits continuous speech mid-sentence; 30 lets silence detection handle segmentation naturally
MLX_AUDIO_REALTIME_MAX_CHUNK_SECONDS=30 python -m mlx_audio.server --workers 1

Roadmap

  • Add on-device LLM post-processing for dictation cleanup and technical wording improvements.
  • Enhance the server connection UX
  • Drive voxmlx-serve (from the voxmlx fork) upstream and assess app-managed local serving (start/stop/config) in localvoxtral.
  • Implement more of the on-device Voxtral Realtime integrations recommended in the model README:

UI

localvoxtral menubar icon Menubar icon

localvoxtral settings view localvoxtral popover view