GitHub - mattthewong/vox: System-wide speech-to-text for macOS. Hold a hotkey, speak, text appears at your cursor. Local Whisper, no paid services.

System-wide speech-to-text for macOS. Hold a hotkey, speak, release -- transcribed text appears wherever your cursor is. Runs entirely locally using whisper.cpp. No paid services, no rate limits.

How it works

Hold hotkey --> Record mic --> Whisper transcribes --> Text pasted at cursor

You run vox in a terminal (or as a background process)
Switch to any app -- editor, browser, terminal, chat
Hold your hotkey (e.g. Fn, Cmd+Shift), speak naturally
Release -- text appears where your cursor is

You hear a gentle chime on start and stop.

Install

Quick start (recommended)

Requirements:

macOS
Homebrew
Go 1.24+

git clone https://github.com/mattthewong/vox.git
cd vox
make setup   # installs sox + whisper-cpp and downloads a default model if missing
make start   # starts whisper-server and runs vox

Manual setup (advanced)

brew install sox whisper-cpp
mkdir -p ~/.local/share/whisper-cpp
curl -L -o ~/.local/share/whisper-cpp/ggml-base.en.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin"

Build or install binary

make build      # outputs bin/vox
make install    # installs to /usr/local/bin/vox

Start manually

whisper-server --host 127.0.0.1 --port 2022 \
  --model ~/.local/share/whisper-cpp/ggml-base.en.bin
vox

macOS permissions

On first run, macOS will prompt for two permissions. Grant them to your terminal app (Terminal, iTerm2, Ghostty, etc.):

Microphone -- System Settings > Privacy & Security > Microphone
Accessibility -- System Settings > Privacy & Security > Accessibility

Configuration

All via environment variables:

Variable	Default	Description
`VOX_HOTKEY`	`option+space`	Hotkey to trigger recording. Comma-separated for multiple.
`WHISPER_URL`	`http://127.0.0.1:2022`	Whisper server URL
`VOX_HOLD_TO_TALK`	`true`	`true` = hold to record, `false` = toggle on/off
`VOX_LANGUAGE`	(auto-detect)	BCP-47 language code (e.g. `en`, `es`)
`VOX_VERBOSE`	`false`	Debug logging

Hotkey formats

VOX_HOTKEY="fn"                 # Fn / Globe key
VOX_HOTKEY="cmd+shift"          # Modifier-only (no extra key needed)
VOX_HOTKEY="option+space"       # Modifier + key
VOX_HOTKEY="ctrl+shift+d"       # Multiple modifiers + key
VOX_HOTKEY="fn,cmd+shift"       # Multiple hotkeys (either triggers)

Available modifiers: ctrl, shift, option/alt, cmd/command Available keys: a-z, 0-9, f1-f20, space, return, escape, tab, delete, arrow keys

Architecture

cmd/vox/main.go          -- Entrypoint, hotkey event loop, orchestration
internal/hotkey/          -- CGEventTap-based global hotkey (supports modifier-only, fn, modifier+key)
  hotkey_darwin.go        -- Go listener with keydown/keyup channels
  bridge.c                -- C event tap callback
internal/audio/           -- Mic recording via ffmpeg/sox subprocess
  recorder.go             -- Start/stop recording, WAV output
  sound.go                -- Embedded chime sounds (start/stop)
internal/transcribe/      -- Whisper HTTP client
  client.go               -- Multipart upload, auto-detects /inference vs /v1/audio/transcriptions
internal/inject/          -- Text injection into focused app
  paste_darwin.go         -- pbcopy + CGEvent Cmd+V (works in any app)
internal/config/          -- Env var config + hotkey string parsing

Development

make build        # Build binary
make test         # Run all tests
make test-short   # Skip integration tests
make lint         # go vet
make fmt          # gofmt
make run          # Build and run

Why

I was using Whisper Flow for speech-to-text but kept hitting rate limits on their free plan. Vox does the same thing -- system-wide dictation with a hold-to-talk hotkey -- but runs entirely on your machine with no external dependencies.

License

MIT