GitHub - PrestigePvP/Voicebox

Voice-to-text tool that captures speech, transcribes it via Whisper, and formats the output with an LLM. Press a hotkey, speak, release — formatted text lands in your clipboard and is auto-pasted into whatever you were typing in.

How It Works

┌──────────────────────┐  PCM chunks   ┌──────────────────────────────────┐   formatted
│  Wails Desktop App   │ ──WebSocket──▶ │  Cloudflare Worker (Durable Obj) │ ──text────▶ Clipboard → Auto-paste
│  (Go + React WebView)│ ◀─────────────│  Whisper STT → LLM Formatter     │
└──────────────────────┘               └──────────────────────────────────┘

Hold Ctrl+Cmd — focus context is captured, recording starts, overlay appears at top-center
Speak into your microphone (voice level meter shows input)
Release Ctrl+Cmd — audio streams to the cloud
Whisper transcribes, LLM formats, result is copied to clipboard and auto-pasted into the originating app

Project Structure

voicebox/
├── main.go                 # Wails entrypoint, app menu
├── app.go                  # App lifecycle, hotkey handlers, pipeline orchestration
├── window_darwin.go        # macOS window management (overlay, settings, dock click)
├── window_other.go         # Stub for non-macOS builds
├── internal/
│   ├── audio/              # PCM audio capture (malgo/miniaudio), RMS level
│   ├── pipeline/           # WebSocket client, streams audio + focus context to worker
│   ├── accessibility/      # macOS AX API: focused element context + auto-paste (Cmd+V)
│   ├── config/             # TOML config loading and saving
│   ├── hotkey/             # Global hotkey registration
│   ├── stt/                # STT provider interface (stubs)
│   └── formatter/          # LLM formatting provider interface (stubs)
├── frontend/               # React + Tailwind overlay UI (Vite)
│   └── src/
│       ├── App.tsx         # Routes between settings mode and overlay mode
│       ├── components/
│       │   ├── settings-form.tsx  # Config editor (react-hook-form + zod)
│       │   └── title-bar.tsx      # Frameless title bar with drag region
│       └── hooks/
│           ├── use-voicebox.ts    # voicebox:state / voicebox:mode / voicebox:level events
│           └── use-config.ts      # GetConfig / SaveConfig / GetConfigPath bindings
├── worker/                 # Cloudflare Worker (TypeScript)
│   ├── src/
│   │   ├── index.ts        # Router: /ws (WebSocket), /health
│   │   ├── session.ts      # Durable Object: audio accumulation + AI pipeline
│   │   ├── prompt.ts       # System prompt + user message builder
│   │   ├── wav.ts          # PCM-to-WAV wrapper
│   │   └── types.ts        # Shared types
│   ├── test/               # Vitest tests
│   └── wrangler.jsonc      # Worker configuration
├── go.mod
└── voicebox.toml           # User config (gitignored)

Setup

Prerequisites

Go 1.24+
Node.js + pnpm
Wails v2 CLI
A Cloudflare account with Workers AI access
macOS (accessibility permission required for auto-paste)

Deploy the Worker

cd worker
pnpm install
wrangler secret put VOICEBOX_TOKEN     # set a shared secret
pnpm deploy

Configure the Desktop Client

On first launch, VoiceBox opens a settings window. You can also create the config manually at ~/.config/voicebox/voicebox.toml:

[provider]
mode = "cloud"

[cloud]
worker_url = "https://voicebox.<your-subdomain>.workers.dev"
token = "your-shared-secret"

[audio]
sample_rate = 16000
channels = 1
chunk_size = 4096

[hotkey]
record = "ctrl+cmd"

Config is loaded from (in order): ~/.config/voicebox/voicebox.toml, next to the binary, then ./voicebox.toml.

macOS Accessibility Permission

Auto-paste requires macOS Accessibility access. On first use, macOS will prompt for permission, or you can grant it manually in System Settings → Privacy & Security → Accessibility.

Build and Run

wails dev      # dev mode with hot reload
wails build    # production binary

Window Modes

Settings (700×450, centered): Opens on launch, dock click, or via the Recording menu. Edit config here.

Overlay (160×48, top-center, floating): Appears during recording. Shows recording indicator with voice level meter, spinner while processing, checkmark on success.

WebSocket Protocol

Client connects to GET /ws?token=<auth-token>.

After receiving {"type":"ready"}, the client sends a configure message with audio and focus context, then streams binary PCM chunks:

Client                          Server
  │── connect /ws?token=... ──────▶│
  │◀── {"type":"ready"} ──────────│
  │── {"type":"configure", ...} ──▶│
  │── [binary PCM chunk] ─────────▶│
  │── [binary PCM chunk] ─────────▶│
  │── {"type":"audio_end"} ───────▶│
  │◀── {"type":"processing",...} ──│
  │◀── {"type":"result",...} ──────│

The configure message carries audio params and focused element context (app name, bundle ID, element role, title, placeholder, current value) used by the LLM formatter to tailor output.

Cloud Backend

STT: @cf/openai/whisper-large-v3-turbo
Formatter: @cf/qwen/qwen3-30b-a3b-fp8

Local Backend (Phase 2)

STT: faster-whisper
Formatter: Ollama
Provider interfaces exist at internal/stt/ and internal/formatter/

Audio Specs

16kHz sample rate, mono, PCM signed 16-bit LE
~4096 byte chunks (~128ms each)
Max recording: ~25 MiB (~13 minutes)

Development

# Desktop app
wails dev                              # dev server (Go + Vite hot reload)
wails build                            # production build
go vet ./...                           # lint Go
go test ./internal/...                 # test Go

# Frontend
cd frontend && pnpm install && pnpm build

# Worker
cd worker
pnpm dev                               # local dev server
pnpm lint                              # type-check
pnpm format                            # prettier
pnpm test                              # vitest
pnpm deploy                            # deploy to Cloudflare