GitHub - mpaepper/vibevoice: Fast local speech-to-text for any app using faster-whisper

4 min read Original article β†—

Vibevoice πŸŽ™οΈ

Hi, I'm Marc PΓ€pper and I wanted to vibe code like Karpathy ;D, so I looked around and found the cool work of Vlad. I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. I hope you have fun with it!

What it does πŸš€

Demo Video

Simply run cli.py and start dictating text anywhere in your system:

  1. Hold down right control key (Ctrl_r)
  2. Speak your text
  3. Release the key
  4. Watch as your spoken words are transcribed and automatically typed!

Works in any application or window - your text editor, browser, chat apps, anywhere you can type!

NEW: LLM voice command mode:

  1. Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
  2. Speak what you want the LLM to do
  3. The LLM receives your transcribed text and a screenshot of your current view
  4. The LLM answer is typed into your keyboard (streamed)

Works everywhere on your system and the LLM always has the screen context

Installation πŸ› οΈ

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py

Requirements πŸ“‹

Python Dependencies

  • Python 3.13 or higher

System Requirements

  • CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
  • CUDA 12.x
  • cuBLAS
  • cuDNN 9.x
  • In case you get this error: OSError: PortAudio library not found run sudo apt install libportaudio2
  • Ollama for AI command mode (with multimodal models for screenshot support)

Setting up Ollama

  1. Install Ollama by following the instructions at ollama.com
  2. Pull a model that supports both text and images for best results:
    ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar
  3. Make sure Ollama is running in the background:

Handling the CUDA requirements

sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
sudo apt install cuda-toolkit-12-8

or alternatively:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cudnn9-cuda-12
  • Then after rebooting, it worked well.

Usage πŸ’‘

  1. Start the application:
python src/vibevoice/cli.py
  1. Hold down right control key (Ctrl_r) while speaking
  2. Release to transcribe
  3. Your text appears wherever your cursor is!

Configuration

You can customize various aspects of VibeVoice with the following environment variables:

Keyboard Controls

  • VOICEKEY: Change the dictation activation key (default: "ctrl_r")
    export VOICEKEY="ctrl"  # Use left control instead
  • VOICEKEY_CMD: Set the key for AI command mode (default: "scroll_lock")
    export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key

AI and Screenshot Features

  • OLLAMA_MODEL: Specify which Ollama model to use (default: "gemma3:27b")
    export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM
  • INCLUDE_SCREENSHOT: Enable or disable screenshots in AI command mode (default: "true")
    export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)
  • SCREENSHOT_MAX_WIDTH: Set the maximum width for screenshots (default: "1024")
    export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots

Screenshot Dependencies

To use the screenshot functionality:

sudo apt install gnome-screenshot

Usage Modes πŸ’‘

VibeVoice supports two modes:

1. Dictation Mode

  1. Hold down the dictation key (default: right Control)
  2. Speak your text
  3. Release to transcribe
  4. Your text appears wherever your cursor is!

2. AI Command Mode

  1. Hold down the command key (default: Scroll Lock)
  2. Ask a question or give a command
  3. Release the key
  4. The AI will analyze your request (and current screen if enabled) and type a response

Credits πŸ™