GitHub - mpaepper/vibevoice: Fast local speech-to-text for any app using faster-whisper

Vibevoice 🎙️

Hi, I'm Marc Päpper and I wanted to vibe code like Karpathy ;D, so I looked around and found the cool work of Vlad. I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. I hope you have fun with it!

What it does 🚀

Simply run cli.py and start dictating text anywhere in your system:

Hold down right control key (Ctrl_r)
Speak your text
Release the key
Watch as your spoken words are transcribed and automatically typed!

Works in any application or window - your text editor, browser, chat apps, anywhere you can type!

NEW: LLM voice command mode:

Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
Speak what you want the LLM to do
The LLM receives your transcribed text and a screenshot of your current view
The LLM answer is typed into your keyboard (streamed)

Works everywhere on your system and the LLM always has the screen context

Installation 🛠️

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py

Requirements 📋

Python Dependencies

Python 3.13 or higher

System Requirements

CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
CUDA 12.x
cuBLAS
cuDNN 9.x
In case you get this error: OSError: PortAudio library not found run sudo apt install libportaudio2
Ollama for AI command mode (with multimodal models for screenshot support)

Setting up Ollama

Install Ollama by following the instructions at ollama.com

Pull a model that supports both text and images for best results:

ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar

Make sure Ollama is running in the background:

Handling the CUDA requirements

Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x
I had some trouble at first with Ubuntu 24.04, so I did the following:
Attention: DO NOT do this if your are a WSL user (https://docs.nvidia.com/cuda/wsl-user-guide/index.html)

sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
sudo apt install cuda-toolkit-12-8

or alternatively:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cudnn9-cuda-12

Then after rebooting, it worked well.

Usage 💡

Start the application:

python src/vibevoice/cli.py

Hold down right control key (Ctrl_r) while speaking
Release to transcribe
Your text appears wherever your cursor is!

Configuration

You can customize various aspects of VibeVoice with the following environment variables:

Keyboard Controls

VOICEKEY: Change the dictation activation key (default: "ctrl_r")
```
export VOICEKEY="ctrl"  # Use left control instead
```

VOICEKEY_CMD: Set the key for AI command mode (default: "scroll_lock")

export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key

AI and Screenshot Features

OLLAMA_MODEL: Specify which Ollama model to use (default: "gemma3:27b")

export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM

INCLUDE_SCREENSHOT: Enable or disable screenshots in AI command mode (default: "true")

export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)

SCREENSHOT_MAX_WIDTH: Set the maximum width for screenshots (default: "1024")
```
export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots
```

Screenshot Dependencies

To use the screenshot functionality:

sudo apt install gnome-screenshot

Usage Modes 💡

VibeVoice supports two modes:

1. Dictation Mode

Hold down the dictation key (default: right Control)
Speak your text
Release to transcribe
Your text appears wherever your cursor is!

2. AI Command Mode

Hold down the command key (default: Scroll Lock)
Ask a question or give a command
Release the key
The AI will analyze your request (and current screen if enabled) and type a response

Credits 🙏

Original inspiration: whisper-keyboard by Vlad
Faster Whisper for the optimized Whisper implementation
Built by Marc Päpper