Vilberta
An interactive voice assistant powered by LLMs, featuring a three-stage ASR + LLM + TTS pipeline, intelligent interruption handling, and MCP support for tool calling.
Core Objective: Not everything needs to be spoken.
Vilberta intelligently separates conversational responses (audio) from visual content like code, lists, and data (text), creating a more natural and efficient interaction model.
Features
- Voice Interaction: Bidirectional audio communication with hands-free operation
- Voice Activity Detection (VAD): Uses Silero VAD for automatic speech detection
- Context-Aware Transcription: Uses conversation history to improve accuracy for technical terms
- Smart Interruption Handling: Gracefully interrupts TTS when you speak and resumes context
- MCP Integration: Support for Model Context Protocol with tool calling capabilities
- Multimodal Output:
[speak]: Audio responses for conversational interaction[text]: Visual content for code, lists, and complex information
- Audio Feedback: Sound effects for user events and system states
- Rich Terminal Interface: Basic CLI and TUI (Text User Interface) options
Note
There is no echo cancellation, so use a headset or use some conference call speaker with built-in echo cancellation capabilities -- to avoid speaker audio going back into mic.
Architecture
Vilberta uses a three-stage pipeline approach instead of direct speech-to-speech models. This design choice was made because multimodal models currently have poor tool calling capabilities, which are essential for practical applications.
Pipeline stages:
- ASR (Automatic Speech Recognition): Gemini 2.5 Flash (transcription only)
- LLM (Large Language Model):
- GPT-4o mini for basic chat
- Gemini 2.5 Flash for tool calling with MCP
- TTS (Text-to-Speech): Pockety TTS (runs locally)
Installation
- Clone the repository:
git clone https://github.com/charstorm/vilberta.git
cd vilberta- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt
- Set your OpenRouter API key:
export OPENROUTER_API_KEY="your-openrouter-api-key"
Configuration
Vilberta can be configured via config.toml. Key settings include:
- Mode:
basic(simple chat) ormcp(with tool calling support) - Models: Configure ASR, chat, and tool calling models separately
- TTS Voice: Choose from available voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma)
- Context Management: Adjust history size and pruning thresholds
See config.toml for detailed configuration options.
Usage
Start the voice assistant with TUI interface:
python -m vilberta -i tui
Start with basic CLI interface:
python -m vilberta -i cli
Operating Modes
- Basic Chat Mode: Simple conversational interaction with the LLM
- MCP Mode: Full tool calling capabilities with MCP server integration
Configure the mode in config.toml before starting the application.
Known Issues
- Sound Effect Timing: Sound effects may occasionally overlap with TTS output, creating jarring transitions. The current implementation has more sound effects than necessary.
- NNPACK Warnings: You may see
Could not initialize NNPACK! Reason: Unsupported hardwarewarnings on startup. These are typically harmless and restarting the application usually resolves them. - TTS Hang: The application may hang if exited (q for quit) during TTS playback of an utterance. Wait for the end of TTS to quit.
Technical Notes
- Currently supports OpenRouter API, but can be extended to work with any OpenAI-compatible API provider
- The three-stage architecture ensures robust tool calling while maintaining conversational quality
Contributing
Vibe-coded PRs are encouraged! But please keep them usable and stable.
Make sure to fix issues with ruff check vilberta and mypy --strict vilberta before submitting.
