Show HN: Vilberta: speech to speech/text chatbot
github.comVilberta is a speech-to-speech/text chatbot using a pipeline that combines voice acivity detector (VAD), speech recognition (ASR), a large language model (LLM), tool calling (via MCP) and text-to-speech (TTS). This is a standard pipeline. Initially, I was going for a more noval approach where a multi-modal LLM handled both speech recognition and LLM. But I found that the multimodal LLM struggled with certain capabilities like tool calling. So I ended up with VAD+ASR+LLM+TTS where I can configure model for each.
No echo cancellation, so you will need a headset.
"Not everything needs to be spoken" - this is the idea I wanted to capture with this project. It generates text for TTS (usually short) and displays relevant information on the screen (longer sections).
Few things I learned on the journey:
- collecting jargons from previous chats to the llm based asr system greatly improves its ability to handle jargons during transcription
- openrouter is awesome!
- end to end speech to speech systems aren't all that great once tool calling is involved. for any serious use case, tool calling will be involved. so it has to go through speech -> text, text processing, text -> speech anyhow.
- once you are serious about a project, Claude code will consume the weekly quota rather quickly. I neded up with opencode + kimi 2.5. 90% of the code is done by chatbots
Usable, tested vibe coded PRs are welcome! Nice work on the speech-to-speech pipeline! You're absolutely right that it has to go through the text intermediate step - that's actually where a lot of the interesting processing can happen. I've found that the speech->text->speech approach gives you much more control over the output quality. The text intermediate step lets you clean up transcription errors, adjust tone, and even restructure the content before converting back to speech. Have you experimented with different text processing steps in between? I've been building something similar at voicevoyage.io focused on that middle text processing layer - turning raw transcriptions into polished content before the final output.