๐๏ธ๐ค Micdrop: Real-Time Voice Conversations with AI
Micdrop website | Documentation
Micdrop is a set of open source Typescript packages to build real-time voice conversations with AI agents. It handles all the complexities on the browser and server side (microphone, speaker, VAD, network communication, etc) and provides ready-to-use implementations for various AI providers.
๐ฆ Packages
Core Packages (start here)
@micdrop/client- Browser library handling microphone input, audio playback, and real-time communication@micdrop/server- Server implementation for audio streaming and AI integration orchestration
AI Implementations
@micdrop/openai- OpenAI integration providing LLM agent and speech-to-text capabilities@micdrop/ai-sdk- AI SDK agent compatible with a lot of LLM providers.@micdrop/elevenlabs- ElevenLabs text-to-speech integration with streaming support@micdrop/cartesia- Cartesia text-to-speech integration for real-time voice synthesis@micdrop/mistral- Mistral AI agent integration for conversation handling@micdrop/gladia- Gladia speech-to-text integration for audio transcription
Utility Packages
@micdrop/react- React hooks for Micdrop
Demo Applications
demo-client- Example web application with React.demo-server- Example server with fastify.
๐ฅ Demo and technical details (video)
See the author Godefroy de Compreignac talking about Micdrop and voice AI in this video:
๐ค Why Micdrop?
While real-time multimodal models (voice-to-voice) offer impressive capabilities, they often come with limitations in terms of customization and cost. Micdrop takes a different approach by:
- ๐ฏ Allowing you to choose the best-in-class API for each component:
- Select specific voices from TTS providers
- Use different LLMs optimized for your use case
- Pick STT engines suited for specific languages/accents
- ๐ฐ Reducing costs by letting you:
- Use more cost-effective API providers
- Mix open source and commercial solutions
- Control exactly when APIs are called
- ๐ง Providing granular control over the conversation flow
- ๐ Supporting a wider range of languages and voices through specialized providers
This modular approach gives you the flexibility to build voice applications that are both powerful and cost-effective.
๐ Features
- ๐๏ธ Microphone handling with:
- Streaming support
- Voice Activity Detection (VAD)
- ๐ Advanced audio playback with:
- Streaming support
- Device selection and control
- ๐ WebSocket communication
- ๐ฆ AI implementations provided for OpenAI, ElevenLabs, Mistral, Gladia, and more
- ๐ Bring your own AI components (framework agnostic)
- Large Language Models (LLM)
- Text-to-Speech (TTS)
- Speech-to-Text (STT)
๐งช Development
For detailed development instructions, including how to build, test, and publish packages, please see DEVELOPMENT.md.
๐ License
MIT License - see the LICENSE file for details
Author
Originally developed for Raconte.ai and open sourced by Lonestone (GitHub)
