The Unified Text-to-Speech SDK
The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.
Multi-Provider
One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.
Cross-Platform Streaming
Low-latency audio streaming with streamSpeech that runs everywhere — Node.js, Edge, and the browser. Same API, zero platform-specific code.
Node.jsEdgeBrowser
Standardized Audio Tags
Add expressive cues like [laugh] [sigh]and emotions with a universal bracket syntax. Models that don't support audio tags are stripped automatically — write once, run anywhere.
AI Engineering
For Production Voice Applications
Lazy base64 conversion
Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.
Content-type awareness
The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.
Custom fetch & Base URL
Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.
Smart retries
Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.
Lightweight & portable
Built on standard Web APIs — fetch and Uint8Array. No heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works: Node.js, Edge, and the browser.
Works seamlessly with Speech Gateway
Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.
| Provider | Model String | Default* |
|---|---|---|
| OpenAI | openai/gpt-4o-mini-tts | Yes |
| OpenAI | openai/tts-1 | — |
| OpenAI | openai/tts-1-hd | — |
| ElevenLabs | elevenlabs/eleven_multilingual_v2 | Yes |
| ElevenLabs | elevenlabs/eleven_v3 | — |
| ElevenLabs | elevenlabs/eleven_flash_v2_5 | — |
| ElevenLabs | elevenlabs/eleven_flash_v2 | — |
| Deepgram | deepgram/aura-2 | Yes |
| Cartesia | cartesia/sonic-3 | Yes |
| Hume | hume/octave-2 | Yes |
| google/gemini-2.5-flash-preview-tts | Yes | |
| google/gemini-2.5-pro-preview-tts | — | |
| Fish Audio | fish-audio/s2-pro | Yes |
| Unreal Speech | unreal-speech/default | Yes |
| Murf | murf/GEN2 | Yes |
| Resemble | resemble/default | Yes |
| fal | fal-ai/* | — |
| Mistral | mistral/voxtral-mini-tts-2603 | Yes |
| xAI | xai/grok-tts | Yes |
* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.
Frequently asked questions
Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.
One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.