SpeechSDK | Universal Text-to-Speech SDK

2 min read Original article ↗

The Unified Text-to-Speech SDK

The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.

Multi-Provider

One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.

Cross-Platform Streaming

Low-latency audio streaming with streamSpeech that runs everywhere — Node.js, Edge, and the browser. Same API, zero platform-specific code.

Node.jsEdgeBrowser

Standardized Audio Tags

Add expressive cues like [laugh] [sigh]and emotions with a universal bracket syntax. Models that don't support audio tags are stripped automatically — write once, run anywhere.

AI Engineering

For Production Voice Applications

Lazy base64 conversion

Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.

Content-type awareness

The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.

Custom fetch & Base URL

Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.

Smart retries

Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.

Lightweight & portable

Built on standard Web APIs — fetch and Uint8Array. No heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works: Node.js, Edge, and the browser.

Works seamlessly with Speech Gateway

Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.

ProviderModel StringDefault*
OpenAIopenai/gpt-4o-mini-ttsYes
OpenAIopenai/tts-1
OpenAIopenai/tts-1-hd
ElevenLabselevenlabs/eleven_multilingual_v2Yes
ElevenLabselevenlabs/eleven_v3
ElevenLabselevenlabs/eleven_flash_v2_5
ElevenLabselevenlabs/eleven_flash_v2
Deepgramdeepgram/aura-2Yes
Cartesiacartesia/sonic-3Yes
Humehume/octave-2Yes
Googlegoogle/gemini-2.5-flash-preview-ttsYes
Googlegoogle/gemini-2.5-pro-preview-tts
Fish Audiofish-audio/s2-proYes
Unreal Speechunreal-speech/defaultYes
Murfmurf/GEN2Yes
Resembleresemble/defaultYes
falfal-ai/*
Mistralmistral/voxtral-mini-tts-2603Yes
xAIxai/grok-ttsYes

* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.

Frequently asked questions

Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.

SpeechSDK

One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.