SpeechSDK | Universal Text-to-Speech SDK

The Unified Text-to-Speech SDK

The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.

Multi-Provider

One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.

Cross-Platform Streaming

Low-latency audio streaming with streamSpeech that runs everywhere — Node.js, Edge, and the browser. Same API, zero platform-specific code.

Node.jsEdgeBrowser

Standardized Audio Tags

Add expressive cues like [laugh] [sigh]and emotions with a universal bracket syntax. Models that don't support audio tags are stripped automatically — write once, run anywhere.

AI Engineering

For Production Voice Applications

Lazy base64 conversion

Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.

Content-type awareness

The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.

Custom fetch & Base URL

Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.

Smart retries

Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.

Lightweight & portable

Built on standard Web APIs — fetch and Uint8Array. No heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works: Node.js, Edge, and the browser.

Works seamlessly with Speech Gateway

Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.

Provider	Model String	Default*
OpenAI	openai/gpt-4o-mini-tts	Yes
OpenAI	openai/tts-1	—
OpenAI	openai/tts-1-hd	—
ElevenLabs	elevenlabs/eleven_multilingual_v2	Yes
ElevenLabs	elevenlabs/eleven_v3	—
ElevenLabs	elevenlabs/eleven_flash_v2_5	—
ElevenLabs	elevenlabs/eleven_flash_v2	—
Deepgram	deepgram/aura-2	Yes
Cartesia	cartesia/sonic-3	Yes
Hume	hume/octave-2	Yes
Google	google/gemini-2.5-flash-preview-tts	Yes
Google	google/gemini-2.5-pro-preview-tts	—
Fish Audio	fish-audio/s2-pro	Yes
Unreal Speech	unreal-speech/default	Yes
Murf	murf/GEN2	Yes
Resemble	resemble/default	Yes
fal	fal-ai/*	—
Mistral	mistral/voxtral-mini-tts-2603	Yes
xAI	xai/grok-tts	Yes

* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.

Frequently asked questions

Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.

One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.