A high-performance real-time voice processing server built in Rust that provides unified Speech-to-Text (STT) and Text-to-Speech (TTS) services through WebSocket and REST APIs.
Features
- Unified Voice API: Single interface for multiple STT/TTS providers
- Real-time Processing: WebSocket-based bidirectional audio streaming
- LiveKit Integration: WebRTC audio streaming with room-based communication
- VAD + Turn Detection: Silero-VAD for audio-level silence detection with integrated ML-based end-of-turn detection (
stt-vadfeature) - Advanced Noise Filtering: Optional DeepFilterNet integration (
noise-filterfeature) - Provider Flexibility: Pluggable architecture supporting multiple providers
- Deepgram (STT/TTS)
- ElevenLabs (STT/TTS)
- Google Cloud (STT/TTS) - WaveNet, Neural2, and Studio voices
- Microsoft Azure (STT/TTS) - 400+ neural voices across 140+ languages
- Audio-Disabled Mode: Development mode without API keys
Quick Start
Prerequisites
- Docker
- At least one provider credential source (server config/env or per-request/session auth override; optional in audio-disabled mode)
Run with Docker
docker run -d \ -p 3001:3001 \ -e DEEPGRAM_API_KEY=your-key \ saynaai/sayna
The server will be available at http://localhost:3001.
Docker Compose
version: "3.9" services: sayna: image: saynaai/sayna ports: - "3001:3001" environment: DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY} ELEVENLABS_API_KEY: ${ELEVENLABS_API_KEY} CACHE_PATH: /data/cache volumes: - sayna-cache:/data/cache volumes: sayna-cache: {}
For complete Docker documentation including LiveKit integration, see docs/docker.md.
Running Without API Keys (Audio-Disabled Mode)
You can run Sayna without provider credentials by using audio-disabled mode. This is useful when you want to exercise the control plane without STT/TTS initialization. Send a WebSocket configuration message with audio: false:
{
"type": "config",
"audio": false
}This mode is useful for:
- Local development and testing
- UI/UX development without audio processing
- Testing WebSocket message flows
- Debugging non-audio features
Authentication (Optional)
Sayna supports customer-based authentication that delegates token validation to an external authentication service. When enabled, protected API endpoints require a valid token provided either as Authorization: Bearer <token> or as ?api_key=<token>.
Enabling Authentication
Add to your .env file:
AUTH_REQUIRED=true AUTH_SERVICE_URL=https://your-auth-service.com/auth AUTH_SIGNING_KEY_PATH=/path/to/auth_private_key.pem AUTH_TIMEOUT_SECONDS=5
Generate signing keys:
# Generate RSA private key openssl genrsa -out auth_private_key.pem 2048 # Extract public key (share with auth service) openssl rsa -in auth_private_key.pem -pubout -out auth_public_key.pem
Making Authenticated Requests
# Authorization header curl -X POST http://localhost:3001/speak \ -H "Authorization: Bearer your-token-here" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello world", "tts_config": { "provider": "deepgram", "model": "aura-asteria-en" } }' # Query parameter alternative curl -X POST "http://localhost:3001/speak?api_key=your-token-here" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello world", "tts_config": { "provider": "deepgram", "model": "aura-asteria-en" } }'
For complete authentication setup and architecture details, see docs/authentication.md.
API Endpoints
WebSocket
- Endpoint:
/ws - Protocol: WebSocket
- Purpose: Real-time bidirectional audio streaming and control
Message Types
Configuration Message:
{
"type": "config",
"audio": true,
"stt_config": {
"provider": "deepgram",
"language": "en-US",
"sample_rate": 16000,
"channels": 1,
"punctuation": true,
"encoding": "linear16",
"model": "nova-3",
"auth": {
"api_key": "dg-session-key"
}
},
"tts_config": {
"provider": "elevenlabs",
"voice_id": "voice_id_here",
"model": "eleven_flash_v2_5",
"audio_format": "mp3_44100_128",
"sample_rate": 44100,
"auth": {
"api_key": "el-session-key"
}
}
}stt_config.auth and tts_config.auth are optional. If omitted, or sent as {}, Sayna falls back to the server-configured provider credentials. If an auth object is provided, it must be complete for that provider. Current shapes are:
- API key providers:
{ "api_key": "..." } - Google Cloud:
{ "credentials": "/path/to/creds.json" }or{ "credentials": { ...service account json... } } - Azure Speech:
{ "api_key": "...", "region": "eastus" }
Audio Input: Binary audio data (16kHz, 16-bit PCM)
Text Input:
{
"type": "text",
"text": "Convert this text to speech"
}REST API
- Health Check:
GET /- Server health check endpoint - Voices:
GET /voices- List available TTS voices (requires auth ifAUTH_REQUIRED=true) - Speak:
POST /speak- Generate speech from text (requires auth ifAUTH_REQUIRED=true) - LiveKit Token:
POST /livekit/token- Generate LiveKit participant token (requires auth ifAUTH_REQUIRED=true) - LiveKit Webhook:
POST /livekit/webhook- Webhook endpoint for LiveKit events (unauthenticated, uses LiveKit signature verification)- Called by LiveKit to deliver room and participant events
- Validates requests using LiveKit's JWT signature mechanism
- Logs SIP-related attributes for phone call troubleshooting
- See docs/livekit_webhook.md for details
Architecture Overview
Core Components
- VoiceManager: Central coordinator for all voice processing operations
- Provider System: Trait-based abstraction for pluggable STT/TTS providers
- WebSocket Handler: Real-time communication and message routing
- LiveKit Integration: WebRTC audio streaming and room management
- DeepFilterNet: Advanced noise reduction with adaptive processing
Request Flow
- Client establishes WebSocket connection to
/ws - Client sends configuration with provider selection
- Audio processing pipeline:
- STT: Audio � Noise Filter (optional) � STT Provider � Text
- TTS: Text � TTS Provider � Audio � Client
- LiveKit mode enables room-based audio streaming
Development
For local development, you can build and run from source.
Prerequisites
- Rust 1.88.0 or later
- Optional: ONNX Runtime (for
stt-vadfeature which includes VAD and turn detection)
Building from Source
# Development build cargo build # Release build (optimized) cargo build --release # Run the server cargo run # Run with a config file cargo run -- -c config.yaml
Feature Flags
Sayna exposes several Cargo features that gate heavyweight subsystems:
stt-vad: Silero-VAD voice activity detection with integrated ONNX-based turn detection. When enabled, VAD monitors audio for silence and triggers the turn detection model to confirm if the speaker's turn is complete.noise-filter: DeepFilterNet noise suppression pipelineopenapi: OpenAPI 3.1 specification generation
# Run with VAD and turn detection cargo run --features stt-vad # Run with noise filter cargo run --features noise-filter # Run with multiple features cargo run --features stt-vad,noise-filter,openapi
The Docker image includes stt-vad and noise-filter by default.
Testing
# Run all tests cargo test # Run specific test cargo test test_name # Run with output cargo test -- --nocapture
Code Quality
# Format code cargo fmt # Run linter cargo clippy # Check for security vulnerabilities cargo audit
Building Docker Image Locally
# Build Docker image docker build -t sayna . # Run container docker run -p 3001:3001 --env-file .env sayna
Configuration
Environment Variables
| Variable | Description | Default | Required |
|---|---|---|---|
DEEPGRAM_API_KEY |
Deepgram API authentication | - | No* |
ELEVENLABS_API_KEY |
ElevenLabs API authentication | - | No* |
GOOGLE_APPLICATION_CREDENTIALS |
Path to Google Cloud service account JSON | - | No* |
AZURE_SPEECH_SUBSCRIPTION_KEY |
Azure Speech Services subscription key | - | No* |
AZURE_SPEECH_REGION |
Azure region (e.g., eastus, westeurope) | eastus |
No* |
LIVEKIT_URL |
LiveKit server WebSocket URL | ws://localhost:7880 |
No |
LIVEKIT_API_KEY |
LiveKit API key (for webhooks and token generation) | - | No*** |
LIVEKIT_API_SECRET |
LiveKit API secret (for webhooks and token generation) | - | No*** |
HOST |
Server bind address | 0.0.0.0 |
No |
PORT |
Server port | 3001 |
No |
AUTH_REQUIRED |
Enable authentication | false |
No |
AUTH_API_SECRETS_JSON |
API secrets JSON array ([{id, secret}]) |
- | Yes** |
AUTH_API_SECRET |
Legacy single API secret | - | No**** |
AUTH_API_SECRET_ID |
Legacy API secret id for AUTH_API_SECRET |
default |
No |
AUTH_SERVICE_URL |
External auth service endpoint | - | Yes** |
AUTH_SIGNING_KEY_PATH |
Path to JWT signing private key | - | Yes** |
AUTH_TIMEOUT_SECONDS |
Auth request timeout | 5 |
No |
*Not required when using audio-disabled mode
**Required when AUTH_REQUIRED=true for the auth method you choose
***Required for LiveKit webhook validation and token generation features
****Legacy single-secret fallback; prefer AUTH_API_SECRETS_JSON
SIP Configuration (Optional)
Sayna supports first-class SIP configuration for managing SIP-specific settings. See docs/sip_config.md for detailed documentation.
YAML Configuration:
sip: room_prefix: "sip-" allowed_addresses: - "192.168.1.0/24" - "10.0.0.1" hook_secret: "your-signing-secret" # Required if hooks configured hooks: - host: "example.com" url: "https://webhook.example.com/events"
Important: All SIP webhook forwarding requests are signed with HMAC-SHA256. You must configure hook_secret (or per-hook secret overrides) if using sip.hooks. See docs/livekit_webhook.md#webhook-signing for signature verification examples.
Environment Variables:
SIP_ROOM_PREFIX: Room name prefix for SIP calls (required if SIP enabled)SIP_ALLOWED_ADDRESSES: Comma-separated IP addresses/CIDRsSIP_HOOK_SECRET: Global signing secret for webhook authentication (min 16 chars)SIP_HOOKS_JSON: JSON array of webhook configurations (with optional per-hooksecretfield)
Performance Considerations
- DeepFilterNet: CPU-intensive processing uses thread pooling
- Audio Buffering: Optimized chunk processing for low latency
- Connection Reuse: Provider connections are maintained for efficiency
- Async Processing: Non-blocking WebSocket message handling
- Memory Management: Careful buffer management in audio loops
Contributing
-
Review the development rules in
.cursor/rules/:rust.mdc: Rust best practicescore.mdc: Business logic specificationsaxum.mdc: Framework patternslivekit.mdc: LiveKit integration details
-
Follow the existing code patterns and conventions
-
Add tests for new features
-
Ensure
cargo fmtandcargo clippypass
Support
For issues, questions, or contributions, please visit the GitHub repository.
