GitHub - gangtao/VoiceCraft: an easy to use voice clone application

14 min read Original article β†—

VoiceCraft - AI Voice Cloning

A Docker-based voice cloning application with two separate TTS engines, each in its own isolated backend container for better dependency management.

VoiceCraft Docker License

Features

  • πŸŽ™οΈ Record or Upload - Record reference audio directly in browser or upload existing files
  • πŸ“ Auto Transcription - AI-powered speech-to-text using OpenAI Whisper
  • πŸ”„ Multi-Engine Support - Choose between OpenVoice V2, XTTS v2, or Qwen3-TTS
  • 🌍 Multi-language - Support for English, Chinese, Japanese, Korean, and more
  • ⬇️ Download Results - Export cloned audio as WAV files
  • 🐳 Docker Ready - Separate containers for each engine to avoid dependency conflicts

Architecture

πŸ“˜ Detailed Architecture Documentation: See ARCHITECTURE.md for comprehensive system design, data flows, and deployment patterns.

The application consists of 6 separate Docker containers with a microservices architecture:

                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚       Frontend (Nginx/React)    β”‚
                                    β”‚            :8080                β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                   β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                              β”‚                              β”‚
                    β–Ό                              β–Ό                              β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  Profile Management   β”‚    β”‚   TTS Engine Backends  β”‚    β”‚   TTS Engine Backends    β”‚
        β”‚      Backend          β”‚    β”‚                        β”‚    β”‚                          β”‚
        β”‚   (Whisper + DB)      β”‚    β”‚  - MeloTTS    :5000    β”‚    β”‚  - Qwen-TTS     :5002    β”‚
        β”‚       :5100           β”‚    β”‚  - XTTS       :5001    β”‚    β”‚  - Pocket-TTS   :5003    β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚                                 β”‚                              β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ β€’ Upload Audio β”‚              β”‚ β€’ Voice Cloning    β”‚        β”‚ β€’ Voice Cloning      β”‚
        β”‚ β€’ Transcriptionβ”‚              β”‚ β€’ Multi-language   β”‚        β”‚ β€’ Multi-language     β”‚
        β”‚ β€’ Profile CRUD β”‚              β”‚ β€’ Audio Generation β”‚        β”‚ β€’ Audio Generation   β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Architecture?

Centralized Profile Management:

  • Single source of truth for voice profiles and audio files
  • Shared audio transcription service (OpenAI Whisper)
  • All TTS backends connect to profile backend for audio retrieval
  • Simplifies profile management and reduces code duplication

Isolated TTS Backends: Each TTS engine has different dependency requirements:

  • MeloTTS requires transformers==4.40.2 (has BeamSearchScorer)
  • XTTS requires transformers>=4.33.0 (installs 4.57.6, which removed BeamSearchScorer)
  • Qwen3-TTS uses the latest qwen-tts package with Python 3.12
  • Pocket-TTS uses pocket-tts with Python 3.12

By separating them into individual containers, each can have its optimal dependencies without conflicts.

TTS Engines

Engine License Commercial Use Quality Speed Languages Container
OpenVoice V2 MIT βœ… Yes Good Fast 6 languages backend-melotts:5000
Coqui XTTS v2 CPML ⚠️ No Excellent Slower 16+ languages backend-xtts:5001
Qwen3-TTS Apache 2.0 βœ… Yes Excellent Fast 10 languages backend-qwentts:5002
Pocket-TTS CC-BY-4.0 βœ… Yes Good Very Fast English only backend-pockettts:5003

OpenVoice V2 (MeloTTS Backend)

  • License: MIT - Free for commercial use
  • Languages: English, Spanish, French, Chinese, Japanese, Korean
  • Best for: Production use, commercial applications
  • Dependencies: transformers 4.40.2, MeloTTS, OpenVoice

Coqui XTTS v2 (XTTS Backend)

  • License: CPML - Non-commercial only
  • Languages: 16+ languages including German, Italian, Portuguese, Russian, Arabic
  • Best for: Research, personal projects, highest quality needs
  • Dependencies: transformers 4.57.6, TTS 0.22.0

Qwen3-TTS (Qwen-TTS Backend)

  • License: Apache 2.0 - Free for commercial use
  • Languages: 10 languages - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Voice Cloning: 3-second rapid voice clone from user audio
  • Best for: Production use, high-quality voice cloning, multilingual applications
  • Dependencies: Python 3.12, qwen-tts, torch
  • Model: Qwen3-TTS-12Hz-0.6B-Base (default, faster) or 1.7B-Base (higher quality)
  • Performance: ~10-20s per generation on CPU with 0.6B model, 30-60s with 1.7B model

Pocket-TTS (Pocket-TTS Backend)

  • License: CC-BY-4.0 - Free for commercial use
  • Languages: English only
  • Voice Cloning: Fast voice cloning from user audio
  • Best for: English-only applications, lightweight deployment, fast inference
  • Dependencies: Python 3.12, pocket-tts, torch 2.5+
  • Model: 100M parameters (very lightweight, CPU-optimized)
  • Performance: ~6x faster than real-time, ~200ms latency, only 2 CPU cores required

Quick Start

Prerequisites

  • Docker and Docker Compose
  • At least 4GB RAM for single backend (8GB for multiple backends)
  • For Pocket-TTS: Hugging Face account and token (see Pocket-TTS Setup)

Quick Deploy (Single Backend - Recommended)

Easy Mode - Using Helper Script:

# Clone or download this repository
cd VoiceCraft

# Start with Qwen3-TTS (recommended - fast, commercial use)
./run.sh qwentts

# Or Pocket-TTS (fastest, commercial use, English only)
./run.sh pockettts

# Or OpenVoice V2 (fast, commercial use)
./run.sh openvoice

# Or XTTS v2 (highest quality, non-commercial)
./run.sh xtts

# View logs
./run.sh logs

# Stop all
./run.sh stop

# Access the application
open http://localhost:8080

Manual Mode - Using Docker Compose:

# Option 1: Start with Qwen3-TTS (recommended - fast, commercial use)
docker compose up -d backend-profiles backend-qwentts frontend

# Option 2: Start with Pocket-TTS (fastest, commercial use, English only)
docker compose up -d backend-profiles backend-pockettts frontend

# Option 3: Start with OpenVoice V2 (fast, commercial use)
docker compose up -d backend-profiles backend-melotts frontend

# Option 4: Start with XTTS v2 (highest quality, non-commercial)
docker compose up -d backend-profiles backend-xtts frontend

# View logs
docker compose logs -f

# Access the application
open http://localhost:8080

Note: The backend-profiles service is required for all TTS backends as it handles audio uploads, transcription, and profile management. The frontend will automatically detect which TTS backends are running.

Deploy All Backends (For Comparison)

# Build and start all containers
docker compose build
docker compose up -d

# View logs
docker compose logs -f

# Access the application
open http://localhost:8080

Run Individual Backends (Recommended)

For most use cases, you only need one TTS backend at a time. The profile management backend is always required. The frontend will automatically detect which TTS backend is running:

# Option 1: OpenVoice V2 - Fast, commercial use OK
docker compose up -d backend-profiles backend-melotts frontend

# Option 2: XTTS v2 - Best quality, non-commercial only
docker compose up -d backend-profiles backend-xtts frontend

# Option 3: Qwen3-TTS - Fast voice cloning, commercial use OK
docker compose up -d backend-profiles backend-qwentts frontend

# Option 4: Pocket-TTS - Fastest, commercial use OK, English only
docker compose up -d backend-profiles backend-pockettts frontend

Run All Backends (For Comparison)

If you want to test and compare all TTS engines:

# Start all backends (profile + all TTS engines + frontend)
docker compose up -d

# Or build first if needed
docker compose build
docker compose up -d

Switch Between Backends

# Stop current TTS backend
docker compose stop backend-melotts

# Start different TTS backend
docker compose up -d backend-qwentts

# The frontend will auto-detect the change
# Note: backend-profiles stays running and doesn't need to be restarted

Usage

  1. Open http://localhost:8080 in your browser
  2. Create a Voice Profile:
    • Click "Create New Profile" or upload reference audio
    • Optionally transcribe the audio using Whisper to see what was said
    • Save the profile with a descriptive name
  3. Select TTS Engine:
    • The UI will show only the running TTS backends (auto-detected)
    • If multiple backends are running, select your preferred TTS engine
  4. Clone Voice:
    • Select a saved voice profile from your profile library
    • Enter the text you want the cloned voice to say
    • Select the language
    • Click "Clone Voice" to generate the output
  5. Download or Play:
    • Listen to the cloned audio in the browser
    • Download the output as a WAV file

Managing Backends

Check Running Backends

Stop a Backend

# Stop specific TTS backend
docker compose stop backend-melotts
docker compose stop backend-xtts
docker compose stop backend-qwentts
docker compose stop backend-pockettts

# Stop profile backend (stops all profile management)
docker compose stop backend-profiles

# Or stop all
docker compose down

Switch TTS Backends

# Stop current TTS backend
docker compose stop backend-qwentts

# Start different TTS backend
docker compose up -d backend-melotts

# Refresh browser - UI will auto-update
# Note: backend-profiles keeps running and retains all profiles

View Logs

# Profile backend
docker compose logs -f backend-profiles

# Specific TTS backend
docker compose logs -f backend-qwentts

# All services
docker compose logs -f

API Endpoints

Profile Management Backend (/api/profiles)

GET  /api/profiles                    # List all voice profiles
POST /api/profiles                    # Create new profile (upload audio)
GET  /api/profiles/:id                # Get profile details
PUT  /api/profiles/:id                # Update profile (name/transcription)
DELETE /api/profiles/:id              # Delete profile
POST /api/profiles/:id/transcribe     # Transcribe profile audio with Whisper
GET  /api/profiles/:id/audio          # Stream profile audio
GET  /api/health                      # Health check

MeloTTS Backend (/api/melotts)

POST /api/melotts/clone           # Clone voice with OpenVoice (requires profile_id)
GET  /api/melotts/audio/:id       # Stream generated audio
GET  /api/melotts/download/:id    # Download generated audio
GET  /api/melotts/health          # Health check

XTTS Backend (/api/xtts)

POST /api/xtts/clone              # Clone voice with XTTS (requires profile_id)
GET  /api/xtts/audio/:id          # Stream generated audio
GET  /api/xtts/download/:id       # Download generated audio
GET  /api/xtts/health             # Health check

Qwen-TTS Backend (/api/qwentts)

POST /api/qwentts/clone           # Clone voice with Qwen3-TTS (requires profile_id)
GET  /api/qwentts/audio/:id       # Stream generated audio
GET  /api/qwentts/download/:id    # Download generated audio
GET  /api/qwentts/health          # Health check

Pocket-TTS Backend (/api/pockettts)

POST /api/pockettts/clone         # Clone voice with Pocket-TTS (requires profile_id)
GET  /api/pockettts/audio/:id     # Stream generated audio
GET  /api/pockettts/download/:id  # Download generated audio
GET  /api/pockettts/health        # Health check

API Usage Examples

# Step 1: Create a voice profile (upload reference audio)
curl -X POST http://localhost:8080/api/profiles \
  -F "audio=@reference_voice.wav" \
  -F "name=John's Voice"

# Response: { "success": true, "profile_id": "abc123", ... }

# Step 2: Optionally transcribe the audio
curl -X POST http://localhost:8080/api/profiles/abc123/transcribe

# Step 3: Clone voice using the profile
# Using MeloTTS (OpenVoice V2)
curl -X POST http://localhost:8080/api/melotts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

# Using XTTS
curl -X POST http://localhost:8080/api/xtts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

# Using Qwen3-TTS
curl -X POST http://localhost:8080/api/qwentts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en",
    "ref_text": "Optional transcript of reference audio"
  }'

# Using Pocket-TTS
curl -X POST http://localhost:8080/api/pockettts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

Configuration

Environment Variables

backend-profiles:

  • Manages audio uploads, transcription, and profile database
  • Uses OpenAI Whisper for transcription (lazy-loaded on first use)

backend-melotts:

  • PRELOAD_MODELS=false - Set to true to preload models on startup
  • USE_GPU=false - Enable GPU acceleration (requires GPU docker setup)
  • PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-xtts:

  • PRELOAD_MODELS=false - Set to true to preload models on startup
  • USE_GPU=false - Enable GPU acceleration
  • COQUI_TOS_AGREED=1 - Required to use Coqui TTS (auto-set)
  • PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-qwentts:

  • PRELOAD_MODELS=false - Set to true to preload models on startup
  • USE_GPU=false - Enable GPU acceleration (requires GPU docker setup)
  • QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base - Model to use (0.6B is faster, 1.7B is higher quality)
  • PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-pockettts:

  • PRELOAD_MODELS=false - Set to true to preload models on startup
  • No GPU support needed (CPU-optimized)
  • PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL
  • HF_TOKEN=your_token - REQUIRED - Hugging Face token for model access (see Pocket-TTS Setup)

Resource Requirements

Container RAM Limit RAM Reserved First Request Load Time
backend-profiles 4GB 2GB ~10-20 seconds (Whisper lazy-load)
backend-melotts 4GB 2GB ~30-60 seconds
backend-xtts 8GB 4GB ~2-3 minutes
backend-qwentts 8GB 4GB ~1-2 minutes (0.6B model)
backend-pockettts 4GB 2GB ~30-60 seconds
frontend Minimal Minimal Instant

Recommended: Run only one TTS backend at a time (plus the required profile backend) to minimize resource usage. The frontend auto-detects available backends.

Project Structure

VoiceCraft/
β”œβ”€β”€ backend-profiles/         # Profile management backend (Port 5100)
β”‚   β”œβ”€β”€ Dockerfile           # Container with Whisper and FFmpeg
β”‚   └── app.py               # Flask API for profile CRUD and transcription
β”œβ”€β”€ backend-melotts/          # OpenVoice V2 backend (Port 5000)
β”‚   β”œβ”€β”€ Dockerfile           # Container with transformers 4.40.2
β”‚   └── app.py               # Flask API for MeloTTS voice cloning
β”œβ”€β”€ backend-xtts/             # XTTS v2 backend (Port 5001)
β”‚   β”œβ”€β”€ Dockerfile           # Container with transformers 4.57.6
β”‚   └── app.py               # Flask API for XTTS voice cloning
β”œβ”€β”€ backend-qwentts/          # Qwen3-TTS backend (Port 5002)
β”‚   β”œβ”€β”€ Dockerfile           # Container with Python 3.12 and qwen-tts
β”‚   └── app.py               # Flask API for Qwen3-TTS voice cloning
β”œβ”€β”€ backend-pockettts/        # Pocket-TTS backend (Port 5003)
β”‚   β”œβ”€β”€ Dockerfile           # Container with Python 3.12 and pocket-tts
β”‚   └── app.py               # Flask API for Pocket-TTS voice cloning
β”œβ”€β”€ frontend/                 # React frontend (Port 8080)
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ nginx.conf           # Routes to profile backend and TTS backends
β”‚   └── index.html           # SPA with profile management and TTS engine selector
β”œβ”€β”€ docker-compose.yml        # Multi-container orchestration
└── README.md

Supported Languages

OpenVoice V2 (MeloTTS Backend)

English (en), Spanish (es), French (fr), Chinese (zh), Japanese (ja), Korean (ko)

XTTS v2 (Additional Languages)

All above plus: German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Hungarian (hu)

Qwen3-TTS (Qwen-TTS Backend)

Chinese (zh), English (en), Japanese (ja), Korean (ko), German (de), French (fr), Russian (ru), Portuguese (pt), Spanish (es), Italian (it)

Pocket-TTS (Pocket-TTS Backend)

English (en) only

Pocket-TTS Setup

Pocket-TTS requires Hugging Face authentication to access the voice cloning model.

Step 1: Accept Model Terms

Visit https://huggingface.co/kyutai/pocket-tts and accept the terms of use.

Step 2: Create Hugging Face Token

  1. Go to https://huggingface.co/settings/tokens
  2. Create a new token (read access is sufficient)
  3. Copy the token

Step 3: Configure Token

Choose one of these methods:

Method A: Using .env file (Recommended)

# In the project root directory
cp .env.example .env
# Edit .env and add your token:
HF_TOKEN=your_huggingface_token_here

Method B: Export as environment variable

export HF_TOKEN=your_huggingface_token_here
docker compose up -d backend-pockettts

Method C: Add directly to docker-compose.yml

backend-pockettts:
  environment:
    - HF_TOKEN=your_huggingface_token_here

Step 4: Start the Service

docker compose up -d backend-profiles backend-pockettts frontend

For more details, see backend-pockettts/README.md.

Troubleshooting

MeloTTS fails with "cannot import name 'BeamSearchScorer'"

This means transformers is too new. The MeloTTS container pins it to 4.40.2. Rebuild the container:

docker compose down
docker compose build --no-cache backend-melotts
docker compose up -d

XTTS fails to load or runs out of memory

XTTS requires more memory. Ensure Docker has at least 8GB RAM allocated:

  • Docker Desktop: Settings β†’ Resources β†’ Memory β†’ 8GB+

Containers fail to start

Check logs for each service:

docker compose logs backend-profiles
docker compose logs backend-melotts
docker compose logs backend-xtts
docker compose logs backend-qwentts
docker compose logs backend-pockettts
docker compose logs frontend

Profile upload fails

  • Check that backend-profiles is running: docker compose ps backend-profiles
  • Check available disk space for audio storage
  • Ensure audio file is in a supported format (wav, mp3, ogg, m4a, webm)
  • Check logs: docker compose logs -f backend-profiles

First request is very slow

Models are loaded on-demand:

  • Profile Backend (Whisper): First transcription takes ~10-20 seconds to load model
  • TTS Backends: First clone request takes 1-3 minutes to download and load models
  • Subsequent requests are much faster

To preload models on startup (slower startup but faster first request):

# In docker-compose.yml
environment:
  - PRELOAD_MODELS=true

Transcription fails or is slow

  • Whisper model is loaded on first transcription request
  • First transcription takes longer (~10-20s for model load + transcription time)
  • Ensure backend-profiles has adequate memory (4GB limit configured)
  • Check logs: docker compose logs -f backend-profiles

Audio quality is poor

  • Use longer reference audio (10-30 seconds)
  • Ensure reference audio is clear with minimal background noise
  • Try XTTS backend for better quality (at the cost of speed)

Pocket-TTS voice cloning not available

If you see "Voice cloning is not available" error:

  1. Ensure you've accepted the terms at https://huggingface.co/kyutai/pocket-tts
  2. Create a Hugging Face token at https://huggingface.co/settings/tokens
  3. Configure HF_TOKEN environment variable (see Pocket-TTS Setup)
  4. Restart the container: docker compose restart backend-pockettts
  5. Check logs: docker compose logs backend-pockettts

Qwen-TTS is very slow on Mac/CPU

The 1.7B model is large and slow on CPU. To improve performance:

Option 1: Use the smaller 0.6B model (RECOMMENDED)

# In docker-compose.yml, set:
environment:
  - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base  # 3x faster, still good quality

Option 2: Switch to 1.7B for best quality (slower)

environment:
  - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base  # Best quality, slower

Performance comparison on M2 Pro Mac (CPU):

  • 0.6B model: ~10-20 seconds per generation
  • 1.7B model: ~30-60 seconds per generation

Note: Docker containers cannot access Mac GPU (Metal/MPS), so inference runs on CPU only.

Development

Local Development (without Docker)

# MeloTTS backend
cd backend-melotts
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install dependencies manually (see Dockerfile)
python app.py

# XTTS backend
cd backend-xtts
python -m venv venv
source venv/bin/activate
# Install dependencies manually (see Dockerfile)
python app.py

# Frontend
cd frontend
# Serve with any static file server
python -m http.server 8080

License

This project is MIT licensed.

Component Licenses

  • Frontend: MIT
  • OpenVoice V2 (MeloTTS backend): MIT - Free for commercial use
  • Coqui XTTS v2 (XTTS backend): CPML - Non-commercial use only
  • Qwen3-TTS (Qwen-TTS backend): Apache 2.0 - Free for commercial use
  • Pocket-TTS (Pocket-TTS backend): CC-BY-4.0 - Free for commercial use

⚠️ Important: If you use the XTTS backend, you must comply with Coqui's CPML license which restricts commercial use.

Credits

  • OpenVoice - MIT licensed voice cloning
  • MeloTTS - High-quality TTS
  • Coqui TTS - XTTS v2 voice cloning model
  • Qwen3-TTS - Apache 2.0 licensed voice cloning from Alibaba Cloud
  • Pocket-TTS - CC-BY-4.0 licensed lightweight voice cloning from Kyutai
  • OpenAI Whisper - Speech recognition
  • Built with Flask, React, Docker, and Nginx