VoiceCraft - AI Voice Cloning
A Docker-based voice cloning application with two separate TTS engines, each in its own isolated backend container for better dependency management.
Features
- ποΈ Record or Upload - Record reference audio directly in browser or upload existing files
- π Auto Transcription - AI-powered speech-to-text using OpenAI Whisper
- π Multi-Engine Support - Choose between OpenVoice V2, XTTS v2, or Qwen3-TTS
- π Multi-language - Support for English, Chinese, Japanese, Korean, and more
- β¬οΈ Download Results - Export cloned audio as WAV files
- π³ Docker Ready - Separate containers for each engine to avoid dependency conflicts
Architecture
π Detailed Architecture Documentation: See ARCHITECTURE.md for comprehensive system design, data flows, and deployment patterns.
The application consists of 6 separate Docker containers with a microservices architecture:
βββββββββββββββββββββββββββββββββββ
β Frontend (Nginx/React) β
β :8080 β
ββββββββββββββββ¬βββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Profile Management β β TTS Engine Backends β β TTS Engine Backends β
β Backend β β β β β
β (Whisper + DB) β β - MeloTTS :5000 β β - Qwen-TTS :5002 β
β :5100 β β - XTTS :5001 β β - Pocket-TTS :5003 β
βββββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β β β
βββββββββ΄βββββββββ βββββββββββ΄βββββββββββ ββββββββββββ΄ββββββββββββ
β β’ Upload Audio β β β’ Voice Cloning β β β’ Voice Cloning β
β β’ Transcriptionβ β β’ Multi-language β β β’ Multi-language β
β β’ Profile CRUD β β β’ Audio Generation β β β’ Audio Generation β
ββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββββ
Why This Architecture?
Centralized Profile Management:
- Single source of truth for voice profiles and audio files
- Shared audio transcription service (OpenAI Whisper)
- All TTS backends connect to profile backend for audio retrieval
- Simplifies profile management and reduces code duplication
Isolated TTS Backends: Each TTS engine has different dependency requirements:
- MeloTTS requires
transformers==4.40.2(has BeamSearchScorer) - XTTS requires
transformers>=4.33.0(installs 4.57.6, which removed BeamSearchScorer) - Qwen3-TTS uses the latest qwen-tts package with Python 3.12
- Pocket-TTS uses pocket-tts with Python 3.12
By separating them into individual containers, each can have its optimal dependencies without conflicts.
TTS Engines
| Engine | License | Commercial Use | Quality | Speed | Languages | Container |
|---|---|---|---|---|---|---|
| OpenVoice V2 | MIT β | Yes | Good | Fast | 6 languages | backend-melotts:5000 |
| Coqui XTTS v2 | CPML |
No | Excellent | Slower | 16+ languages | backend-xtts:5001 |
| Qwen3-TTS | Apache 2.0 β | Yes | Excellent | Fast | 10 languages | backend-qwentts:5002 |
| Pocket-TTS | CC-BY-4.0 β | Yes | Good | Very Fast | English only | backend-pockettts:5003 |
OpenVoice V2 (MeloTTS Backend)
- License: MIT - Free for commercial use
- Languages: English, Spanish, French, Chinese, Japanese, Korean
- Best for: Production use, commercial applications
- Dependencies: transformers 4.40.2, MeloTTS, OpenVoice
Coqui XTTS v2 (XTTS Backend)
- License: CPML - Non-commercial only
- Languages: 16+ languages including German, Italian, Portuguese, Russian, Arabic
- Best for: Research, personal projects, highest quality needs
- Dependencies: transformers 4.57.6, TTS 0.22.0
Qwen3-TTS (Qwen-TTS Backend)
- License: Apache 2.0 - Free for commercial use
- Languages: 10 languages - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Voice Cloning: 3-second rapid voice clone from user audio
- Best for: Production use, high-quality voice cloning, multilingual applications
- Dependencies: Python 3.12, qwen-tts, torch
- Model: Qwen3-TTS-12Hz-0.6B-Base (default, faster) or 1.7B-Base (higher quality)
- Performance: ~10-20s per generation on CPU with 0.6B model, 30-60s with 1.7B model
Pocket-TTS (Pocket-TTS Backend)
- License: CC-BY-4.0 - Free for commercial use
- Languages: English only
- Voice Cloning: Fast voice cloning from user audio
- Best for: English-only applications, lightweight deployment, fast inference
- Dependencies: Python 3.12, pocket-tts, torch 2.5+
- Model: 100M parameters (very lightweight, CPU-optimized)
- Performance: ~6x faster than real-time, ~200ms latency, only 2 CPU cores required
Quick Start
Prerequisites
- Docker and Docker Compose
- At least 4GB RAM for single backend (8GB for multiple backends)
- For Pocket-TTS: Hugging Face account and token (see Pocket-TTS Setup)
Quick Deploy (Single Backend - Recommended)
Easy Mode - Using Helper Script:
# Clone or download this repository cd VoiceCraft # Start with Qwen3-TTS (recommended - fast, commercial use) ./run.sh qwentts # Or Pocket-TTS (fastest, commercial use, English only) ./run.sh pockettts # Or OpenVoice V2 (fast, commercial use) ./run.sh openvoice # Or XTTS v2 (highest quality, non-commercial) ./run.sh xtts # View logs ./run.sh logs # Stop all ./run.sh stop # Access the application open http://localhost:8080
Manual Mode - Using Docker Compose:
# Option 1: Start with Qwen3-TTS (recommended - fast, commercial use) docker compose up -d backend-profiles backend-qwentts frontend # Option 2: Start with Pocket-TTS (fastest, commercial use, English only) docker compose up -d backend-profiles backend-pockettts frontend # Option 3: Start with OpenVoice V2 (fast, commercial use) docker compose up -d backend-profiles backend-melotts frontend # Option 4: Start with XTTS v2 (highest quality, non-commercial) docker compose up -d backend-profiles backend-xtts frontend # View logs docker compose logs -f # Access the application open http://localhost:8080
Note: The backend-profiles service is required for all TTS backends as it handles audio uploads, transcription, and profile management. The frontend will automatically detect which TTS backends are running.
Deploy All Backends (For Comparison)
# Build and start all containers docker compose build docker compose up -d # View logs docker compose logs -f # Access the application open http://localhost:8080
Run Individual Backends (Recommended)
For most use cases, you only need one TTS backend at a time. The profile management backend is always required. The frontend will automatically detect which TTS backend is running:
# Option 1: OpenVoice V2 - Fast, commercial use OK docker compose up -d backend-profiles backend-melotts frontend # Option 2: XTTS v2 - Best quality, non-commercial only docker compose up -d backend-profiles backend-xtts frontend # Option 3: Qwen3-TTS - Fast voice cloning, commercial use OK docker compose up -d backend-profiles backend-qwentts frontend # Option 4: Pocket-TTS - Fastest, commercial use OK, English only docker compose up -d backend-profiles backend-pockettts frontend
Run All Backends (For Comparison)
If you want to test and compare all TTS engines:
# Start all backends (profile + all TTS engines + frontend) docker compose up -d # Or build first if needed docker compose build docker compose up -d
Switch Between Backends
# Stop current TTS backend docker compose stop backend-melotts # Start different TTS backend docker compose up -d backend-qwentts # The frontend will auto-detect the change # Note: backend-profiles stays running and doesn't need to be restarted
Usage
- Open http://localhost:8080 in your browser
- Create a Voice Profile:
- Click "Create New Profile" or upload reference audio
- Optionally transcribe the audio using Whisper to see what was said
- Save the profile with a descriptive name
- Select TTS Engine:
- The UI will show only the running TTS backends (auto-detected)
- If multiple backends are running, select your preferred TTS engine
- Clone Voice:
- Select a saved voice profile from your profile library
- Enter the text you want the cloned voice to say
- Select the language
- Click "Clone Voice" to generate the output
- Download or Play:
- Listen to the cloned audio in the browser
- Download the output as a WAV file
Managing Backends
Check Running Backends
Stop a Backend
# Stop specific TTS backend docker compose stop backend-melotts docker compose stop backend-xtts docker compose stop backend-qwentts docker compose stop backend-pockettts # Stop profile backend (stops all profile management) docker compose stop backend-profiles # Or stop all docker compose down
Switch TTS Backends
# Stop current TTS backend docker compose stop backend-qwentts # Start different TTS backend docker compose up -d backend-melotts # Refresh browser - UI will auto-update # Note: backend-profiles keeps running and retains all profiles
View Logs
# Profile backend docker compose logs -f backend-profiles # Specific TTS backend docker compose logs -f backend-qwentts # All services docker compose logs -f
API Endpoints
Profile Management Backend (/api/profiles)
GET /api/profiles # List all voice profiles POST /api/profiles # Create new profile (upload audio) GET /api/profiles/:id # Get profile details PUT /api/profiles/:id # Update profile (name/transcription) DELETE /api/profiles/:id # Delete profile POST /api/profiles/:id/transcribe # Transcribe profile audio with Whisper GET /api/profiles/:id/audio # Stream profile audio GET /api/health # Health check
MeloTTS Backend (/api/melotts)
POST /api/melotts/clone # Clone voice with OpenVoice (requires profile_id) GET /api/melotts/audio/:id # Stream generated audio GET /api/melotts/download/:id # Download generated audio GET /api/melotts/health # Health check
XTTS Backend (/api/xtts)
POST /api/xtts/clone # Clone voice with XTTS (requires profile_id) GET /api/xtts/audio/:id # Stream generated audio GET /api/xtts/download/:id # Download generated audio GET /api/xtts/health # Health check
Qwen-TTS Backend (/api/qwentts)
POST /api/qwentts/clone # Clone voice with Qwen3-TTS (requires profile_id) GET /api/qwentts/audio/:id # Stream generated audio GET /api/qwentts/download/:id # Download generated audio GET /api/qwentts/health # Health check
Pocket-TTS Backend (/api/pockettts)
POST /api/pockettts/clone # Clone voice with Pocket-TTS (requires profile_id) GET /api/pockettts/audio/:id # Stream generated audio GET /api/pockettts/download/:id # Download generated audio GET /api/pockettts/health # Health check
API Usage Examples
# Step 1: Create a voice profile (upload reference audio) curl -X POST http://localhost:8080/api/profiles \ -F "audio=@reference_voice.wav" \ -F "name=John's Voice" # Response: { "success": true, "profile_id": "abc123", ... } # Step 2: Optionally transcribe the audio curl -X POST http://localhost:8080/api/profiles/abc123/transcribe # Step 3: Clone voice using the profile # Using MeloTTS (OpenVoice V2) curl -X POST http://localhost:8080/api/melotts/clone \ -H "Content-Type: application/json" \ -d '{ "profile_id": "abc123", "text": "Hello, this is a test.", "language": "en" }' # Using XTTS curl -X POST http://localhost:8080/api/xtts/clone \ -H "Content-Type: application/json" \ -d '{ "profile_id": "abc123", "text": "Hello, this is a test.", "language": "en" }' # Using Qwen3-TTS curl -X POST http://localhost:8080/api/qwentts/clone \ -H "Content-Type: application/json" \ -d '{ "profile_id": "abc123", "text": "Hello, this is a test.", "language": "en", "ref_text": "Optional transcript of reference audio" }' # Using Pocket-TTS curl -X POST http://localhost:8080/api/pockettts/clone \ -H "Content-Type: application/json" \ -d '{ "profile_id": "abc123", "text": "Hello, this is a test.", "language": "en" }'
Configuration
Environment Variables
backend-profiles:
- Manages audio uploads, transcription, and profile database
- Uses OpenAI Whisper for transcription (lazy-loaded on first use)
backend-melotts:
PRELOAD_MODELS=false- Set totrueto preload models on startupUSE_GPU=false- Enable GPU acceleration (requires GPU docker setup)PROFILE_BACKEND_URL=http://backend-profiles:5100- Profile backend URL
backend-xtts:
PRELOAD_MODELS=false- Set totrueto preload models on startupUSE_GPU=false- Enable GPU accelerationCOQUI_TOS_AGREED=1- Required to use Coqui TTS (auto-set)PROFILE_BACKEND_URL=http://backend-profiles:5100- Profile backend URL
backend-qwentts:
PRELOAD_MODELS=false- Set totrueto preload models on startupUSE_GPU=false- Enable GPU acceleration (requires GPU docker setup)QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base- Model to use (0.6B is faster, 1.7B is higher quality)PROFILE_BACKEND_URL=http://backend-profiles:5100- Profile backend URL
backend-pockettts:
PRELOAD_MODELS=false- Set totrueto preload models on startup- No GPU support needed (CPU-optimized)
PROFILE_BACKEND_URL=http://backend-profiles:5100- Profile backend URLHF_TOKEN=your_token- REQUIRED - Hugging Face token for model access (see Pocket-TTS Setup)
Resource Requirements
| Container | RAM Limit | RAM Reserved | First Request Load Time |
|---|---|---|---|
| backend-profiles | 4GB | 2GB | ~10-20 seconds (Whisper lazy-load) |
| backend-melotts | 4GB | 2GB | ~30-60 seconds |
| backend-xtts | 8GB | 4GB | ~2-3 minutes |
| backend-qwentts | 8GB | 4GB | ~1-2 minutes (0.6B model) |
| backend-pockettts | 4GB | 2GB | ~30-60 seconds |
| frontend | Minimal | Minimal | Instant |
Recommended: Run only one TTS backend at a time (plus the required profile backend) to minimize resource usage. The frontend auto-detects available backends.
Project Structure
VoiceCraft/
βββ backend-profiles/ # Profile management backend (Port 5100)
β βββ Dockerfile # Container with Whisper and FFmpeg
β βββ app.py # Flask API for profile CRUD and transcription
βββ backend-melotts/ # OpenVoice V2 backend (Port 5000)
β βββ Dockerfile # Container with transformers 4.40.2
β βββ app.py # Flask API for MeloTTS voice cloning
βββ backend-xtts/ # XTTS v2 backend (Port 5001)
β βββ Dockerfile # Container with transformers 4.57.6
β βββ app.py # Flask API for XTTS voice cloning
βββ backend-qwentts/ # Qwen3-TTS backend (Port 5002)
β βββ Dockerfile # Container with Python 3.12 and qwen-tts
β βββ app.py # Flask API for Qwen3-TTS voice cloning
βββ backend-pockettts/ # Pocket-TTS backend (Port 5003)
β βββ Dockerfile # Container with Python 3.12 and pocket-tts
β βββ app.py # Flask API for Pocket-TTS voice cloning
βββ frontend/ # React frontend (Port 8080)
β βββ Dockerfile
β βββ nginx.conf # Routes to profile backend and TTS backends
β βββ index.html # SPA with profile management and TTS engine selector
βββ docker-compose.yml # Multi-container orchestration
βββ README.md
Supported Languages
OpenVoice V2 (MeloTTS Backend)
English (en), Spanish (es), French (fr), Chinese (zh), Japanese (ja), Korean (ko)
XTTS v2 (Additional Languages)
All above plus: German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Hungarian (hu)
Qwen3-TTS (Qwen-TTS Backend)
Chinese (zh), English (en), Japanese (ja), Korean (ko), German (de), French (fr), Russian (ru), Portuguese (pt), Spanish (es), Italian (it)
Pocket-TTS (Pocket-TTS Backend)
English (en) only
Pocket-TTS Setup
Pocket-TTS requires Hugging Face authentication to access the voice cloning model.
Step 1: Accept Model Terms
Visit https://huggingface.co/kyutai/pocket-tts and accept the terms of use.
Step 2: Create Hugging Face Token
- Go to https://huggingface.co/settings/tokens
- Create a new token (read access is sufficient)
- Copy the token
Step 3: Configure Token
Choose one of these methods:
Method A: Using .env file (Recommended)
# In the project root directory cp .env.example .env # Edit .env and add your token: HF_TOKEN=your_huggingface_token_here
Method B: Export as environment variable
export HF_TOKEN=your_huggingface_token_here
docker compose up -d backend-pocketttsMethod C: Add directly to docker-compose.yml
backend-pockettts: environment: - HF_TOKEN=your_huggingface_token_here
Step 4: Start the Service
docker compose up -d backend-profiles backend-pockettts frontend
For more details, see backend-pockettts/README.md.
Troubleshooting
MeloTTS fails with "cannot import name 'BeamSearchScorer'"
This means transformers is too new. The MeloTTS container pins it to 4.40.2. Rebuild the container:
docker compose down docker compose build --no-cache backend-melotts docker compose up -d
XTTS fails to load or runs out of memory
XTTS requires more memory. Ensure Docker has at least 8GB RAM allocated:
- Docker Desktop: Settings β Resources β Memory β 8GB+
Containers fail to start
Check logs for each service:
docker compose logs backend-profiles docker compose logs backend-melotts docker compose logs backend-xtts docker compose logs backend-qwentts docker compose logs backend-pockettts docker compose logs frontend
Profile upload fails
- Check that backend-profiles is running:
docker compose ps backend-profiles - Check available disk space for audio storage
- Ensure audio file is in a supported format (wav, mp3, ogg, m4a, webm)
- Check logs:
docker compose logs -f backend-profiles
First request is very slow
Models are loaded on-demand:
- Profile Backend (Whisper): First transcription takes ~10-20 seconds to load model
- TTS Backends: First clone request takes 1-3 minutes to download and load models
- Subsequent requests are much faster
To preload models on startup (slower startup but faster first request):
# In docker-compose.yml environment: - PRELOAD_MODELS=true
Transcription fails or is slow
- Whisper model is loaded on first transcription request
- First transcription takes longer (~10-20s for model load + transcription time)
- Ensure backend-profiles has adequate memory (4GB limit configured)
- Check logs:
docker compose logs -f backend-profiles
Audio quality is poor
- Use longer reference audio (10-30 seconds)
- Ensure reference audio is clear with minimal background noise
- Try XTTS backend for better quality (at the cost of speed)
Pocket-TTS voice cloning not available
If you see "Voice cloning is not available" error:
- Ensure you've accepted the terms at https://huggingface.co/kyutai/pocket-tts
- Create a Hugging Face token at https://huggingface.co/settings/tokens
- Configure HF_TOKEN environment variable (see Pocket-TTS Setup)
- Restart the container:
docker compose restart backend-pockettts - Check logs:
docker compose logs backend-pockettts
Qwen-TTS is very slow on Mac/CPU
The 1.7B model is large and slow on CPU. To improve performance:
Option 1: Use the smaller 0.6B model (RECOMMENDED)
# In docker-compose.yml, set: environment: - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base # 3x faster, still good quality
Option 2: Switch to 1.7B for best quality (slower)
environment: - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base # Best quality, slower
Performance comparison on M2 Pro Mac (CPU):
- 0.6B model: ~10-20 seconds per generation
- 1.7B model: ~30-60 seconds per generation
Note: Docker containers cannot access Mac GPU (Metal/MPS), so inference runs on CPU only.
Development
Local Development (without Docker)
# MeloTTS backend cd backend-melotts python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies manually (see Dockerfile) python app.py # XTTS backend cd backend-xtts python -m venv venv source venv/bin/activate # Install dependencies manually (see Dockerfile) python app.py # Frontend cd frontend # Serve with any static file server python -m http.server 8080
License
This project is MIT licensed.
Component Licenses
- Frontend: MIT
- OpenVoice V2 (MeloTTS backend): MIT - Free for commercial use
- Coqui XTTS v2 (XTTS backend): CPML - Non-commercial use only
- Qwen3-TTS (Qwen-TTS backend): Apache 2.0 - Free for commercial use
- Pocket-TTS (Pocket-TTS backend): CC-BY-4.0 - Free for commercial use
Credits
- OpenVoice - MIT licensed voice cloning
- MeloTTS - High-quality TTS
- Coqui TTS - XTTS v2 voice cloning model
- Qwen3-TTS - Apache 2.0 licensed voice cloning from Alibaba Cloud
- Pocket-TTS - CC-BY-4.0 licensed lightweight voice cloning from Kyutai
- OpenAI Whisper - Speech recognition
- Built with Flask, React, Docker, and Nginx