GitHub - gangtao/VoiceCraft: an easy to use voice clone application

VoiceCraft - AI Voice Cloning

A Docker-based voice cloning application with two separate TTS engines, each in its own isolated backend container for better dependency management.

Features

🎙️ Record or Upload - Record reference audio directly in browser or upload existing files
📝 Auto Transcription - AI-powered speech-to-text using OpenAI Whisper
🔄 Multi-Engine Support - Choose between OpenVoice V2, XTTS v2, or Qwen3-TTS
🌍 Multi-language - Support for English, Chinese, Japanese, Korean, and more
⬇️ Download Results - Export cloned audio as WAV files
🐳 Docker Ready - Separate containers for each engine to avoid dependency conflicts

Architecture

📘 Detailed Architecture Documentation: See ARCHITECTURE.md for comprehensive system design, data flows, and deployment patterns.

The application consists of 6 separate Docker containers with a microservices architecture:

                                    ┌─────────────────────────────────┐
                                    │       Frontend (Nginx/React)    │
                                    │            :8080                │
                                    └──────────────┬──────────────────┘
                                                   │
                    ┌──────────────────────────────┼──────────────────────────────┐
                    │                              │                              │
                    ▼                              ▼                              ▼
        ┌───────────────────────┐    ┌────────────────────────┐    ┌──────────────────────────┐
        │  Profile Management   │    │   TTS Engine Backends  │    │   TTS Engine Backends    │
        │      Backend          │    │                        │    │                          │
        │   (Whisper + DB)      │    │  - MeloTTS    :5000    │    │  - Qwen-TTS     :5002    │
        │       :5100           │    │  - XTTS       :5001    │    │  - Pocket-TTS   :5003    │
        └───────────────────────┘    └────────────────────────┘    └──────────────────────────┘
                │                                 │                              │
        ┌───────┴────────┐              ┌─────────┴──────────┐        ┌──────────┴───────────┐
        │ • Upload Audio │              │ • Voice Cloning    │        │ • Voice Cloning      │
        │ • Transcription│              │ • Multi-language   │        │ • Multi-language     │
        │ • Profile CRUD │              │ • Audio Generation │        │ • Audio Generation   │
        └────────────────┘              └────────────────────┘        └──────────────────────┘

Why This Architecture?

Centralized Profile Management:

Single source of truth for voice profiles and audio files
Shared audio transcription service (OpenAI Whisper)
All TTS backends connect to profile backend for audio retrieval
Simplifies profile management and reduces code duplication

Isolated TTS Backends: Each TTS engine has different dependency requirements:

MeloTTS requires transformers==4.40.2 (has BeamSearchScorer)
XTTS requires transformers>=4.33.0 (installs 4.57.6, which removed BeamSearchScorer)
Qwen3-TTS uses the latest qwen-tts package with Python 3.12
Pocket-TTS uses pocket-tts with Python 3.12

By separating them into individual containers, each can have its optimal dependencies without conflicts.

TTS Engines

Engine	License	Commercial Use	Quality	Speed	Languages	Container
OpenVoice V2	MIT ✅	Yes	Good	Fast	6 languages	backend-melotts:5000
Coqui XTTS v2	CPML ⚠️	No	Excellent	Slower	16+ languages	backend-xtts:5001
Qwen3-TTS	Apache 2.0 ✅	Yes	Excellent	Fast	10 languages	backend-qwentts:5002
Pocket-TTS	CC-BY-4.0 ✅	Yes	Good	Very Fast	English only	backend-pockettts:5003

OpenVoice V2 (MeloTTS Backend)

License: MIT - Free for commercial use
Languages: English, Spanish, French, Chinese, Japanese, Korean
Best for: Production use, commercial applications
Dependencies: transformers 4.40.2, MeloTTS, OpenVoice

Coqui XTTS v2 (XTTS Backend)

License: CPML - Non-commercial only
Languages: 16+ languages including German, Italian, Portuguese, Russian, Arabic
Best for: Research, personal projects, highest quality needs
Dependencies: transformers 4.57.6, TTS 0.22.0

Qwen3-TTS (Qwen-TTS Backend)

License: Apache 2.0 - Free for commercial use
Languages: 10 languages - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Voice Cloning: 3-second rapid voice clone from user audio
Best for: Production use, high-quality voice cloning, multilingual applications
Dependencies: Python 3.12, qwen-tts, torch
Model: Qwen3-TTS-12Hz-0.6B-Base (default, faster) or 1.7B-Base (higher quality)
Performance: ~10-20s per generation on CPU with 0.6B model, 30-60s with 1.7B model

Pocket-TTS (Pocket-TTS Backend)

License: CC-BY-4.0 - Free for commercial use
Languages: English only
Voice Cloning: Fast voice cloning from user audio
Best for: English-only applications, lightweight deployment, fast inference
Dependencies: Python 3.12, pocket-tts, torch 2.5+
Model: 100M parameters (very lightweight, CPU-optimized)
Performance: ~6x faster than real-time, ~200ms latency, only 2 CPU cores required

Quick Start

Prerequisites

Docker and Docker Compose
At least 4GB RAM for single backend (8GB for multiple backends)
For Pocket-TTS: Hugging Face account and token (see Pocket-TTS Setup)

Quick Deploy (Single Backend - Recommended)

Easy Mode - Using Helper Script:

# Clone or download this repository
cd VoiceCraft

# Start with Qwen3-TTS (recommended - fast, commercial use)
./run.sh qwentts

# Or Pocket-TTS (fastest, commercial use, English only)
./run.sh pockettts

# Or OpenVoice V2 (fast, commercial use)
./run.sh openvoice

# Or XTTS v2 (highest quality, non-commercial)
./run.sh xtts

# View logs
./run.sh logs

# Stop all
./run.sh stop

# Access the application
open http://localhost:8080

Manual Mode - Using Docker Compose:

# Option 1: Start with Qwen3-TTS (recommended - fast, commercial use)
docker compose up -d backend-profiles backend-qwentts frontend

# Option 2: Start with Pocket-TTS (fastest, commercial use, English only)
docker compose up -d backend-profiles backend-pockettts frontend

# Option 3: Start with OpenVoice V2 (fast, commercial use)
docker compose up -d backend-profiles backend-melotts frontend

# Option 4: Start with XTTS v2 (highest quality, non-commercial)
docker compose up -d backend-profiles backend-xtts frontend

# View logs
docker compose logs -f

# Access the application
open http://localhost:8080

Note: The backend-profiles service is required for all TTS backends as it handles audio uploads, transcription, and profile management. The frontend will automatically detect which TTS backends are running.

Deploy All Backends (For Comparison)

# Build and start all containers
docker compose build
docker compose up -d

# View logs
docker compose logs -f

# Access the application
open http://localhost:8080

Run Individual Backends (Recommended)

For most use cases, you only need one TTS backend at a time. The profile management backend is always required. The frontend will automatically detect which TTS backend is running:

# Option 1: OpenVoice V2 - Fast, commercial use OK
docker compose up -d backend-profiles backend-melotts frontend

# Option 2: XTTS v2 - Best quality, non-commercial only
docker compose up -d backend-profiles backend-xtts frontend

# Option 3: Qwen3-TTS - Fast voice cloning, commercial use OK
docker compose up -d backend-profiles backend-qwentts frontend

# Option 4: Pocket-TTS - Fastest, commercial use OK, English only
docker compose up -d backend-profiles backend-pockettts frontend

Run All Backends (For Comparison)

If you want to test and compare all TTS engines:

# Start all backends (profile + all TTS engines + frontend)
docker compose up -d

# Or build first if needed
docker compose build
docker compose up -d

Switch Between Backends

# Stop current TTS backend
docker compose stop backend-melotts

# Start different TTS backend
docker compose up -d backend-qwentts

# The frontend will auto-detect the change
# Note: backend-profiles stays running and doesn't need to be restarted

Usage

Open http://localhost:8080 in your browser
Create a Voice Profile:
- Click "Create New Profile" or upload reference audio
- Optionally transcribe the audio using Whisper to see what was said
- Save the profile with a descriptive name
Select TTS Engine:
- The UI will show only the running TTS backends (auto-detected)
- If multiple backends are running, select your preferred TTS engine
Clone Voice:
- Select a saved voice profile from your profile library
- Enter the text you want the cloned voice to say
- Select the language
- Click "Clone Voice" to generate the output
Download or Play:
- Listen to the cloned audio in the browser
- Download the output as a WAV file

Managing Backends

Check Running Backends

Stop a Backend

# Stop specific TTS backend
docker compose stop backend-melotts
docker compose stop backend-xtts
docker compose stop backend-qwentts
docker compose stop backend-pockettts

# Stop profile backend (stops all profile management)
docker compose stop backend-profiles

# Or stop all
docker compose down

Switch TTS Backends

# Stop current TTS backend
docker compose stop backend-qwentts

# Start different TTS backend
docker compose up -d backend-melotts

# Refresh browser - UI will auto-update
# Note: backend-profiles keeps running and retains all profiles

View Logs

# Profile backend
docker compose logs -f backend-profiles

# Specific TTS backend
docker compose logs -f backend-qwentts

# All services
docker compose logs -f

API Endpoints

Profile Management Backend (`/api/profiles`)

GET  /api/profiles                    # List all voice profiles
POST /api/profiles                    # Create new profile (upload audio)
GET  /api/profiles/:id                # Get profile details
PUT  /api/profiles/:id                # Update profile (name/transcription)
DELETE /api/profiles/:id              # Delete profile
POST /api/profiles/:id/transcribe     # Transcribe profile audio with Whisper
GET  /api/profiles/:id/audio          # Stream profile audio
GET  /api/health                      # Health check

MeloTTS Backend (`/api/melotts`)

POST /api/melotts/clone           # Clone voice with OpenVoice (requires profile_id)
GET  /api/melotts/audio/:id       # Stream generated audio
GET  /api/melotts/download/:id    # Download generated audio
GET  /api/melotts/health          # Health check

XTTS Backend (`/api/xtts`)

POST /api/xtts/clone              # Clone voice with XTTS (requires profile_id)
GET  /api/xtts/audio/:id          # Stream generated audio
GET  /api/xtts/download/:id       # Download generated audio
GET  /api/xtts/health             # Health check

Qwen-TTS Backend (`/api/qwentts`)

POST /api/qwentts/clone           # Clone voice with Qwen3-TTS (requires profile_id)
GET  /api/qwentts/audio/:id       # Stream generated audio
GET  /api/qwentts/download/:id    # Download generated audio
GET  /api/qwentts/health          # Health check

Pocket-TTS Backend (`/api/pockettts`)

POST /api/pockettts/clone         # Clone voice with Pocket-TTS (requires profile_id)
GET  /api/pockettts/audio/:id     # Stream generated audio
GET  /api/pockettts/download/:id  # Download generated audio
GET  /api/pockettts/health        # Health check

API Usage Examples

# Step 1: Create a voice profile (upload reference audio)
curl -X POST http://localhost:8080/api/profiles \
  -F "audio=@reference_voice.wav" \
  -F "name=John's Voice"

# Response: { "success": true, "profile_id": "abc123", ... }

# Step 2: Optionally transcribe the audio
curl -X POST http://localhost:8080/api/profiles/abc123/transcribe

# Step 3: Clone voice using the profile
# Using MeloTTS (OpenVoice V2)
curl -X POST http://localhost:8080/api/melotts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

# Using XTTS
curl -X POST http://localhost:8080/api/xtts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

# Using Qwen3-TTS
curl -X POST http://localhost:8080/api/qwentts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en",
    "ref_text": "Optional transcript of reference audio"
  }'

# Using Pocket-TTS
curl -X POST http://localhost:8080/api/pockettts/clone \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "abc123",
    "text": "Hello, this is a test.",
    "language": "en"
  }'

Configuration

Environment Variables

backend-profiles:

Manages audio uploads, transcription, and profile database
Uses OpenAI Whisper for transcription (lazy-loaded on first use)

backend-melotts:

PRELOAD_MODELS=false - Set to true to preload models on startup
USE_GPU=false - Enable GPU acceleration (requires GPU docker setup)
PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-xtts:

PRELOAD_MODELS=false - Set to true to preload models on startup
USE_GPU=false - Enable GPU acceleration
COQUI_TOS_AGREED=1 - Required to use Coqui TTS (auto-set)
PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-qwentts:

PRELOAD_MODELS=false - Set to true to preload models on startup
USE_GPU=false - Enable GPU acceleration (requires GPU docker setup)
QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base - Model to use (0.6B is faster, 1.7B is higher quality)
PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL

backend-pockettts:

PRELOAD_MODELS=false - Set to true to preload models on startup
No GPU support needed (CPU-optimized)
PROFILE_BACKEND_URL=http://backend-profiles:5100 - Profile backend URL
HF_TOKEN=your_token - REQUIRED - Hugging Face token for model access (see Pocket-TTS Setup)

Resource Requirements

Container	RAM Limit	RAM Reserved	First Request Load Time
backend-profiles	4GB	2GB	~10-20 seconds (Whisper lazy-load)
backend-melotts	4GB	2GB	~30-60 seconds
backend-xtts	8GB	4GB	~2-3 minutes
backend-qwentts	8GB	4GB	~1-2 minutes (0.6B model)
backend-pockettts	4GB	2GB	~30-60 seconds
frontend	Minimal	Minimal	Instant

Recommended: Run only one TTS backend at a time (plus the required profile backend) to minimize resource usage. The frontend auto-detects available backends.

Project Structure

VoiceCraft/
├── backend-profiles/         # Profile management backend (Port 5100)
│   ├── Dockerfile           # Container with Whisper and FFmpeg
│   └── app.py               # Flask API for profile CRUD and transcription
├── backend-melotts/          # OpenVoice V2 backend (Port 5000)
│   ├── Dockerfile           # Container with transformers 4.40.2
│   └── app.py               # Flask API for MeloTTS voice cloning
├── backend-xtts/             # XTTS v2 backend (Port 5001)
│   ├── Dockerfile           # Container with transformers 4.57.6
│   └── app.py               # Flask API for XTTS voice cloning
├── backend-qwentts/          # Qwen3-TTS backend (Port 5002)
│   ├── Dockerfile           # Container with Python 3.12 and qwen-tts
│   └── app.py               # Flask API for Qwen3-TTS voice cloning
├── backend-pockettts/        # Pocket-TTS backend (Port 5003)
│   ├── Dockerfile           # Container with Python 3.12 and pocket-tts
│   └── app.py               # Flask API for Pocket-TTS voice cloning
├── frontend/                 # React frontend (Port 8080)
│   ├── Dockerfile
│   ├── nginx.conf           # Routes to profile backend and TTS backends
│   └── index.html           # SPA with profile management and TTS engine selector
├── docker-compose.yml        # Multi-container orchestration
└── README.md

Supported Languages

OpenVoice V2 (MeloTTS Backend)

English (en), Spanish (es), French (fr), Chinese (zh), Japanese (ja), Korean (ko)

XTTS v2 (Additional Languages)

All above plus: German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Hungarian (hu)

Qwen3-TTS (Qwen-TTS Backend)

Chinese (zh), English (en), Japanese (ja), Korean (ko), German (de), French (fr), Russian (ru), Portuguese (pt), Spanish (es), Italian (it)

Pocket-TTS (Pocket-TTS Backend)

English (en) only

Pocket-TTS Setup

Pocket-TTS requires Hugging Face authentication to access the voice cloning model.

Step 1: Accept Model Terms

Visit https://huggingface.co/kyutai/pocket-tts and accept the terms of use.

Step 2: Create Hugging Face Token

Go to https://huggingface.co/settings/tokens
Create a new token (read access is sufficient)
Copy the token

Step 3: Configure Token

Choose one of these methods:

Method A: Using .env file (Recommended)

# In the project root directory
cp .env.example .env
# Edit .env and add your token:
HF_TOKEN=your_huggingface_token_here

Method B: Export as environment variable

export HF_TOKEN=your_huggingface_token_here
docker compose up -d backend-pockettts

Method C: Add directly to docker-compose.yml

backend-pockettts:
  environment:
    - HF_TOKEN=your_huggingface_token_here

Step 4: Start the Service

docker compose up -d backend-profiles backend-pockettts frontend

For more details, see backend-pockettts/README.md.

Troubleshooting

MeloTTS fails with "cannot import name 'BeamSearchScorer'"

This means transformers is too new. The MeloTTS container pins it to 4.40.2. Rebuild the container:

docker compose down
docker compose build --no-cache backend-melotts
docker compose up -d

XTTS fails to load or runs out of memory

XTTS requires more memory. Ensure Docker has at least 8GB RAM allocated:

Docker Desktop: Settings → Resources → Memory → 8GB+

Containers fail to start

Check logs for each service:

docker compose logs backend-profiles
docker compose logs backend-melotts
docker compose logs backend-xtts
docker compose logs backend-qwentts
docker compose logs backend-pockettts
docker compose logs frontend

Profile upload fails

Check that backend-profiles is running: docker compose ps backend-profiles
Check available disk space for audio storage
Ensure audio file is in a supported format (wav, mp3, ogg, m4a, webm)
Check logs: docker compose logs -f backend-profiles

First request is very slow

Models are loaded on-demand:

Profile Backend (Whisper): First transcription takes ~10-20 seconds to load model
TTS Backends: First clone request takes 1-3 minutes to download and load models
Subsequent requests are much faster

To preload models on startup (slower startup but faster first request):

# In docker-compose.yml
environment:
  - PRELOAD_MODELS=true

Transcription fails or is slow

Whisper model is loaded on first transcription request
First transcription takes longer (~10-20s for model load + transcription time)
Ensure backend-profiles has adequate memory (4GB limit configured)
Check logs: docker compose logs -f backend-profiles

Audio quality is poor

Use longer reference audio (10-30 seconds)
Ensure reference audio is clear with minimal background noise
Try XTTS backend for better quality (at the cost of speed)

Pocket-TTS voice cloning not available

If you see "Voice cloning is not available" error:

Ensure you've accepted the terms at https://huggingface.co/kyutai/pocket-tts
Create a Hugging Face token at https://huggingface.co/settings/tokens
Configure HF_TOKEN environment variable (see Pocket-TTS Setup)
Restart the container: docker compose restart backend-pockettts
Check logs: docker compose logs backend-pockettts

Qwen-TTS is very slow on Mac/CPU

The 1.7B model is large and slow on CPU. To improve performance:

Option 1: Use the smaller 0.6B model (RECOMMENDED)

# In docker-compose.yml, set:
environment:
  - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-Base  # 3x faster, still good quality

Option 2: Switch to 1.7B for best quality (slower)

environment:
  - QWEN_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base  # Best quality, slower

Performance comparison on M2 Pro Mac (CPU):

0.6B model: ~10-20 seconds per generation
1.7B model: ~30-60 seconds per generation

Note: Docker containers cannot access Mac GPU (Metal/MPS), so inference runs on CPU only.

Development

Local Development (without Docker)

# MeloTTS backend
cd backend-melotts
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install dependencies manually (see Dockerfile)
python app.py

# XTTS backend
cd backend-xtts
python -m venv venv
source venv/bin/activate
# Install dependencies manually (see Dockerfile)
python app.py

# Frontend
cd frontend
# Serve with any static file server
python -m http.server 8080

License

This project is MIT licensed.

Component Licenses

Frontend: MIT
OpenVoice V2 (MeloTTS backend): MIT - Free for commercial use
Coqui XTTS v2 (XTTS backend): CPML - Non-commercial use only
Qwen3-TTS (Qwen-TTS backend): Apache 2.0 - Free for commercial use
Pocket-TTS (Pocket-TTS backend): CC-BY-4.0 - Free for commercial use

⚠️ Important: If you use the XTTS backend, you must comply with Coqui's CPML license which restricts commercial use.

Credits

OpenVoice - MIT licensed voice cloning
MeloTTS - High-quality TTS
Coqui TTS - XTTS v2 voice cloning model
Qwen3-TTS - Apache 2.0 licensed voice cloning from Alibaba Cloud
Pocket-TTS - CC-BY-4.0 licensed lightweight voice cloning from Kyutai
OpenAI Whisper - Speech recognition
Built with Flask, React, Docker, and Nginx

VoiceCraft - AI Voice Cloning

Features

Architecture

Why This Architecture?

TTS Engines

OpenVoice V2 (MeloTTS Backend)

Coqui XTTS v2 (XTTS Backend)

Qwen3-TTS (Qwen-TTS Backend)

Pocket-TTS (Pocket-TTS Backend)

Quick Start

Prerequisites

Quick Deploy (Single Backend - Recommended)

Deploy All Backends (For Comparison)

Run Individual Backends (Recommended)

Run All Backends (For Comparison)

Switch Between Backends

Usage

Managing Backends

Check Running Backends

Stop a Backend

Switch TTS Backends

View Logs

API Endpoints

Profile Management Backend (/api/profiles)

MeloTTS Backend (/api/melotts)

XTTS Backend (/api/xtts)

Qwen-TTS Backend (/api/qwentts)

Pocket-TTS Backend (/api/pockettts)

API Usage Examples

Configuration

Environment Variables

Resource Requirements

Project Structure

Supported Languages

OpenVoice V2 (MeloTTS Backend)

XTTS v2 (Additional Languages)

Qwen3-TTS (Qwen-TTS Backend)

Pocket-TTS (Pocket-TTS Backend)

Pocket-TTS Setup

Step 1: Accept Model Terms

Step 2: Create Hugging Face Token

Step 3: Configure Token

Step 4: Start the Service

Troubleshooting

MeloTTS fails with "cannot import name 'BeamSearchScorer'"

XTTS fails to load or runs out of memory

Containers fail to start

Profile upload fails

First request is very slow

Transcription fails or is slow

Audio quality is poor

Pocket-TTS voice cloning not available

Qwen-TTS is very slow on Mac/CPU

Development

Local Development (without Docker)

License

Component Licenses

Credits

Profile Management Backend (`/api/profiles`)

MeloTTS Backend (`/api/melotts`)

XTTS Backend (`/api/xtts`)

Qwen-TTS Backend (`/api/qwentts`)

Pocket-TTS Backend (`/api/pockettts`)