GitHub - TrevorS/qwen3-tts-rs: Rust implementation of Qwen3-TTS speech synthesis

qwen3-tts

Pure Rust inference for Qwen3-TTS, a text-to-speech model from Alibaba. Built on candle — no Python or ONNX runtime required.

All code in this repo was written with Claude Code. This is an experiment -- not a production library.

Changelog

0.4.0

Pre-allocated KV cache with InplaceOp2 (zero-copy CUDA writes, no Tensor::cat)
GPU-side repetition penalty mask (incremental slice_assign, eliminates growing CPU transfer)
Deferred acoustic codes transfer (single bulk GPU→CPU at end of generation)
Fused residual + RMSNorm CUDA kernel
GPU→CPU syncs reduced from 3/frame to 1/frame (4-byte EOS check)
Non-streaming RTF: 0.48–0.67 across all variants (97-100% of theoretical throughput)

0.3.0

GPU-side sampling: batched argmax, on-device top-k/top-p/repetition penalty
Eliminated 15 of 16 GPU→CPU syncs per frame in code predictor
Cached token suppression mask in streaming sessions
Tokenizer fallback from vocab.json + merges.txt when tokenizer.json is unavailable
Profiling infrastructure: Chrome tracing, flamegraph, Nsight Systems via Makefile
Benchmarked all 4 model variants (0.6B Base, 1.7B Base/CustomVoice/VoiceDesign)
Self-contained model directories (removed tokenizer symlinks)
Enhanced waveform plots with stats annotation bar

0.2.0

ICL voice cloning now works correctly with proper reference audio
Fixed WAV output format (WAVEX/float32 → standard WAV/PCM16) — resolves playback speed issues in some players
Improved tokenizer path resolution with explicit --tokenizer-dir override
Added benchmarking suite (Criterion micro-benchmarks + E2E speed tests)
Automatic resampling of reference audio to 24kHz for voice cloning
Docker base image updated to NGC pytorch:25.11 (CUDA 13.0)

Thanks to u/rngesius for feedback on playback speed and tokenizer issues.

Acknowledgements

Qwen Team (Alibaba) — Qwen3-TTS model, weights, and technical report
candle — Rust ML framework by Hugging Face
mlx-audio — reference implementation that helped clarify model details
Claude Code — wrote the code

Features

CPU inference with optional MKL/Accelerate for faster BLAS operations
CUDA support for NVIDIA GPU acceleration with bf16 and Flash Attention 2
Metal support for Apple Silicon
Streaming synthesis for low-latency audio output
Voice cloning via x-vector or ICL (in-context learning) from reference audio (Base models)
Preset speakers with 9 built-in voices (CustomVoice models)
Text-described voices via natural language prompts (VoiceDesign models)
Auto-detection of model variant from config.json
HuggingFace Hub integration for easy model downloads

Performance

Benchmarked on an NVIDIA DGX Spark (GB10 Blackwell, ARM Cortex-X925, 120 GB unified memory). Default generation parameters, seed 42, 2 warmup + 3 timed iterations.

Model	RTF (short)	RTF (long)	Tok/s	Memory
0.6B Base (CUDA BF16)	0.48	0.50	25.9	767 MB
1.7B Base (CUDA BF16)	0.65	0.65	19.4	767 MB
1.7B CustomVoice (CUDA BF16)	0.64	0.67	19.2	772 MB
1.7B VoiceDesign (CUDA BF16)	0.64	0.66	19.3	770 MB
1.7B CustomVoice (CPU F32)	5.39	6.48	2.1	9.1 GB

RTF (real-time factor) = wall-clock / audio duration. < 1.0 is faster than real-time. Non-streaming results shown above. Streaming adds ~8-12% overhead with TTFA ~444 ms (0.6B) / ~580 ms (1.7B).

See docs/BENCHMARKS.md for full results, test corpus, micro-benchmarks, and reproduction instructions.

Samples

All samples generated with 1.7B models, seed 42. Text: "The sun set behind the mountains, painting the sky in shades of gold and violet."

Model Variants

Five official model variants exist across two size classes. Each variant supports a different speaker conditioning method:

Variant	Params	Speaker Conditioning	Use Case
0.6B Base	1.8 GB	Voice cloning from reference audio	Clone any voice from a WAV file
0.6B CustomVoice	1.8 GB	9 preset speakers	Pick from built-in voices
1.7B Base	3.9 GB	Voice cloning from reference audio	Higher quality voice cloning
1.7B CustomVoice	3.9 GB	9 preset speakers	Higher quality preset voices
1.7B VoiceDesign	3.8 GB	Text description	Describe a voice in natural language

Which model should I use?

Want to clone a specific voice? Use a Base model with --ref-audio (ICL mode) or --ref-audio --x-vector-only (faster, lower quality).
Want a quick preset voice? Use a CustomVoice model with --speaker.
Want to describe a voice in text? Use 1.7B VoiceDesign with --instruct.
Unsure? Start with 0.6B CustomVoice for the fastest results.

Valid combinations

	Preset speakers	Voice clone (x_vector)	Voice clone (ICL)	Text-described voice
Base		x	x
CustomVoice	x
VoiceDesign				x

Using the wrong combination (e.g. preset speakers on a Base model) won't crash, but produces unpredictable voice output. The library and CLI warn when this happens.

Installation

Add to your Cargo.toml:

[dependencies]
qwen3-tts = { version = "0.1", features = ["hub"] }

Feature Flags

Feature	Description
`cpu` (default)	CPU inference
`cuda`	NVIDIA GPU acceleration
`flash-attn`	Flash Attention 2 (requires CUDA toolkit; enables bf16 compute)
`metal`	Apple Silicon GPU acceleration
`mkl`	Intel MKL for faster CPU inference
`accelerate`	Apple Accelerate framework
`hub`	HuggingFace Hub model downloads
`cli`	Command-line tools

Quick Start

Preset speakers (CustomVoice)

use qwen3_tts::{Qwen3TTS, Speaker, Language, auto_device};

fn main() -> anyhow::Result<()> {
    let device = auto_device()?;
    let model = Qwen3TTS::from_pretrained("path/to/customvoice_model", device)?;

    let audio = model.synthesize_with_voice(
        "Hello, world!",
        Speaker::Ryan,
        Language::English,
        None,
    )?;
    audio.save("output.wav")?;
    Ok(())
}

Available speakers: Serena, Vivian, UncleFu, Ryan, Aiden, OnoAnna, Sohee, Eric, Dylan

Voice cloning (Base)

use qwen3_tts::{Qwen3TTS, Language, AudioBuffer, auto_device};

fn main() -> anyhow::Result<()> {
    let device = auto_device()?;
    let model = Qwen3TTS::from_pretrained("path/to/base_model", device)?;

    // Load reference audio
    let ref_audio = AudioBuffer::load("reference_voice.wav")?;

    // ICL mode: full voice cloning with reference text
    let prompt = model.create_voice_clone_prompt(&ref_audio, Some("transcript of ref audio"))?;

    // x_vector_only: faster, speaker embedding only (no reference text needed)
    // let prompt = model.create_voice_clone_prompt(&ref_audio, None)?;

    let audio = model.synthesize_voice_clone(
        "Hello in the cloned voice!",
        &prompt,
        Language::English,
        None,
    )?;
    audio.save("cloned.wav")?;
    Ok(())
}

Text-described voice (VoiceDesign)

use qwen3_tts::{Qwen3TTS, Language, auto_device};

fn main() -> anyhow::Result<()> {
    let device = auto_device()?;
    let model = Qwen3TTS::from_pretrained("path/to/voicedesign_model", device)?;

    let audio = model.synthesize_voice_design(
        "Hello from a designed voice!",
        "A cheerful young female voice with high pitch and energetic tone",
        Language::English,
        None,
    )?;
    audio.save("designed.wav")?;
    Ok(())
}

With custom options

use qwen3_tts::{Qwen3TTS, SynthesisOptions, auto_device};

fn main() -> anyhow::Result<()> {
    let device = auto_device()?;
    let model = Qwen3TTS::from_pretrained("path/to/model", device)?;

    let options = SynthesisOptions {
        temperature: 0.8,
        top_k: 30,
        top_p: 0.85,
        repetition_penalty: 1.05,
        ..Default::default()
    };
    let audio = model.synthesize("Custom settings!", Some(options))?;
    audio.save("output.wav")?;
    Ok(())
}

Streaming synthesis

For low-latency applications, stream audio in chunks:

use qwen3_tts::{Qwen3TTS, Speaker, Language, SynthesisOptions, auto_device};

fn main() -> anyhow::Result<()> {
    let device = auto_device()?;
    let model = Qwen3TTS::from_pretrained("path/to/model", device)?;

    let options = SynthesisOptions {
        chunk_frames: 10, // ~800ms per chunk
        ..Default::default()
    };

    for chunk in model.synthesize_streaming(
        "Hello, world!",
        Speaker::Ryan,
        Language::English,
        options,
    )? {
        let audio = chunk?;
        // Play or stream this chunk
        println!("Got {} samples", audio.samples.len());
    }
    Ok(())
}

With HuggingFace Hub

use qwen3_tts::{Qwen3TTS, ModelPaths, auto_device};

fn main() -> anyhow::Result<()> {
    let paths = ModelPaths::download(None)?;
    let device = auto_device()?;

    let model = Qwen3TTS::from_paths(&paths, device)?;
    let audio = model.synthesize("Hello from HuggingFace!", None)?;
    audio.save("output.wav")?;
    Ok(())
}

Architecture

The TTS pipeline consists of three stages:

TalkerModel: 28-layer transformer generating semantic tokens from text autoregressively. Uses MRoPE (multimodal rotary position encoding) across all variants.
CodePredictor: 5-layer decoder that generates 15 acoustic tokens per semantic token. Always 1024 hidden dim; 1.7B models use a projection layer to bridge from the talker's 2048-dim space.
Decoder12Hz: Converts 16-codebook tokens to 24kHz audio via ConvNeXt blocks and transposed convolution upsampling. Shared across all model variants.

Text --> TalkerModel --> Semantic Token --> CodePredictor --> [16 codes] --> Decoder --> Audio
              ^                                  ^
         (autoregressive,                  (per frame,
          one per frame)                    15 acoustic codes)

CLI

The model variant is auto-detected from config.json. The CLI warns if your flags don't match the model type.

# CustomVoice: preset speaker
cargo run --release --features cli --bin generate_audio -- \
  --model-dir path/to/customvoice \
  --text "Hello world" \
  --speaker ryan \
  --language english \

# Base: voice cloning (ICL — best quality, requires reference text)
cargo run --release --features cli --bin generate_audio -- \
  --model-dir path/to/base \
  --text "Hello world" \
  --ref-audio reference.wav \
  --ref-text "transcript of the reference audio"

# Base: voice cloning (x_vector_only — faster, no transcript needed)
cargo run --release --features cli --bin generate_audio -- \
  --model-dir path/to/base \
  --text "Hello world" \
  --ref-audio reference.wav \
  --x-vector-only

# VoiceDesign: describe the voice you want
cargo run --release --features cli --bin generate_audio -- \
  --model-dir path/to/voicedesign \
  --text "Hello world" \
  --instruct "A cheerful young female voice with high pitch and energetic tone" \
  --language english

# Reproducible generation with fixed seed
cargo run --release --features cli --bin generate_audio -- \
  --model-dir path/to/model \
  --text "Hello" \
  --seed 42

CLI options

Flag	Default	Description
`--model-dir`	`test_data/model`	Path to model directory
`--text`	`"Hello"`	Text to synthesize
`--speaker`	`ryan`	Preset speaker (CustomVoice only)
`--language`	`english`	Target language
`--instruct`		Voice description for VoiceDesign models
`--ref-audio`		Reference audio WAV for voice cloning (Base only)
`--ref-text`		Reference transcript for ICL voice cloning (requires `--ref-audio`)
`--x-vector-only`		Speaker embedding only, no ICL (use with `--ref-audio`)
`--output`		Output WAV file path (overrides default naming)
`--device`	`auto`	Device: `auto`, `cpu`, `cuda`, `cuda:N`, `metal`
`--duration`		Max duration in seconds (overrides --frames)
`--frames`	`2048`	Max frames (~164s); generation stops at EOS
`--temperature`	`0.7`	Sampling temperature
`--top-k`	`50`	Top-k sampling
`--top-p`	`0.9`	Nucleus sampling threshold
`--repetition-penalty`	`1.05`	Repetition penalty
`--seed`	`42`	Random seed for reproducibility

GPU Acceleration

On CUDA devices, the talker and code predictor automatically run in bf16 for lower memory usage and faster inference. The codec decoder and speaker encoder remain in f32 (convolutional, no attention).

For best performance, build with Flash Attention 2 (requires CUDA toolkit in PATH):

cargo build --release --features flash-attn,cli

Docker

For GPU builds, use the build script which auto-detects your GPU architecture:

# Build GPU image (auto-detects compute capability)
./build-docker.sh qwen3-tts flash-attn,cli

# Build CPU-only image
./build-docker.sh qwen3-tts-cpu cli

The script builds inside a running container with GPU access, ensuring correct PTX compilation for your exact hardware (Ampere, Ada, Hopper, Blackwell, etc.).

Run inference:

docker run --gpus all \
  -v /path/to/models:/models \
  -v /path/to/output:/output \
  qwen3-tts \
    --model-dir /models/0.6b-customvoice \
    --speaker ryan \
    --text "Hello world, this is a test." \
    --device cuda \
    --output /output/hello.wav

For CPU-only builds without GPU access, use the Dockerfile directly:

docker build --build-arg FEATURES=cli --build-arg BASE=ubuntu:22.04 -t qwen3-tts-cpu .

Dtype behavior

Component	CPU	CUDA/Metal	CUDA + flash-attn
Talker (transformer)	F32	BF16	BF16
Code Predictor	F32	BF16	BF16
Codec Decoder	F32	F32	F32
Speaker Encoder	F32	F32	F32

Model Files

All models share the same speech tokenizer and text tokenizer.

Component	HuggingFace Repo	Size
0.6B Base	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	1.8 GB
0.6B CustomVoice	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	1.8 GB
1.7B Base	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	3.9 GB
1.7B CustomVoice	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	3.9 GB
1.7B VoiceDesign	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	3.8 GB
Speech Tokenizer	`Qwen/Qwen3-TTS-Tokenizer-12Hz`	682 MB
Text Tokenizer	`Qwen/Qwen2-0.5B`	7 MB

Supported languages

English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Sample Rate

Output audio is always 24kHz mono. Use audio::resample() for other rates:

use qwen3_tts::audio;

let audio_48k = audio::resample(&audio, 48000)?;

License

MIT License. See the main Qwen3-TTS repository for model license information.