vitsevozsrepovid_re.mp4
See more samples on my Xitter post
VITS EVOlution is an open source text to speech stack model around zero-shot voice cloning and low latency. It includes an ONNX speaker encoder, TTS inference, uses DeepPhonemizer for phonemes (MIT), and a voice blending option that averages multiple speaker embeddings to create new voices.
Features
- Zero-shot voice cloning from a reference clip
- Voice blending with two or more reference embeddings
- ONNX release format for both the speaker encoder and the TTS model
- Permissive licenses for all the stack
- CPU inference at about
0.18real-time factor on anIntel(R) Xeon(R) Platinum 8470, or about5.6xfaster than real time
Released models
| Model | Type | Checkpoint | Hugging Face demo | Colab demo |
|---|---|---|---|---|
vits-evo-zero-shot-v1 |
Zero-shot TTS (ONNX) | Google Drive | HF Space | Google Colab |
Minimal run
Clone DeepPhonemizer and install it:
git clone https://github.com/ZDisket/DeepPhonemizer.git pip install ./DeepPhonemizer
After downloading the TTS model (duh!), download the English phonemizer checkpoint:
wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_ipa_forward.pt
from onnx_tts import VitsEvo from resemblyzer_onnx import OnnxVoiceEncoder tts = VitsEvo( "path/to/model.onnx", config_path="configs/zero_shot_pretrain_nosdp_1gpu.inference.json", phonemizer_checkpoint="en_us_cmudict_ipa_forward.pt", ) speaker_encoder = OnnxVoiceEncoder(device="cpu") embedding = speaker_encoder.make_embedding("reference.wav") audio = tts.synthesize( "Hello from VITS EVOlution.", speaker_embedding=embedding, ) tts.save_wav("output.wav", audio)
Voice blending
Voice blending averages two or more speaker embeddings to create a new voice.
embedding = speaker_encoder.make_embedding( ["speaker_a.wav", "speaker_b.wav", "speaker_c.wav"], factors=[1.0, 0.7, 0.4], ) audio = tts.synthesize( "This voice is blended from multiple references.", speaker_embedding=embedding, )
Gradio app
Run the included demo app:
python gradio_zero_shot.py --model path/to/model.onnx --config configs/zero_shot_pretrain_nosdp_1gpu.inference.json --phonemizer-checkpoint en_us_cmudict_ipa_forward.pt
The app lets you:
- upload a primary reference clip
- mix extra reference clips with custom weights
- inspect phoneme output
- test zero-shot cloning and blended voices from the browser
Contact
Hey you? Like my stuff? email me or DM me on Xitter for any inquiries you may have.