GitHub - ZDisket/vits-evo: VITS EVOlution: Lightweight, deployable voice cloning TTS model

2 min read Original article ↗
vitsevozsrepovid_re.mp4

See more samples on my Xitter post

VITS EVOlution is an open source text to speech stack model around zero-shot voice cloning and low latency. It includes an ONNX speaker encoder, TTS inference, uses DeepPhonemizer for phonemes (MIT), and a voice blending option that averages multiple speaker embeddings to create new voices.

Features

  • Zero-shot voice cloning from a reference clip
  • Voice blending with two or more reference embeddings
  • ONNX release format for both the speaker encoder and the TTS model
  • Permissive licenses for all the stack
  • CPU inference at about 0.18 real-time factor on an Intel(R) Xeon(R) Platinum 8470, or about 5.6x faster than real time

Released models

Model Type Checkpoint Hugging Face demo Colab demo
vits-evo-zero-shot-v1 Zero-shot TTS (ONNX) Google Drive HF Space Google Colab

Minimal run

Clone DeepPhonemizer and install it:

git clone https://github.com/ZDisket/DeepPhonemizer.git
pip install ./DeepPhonemizer

After downloading the TTS model (duh!), download the English phonemizer checkpoint:

wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_ipa_forward.pt
from onnx_tts import VitsEvo
from resemblyzer_onnx import OnnxVoiceEncoder

tts = VitsEvo(
    "path/to/model.onnx",
    config_path="configs/zero_shot_pretrain_nosdp_1gpu.inference.json",
    phonemizer_checkpoint="en_us_cmudict_ipa_forward.pt",
)

speaker_encoder = OnnxVoiceEncoder(device="cpu")
embedding = speaker_encoder.make_embedding("reference.wav")

audio = tts.synthesize(
    "Hello from VITS EVOlution.",
    speaker_embedding=embedding,
)

tts.save_wav("output.wav", audio)

Voice blending

Voice blending averages two or more speaker embeddings to create a new voice.

embedding = speaker_encoder.make_embedding(
    ["speaker_a.wav", "speaker_b.wav", "speaker_c.wav"],
    factors=[1.0, 0.7, 0.4],
)

audio = tts.synthesize(
    "This voice is blended from multiple references.",
    speaker_embedding=embedding,
)

Gradio app

Run the included demo app:

python gradio_zero_shot.py --model path/to/model.onnx --config configs/zero_shot_pretrain_nosdp_1gpu.inference.json --phonemizer-checkpoint en_us_cmudict_ipa_forward.pt

The app lets you:

  • upload a primary reference clip
  • mix extra reference clips with custom weights
  • inspect phoneme output
  • test zero-shot cloning and blended voices from the browser

Contact

Hey you? Like my stuff? email me or DM me on Xitter for any inquiries you may have.