GitHub - ZDisket/vits-evo: VITS EVOlution: Lightweight, deployable voice cloning TTS model

vitsevozsrepovid_re.mp4

See more samples on my Xitter post

VITS EVOlution is an open source text to speech stack model around zero-shot voice cloning and low latency. It includes an ONNX speaker encoder, TTS inference, uses DeepPhonemizer for phonemes (MIT), and a voice blending option that averages multiple speaker embeddings to create new voices.

Features

Zero-shot voice cloning from a reference clip
Voice blending with two or more reference embeddings
ONNX release format for both the speaker encoder and the TTS model
Permissive licenses for all the stack
CPU inference at about 0.18 real-time factor on an Intel(R) Xeon(R) Platinum 8470, or about 5.6x faster than real time

Released models

Model	Type	Checkpoint	Hugging Face demo	Colab demo
`vits-evo-zero-shot-v1`	Zero-shot TTS (ONNX)	Google Drive	HF Space	Google Colab

Minimal run

Clone DeepPhonemizer and install it:

git clone https://github.com/ZDisket/DeepPhonemizer.git
pip install ./DeepPhonemizer

After downloading the TTS model (duh!), download the English phonemizer checkpoint:

wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_ipa_forward.pt

from onnx_tts import VitsEvo
from resemblyzer_onnx import OnnxVoiceEncoder

tts = VitsEvo(
    "path/to/model.onnx",
    config_path="configs/zero_shot_pretrain_nosdp_1gpu.inference.json",
    phonemizer_checkpoint="en_us_cmudict_ipa_forward.pt",
)

speaker_encoder = OnnxVoiceEncoder(device="cpu")
embedding = speaker_encoder.make_embedding("reference.wav")

audio = tts.synthesize(
    "Hello from VITS EVOlution.",
    speaker_embedding=embedding,
)

tts.save_wav("output.wav", audio)

Voice blending

Voice blending averages two or more speaker embeddings to create a new voice.

embedding = speaker_encoder.make_embedding(
    ["speaker_a.wav", "speaker_b.wav", "speaker_c.wav"],
    factors=[1.0, 0.7, 0.4],
)

audio = tts.synthesize(
    "This voice is blended from multiple references.",
    speaker_embedding=embedding,
)

Gradio app

Run the included demo app:

python gradio_zero_shot.py --model path/to/model.onnx --config configs/zero_shot_pretrain_nosdp_1gpu.inference.json --phonemizer-checkpoint en_us_cmudict_ipa_forward.pt

The app lets you:

upload a primary reference clip
mix extra reference clips with custom weights
inspect phoneme output
test zero-shot cloning and blended voices from the browser

Contact

Hey you? Like my stuff? email me or DM me on Xitter for any inquiries you may have.