GitHub - pevers/parkiet: Parkiet is a 1.6B parameter Dutch text-to-speech model (TTS)

5 min read Original article ↗

Parkiet: Dutch Text-to-Speech (TTS)

Parkiet

Open-weights Dutch TTS based on the Parakeet architecture, ported from Dia to JAX for scalable training. A full walkthrough to train the model for your language on Google Cloud TPUs, can be found in the TRAINING.md doc. A comparison to ElevenLabs can be found on my blog.

Parkiet creates highly realistic voices from text. You can guide the audio output to control emotion and tone. The model also supports nonverbal sounds (currently only laughter), and up to four different speakers per prompt. Voice cloning is also supported. Here are some samples.

Text File
[S1] denk je dat je een open source model kan trainen met weinig geld en middelen? [S2] ja ik denk het wel. [S1] oh ja, hoe dan? [S2] nou kijk maar in de repo op Git Hub of Hugging Face.
intro.mp4
[S1] hoeveel stemmen worden er ondersteund? [S2] nou, uhm, ik denk toch wel meer dan twee. [S3] ja, ja, d dat is het mooie aan dit model. [S4] ja klopt, het ondersteund tot vier verschillende stemmen per prompt.
multi.mp4
[S1] h h et is dus ook mogelijk, om eh ... uhm, heel veel t te st stotteren in een prompt.
stutter.mp4
[S1] (laughs) luister, ik heb een mop, wat uhm, drinkt een webdesigner het liefst? [S2] nou ... ? [S1] Earl Grey (laughs) . [S2] (laughs) heel goed.
laughs.mp4
[S1] je hebt maar weinig audio nodig om een stem te clonen de rest van deze tekst is uitgesproken door een computer. [S2] wauw, dat klinkt wel erg goed. [S1] ja, ik hoop dat je er wat aan hebt.
voice_out.mp4

Generation Guidelines

  • Use [S1], [S2], [S3], [S4] to indicate the different speakers. Always start with [S1] and always alternate between [S1] and [S2] (i.e. [S1]... [S1]... is not good).
  • Prefer lower capital text prompts with punctuation. Write out digits as words. Even though the model should be able to handle some variety, it is better to stick close to the output of WhisperD-NL.
  • Slowing down can be encouraged by using ... in the prompt.
  • Stuttering and disfluencies can be encouraged by using uh, uhm, mmm.
  • Laughter can be added with the (laughs) tag. However, use it sparingly because the model quickly derails for too many events.
  • Reduce hallucination by tuning the text prompts. The model can be brittle for unexpected events or tokens. Take a look at the example sentences and mimick the style.

News

September 28, 2025: Added tensorsafe format support allowing the model to run directly in the Dia pipeline without conversion.

Quickstart

There are three flavours of the model. The HF transformers version (recommended), the original JAX model, and the backported PyTorch model. The HF transformers version is the easiest to use and integrates seamlessly with the Hugging Face ecosystem.

HF Transformers (Recommended)

# Make sure you have the runtime dependencies installed for JAX
# You can also extract the HF inference code and the transformers dependency
sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev

uv sync # For CPU
uv sync --extra cuda # For CUDA

# Run the inference demo with HF transformers
uv run python src/parkiet/dia/inference_hf.py
PyTorch
# Make sure you have the runtime dependencies installed for JAX
sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev

uv sync # For CPU
uv sync --extra cuda # For CUDA

wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.pth?download=true -O weights/dia-nl-v1.pth
uv run python src/parkiet/dia/inference.py
JAX
# Make sure you have the runtime dependencies installed for JAX
sudo apt-get install build-essential cmake protobuf-compiler libprotobuf-dev

uv sync --extra tpu # For TPU
uv sync --extra cuda # For CUDA

# Download the checkpoint
wget https://huggingface.co/pevers/parkiet/resolve/main/dia-nl-v1.zip?download=true -O weights/dia-nl-v1.zip

# Create the checkpoint folder and unzip
mkdir -p weights
unzip weights/dia-nl-v1.zip -d weights

# Run the inference demo
# NOTE: Inference can take a while because of JAX compilation. Subsequent calls will be cached and much faster. I'm working on some performance improvements.
uv run python src/parkiet/jax/inference.py

Hardware Requirements

Framework float32 VRAM bfloat16 VRAM
JAX ≥19 GB ≥10GB
PyTorch ≥15 GB ≥10GB

Note: bfloat16 typically reduces VRAM usage versus float32 on supported hardware to about 10GB. However, converting the full model to bfloat16 causes more instability and hallucinations. Setting just the compute_dtype to bfloat16 is a good compromise and is also done during training. We would like to reduce the VRAM requirements in a next training run.

⚠️ Disclaimer

This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:

  • Identity Misuse: Do not produce audio resembling real individuals without permission.
  • Deceptive Content: Do not use this model to generate misleading content (e.g. fake news).
  • Illegal or Malicious Use: Do not use this model for activities that are illegal or intended to cause harm. By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are not responsible for any misuse and firmly oppose any unethical usage of this technology.

Training

For a full guide on data preparation, model conversion and the TPU setup to train this model for any language, see TRAINING.md.

Acknowledgements

License

Repository code is licensed under the MIT License. The TTS model itself is licensed as RAIL-M.