A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.
What you can build
Four things that change when cloning runs locally.
Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.
Personal audiobook narrators
Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.
Multilingual creator content
YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.
Accessibility & voice banking
People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.
Product voices on demand
Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.
On-device vs hosted
How VoxCPM2 compares to ElevenLabs.
ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.
For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers.
| VoxCPM2 (Soniqo) | ElevenLabs | |
|---|---|---|
| Where it runs | On the user’s device | Hosted API |
| Audio leaves the device | No | Yes (uploaded to ElevenLabs) |
| Offline use | Yes | No (requires internet) |
| Per-call cost | None | Per-character billing |
| Model licence | Apache 2.0, open weights | Proprietary, SaaS only |
| Max output sample rate | 48 kHz native | 48 kHz (Pro tier and above) |
| Languages | 30 | 29 (Multilingual v2) · 70+ (Eleven v3) |
| Reference clip required | 5–30 s | 1 min (Instant) · 30 min (Professional) |
| Voice design from text | Yes | Yes |
Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.
Three cloning modes
One model, three ways in.
The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.
Voice design
When you don't have a reference recording.
Describe the voice in natural language. The model picks a matching voice and stays consistent across calls.
Reference cloning
When you have a short clip of the target speaker.
Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice.
Ultimate cloning
When the speaker has a distinctive accent and you want it preserved.
Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through.
The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.
Under the hood
How VoxCPM2 produces audio.
Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.
The pipeline starts with a local encoder (LocEnc) that fuses text tokens and (optional) reference audio into one stream of vectors. That stream feeds the TSLM — a 28-layer MiniCPM-4 language model that decides what audio "patch" should come next, the same way a text LM picks the next token. A second pass through the RALM refines each patch.
Up to this point everything is a transformer. The interesting twist is the LocDiT: instead of choosing from a fixed vocabulary of discrete audio tokens, it runs a small diffusion process to paint the audio latent in each slot. No discrete codec means no quantisation bottleneck — which is what lets the final stage, AudioVAE V2, decode straight to 48 kHz. Every other on-device engine in this stack tops out at 24 kHz.
The split is worth noting: the autoregressive LM is great at deciding what should come next (content, rhythm, length); the diffusion head is great at painting acoustic detail (phase, spectrum). VoxCPM2 lets each do what it's good at. That's why the model holds its own at only 2B parameters — the architecture earns the perceptual quality, not the size.
The paper
Read the original research.
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
OpenBMB · arXiv:2509.24650 · Sept 2025
Bundles
Three sizes. Pick by your disk budget.
All three bundles run the same architecture; they differ only in how aggressively the language model is quantised. The int8 bundle is the recommended default — it matches the upstream Python pipeline on the 8-sentence round-trip benchmark while being faster and 40% smaller than bf16.
| Bundle | Size | Best for |
|---|---|---|
| bf16 | ~5.0 GB | Reference / debugging. |
| int8 default | ~3.0 GB | Everyday cloning, audiobooks, podcasts. |
| int4 | ~1.9 GB | Disk-constrained deployments. |