Cloning a voice at 48 kHz with VoxCPM2

Voice cloning

May 17, 2026

A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.

What you can build

Four things that change when cloning runs locally.

Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.

Personal audiobook narrators

Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.

Multilingual creator content

YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.

Accessibility & voice banking

People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.

Product voices on demand

Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.

On-device vs hosted

How VoxCPM2 compares to ElevenLabs.

ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.

For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers.

	VoxCPM2 (Soniqo)	ElevenLabs
Where it runs	On the user’s device	Hosted API
Audio leaves the device	No	Yes (uploaded to ElevenLabs)
Offline use	Yes	No (requires internet)
Per-call cost	None	Per-character billing
Model licence	Apache 2.0, open weights	Proprietary, SaaS only
Max output sample rate	48 kHz native	48 kHz (Pro tier and above)
Languages	30	29 (Multilingual v2) · 70+ (Eleven v3)
Reference clip required	5–30 s	1 min (Instant) · 30 min (Professional)
Voice design from text	Yes	Yes

Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.

Three cloning modes

One model, three ways in.

The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.

Voice design

When you don't have a reference recording.

Describe the voice in natural language. The model picks a matching voice and stays consistent across calls.

Reference cloning

When you have a short clip of the target speaker.

Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice.

Ultimate cloning

When the speaker has a distinctive accent and you want it preserved.

Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through.

The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.

Under the hood

How VoxCPM2 produces audio.

Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.

The pipeline starts with a local encoder (LocEnc) that fuses text tokens and (optional) reference audio into one stream of vectors. That stream feeds the TSLM — a 28-layer MiniCPM-4 language model that decides what audio "patch" should come next, the same way a text LM picks the next token. A second pass through the RALM refines each patch.

Up to this point everything is a transformer. The interesting twist is the LocDiT: instead of choosing from a fixed vocabulary of discrete audio tokens, it runs a small diffusion process to paint the audio latent in each slot. No discrete codec means no quantisation bottleneck — which is what lets the final stage, AudioVAE V2, decode straight to 48 kHz. Every other on-device engine in this stack tops out at 24 kHz.

The split is worth noting: the autoregressive LM is great at deciding what should come next (content, rhythm, length); the diffusion head is great at painting acoustic detail (phase, spectrum). VoxCPM2 lets each do what it's good at. That's why the model holds its own at only 2B parameters — the architecture earns the perceptual quality, not the size.

The paper

Read the original research.

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

OpenBMB · arXiv:2509.24650 · Sept 2025

View on arXiv

Bundles

Three sizes. Pick by your disk budget.

All three bundles run the same architecture; they differ only in how aggressively the language model is quantised. The int8 bundle is the recommended default — it matches the upstream Python pipeline on the 8-sentence round-trip benchmark while being faster and 40% smaller than bf16.

Bundle	Size	Best for
bf16	~5.0 GB	Reference / debugging.
int8 default	~3.0 GB	Everyday cloning, audiobooks, podcasts.
int4	~1.9 GB	Disk-constrained deployments.