Grafting a Speech Head onto Gemma 4 E4B

10 min read Original article ↗

For a Discord buddy, the tempting model shape is small, fast, and multimodal. It should hear the call, see the game, read the chat, and respond quickly enough that the moment is still alive. That is why Gemma 4 E4B is interesting: it is small enough to run locally, takes text, image, and audio as inputs, can analyze video as frames, and still behaves like a regular language model at the end.

The last part matters. Gemma 4 E4B can listen, but it does not speak natively. Its audio tower is an input encoder. The model turns audio into embeddings the text decoder can read, then predicts text tokens. To understand what a speech branch would have to add, first look at where the released model stops.

The useful mental model: image and audio do not become separate output streams. They get converted into the same hidden-space language as text, inserted into the prompt where special placeholder tokens sit, and then a 42-layer text decoder predicts the next text token.

Effective params 4.5B

Decoder layers 42

Text vocab 262K

Context window 128K

Inputs T/I/A

Output Text

Text input

Tokens to 2560 embeddings

Text starts as 262K-vocab token IDs and is already in the decoder's hidden-space format after embedding.

Image / video frames

Vision tower to projector

Visual patches become projected features that replace image placeholder slots in the sequence.

Audio input

Audio tower to projector

Audio is mono 16 kHz float input using 32 ms frames; Gemma 4 budgets 25 audio tokens per second and supports clips up to 30 seconds.

Shared hidden space

One 2560-wide sequence

Text, image, and audio now occupy the same decoder input stream.

Gemma text decoder

42 transformer layers

Hybrid sliding-window and global attention process the unified prompt.

Output head

Text-token logits

The tied text embedding head predicts the next token in the vocabulary.

Text goes straight into the language stream; the other towers sit idle.

Why Gemma stops at text

Gemma 4 E4B's audio support is trained around speech recognition and speech-to-text translation. It can listen to an audio clip and write out text. It is not trained to produce Mimi-style audio tokens, and it does not include the codec decoder needed to turn those tokens into a waveform.

That is where models like Moshi point in a different direction. In the Mimi Codec post, the audio is represented as discrete token streams. Once speech lives as tokens, a model can generate speech directly in roughly the same way a language model generates text. Gemma 4 E4B is not built that way. It is closer to a multimodal text model that can hear than a native speech-to-speech model.

That leaves two paths. The practical one is to let Gemma write text, then send that text to a normal TTS engine. The architectural experiment is more direct: train a new audio head on Gemma's decoder hidden states, before those states are collapsed into text-token logits.

Making it talk

The smoke-test experiment is not ordinary text-to-speech bolted onto the end of a chat response. Text can go into Gemma, then a trainable audio head maps Gemma's hidden states into Kyutai Mimi codec tokens. A frozen Mimi decoder turns those tokens into a waveform. No text is passed to Mimi at inference; it only sees predicted codec tokens.

The same head can also be trained on Gemma states produced from audio input. In that version, the spoken prompt lives inside the WAV, Gemma receives no text instruction, and the audio head still reads the final Gemma decoder states before predicting Mimi tokens.

The tap point is specific: a learned mix of the last six Gemma decoder layers, after the transformer stack and before the tied text-output head. Gemma stays frozen, Mimi stays frozen, and the only trained module is the Gemma-to-Mimi token head.

The code required to replicate the experiment, along with a minimal reproduction dataset and generated samples, is available in the gemma4-audio GitHub repo.

Gemma inputs

Unified token stream

Text tokens and projected vision/audio embeddings enter the same Gemma decoder stream.

Frozen Gemma 4 E4B

42 decoder layers

Gemma processes the prompt first. The highlighted tail of the stack is the part we read for the audio branch.

Tap point

Last 6 hidden layers

A learned layer mix reads Gemma's late decoder activations before the tied vocabulary head turns them into text logits.

Trainable branch

Gemma-to-Mimi audio head

Only this head is trained. It maps the last-six-layer Gemma state into Mimi codec-token logits while Gemma and Mimi stay frozen.

Normal Gemma output

Text logits

This path still exists, but it is not what drives the audio samples below.

Audio-token space

Mimi codec tokens

The smoke run predicts 8 codebooks at 12.5 Hz with cross-entropy over Mimi-encoded teacher speech.

Frozen Mimi decoder

24 kHz waveform

Mimi only receives predicted codec tokens and turns them into playable speech audio.

The prototype grafts an audio-token head onto Gemma hidden states and decodes through frozen Mimi.

Why this is different from TTS

Attaching a normal TTS engine to Gemma's text is useful and probably the easiest production path. This experiment does something narrower and more architectural: it asks whether Gemma's hidden states can be trained to predict speech-code tokens directly. The result is still rough, but the wiring is meaningfully different from a text-to-TTS pipeline.

Tap point

Before text logits

The head reads Gemma's hidden states before the model turns them into vocabulary probabilities. The output branch is not downstream of decoded text.

Loss

Codec-token cross entropy

The training target is Mimi audio tokens from teacher WAVs. The error is measured over predicted codebook IDs, not over text tokens.

Inference

No Kyutai text prompt

For text samples, text enters Gemma only. For the audio sample, audio enters Gemma only. Mimi only receives predicted codec tokens and decodes them to waveform audio.

Seven Gemma-to-Mimi samples

These clips are from the step-500 smoke checkpoint. Each input sentence is fed to Gemma; the audio head reads Gemma hidden states and predicts Mimi tokens; frozen Mimi decodes the result. This is an overfit architecture proof, not a polished TTS model. Whisper found at least two target words in 5 of these 7 shown samples.

For each clip, Gemma receives the prompt template Say this naturally as speech:\n{text}. The {text} field is the input sentence shown below. Gemma is not generating an open-ended chat answer first; this smoke test is rendering a provided sentence through the Gemma-to-Mimi head.

Input 01

“the final layer generates audio codes for the demo.”

Gemma -> Mimi output

Whisper heard: The final area is getting a read of all the available for the development.

1 matched word: final

Input 02

“the teacher model connects the final sample for the demo.”

Gemma -> Mimi output

Whisper heard: the teacher model to connect the instrument to the video.

2 matched words: model, teacher

Input 03

“the training run maps short phrases from text prompts.”

Gemma -> Mimi output

Whisper heard: the training even when that short phrase is fast pause.

2 matched words: short, training

Input 04

“the small adapter conditions clear speech through frozen layers.”

Gemma -> Mimi output

Whisper heard: The small adapter conditions clear the picture through frozen windows.

6 matched words: adapter, clear, conditions, frozen, small, through

Input 05

“the speech head generates the voice output from text prompts.”

Gemma -> Mimi output

Whisper heard: The speech head generates the voices now.

3 matched words: generates, head, speech

Input 06

“the voice sample conditions the final sample for the demo.”

Gemma -> Mimi output

Whisper heard: of always simple conditions on and off.

1 matched word: conditions

Input 07

“the teacher model matches the waveform from text prompts.”

Gemma -> Mimi output

Whisper heard: The feature model matches the waveform with the X-POPs.

3 matched words: matches, model, waveform

Audio in, audio out

We also ran the stricter path where Gemma receives no text instruction at all. The input WAV itself says the instruction and sentence. Gemma processes that audio input, the same adapter reads Gemma's audio-conditioned hidden states, and frozen Mimi decodes the predicted codec tokens.

The checkpoint for this clip is step 850 from the audio-only continuation. This is still a narrow smoke test, but it shows that the branch can be driven by Gemma's audio-input path rather than only by text tokens.

Audio input to Gemma

“Repeat the following sentence naturally as speech. The small adapter conditions clear speech through frozen layers.”

Text passed to Gemma: none. The Gemma message contains only the audio object.

Gemma audio states -> Mimi output

Whisper heard: The small adapter conditions clear speeds through frozen mirrors.

6 matched words: adapter, clear, conditions, frozen, small, through

What this experiment tests

The samples are rough, so this should be read as an architecture smoke test, not a finished voice model. The interesting research claim is the wiring: can Gemma's decoder state predict speech-code tokens directly, without handing decoded text to a separate TTS model?

Architecture

It tests the tap point

The head does not read raw prompt embeddings. It reads the last Gemma decoder layers after Gemma has processed the prompt, before the normal text-output head scores vocabulary tokens.

Roadmap

It tests speech-token wiring

Mimi is only the codec decoder here. The learned part is Gemma-owned: it turns Gemma hidden states into speech tokens.

Product

It is not a quality claim yet

A stronger version would need larger data, temporal modeling, and cleaner evaluation before it competes with real TTS.

Gemma frozen Mimi frozen 152M trainable head params 128 train / 36 valid / 36 test 500 text smoke / 850 audio continuation 5/7 text clips plus audio-input ASR hit

What would make it real

The next version has to move beyond rendering provided phrases. A real speech-native assistant would need Gemma to generate an answer, expose the hidden states for those generated answer tokens, and train the audio head to speak that generated content reliably.

Reasoning

Speak generated answers

Run Gemma's normal autoregressive decoding loop first, then feed the hidden states for the generated answer into the audio head.

Generalization

Train beyond memorized phrases

Use a much broader text/audio set and hold out prompts that are not paraphrases of the training examples. The current samples are an overfit smoke test.

Quality

Add temporal speech modeling

The fixed parallel codec head is enough for a wiring proof. A real system needs better duration control, streaming behavior, and stronger audio-token decoding.

Why this matters for a game buddy

A Discord-native companion needs a fast brain that can fuse what people say, what is visible on screen, and what is happening in chat. Gemma 4 E4B is shaped for that kind of perception loop: image and audio can sit directly beside text in the prompt, and the decoder can answer from the combined context.

The production-simple version is still Gemma plus a low-latency voice layer: Gemma decides what to say and why it matters, while the voice layer decides how it sounds and how it handles interruption. The prototype above is the research branch. It asks whether Gemma's decoder state can become the source of the speech-token stream itself, which could eventually reduce service handoffs and keep more multimodal nuance available to the voice.

Sources and caveats

The model facts here come from Google's Gemma 4 E4B model card, the official E4B config, Google's audio understanding docs, our local MLX-VLM implementation inspection, and Kyutai's Moshi/Mimi paper for the speech-token setup. Video is shown as frame-based understanding, not a separate video-generation stream. The visual is a faithful schematic, not a tensor viewer: the chips and animation show flow and dimensionality, not real activation values. The Gemma-to-Mimi audio head described here is a smoke-test prototype, not a released Gemma capability or a production voice model.