idlemachines - NFHN Reader

Most of what we know about frontier model architecture comes from reading the tea leaves of benchmarks and reasoning backwards. Open weight releases are different, they let us see some actual architectural choices, you can find decisions that cost billions of parameters and ask what problem they're solving. Gemma-4 has a few of these, and they turn out to be quiet arguments against things we've stopped questioning.

No attention scale — QK-norm instead

Every transformer you've ever implemented probably has this line in it somewhere:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_\text{head}}}\right) V$

The $\frac{1}{\sqrt{d_\text{head}}}$ scalar exists for one reason: when the head dimension is large, the dot product $QK^\top$ produces values with large variance and the softmax saturates. The fix has always been to divide by $\sqrt{d}$ to bring things back to unit-variance territory before the softmax can go wrong.

And we've all probably had occasions where we've forgotten to add this scale, and then been surprised when the training diverges and the model fails to learn.

Preserving this unit variance is important; in a deep model with many attention layers, a variance in the dot product that grows with d would cause the softmax to produce near one-hot distributions, leading to vanishing gradients and training instability.

So it's something of a surprise that Gemma 4 removes the scalar entirely. Instead, queries and keys are each passed through RMSNorm before the dot product (values get normalised too, but magnitude-only, with no learned scale):

$\hat{Q} = \text{RMSNorm}(Q), \quad \hat{K} = \text{RMSNorm}(K), \quad \hat{V} = \text{RMSNorm}_\text{mag}(V)$

$\text{Attention}(\hat{Q}, \hat{K}, \hat{V}) = \text{softmax}\!\left(\hat{Q}\hat{K}^\top\right) \hat{V}$

where RMSNorm with scale normalises each vector to unit RMS then rescales with a learned per-dimension gain $g$ ; the magnitude-only variant skips $g$ entirely:

$\text{RMSNorm}(x)_i = \frac{x_i}{\sqrt{\frac{1}{d}\sum_j x_j^2 + \epsilon}} \cdot g_i, \qquad \text{RMSNorm}_\text{mag}(x)_i = \frac{x_i}{\sqrt{\frac{1}{d}\sum_j x_j^2 + \epsilon}}$

Practice

QK-Norm Causal Attention

HardQ181

Implement QK-norm causal attention — RMSNorm with learned scale on Q and K, magnitude-only on V, no explicit scaling factor.

QK-norm is a modern addition to the transformer architecture, now fairly common in large models; it appeared in vision transformers before making its way into LLMs.

The weight g is the same size as the head dimension, so each dimension can be scaled independently. The model can learn to put more weight on certain dimensions and less on others, a more flexible way to control the variance of the dot product than a fixed global divisor. It's strictly more expressive: the $1/\sqrt{d}$ scalar is just a special case where all $g_i$ are equal.

However, the $1/\sqrt{D}$ scalar exists because of a clean variance argument. For $q_i, k_i$ i.i.d. zero-mean unit-variance:

$\operatorname{Var}(q \cdot k) = \sum_{i=1}^{D} \operatorname{Var}(q_i k_i) = D \implies \operatorname{std}(q \cdot k) = \sqrt{D}$

The dot product variance grows with head dimension, and dividing by $\sqrt{D}$ brings it back to 1. After QK-norm, the vectors lie on a sphere of radius $g\sqrt{D}$ , and for i.i.d. Gaussian inputs:

$\operatorname{std}(\hat{q} \cdot \hat{k}) = g^2\sqrt{D}$

Same structure, $\sigma = 1$ replaced by $g$ . The $\sqrt{D}$ dependence remains in both cases. The difference is that unconstrained weights can grow $\sigma$ freely during training whereas $g$ is a stable learned parameter. (Which can just grow unconstrained itself.) You could keep both: use the $1/\sqrt{D}$ scalar for safety and let $g$ add expressivity on top. But $g$ is going to learn to compensate for $\sqrt{D}$ regardless (it has to, or the attention breaks), so the scalar just shifts what value $g$ needs to learn. Gemma 4's argument is that you might as well let $g$ do the whole job.

The tradeoff is that you're now relying on the model to learn appropriate values for g. Get this wrong during training and you're back to the same saturation problems the scalar was designed to prevent. In practice it seems to work, but it's worth being clear that you're now relying on the model to learn appropriate values for g rather than having a fixed fallback.

While the model code is all available on Huggingface (and I strongly recommend you read through it) we don't have anything about the training. We don't know what kind of weight initialisation was used, or the kind of learning rate schedule, or any of the other details that might have helped ensure that $g$ starts in a good place and stays there. So it's hard to say how much of a risk this is in practice; only that the risk exists, and with the scalar you had backup.

Partial RoPE on global layers

Gemma 4 uses a hybrid attention stack: five local (sliding-window) layers for every one global (full-context) layer. Most tokens only need to attend to nearby context, so local layers handle the bulk of the work cheaply, and global layers let information flow across the full 256K window.

Local layers use standard RoPE with $\theta = 10{,}000$ , nothing unusual. Global layers are where it gets interesting.

Standard RoPE applies a rotation to every dimension pair in the query and key vectors using frequencies $\omega_i = \theta^{-2i/d}$ . Global layers in Gemma 4 only rotate the first 25% of pairs. Global layers also use a larger head dimension than local ones (512 vs 256), so:

$\text{rope\_dim} = \lfloor 0.25 \cdot d_\text{head} / 2 \rfloor = \lfloor 0.25 \times 256 \rfloor = 64 \text{ pairs rotated out of } 256$

There is precedent for this; NOPE removes positional encoding entirely, and some models have used partial PE before. The specific combination of 25% and the large base is new as far as I know.

The remaining 75% of dimension pairs pass through untouched, with no positional signal. And the base for the rotated dimensions jumps to $\theta = 1{,}000{,}000$ , which spreads the positional frequencies far apart to cover the long context window without wrapping.

The rotation for each pair $(x_i,\, x_{i+d/2})$ at position $p$ is the standard RoPE formula:

$\begin{pmatrix} x_i' \\ x_{i+d/2}' \end{pmatrix} = \begin{pmatrix} \cos(p\omega_i) & -\sin(p\omega_i) \\ \sin(p\omega_i) & \cos(p\omega_i) \end{pmatrix} \begin{pmatrix} x_i \\ x_{i+d/2} \end{pmatrix}$

Only the first 25% of the pairs pairs go through this. The rest receive inv_freq = 0, which means the rotation is always identity; those dimensions carry no positional signal.

Practice

Partial Rotary Positional Embeddings

MediumQ183

Implement partial RoPE — apply the rotation to the first partial_rotary_factor fraction of dimension pairs and leave the rest unchanged.

Why keep the high-frequency end and drop the rest? The rotated dimensions (the first 25% by index) are the ones with the shortest periods. With $\theta = 1{,}000{,}000$ , even the fastest of these has a period of a few steps, and the slowest of the rotated set has a period of a few hundred. Beyond that range, the rotation has wrapped around enough times to become incoherent.

This is not a problem, global layers don't need to know that a token is 250,000 positions away rather than 251,000. Attention is always a mixture of content matching and positional bias, but the balance shifts with the length scale involved. For nearby tokens, exact order matters; syntactic structure is local, and the local layers have full RoPE to capture it. For long-range dependencies, what matters is that the relevant token exists and what it contains, not where exactly it sits. The global layers can get by with a much weaker positional signal and rely primarily on content routing. By making 75% of dimensions position-free, that's what they do, while the rotated 25% still provides short-range positional grounding for tokens that end up in the same global window.

This is the same intuition behind NOPE, which removes positional encoding entirely. The local layers (five for every global) parse syntax and short-range structure with full RoPE. The global layers inherit those representations and route between them without being constrained by position.

This is also worth noting because the underlying phenomenon (RoPE becoming incoherent at long range) isn't new. It's been true of every model using RoPE with a standard context window; we just don't have to think about it when training at 2048 tokens. Context lengths of 256K force a more principled answer. Open-weight releases are one way to see what that answer looks like in practice.

1import numpy as np
2
3def partial_rope(x, positions, theta, partial_rotary_factor):
4    B, T, H, D = x.shape
5    rope_dim = int(partial_rotary_factor * (D // 2))
6    if rope_dim == 0:
7        return x.copy()
8
9    # Only compute frequencies for the rotated pairs.
10    # Denominator is full D, not 2*rope_dim — frequencies sit proportionally
11    # where they would in standard RoPE, covering the high-frequency end.
12    i = np.arange(rope_dim, dtype=float)
13    inv_freq = 1.0 / (theta ** (2.0 * i / D))
14    angles = positions[:, None].astype(float) * inv_freq[None, :]  # (T, rope_dim)
15    cos_a, sin_a = np.cos(angles), np.sin(angles)
16
17    result = x.copy()  # unrotated dims are already correct
18    x1 = x[..., :rope_dim]                       # first half, rotated slice
19    x2 = x[..., D // 2 : D // 2 + rope_dim]      # second half, rotated slice
20    c = cos_a[None, :, None, :]
21    s = sin_a[None, :, None, :]
22    result[..., :rope_dim]                    = x1 * c - x2 * s
23    result[..., D // 2 : D // 2 + rope_dim]  = x1 * s + x2 * c
24    return result

The unrotated 75% never touch the trig functions, we let x.copy() handle them. The rotated 25% use frequencies computed against the full D, not 2 * rope_dim, which keeps them proportional to where they'd sit in standard RoPE rather than artificially expanding to fill the available range.

Per-layer input gating

Gemma 4 uses the tanh-approximate GELU (gelu_pytorch_tanh), not the exact erf form. The difference is numerically negligible, but the coding question uses the exact form (math.erf from stdlib) which is easier to implement without external dependencies.

This one is worth reading carefully, because the implication is more interesting than the description suggests. On the face of it, it's just an additional embedding guiding each token representation through the model, reminding the model what the token is at each layer. But the fact it's there, this goes against the central assumptions about how transformers work, and the security that comes from the residual stream.

So, how does this work? Each layer receives a small signal $s \in \mathbb{R}^{256}$ derived from the original input token. This representation is independent of whatever has accumulated in the residual stream, and uses it to gate a low-dimensional update. This runs after attention and MLP have already processed the hidden state; it's an additional operation on top of normal transformer processing:

$h \leftarrow h + \text{RMSNorm}\!\left(\left(\text{GELU}(h W_\text{gate}^\top) \odot s\right) W_\text{proj}^\top\right)$

$W_\text{gate}$ projects $h$ down to 256 dimensions, GELU is applied, and the result is multiplied element-wise ( $\odot$ ) by $s$ before projecting back up. The hidden state determines which dimensions of the token signal get through; the GELU activations serve as the gate.

Compare this to a standard FFN, which has no such gating and works in the opposite direction:

$h \leftarrow h + \text{RMSNorm}\!\left(\text{GELU}(h \cdot W_\text{in}^\top) \cdot W_\text{out}^\top\right)$

In a standard FFN, $W_\text{in}$ projects the hidden state up to roughly four times the hidden dimension before $W_\text{out}$ contracts it back, expanding capacity then compressing. The per-layer gating does the opposite: it projects down to 256 (a small bottleneck), gates that compressed representation with the token signal, then expands back up. The compression serves a different purpose, it's not trying to add capacity in the way an FFN does, rather trying to let the current hidden state selectively read from a fixed token-identity signal.

The signal s is itself a combination of two things. The first is a dedicated embedding table, separate from the main vocab embeddings, with shape (vocab_size, num_layers × 256). Every token in the vocabulary has a distinct 256-dimensional vector per layer: the signal layer 10 receives for the word "parsnip" is independently learned from the signal layer 25 receives for the same word. The full table is around 2 billion parameters for a 30-layer model. The second is a linear projection of the main token embeddings at the input, which provides a complementary signal based on the model's initial representation rather than raw token identity. Both are combined before being sliced per layer.

Practice

Per-Layer Input Gating

MediumQ184

Implement the per-layer input gating — project down, GELU, gate with the per-layer signal, project back, add to the residual.

What's notable is that neither of these signals comes from the hidden states, they are derived from the original input. Every layer has a direct channel back to token identity that bypasses the residual stream entirely. By layer 30, the hidden state may carry very little trace of which specific token was at position $i$ ; it's been transformed repeatedly by attention and MLP operations, and the residual stream is a dense mix of contextual information. The per-layer signal is a way of saying "regardless of what you've accumulated, this is what that token was."

Personally I find this a very interesting architectural choice, the implication being that token identity is fundamentally important to the relationships between tokens deeper in the model, and the residual stream alone isn't a good way to preserve it. But more than that, it challenges the assumption that internal representations are rich enough to reconstruct token identity from the residual stream. If that were true, the per-layer signal would be redundant. The fact that it's there, backed by a 2B parameter table, suggests the model designers don't think that's the case.

What this may be suggesting is that while the token identity is certainly present in the residual stream, it's not necessarily preserved as strong as we might have thought. The model can reconstruct it to some extent, but if it's not a priority for the network in next-token-prediction there is little to force it to be preserved. The per-layer signal is a way of ensuring that token identity is always available in a clean form, without relying on the hidden states to preserve it through the layers. It may even open up more capacity in the hidden states for other things, since they don't have to carry token identity as strongly.

Whether this is a problem that only surfaces at scale or a more general property of deep transformers trained on next-token prediction, it will be interesting to see whether the pattern shows up in smaller models going forward.

1import numpy as np
2import math
3
4def per_layer_gate(hidden, per_layer_signal, W_gate, W_proj, norm_weight, eps=1e-6):
5    def gelu(x):
6        return 0.5 * x * (1.0 + np.vectorize(math.erf)(x / math.sqrt(2.0)))
7
8    def rms_norm(x, w):
9        rms = np.sqrt((x ** 2).mean(-1, keepdims=True) + eps)
10        return (x / rms) * w
11
12    h = hidden @ W_gate.T          # project DOWN to 256 — opposite of a standard FFN
13    h = gelu(h)                    # activation acts as the gate
14    h = h * per_layer_signal       # the hidden state selects which dims of the token signal pass through
15    h = h @ W_proj.T               # project back up to d_model
16    return hidden + rms_norm(h, norm_weight)

The contrast with a standard FFN is in the projection direction: an FFN projects up (expanding capacity), this projects down (compressing to a bottleneck). The compression isn't about capacity; it's about forcing the gate to be selective. With 256 dimensions and a GELU, the hidden state can suppress or amplify specific aspects of the token signal. Everything the layer needs to know about token identity has to fit through that bottleneck, which forces the layer to be selective about what passes through.

The last several layers of each Gemma 4 model don't compute their own key and value projections. They reuse $K$ and $V$ from an earlier anchor layer.

KV sharing across layers is a fairly well-established technique for reducing KV cache memory at inference time, and Gemma 4 uses it in the last block of layers. The basic idea is that several consecutive layers share the same key and value projections from a single anchor layer, rather than each computing their own.

The anchor computes $K = hW_K^\top$ and $V = hW_V^\top$ and stores them. Shared layers above it skip those projections entirely and pull the stored tensors directly. Each shared layer still computes its own $Q = hW_Q^\top$ , so the attention patterns can differ across the shared block even though K and V are the same.

$\text{Anchor:}\quad K = hW_K^\top,\quad V = hW_V^\top$

$\text{Shared layer } \ell:\quad \text{Attn}(Q_\ell,\, K_\text{anchor},\, V_\text{anchor})$

Practice

KV-Shared Attention Block

HardQ185

Implement a two-layer KV-sharing attention block — anchor layer computes and stores K, V; shared layer reuses them with its own Q.

One detail worth noting: Gemma 4 uses two separate attention types (sliding and global), and each type has its own anchor. Shared sliding layers pull from the last non-shared sliding layer, shared global layers pull from the last non-shared global layer.

In the A4B MoE, 57% of layers are shared this way. The MLP at shared layers is doubled to compensate, giving those layers more feed-forward capacity to make up for not having their own K and V projections.

The practical win is straightforward: you only need to store $K$ and $V$ for each anchor rather than for every layer on top of it, which can meaningfully reduce KV cache size in production. For the shared layers to work well, the key and value representations from the anchor need to be useful enough that the model can do something reasonable with just its own Q, or the doubled MLP can compensate for whatever is lost.

And then there's the MoE

The A4B model is a mixture-of-experts variant. In DeepSeek, Qwen, Mixtral, the standard MoE playbook replaces a dense MLP with a router plus $N$ expert networks, each token routed to $k$ of them, with the dense path gone.

In Gemma 4, they have both a MoE block and a parallel dense MLP. This looks initially like an "always-on" expert in MoE parlance, but it's more than that. The dense MLP is a full feed-forward layer that runs for every token, and the MoE block sits alongside it as an additional path. This is a substantial commitment in terms of compute and parameters per token, the MoE isn't a cheap add-on, it's a full second feed-forward block that runs in parallel to the dense MLP. The model is paying the cost of both for every token, regardless of routing decisions.

We can sketch out the structure like this:

mlp_out   = dense_mlp(pre_ffn_norm(residual))
expert_out = experts(pre_ffn_norm_2(residual), routing_from(residual))
output = post_norm(mlp_out + expert_out) + residual

The implication is that every token gets a full dense feed-forward pass regardless of routing. A bad routing choice costs you the expert contribution but the dense path runs unconditionally. This is a different tradeoff from standard MoE, where routing is dominant. Whether this is conservative engineering or a deliberate architectural decision about what the dense path should be responsible for, we don't know without the training details.

Is it a big change or not?

Individually these might seem incremental. Taken together, and especially the per-layer embedding, they hint that something bigger may be starting to move. The dominant intuition about transformers is that tokens are embedded into right high-dimmensional spaces, and the residual stream is rich enough to preserve whatever information about those tokens is needed for next-token prediction. The per-layer embedding challenges that assumption by providing a direct channel back to token identity at every layer, suggesting that the model designers don't think the residual stream alone is sufficient to preserve that information.

We don't yet know if this is a fundamental shift in how we think about token information flow, or just a pragmatic solution to a problem that becomes more acute at scale. But it's a departure from the standard recipe, and it made it into a 144B production model, which suggests someone thought it was important. Not just some ablation that survived.