model: (qwen3next) correct vectorized key_gdiff calculation by ngxson · Pull Request #19324 · ggml-org/llama.cpp

5 min read Original article ↗

@ngxson

Testing with the provided prompt from #19305

image

pwilkin, Mushoz, Cr4xy, yuhao318, danielhanchen, BVEsun, jacekpoplawski, stan4cb, ragoune, LETS-BEE, and 7 more reacted with hooray emoji

@ngxson

CISC

CISC approved these changes Feb 4, 2026

@ngxson

@ngxson

Quite fun: after applied 4bfbf0b, I asked the model to identify the bug (giving it the code before that commit). It does successfully identified the problem and even suggested me to make one more improvement (commit d871ac8)

image
Cr4xy, pwilkin, Mushoz, BVEsun, stan4cb, othermod, LETS-BEE, EmmanuelMr18, qnixsynapse, IIIIIllllIIIIIlllll, and 6 more reacted with laugh emoji

@Mushoz

We've officially arrived at self-improving AI it looks like ;)

@ggerganov

My test cases that were failing before are now passing with this change.

@ngxson

I updated to compare-logprobs script can reran it. There are still some diversions from vLLM (I suppose due to numerical issues), but it does look better on long context (see tokens past 5000 depth):

PR

idx logits_llama.log logprob_1 logits_vllm.log logprob_2 diff (abs)
1 ' ' -3.0408 ' ' -3.0440 0.0033
2 '\n\n' -0.6087 '\n\n' -0.5918 0.0170
3 ' API' -0.7177 ' API' -0.8431 0.1254
4 ' lightweight' -0.2557 ' lightweight' -0.2838 0.0281
5 ' and' -0.1517 ' and' -0.1594 0.0077
6 ' C' -0.0635 ' C' -0.0332 0.0302
7 ' HTTP' -0.0113 ' HTTP' -0.0080 0.0032
8 ' server' -0.0037 ' server' -0.0066 0.0029
9 ' based' -0.0240 ' based' -0.0691 0.0451
10 ' on' -0.0000 ' on' -0.0000 0.0000
1011 ' GPU' -1.0844 ' GPU' -1.1533 0.0689
1012 ' parameters' -0.0969 ' parameters' -0.1143 0.0175
1013 ' to' -0.2201 ' to' -0.1712 0.0488
1014 ' fit' -0.0660 ' fit' -0.0926 0.0266
1015 ' model' -0.1862 ' model' -0.3159 0.1297
1016 ' available' -0.3698 ' available' -0.5154 0.1456
1017 ' memory' -0.0490 ' memory' -0.0509 0.0019
1018 ' (' -0.0223 ' (' -0.0401 0.0178
1019 ' or' -0.1865 ' or' -0.3119 0.1253
1020 ' '' -0.0016 ' '' -0.0011 0.0005
5021 ' tokens' -0.0002 ' tokens' -0.0001 0.0001
5022 ' at' -0.0000 ' at' -0.0000 0.0000
5023 ' a' -0.6503 ' minimum' -0.6290 0.0213
5024 ' Default' -0.0021 ' Default' -0.0005 0.0015
5025 ' `' -0.0000 ' `' -0.0000 0.0000
5026 ' Set' -0.7499 ' Time' -0.6461 0.1037
5027 ' a' -0.1898 ' a' -0.2087 0.0189
5028 ' time' -0.0009 ' time' -0.0005 0.0003
5029 ' limit' -0.0007 ' limit' -0.0009 0.0002
5030 ' for' -0.6250 ' (' -0.7296 0.1046

master

idx logits_llama.log logprob_1 logits_vllm.log logprob_2 diff (abs)
1 ' ' -3.0408 ' ' -3.0440 0.0033
2 '\n\n' -0.6088 '\n\n' -0.5918 0.0170
3 ' API' -0.8385 ' API' -0.8431 0.0046
4 ' lightweight' -0.2408 ' lightweight' -0.2838 0.0430
5 ' pure' -0.6919 ' and' -0.1594 0.5325
6 ' C' -0.0190 ' C' -0.0332 0.0142
7 ' HTTP' -0.0373 ' HTTP' -0.0080 0.0293
8 ' server' -0.0021 ' server' -0.0066 0.0044
9 ' based' -0.6722 ' based' -0.0691 0.6031
10 ' on' -0.0001 ' on' -0.0000 0.0000
1011 ' GPU' -1.3448 ' GPU' -1.1533 0.1915
1012 ' GPU' -0.6565 ' parameters' -0.1143 0.5422
1013 ' to' -0.7436 ' to' -0.1712 0.5724
1014 ' fit' -0.1738 ' fit' -0.0926 0.0812
1015 ' model' -0.6253 ' model' -0.3159 0.3094
1016 ' available' -0.9195 ' available' -0.5154 0.4041
1017 ' memory' -0.0584 ' memory' -0.0509 0.0075
1018 ' (' -0.0105 ' (' -0.0401 0.0296
1019 '/'' -0.9424 ' or' -0.3119 0.6305
1020 ' '' -0.0020 ' '' -0.0011 0.0009
5021 ' tokens' -0.0002 ' tokens' -0.0001 0.0001
5022 ' at' -0.0004 ' at' -0.0000 0.0004
5023 ' minimum' -0.2551 ' minimum' -0.6290 0.3738
5024 ' Default' -0.0005 ' Default' -0.0005 0.0000
5025 ' `' -0.0000 ' `' -0.0000 0.0000
5026 ' Set' -0.0209 ' Time' -0.6461 0.6252
5027 ' a' -0.3147 ' a' -0.2087 0.1061
5028 ' time' -0.0003 ' time' -0.0005 0.0002
5029 ' limit' -0.0062 ' limit' -0.0009 0.0053
5030 ' in' -0.5323 ' (' -0.7296 0.1973

@Mushoz

It does still deviate in the token that was picked at position 5030. Shouldn't numerical precision issues still result in the same token?

@ngxson

Not always, numerical differences can accumulate enough to change the output logits. But I think it may depend on the quantization that I'm using (q8_0). Will need to do more testing. But for now, I think the current fix should already be good enough.

@ggerganov

DarkCeptor44 reacted with thumbs down emoji CISC, alexkhvlg, Quozul, Shine1i, LETS-BEE, Cr4xy, ngxson, allkhor, pyrypajunen, lilblam, and 142 more reacted with laugh emoji mtasic85, seeker789, teleprint-me, t-var-s, mudler, EvanDu168, firengate, and sbhjt-gr reacted with eyes emoji

@CISC

Btw, it's quite funny watching the ollama bros copy-pasting our bugs into their "new engine" 🤣. Let's see how long it will take them to realize.

That's the only reason we write buggy code, right? * cough *

ggerganov, LETS-BEE, Cr4xy, ngxson, lilblam, IIIIIllllIIIIIlllll, qnixsynapse, jacekpoplawski, firefox42, 0dragosh, and 28 more reacted with laugh emoji

@CISC

@ngxson

@fizzAI

@Mushoz

@CISC

@ngxson

It should only affect I-quants, since imatrix is generated from intermediate activations.

Normal quants (Qx_0, Qx_1, Qx_K) should not be affected

@CISC

Ah, yes, imatrix would be affected.

@danielhanchen

Oh yes Q8_K_XL, Q8_0, BF16, MXFP4_MOE are fine - the rest are imatrix so they did change a bit

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request

Feb 23, 2026
…#19324)

* model: (qwen3next) correct vectorized key_gdiff calculation

* move transpose to outside of loop