Testing with the provided prompt from #19305
CISC
approved these changes
I updated to compare-logprobs script can reran it. There are still some diversions from vLLM (I suppose due to numerical issues), but it does look better on long context (see tokens past 5000 depth):
PR
| idx | logits_llama.log | logprob_1 | logits_vllm.log | logprob_2 | diff (abs) |
|---|---|---|---|---|---|
| 1 | ' ' | -3.0408 | ' ' | -3.0440 | 0.0033 |
| 2 | '\n\n' | -0.6087 | '\n\n' | -0.5918 | 0.0170 |
| 3 | ' API' | -0.7177 | ' API' | -0.8431 | 0.1254 |
| 4 | ' lightweight' | -0.2557 | ' lightweight' | -0.2838 | 0.0281 |
| 5 | ' and' | -0.1517 | ' and' | -0.1594 | 0.0077 |
| 6 | ' C' | -0.0635 | ' C' | -0.0332 | 0.0302 |
| 7 | ' HTTP' | -0.0113 | ' HTTP' | -0.0080 | 0.0032 |
| 8 | ' server' | -0.0037 | ' server' | -0.0066 | 0.0029 |
| 9 | ' based' | -0.0240 | ' based' | -0.0691 | 0.0451 |
| 10 | ' on' | -0.0000 | ' on' | -0.0000 | 0.0000 |
| 1011 | ' GPU' | -1.0844 | ' GPU' | -1.1533 | 0.0689 |
| 1012 | ' parameters' | -0.0969 | ' parameters' | -0.1143 | 0.0175 |
| 1013 | ' to' | -0.2201 | ' to' | -0.1712 | 0.0488 |
| 1014 | ' fit' | -0.0660 | ' fit' | -0.0926 | 0.0266 |
| 1015 | ' model' | -0.1862 | ' model' | -0.3159 | 0.1297 |
| 1016 | ' available' | -0.3698 | ' available' | -0.5154 | 0.1456 |
| 1017 | ' memory' | -0.0490 | ' memory' | -0.0509 | 0.0019 |
| 1018 | ' (' | -0.0223 | ' (' | -0.0401 | 0.0178 |
| 1019 | ' or' | -0.1865 | ' or' | -0.3119 | 0.1253 |
| 1020 | ' '' | -0.0016 | ' '' | -0.0011 | 0.0005 |
| 5021 | ' tokens' | -0.0002 | ' tokens' | -0.0001 | 0.0001 |
| 5022 | ' at' | -0.0000 | ' at' | -0.0000 | 0.0000 |
| 5023 | ' a' | -0.6503 | ' minimum' | -0.6290 | 0.0213 |
| 5024 | ' Default' | -0.0021 | ' Default' | -0.0005 | 0.0015 |
| 5025 | ' `' | -0.0000 | ' `' | -0.0000 | 0.0000 |
| 5026 | ' Set' | -0.7499 | ' Time' | -0.6461 | 0.1037 |
| 5027 | ' a' | -0.1898 | ' a' | -0.2087 | 0.0189 |
| 5028 | ' time' | -0.0009 | ' time' | -0.0005 | 0.0003 |
| 5029 | ' limit' | -0.0007 | ' limit' | -0.0009 | 0.0002 |
| 5030 | ' for' | -0.6250 | ' (' | -0.7296 | 0.1046 |
master
| idx | logits_llama.log | logprob_1 | logits_vllm.log | logprob_2 | diff (abs) |
|---|---|---|---|---|---|
| 1 | ' ' | -3.0408 | ' ' | -3.0440 | 0.0033 |
| 2 | '\n\n' | -0.6088 | '\n\n' | -0.5918 | 0.0170 |
| 3 | ' API' | -0.8385 | ' API' | -0.8431 | 0.0046 |
| 4 | ' lightweight' | -0.2408 | ' lightweight' | -0.2838 | 0.0430 |
| 5 | ' pure' | -0.6919 | ' and' | -0.1594 | 0.5325 |
| 6 | ' C' | -0.0190 | ' C' | -0.0332 | 0.0142 |
| 7 | ' HTTP' | -0.0373 | ' HTTP' | -0.0080 | 0.0293 |
| 8 | ' server' | -0.0021 | ' server' | -0.0066 | 0.0044 |
| 9 | ' based' | -0.6722 | ' based' | -0.0691 | 0.6031 |
| 10 | ' on' | -0.0001 | ' on' | -0.0000 | 0.0000 |
| 1011 | ' GPU' | -1.3448 | ' GPU' | -1.1533 | 0.1915 |
| 1012 | ' GPU' | -0.6565 | ' parameters' | -0.1143 | 0.5422 |
| 1013 | ' to' | -0.7436 | ' to' | -0.1712 | 0.5724 |
| 1014 | ' fit' | -0.1738 | ' fit' | -0.0926 | 0.0812 |
| 1015 | ' model' | -0.6253 | ' model' | -0.3159 | 0.3094 |
| 1016 | ' available' | -0.9195 | ' available' | -0.5154 | 0.4041 |
| 1017 | ' memory' | -0.0584 | ' memory' | -0.0509 | 0.0075 |
| 1018 | ' (' | -0.0105 | ' (' | -0.0401 | 0.0296 |
| 1019 | '/'' | -0.9424 | ' or' | -0.3119 | 0.6305 |
| 1020 | ' '' | -0.0020 | ' '' | -0.0011 | 0.0009 |
| 5021 | ' tokens' | -0.0002 | ' tokens' | -0.0001 | 0.0001 |
| 5022 | ' at' | -0.0004 | ' at' | -0.0000 | 0.0004 |
| 5023 | ' minimum' | -0.2551 | ' minimum' | -0.6290 | 0.3738 |
| 5024 | ' Default' | -0.0005 | ' Default' | -0.0005 | 0.0000 |
| 5025 | ' `' | -0.0000 | ' `' | -0.0000 | 0.0000 |
| 5026 | ' Set' | -0.0209 | ' Time' | -0.6461 | 0.6252 |
| 5027 | ' a' | -0.3147 | ' a' | -0.2087 | 0.1061 |
| 5028 | ' time' | -0.0003 | ' time' | -0.0005 | 0.0002 |
| 5029 | ' limit' | -0.0062 | ' limit' | -0.0009 | 0.0053 |
| 5030 | ' in' | -0.5323 | ' (' | -0.7296 | 0.1973 |
It does still deviate in the token that was picked at position 5030. Shouldn't numerical precision issues still result in the same token?
Not always, numerical differences can accumulate enough to change the output logits. But I think it may depend on the quantization that I'm using (q8_0). Will need to do more testing. But for now, I think the current fix should already be good enough.
Btw, it's quite funny watching the ollama bros copy-pasting our bugs into their "new engine" 🤣. Let's see how long it will take them to realize.
That's the only reason we write buggy code, right? * cough *
It should only affect I-quants, since imatrix is generated from intermediate activations.
Normal quants (Qx_0, Qx_1, Qx_K) should not be affected
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request
…#19324) * model: (qwen3next) correct vectorized key_gdiff calculation * move transpose to outside of loop

