llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

3 min read Original article ↗

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!