only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 532 | processing task, is_child = 0
slot update_slots: id 0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id 0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id 0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id 0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id 0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id 0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id 0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id 0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token, 253.34 tokens per second)
�������eval time = 56063.04 ms / 5913 tokens (����9.48 ms per token, 105.47 tokens per second)
������total time = 56126.20 ms / 5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted / 5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id 0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv update_slots: all slots are idle
same model, same config (except mtp)
slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id 0 | task 0 |
prompt eval time = 91.85 ms / 16 tokens ( 5.74 ms per token, 174.20 tokens per second)
eval time = 103127.94 ms / 6571 tokens ( 15.69 ms per token, 63.72 tokens per second)
total time = 103219.79 ms / 6587 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv update_slots: all slots are idle
prompt „create a flappy bird clone“
(I‘m not creative, sorry)
Great Speedup!