Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s https://t.co/c5e1txZvzD

Post

Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s

00:00

9:33 PM · Sep 19, 202363.5KViews

Don't miss what's happening

People on X are the first to know.