Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s https://t.co/c5e1txZvzD

1 min read Original article ↗

Post

Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s

00:00

Don't miss what's happening

People on X are the first to know.

Log inSign up