Settings

Theme

Three types of LLM workloads and how to serve them

modal.com

75 points by charles_irl 4 months ago · 5 comments

Reader

rippeltippel 4 months ago

> Gallia est omnis divisor in partes tres.

OCD-driven fix: The correct Latin quote is "Gallia est omnis divisa in partes tres".

ZsoltT 4 months ago

> we recommend using SGLang with excess tensor parallelism and EAGLE-3 speculative decoding on live edge Hopper/Blackwell GPUs accessed via low-overhead, prefix-aware HTTP proxies

lord

  • charles_irlOP 4 months ago

    Sorry to lead with a bunch of jargon! Wanted to make it obvious that we'd give concrete recommendations instead of palaver.

    The technical terms there are later explained and diagrammed, and the recommendations derived from something close to first principles (e.g. roofline analysis).

omneity 4 months ago

Very cool insights, thanks for sharing!

Do you have benchmarks for the SGLang vs vLLM latency and throughput question? Not to challenge your point, but I’d like to reproduce these results and fiddle with the configs a bit, also on different models & hardware combos.

(happy modal user btw)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection