Scaling LLama2-70B with Multiple Nvidia/AMD GPU

13 points by junrushao1994 2 years ago · 6 comments

Reader

junrushao1994OP 2 years ago

Machine Learning Compilation (MLC) now supports compiling LLMs to multiple GPUs.

For Llama2-70B, it runs 4-bit quantized Llama2-70B at:

- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k

- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k

- Also it is scales well with 8 A10G/A100 GPUs in our experiment.

Details:

brucethemoose2 2 years ago

For those suffering from deceptive graph fatigue, this is impressive.

exLlama is blazing fast. Even if they just benched exllamav1, exllamav2 is only a bit faster, at least on my single 3090 in a similar environment.

vLLM is focused more on batching performance, but even then MLC/TVM looks like its putting up a fight without batching.

I am a bit fatigued with llama backends myself, and it looks like this won't help me run 70B in a single 3090, but I need to dig into mlc again.

l3jin 2 years ago

Universal deployment is indeed attractive. I have tested the Llama2-70B on 7900 XTX. Love the performance!

Also saw a report earlier today on MLC’s discord about AMD MI-100:

GPU Count | Model Size | Prefill Speed | Decode Speed

1 | 33b | 102.2 | 22.3

2 | 33b | 112.3 | 33.0

4 | 33b | 144.8 | 41.2

jinhongyii 2 years ago

The performance is really amazing with such low cost.

zhye 2 years ago

Serving LLM with AMD GPUs to serve LLM looks impressive, MLC is evolving fast! Any results on NVLink/xGMI instead of PCIe?

Settings