Llama 405B 506 tokens/second on an H200

21 points by moondistance a year ago · 5 comments

Reader

EgoIncarnate a year ago

not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

FanaHOVA a year ago

Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.

7e a year ago

And this is why nobody submits MLPerf against NVIDIA.

greenknight a year ago

Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...
From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)

moondistanceOP a year ago

Significant further optimizations. FP8!

Settings