Llama Is Expensive

14 points by razcle 3 years ago · 9 comments

Reader

rvz 3 years ago

> As a massive disclaimer, a reason to use LLama over gpt-3.5 is finetuning. In this post, we only explore cost and latency. I don't compare LLama-2 to GPT-4, as it is closer to a 3.5-level model. Given the discourse on twitter, it seems Llama-2 still trails behind gpt-3.5-turbo. Benchmark performance also supports this claim:

Well one other massive disclaimer is that the author is "Backed by OpenAI"'s Startup Fund which they failed to disclose in the post.

So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)

sualehasif 3 years ago

a founder from anysphere here. while we are funded by openai startup fund, we have no other formal relations by them. we love oss models and are exploring how we can use them for amazing code products. the post is our honest reflection into the tradeoffs we see :)
- QuinnyPig 3 years ago
  
  I hear you, but disclosing the relationship up front is the best way to disarm suspicions of bias.

brucethemoose2 3 years ago

> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)

Well there is your problem.

LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)

https://github.com/turboderp/exllama#dual-gpu-results

And this is without any consideration of batching (which I am not familiar with TBH).

Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.

There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.

amanrs 3 years ago

The key misconception about many quantization methods is that lower precision = better speed.
I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf
It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5
For larger batch sizes, bf16 costs dip below 3-bit quantized.
- brucethemoose2 3 years ago
  
  exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).
  And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.

joefourier 3 years ago

You don’t have to run Llama 70B on a rented 2xA100 80GB which is of course going to be quite pricy. Quantising it to 4-bit as brucethemoose2 mentioned allows you to run it on far cheaper hardware - it’ll fit on a single A6000 which can be rented for as low as $0.44/h, 10x cheaper than the $4.42/h they mentioned for their 2x A100 80GB (speed might be impacted but it shouldn’t be 10x slower).

And if you’re running it on your own machine, then the cost of using Llama is just your electricity bill - you can theoretically run it on 2x 3090 which are now quite cheap to buy, or on a CPU with enough RAM (but it will be very very slow).

shostack 3 years ago

What are my options for running llama 2 on a single 3080?
- brucethemoose2 3 years ago
  
  Llama.cpp, through kobold.cpp, offloading some of it to ram.

Settings

Llama Is Expensive

Keyboard Shortcuts