Llama Is Expensive
cursor.so> As a massive disclaimer, a reason to use LLama over gpt-3.5 is finetuning. In this post, we only explore cost and latency. I don't compare LLama-2 to GPT-4, as it is closer to a 3.5-level model. Given the discourse on twitter, it seems Llama-2 still trails behind gpt-3.5-turbo. Benchmark performance also supports this claim:
Well one other massive disclaimer is that the author is "Backed by OpenAI"'s Startup Fund which they failed to disclose in the post.
So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)
a founder from anysphere here. while we are funded by openai startup fund, we have no other formal relations by them. we love oss models and are exploring how we can use them for amazing code products. the post is our honest reflection into the tradeoffs we see :)
I hear you, but disclosing the relationship up front is the best way to disarm suspicions of bias.
> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)
Well there is your problem.
LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)
https://github.com/turboderp/exllama#dual-gpu-results
And this is without any consideration of batching (which I am not familiar with TBH).
Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.
There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.
The key misconception about many quantization methods is that lower precision = better speed.
I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf
It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5
For larger batch sizes, bf16 costs dip below 3-bit quantized.
exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).
And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.
You don’t have to run Llama 70B on a rented 2xA100 80GB which is of course going to be quite pricy. Quantising it to 4-bit as brucethemoose2 mentioned allows you to run it on far cheaper hardware - it’ll fit on a single A6000 which can be rented for as low as $0.44/h, 10x cheaper than the $4.42/h they mentioned for their 2x A100 80GB (speed might be impacted but it shouldn’t be 10x slower).
And if you’re running it on your own machine, then the cost of using Llama is just your electricity bill - you can theoretically run it on 2x 3090 which are now quite cheap to buy, or on a CPU with enough RAM (but it will be very very slow).
What are my options for running llama 2 on a single 3080?
Llama.cpp, through kobold.cpp, offloading some of it to ram.