Punica: Serving multiple LoRA finetuned LLM as one

135 points by abcdabcd987 2 years ago · 27 comments

Reader

huac 2 years ago

I think this is one of the most important possible works for open source LLM's, really glad y'all pushed this forward!

That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.

If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!

therealpygon 2 years ago

Doesn’t OpenAI still operate at significant losses by using massive infusions of capital from Microsoft and other investors? If you are giving away half your product, it’s not surprising that they would be undercutting competition. Not a new strategy.
Underprice to avoid or drive out competition and encourage lock-in, then increase prices when you no longer have competitors or your user base is large enough and reliant enough that your attrition is manageable. Then you sell to a bigger company who grinds it up and integrates into their own products. Same as always. Bonus points if you claim to be open source for the free marketing and/or free development/testing in the form of user contributions before switching to a proprietary model.
Shouldn’t we have a standardized corporate strategy bingo card by now?
- huac 2 years ago
  
  I don't have any access to their financials, so this is speculative, but while they do 'give away' GPT-3.5-turbo in the free ChatGPT, the rest of the business is likely extremely profitable. If you want to estimate cost of serving those free requests, consider how much it costs to do that via API. A 10 message conversation, where ChatGPT outputs 200 tokens each time, is $0.002, or two-tenths of a cent. I believe their API usage to still be positive margin for them. (Of course, now consider how much markup there is in ChatGPT pro!!)
  There is a difference between pricing aggressively and pricing at a loss. Their pricing for gpt-3.5-turbo now matches leading public providers for Llama-70B ($1/million tokens). Rumors are that 3.5-turbo is actually a 20B model, but even let's assume that it is larger than 70B: OpenAI can still price more aggressively than Llama-70B providers because they have better throughput and utilization of the same hardware.
ghotli 2 years ago

Interesting. I'm not so sure I really 'got' that part of finetunes / LoRA adapters before reading this comment. Makes me want to make one to take it for a spin, see what comes out the other side.
- huac 2 years ago
  
  the nice thing too is that because you are freezing almost all the parameters, and generally in lower precision (eg QLoRA loads the full model in 4-bit), it's super low gpu memory usage. a free Colab will suffice for finetuning a 7b definitely, renting a 3090 is less than 50 cents an hour, pretty low barrier to entry to try something!

kcorbitt 2 years ago

Awesome work! Here's a recent paper released yesterday, also focused on efficiently serving many LoRAs simultaneously: https://arxiv.org/abs/2311.03285

Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.

abcdabcd987OP 2 years ago

Thank you! We are also very excited about combining the fast fine-tuning and efficient serving. In fact, what you just said is very related to one of our very first motivations. In my previous blog post [1], I call this scheme "Just-in-time Fine-tuning". Our previous measurement is that, for a medium-sized webpage (~10K tokens), it takes around 30 seconds to 2 minutes to finetune a LoRA model. Another good side of this JIT fine-tuning scheme is that, we can turn any model into a long-context model.
We'll keep doing more research on finetuning. And hopefully, we'll see the results soon.
[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/
3abiton 2 years ago

It's all very interesting ideas, like Captain Planet becoming a super LLM

Palmik 2 years ago

This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it's even better to have a working (and extremely efficient!) implementation.

How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?

abcdabcd987OP 2 years ago

Thanks for your encouragement! We are working on quantization as well. We recently submitted a paper, Atom [1], that uses 4-bit quantization, delivering 7.73x throughput compared to FP16 and 2.53x compared to INT8. Atom is able to maintain a perplexity (i.e., model accuracy) close to FP16, outperforming existing quantization approaches.
We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :)
[1] https://arxiv.org/abs/2310.19102
- Palmik 2 years ago
  
  Added to my reading list! The world of quantizations is moving so fast even TheBloke might not be able to keep up!
  So Atom base models would be compatible with Punica?
  I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?

vlovich123 2 years ago

Am I correct in understanding that LoRA is basically a way to cheaply create “delta” LLMs that apply onto the main large one to create a specialization? In other words, this would obviate all the vector DB stuff that people are doing right?

lamroger 2 years ago

The general consensus imo is that fine-tuning is more for tone and style vs accuracy. People use vector DBs to grab relevant data to throw into the prompt and call it Retrieval Augmented Generation.
From what this seems to do is host multiple deltas fine-tunings and hot swap as needed. Incredible optimization. It's like going from AMIs to ECS or Kubernetes.
Havoc 2 years ago

Best as I can tell lora is useful for steering the models behaviour while injecting 100% new knowledge is still largely via rag - so vector db

yyding 2 years ago

Good job! I observed that you implemented many cuda kernels by yourselves. Just wondering your consideration or trade-off between implementating the kernels via pure CUDA code vs. implementing based on compiler like TVM/Triton.

zhye 2 years ago

Good question, in general implementing kernels on page tables is tricky in Tensor Compilers because integer set analysis might fail sometimes (but can be fixed with some tweaks). I think using compilers like TVM can help deploy serving systems on different platforms (e.g. AMD GPUs) and I'm optimistic about this direction (and we have to make Tensor Compilers more user-friendly).

j0057 2 years ago

That name is easy to confuse with the unrelated LoRa and LoRaWAN.

__void 2 years ago

it seems that in 2012, along with the end of the world, the possible acronyms/usable names for a project also ended

lmeyerov 2 years ago

Super cool!

I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...

Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa

Now, we can have k distinct classifiers, and not risk them interfering with one another

Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?

kkielhofner 2 years ago

Nice!

Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?

abcdabcd987OP 2 years ago

Certainly! We'd like our good designs to be picked up by frameworks and serve all users. Currently, Punica is built on top of PyTorch and HuggingFace Transformers ecosystems. Therefore, vLLM and LMDeploy, which are also in the PyTorch ecosystem, should have a smooth adaption. As for Nvidia Triton and TensorRT-LLM, since our kernels are written in CUDA, I believe it will also work seamlessly.
We call for the open source community to help us integrate Punica with all frameworks, thus the whole society can benefit from the efficiency improvement!

junrushao1994 2 years ago

This is great! Have you guys considered integrating with one of the existing systems?

abcdabcd987OP 2 years ago

Thanks for the question. Currently Punica is on the ecosystem of PyTorch and HuggingFace Transformers. So PyTorch users can start to use Punica now.
Look forward to collaboration with TVM and MLC to reach more users :)

ruihangl 2 years ago

Great work! I am curious that how much effort it would take to support LoRAs with different ranks?

zhye 2 years ago

It will take some effort to implement operators but not too much (cutlass's group gemm already support different mnk's), however the performance benefit is marginal compared to padding all LoRA ranks to the same rank because all these kernels are not compute bound.

busssard 2 years ago

there was a word on GPT4 just being 8 different GPT3 in a trenchcoat finetuned on different topics. If we can do this now with 8x finetuned Vicuna 13b for the price of running Vicuna once, this is huge!

Settings

Punica: Serving multiple LoRA finetuned LLM as one

Keyboard Shortcuts