DyLoRA: Parameter Efficient Tuning of Pre-Trained Models
arxiv.orgWhen fine tuning an LLM you can use the LORA technique to make the fine tuning faster. LORA involves fine tuning a subset of parameters (really it's a low rank approximation of the weight matrix determined by picking the n largest eigenvalues in the SVD decomposition). The size of the subset is determined by the rank. The smaller the rank the faster the fine tuning. However if you make the rank too small then quality will suffer. So you want to pick the optimal rank. This paper describes a technique which can be used to find the optimal rank more easily.
Fascinating progress.
Would you say the following understanding is correct?:
- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.
- You can fine-tune any model on any hardware, provided it fits into RAM. That means, that the 30B llama-derived models in their 4-bit quantized version and 19.5GB of VRAM requirement can be fine-tuned on consumer grade GPUs with 24gb of VRAM. (Like the RTX 3090 and 4090)
Yes to the first.
To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.
Is it possible for many people to simultaneously fine tune models on different data and then combine the new models into something improved?
One approach is to have the model learn to select between several separately fine tuned adapters by learning which adapter works best in a given context. So at any given time it's only really using one adapter but can switch to another. In this case one adapter can't really improve another but the overall impact might be a model which is improved in a variety of different contexts.
Yes, but the naïve way to combine rank k adaptations created by n different people would be to concatenate them to a rank nk adaptation, which wouldn't be as lightweight and easy to share, so you'd likely be better off mushing them into the baseline model.
Can they mathematically be “mushed” and then create an improved model?
I have yet to understand the difference between fine tuning and training and therefore yet to understand if a distributed decentralized eventually consistent training approach is a possibility or simply not realistic.
If you make N copies of a model, train them independently for a little while on N machines, and average them back together, it sort of works. But not if you train for very long, as the internal structure diverges.
It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.
I was thinking if you can fine tune / train on a restricted subspace of the weights? If so they one can assign specific partitioned subspaces and then the averaging wouldn’t overlap, however maybe that would destroy some valuable cohesion.
I haven't heard of that being tried (though I don't read everything.) Someone could do the experiment and write it up, and maybe get it published. The main ML conferences rarely publish anything that's not an improvement on the SOTA, which is why it's so hard to find anything about ideas that don't quite work.
The underlying motivation to my thoughts and comments is investigating if a decentralized but periodically coordinated algorithm for training LLMs exists. We have millions of GPUs distributed across the world which if they could somehow be put to work on training without extreme requirements on data transfer between them could enable training of large LLMs in an open source way even if that training is technically energy suboptimal.
Yeah, your intuition that this would destroy cohesion is correct.
It's basically not possible to do what you are trying to do in an async manner. With advancements in large batch gradients, it might be possible to do some sort of synchronous P2P gradient averaging.
Edit: I’m reading this to try and get some sense of the issues - https://www.amazon.science/blog/near-linear-scaling-of-gigan...
What about with some fairly frequent and periodic synchronization?
Is there potentially some balance where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion. I was thinking maybe this algorithm would be 10x less energy efficient but have the benefit of decentralization. Something along those lines.
I’m guessing the current training algorithms do something like this but since rapid synchronization always makes the efficiency increase (in the extreme that giant single wafer cpu) then openAI and others use systems with high interconnect bandwidth.
I am not familiar with that work.
> where small enough subsets can be chosen and disparate workers broadcast the small changes at small enough intervals that the net gain in learnings is still larger than the loss in fit due to de-cohesion
I think this really probably depends on the terrain of your loss landscape. My intuition is that many are too spike-y and if you take a step or two in each of your subsets and then average them, you will end up on a steep hill rather than a valley between your two points.
But this is an active area of research for sure.
Kudos for the authors for providing the code https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA and the roberta example. Considering the current state of the OSS LLM community, I'm guessing someone is already porting it to Llama and gpt-style models.
Adding this to the huggingface peft library would be amazing. That's the main library that people using LoRA are currently using. https://github.com/huggingface/peft/issues/289
The stable diffusion community has, unfortunately, largely ignored peft because the training/inference scripts largely ignored diffusers.
I'm unsure of the value of dynamically reducing the rank of the LoRA matrix at inference time given that probably most of the parameter count comes from the original weights rather than the LoRA diff.
But nonetheless, training time improvements look interesting.
e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.
I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?
Yea, this is interesting but I can't see the immidiate value (not that there isn't).
Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?
The optimal rank could differ across layers
I would be shocked if the "optimal rank" in terms of performance wouldn't be using the maximum rank from the DynLoRA across all layers.
Err, I suppose trivially, the higher rank terms include the lower-rank subnets, so they dominate in terms of quality.
But if you have some capacity constraint (e.g., memory, I guess?) then you can imagine dynamic rank allocation helping in the case where the maximum rank across all layers isn't within budget.
It's a bit of a stretch though, I agree
As someone else mentioned [0], the procedure would basically be to train a DyLoRA for an initial few iterations, then do a search among the layers to find the best scoring combination of ranks, and then train pruned to just use those ranks to completion.
Seems complicated but I could see it being useful potentially.
So this can tune a model 7X faster than LoRA, which was already a massive speed boost? Curious to see what this will do to the LLaMA-derivative community in particular.
7x faster compared to grid-search LoRA for best rank.
I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.
Highest posssible in which combination, though? If you’re fine tuning a model with N layers, then you could apply LoRA to any or all of them. Maybe it’s better to concentrate effort unevenly, in which case a uniform increase of adaptation rank (to compute budget) could still be subpar.
Right but the way that this paper proposes determining the best rank is by training a LoRA with the full rank.
What is the fastest way to show that?
Fastest way to show what? That you should train with the maximum sized LoRA you can? Because the only upsides to having a smaller LoRA are in the training time, and if you are already able to train a DynLoRA with max rank 8, then you should just train a LoRA with that rank.
You get diminishing returns as you increase the rank, so with a fixed training budget it's not clear whether you get the best return from increasing rank vs increasing something else. If you start off by training DynLORA with max rank 8 you can see returns diminish fast beyond rank 5. Then you can use rank 5 for the rest of your training. You wouldn't know that with LoRA. I think this is the idea behind the paper. If you are just going to use your entire budget training a DyLoRA with max rank 8 then you're right there's no advantage over LoRA with rank 8. You'd have to use the ability to assess multiple ranks in order to see some benefit.
I can see that. But are we sure that a rank-based difference that doesn't manifest early in the training process won't manifest as you get further along? See also 'grokking' [0]
Not sure there's any way to know beforehand whether that would happen but the advantage of DyLoRA is that at least you will know afterwards whether you really needed the full rank whereas with LoRA you wouldn't? In some cases that might not be valuable information but I guess you'd rather know than not.
Why is the only advantage at training time? I might misunderstand something but with this method you can train once, and then deploy models that use arbitrary rank (according to end-users compute requirements) and expect to have a model that performs best for that specific rank.
How does this technique differ from the supernet optimization for one-shot NAS? https://proceedings.mlr.press/v80/bender18a.html
It seems like they use a fixed-distribution controller for training. It’d be nice to see why it’s worth deviating from the original RL paradigm.
It's very different, but hard to distill in a comment. They use a new regularization technique to basically create a LoRA with dynamically adjustable rank.
There are good theoretical reasons behind this as well https://calculatedcontent.com/2023/02/01/deep-learning-and-e...