Finetuning a commercially viable open source LLM (Flan-UL2) using Dolly15K and LoRA

Resources:
- Flan-UL2-Dolly15K (HuggingFace)
- Flan-UL2-Dolly15K (Github)

To say that the world of Large Language Models (LLMs) has been moving fast feels like a wild understatement. Ignoring OpenAI for a moment and just looking at Open Source, every week (day?) there are announcements of new models, datasets, applications, etc. In fact, StabilityAI just announced StableLM! Innovations like quantization have made it possible to run large models on smaller and smaller and less expensive hardware. We live in a world where you can now have your own AI assistant (of debatable quality) running locally on consumer hardware. Crazy times.

These advancements have many people scrambling to figure out how to apply this new technology to their product or business. Unfortunately, much of the recent open source progress has been focused on Facebook’s LLaMA model, which according to their Github, is provided under at “Non-commercial bespoke license”. In fact, the weights for LLaMA were never meant to be public in the first place and their proliferation is the result of a leak on 4chan. It would seem Facebook is taking the situation seriously as well. Note that this applies to all derivatives of LLaMA including Alpaca, GPT4All and Vicuna.

When considering what model to use for your business/commercial application it is equally important to also consider what datasets were used to train it. Alpaca for example is a very popular synthetic data however, according to the Github repo:

“…models trained using the dataset should not be used outside of research purposes.”

Nevertheless, synthetic datasets are an awesome innovation that essentially use a more mature LLM to generate training data for a smaller, new model. This can be incredibly efficient and effective when compared to the cost and time required to produced an equivalent human generated dataset. However, if your synthetic dataset was generated by a private LLM (think OpenAI’s GPT-3.5/4 here), it’s important that your intended usage of that dataset is aligned with the Terms of Use associated with that model. For example, according to OpenAI’s Terms of Use:

You may not… use output from the Services to develop models that compete with OpenAI;

Does this mean you can’t use a synthetic dataset generated by OpenAI? Not necessarily. That entirely depends on your intended usage of the resulting model and whether you (and your legal dept?) believe it’s aligned with OpenAI’s Terms of Use. Disclaimer: I know nothing about business law and this isn’t meant to be any sort of legal advice for you and your biz. I do tech stuff.

Despite all the love that some of the “commercially problematic” models have been getting, there are great options without the licensing issues. This tweet by @theaievangelist is a useful list. In the following example, I’m going to demonstrate how to finetune Google’s Flan-UL2 model for the purposes of creating a Q&A chatbot. The resulting model is free of licensing restrictions (Apache 2.0) and can be used in commercial applications. Additionally, I‘ve included results from several popular benchmarks comparing the resulting model to the Flan-UL2 baseline as well as Vicuna, a popular Llama variant.

This example includes a model (Flan-UL2) trained on Dolly15K (Databricks). Dolly15K was released by Databricks under the Creative Commons Attribution-ShareAlike 3.0 Unported License and according to their website: anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

Why Flan-UL2?
- Licensing: Apache 2.0
- Strong benchmark performance
- A Max Input of 2048 tokens makes it useful for many commercial use cases including Retrieval Augmented Generation (RAG).
- At 20B parameters it is a bit large for most consumer applications but when run in 8-bit mode can fit nicely inside <50GB of VRAM, affordable for most businesses.

Resulting Model:
- Flan-UL2-Dolly15K (HuggingFace)

Source Code:
- Flan-UL2-Dolly15K (Github)

A goal of this project was to produce this model with a limited budget demonstrating the ability train a robust, commercially viable LLM using systems available to even small businesses and individuals. This had the added benefit of personally saving me money as well :). To achieve this a server was rented on vultr.com with the following pricing/specs:

Pricing: $2.604/hour
OS: Ubuntu 22.10 x64
12 vCPUs
120 GB CPU RAM
80 GB GPU RAM (1 x A100)

To dramatically reduce memory footprint and compute requirements Low Rank Adaption(LoRA) was used as opposed to finetuning the entire network. Additionally, the Flan-UL2 model was loaded and trained in 8 bit mode, also greatly reducing memory requirements. Finally, a batch size of 1 was used with 8 gradient accumulation steps. Here is a list of training parameters used:

Epochs: 1
Learning Rate: 1e-5
Batch Size: 1
Gradient Accumulation Steps: 8
8 Bit Mode: Yes

Environment Setup

If you don’t have conda installed:

curl -sL "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" > "Miniconda3.sh"
bash Miniconda3.sh

Clone the Github repository and configure the conda environment. This will install all the necessary dependencies.

git clone https://github.com/ConiferLabsWA/flan-ul2-dolly
cd flan-ul2-dolly
conda env create -f environment.yml
conda activate conifer

If you are running in a Unix environment and loading the model in 8 bit mode, you may encounter this error from the bitsandbytes library:

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.

If that happens, try this workaround:

cd ~/miniconda3/envs/conifer/lib/python3.10/site-packages/bitsandbytes/
cp libbitsandbytes_cuda120.so libbitsandbytes_cpu.so

The following command will finetune the Flan-UL2 for 1 epoch. (1 epoch = ~13 hours on 1 x A100 [80GB VRAM])

python train_lora.py

To use these models for inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel, PeftConfigprompt = "What are the most effective ways to deal with stress?"
peft_model_id = 'coniferlabs/flan-ul2-dolly-lora'
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id, device_map={'': 0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model.eval()
tokenized_text = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=tokenized_text, min_length=10, max_length=500)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Results

These benchmarks were generated using theFlan-Eval tool, developed by Declare Labs which covers the following:
- Multi-task Language Understanding (MMLU)
- Big Bench Hard (BBH)
- Discrete Reasoning Over the content of Paragraphs (DROP)

| Model                               | Precision | MMLU | BBH  | DROP |
|-------------------------------------|-----------|------|------|------|
| google/flan-ul2                     | 8-bit     | 55.2 | 44.9 | 64.4 |
| coniferlabs/flan-ul2-dolly-lora     | 8-bit     | 52.2 | 41.7 | 53.3 |
| eachadea/vicuna-13b***              | 32-bit    | 49.7 | 37.1 | 32.9 |

**Note that results for eachadea/vicuna-13b in 32-bit precision are from Declare Lab’s own results evaluating this model.

Below are a two examples of questions and responses from the models used here. It’s important to note that I spent very little time tweaking the generation parameters used to create these model outputs. As such, it’s likely that with a bit more work each model is capable of higher quality responses.

Question: What are the most effective ways to deal with stress?

Flan-UL2: Take a bath, listen to music, read a book.

Flan-UL2-Dolly: Breathe deeply and slowly. Take a deep breath in through your nose, hold it for a few seconds, and then exhale through your nose. Repeat this three times.

Question: What are the primary factors that influence consumer behavior?

Flan-UL2: Consumers make decisions based on the perceived value of the product or service.

Flan-UL2-Dolly: Consumer behavior is the study of how consumers make decisions about what products to buy and how much to pay for them. Consumers make decisions based on a variety of factors, including the price of the product, the quality of the product, the availability of the product, and the reputation of the company that makes the product. Consumers also make decisions based on how they feel about the company that makes the product. Consumers who feel positively about a company are more likely to buy their products than consumers who feel negatively about the company.

Conclusions

The process of finetuning an open source LLM with a commercial license has become surprising accessible. This is due in large part to the introduction of LoRA which makes it possible to finetune a large, pre-trained network with substantially lower cost, time and hardware. Additionally, techniques like quantization can further reduce resource requirements allowing engineers to experiment much faster. It is worth mentioning however that a production, commercial model will likely benefit from training at higher precision.

It’s worth noting that evaluating LLMs is tricky and definitely has a subjective aspect to it. Nevertheless, the resulting model exhibits SOTA performance on several benchmarks despite some noticeable drift from the Flan-UL2 baseline. It is worth noting that this model still outperforms full precision Vicuna. Additionally, the model produces quality prompt responses in many cases even without tuning the generation parameters.

Next Steps

Here are a few ways these results could be improved upon:

Training UL2 at higher precision — Training in 8-bit substantially reduces resource requirements but it does come at the cost of a performance hit for the resulting model. Additionally, when training in 8-bit it is not unusual to see vanishing gradients which definitely became an issue when training this model. This was resolved by lowering the learning rate, however I believe that this model would perform better with longer training and a higher learning rate which could be achieved with 16-bit+ precision.

Using a dataset with longer token lengths — The Dolly15K dataset does not take advantage of the full 2048 token input the UL2 is capable of. Training only at shorter token sequences may result in poor performance during inference when larger sequences are used. Ideally a dataset with a range of input lengths ranging up to 2048 would be used. Note that this would have a meaningful impact on memory resources used for training. One dataset that is especially interesting here is the ShareGPT dataset used to train Vicuna.

Test and compare other open source models and datasets— A few models of particular interest include: GPT4All-J, Pythia, and Google’s T5. Also, Open Assistant appears to have put together a really impressive human generated dataset.