Aligning a LLM with Human Preferences#
In order to better align the responses instruction-tuned LLMs generate to what humans would prefer, we can train LLMs against a reward model or a dataset of human preferences in a process known as RLHF (Reinforcement Learning with Human Feedback).
DataDreamer makes this process extremely simple and straightforward to accomplish. We demonstrate it below using LoRA to only train a fraction of the weights with DPO (a more stable, and efficient alignment method than traditional RLHF).
from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFDPO from peft import LoraConfig with DataDreamer("./output"): # Get the DPO dataset dpo_dataset = HFHubDataSource( "Get DPO Dataset", "Intel/orca_dpo_pairs", split="train" ) # Keep only 1000 examples as a quick demo dpo_dataset = dpo_dataset.take(1000) # Create training data splits splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10) # Align the TinyLlama chat model with human preferences trainer = TrainHFDPO( "Align TinyLlama-Chat", model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_prompts=splits["train"].output["question"], train_chosen=splits["train"].output["chosen"], train_rejected=splits["train"].output["rejected"], validation_prompts=splits["validation"].output["question"], validation_chosen=splits["validation"].output["chosen"], validation_rejected=splits["validation"].output["rejected"], epochs=3, batch_size=1, gradient_accumulation_steps=32, )