Setting up vast.ai for inference with Llama 2

Press enter or click to view image in full size

Introduction

Vast.ai is an innovative platform that provides a marketplace for renting GPUs. Its mission is to make deep learning and other intensive computing tasks more accessible and affordable. In contrast to other cloud-based solutions, vast.ai operates as a decentralized service. This means that individual users (the renters) can directly rent computing resources from other users (the providers), who list their available hardware for rental. The platform supports various hardware configurations, which means you can select a suitable setup based on your project’s specific needs and your budget.

Llama 2 is a state-of-the-art, open-source language model developed to tackle various Natural Language Processing (NLP) tasks. With a collection of pre-trained models ranging from 7 billion to a staggering 70 billion parameters, Llama 2 offers a broad spectrum of capabilities to cater to different computational needs and complexities.

Requirements

vast.ai account: If you don’t already have an account, you can create one by visiting vast.ai and following the account creation process.
Access to Llama 2: You will need access to the Llama 2 models. You can obtain access to Llama 2 by filling out the form available at this link. Once you’ve submitted the form, you will be granted access to download the Llama 2 model files.

Creating an Instance on vast.ai

For running Llama 2 inference, you’ll need a powerful setup. Here’s how you can select the right GPU and create an instance on vast.ai:

Log in to your vast.ai account
Choosing the Right Hardware: In the hardware settings, you should select a machine with 8 GPUs that each have at least 18GB of memory available. The RTX 4090 cards are a great choice, given their high performance. Remember to ensure that the machine also has at least 200GB of storage space to accommodate the model files and data.

Press enter or click to view image in full size

3. Setting up the Docker Image: vast.ai uses Docker containers to manage your environment. For running Llama 2, the `pytorch:latest` Docker image is recommended. You can specify this in the ‘Image’ field.

4. After setting up the necessary hardware and Docker image, review the configurations, and then click ‘Rent’ to finalize and create your instance.

Once your instance is successfully created, it will appear in your vast.ai Instances dashboard.

Setting up the environment

Once you’ve created your instance on vast.ai, the next step is to set up the environment for running the Llama 2 model. Follow these steps to configure your vast.ai environment:

1. Connect to Your Instance: From your vast.ai dashboard, locate the instance you’ve created. Here, you’ll find all the necessary details to establish a ssh connection to your rented GPU server. Open your SSH client, enter the provided configuration, and establish a connection to the server.

2. Install Hugging Face Hub: The first software component you need to install is the HuggingFace Hub. You’ll use it to download the Llama 2 model. Once connected to your instance via SSH, run the following command to install Hugging Face Hub:

pip install huggingface_hub

3. Create a Directory for the Model: Now create a directory where the Llama 2 model will be stored. Use the following command to create a directory in the root called ‘model’:

mkdir /root/model

4. Download the Llama 2 Model: To download the Llama 2 model, start an interactive Python session by simply typing `python` and pressing enter. Then, enter the following Python commands:

from huggingface_hub import snapshot_download
snapshot_download("meta-llama/Llama-2–70b-chat", token="your-hugging-face-token", local_dir="/root/model")

Make sure to replace `”your-hugging-face-token”` with your actual Hugging Face token. You can find the token here. The model is quite large, so the download process can take 20–30 minutes.

5. Clone the Llama 2 Repository: Exit the Python session (you can do this by typing `exit()` and pressing enter). Navigate to your home directory with `cd ~`. Now, clone the Llama 2 repository from GitHub to your instance:

git clone https://github.com/facebookresearch/llama.git

6. Install Llama 2 Dependencies: Once the repository has been successfully cloned, navigate to the llama directory and install its dependencies:

cd /root/llama && pip install -e .

With these steps, you’ve set up your environment on vast.ai and are ready to run inference with the Llama 2 model. The next section will guide you on how to use the model for inference.

Running Inference

Press enter or click to view image in full size

After setting up the environment and downloading the Llama 2 model, you are ready to use the model for inference. Here is how you can proceed:

1. First, navigate to the Llama 2 directory using the following command:

cd /root/llama

2. To run the model, use the `torchrun` command. This command is part of the PyTorch library and allows you to run PyTorch scripts. The ` — nproc_per_node` argument specifies the number of GPUs to use. Here, we’re using all 8 GPUs. The ` — ckpt_dir` argument specifies the directory of the model checkpoint, and the ` — tokenizer_path` points to the directory of the tokenizer. The ` — max_seq_len` argument sets the maximum length of the input sequence, and ` — max_batch_size` sets the maximum number of sequences processed simultaneously. Use the following command to run the model:

torchrun --nproc_per_node 8 example_chat_completion.py \
 --ckpt_dir /root/model \
 --tokenizer_path /root/model/tokenizer.model \
 --max_seq_len 512 --max_batch_size 4

3. You can modify the `dialogs` variable on line 27 of the `example_chat_completion.py` script to customize the dialogue prompt. For instance, you can set the dialogs variable like this:

dialogs = [[{"role": "user", "content": "What is the recipe of mayonnaise?"}]]

4. If you want to further customize the model’s behavior, you can edit the `DEFAULT_SYSTEM_PROMPT` in the `llama/generation.py` file on line 46. The `DEFAULT_SYSTEM_PROMPT` is a string that is added at the beginning of each system turn before the system message. It can be used to gently guide the behavior of the model.

After running the model, the generated responses will be displayed in your terminal. You can interpret these results as the model’s responses to the prompts you’ve provided.

Remember to terminate your vast.ai instance

One of the essential steps, often overlooked in the process, is to shut down your instance when you’re done using it. If you forget to shut down your instance, you will continue to be charged. Always make a habit of double-checking and ensuring that your instances are stopped when not in use. With that, you are now all set to conduct inference with Llama 2 on vast.ai without any surprises on your bill!