Open source tools like Ollama and Open WebUI are convenient for building local LLM inference stacks that let you create a ChatGPT-like experience on your own infrastructure. Whether you are a hobbyist, someone concerned about privacy, or a business looking to deploy LLMs on-premises, these tools can help you achieve that.
Prerequisites
We assume here that you are running an LTS version of Ubuntu (NVIDIA and AMD tooling is best supported on LTS releases) and that you have a GPU installed on your machine (either NVIDIA or AMD). If you don’t have a GPU, you can still follow this guide, but inference will be much slower as it will run on CPU.
Making sure the system is up-to-date
As long as you use the latest kernels provided by Ubuntu, you can enjoy the pre-built NVIDIA drivers that come with the OS.
First make sure your server is up-to-date:
If your system needs reboot, reboot it before running:
Note: You can check if your system needs to be rebooted by checking if this file exists:
/var/run/reboot-required.
Removing old kernels is important to avoid pulling DKMS NVIDIA drivers during the installation.
NVIDIA drivers installation (skip this step if you have an AMD GPU)
Drivers
Install the NVIDIA drivers by following the instructions in this post: How to install NVIDIA drivers on Ubuntu.
NVIDIA Container Toolkit
You will also need the NVIDIA container toolkit which is not available in the Ubuntu archive. Thus, you need to install the NVIDIA repository first:
| |
Then, install the toolkit:
Note: You can find a detailed guide about this section on the NVIDIA documentation.
Verify the installation
You can verify that the installation was successful by running: nvidia-smi.
If everything is working correctly, you should see the output of nvidia-smi showing your GPU information.
AMD drivers installation (NVIDIA users can skip this section)
Drivers
The amdgpu drivers are included in the Linux kernel modules on Ubuntu and should work out of the box. To make sure you have them installed, run:
| |
If you don’t see any output, it’s either because you are running linux-virtual (a lightweight kernel bundle for VMs) or because you are running a cloud kernel flavor that doesn’t include extra modules by default.
If you are on a cloud, install the appropriate extra modules package for your kernel flavor. For example, on AWS, you would run:
| |
If you are not running on a cloud, install either linux-generic or linux-generic-hwe-24.04 (or -22.04 if you are using Ubuntu 22.04 LTS) depending on whether you are using the HWE kernel or not:
AMD Container Toolkit
Since we’re using Docker for the LLM inference server, the ROCm toolkit (like the CUDA toolkit for NVIDIA) will be included in the container image, so there’s nothing to install.
However, just like with NVIDIA, you need to configure Docker to use the AMD GPU. To do this, install the AMD container toolkit repository:
| |
Then, install the toolkit:
More information can be found on the AMD ROCm documentation.
Installing Docker
To install Docker on your machine, follow the official documentation from Docker.
Once done, if you are using an NVIDIA container, run the following command to configure Docker:
After running this command, you should find something like this in /etc/docker/daemon.json:
Similarly, for AMD GPUs, run:
and you should find something like this in /etc/docker/daemon.json:
Installing Ollama and Open WebUI
Ollama is the server that will be running the LLMs and Open WebUI is the ChatGPT-like UI to chat with the model.
Create a compose.yml file with the following content:
| |
Simply run docker compose up -d, and you should be able to open http://localhost:8080 in your favorite web browser and start chatting with your model!
But wait, you don’t have any model yet!
Downloading a model
You can download models directly from the ollama container. For example, to download the llama2 model, run:
| |
This will download the model inside the ollama container and make it available for inference.
You can also list available models by running:
| |
or check the Ollama model repository for more models. Make sure the size of the model fits in your GPU memory! For example, llama2 requires at least 4GB of GPU memory. You can check your GPU memory by running nvidia-smi or btop.
Note: The first time a model is used, it might take a bit longer to respond as it needs to be loaded into GPU memory.
Maintenance
To update the Ollama and Open WebUI images, simply run:
To keep the NVIDIA drivers up-to-date and never pull the DKMS packages, follow the instructions in this post: How to install NVIDIA drivers on Ubuntu.
Troubleshooting
I find that the best way to monitor the GPU usage is to use btop. If you have nvidia-smi or rocm-smi installed, btop will show you the GPU usage in its UI.
One of the first things to check if you think that the GPU is not being used is the logs from the ollama container:
| |
and to look for something like:
level=INFO source=types.go:42 msg="inference compute" id=GPU-8c5284c3-6336-84e6-f91e-ba027e8d440b filter_id="" library=CUDA compute=8.9 name=CUDA0 description="NVIDIA L4" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:31:00.0 type=discrete total="22.5 GiB" available="22.0 GiB"
[...]
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA L4) (0000:31:00.0) - 22560 MiB free
[...]
load_tensors: offloading 32 repeating layers to GPU
If you see that the model is being loaded on CPU instead of GPU, then there is probably something wrong with your NVIDIA or AMD container toolkit or drivers installation.
Check that your card is visible either by running nvidia-smi, rocm-smi or btop on the host machine. If it is not, then the problem is with your drivers installation.