llama.cpp performance breakthrough for multi-GPU setups

4 min read Original article ↗

László Jagusztin

While we were enjoying our well-deserved end-of-year break, the ik_llama.cpp project (a performance-optimized fork of llama.cpp) achieved a breakthrough in local LLM inference for multi-GPU configurations, delivering a massive performance leap — not just a marginal gain, but a 3x to 4x speed improvement.
While it was already possible to use multiple GPUs to run local models, previous methods either only served to pool available VRAM or offered limited performance scaling. However, the ik_llama.cpp team has introduced a new execution mode (split mode graph) that enables the simultaneous and maximum utilization of multiple GPUs.
Why is it so important? With GPU and memory prices at an all-time high, this is a game-changer. We no longer need overpriced high-end enterprise cards; instead, we can harness the collective power of multiple low-cost GPUs in our homelabs, server rooms, or the cloud.

If you can’t read the article further than please click here

My measurements:

Press enter or click to view image in full size

4 x Nvidia Tesla T4 GPUs on 64 core AMD EPYC 7V12 server

Press enter or click to view image in full size

4 x Nvidia Tesla T4 GPUs on 64 core AMD EPYC 7V12 server

What is Split Mode Graph?

Traditionally, llama.cpp uses "layer" or "row" splitting to distribute a model across multiple GPUs. These methods often lead to GPUs "idling" while waiting for one another to finish a computation or transfer data.

Split Mode Graph implements tensor parallelism at the GGML graph level. Instead of just assigning layers to different GPUs, it distributes the compute graph nodes themselves. This allows the system to saturate the compute units of all available GPUs simultaneously.

Key Benefits

  • Performance Boost: Benchmarks show performance gains of 3x-4x over standard methods when using multiple CUDA GPUs for both prompt processing (PP) and token generation (TG).
  • High Utilization: Unlike the default modes which often show fluctuating GPU usage, graph split mode keeps GPUs "pegged at 100%," maximizing the return on multi-GPU hardware.
  • Backend Agnostic Potential: Because it is implemented at the ggml graph level rather than the CUDA backend level, it can theoretically be extended to other backends like Vulkan or ROCm in the future.
  • The solution leverages NVIDIA NCCL library if available, to enable topology awareness. This identifies the fastest path between GPUs (such as NVLink or PCIe) to ensure seamless parallel communication. By allowing GPUs to read directly from each other’s memory, NCCL keeps them at 100% utilization, eliminating the 20–30% idle time usually wasted waiting for data on the bus.

How to Use It

To enable this mode, use the following flag in your command line:

Example on Ubuntu:
sudo apt install libnccl-dev
cmake -B build -DGGML_CUDA=ON ...
llama-cli -m model.gguf -sm graph -ngl 99 ...

Appendix, llama.cpp split modes

Layer Split Mode (-sm layer)

This is the standard way llama.cpp handles multiple GPUs. It distributes the model's transformer layers sequentially across your hardware.
If you have 80 layers and two GPUs, it might put layers 1–40 on GPU 0 and 41–80 on GPU 1. GPU 0 processes its layers, then sends the output data to GPU 1 to process the rest. Very simple and works with almost every backend (CUDA, Metal, Vulkan). Because it is sequential, GPU 1 sits idle while GPU 0 is working, and vice-versa. This leads to lower overall “Tokens per Second” because your total compute power isn’t being used at the same time.

Row Split Mode (-sm row)

Introduced to provide a more “balanced” load, Row mode splits the actual weight matrices (tensors) across GPUs rather than just the layers. Every GPU contains a piece of every layer. When a matrix multiplication happens, all GPUs perform a portion of the math simultaneously on their respective “rows” of data. Better at handling extremely large context windows (KV cache) because it can distribute the memory load more evenly. Can be faster than Layer mode on systems with very high-speed interconnects (like NVLink) because it uses GPUs in parallel.

None Split Mode (-sm none)

As the name suggests, this disables multi-GPU splitting entirely for the model weights. The entire model is loaded onto a single “Main GPU.” When to use it: You have one very powerful GPU (e.g., an A100) and one weak GPU (e.g., a GT 1030) that would only slow things down.