Blog - Patrick Niklaus

GPUs are not always faster

This is the first post in a series about shortest-path algorithms on the GPU. The second post.

In a recent effort to learn more about Vulkan and how to write efficient compute shaders I turned to my favorite algorithm.

The humble Dijkstra's algorithm. Notoriously one of those algorithms that does not work well on GPUs. In fact even a pretty naive implementation on a CPU will beat your GPU shader easily on some common graphs like road networks. Understanding why exactly helps us to figure out how to run things efficiently on GPUs.

Dijkstra's algorithm is used to find the shortest path in a graph. Starting with the source we process nodes until we reach the destination. For that we need a priority queue, that gives us the nodes with the lowest distance to the source. We then look at all neighbours n of that node u and may update their distance dist[n] to the source, depending on whether it would be an improvement to reach n via the current node u. If we guarantee that we always process the minimal node u in each step, this will always find the shortest path to the destination.

Now that guarantee is important, because it creates a data dependency: The minimal node can change based on the updates we do on the neighbours in the previous iteration. This is a bit of a problem, because GPUs really need to do things in parallel to work efficiently. So you would like to do things like: Run the updates of thousands of neighbours concurrently and then continue with the next batch. This would work if each node would have thousands of neighbors, but typically that is not the case. Instead we would do something like getting a batch of 1024 nodes and then processing all of their neighbours in parallel.

Doing so has multiple problems: Lets say two nodes u, v have the same neighbour n. If we concurrently try to update the distance dist[n] with dist[u] + w(u, n) and dist[v] + w(v, n) we potentially write the higher value to dist[n]. Atomics, e.g. an instruction that does a comparison "is dist(u, n) smaller then dist[n]" and an operation "update dist[n] to dist(u, v)" are a way to solve this race condition. However if dist[u] or dist[v] are not the minimal value yet, then in turn dist[n] is also not minimal. This leads to us having to process nodes multiple times to get the correct distance values. We pay for parallelism with lack of efficiency.

To see the other extreme of the spectrum lets consider another well-known algorithm for computing shortest paths: Bellman-Ford. The algorithm is even simpler then Dijkstra's: You simply iterate over all edges (u, v) in the graph and update dist[v] if dist[u] + w(u, v) would be an improvement. We then continue this |V| - 1 times, or until we don't do any updates anymore. This is perfectly parallel work that can easily be run on an GPU.

And in some cases Bellman-Ford works really well, namely graphs where the shortest paths are relatively short in terms of how many "hops" (e.g. edges) it takes from the source to the destination. Intuitively this would require nodes in the graph that have a lot of connections to other nodes (e.g. neighbours). We call having a lot of neighbours, having a high "degree". Road networks are an example of a type of graph where this algorithm does not work well. Depending on the exact translation of a road network to a graph, nodes typically have either 1 (dead-end), 3 (T-intersection) or 4 (cross-intersection) outgoing neighbors. The average degree is relatively low, usually around ~2.5. In those conditions it takes a lot of hops to reach the destination, so Bellman-Ford will do a lot of rounds of updating all edges. Most of the time relatively few nodes have changed and we waste a lot of effort.

In the early days of parallel computing people were mostly concerned with "can I run this on multiple CPUs" and when they meant "multiple" they were talking about maybe fewer than a dozen. This is of course very different than modern parallel processing on GPUs where we will have thousands of computations in parallel. One algorithm from that era is called Delta-Step (Meyer & Sanders), that tries to adapt the classic Dijkstra algorithm to the parallel world. Instead of ordering nodes individually, we sort them in "buckets" by their distance, so all nodes with distance 0 - 99 are in bucket 0, 100 - 299 in bucket 1 and so on. The width of those buckets is called delta and in this example that would be 100. We are not interested in sorting the nodes inside the bucket at all, since we want to process them in parallel anyway. In the image above you see the distance dist[u] of nodes from a source, on the right you see them sorted into buckets.

Like with Dijkstra each iteration we update all the neighbors, here we do it for all nodes in a bucket simultaneously. We already saw that not honoring the requirement of ordering, e.g. always processing the nodes with the lowest distance, leads us to having to update nodes several times. So Delta-Step updates the nodes in a bucket multiple times until we don't see any updates anymore, then proceeds with the next bucket. Updating nodes that haven't actually changed is not efficient, so for each iteration of the bucket we only update nodes whose dist[u] to the source has improved in the previous round. The weight of some edges in the graph may exceed the value of delta, so that dist[u] + w(u, n) will always produce a value that would put dist[n] outside the current bucket. We can optimize, and only update these edges once after the bucket has settled down, since these updates will not improve any of the nodes in the current bucket.

In practice there are many problems to implementing Delta-Step on the GPU. But to understand how it works it is educational to first look at how the search actually looks. One cool thing with working on the GPU is that visualizing algorithms is kind of easy because it is meant for rendering stuff. So below you see the search from a random source node expanding through a major European city (can you guess which one?). The dark green nodes are the "settled", e.g. processed nodes that won't be updated anymore. The light green nodes are the nodes that are in the current bucket and the red nodes are the ones that were updated by the current iteration. You may see flashes of "red" through the whole graph after every bucket is finished. This is an implementation detail to get around some of the problems with implementing Delta-Step as it is originally described on the GPU.

I found this visualization fairly illuminating since it captures what is actually hard to put into concrete metrics. At first progress is fairly slow. The actual number of nodes we are processing is small and we can not take advantage of the parallelism of the GPU. Then we see rapid progress as the frontier of the bucket suddenly expands. The GPU can do a lot of work in parallel. As we near the edge of the bucket we stall out again, very few nodes are updated and we can't really saturate the GPU anymore.

Overall the utilization of the GPU is actually pretty poor, I'm seeing 20-30% on my tiny integrated GPU. And as you can imagine this ends up not being the fastest algorithm overall, especially compared to algorithms that take the GPU execution model into account. But crucially it is more efficient in the sense that it delivers a result faster than Bellman-Ford and uses less GPU resources for that.

Since the more advanced algorithms like Near-Far (Davidson et al.) and ADDS (Wang et al.) are based-on / inspired by Delta-Step we'll go into the technical details of how Delta-Step is implemented on the GPU in the next post of this series. Using that understanding to see what they do differently and why.