Optimizing Delta-Step on the GPU
This is the second post in a series about shortest-path algorithms on the GPU. See the previous one if you missed it.
We discussed in the previous post what algorithmic problems we have implementing parallelized SSSP algorithms on GPUs. In this post we want to dive into details of Delta-Step on the GPU and why it is not efficient.
How do we even run computations on the GPU? Besides various more
vendor-specific frameworks like CUDA (NVIDIA) and ROCm (AMD), we also
have SYCL and OpenCL which are vendor-agnostic. The odd one of the bunch
is Vulkan, which was not originally targeting computation but graphics
rendering. Vulkan has mandatory support for compute-shaders though and
basically every somewhat modern GPU supports it. The API is low-level in
its primitives which lets us learn more about the details of GPU
execution. Vulkan runs shaders compiled to an intermediary instruction
set called SPIR-V, which then gets translated to the native
instruction set of the GPU on execution. Typically compute shaders are
written in a C-like language called GLSL that is common for graphics
shaders.
A minimal version of a Delta-Step-like compute shader in GLSL may look something like this:
layout(local_size_x = 64) in;
layout(std430, binding = 0) readonly buffer FirstEdges { uint first_edges[]; };
layout(std430, binding = 1) readonly buffer Targets { uint targets[]; };
layout(std430, binding = 2) readonly buffer Weight { uint weight[]; };
layout(std430, binding = 3) coherent buffer Dist { uint dist[]; };
layout(push_constant) uniform PushConsts {
uint src_node;
uint dst_node;
uint n;
uint bucket_idx;
uint delta;
uint max_weight;
} pc;
void main() {
uint u = gl_GlobalInvocationID.x;
if (u >= pc.n) {
return;
}
uint du = dist[u];
if (du >= pc.bucket_idx * pc.delta && du < (pc.bucket_idx + 1) * pc.delta) {
for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
uint w = weight[eid];
if (w < pc.max_weight) {
uint v = targets[eid];
uint alt = du + w;
uint prev = atomicMin(dist[v], alt);
if (alt < prev) {
atomicMax(dist[pc.n], alt);
}
}
}
}
}
}This line configures how large our workgroup size is.
layout(local_size_x = 64) in;A workgroup is an abstraction of one piece of work we want to schedule on the same execution unit. This execution unit has multiple names depending on the vendor or architecture, but common names are "Streaming Multiprocessor" (NVIDIA), "Compute Unit" (AMD). The number of threads that are executed in parallel there determines the subgroups, which is also called a warp (NVIDIA) and wavefront (AMD). The subgroup size is usually a fixed hardware-dependent size, commonly 32 or 64. In the image below the work is split into 8 workgroups, with a size of 16 each. The hardware in this example requires subgroups of 8, so each workgroup has two subgroups. Workgroups can be freely scheduled between execution units, so there is no guarantee that two consecutive workgroups would run on the same hardware. Within a workgroup we are guaranteed that all subgroups run on the same hardware and have access to fast shared memory. Since all threads in a subgroup run together, that means the longest running thread determines how long the subgroup is executed.
Next we need to expose the graph to the compute shader. The graph is
represented as adjacency graph, or sometimes referred to as "CSR"
(Compressed Sparse Rows, a type of matrix encoding) format, using the
read-only buffers FirstEdges, Targets and
Weights. During the search we need to keep track of the
distance from src_node, this is saved in the buffer
Dist. One detail to observe here is the
binding. This is a numbered slot we can use to attach
memory to the shader. This abstraction makes it easy to implement
schemes like double-buffering where we switch out the actual memory
behind a buffer between iterations.
layout(std430, binding = 0) readonly buffer FirstEdges { uint first_edges[]; };
layout(std430, binding = 1) readonly buffer Targets { uint targets[]; };
layout(std430, binding = 2) readonly buffer Weight { uint weight[]; };
layout(std430, binding = 3) coherent buffer Dist { uint dist[]; };The next thing we want to setup are our PushConstants,
which are constant for all invocations of this shader. These act similar
to "function parameters" to a shader. In this case we provide
information about the source and destination nodes, the total number of
nodes, the current bucket that we are processing and the delta value.
max_weight is used to implement the light/heavy edge
filter.
layout(push_constant) uniform PushConsts {
uint src_node;
uint dst_node;
uint n;
uint bucket_idx;
uint delta;
uint max_weight;
} pc;The node in our graph the shader is invoked for is determined by the global invocation ID.
uint u = gl_GlobalInvocationID.x;Each invocation of the shader (or "thread" if you want to call it
that) gets its own invocation ID. Since this was originally
designed for 3D applications the invocation IDs are three-dimensional.
For our use-case we are only going to care about one dimension
x though, which will span values from 0 to
workgroup_size * num_workgroups - 1. Since the number of
nodes in our graph is usually not a multiple of
workgroup_size we need to guard against that:
if (u >= pc.n) {
return;
}We only do any real computation if the node's distance to the source is in our current bucket range. From this you may conclude that we run this shader on nodes that are not in the current bucket and you are right. In fact we run this shader on every node in the graph.
uint du = dist[u];
if (du >= pc.bucket_idx * pc.delta && du < (pc.bucket_idx + 1) * pc.delta) {
// ...
}This is wasteful. Each iteration we are determining which node is in the current bucket anew. The original CPU implementation keeps per-bucket node lists. Since the number of nodes in each bucket is unknown, we need to implement buckets with dynamic sizes. Growing buckets requires all threads writing to the bucket to block and wait. On a CPU this is relatively easy because we have mutexes to synchronize the execution between all execution threads. GPUs are not architected around supporting this level of synchronization. The main lever that Near-Far or ADDS can apply to improve on Delta-Step on the GPU is not requiring this synchronization or inefficient sweeps.
for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
uint w = weight[eid];
if (w < pc.max_weight) {
uint v = targets[eid];
uint alt = du + w;
uint prev = atomicMin(dist[v], alt);
if (alt < prev) {
atomicMax(dist[pc.n], alt);
}
}
}The main loop of the shader is very similar to a Dijkstra
implementation. We iterate over all neighbours v, calculate
their distance to the source coming from node u. If taking
the path over u is an improvement over the current distance
to v, we update dist[v] using
atomicMin. Since we can't coordinate between different
workgroups, we need to be careful when writing to shared buffers like
dist to not cause data races. Luckily there are various
atomic operations that guarantee the correct value wins if there is
contention. One detail to note here is that we use
dist[pc.n] (e.g., the last element in the dist
array) to keep track of the maximum distance seen. Since we don't have
explicit buckets we can't tell when all remaining buckets are empty and
we can terminate the search. If we know the maximum distance is smaller
than the value for the current bucket we know all following buckets will
be empty.
To run this shader from the CPU we use Vulkan. We need a pipeline which is our compiled shader bound to a specific device and descriptor set layout. The descriptor set layout tells Vulkan which bindings (e.g., the numbered slots from our shader code) are available to connect buffers to. The code below collects commands in a command buffer. We invoke a pipeline by binding buffers to a specific configuration (the descriptor set), setting the push constants, and then dispatching workgroups. The barrier we configure ensures that all memory that is written by the GPU and we want to read on the host (e.g., the CPU) will be available.
cmd_buf.begin({vk::CommandBufferUsageFlagBits::eOneTimeSubmit});
cmd_buf.bindPipeline(vk::PipelineBindPoint::eCompute, main_pipeline.pipeline);
cmd_buf.bindDescriptorSets(vk::PipelineBindPoint::eCompute,
main_pipeline.layout,
0,
current_desc_set,
{});
cmd_buf.pushConstants(main_pipeline.layout,
vk::ShaderStageFlagBits::eCompute,
0,
sizeof(pc),
&pc);
cmd_buf.dispatch((num_nodes + WORKGROUP_SIZE - 1) / WORKGROUP_SIZE, 1, 1);
cmd_buf.pipelineBarrier(vk::PipelineStageFlagBits::eComputeShader,
vk::PipelineStageFlagBits::eHost,
vk::DependencyFlags{},
vk::MemoryBarrier{vk::AccessFlagBits::eShaderWrite,
vk::AccessFlagBits::eHostRead},
{},
{});
cmd_buf.end();
queue.submit(vk::SubmitInfo{0, nullptr, nullptr, 1, &cmd_buf});
queue.waitIdle();Once all commands are recorded in the buffer we submit them to a queue. This will then actually start the execution of our shader. When our queue is empty and all commands have been executed, we can continue with the next iteration. One important feature of Vulkan is that we can record and reuse command buffers multiple times or record new commands while the last ones are still executing. This makes it more efficient to run the same code repeatedly on the GPU, in my experiments this improved performance by 10-20%. For now we keep this simpler style of recording and submitting immediately since it is easier to follow.
In addition to the code to invoke the shader, we need to implement the CPU-side control loop of the Delta-Step shader.
for (auto bucket_index = 0; bucket_index < MAX_BUCKETS; ++bucket_index) {
while (max_changed - min_changed > 0) {
// only process light edges below delta threshold
auto [min_changed,
max_changed,
best_distance,
max_distance] = executeDeltaStepShader(bucket_index, delta);
}
// process all edges in the bucket, also the heavy ones
auto [min_changed,
max_changed,
best_distance,
max_distance] = executeDeltaStepShader(bucketIndex, UINT_MAX);
// we have found a path to the destination node and it is settled
if (best_distance < (bucket_index+1) * delta) {
break;
}
// all following buckets will be empty
if (max_distance < (bucket_index+1) * delta) {
break;
}
}
return best_distance;In the original shader code at the beginning we omitted something
important: Tracking which nodes actually changed in each iteration. We
discussed in the previous blog post that Delta-Step has to process each
bucket potentially multiple times until no nodes in the bucket get
updated anymore. We'll add two additional buffers to our shader to keep
track of which nodes changed in the previous run and which ones were
changed in the current run. To make things a bit more memory efficient
we can use bit-vectors here that cut down the size per buffer to
num_nodes / 32.
// Nodes modified by the current iteration (written to)
layout(std430, binding = 4) coherent buffer CurrentChanged { uint current_changed[]; };
// Nodes modified in the previous iteration (read from)
layout(std430, binding = 5) readonly buffer PreviousChanged { uint previous_changed[]; };With these buffers we can check if a node in the previous iteration was changed.
uint du = dist[u];
if (du >= pc.bucket_idx * pc.delta &&
du < (pc.bucket_idx + 1) * pc.delta &&
// additional check if the node was changed
was_node_changed(u)) {
// ...
}We still have to process every node in the graph, but we discard most of them early. To improve that we could try to process only the nodes that changed. Keeping a list of nodes is challenging, but it is fairly simple to keep track of the range of node IDs that changed. We can reuse two additional values in the changed buffer to store the min/max there and use atomic operations to ensure they are consistent over all workgroups.
if (du >= pc.bucket_idx * pc.delta &&
du < (pc.bucket_idx + 1) * pc.delta &&
was_node_changed(u)) {
for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
uint w = weight[eid];
if (w < pc.max_weight) {
uint v = targets[eid];
uint alt = du + w;
uint prev = atomicMin(dist[v], alt);
if (alt < prev) {
atomicMax(dist[pc.n], alt);
bool was_set = set_node_changed(v);
if (!was_set) {
atomicMin(current_changed[num_blocks], v);
atomicMax(current_changed[num_blocks + 1], v);
}
}
}
}
}And then we only process
min_changed_id .. max_changed_id in each iteration. The
size of this range highly depends on how your graph nodes are ordered.
If nodes are shuffled uniformly random this will be around
|V|/3 on average. Since the road networks we are working
with here are from OpenStreetMap, the "natural" order
of Node IDs is roughly by their creation time in OSM. While there is
some correlation of nearby node IDs being similar for ranges, it boils
down to what the oldest and newest nodes in a bucket are. If one node
was recently added or one node is super old then the bucket range will
be quite wide. Road networks have the convenient property that nodes
have a location, and we can use that to order them. One way to
do that is to use space-filling curves like Z-order curves
that map a 2D value like (longitude, latitude) to a single
value. If we order each node in the graph by their corresponding Z-order
value, then nodes which are sorted close together are geographically
close together. Since the nodes in a bucket are found via neighbours,
they are geographically close. Which in turn means their Z-order value
is similar and hence their IDs should be close together.

Drag the slider to reveal the natural or Z-curve node order.
The image above shows nodes colored by their node ID. On the
left-hand side we have nodes in their natural order as they appear in
OSM. On the right-hand side we have nodes sorted by the Z-order of their
geographical location. With our new graph ordering we reduce the average
range of min_changed_id .. max_changed_id dramatically.
Below we compare the range size per iteration for each bucket. We
primarily profit from smaller ranges in the early phase of the search
where buckets are small and only span small parts of the graph. Smaller
ranges also help in the late stage of each bucket where few nodes
remain. This approach could be extended for more complex dispatching
strategies based on geographical "cells", but in my experiments I
haven't found an approach yet that balances the additional complexity
(and hence overhead) of the dispatch shader with better actual
runtime.

Not only does this reduce the number of workgroups we need to dispatch, but we will also do more useful work per-workgroup. Since if the nodes in a workgroup are close together the chance is higher they are all in the same bucket and have been modified in the last iteration. Hence more threads in each subgroup will enter this branch and do useful work, which is critical since the execution model of a GPU blocks all threads in a subgroup until they have all completed.

Besides reducing the amount of unnecessary work, we are helping the
memory cache on the GPU do the right thing. The bottleneck for most
algorithms, be it on the CPU or the GPU, is usually the memory latency.
By shrinking the range of nodes we access during each iteration we also
access narrower memory ranges and have a higher chance of hitting L1 or
L2 memory. The readonly graph buffers are going to reside
mostly in L1, since the graph data for one workgroup of size 64 is
roughly 1.5KB. Since the dist buffer is marked as
coherent that means all writes need to be visible to all
workgroups, L2 is the best we can hope for, though. For a small graph
like Berlin the dist buffer is only roughly 660 KB in size,
which still easily fits into L2 cache. But even for a relatively small
graph like Germany (13M nodes, 32M edges) we are looking at more than 50
MB for the distances alone, that can likely not fit into L2 cache
anymore. For a general overview of GPU caches, see this blog
post (albeit the cache sizes are outdated).

The above image shows a heatmap of the memory access for two of the
buffers we use when using Z-order to sort the graph nodes. On the left
side we see the read count to the dist buffer.
Since we read this buffer in every shader invocation for every node in
the min_changed .. max_changed node ID range, we can see a
distinct "block" pattern. On the right side we see the read
count for the first_edges array. This buffer is only read
at positions where nodes actually changed, so the map shows the hotspots
of nodes that were updated many times.
for (auto bucket_index = 0; bucket_index < MAX_BUCKETS; ++bucket_index) {
// process all edges in a bucket until we converge
while (max_changed - min_changed > 0) {
auto [min_changed,
max_changed,
best_distance,
max_distance] = executeDeltaStepShader(bucket_index);
}
// we have found a path to the destination node and it is settled
if (best_distance < (bucket_index+1) * delta) {
break;
}
// all following buckets will be empty
if (max_distance < (bucket_index+1) * delta) {
break;
}
}
return best_distance;In the original Delta-Step paper they divide the updates of buckets
into two phases: First we update all neighbors connected through "light"
edges (edges with weight smaller than delta) until they
converge. Then we update the heavy edges in a single pass, since they
will only update the distance of nodes outside the current bucket.
However for this particular implementation on the GPU the light / heavy
pass does not yield much benefit. This is connected to the GPU execution
model where branches that are not taken, simply make the thread idle. So
we don't actually decrease the runtime of each group significantly if we
conditionally filter neighbor updates; other threads in the workgroup
may be processing light edges. To solve this we need a separate edge
array for light and heavy edges. This wasn't worth the additional
overhead though and the simplified algorithm that updates all neighbors
(be it via light or heavy edges) is 15% faster.
After all this work, is it actually faster than Dijkstra? No, not
even on a tiny graph like Berlin. A standardized way to compare SSSP
algorithms for different query lengths is the Dijkstra rank. The
Dijkstra rank groups queries from a source to a
destination by the number of nodes that need to be settled
to find the shortest path. The rank is a logarithmic unit, for each
further rank it takes twice the number of operations. The graph below
shows the performance of Delta-Step, Near-Far and Dijkstra on Berlin.
Our Delta-Step variation seems to beat Near-Far slightly (speedup
1.2), but Dijkstra reigns supreme, with a speedup of
~25 over Delta-Step and ~30 over Near-Far.

However Berlin is a tiny graph and the real performance of these algorithms only becomes apparent when looking at graphs with more realistic sizes like all of Germany. Even with our improvement to only process the range of changed nodes in Delta-Step that range can often be close to the whole graph (even with z-order). For a tiny graph like Berlin that overhead is smaller than the algorithmic overhead of the Near-Far algorithm. If the graph is much larger Near-Far has a clear advantage, because it avoids processing unnecessary nodes to begin with.

Now these measurements were done on my Thinkpad X1 (2018 vintage), with its rather underpowered Intel UHD Graphics 620 GPU that does not even have an L1 data cache. When running the same measurements on my rather dated Radeon RX480 a different picture emerges:

There is hope! We can beat a Dijkstra on a GPU, the queries
just need to be large enough. Dijkstra still has a speedup of
1.1 over our Delta-Step algorithm. Near-Far beats Dijkstra
though with a speedup of 1.48. In the next post of this
series we will discuss how Near-Far achieves these speedups. Stay
tuned!