Blog - Patrick Niklaus

Optimizing Delta-Step on the GPU

This is the second post in a series about shortest-path algorithms on the GPU. See the previous or next one if you missed it.

We discussed in the previous post what algorithmic problems we have implementing parallelized SSSP algorithms on GPUs. In this post we want to dive into details of Delta-Step on the GPU and why it is not efficient.

How do we even run computations on the GPU? Besides various more vendor-specific frameworks like CUDA (NVIDIA) and ROCm (AMD), we also have SYCL and OpenCL which are vendor-agnostic. The odd one of the bunch is Vulkan, which was not originally targeting computation but graphics rendering. Vulkan has mandatory support for compute-shaders though and basically every somewhat modern GPU supports it. The API is low-level in its primitives which lets us learn more about the details of GPU execution. Vulkan runs shaders compiled to an intermediary instruction set called SPIR-V, which then gets translated to the native instruction set of the GPU on execution. Typically compute shaders are written in a C-like language called GLSL that is common for graphics shaders.

A minimal version of a Delta-Step-like compute shader in GLSL may look something like this:

layout(local_size_x = 64) in;

layout(std430, binding = 0) readonly buffer FirstEdges { uint first_edges[]; };
layout(std430, binding = 1) readonly buffer Targets { uint targets[]; };
layout(std430, binding = 2) readonly buffer Weight { uint weight[]; };
layout(std430, binding = 3) coherent buffer Dist { uint dist[]; };

layout(push_constant) uniform PushConsts {
    uint src_node;
    uint dst_node;
    uint n;
    uint bucket_idx;
    uint delta;
    uint max_weight;
} pc;

void main() {
    uint u = gl_GlobalInvocationID.x;

    if (u >= pc.n) {
        return;
    }

    uint du = dist[u];
    if (du >= pc.bucket_idx * pc.delta && du < (pc.bucket_idx + 1) * pc.delta) {
        for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
            uint w = weight[eid];
            if (w < pc.max_weight) {
              uint v = targets[eid];
              uint alt = du + w;
              uint prev = atomicMin(dist[v], alt);
              if (alt < prev) {
                atomicMax(dist[pc.n], alt);
              }
            }
        }
    }
  }
}

This line configures how large our workgroup size is.

layout(local_size_x = 64) in;

A workgroup is an abstraction of one piece of work we want to schedule on the same execution unit. This execution unit has multiple names depending on the vendor or architecture, but common names are "Streaming Multiprocessor" (NVIDIA), "Compute Unit" (AMD). The number of threads that are executed in parallel there determines the subgroups, which is also called a warp (NVIDIA) and wavefront (AMD). The subgroup size is usually a fixed hardware-dependent size, commonly 32 or 64. In the image below the work is split into 8 workgroups, with a size of 16 each. The hardware in this example requires subgroups of 8, so each workgroup has two subgroups. Workgroups can be freely scheduled between execution units, so there is no guarantee that two consecutive workgroups would run on the same hardware. Within a workgroup we are guaranteed that all subgroups run on the same hardware and have access to fast shared memory. Since all threads in a subgroup run together, that means the longest running thread determines how long the subgroup is executed.

Next we need to expose the graph to the compute shader. The graph is represented as adjacency graph, or sometimes referred to as "CSR" (Compressed Sparse Rows, a type of matrix encoding) format, using the read-only buffers FirstEdges, Targets and Weights. During the search we need to keep track of the distance from src_node, this is saved in the buffer Dist. One detail to observe here is the binding. This is a numbered slot we can use to attach memory to the shader. This abstraction makes it easy to implement schemes like double-buffering where we switch out the actual memory behind a buffer between iterations.

layout(std430, binding = 0) readonly buffer FirstEdges { uint first_edges[]; };
layout(std430, binding = 1) readonly buffer Targets { uint targets[]; };
layout(std430, binding = 2) readonly buffer Weight { uint weight[]; };
layout(std430, binding = 3) coherent buffer Dist { uint dist[]; };

The next thing we want to setup are our PushConstants, which are constant for all invocations of this shader. These act similar to "function parameters" to a shader. In this case we provide information about the source and destination nodes, the total number of nodes, the current bucket that we are processing and the delta value. max_weight is used to implement the light/heavy edge filter.

layout(push_constant) uniform PushConsts {
    uint src_node;
    uint dst_node;
    uint n;
    uint bucket_idx;
    uint delta;
    uint max_weight;
} pc;

The node in our graph the shader is invoked for is determined by the global invocation ID.

uint u = gl_GlobalInvocationID.x;

Each invocation of the shader (or "thread" if you want to call it that) gets its own invocation ID. Since this was originally designed for 3D applications the invocation IDs are three-dimensional. For our use-case we are only going to care about one dimension x though, which will span values from 0 to workgroup_size * num_workgroups - 1. Since the number of nodes in our graph is usually not a multiple of workgroup_size we need to guard against that:

if (u >= pc.n) {
    return;
}

We only do any real computation if the node's distance to the source is in our current bucket range. From this you may conclude that we run this shader on nodes that are not in the current bucket and you are right. In fact we run this shader on every node in the graph.

uint du = dist[u];
if (du >= pc.bucket_idx * pc.delta && du < (pc.bucket_idx + 1) * pc.delta) {
    // ...
}

This is wasteful. Each iteration we are determining which node is in the current bucket anew. The original CPU implementation keeps per-bucket node lists. Since the number of nodes in each bucket is unknown, we need to implement buckets with dynamic sizes. Growing buckets requires all threads writing to the bucket to block and wait. On a CPU this is relatively easy because we have mutexes to synchronize the execution between all execution threads. GPUs are not architected around supporting this level of synchronization. The main lever that Near-Far or ADDS can apply to improve on Delta-Step on the GPU is not requiring this synchronization or inefficient sweeps.

for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
    uint w = weight[eid];
    if (w < pc.max_weight) {
        uint v = targets[eid];
        uint alt = du + w;
        uint prev = atomicMin(dist[v], alt);
        if (alt < prev) {
            atomicMax(dist[pc.n], alt);
        }
    }
}

The main loop of the shader is very similar to a Dijkstra implementation. We iterate over all neighbours v, calculate their distance to the source coming from node u. If taking the path over u is an improvement over the current distance to v, we update dist[v] using atomicMin. Since we can't coordinate between different workgroups, we need to be careful when writing to shared buffers like dist to not cause data races. Luckily there are various atomic operations that guarantee the correct value wins if there is contention. One detail to note here is that we use dist[pc.n] (e.g., the last element in the dist array) to keep track of the maximum distance seen. Since we don't have explicit buckets we can't tell when all remaining buckets are empty and we can terminate the search. If we know the maximum distance is smaller than the value for the current bucket we know all following buckets will be empty.

To run this shader from the CPU we use Vulkan. We need a pipeline which is our compiled shader bound to a specific device and descriptor set layout. The descriptor set layout tells Vulkan which bindings (e.g., the numbered slots from our shader code) are available to connect buffers to. The code below collects commands in a command buffer. We invoke a pipeline by binding buffers to a specific configuration (the descriptor set), setting the push constants, and then dispatching workgroups. The barrier we configure ensures that all memory that is written by the GPU and we want to read on the host (e.g., the CPU) will be available.

cmd_buf.begin({vk::CommandBufferUsageFlagBits::eOneTimeSubmit});

cmd_buf.bindPipeline(vk::PipelineBindPoint::eCompute, main_pipeline.pipeline);
cmd_buf.bindDescriptorSets(vk::PipelineBindPoint::eCompute,
                            main_pipeline.layout,
                            0,
                            current_desc_set,
                            {});

cmd_buf.pushConstants(main_pipeline.layout,
                      vk::ShaderStageFlagBits::eCompute,
                      0,
                      sizeof(pc),
                      &pc);

cmd_buf.dispatch((num_nodes + WORKGROUP_SIZE - 1) / WORKGROUP_SIZE, 1, 1);
cmd_buf.pipelineBarrier(vk::PipelineStageFlagBits::eComputeShader,
                        vk::PipelineStageFlagBits::eHost,
                        vk::DependencyFlags{},
                        vk::MemoryBarrier{vk::AccessFlagBits::eShaderWrite,
                                          vk::AccessFlagBits::eHostRead},
                        {},
                        {});


cmd_buf.end();
queue.submit(vk::SubmitInfo{0, nullptr, nullptr, 1, &cmd_buf});
queue.waitIdle();

Once all commands are recorded in the buffer we submit them to a queue. This will then actually start the execution of our shader. When our queue is empty and all commands have been executed, we can continue with the next iteration. One important feature of Vulkan is that we can record and reuse command buffers multiple times or record new commands while the last ones are still executing. This makes it more efficient to run the same code repeatedly on the GPU, in my experiments this improved performance by 10-20%. For now we keep this simpler style of recording and submitting immediately since it is easier to follow.

In addition to the code to invoke the shader, we need to implement the CPU-side control loop of the Delta-Step shader.

for (auto bucket_index = 0; bucket_index < MAX_BUCKETS; ++bucket_index) {
    while (max_changed - min_changed > 0) {
        // only process light edges below delta threshold
        auto [min_changed,
              max_changed,
              best_distance,
              max_distance] = executeDeltaStepShader(bucket_index, delta);
    }
    // process all edges in the bucket, also the heavy ones
    auto [min_changed,
          max_changed,
          best_distance,
          max_distance] = executeDeltaStepShader(bucketIndex, UINT_MAX);

    // we have found a path to the destination node and it is settled
    if (best_distance < (bucket_index+1) * delta) {
        break;
    }

    // all following buckets will be empty
    if (max_distance < (bucket_index+1) * delta) {
        break;
    }
}

return best_distance;

In the original shader code at the beginning we omitted something important: Tracking which nodes actually changed in each iteration. We discussed in the previous blog post that Delta-Step has to process each bucket potentially multiple times until no nodes in the bucket get updated anymore. We'll add two additional buffers to our shader to keep track of which nodes changed in the previous run and which ones were changed in the current run. To make things a bit more memory efficient we can use bit-vectors here that cut down the size per buffer to num_nodes / 32.

// Nodes modified by the current iteration (written to)
layout(std430, binding = 4) coherent buffer CurrentChanged { uint current_changed[]; };
// Nodes modified in the previous iteration (read from)
layout(std430, binding = 5) readonly buffer PreviousChanged { uint previous_changed[]; };

With these buffers we can check if a node in the previous iteration was changed.

uint du = dist[u];
if (du >= pc.bucket_idx * pc.delta &&
    du < (pc.bucket_idx + 1) * pc.delta &&
    // additional check if the node was changed
    was_node_changed(u)) {
    // ...
}

We still have to process every node in the graph, but we discard most of them early. To improve that we could try to process only the nodes that changed. Keeping a list of nodes is challenging, but it is fairly simple to keep track of the range of node IDs that changed. We can reuse two additional values in the changed buffer to store the min/max there and use atomic operations to ensure they are consistent over all workgroups.

if (du >= pc.bucket_idx * pc.delta &&
    du < (pc.bucket_idx + 1) * pc.delta &&
    was_node_changed(u)) {
    for (uint eid = first_edges[u]; eid < first_edges[u+1]; ++eid) {
        uint w = weight[eid];
        if (w < pc.max_weight) {
            uint v = targets[eid];
            uint alt = du + w;
            uint prev = atomicMin(dist[v], alt);
            if (alt < prev) {
                atomicMax(dist[pc.n], alt);
                bool was_set = set_node_changed(v);
                if (!was_set) {
                    atomicMin(current_changed[num_blocks], v);
                    atomicMax(current_changed[num_blocks + 1], v);
                }
            }
        }
    }
}

And then we only process min_changed_id .. max_changed_id in each iteration. The size of this range highly depends on how your graph nodes are ordered. If nodes are shuffled uniformly random this will be around |V|/3 on average. Since the road networks we are working with here are from OpenStreetMap, the "natural" order of Node IDs is roughly by their creation time in OSM. While there is some correlation of nearby node IDs being similar for ranges, it boils down to what the oldest and newest nodes in a bucket are. If one node was recently added or one node is super old then the bucket range will be quite wide. Road networks have the convenient property that nodes have a location, and we can use that to order them. One way to do that is to use space-filling curves like Z-order curves that map a 2D value like (longitude, latitude) to a single value. If we order each node in the graph by their corresponding Z-order value, then nodes which are sorted close together are geographically close together. Since the nodes in a bucket are found via neighbours, they are geographically close. Which in turn means their Z-order value is similar and hence their IDs should be close together.

Drag the slider to reveal the natural or Z-curve node order.

The image above shows nodes colored by their node ID. On the left-hand side we have nodes in their natural order as they appear in OSM. On the right-hand side we have nodes sorted by the Z-order of their geographical location. With our new graph ordering we reduce the average range of min_changed_id .. max_changed_id dramatically. Below we compare the range size per iteration for each bucket. We primarily profit from smaller ranges in the early phase of the search where buckets are small and only span small parts of the graph. Smaller ranges also help in the late stage of each bucket where few nodes remain. This approach could be extended for more complex dispatching strategies based on geographical "cells", but in my experiments I haven't found an approach yet that balances the additional complexity (and hence overhead) of the dispatch shader with better actual runtime.

Not only does this reduce the number of workgroups we need to dispatch, but we will also do more useful work per-workgroup. Since if the nodes in a workgroup are close together the chance is higher they are all in the same bucket and have been modified in the last iteration. Hence more threads in each subgroup will enter this branch and do useful work, which is critical since the execution model of a GPU blocks all threads in a subgroup until they have all completed.

Besides reducing the amount of unnecessary work, we are helping the memory cache on the GPU do the right thing. The bottleneck for most algorithms, be it on the CPU or the GPU, is usually the memory latency. By shrinking the range of nodes we access during each iteration we also access narrower memory ranges and have a higher chance of hitting L1 or L2 memory. The readonly graph buffers are going to reside mostly in L1, since the graph data for one workgroup of size 64 is roughly 1.5KB. Since the dist buffer is marked as coherent that means all writes need to be visible to all workgroups, L2 is the best we can hope for, though. For a small graph like Berlin the dist buffer is only roughly 660 KB in size, which still easily fits into L2 cache. But even for a relatively small graph like Germany (13M nodes, 32M edges) we are looking at more than 50 MB for the distances alone, that can likely not fit into L2 cache anymore. For a general overview of GPU caches, see this blog post (albeit the cache sizes are outdated).

The above image shows a heatmap of the memory access for two of the buffers we use when using Z-order to sort the graph nodes. On the left side we see the read count to the dist buffer. Since we read this buffer in every shader invocation for every node in the min_changed .. max_changed node ID range, we can see a distinct "block" pattern. On the right side we see the read count for the first_edges array. This buffer is only read at positions where nodes actually changed, so the map shows the hotspots of nodes that were updated many times.

for (auto bucket_index = 0; bucket_index < MAX_BUCKETS; ++bucket_index) {
    // process all edges in a bucket until we converge
    while (max_changed - min_changed > 0) {
        auto [min_changed,
              max_changed,
              best_distance,
              max_distance] = executeDeltaStepShader(bucket_index);
    }

    // we have found a path to the destination node and it is settled
    if (best_distance < (bucket_index+1) * delta) {
        break;
    }

    // all following buckets will be empty
    if (max_distance < (bucket_index+1) * delta) {
        break;
    }
}

return best_distance;

In the original Delta-Step paper they divide the updates of buckets into two phases: First we update all neighbors connected through "light" edges (edges with weight smaller than delta) until they converge. Then we update the heavy edges in a single pass, since they will only update the distance of nodes outside the current bucket. However for this particular implementation on the GPU the light / heavy pass does not yield much benefit. This is connected to the GPU execution model where branches that are not taken, simply make the thread idle. So we don't actually decrease the runtime of each group significantly if we conditionally filter neighbor updates; other threads in the workgroup may be processing light edges. To solve this we need a separate edge array for light and heavy edges. This wasn't worth the additional overhead though and the simplified algorithm that updates all neighbors (be it via light or heavy edges) is 15% faster.

After all this work, is it actually faster than Dijkstra? No, not even on a tiny graph like Berlin. A standardized way to compare SSSP algorithms for different query lengths is the Dijkstra rank. The Dijkstra rank groups queries from a source to a destination by the number of nodes that need to be settled to find the shortest path. The rank is a logarithmic unit, for each further rank it takes twice the number of operations. The graph below shows the performance of Delta-Step, Near-Far and Dijkstra on Berlin. Our Delta-Step variation seems to beat Near-Far slightly (speedup 1.2), but Dijkstra reigns supreme, with a speedup of ~25 over Delta-Step and ~30 over Near-Far.

However Berlin is a tiny graph and the real performance of these algorithms only becomes apparent when looking at graphs with more realistic sizes like all of Germany. Even with our improvement to only process the range of changed nodes in Delta-Step that range can often be close to the whole graph (even with z-order). For a tiny graph like Berlin that overhead is smaller than the algorithmic overhead of the Near-Far algorithm. If the graph is much larger Near-Far has a clear advantage, because it avoids processing unnecessary nodes to begin with.

Now these measurements were done on my Thinkpad X1 (2018 vintage), with its rather underpowered Intel UHD Graphics 620 GPU that does not even have an L1 data cache. When running the same measurements on my rather dated Radeon RX480 a different picture emerges:

There is hope! We can beat a Dijkstra on a GPU, the queries just need to be large enough. Dijkstra still has a speedup of 1.1 over our Delta-Step algorithm. Near-Far beats Dijkstra though with a speedup of 1.48. In the next post of this series we will discuss how Near-Far achieves these speedups. Stay tuned!