Optimizing Parallel Reduction in Metal for Apple M1

Exploring how an optimal implementation should approach the memory bandwidth of the architecture

Press enter or click to view image in full size

Parallel reduction is a data-parallel primitive commonly used to reduce an array of elements into a single result. Specifically, the reduction operator sum applied to an array of elements [3, 6, 0, 8] yields 3 + 6 + 0 + 8 = 17. This operation is frequently an integral step in more complex problems. For example, in our scientific application Lumo, which is used to visualize molecular orbitals, we use parallel reduction as a part of isosurface construction to count the number of voxels that the isovalue intersects. Of course, the applications of reduction operations beyond scientific problems are extensive, with the MapReduce programming model is arguably the most well-known paradigm for big data processing.

While parallel reduction is trivial to implement on GPUs, it is a challenge to fully optimize. Mark Harris at NVIDIA published Optimizing Parallel Reduction in CUDA, which is an incredibly useful starting point to understand the pitfalls in naive implementations. With the addition of templates to kernel functions in Metal Shading Language 2.3, these optimizations can now be directly implemented in Metal to yield working computational kernels.