I was reading Algorithms for Modern Hardware recently, specifically the chapter on SIMD, and was impressed by auto-vectorization. The basic idea is that modern CPUs have so-called SIMDSingle Instruction Multiple Data instructions that allow applying an operation to several values at once, which is much faster than performing the equivalent scalar instructions one at a time. Modern compilers can detect certain programming patterns where an operation is performed repeatedly on scalar values and group the operations to make use of the faster instructions.There are some details around having to either pad the last group so it lines up with the group size or run some scalar instructions at the end, it’s really quite clever.
The book uses primarily C++ for its examples, but I was curious if the same pattern would work in Rust as well, without having to manually annotate anything. So I opened up the compiler explorer and typed in my first test, which looked like this:
#[no_mangle]
fn sum() -> u32 {
(0..1000).sum()
}
To make sure the compiler would feel safe to use SIMD instructions, I used -C opt-level=3 -C target-cpu=skylake, telling it that the target CPU supports
them. But instead of SIMD instructions I got this assembly:
sum:
mov eax, 499500
ret
Turns out I was outsmarted by the compiler, which used constant folding to directly return the final value it computed at compile time. To avoid this optimization from happening we can pass an argument into the function, so that the compiler cannot know the final result:
#[no_mangle]
fn sum(n: u32) -> u32 {
(0..n).sum()
}
Which generates this assembly:
sum:
test edi, edi
je .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea eax, [rdi + rcx]
dec eax
ret
.LBB0_1:
xor eax, eax
ret
Still there are no SIMD instructions in here, which would usually start with the
letter v, such as
vpaddd which adds
integers in parallel. To compare, I wrote the equivalent program in C++:
unsigned int sum(unsigned int n) {
unsigned int rv = 0;
for (unsigned int i = 0; i < n; i++) {
rv += i;
}
return rv;
}
Compiler explorer defaults to GCC to compile C++. I used the equivalent -O3 -march=skylake, and sure enough, I got back 106 lines of assembly (omitted for
brevity), including the desired SIMD instructions. I then switched to Clang, the
LLVM-based C/C++ compiler that is also used by Rust, and it produced the exact
same assembly as it did for the Rust version. This tells us the difference is
not one of languages, but one of compilers.Interestingly enough, the LLVM IR generated is quite different between the
two languages.
A quick comparative benchmark revealed the LLVM version to be significantly
faster than GCC’s vectorized loop, especially for large values of n, which is
unsurprising when looking at the instructions, about 10 scalar instructions will
beat hundreds of loops over vector instructions.
The key here is that the sum of consecutive integers from zero to n has a closed form solution, which I only realized after looking more closely at the assembly:
$$\sum^n_{i=0}{i} = \frac{n (n+1)}{2}$$
LLVM apparently detects that that is exactly what we are trying to do, and as a
result it can do away with the looping altogether and directly calculate the
result in one step, changing sum from O(n) to O(1). Colour me impressed.
Finally, replicating the book’s example more closely, I managed to get LLVM to automatically vectorize a loop for me, using the following code:
#[no_mangle]
fn sum(arr: &[u32]) -> u32 {
arr.iter().sum()
}
This works because LLVM has no idea about the contents of the array, so it cannot deduce a more efficient way of calculating their sum than to actually add them all up.
My takeaway here is that LLVM is even better at generating optimal code than I would have expected, at least for trivial examples. I am happy that I can continue writing idiomatic code without feeling like I am trading off performance for readability.