Rust: Zero-Cost Abstraction in Action
idursun.comThese are not about abstractions but compile time optimizations. These are also not implemented in the Rust compiler but LLVM. So any any language frontend in front of LLVM would yield the same optimizations. According to Wikipedia the list is:
> [...] variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#, Common Lisp, Crystal, CUDA, D, Delphi, Dylan, Fortran, Graphical G Programming Language,Halide, Haskell, Java bytecode, Julia, Kotlin, Lua, Objective-C, OpenGL Shading Language, Ruby, Rust, Scala, Swift, Xojo, and Zig.
Sometimes I think mention of Rust in the title just gets upvotes without even reading the article, here.
Only those that emit LLVM code that can be optimized like this.
In poorly performing languages like Python and Ruby calling a method on a value (like sum() or even the += operator) essentially means looking up the method name in a hash table, and then doing a dynamic function call on the resulting function pointer, so this sort of optimizations cannot be done in an ahead-of-time compiler unless all objects are created locally and thus have known method dictionaries (and even then, it depends on whether they are available to the compiler in IR form, whether the lookup code is inlined and whether the code can actually be simplified).
> essentially means looking up the method name in a hash table, and then doing a dynamic function call on the resulting function pointer
Well, you do need something very much like this if you want to support "openly extensible" classes in a backward- and forward-compatible way. Otherwise you end up with something like the C++ ABI hell where changing things around in a base class breaks every binary that links to earlier versions of that class. This is only a "poorly-performing" pattern inasmuch as the developer is not enabled to "opt out" of that extensibility when that's the sensible choice.
10+ years ago this was true but CRuby's basic operations like += (bytecode is opt_plus followed by setlocal) don't do method lookup unless you redefine them and even if you do then it uses inline caches in the bytecode so there's only one hash lookup.
TruffleRuby has been constant folding trivial code like this since 2014. AOT is also an option.
Actually, so there's a very good chance that this particular optimization is happening due to MIRI, which allows very sophisticated constant evaluation of MIR (a rust intermediate representation)
https://rust-lang.github.io/rustc-guide/miri.html
So while, yes, in general pretty much all rust optimizations are actually due to LLVM, this one in particular might not be
Here's the equivalent in C++, compiled with clang 9.0.0 at optimization level 1. https://godbolt.org/z/nKSG9_
Byte for byte identical assembly.
I think it's both about abstraction and compile time optimization. The point was that by default you can write code that uses higher level abstraction but compiles to the same code as an imperative-style loop. Most used languages or default implementations don't use LLVM, moreover they are interpreted or JIT compiled, so there might be different performance between e.g. for-loops vs. forEach with a closure.
Even worse, the lack of proper compiler design courses in many universities or "engineer" bootcamps, means that many attribute to language X, what is actually a compiler backend capability, like you quite well explain.
The C# compiler uses LLVM? Is that for something like Xamarin or something? I was under the impression that it was self-contained. Anyone knows the details of this?
Unity's burst compiler https://docs.unity3d.com/Packages/com.unity.burst@1.2/manual... uses LLVM, but only supports a somewhat limited subset of C#. I've used it a bit in some performance critical parts of a game and saw some dramatic speed increases. But I can't speak to other parts of the C# ecosystem beyond Unity though.
Anyone can build a frontend. "Java bytecode" is in that list too. It doesn't mean defacto compilers of these languages rely on LLVM.
In this case it is Azul Zing, a very serious product. It's one of the four major VM Jit compilers (OpenJ9, C2, Graal are the others in my opinion)
There are also PTC, Aicas, Virtenio, Ricoh and Gemalto all targeted to embededded deployments, of which, PTC and Aicas are the most well known ones.
Sadly Excelsior is no more. I imagine that regular JIT compilers making AOT/JIT caches available, is what killed them.
There are many implementations of C#, a few of them use LLVM, like Xamarin AOT, IL2CPP, Burst.
.NET Core and .NET Framework, do not.
This isn’t really talking about Rust, it’s actually talking about the optimization capabilities of LLVM (the compiler backend, which quite a lot of languages use, such as Clang for C++, Rust, Swift, Julia, Zig, ...) These languages all have similar chances of performing those same optimizations, at least in those simple cases.
What I’m interested is how the IL code for the compiler frontends for each of those languages are more well-optimizable for LLVM (in practical situations, not just a few lines of simple numerical code.) I’ve heard that you need to be careful about encoding IL code in the right way such that LLVM does not generate needless memcpy’s or something but I’m not that much of a compiler expert...
rustc does indeed deliberately seek to emit LLVM IR that resembles what Clang would emit, in order to benefit from the same sorts of optimizations that default LLVM is tuned for.
It's both, it's just complicated by the fact that LLVM is optimizing it to a known formula.
The zero cost abstractions are the fact that you get the same output, special optimization and all, when you do the original loop, or when you use the "(1..n).fold(0, |x,y| x + y)" variant, or "(1..n).sum()" variant. Rust is converting those to intermediate code that is the same, or close enough, that LLVM is able to apply the same optimization.
Would it be better is some sample code was chosen that wasn't special cased to the degree that the entire algorithm was replaced wholesale? Probably. It doesn't invalidate the premise though, just slightly obscures it.
There is nothing about zero const abstractions in this article, just basic compiler optimizations.
The OP doesn't explicitly demonstrate any zero cost abstractions, however these "basic compiler optimizations" are only feasible to achieve statically (i.e. without a JIT) in the presence of sufficiently transparent abstractions, which takes a good deal of language design effort to enable.
Iterators in Rust are (usually) zero-cost abstractions of loops. Bjarne Stroustrup's definition:
What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.
Rust iterators fit in the second part.
Well, functions are abstractions too, but yes to me the interesting part of "zero const abstraction" is "zero cost data structure composition", and that indeed is not covered.
For anyone interested in reading more about this, I would recommend Zero Cost Abstractions[1] by withoutboats, who is a contributor to Rust.
1: https://boats.gitlab.io/blog/post/zero-cost-abstractions/
This is not Rust specific, this is a compiler thing, both LLVM and GCC can detect sums and generate closed form formulas instead. There are other fun algorithm detections - e.g. if you try to do bitcounts yourself, LLVM will use popcnt instead.
Compilers are awesome, check out Matt Godbolt's talk on this very topic: https://www.youtube.com/watch?v=nAbCKa0FzjQ
I assume it's doing `(N-2)(N-3)/2 + 2N - 3` instead of `N(N-1)/2` due to overflow concerns? But couldn't `(N-2)(N-3)` also possibly overflow, just supporting a larger range of `N`?
In this assembly code it cannot overflow because N is a 32-bit integer and the multiplication gives a 64-bit result, which is converted to 32-bit only after shifting.
I can't figure out why it doesn't use the simpler formula (other than the optimizer being bad).
The earlier transformations are actually more impressive. Constant folding is almost always has a good RoI, so lots of compilers do it, and it's simple because, well, it's a constant there is no variables or partial eval needed. The others require lots of inlining before the final rule fires, and so are more ambitious.
Seeing what
pub fn sum3(n: i32) -> i32 {
(1..n).sum() + (1..2*n).sum() + (1..(n + 2)).sum()
}
does would be more interesting to me.Also, while all the inline is rustc, I assume the "triangle number trick" is LLVM.
Godbolt supports Rust, and can show the LLVM IR: https://godbolt.org/z/GsccW3
A bit offtopic, I love how Godbolt grew out of C++ community to embrace as much AOT toolchains as possible, kudos to Matt and everyone involved into making it happen.
I wonder if the const syntax introduced in ES6 will ever result in const folding, or any JIT optimizations for code running in JS runtimes. Apparently, from what I've read, const is being ignored as far as optimiztions go and is only used as a way to prevent the developer from ever reassigning certain variables.
The ocaml compiler does constant folding too, and I consistently look at the (usually javascript because of Bucklescript) output in amazement when it finds shit like that. Unrolling all my tail recursion is great too.
For anyone that wants to learn about optimizations in AOT compiled ML languages, have a look at:
"The Implementation of Functional Programming Languages"
"Compiling with Continuations"
"Modern Compiler Implementation in ..." (C, Java and ML variants)
Although oriented towards Lisp, "Lisp in small pieces" is a classical book as well, with many optimizations like inlining of lambda calls across multiple call levels.
Are compiler optimizations like summing a series done on an ad-hoc basis? Certainly the compiler couldn’t have discovered or inferred (not sure what term to use) that formula, no?
Just generally curious.
Ad-hoc basis. Usually comes out of studying the most common classes of computation and optimizing for it. For example, the "summing a series" probably falls out of loop optimizations: https://en.wikipedia.org/wiki/Loop_optimization
Those optimizations are happening on LLVM's end (though the higher-level language will have to expose sufficiently transparent abstractions to make these optimizations possible). I'd love to read a book on the optimization techniques that are implemented in LLVM/GCC to make transformations like this possible.
Gotcha. Thanks. So presumably Clang would have done the same for C?
Easy to check, thanks to godbolt:
Yes, and it's mentioned in the post that the C code was also brought into parity.
Yeah I saw that, but just wasn’t sure how using intrinsics played a role.
There is nothing impressive about this. https://godbolt.org/z/S2tDDh
> he was very disappointed because Rust version was twice as fast than the C version which was hand-optimised by pulling off all the tricks he knew to make it perform well.
Why disappointed? It just highlights the quality of Rust's approach.
Evidently the tricks were what made it slow.
This happens all too often: once the code changes enough that the optimizer doesn't recognize the pattern anymore, it throws up its hands, and you're on your own. Some people call this optimizer roulette.
It's not just compilers, either. CPUs have their own peephole optimizers, and patterns they recognize, or don't, and it can easily make a 2x difference in your run time depending on if it cottons to what you're trying, or doesn't.
With CPUs it gets even worse, because that clever optimized Assembly code can stop being so in another CPU or after a firmware update.
The days of Z80, 6502 and similar are long gone.
This is neat but there is still cost in compile time
Usually in Rust community zero cost means zero runtime cost. Obviously, there are many other costs you could define.