New Rust hash table leads Benchmarks Game
benchmarksgame.alioth.debian.orgIn other Rust/benchmarksgame news, I just submitted a simple fix to the Rust program for "reverse-complement" that makes it faster than the fastest C++ program, on my computer. The old version was spending 2/3 of its time just reading the input into memory, because it wasn't allocating a large enough buffer up front.
https://github.com/TeXitoi/benchmarksgame-rs/pull/44
I'm also working on some additional changes that make it even faster than the C version (again, on my computer) by improving how it divides work across CPUs:
https://github.com/TeXitoi/benchmarksgame-rs/pull/46
These improved Rust programs have not yet been added to the benchmarksgame site. Previous entries are ranked at:
http://benchmarksgame.alioth.debian.org/u64q/performance.php...
Minor improvements to the Rust programs for "mandelbrot" and "binary-trees" are also awaiting review!
Would it be any worse to use `File::open("/dev/stdin")` there, so that it is safe code?
Oh, that's a good idea. (It's unfortunate that neither my code nor yours is portable to non-Unix platforms like Windows.)
UPDATE: Pushed a commit to my latest PR to replace the unsafe `File::from_raw_fd` with your `File::open`. Thanks!
Thanks!
I like Rust for what it does in advancing the state of the art of the languages, but I also like how this example demonstrates how hard it is to avoid "unsafe" constructs and remain competitive.
https://github.com/TeXitoi/benchmarksgame-rs/blob/master/src...
For what it's worth, this alternate implementation has only one line of unsafe code (a call to the libc "memchr" function) and is only 9% slower than the fastest unsafe version:
https://github.com/mbrubeck/benchmarksgame-rs/blob/reverse_c...
It's very easy to write extremely fast safe Rust code. (The safe Rust version above is faster than the fastest C++ submission, on my computer.) Using "unsafe" for optimization is usually only helpful to get a few extra percent speedup in an inner loop. If this were production code rather than the benchmarks game, I'd probably ship the safe version.
Wonderful, thank you! Producing the safe versions which are that efficient is really the best argument for Rust that I can imagine.
As an update, this pull request now removes most of the unsafe code:
Further update: I now have a Rust program that is 100% safe code and is also faster than any of the unsafe programs (including the C and C++ entries):
> one line of unsafe code (a call to the libc "memchr" function)
Is burntsushi's Rust implementation of memchr notably slower than libc's?
The rust-memchr crate uses libc's memchr if it's available and known to be fast. Otherwise, it falls back to a pure Rust implementation (written by bluss, not me).
I expect this to change once we get SIMD. ;-)
> Is burntsushi's Rust implementation of memchr notably slower than libc's?
Not as far as I know. A lot of this benchmarksgame code has evolved slowly over time and hasn't been cleaned up yet to use "modern" crates and language features.
Previously std::collections::HashMap was used with the default hash function --
[46.03 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...)
and then the hash function was changed to FnvHasher --
[17.10 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
and then better use of quad core with futures_cpupool --
[9.44 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
and now use of std::collections::HashMap has been replaced with an experimental hash table inspired by Python 3.6's new dict implementation --
[5.30 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
afaict comparing #4 to #5 is all about differences between that experimental hash table and std::collections::HashMap --
[9.14 secs] http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
> afaict comparing Rust program #4 [5.30 secs] to Rust program #5 [9.14 secs] is all about differences between std::collections::HashMap and that experimental hash table.
There was also a small contribution (~6%) of working on bytes rather than strings according to the original PR: https://github.com/TeXitoi/benchmarksgame-rs/pull/39
Please check the source code for Rust #5.
> experimental hash table
When discussing whether or not to use this at least someone mentioned that there company was using it in production. I don't think it really counts as experimental.
"experimental" is descriptive, not prescriptive:
> Experimental hash table implementation in just Rust (stable, no unsafe code).
That is how the author of ordermap currently describes ordermap at the top of the GitHub README.rst
Java is showing quite impressive numbers! 50% overhead over native C implementations was often cited as a good guess for the ultimate efficiency of JIT code generation back in the Self Hotspot days.
People who were trying to castigate Go early on as having "Java-like speeds" were really just showing their ignorance of the state of the art of JIT compilation for managed languages and the JVM. Such outdated folk knowledge of performance in the programming field seems to be a constant over the decades. (Programmers have had such distorted views since the mid 80's at least.) Maybe this kind of knowledge needs to be a used in job interview questions for awhile? Very soon, people will just memorize such trivia for interviews, but it would serve to squash this form of folk programming "alternative fact."
HotSpot has had great speeds for numeric computation at least since 2005. I was doing financial software in Java in my first job out of college, our CTO was an ex-Sun architect who literally wrote the book on Java, and the speeds we got on numerical computations were basically equivalent to C.
The part where Java really falls down is in memory use & management, which you can see on the binary-tree & mandelbrot benchmarks, where it's roughly 4x slower than C. There are inherent penalties to pointer chasing that you can't get around. While HotSpot is often (amazingly) smart enough to inline & stack-allocate small private structs, typical Java coding style relies on complex object graphs. In C++ or Rust these would all have well-defined object ownership and be contained within a single block of memory, so access is just "add a constant to this pointer, and load". In Java, you often need to trace a graph of pointers 4-5 levels deep, each of which may cause a cache miss.
Rule of thumb while I was at Google was to figure on real-world Java being about 2-3x slower than real-world C++.
> The part where Java really falls down is in memory use & management, which you can see on the binary-tree & mandelbrot benchmarks, where it's roughly 4x slower than C.
binary-tree is not useful for comparing GCed and non-GCed languages. For non-GCed languages, you are allowed to use a memory pool of your choice (the C version uses the Apache Portable Runtime library), for GCed languages you are required to use the standard GC with the default settings (no adjustment of GC parameters permitted). This is apples and oranges.
For mandelbrot, the C version uses handcoded SIMD intrinsics. I.e. it's not even portable to non-x86 processors.
> For non-GCed languages, you are allowed to use a memory pool of your choice (the C version uses the Apache Portable Runtime library), for GCed languages you are required to use the standard GC with the default settings (no adjustment of GC parameters permitted). This is apples and oranges.
Doesn't that match with how a library would be used in the real world? A c library can create it's own memory pool but a GC one has to live with however it's host is configured.
If I were to run a performance-critical application, I'd definitely tune the GC accordingly. It's why the JVM offers several garbage collectors in the first place, for example.
Also, GCed languages aren't prevented from using memory pools, but often they are not part of their common libraries, because there's less need for them.
> If I were to run a performance-critical application, I'd definitely tune the GC accordingly.
But you have to tune it for the performance of the whole application (AFAIK), you can't tune it for an individual algorithm like you can with c. It's a one size fits all approach.
1. That goes towards the other point that I made [1] about how microbenchmarks have only limited relevance for the performance of large applications (the performance of memory pools can also change as a result; as an extreme case, multiple large memory pools can lead to swapping).
2. Many GCs allow you to tune performance for individual computations. For example, Erlang allows you to basically start a new lightweight process with a heap large enough so that collection isn't needed and to throw it away at the end; OCaml's GC parameters can be changed while the program is running.
Probably because people are intelligent enough not to compare speeds inside of a vacuum. When someone denigrates a language as "java-like" they're really just comparing it anecdotally to the sum of all Java projects they've worked with. Rarely is the project a single-purpose, optimized pet-project.
Smalltalk was long castigated for being a "slow, poky interpreted language" long, long after it stopped being that in fact. In all of my time as a consultant for the language vendor, never did I ever come across the VM actually being too slow. In something like 90% of the cases, it was due to IO.
Before I left the Smalltalk part of my career behind, someone had the occasion to compare the parser-compiler of one Smalltalk which was implemented in C with Yacc/Lex with one implemented in pure Smalltalk with a JIT VM. IT turns out, once the console logging was disabled, the JIT VM's parser was just as fast as the one in C.
In my experience of almost 2 decades, it has been a constant that uninformed programmers are especially uninformed about the relative performance of managed languages.
If only someone was interested in contributing Smalltalk programs written with MatriX to use quad-core --
Note that in many of the other benchmarks, the fastest Java program takes 2x to 4x longer than the fastest C program. Still not bad! But knucleotide shows Java in a better light than most:
http://benchmarksgame.alioth.debian.org/u64q/java.html
(And as always, it's possible that someone can write a much faster Java program than the ones submitted so far.)
And in many cases it doesn't matter, those 2x, 4x longer are still in the accepted expectation time frame.
There are lots of folk programming like the impact of bounds checking or that all game consoles except for XBox use OpenGL.
Also many younger developers believe that C was always fast, and are unaware that early 8 and 16 bit compilers for home computers were like managed languages. The compilers generated way worse code than hobby Assembly developers.
> early 8 and 16 bit compilers for home computers were like managed languages
Yeah, I wrote several published games entirely in assembly back when that was really the only reasonable option.
Of course some of the things we had to deal with meant that even the compilers were good, they couldn't have been good enough.
Extreme limited memory is the obvious one, but pages that need to be swapped out at runtime is the other one. An example: The Game Boy had 64k of memory addressing, but would ship with 128k and larger cartridges. 32k of the memory space was reserved for video memory, RAM, and hardware registers (if I remember correctly). The first 16k of memory was always mapped to the first 16k of ROM, but the second 16k of memory was mapped to arbitrary 16k blocks of ROM. 16k wasn't enough space to hold your entire game, so some code would leak into other pages (hopefully not more than ONE other page), but you also had to be able to swap other ROM pages into that second 16k memory region to, e.g., load graphics and game data.
There isn't a compiler around even today that can juggle all of that automatically. At a minimum you'd need to be marking different functions as belonging to different memory regions, but you'd still have to manually keep track of which function was where and ensure you don't try to call a region-2 function from a region-1 function when region-2 is some arbitrary ROM page instead of the extra code page. But often something in region-2 needs to call region-1 to load graphics into sprite registers and then restore region-2 so that the stack frame becomes valid again. :)
And 32k is SUCH a small amount of memory that being able to chip away at every single function was important. You have a function that does two things, but sometimes you just need to do the second? Put a label halfway through and call directly into the middle. You can save a byte here by loading one register into another, because you know that the registers will have the right values? Or you can save another byte there because you've realized that a particular constant is loaded a lot, and you can store that constant into one of the 128 cheap-memory locations? Do it! We need every byte...
Yes it would be possible to write such a compiler, but the level of effort to customize it to the specific architecture would be extreme. Easier to just make programmers do the hard work.
Nice tricks, it brings back memories from demoscene days!
I was just arguing about plain boring C code. :)
>> Java is showing quite impressive numbers!
359% more RAM isn't very impressive. Even less so when one considers the 20+ years of effort spent to achieve it.
You know what impresses me? A 22 month old language besting everything else while guaranteeing no segfaults or NPEs at compile time. That's impressive.
Not that I don't disagree with the general idea of this post, but it's worth pointing out that while Rust is fairly young based on the 1.0 release date, there was a large amount of time prior to that where the language underwent a number of changes.
It's also worth remembering that this is just a game, and as such it's somewhat of an apples-to-oranges comparison. Java's RAM usage may not be impressive compared to C, but that doesn't make the results overall any less impressive, especially knowing what it was like before those 20+ years of effort.
I don't see a reason that both Rust and Java's results can't be impressive in their own right. Java's numbers are (for the most part) impressive compared to C and Rust's numbers are also impressive compared to C, just in a different way.
>> while Rust is fairly young based on the 1.0 release date, there was a large amount of time prior to that where the language underwent a number of changes.
Both Rust and Java had similar intervals between start of development and 1.0 release, which is what I carefully referenced my claims to. The comparison is fair; deliberately conservative actually given Java 1.0 in 1995 (now 22 years ago.)
>> I don't see a reason that both Rust and Java's results can't be impressive in their own right.
The state of the art has moved on. There was a time when Java pulling to within 50-ish percent of a 45 year old programming language was impressive; back around 2005 or so. It's old hat now and there is little evidence the gap is going to close much further.
It seems to me that there are a lot of apples-to-oranges comparisons here? Some implementations are using the language's standard library hashtable implementation while others are using 3rd party version (with different algorithms and data structures across all of them), some are using multiple threads while others are single threaded etc. As a result, I wouldn't read too much into the rankings you see here.
Sure, but I'm less worried about absolute rank and more interested in whether the language I'm using is on the order of C/C++/Fortran.
I can sell a 2x slowdown to my boss if I can show that that's the only cost of a significantly more elegant and productive language, but at 100x that's a much harder sell.
Just be careful to remember that good performance on microbenchmarks does not necessarily translate to good performance on large applications. Some factors to consider are:
1. Cache performance and locality may differ considerably for large applications. For example, a microbenchmark may perform well because it can monopolize the L1 cache in a way that is not possible for larger applications.
2. Aggressive inlining, a common factor in microbenchmark performance, may not scale well for large code bases. The tradeoffs here can be rather complex: https://webkit.org/blog/2826/unusual-speed-boost-size-matter...
3. Microbenchmarks often avoid abstraction in order to achieve high performance. This can create unacceptable software engineering costs.
It's still useful as a potential filter. If someone's implemented your microbenchmark in langB and it's within 2x of langA, then maybe they're worth comparing. If on the other hand langB is 20x away from langA in microbenchmarks, you can be pretty sure that it won't ever be a good replacement.
As long as you have people of roughly equivalent skill level within their language implementing the microbenchmarks. It's not that hard to get a 10x or even 100x speed improvement in the same language when an expert rewrites the naive code that a newbie wrote.
That's why microbenchmarks that have been on the internet for a while are quite nice: there's a good chance at least some experts from each language will have had a go and submitted a decent implementation.
But that's a highly (and narrowly) optimized version that took a lot of effort to build. It's not something that the average programmer you have working for you will be able to replicate. Also, some languages will not have equivalent effort put in the corresponding implementations (simple example: some languages have parallelized solutions, some don't).
I think the idea behind real-world software performance is that "if you need it, you really need to have it, but most of the time you don't need it." The benchmarks game isn't all that unrealistic for this. Most of the time, you won't care about performance at all. When you do care about performance, you'll be able to devote an experienced engineer to carefully optimize a specific hot spot, not unlike a microbenchmark. It matters how fast you can get the code to go when it's a hot spot, not how fast all your little support code runs.
My one complaint is that there's no benchmark that measures FFI performance. Realistically, if you build a system in Python or Ruby - you're going to be dropping down to C for your hot spots. And so scripting language performance on all these compute-intensive tasks is somewhat irrelevant, you really want to know how much overhead you'll incur crossing the scripting/C boundary (which, in my experience can sometimes be large enough that it wipes out all the gains of coding in C in the first place).
> measures FFI performance
Many of the pi-digits programs use GMP.
https://benchmarksgame.alioth.debian.org/u64q/performance.ph...
I have some experience with codegolfing for speed and in my experience, these benchmarks do not reflect the reality of such problems, either. If you really want to tweak code for speed, you'll use a mix of algorithmic improvements (that's where the biggest gains are) tailored to your language and compiler/hardware, possibly using FFI/assembly if you absolutely need it. But the benchmark game explicitly disallows algorithmic improvements, for example. Except through the backdoor, where language implementors can sometimes tweak libraries to circumvent restrictions.
And to make things more complicated, the performance increase is generally a function of the time spent on optimizing the problem, which is dependent not only on any innate speed, but also the expressiveness of the language and the programmer's familiarity with the language. Given that you don't have infinite time, there are further tradeoffs here.
> Except through the backdoor, where language implementors can sometimes tweak libraries to circumvent restrictions.
Which programming language implementations are doing that to your knowledge?
"… Kernels … Toy programs … Synthetic benchmarks … discredited today, usually because the compiler writer and architect can conspire to make the computer appear faster on these stand-in programs than on real applications."
http://benchmarksgame.alioth.debian.org/why-measure-toy-benc...
> Which programming language implementations are doing that to your knowledge?
My point here is not that this is being done (I'm generally assuming that language implementors have better things to do than to pollute their standard libraries for a benchmark game); I was making a different point, namely the inability to do algorithmic improvement, and mentioned that possibility for the sake of completeness.
There's been a good deal of play, flex and slop in the interpretation of "use the same algorithm" for some of those tasks.
I think your point is more that "Don't optimize away the work." is antithetical to what we do.
All of the benchmark game problems either prescribe using a fixed algorithm or have an obvious optimal asymptotic complexity. The remaining challenge is generally to reduce the constant factor as much as possible.
This is just not what a lot of computationally expensive problems look like in practice. As a simple example, any practical solution for an NP-hard problem will be full of tradeoffs; often you just want something that's good enough and then you get to choose and adjust an algorithm for your particular problem space to find the sweet spot between time complexity and quality of the solution.
The benchmark games also have fairly simple and obvious data structures; most of them just deal with arrays and strings. Real-world problems often require you to make difficult choices about representation (where one is optimal in some situations, another in a different set of situations, but you handle all of them). Example: adjacency matrixes can be represented as bit matrices, integer/float matrices (if edges can have weights), or associate arrays of sets, to name just a few implementation options. Depending on how your graph is structured (sparse vs. dense, connectivity) and what algorithms you require, one or the other can be optimal. Clever choices can make orders of magnitude of difference for performance.
I'll offer you a concrete example: computing the factorial of large numbers basically has three well-known algorithms: (1) naive multiplication, (2) divide-and-conquer, (3) prime decomposition. Each subsequent approach improves performance over its predecessor, but is also increasingly more difficult to implement.
Now, it so happens that this particular problem is well-researched, so you can look it up, but you encounter similar problems all the time where you can't find them on stackoverflow or in the literature. And then you have to consider tradeoffs between the time spent researching better solutions, implementing those better solutions, and the time gained from the increased performance.
You're pushing hard on a wide-open door :-)
In my experience, if you're seeing a 20x difference, we're either already talking compilers vs. interpreters or fundamentally different implementations (e.g. one using SIMD intrinsics vs. one not using them).
And some languages are allowed to use FFI to make their impl faster. There's some rule about this that I don't understand, but oh well. It's all for fun, not serious.
But come on, now Rust can legitimately be called "faster than C" ;)
At least until the Clang C version is added... or maybe it will still be faster.
>And some languages are allowed to use FFI to make their impl faster. There's some rule about this that I don't understand, but oh well. It's all for fun, not serious.
Actually it's pretty easy: if you can write a faster version in any language, given whatever is there in the language, even if it's c implemented standard library stuff, do it.
The implementations are not meant to be final -- people can contribute faster ones.
I suspect all languages without exception use some standard library functionality in at least a few of those sample programs, and most "standard" libraries aren't constrained to be self-hosting - so all of em use some native code, probably written in C or C++. I suspect that's true of rust too.
FFI is a fact of life. I can imagine it would be perverting the intent of the game if you explicitly used FFI to delegate the actual core of the benchmark program to C as opposed to using "standard" building blocks, but the distinciton is necessarily vague.
> I suspect that's true of rust too.
rustc uses LLVM and jemalloc (by default). Other than that, it is all Rust.
> It seems to me that there are a lot of apples-to-oranges comparisons here?
So, the usual. :)
I stopped taking into account the benchmarks game completely after I saw apples-to-potatoes comparisons a few years back.
http://benchmarksgame.alioth.debian.org/dont-jump-to-conclus...
Long ago the Ada program contributors told me -- It's only the same program if the same assembler is generated ;-)
That's why they call it a game.
Huh, I could have sworn at the time that that was the reason but C2 backs it up: http://wiki.c2.com/?GreatComputerLanguageShootout
Maybe they should remove 'game' from the title if they need an FAQ for it. Or even better they should acknowledge that it is a game.
The word game has several different meanings -- not all of them are dismissive.
There was some ruckus regarding the use of standard-library hash tables. See this thread https://news.ycombinator.com/item?id=13266801 for some links.
> some are using multiple threads while others are single threaded
Perhaps sort by cpu (seconds) rather than (elapsed) secs.
http://benchmarksgame.alioth.debian.org/u64q/performance.php...
So? You're welcome to write your version, if you think you can make a faster one.
When SIMD goes stable rust may dominate that game. Still wish they would use clang so it was apples to apples with c and c++. Edit: actually I wish they would add clang for those languages and leave GCC for comparison. Then I'd want FORTRAN to add gfortran for the same reason.
"If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements."
Yeah, I think SIMD will be the next big jump for these benchmarks. I also wish we could see clang used with the C/C++ cases.
If llvm fixes a misoptimization bug rust can start passing noalias information again and that will cause auto vectorization in many cases. That would show up in the benchmarks too. That is why FORTRAN is winning n-body without any explicit SIMD code.
Is there any reason why we can't have a "C clang"? Or has just nobody bothered yet?
Circa 2011 the maintainer of the benchmark game decided to mostly only allow one implementation of each language[0] following pypy developers trying to get program alternatives which weren't pypy-pessimal.
[0] some languages get a bye for some reason e.g. MRI and JRuby, but no pypy, and which implementation is blessed is also arbitrary e.g. javascript is v8 but lua is lua.
> alternatives which weren't pypy-pessimal
Not true.
Back-in-the-day Joe LaFata contributed specially written-for PyPy pi-digits, spectral-norm, mandelbrot programs and they were all shown on the website side-by-side with the CPython programs.
Back-in-the-day I noticed the Python n-body program failed with PyPy, I asked about the problem and was told "we have nbody_modified in our benchmarks" and then I asked them to contribute the specially modified for PyPy program -- and it was displayed on the website within 3 hours.
https://morepypy.blogspot.com/2010/03/introducing-pypy-12-re...
2011 didn't have as widespread Clang/LLVM usage. Now that it is so ubiquitous, probably a good time to revisit for C...
C needs to defend its reputation now!
I am saying this as someone who loved C for my entire life, when I was in college I did implement most of the assignments in C when prof said python is okay, but I did in C because I loved it and I thought I would learn more by doing them in C.
So no hard feeling involved.
There is no reputation to defend. You mean security problems everywhere ? do you mean old, broken, nasty build systems ? Do you mean not having single good package management ? The language (C/C++) is clearly intractable (parsing wise), it is 2017 and we don't have single good IDE for them (I don't use IDE at all, but I am sure you agree with me how much an IDE is important for newcomers)
If that is the case I think writing assembly would outperform C like shit, and with that logic asm is much better than C.
And to be honest, I am in job finding phase and preparing for jobs, if this wasn't my plan right now, I would abandoned C and C++ (even C++{11,14}) for Rust in heartbeat.
The language (Rust) clearly is awesome language, well designed, does have perfect build system (have you seen Cargo ? it is wonderful), flexible language design (you can write OS in rust -not having runtime- and you can write web app in Rust). I am stuck with C and C++ for now, but if my opinion counts , Rust is superior in every aspect to C/C++.
Even if Rust were slower a little bit (+/- 10%) I wouldn't mind. Because its ecosystem is so healthy I would trade 10% of my program performance to having something as nice as Rust.
Right. The only thing C had going for it was that it was the fastest non-assembly language.
I'm biased, I want all C development halted and moved over to Rust. If C is no longer the fastest, for some definition of that, then no one should be defending it at this point.
Honestly, I want the LLVM optimized C version specifically so that this last argument for C in any context will be taken off the table. I'm sure there are some people saying, "but it's not using the same compiler backend and optimizer...", I want this to counter that.
Rust all the things.
>Right. The only thing C had going for it was that it was the fastest non-assembly language.
Not really. C has lots of existing code, lots of developers who know it, and for many platforms it is the only language for which you'll find a compiler. As a language it is a trainwreck, but it does have a sizeable moat.
Yes. And I recently did an experiment at work using corrode on some existing software, with some decent results. I was able to easily translate one source file to Rust, and then link that into the existing make based build.
I've also experimented with just FFI for similar integration, with similar ease.
The point is, even with large code bases in C, you can start to migrate and stop feeding the beast.
I am asking this genuinely, I always thought and heard LLVM optimizer is inferior to GCC's. Am I wrong ? is there any scientific benchmark for this ?
It's really hard to quantify enormous (millions-of-lines) complex pieces of code that simply. LLVM and GCC take broadly the same approach to optimization: an SSA-based IR with dozens of passes, followed by lowering to a machine-specific IR with many more passes, followed by machine code generation, register allocation, and cleanup optimizations.
Basically, all you can really say is: LLVM is better at some code; GCC is better at other code. The differences are at the level of highly situational details at this point, not broad strokes.
My understanding was that GCC was better in more scenarios most of the time, but now it seems like Clang has caught up. and Maybe Clang 4.0 is even faster in more.
Here are some recent benchmarks from phoronix: http://www.phoronix.com/scan.php?page=article&item=gcc7-clan...
Yeah. But in this kind of tuned benchmark, I doubt clang will do any better across the board - especially not considering the fact that the current programs are likely to be at least somewhat tuned specifically for gcc.
Clang does well enough to be the default compiler for Apple and Google. It can't be that far behind.
Oh yeah, I have no reason to believe it's worse; it's just that I also doubt it's going to help much either.
But... trial and error and all ;-).
I think they make different choices in different situations. My understanding has been that they are generally the same, with one doing much better in different situations.
But I have nothing to back this up. Wouldn't it be cool if the benchmark game provided a datapoint on this?
"If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements."
> The only thing C had going for it was that it was the fastest non-assembly language.
Which only happened after the widespread of UNIX clones into the industry.
Back in the 80's and early 90's, the C compilers for 8 and 16 bit home computers generated pretty crappy code, versus what humans were able to write in Assembly.
> it is 2017 and we don't have single good IDE for them
What are your criteria for a good IDE for C/C++?
> javascript is v8
Kind-of: Node.js actually.
"If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements."
rust and C are only lucky that the real fast languages are not included in this benchmark comparisons.
e.g. felix or pony would dominate it then, over C++ with OpenMP.
Here only a missing fast C/C++ hash table is scewing the picture.
I have never heard of either of those, do you have any benchmarks?
EDIT, Nevermind Felix compiles to C++ and Pony looks like an academic project. If you want to provide benchmarks or try to change my mind, I am open to it, but it would require significant evidence.
Just FYI in case it happens to others: I was confused because the page I get shows Rust coming in fourth. I had to reload it/re-sort the columns a couple times to see the new Rust #4 entry.
LLVM IR could squeeze out some more. I should take a crack at it. Did something similar to show DuPont why Criterion rocked.
The hash table is implemented in safe Rust (By using std's Vec). It has some inefficiencies that could maybe have been polished off using `unsafe`.
I had a quick look at it a while back, there aren't many (any?) inefficiencies there like that.
Citation ? To be honest I always heard LLVM is slower than GCC in most scenarios.
Yeah. This is largely IO bound. Operating system read calls from STDIN dominate the runtime.
> Some language implementations have hash tables built-in; some provide a hash table as part of a collections library; some use a third-party hash table library. (For example, use either khash or CK_HT for C language k-nucleotide programs.) The hash table algorithm implemented is likely to be different in different libraries.
> Please don't implement your own custom "hash table" - it will not be accepted.
> The work is to use the built-in or library hash table implementation to accumulate count values - lookup the count for a key and update the count in the hash table.
The C++ implementation is thereby testing an old version of a non-standard extension of libstdc++ that I had never heard of and which was likely contributed once by IBM and never really looked at again (by either maintainers or users ;P), while the C implementation is testing the specified khash library, which is apparently something a number of people actively contribute to and attempt to optimize, giving it some notoriety.
If I were to do this in C++, and I wasn't allowed to use my hash table, I would almost certainly not be using __gnu_pbds::cc_hash_table<>. If I were to just want to use something "included", I would use the C++11 std::unordered_map<> (note that this code is compiled already as C++11). But we all know the STL is designed for flexibility and predictability, not performance, and the culture of C++ is "that's OK, as if you actually care about performance no off-the-shelf data structure is going to be correct". If I decided I wanted speed, I know I'd want to check out Folly, and I might even end up using khash.
Reading other comments, what happened here is the Rust version is now using some "experimental" hash table based on ongoing work to optimize the Python 3000 dict implementation. This is just not a useful benchmark. What we are benchmarking is "how maintained is the implementation's built in hash table and is it tunable for this particular workload".
That's why you should not be surprised to see Java doing so well: the code actually being written here is just some glue... your programming language has to be incompetent to do poorly at this benchmark (especially as many commenters here are using a "within a power of 2" rule of thumb). There are even multiple listings for the Java one, and the one that is faster is using it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?
What we really should be asking here is: why is any language doing "poorly" in this benchmark? It just isn't surprising that Rust is competitive with C/C++, nor is it surprising that Java is also; what is surprising is that Swift, Go, Haskell, and C# are not, and so I bet the issue is something (such as "is allocating memory for a thing which is not required") that can be trivially fixed for each (though by the rules of engagement, it might... or might not :/ as Java "cheated", right? ;P... require a minor fix upstream).
I mean, since the "work" explicitly is not "write a hash table using nothing but primitives from this language", there is no particular reason why Perl and Python (which is using a non-destructive array .replace, which is likely brutal... again: I bet this is almost always a benchmark of ancillary memory allocations) should be doing as poorly as they are: if we all made "optimize for this benchmark" a top priority for a weekend hackathon, I bet we could get every open source language to nail this under 25s. But do we care?
> If I were to do this in C++ … I would use the C++11 std::unordered_map<> …
http://benchmarksgame.alioth.debian.org/play.html#contribute
The thesis of my comment was that we should be surprised that any language does poorly on this benchmark, particularly ones that have similar kinds of targeting, and that if we cared about this benchmark (and I claim we don't), we should all pitch in, possibly upstream to fix various languages and their standard libraries to nail this benchmark. However, I also believe the rules of this benchmark are awkward and even flawed, and that it isn't clear to me that it is worth anyone's time to do that.
I am not lamenting that someone should spend more time on this: I am lamenting that tons of people seem to care about it at all, it is not a "microbenchmark" (as some are calling it), and I think the main lesson we can learn from it is "there is something subtely wrong, either with the implementation that was contributed for this benchmark, the language's runtime, or it's standard library", as given these rules we really should expect every language to be similarly in performance.
And so, if we all cared about this benchmark, I bet we could figure out what is going on and get every open source language down under 25s. Past that point, I think the rules are such that this isn't even a fun game much less a useful metric of anything worth measuring, and you are probably wasting your time contributing. I guess, to make this subthread go somewhere: why do you disagree?
> it is not a "microbenchmark"
Home page -- "Will your toy benchmark program be faster if you write it in a different programming language? It depends how you write it!"
Whatever the motivation for your comments, they did remind me that I'd intended to have background information URLs on other pages (not just the home page).
So thanks for that!
> it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap?!?
That's just a library that provides data structures without boxing of int,double... in Java. This eliminates a lot of overhead.
Oh, I absolutely understand why it is interesting and fast: what I don't understand is how it satisfies the rules, as it is effectively "some random project with a hashtable". Is this particular one so famous that it should be allowed instead of java.util.HashMap? Can I just publish my C++ hashtable and then rely on it, in which case the rule makes no sense? That's why I tack on "?!".
> instead of java.util.HashMap
Not "instead of" as-well-as.
This entire subthread is clearly a digression ;p, but why do you say that? The rules were about using a built-in or standard collection: this Java implementation does not use HashMap and instead uses a random project with a better data structure for this use case. This is a loophole to bypass the restriction on writing your own hashtable: you just have to publish it first? ;P
Java #6 => HashMap
http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
Java #3 => HashMap
http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
Java #5 => HashMap
http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
Java #4 => HashMap
http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
Those are different implementations; I know those exist, and which is why I had specifically said there were multiple implementations and pointed at the "faster one", which is obviously the one most people are going to be paying attention to, and which is the one that is also most relevant as we could and probably should expect the implementations for other languages to use some crazy one-off library: it demonstrates and undermines the limitations placed on the various languages seem arbitrary enough that you can't read into these results as being indicative of the languages themselves, but simultaneously that we shouldn't even care as the fast Java entry effectively "cheated".
I am going to take a step back: you seem irked by my comments, due to the knee-jerk link to have me contribute a different implementation for C++ as well as arguing with the short reply what I really maintain is a nearly-offtopic point about the Java library in question here (both of which I am reading as slightly aggressive towards me or my comment).
Do you really like this benchmark? Are you super excited that Rust is slightly faster right now than C/C++/Java? What is going through your head when you read my comments? Do you disagree that we should expect all native-ish languages to do equally-enough well for most people given these constraints? Is it that my comment is coming off to you as "no fun allowed", as I am trying to point out that this is a meaningless competition that mostly teaches us about different things than "who is winning"?
In fact, looking at your recent comment history, I am realizing you are being super aggressive about this with everyone, and are nigh unto spamming the play.html link to everyone who tries to comment about the high-level idea of "what are we doing here". What's up?
> I know those exist…
So you knew why I said -- Not "instead of" as-well-as.
> … knee-jerk link to have me contribute a different implementation for C++ …
Improved programs are welcome, period. Hence the link.
> What is going through your head when you read my comments?
Will this sentence never end :-)
> … mostly teaches us about different things than "who is winning"? … the high-level idea of "what are we doing here".
There are links on the home page which provide context.
> What's up?
igouy is the maintainer of the benchmarks game.
I'm not a system programmer, but it makes me very happy to see Rust taking off, with all its potential to, hopefully, replace currently most used unsafe system languages.
Is there something fishy going on with the NaiveHashMap here? What's happening?
impl Hasher for NaiveHasher {
fn finish(&self) -> u64 {
self.0
}
fn write(&mut self, _: &[u8]) {
unimplemented!()
}
fn write_u64(&mut self, i: u64) {
self.0 = i ^ i >> 7;
}
}NaiveHasher (declared as "struct NaiveHasher(u64);") is a tuple with one 64-bit field named "0". (Tuple fields are implitly named as 0, 1, 2 ...)
It implements a very simple hashing function for 64-bit numbers, each hashed number n overwrites the state with n xor (n >> 7). finish() will return the last written state.
This seems to work okay since the hashed structure "Code" contains exactly one 64-bit field.
I don't know if it's fishy, but it's certainly very custom. For a generic hashing implementation you'd at least want to mix in the previous state. I assume it was done more for brevity than performance though.
Looks like it only handles u64's being hashed and panics on all other input. That must be because it is known it will be used exactly with `.write_u64`. Not having to implement the general byte buffer hashing is a big benefit, removes the whole partial state tracking of the hasher, which the compiler probably would not optimize out (we don't know until we try though).
It's the Rust equivalent of the CUSTOM_HASH_FUNCTION macro in the C source. The comment from there might help explain things:
// Define a custom hash function to use instead of khash's default hash // function. This custom hash function uses a simpler bit shift and XOR which // results in several percent faster performance compared to when khash's // default hash function is used. #define CUSTOM_HASH_FUNCTION(key) (khint32_t)((key) ^ (key)>>7)
Cool! Is Rust's hash table open addressed?
Yes. It's open addressing, linear probing, Robin Hood hashing.
https://doc.rust-lang.org/std/collections/struct.HashMap.htm...
The program which reached the top of k-n does not use the standard library's hashmap, it uses ordermap: https://github.com/bluss/ordermap
Though that's also an open addressed map.
It is new Rust hash table from external crate.
The default hash table has better security against malicious input by using a slower hashing algorithm. Its the right default, but if you really want performance and to compete with C/C++, you have to use an algorithm that makes different tradeoffs, or you would be comparing apples to oranges.
> or you would be comparing apples to oranges
Do the other programs use that same hash table algorithm?
http://www.sebastiansylvan.com/post/robin-hood-hashing-shoul...
Do the hash tables in use by the C/C++ entries use low-security hash functions? Seems like that needs evidence.
The first-place Rust program uses this very simple low-security hash function:
The second-place C program uses the exact same hash function as the Rust program, except it also truncates the result to 32 bits:impl Hasher for NaiveHasher { fn write_u64(&mut self, i: u64) { self.0 = i ^ i >> 7; } }
The third-place C++ program uses the identity function as its hash function:#define CUSTOM_HASH_FUNCTION(key) (khint32_t)((key) ^ (key)>>7)
Sources:struct hash{ uint64_t operator()(const T& t)const{ return t.data; } };- Rust: http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
- C: http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
- C++: http://benchmarksgame.alioth.debian.org/u64q/program.php?tes...
> The default hash table has better security against malicious input by using a slower hashing algorithm.
I think an adaptive hash that switches from fast to secure when collisions are detected would make a better default choice (at the cost of some implementation complexity).
Or possibly even an implementation with log(n) worst case complexity.
It's in native Rust though. Which is all that matters. And it's a publicly available crate, so anyone who wants to use it can... and easily with cargo/crates.io.
Does this have anything to do with: https://news.ycombinator.com/item?id=13742865 ?
No. The Rust program uses https://github.com/bluss/ordermap which is an implementation/variant[0] of Hettinger's naturally ordered hash map.
I'm really curious what this benchmark would be for JavaScript V8? Anyone have the time to recreate the same functionality to test in Node?
There are two "Node.js" entries on this page.
quite a decent perf from the ML family at around 19 seconds (F# and Ocaml). Top of the functionals, at least, twice as fast as Haskell.
Also look how ginormous the binaries are for all the VM languages. Kinda would have thought it would be the opposite what with not needing to link in as much runtime?
You could even argue that Rust is a member of the ML family seeing as the ML family of languages were major inspirations and furthermore I believe the original implementation of Rust was written in OCaml.
Before Rust was written in Rust it was indeed written in OCaml.
As a scientific programmer interested in functional-flavoured (ie undogmatic) languages, I would do Rust immediately if it had a REPL. Dying to dump Python. This post is very convincing on Rust's design decisions:
http://science.raphael.poss.name/rust-for-functional-program...
This post was a total revelation to me, even if I assume it's well known in the community, because it is very credible on the Sophie's Choice issue of mathematical purity versus acknowledgement of the reality of the instruction pointer-based imperative machine that exists underneath.
I've looked at other languages that "do" multiprocessing recently. Go is great, but it's essentially about getting large teams of variable-skill people to work together well. It's not an inspiring language, whereas Rust clearly is. Erlang (via Elixir) is very interesting, but the actor model will never be as performant in reality as the shared memory architecture. Julia is just a modern interpretation of matlab. A number of the JVM languages are great, but the "culture" of that ecosystem will always be corporate.
This is why I believe the science crowd could really gravitate to Rust, because it may have the ability, like the functional crowd, to satisfy the "search for beauty" aspect which motivates many academics, scientists, and indeed, programmers, all the while staying just the right side of pragmatism. And clearly targeting "where the puck is going" on massively multicore hardware.
The trial-and-error nature of scientific/data science discovery inevitably requires a REPL. If I had the right compiler/interpreter skills I would gladly contribute to making a Rust REPL happen. Unfortunately I don't. As it stands, all I can say is that if the REPL happens, I would be axed to contribute on the Rust scientific ecosystem with great motivation and pleasure.
The size of the binary isn't in that table. I think you're confusing it with the memory consumption numbers.
Ah okay. I guess this reflects the large VMs then. Still not ideal.
Does anybody know what happened to the image charts like this one for example? https://web.archive.org/web/20121218042116/http://shootout.a...
This was my favourite feature of the benchmarks game.
Those charts provide an instant without-thought comparison (which can be helpful in other situations and with other data sets).
In this situation: it's helpful to slow-down, look at the source-code, think about what's being compared …
The rust version is using multiple cpus using a pool concept (which looks a lot like the multiprocessing module from python so kudos there). But the C version is single threaded from what I can tell. So rust is safe but threaded to be faster than single threaded C which isn't that much slower. Hmm...
That's not true. If you look at the comparisons, the cpu time taken by C is actually more than rust, and the cpu load looks about even. Note the C version uses:
#pragma omp parallel sectionsAhh dang I didn't know you could do compiler run parallelism.
>.. and the cpu load looks about even.
318% is 20% bigger then 265%.
And 5.30s is about 20% faster than 6.46s. We're just throwing around numbers now.
My point was that the cpu load is "about even" compared to if the C version was single-threaded, in which case it would look more like 200-300% bigger cpu load instead of merely 20%.
From what I can see the C version is using OpenMP and uses all of the available CPU cores.
Is the C version optimized? Could you conclude that parallelism in rust is faster/ better than C + openMP in this case considering rust is to replace C++ and is safe at least in this submission?
Yeah I see that now. Missed that; my mistake.
Isn't the whole point of Rust to do threaded things safely by keeping strict tabs on who owns what and for how long?