New Computer Language Benchmarks Game metric: time + source code size

benchmarksgame-team.pages.debian.net

47 points by benstrumental 4 years ago · 58 comments

Reader

Geometric mean of (time + gzipped source code size in bytes) seems statistically wrong.

What if you shifted time to nanoseconds ? Or source code size in terms of Megabytes. The rankings could change. The culprit is the '+'

I would think Geometric mean of (time x gzipped source code size) is the correct way to compare languages together. It would not matter what the units of time or size are in that case.

[Here the geometric mean is the geometric mean of (time x gzipped size) of all benchmark programs of a particular language.]

ntoskrnl 4 years ago
Yep this is correct. Adding disparate units is almost always nonsensical. You can confirm with a scientific calculator like insect:
```
  $ insect '5s + 10MB'
    Conversion error:

      Cannot convert unit MB (base units: bit)
                  to unit s

  $ insect '5s * 10MB'
  50 s·MB
```
- smegsicle 4 years ago
  
  units, frink, insect oh my
tuukkah 4 years ago

I think the summed numbers might be unitless. At least all the other numbers are relative to the fastest/smallest entry. That is, what would make sense is score(x) = time(x) / time(fastest) + size(x) / size(smallest) instead of score(x) = (time(x) + size(x)) / score(best)
dwattttt 4 years ago

It's not necessarily wrong to add disparate units like this. It's implicitly weighting one unit to the other. Changing to nanoseconds just gives more weight to the time metric in the unified benchmark. You could instead explicitly weight them without changing units, if you cared about the size more you could add a multiplier to it.
- sidkshatriya 4 years ago
  
  You really don’t know what weight is the right weight to balance time and gripped size. Multiplying them together sidesteps the whole issue and puts time and size on par with each other regardless of the individual unit scaling.
  The whole point of benchmarks is to protect against accidental bias in your calculations. Adding them seems totally against my intuition. If you did want to give time more weight then I would raise it to some power. Example: geometric mean of (time x time x source size) would give time much more importance in an arguably more principled way.
  - dwattttt 4 years ago
    
    Multiplying them is another way of expressing them as a unified value. It's not a question of accidental bias, you're explicitly choosing how important one second is compared to one byte.
    You could imagine there's a 1 sec/byte multiplier on the bytes value, saying in effect "for every byte of gzipped source, penalise the benchmark by one second".
    
    sidkshatriya 4 years ago
    
    > You could imagine there's a 1 sec/byte multiplier on the bytes value, saying in effect "for every byte of gzipped source, penalise the benchmark by one second".
    Your explanation makes sense. However the main issue is we don’t know if this “penalty” is fair or correct or has some justifiable basis. In absence of any explanation it would make more sense to multiply them together as a “sane default”. Later, having done some research we can attach some weightage perhaps appealing to some physical laws or information theory. Even then I doubt that + would be the operator I would use to combine them.
  - igouy 4 years ago
    
    > Adding them…
    Read '+' as '&'.
igouy 4 years ago

> The culprit is the '+'
That annotation does seem to have caused much frothing and gnashing.
Here's how the calculation is made — "How not to lie with statistics: The correct way to summarize benchmark results."
[pdf] http://www.cse.unsw.edu.au/~cs9242/11/papers/Fleming_Wallace...
- yorwba 4 years ago
  
  That paper is only about the reasoning behind taking the geometric mean, it doesn't have anything to say on the "time + gzipped source code size in bytes" part.

agentgt 4 years ago

I really wish they aggregated the metric of build time (+ whatever).

That is a huge metric I care about.

You can figure out it somewhat by clicking on each language benchmark but it is not aggregated.

BTW as biased guy in the Java world I can tell you this is one area Java is actually mostly the winner even beating out many scripting languages apparently.

igouy 4 years ago

Do Java "build time" measurements include class loading and JIT compilation? :-)
- kaba0 4 years ago
  
  Does C “build time” include the time it takes to load the binary from disk?

_b 4 years ago

I'd be interested to see "C compiled with Clang" added as another language to the benchmark games. In part, digging into Clang vs gcc benchmarks is always interesting, and in part, as Rust & Clang share the same LLVM backend, it would shed light on how much of the C vs Rust difference is from frontend language stuff vs backend code gen stuff.

igouy 4 years ago

Already done:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

IshKebab 4 years ago

With a totally arbitrary conversion of 1 second = 1 gzipped byte.

This is basically meaningless. I don't see why you'd even need to do this. You can easily show code size and performance on the same graph.

NeutralForest 4 years ago

This presentation is pretty bad, there should be more context, some kind of color scheme or labels instead of text in the background, spacing between the languages represented, other benchmarks than the geometric mean, etc.

gus_massa 4 years ago

> other benchmarks than the geometric mean
The text is not clear enough, but "geometric mean" is not the benchmark. The 11 problems are listed in https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
The results of the 11 problems are combined using the "geometric mean" into a single number. Some people prefer the "geometric mean", other people prefer the "arithmetic mean" to combine the numbers, other people prefer the maximum, and there rare many other methods (like the average excluding both borders).
- NeutralForest 4 years ago
  
  >The text is not clear enough, but "geometric mean" is not the benchmark.
  Thanks that makes more sense, that's another issue for context then. I don't have anything against geometric means but there should be basic statistics like average, max, min,... available as well.
  - igouy 4 years ago
    
    > … basic statistics like…
    median, quartiles
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    
    NeutralForest 4 years ago
    
    Could you guide me to the ones I mentioned, I'm not seeing them.
    
    stonemetal12 4 years ago
    
    The linked page has box and whisker plots. On a box and whisker plot the lower bar is the min, the upper bar is the max. The box goes from 25th percentile to 75th percentile while the bar in the middle of the box is the 50th percentile.
    Therefore the stats you mentioned are all there min, max, and average with two different definitions of average given (geometric mean, and 50th percentile).
    
    igouy 4 years ago
    
    The bar in the middle of the box is an "average" - the median.
    https://www.merriam-webster.com/dictionary/average
    https://www.itl.nist.gov/div898/handbook/eda/section3/boxplo...
    
    gus_massa 4 years ago
    
    It would be nice to be able to see the numbers in a table (perhaps in a auxiliary page, instead of the main page). Sometimes people want to rearrange the data or use another representation. (log scale? sort by 75% quartile? ...)
    
    igouy 4 years ago
    
    For people who want to rearrange the data or use another representation, there are data files —
    https://salsa.debian.org/benchmarksgame-team/benchmarksgame/...

kibwen 4 years ago

For comparing multiple implementations of a single benchmark in a single language, this sort of data would be interesting as a 2D plot, to see how many lines it takes to improve performance by how much. But for cross-language benchmarking this seems somewhat confounding, as the richness of standard libraries varies between languages (and counting the lines of external dependencies sounds extremely annoying, not only because you have to decide whether to include standard libraries (including libc), you also need to find a way not to penalize those for having many lines devoted to tests).

benstrumentalOP 4 years ago

Not exactly what you're looking for, but here are some 2D plots of code size vs. execution time with geometric means of fastest entries and smallest code size entries of each language:
https://twitter.com/ChapelLanguage/status/152442889069266944...
simion314 4 years ago

And when you want to make the code readable you try to space things out, split things in small functions, use longer and clear variables name. I guess they are asking for running the code trough a minifier so their implementation gains some points.
- benstrumentalOP 4 years ago
  
  I can't find the documentation for it, but you can see here that they measure the size of the source file after gzip compression, which reduces advantage of code-golf solutions:
  https://salsa.debian.org/benchmarksgame-team/benchmarksgame/...
  - igouy 4 years ago
    
    "How source code size is measured"
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    
    benstrumentalOP 4 years ago
    
    Thanks Isaac. How do I reach that page? I expected it on one of the informational pages such as the home page or benchmark programs page, but had no luck.
    
    igouy 4 years ago
    
    There's a "How programs are measured" link on these pages which show measurements:
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    :and these pages:
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    :and these pages:
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
- kibwen 4 years ago
  
  On other benchmarks they measure the size of source code after it's been run through compression, as a way to normalize that. Not sure if that's been done here, but it should be.
  - igouy 4 years ago
    
    Yes, they're the same measurements.
mrtranscendence 4 years ago

I'm not sure I see the problem. What does it matter that program A is shorter than program B because language A has a richer standard library? Program A still required less code.
- kibwen 4 years ago
  
  Because that's not what's being measured here, you're also mixing in performance, and it's impossible to tell at a glance whether a score is attributable to one or the other or both.

Thaxll 4 years ago

The thing they should change is to forbid the nonsense like:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Actually if you look at all the top net core submissions the only one fast are the one using low level intrinsics etc ...

spullara 4 years ago

All of the languages now have that trash in them. I'd like a "naive" benchmarks game where you write the code straight forwardly in a normal style for the language.
- igouy 4 years ago
  
  "simple" (2nd link on the homepage.)
  https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
  - spullara 4 years ago
    
    Kind of but less transliteration and more in the natural style of the target language.
  - weberer 4 years ago
    
    >Java: 40 seconds
    >Python 3: 1h 09 minutes
    Well damn.
    
    igouy 4 years ago
    
    :-)
    "Or even acknowledge — Anyone else a bit shocked by how well Javascript on V8 performs? I might need to rethink my assumptions…"
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    
    UncleEntity 4 years ago
    
    Not hard to do, I once took some C code, naively converted it to python [0] and it took hours instead of seconds to run.
    [0] I wanted to output an image and the C code only ran statistics so I would have had to figure out some random C image library which wasn’t how I wanted to spend my day.
  - ufo 4 years ago
    
    Is that only for the mandelbrot benchmark?
    
    igouy 4 years ago
    
    "… you choose which program differencies to explore … you choose which program measurements to compare."
    "Are they similar-enough to be comparable for your purposes?"
    n-body Chapel #3 program
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    n-body C clang program
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
    n-body Java program
    https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
igouy 4 years ago

> Actually if you look at all the top net core submissions the only one fast are the one using low level intrinsics etc ...
Do you mean "fast" like a C program using low level intrinsics?

arunc 4 years ago

Just curious, why does this benchmark not include D language? I remember seeing it a few years ago. Was it removed recently?

igouy 4 years ago

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
2009

cpurdy 4 years ago

Predictably, "The Computer Language Benchmarks Game" once again proves the worthlessness of "The Computer Language Benchmarks Game".

This thing has been a long running joke in the software industry, exceeded only by the level of their defensiveness.

SMH.

guenthert 4 years ago

I think APL had already shown, that brevity in itself is not desirable.

Shadonototra 4 years ago

these pseudo benchmarks should be banned

hexo 4 years ago

I dont buy these results at all. Julia at second place looks like plain lie and complete nonsense, to the point I'm gonna look into this and run it myself.

After trying hard to use julia for about a year and I came to conclusion it's one of the slowest things around. Maybe the stuff changed? Maybe, but julia code still remains incorrect.

I hope they fix both things, speed (including start up speed, it counts A LOT) and correctness.

ChrisRackauckas 4 years ago

Note that these benchmarks include compilation time for Julia, while it does not include compilation time for C, Rust, etc.
- igouy 4 years ago
  
  Julia is presented like this —
  “Julia features optional typing, multiple dispatch, and good performance, achieved using type inference and just-in-time (JIT) compilation, implemented using LLVM.”
  Julia 1.7 Documentation, Introduction
  https://docs.julialang.org/en/v1/
  - ChrisRackauckas 4 years ago
    
    Yes, because it's all set for prime time in the next release.
    "Julia features optional typing, multiple dispatch, and good performance, achieved using type inference and just-in-time (JIT) compilation (and optional ahead-of-time compilation), implemented using LLVM."
    https://docs.julialang.org/en/v1.9-dev/
    So it'll be updated when v1.9 comes out? Anyways, it's a somewhat interesting thing that Julia still gets 3rd even though it's measuring compilation time.
    
    igouy 4 years ago
    
    Tell us when it becomes achieved using ahead-of-time compilation (and optional just-in-time (JIT) compilation).

Settings

New Computer Language Benchmarks Game metric: time + source code size

Keyboard Shortcuts