How Not to Measure Computer System Performance
homes.cs.washington.eduBenchmarking practises are currently poor, almost without exception. For peak VM performance, we have started to use Kalibera/Jones's method http://kar.kent.ac.uk/33611/7/paper.pdf (we reimplemented the statistical computations at http://soft-dev.org/src/libkalibera/ to make it more accessible). I don't think this method is the end of the story, but it's a definite improvement: we were surprised at some of the odd effects it highlighted (non-determinism was not what I was expecting). It's definitely changed how I think about benchmarking.
Great article the gist I get from it is: running an experiment in computer science is easy (just ./bench), running an experiment in computer science in a correct way is hard. I agree with this assessment.
This plotty tool [0] seems interesting and valuable - but I'm not sure how it relates to the problem the author talks about.
Why would linking order affect runtime performance? Something to do with the interaction between offsets and cache, maybe?
Would it be possible to determine ahead of time what order would maximize performance, or would that require profiling?
I'd speculate that if you're unlucky about link order, two hot cache lines may get mapped to the same slot in an N-way associative cache -- whereas if you're lucky, they end up going to different slots and don't continuously evict each other.
With regards to alignment... do linkers typically pack objects so tightly that the start of each object isn't aligned on a cache line boundary? AFAIK they're typically 32, 64, or 128 bytes.
Cacheline boundary?
Probably, because caches line sizes are an implementation detail, not part of the architectural specification.
Linking order will affect code and static data order and cache alignment in turn. Similarly, environment variables are pushed on the stack by the kernel. Their size will affect alignment of application data on the stack.
Maybe there are other causes as well.
> Would it be possible to determine ahead of time what order would maximize performance, or would that require profiling?
I think at the very least, you'd need profiling to determine the hot code path, and that can change depending on input...
Guessing, locality of reference might play a role.
That, the branch predictors, and caching behaviors, n-way, alignment, etc.
It probably explains why different runs vary so widely, I always thought it was other things going on in the OS, never really thought about the caches, etc.
> That, the branch predictors, and caching behaviors, n-way, alignment, etc.
Those all fall under locality of reference, btw. But yeah cache and branch prediction play a huge role in the list.
One thing that stung me in the past was OS scheduling.
There are lies, damned lies, and software benchmarks.
Lies, damn lies, and $100 million investments.
I think the author has computer science and computer engineering confused.
I don't think he does. What he describes is part of what myself and my colleagues consider "computer science." We typically consider "computer engineering" the design and making of hardware. But to be a systems researcher in computer science, you must know how these things work, and be able to reason about how they affect the software systems you care about.