When you want to measure program performance, you can code up a micro-benchmark or you can time a program on the command line (a macro-benchmark). Both are tricky to get right. I wrote BestGuess to make command-line benchmarking easy and accurate, after trying other solutions.
Micro-benchmarks or macro-benchmarks?
Micro-benchmarking measures the CPU time (for instance) of one part of a program. You put some portion of your code inside a wrapper that reads a clock or counter before the code executes, reads it again after, and stores the difference for later analysis.
To do this, you have to be able to modify the code you are measuring, and you need to know which calculations to instrument. If you have profiled your code, and you know where it’s spending its time.
But there are many pitfalls, from the dynamic effects of JIT compilation to the subtleties of how CPU time and wall clock time are obtained. Beyond these, there are the usual suspects when it comes to preventing code from achieving its maximum performance:
- Micro-architecture features (caches, branch prediction, etc.)
- Operating system features (more caches, address randomization, etc.)
- Compiler/linker behavior (optimizations, link order, etc.)
- System load and the details of how the OS scheduler handles it
By definition, micro-benchmarks measure the work done by only a fraction of a program’s code. With all of the above factors interfering with the ideal (best) execution, measurement error can be large relative to the performance of the code being measured.
In order to (try to) prevent measure error from invalidating the results, it’s often necessary to do the same calculation $N$ times, measure the aggregate time, then divide by $N$. Modifying a program so that it repeats a key calculation seems awkward, but you’re already modifying code to create a micro-benchmark. Still, the idea of a using a macro-benchmark starts to look appealing, because you may not need to modify your code at all.
One approach to command-line benchmarking leaves the program unchanged. That is, no modifications are made just for measurement purposes. You benchmark the code as-is, make a code change that may affect performance, and benchmark again. Easy.
The “unmodified code” approach I just described works well when you’ve already done some profiling, and you are testing out possible performance boosts. In this scenario, you’re working on the part of the code responsible for a large fraction of the program’s run time. Any worthwhile performance enhancement will yield a repeatable and measurable boost.
If, instead, you’re working on a piece of code whose run time is a tiny fraction of the program’s overall run time, there are a couple of questions to consider. The first is, why am I doing this? If a program runs in 100ms and I achieve a 20% increase in the speed of a calculation that accounts for 1% of the run time, I’ll see a 99.8ms result. And in many cases, a 20% speed increase is huge and very hard to obtain! Is it worth the effort?
Note: When swapping out one algorithm for another, large performance gains can be obtainable. In CS education, we emphasize worst-case and average-case complexity bounds for algorithm analysis. But this is not the whole story. E.g. for small $n$, insertion sort is usually fastest, even though its complexity is higher than the best comparison sorts. That means there are cases where replacing merge sort with insertion sort increase performance!
If you are convinced it is worth the effort to improve a small part of your program, the other question to consider is whether you should create a special version of your code for benchmarking. You’ve already written unit tests (right?), some of which access internal (not exposed) APIs, so why not add a performance test program as well?
You can arrange for a key calculation to run $N$ times, and read $N$ from the command line, enabling (1) interactive experiments done at the command line; and (2) automated tests to catch performance regressions.
Testing at the command line is convenient, but also really flexible. You can use one of the ready-made tools for command-line benchmarking, and you won’t be limited to the timing library that comes with Python or Ruby or Go or Rust. You can run quick experiments interactively, then script longer experiments and let them run. Your test automation (some scripts, a benchmarking tool) is likely already portable, making it easy to compare performance across platforms.
When writing a library, I usually write little driver programs that each call an API whose performance I’m interested in. The driver reads command-line arguments to determine what arguments are passed to the API and, often, $N$: how many times it should execute. This becomes part of my test suite.
Set $N$ to zero, and I can measure all of the time my program spends outside of the part I’m measuring. Set $N$ to a low number, and I measure the cost of just a few invocations of the key API. This situation may match what happens in a typical use case. But set $N$ to a high number, and the (possibly high) cost of early API invocations will be averaged out with many warmed-up ones. This scenario may match my use cases better than the one where $N$ is low.
Micro-benchmarks have their place, especially in low-level work like tuning the implementation of a virtual machine or a codec. In my work, however, macro-benchmarks have been useful in a much wider range of scenarios. And after using several command-line benchmarking tools, I ended up writing one that did what I needed, BestGuess.
Of course, you can avoid using any tool by rolling your own benchmarking. It’s not dangerous like rolling your own crypto, or impractical like rolling your own linear algebra library.
Roll your own benchmarking
Most shells have a built-in time command, and there’s also /usr/bin/time,
which is supposedly better. On some systems, /usr/bin/time will report
everything in the rusage struct (on Unix), and more. This is macOS, using the
calculator bc to compute 1000 digits of π:
1$ /usr/bin/time -l bc -l <<<"scale=1000;4*a(1)" >/dev/null
2 0.03 real 0.02 user 0.00 sys
3 1703936 maximum resident set size
4 0 average shared memory size
5 0 average unshared data size
6 0 average unshared stack size
7 224 page reclaims
8 1 page faults
9 0 swaps
10 0 block input operations
11 0 block output operations
12 0 messages sent
13 0 messages received
14 0 signals received
15 0 voluntary context switches
16 2 involuntary context switches
17 155614143 instructions retired
18 49804923 cycles elapsed
19 1393536 peak memory footprint
20$
Pretty nice! But the CPU time units are 1/100s, which is rather coarse. If we used a benchmarking tool instead, we could get an idea of how much variation there is.
1$ bestguess -M -r 10 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"'
2⇨ Use -o <FILE> or --output <FILE> to write raw data to a file. ⇦
3
4Command 1: bc -l <<<"scale=1000;4*a(1)"
5 Mode ╭ Min Median Max ╮
6 Total CPU time 18.03 ms │ 17.98 18.19 24.31 │
7 Wall clock 19.25 ms ╰ 19.19 19.47 26.14 ╯
8
9Only one command benchmarked. No ranking to show.
10$
Looks like there was over 6ms difference in CPU time (about 33% of the median time!) between the fastest and slowest executions in a batch of 10. While expected, in part because the first execution fills some caches, it is interesting to observe.
You can put /usr/bin/time into a for loop in a shell script to collect
repeated measurements, and if its granularity is enough for you, this may be a
good way to go. Of course, you’ll have to produce your own statistics. An
afternoon remembering how to use awk, sed, and cut is in your future! Or, for a
one-off experiment, collect the data and import it into a spreadsheet for
analysis.
But if you’ll be running more than a few more experiments, you’ll want a benchmarking tool. You can download one of them and look at all the information it gives you (often a lot). It may tell you when you need to use warmup runs, and likely emits really official-looking statistics.
But all is not as it seems. The numbers can be subtly but unmistakably wrong, and the statistics may not hold water.
There are plenty of tools for benchmarking parts of a computer, like the CPU itself, the GPU, memory subsystem, disk subsystem, and more. I am not considering those here.
Of the general-purpose benchmarking tools I could find, I tried a few.
Cmdbench
The cmdbench tool wasn’t a good fit for my work. This Python tool launches your program in the background and monitors its execution. It was designed to benchmark long-running programs (think seconds, not milliseconds), and to show how CPU load and allocated memory vary over the duration of each execution. I had problems getting it to run several times, though when I opened issues, they were fixed quickly. Missing checks for user errors (e.g. a command with unbalanced quotes) result in Python exceptions. I’m sure this will improve over time. For the right use cases, this tool appears worth the effort to use it.
Multitime
Laurie Tratt wrote multitime, a spin
on /usr/bin/time that repeatedly executes a single command, producing a
similar report of real, user, and system times. Basic descriptive statistics
are calculated, and the granularity of the reported data is milliseconds.
1$ multitime -n 5 bc -l <<<"scale=1000;4*a(1)" > /dev/null
2===> multitime results
31: bc -l
4 Mean Std.Dev. Min Median Max
5real 0.014 0.005 0.008 0.015 0.022
6user 0.007 0.006 0.003 0.004 0.018
7sys 0.005 0.002 0.003 0.005 0.009
8$
The multitime tool reflects the author’s expertise in performance matters, and
is well-implemented. I recommend it for measuring single programs, though I
would suggest one enhancement. The ability to save the raw data for later
statistical analysis would be quite valuable. That said, this tool has a small
but clever set of features that play well with other Unix tools.
BestGuess reports virtually the same median run time (shown
earlier) for the bc command as does multitime
above. BestGuess uses a shell to run this command, due to the
input redirection. Running an empty command, e.g. bestguess -r 10 -s '/bin/bash -c' '', shows a median of 3.23ms of shell overhead on my
system. The 18.19ms reported by BestGuess minus 3.23ms of overhead gives
14.96ms, and multitime reported 15ms.
Hyperfine
I used Hyperfine in my research for a couple of years. It is written in Rust, and the excellent Rust toolchain makes it a breeze to install. It is a mature project with many features.
Hyperfine wasn’t the right tool for my work, as it turned out. I had to modify
it to save the raw data, because the individual user and system times were not
being saved, precluding further analysis. And the supplementary Python scripts
introduced a dependence on Python (tragic, due to Python’s abysmal module
system) into the otherwise easy-to-manage automated testing we had crafted. Our
test automation starts with a file of commands to be benchmarked, and ends with
all the raw data in one CSV file, a statistical summary in another, and
performance graphs produced by gnuplot. Benchmarking done this way is more
repeatable and less subject to human error than if all the steps have to be done
manually.
The outlier detection in Hyperfine prompted me to investigate just what constitutes an outlier in performance testing. I concluded that in a very real sense, there are no outliers – all measurements are valid.
Suppose you are concerned with the speed of cars going by your house. You set up a radar device and let it record measurements for a week. Several cars creep slowly by at 5MPH, maybe because they were about to turn into a driveway. Some cars speed by at 40MPH. All of these measurements are valid. If you throw away outliers, you discard the part of the distribution you are most interested in (the rightward tail)!
The radar device should not be warning you that there are outliers. In many distributions there are outliers, and the warning has no scientific basis. The benchmarking tool does not know what you are trying to learn, and does not need to know (and should not care) how you will analyze the data. If you see this warning and re-run your experiment, you are invalidating its results. It’s like measuring vehicle speeds each week until you finally hit a week with no creepers or speeders.
Outliers are valid data. If you are interested in the extreme values, your data analysis should focus on them. If you are not, then your analysis should be insensitive to them.
Benchmarking tools are not oracles. They don’t know what your experiment is designed to test. For example, if you know that there are extra costs for early executions, and that’s not what you want to measure, you can filter those out later, during analysis. Or you can configure some number of unmeasured warmup runs in Hyperfine (or BestGuess) to measured “warmed up” times.
Warming up a program does not guarantee that the subsequent timed executions will be faster than the warmup times, however. You may find that no number of warmup runs can achieve this, because performance is up and down over all the timed executions. There’s no substitute for actually looking at the data, which means you need a tool that saves all the data.
Note that “all measurements are valid” is a claim circumscribed by the experiment design. If the goal is to measure performance on a quiet machine and Chrome starts up, subsequent benchmarks are not valid data for this experiment.
For my work, I had to stop using Hyperfine (and some other tools) because I
could not find support in the literature for using the mean and standard
deviation to characterize distributions of performance data (e.g. CPU times).
These distributions are not normal (Gaussian). You can see for yourself
with the BestGuess -D option, which analyzes the distribution of CPU times for
normality using three different measures (AD score, skew, and excess kurtosis).
Mean and standard deviation are misleading here. Different statistics are
needed, and the ones used by BestGuess are discussed below.
Making accurate measurements
Having tried several benchmarking tools and used Hyperfine extensively, I wondered if there existed a clear notion of accuracy for command-line benchmarking. It does not appear that practitioners agree on a common definition or metric, and I’m not prepared to offer a proposal. All I can say here is what I have observed.
In my work, I noticed that the CPU times reported by Hyperfine differed
noticeably from what /usr/bin/time would say. Hyperfine’s times were always
higher. I looked into this enough to make an educated guess as to why.
Multitime, BestGuess, and /usr/bin/time produce unsurprisingly similar
numbers, because they work the same way. They fork, manipulate a few file
descriptors, then exec the program to be measured.
Crucially, they all use wait4 to get measurements from the operating system.
Without resorting to external process monitoring of some kind, this appears to
be the best way to get good measurements. These tools report what the OS itself
recorded about the benchmarked command.
Hyperfine does not follow the “bare bones” fork/exec/wait method. It uses a
general Rust package for managing processes, and it’s possible that some Rust
code executes after the fork, but before the exec. The time spent in that
code will be attributed by Hyperfine to the benchmarked command, and could
explain the data it reports. Of course, if you are measuring programs that run
for several minutes or more, you’ll never notice.
For now, I’m sticking with tools like BestGuess, /usr/bin/time, and multitime,
because I understand the metrics they report, and it’s clear from their source
code what is being measured. It is not all obvious how a tool could produce
“better” numbers than what the OS itself measures, short of deploying some kind
of external monitoring.
Now, on to what we do with benchmark measurements!
Statistics
When a sample distribution is not normal, the mean is not a useful measure of central tendency, and the standard deviation is not a useful measure of dispersion (spread). The median and inter-quartile range are the usual alternatives, and that’s what BestGuess shows. In case you want to know the “typical” run time, BestGuess also estimates the mode, though this metric is useful only in unimodal cases.
More importantly, we must find another way to rank the benchmarked commands, because the mean and standard deviation are not suited to the task. BestGuess currently compares two sample distributions at a time; call them $X$ and $Y$. We start with a Mann-Whitney U calculation, a statistical test where the null hypothesis is: the chance that a randomly-chosen observation from $X$ is larger then a randomly-chosen observation from $Y$ is equal to the chance that one from $Y$ is larger than one from $X$. Essentially, the null hypothesis is “the two samples, $X$ and $Y$, came from the same distribution”.
BestGuess ranks the benchmarked commands, marking the best performers with a
✻. In the example below, we see that it took 2.25x as long to compute 1500
digits of π than only 1000 digits. Later, we’ll ask BestGuess to explain how it
determined that the two commands really performed differently.
1$ bestguess -QR -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1500;4*a(1)"'
2Best guess ranking:
3
4 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
5 ✻ 1: bc -l <<<"scale=1000;4*a(1)" 18.39 ms
6 ══════════════════════════════════════════════════════════════════════════════
7 2: bc -l <<<"scale=1500;4*a(1)" 41.39 ms 23.00 ms 2.25x
8 ══════════════════════════════════════════════════════════════════════════════
9$
The accompanying p-value indicates (roughly) how strongly we can reject the null hypothesis. In other words, a low p-value is a strong indication that two samples of run time measurements came from different distributions. For instance, a p-value of 0.01 indicates that with high confidence (99%), we can reject the hypothesis that the two commands have indistinguishable performance. Phrased more naturally, a low p-value like 0.01 gives a strong indication that the two commands perform differently.
Below, the -E (explain) option shows that $p < 0.001$, indicating
significance. The median difference between the run times of the two commands
is Δ = 22.86ms. We’ll discuss that next.
1$ bestguess -QRE -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1500;4*a(1)"'
2
3 ╭────────────────────────────────────────────────────────────────────────────╮
4 │ Parameter: Settings: (modify with -c) │
5 │ Minimum effect size (H.L. median shift) effect 500 μs │
6 │ Significance level, α alpha 0.01 │
7 │ C.I. ± ε contains zero epsilon 250 μs │
8 │ Probability of superiority super 0.33 │
9 ╰────────────────────────────────────────────────────────────────────────────╯
10
11Best guess ranking:
12
13 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
14 ✻ 1: bc -l <<<"scale=1000;4*a(1)" 18.29 ms
15
16 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
17 2: bc -l <<<"scale=1500;4*a(1)" 41.16 ms 22.87 ms 2.25x
18
19 Timed observations N = 20
20 Mann-Whitney U = 0
21 p-value (adjusted) p < 0.001 (< 0.001)
22 Hodges-Lehmann Δ = 22.86 ms
23 Confidence interval 99.02% (22.74, 22.99) ms
24 Prob. of superiority  = 0.00
25 ══════════════════════════════════════════════════════════════════════════════
26$
A difference in performance can arise from a difference in the shape but not location (e.g. median) of a sample, such as when one is widely dispersed and the other is not, though they are centered on the same value. In my opinion, this is important information in an A/B comparison. If both programs run in the same (median) amount of time, but one has higher dispersion, they behave differently. When choosing between them, we probably want to know this, even if it’s not the foremost consideration in the choice.
When two distributions are similar in shape but differ by location, the Hodges-Lehmann median shift calculation is a good measure of the shift amount. It tells us how much faster is program A compared to B without relying on mean run times (too sensitive to high values) or standard deviations (not meaningful for non-normal empirical distributions). And it comes with a confidence level and corresponding interval, to help with interpretation.
The Mann-Whitney U test and the Hodges-Lehmann median shift figure provide what we really want to know:
- Can we distinguish between the performance of these two programs? (Null hypothesis: Our samples $X$ and $Y$ came from the same underlying distribution.)
- If there is a distinction, how much faster is one compared to the other? (What is the effect size, i.e. the shift of the median run time?)
The effect size is important. We might borrow from the experience of other scientists who analyze experimental results. Any clinical researcher will tell you that statistical significance is not all you need to know. Treatment A might give better outcomes than treatment B in a statistically significant way, but you need to know the effect size before deciding whether the cost or side effects of A are worth it. Moreover, a small effect size could indicate that the experiment produced an unlikely outcome, which subsequent experiments may reveal.
In macro-benchmarking, we can obtain the result that program A is faster than program B in a statistically significant way, but the effect size may be quite small, such as a fraction of a millisecond. Given the interference produced by the myriad processes running on even a “quiet” machine, it’s entirely possible that an experiment with a small effect size is not consistently reproducible.
On my MacBook, there are 704 processes running right now.
BestGuess measures the effect size by examining all of the pairwise differences between run times for two samples, and finds the median difference. This is the Hodges-Lehmann measure. It comes with a confidence measure and interval. If the confidence measure is low, or if the confidence interval contains zero, then we should not put a high value on the computed effect size.
On the other hand, consider the case where (1) the effect size is “large enough” (modest experience suggests 0.5ms to 1ms); (2) the effect size confidence is high (e.g. > 99%); and (3) the confidence interval does not include zero. In this case, we have many good reasons to believe that one program ran faster than the other.
Statistical significance and effect size are used by BestGuess to rank benchmark results.
- When two programs are not statistically different in performance, they are reported as indistinguishable.
- When the effect size is too small or has low confidence, the programs are reported as indistinguishable.
Below, I compare generating 1005 digits of π to 1000 digits. Using configuration parameters that I left as default values, BestGuess shows a high p-value ($p = 0.081$) indicating a lack of significant difference between the two commands. Also, the effect size was small (Δ$ = 0.07$ms), so even if we considered $p = 0.081$ to be good enough, the difference is minor. Worse still, the 99% confidence interval for the difference includes zero.
1$ bestguess -QRE -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1005;4*a(1)"'
2[SNIP! Configuration parameters omitted]
3
4Best guess ranking: The top 2 commands performed identically
5
6 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
7 ✻ 1: bc -l <<<"scale=1000;4*a(1)" 18.18 ms
8
9 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
10 ✻ 2: bc -l <<<"scale=1005;4*a(1)" 18.27 ms 0.09 ms 1.00x
11
12 Timed observations N = 20
13 Mann-Whitney U = 135
14 p-value (adjusted) p = 0.081 (0.080) ✗ Non-signif. (α = 0.01)
15 Hodges-Lehmann Δ = 0.07 ms ✗ Effect size < 500 μs
16 Confidence interval 99.02% (-43, 322) μs ✗ CI ± 250 μs contains 0
17 Prob. of superiority  = 0.34 ✗ Pr. faster obv. > 33%
18
19$
Finally, BestGuess employs one additional check on sample similarity, a calculation related to the Mann-Whitney U value called the probability of superiority. It measures the probability that a randomly-chosen run time from program B is lower (faster) than one randomly-chosen from program A. Assuming A is faster than B, it is unlikely that many measurements for program B were faster than those of A, though some may be. A high probability here simply indicates a large overlap in the distributions of the two samples.
Currently, BestGuess version 0.7.5 is the latest release. Earlier versions have been tested on a trial basis in my research, and I am confident in the measurements. BestGuess always saves the raw data, so it is easy to re-analyze it (with BestGuess or other systems).
We are not yet ready to declare a version 1.0, but I hope people will try out the current release and give us feedback. I expect to use this tool for a long time, and to develop it further to meet my group’s future needs and those of others.
BestGuess
BestGuess is a tool for command-line benchmarking. It does these things:
- Runs commands and captures run times, memory usage, and other metrics.
- Saves the raw data, for record-keeping or for later analysis.
- Optionally reports on various properties of the data distribution.
- Ranks the benchmarked commands from fastest to slowest.
The default output contains a lot of information about the commands being benchmarked:
1$ bestguess -r 20 "ls -lR" "ls -l" "ps Aux"
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4Command 1: ls -lR
5 Mode ╭ Min Q₁ Median Q₃ Max ╮
6 Total CPU time 5.86 ms │ 5.79 5.86 5.92 6.26 9.77 │
7 User time 2.18 ms │ 2.14 2.17 2.18 2.29 3.03 │
8 System time 3.68 ms │ 3.65 3.69 3.75 3.97 6.73 │
9 Wall clock 7.17 ms │ 7.14 7.22 7.34 7.65 12.55 │
10 Max RSS 1.86 MB │ 1.64 1.83 1.86 1.88 2.06 │
11 Context sw 13 ct ╰ 13 13 13 14 18 ╯
12
13Command 2: ls -l
14 Mode ╭ Min Q₁ Median Q₃ Max ╮
15 Total CPU time 2.63 ms │ 2.61 2.64 2.71 2.77 3.44 │
16 User time 1.05 ms │ 1.00 1.01 1.03 1.05 1.19 │
17 System time 1.62 ms │ 1.59 1.62 1.67 1.72 2.25 │
18 Wall clock 3.87 ms │ 3.82 3.90 4.09 4.34 5.14 │
19 Max RSS 1.73 MB │ 1.55 1.73 1.77 1.80 1.98 │
20 Context sw 13 ct ╰ 13 13 13 14 14 ╯
21
22Command 3: ps Aux
23 Mode ╭ Min Q₁ Median Q₃ Max ╮
24 Total CPU time 34.10 ms │ 33.45 33.90 34.10 34.26 34.37 │
25 User time 8.52 ms │ 8.48 8.51 8.53 8.57 8.63 │
26 System time 25.62 ms │ 24.91 25.38 25.55 25.65 25.88 │
27 Wall clock 45.18 ms │ 44.28 44.82 45.05 45.31 45.56 │
28 Max RSS 2.94 MB │ 2.88 2.94 2.96 3.03 3.23 │
29 Context sw 98 ct ╰ 96 98 98 98 99 ╯
30
31Best guess ranking:
32
33 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
34 ✻ 2: ls -l 2.71 ms
35 ══════════════════════════════════════════════════════════════════════════════
36 1: ls -lR 5.92 ms 3.22 ms 2.19x
37 3: ps Aux 34.10 ms 31.39 ms 12.60x
38 ══════════════════════════════════════════════════════════════════════════════
39$
For a mini-report, use the -M option. See reporting
options below.
The implementation is C99 and runs on many Linux and BSD platforms, including macOS.
Dependencies:
- C compiler
- make
Example: Compare two ways of running ps
The default report shows CPU times (user, system, and total), wall clock time, max RSS (memory), and the number of context switches.
The mode may be the most relevant, as it is the “typical” value, but the median is also a good indication of central tendency for performance data.
The figures to the right show the conventional quartile figures, from the minimum up to the maximum observation. The starred command at the top of the ranking ran the fastest, statistically.
1$ bestguess -r=100 "ps" "ps Aux"
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4Command 1: ps
5 Mode ╭ Min Q₁ Median Q₃ Max ╮
6 Total CPU time 7.83 ms │ 7.76 8.69 11.18 14.68 27.40 │
7 User time 1.38 ms │ 1.36 1.53 1.86 2.33 3.96 │
8 System time 6.44 ms │ 6.39 7.16 9.38 12.21 23.49 │
9 Wall clock 8.38 ms │ 8.35 9.64 13.22 17.49 45.47 │
10 Max RSS 1.86 MB │ 1.73 1.86 2.08 2.38 2.77 │
11 Context sw 1 ct ╰ 1 2 29 85 572 ╯
12
13Command 2: ps Aux
14 Mode ╭ Min Q₁ Median Q₃ Max ╮
15 Total CPU time 37.04 ms │ 36.57 36.98 37.45 38.82 86.70 │
16 User time 9.37 ms │ 9.33 9.42 9.56 9.77 19.74 │
17 System time 27.55 ms │ 27.15 27.57 27.88 29.15 66.97 │
18 Wall clock 50.81 ms │ 49.44 50.81 51.30 52.39 115.32 │
19 Max RSS 3.00 MB │ 2.97 3.08 3.62 3.95 5.33 │
20 Context sw 0.12 K ╰ 0.12 0.12 0.13 0.15 1.07 ╯
21
22Best guess ranking:
23
24 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
25 ✻ 1: ps 11.18 ms
26 ══════════════════════════════════════════════════════════════════════════════
27 2: ps Aux 37.45 ms 26.28 ms 3.35x
28 ══════════════════════════════════════════════════════════════════════════════
29$
Best practice: Save the raw data
Use -o <FILE> to save the raw data (and silence the pedantic admonition to do
so). We can see in the example below that the raw data file has 76 lines: one
header and 25 observations for each of 3 commands. The first command is empty,
and is used to measure the shell startup time.
1$ bestguess -o /tmp/data.csv -M -r=25 -s "/bin/bash -c" "" "ls -l" "ps Aux"
2Command 1: (empty)
3 Mode ╭ Min Median Max ╮
4 Total CPU time 8.61 ms │ 4.91 8.59 10.41 │
5 Wall clock 12.58 ms ╰ 6.74 14.99 44.84 ╯
6
7Command 2: ls -l
8 Mode ╭ Min Median Max ╮
9 Total CPU time 16.96 ms │ 5.66 16.87 21.22 │
10 Wall clock 30.26 ms ╰ 7.33 24.56 42.76 ╯
11
12Command 3: ps Aux
13 Mode ╭ Min Median Max ╮
14 Total CPU time 34.42 ms │ 32.02 35.31 45.86 │
15 Wall clock 43.78 ms ╰ 42.29 46.54 58.38 ╯
16
17Best guess ranking:
18
19 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
20 ✻ 1: (empty) 8.59 ms
21 ══════════════════════════════════════════════════════════════════════════════
22 2: ls -l 16.87 ms 8.28 ms 1.96x
23 3: ps Aux 35.31 ms 26.72 ms 4.11x
24 ══════════════════════════════════════════════════════════════════════════════
25$ wc -l /tmp/data.csv
26 76 /tmp/data.csv
27$
The accompanying program bestreport can read the raw data file (or many of
them) and reproduce any and all of the summary statistics and graphs:
1$ bestreport -QRB /tmp/data.csv
2 5 10 16 21 27 32 37 43
3 ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
4 ┌─┬┐
5 1:├┄┄┄┄┤ │├┄┤
6 └─┴┘
7 ┌───────┬────┐
8 2: ├┄┄┄┄┄┄┄┄┄┄┄┄┤ │ ├┄┄┤
9 └───────┴────┘
10 ┌─┬────────┐
11 3: ├┄┄┄┤ │ ├┄┄┄┄┄┄┄┄┄┄┤
12 └─┴────────┘
13 ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
14 5 10 16 21 27 32 37 43
15
16Box plot legend:
17 1: (empty)
18 2: ls -l
19 3: ps Aux
20
21Best guess ranking:
22
23 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
24 ✻ 1: (empty) 8.59 ms
25 ══════════════════════════════════════════════════════════════════════════════
26 2: ls -l 16.87 ms 8.28 ms 1.96x
27 3: ps Aux 35.31 ms 26.72 ms 4.11x
28 ══════════════════════════════════════════════════════════════════════════════
29$
Example: Measure shell startup time
When running a command via a shell or other command runner, you may want to
measure the overhead of starting the shell. Supplying an empty command string,
"", as one of the commands will run the shell with no command, thus measuring
the time it takes to launch the shell.
Rationale: BestGuess does not compute shell startup time because it doesn’t know how you want it measured, if at all. (Which shell? How to invoke it? How many runs and warmup runs?)
On my machine, as shown below, about 2.4ms is spent in the shell, out of the
5.2ms needed to run ls -l.
When reporting experimental results, we might want to subtract the shell startup time from the run time of the other commands to estimate the net run time.
1$ bestguess -M -w 5 -r 20 -s "/bin/bash -c" "" "ls -l"
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4Command 1: (empty)
5 Mode ╭ Min Median Max ╮
6 Total CPU time 2.39 ms │ 2.37 2.39 2.57 │
7 Wall clock 3.15 ms ╰ 3.07 3.15 3.39 ╯
8
9Command 2: ls -l
10 Mode ╭ Min Median Max ╮
11 Total CPU time 5.21 ms │ 5.15 5.22 5.38 │
12 Wall clock 6.75 ms ╰ 6.66 6.77 7.04 ╯
13
14Best guess ranking:
15
16 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
17 ✻ 1: (empty) 2.39 ms
18 ══════════════════════════════════════════════════════════════════════════════
19 2: ls -l 5.22 ms 2.83 ms 2.18x
20 ══════════════════════════════════════════════════════════════════════════════
21$
BestGuess reporting options
Mini stats
If the summary statistics included in the default report are more than you want to see, use the “mini stats” option.
1$ bestguess -M -r 20 "ps A" "ps Aux" "ps"
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4Command 1: ps A
5 Mode ╭ Min Median Max ╮
6 Total CPU time 23.20 ms │ 22.72 23.94 30.54 │
7 Wall clock 23.98 ms ╰ 23.43 24.83 33.16 ╯
8
9Command 2: ps Aux
10 Mode ╭ Min Median Max ╮
11 Total CPU time 26.50 ms │ 26.39 27.08 29.69 │
12 Wall clock 36.27 ms ╰ 35.72 36.45 41.68 ╯
13
14Command 3: ps
15 Mode ╭ Min Median Max ╮
16 Total CPU time 7.81 ms │ 7.79 7.85 8.88 │
17 Wall clock 8.42 ms ╰ 8.38 8.46 9.60 ╯
18
19Best guess ranking:
20
21 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
22 ✻ 3: ps 7.85 ms
23 ══════════════════════════════════════════════════════════════════════════════
24 1: ps A 23.94 ms 16.09 ms 3.05x
25 2: ps Aux 27.08 ms 19.24 ms 3.45x
26 ══════════════════════════════════════════════════════════════════════════════
27$
Bar graph of performance
There’s a cheap (limited) but useful bar graph feature in BestGuess (-G or
--graph) that shows the total time taken for each iteration as a horizontal
bar.
The bar is scaled to the maximum time needed for any iteration of command. The chart, therefore, is meant to show variation between iterations of the same command. Iteration 0 prints first.
The bar graph is meant to provide an easy way to estimate how many warmup runs may be needed, but can also give some insight about whether performance settles into a steady state or oscillates.
The contrived example below measures shell startup time against the time to run
ls without a shell. It looks like bash would could use a few warmup runs.
Interestingly, the performance of ls got better and then worse again in this
(very) small experiment.
1$ bestguess -NG -r 10 /bin/bash ls
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4Command 1: /bin/bash
50 max
6│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
7│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
8│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
9│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
10│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
11│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
12│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
13│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
14│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
15│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
16
17Command 2: ls
180 max
19│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
20│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
21│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
22│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
23│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
24│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
25│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
26│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
27│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
28│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
29
30$
Box plots for comparisons
Box plots are a convenient way to get a sense of how two distributions compare. We found, when using BestGuess (and before that, Hyperfine) that we didn’t want to wait to do statistical analysis of our raw data using a separate program. To get a sense of what the data looked like as we collected it, I implemented a (limited resolution) box plot feature.
The edges of the box are the interquartile range, and the median is shown inside the box. The whiskers reach out to the minimum and maximum values.
In the example below, although bash (launching the shell with no command to
run) appears faster than ls, we can see that their distributions overlap
considerably. The BestGuess ranking analysis concludes that these two commands
performed statistically identically. You can configure the thresholds used to
draw this conclusion to suit your experiment design, such as if you want to
ignore the fact that ls often took a long time to run.
1$ bestguess -NRB -r 100 /bin/bash ls
2Use -o <FILE> or --output <FILE> to write raw data to a file.
3
4 2.0 2.1 2.3 2.4 2.5 2.7 2.8 2.9
5 ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
6 ┌┬─┐
7 1: ├┄┄┄┤│ ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
8 └┴─┘
9 ┌┬┐
10 2:├┤│├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
11 └┴┘
12 ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
13 2.0 2.1 2.3 2.4 2.5 2.7 2.8 2.9
14
15Box plot legend:
16 1: /bin/bash
17 2: ls
18
19Best guess ranking: The top 2 commands performed identically
20
21 ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
22 ✻ 2: ls 2.16 ms
23 ✻ 1: /bin/bash 2.38 ms 0.23 ms 1.10x
24 ══════════════════════════════════════════════════════════════════════════════
25$
Feature set
See the project README for an overview of features, and for more information about the statistical calculations. There’s a section for Hyperfine users, too. BestGuess uses some of the same option names and can produce a Hyperfine-format CSV file of summary statistics.
During this time, when the BestGuess documentation is still being created,
running bestguess -h is the best way to see all of the options. Currently,
they are:
1$ bestguess -h
2Usage: bestguess [-A <action>] [options] ...
3
4 -w --warmup Number of warmup runs
5 -r --runs Number of timed runs
6 -p --prepare Execute <COMMAND> before each benchmarked command
7 -i --ignore-failure Ignore non-zero exit codes
8 --show-output Show output of commands as they run
9 -s --shell Use <SHELL> (e.g. "/bin/bash -c") to run commands
10 -n --name Name to use in reports instead of full command
11 -o --output Write timing data to CSV <FILE> (use - for stdout)
12 --export-csv Write statistical summary to CSV <FILE>
13 --hyperfine-csv Write Hyperfine-style summary to CSV <FILE>
14 -f --file Read additional commands and arguments from <FILE>
15 -Q --quiet Show only the output requested using other flags
16 -R --ranking Calculate and show statistical ranking of commands
17 -S --summary Show summary statistics for each command
18 -M --mini-stats Show minimal summary statistics for each command
19 -D --dist-stats Report the analysis of each sample distribution
20 -T --tail-stats Report on the tail of each sample distribution
21 -G --graph Show graph of total time for each command execution
22 -B --boxplot Show box plot of timing data comparing all commands
23 -E --explain Show an explanation of the inferential statistics
24 -c Configure <SETTING>=<VALUE>, e.g. width=80.
25 Configuration settings [default]:
26 width Maximum terminal width for graphs, plots [80]
27 alpha Alpha value for statistics [.01]
28 epsilon Epsilon for confidence intervals (μsec) [250]
29 effect Minimum effect size (μsec) [500]
30 super Superiority threshold (probability) [.333]
31 --config Show configuration settings
32 --limits Show compiled-in limits
33 -A --action If the BestGuess executables are installed under custom
34 names, an <ACTION> option is required, and may be either
35 'run' or 'report'. See the manual.
36 -v --version Show version
37 -h --help Show help
38$
Bug reports
Bug reports are welcome!
BestGuess is implemented in C, which we acknowledge makes good code more difficult to write. But BestGuess needs low-level control over the details of how processes are launched and measured, in order to obtain the best measurements we can.
But with C, segfaults and errant memory accesses are always a possibility. When BestGuess can detect a violation of intended behavior, it terminates in a controlled panic with an error message.
If you see any kind of bug, including a panic message, please let us know by opening an issue with instructions on how we can reproduce the bug.
Contributing
If you are interested in contributing, get in touch! My main blog page shows several ways to reach me.
Acknowledgments
Natalie Grogan showed me the Anderson-Darling test for normality, and analyzed data that my group had been collecting. The result was our understanding that command-line performance distributions are not remotely close to normal, and this changed how I looked at benchmarking.
Code, papers, and blog posts by people like Laurie Tratt and Daniel Lemire have been invaluable over the last few years as I’ve done some performance engineering and benchmarking.
Making things go fast is fun. Knowing that we’ve measured things accurately is satisfying.