Command-line benchmarking with BestGuess

When you want to measure program performance, you can code up a micro-benchmark or you can time a program on the command line (a macro-benchmark). Both are tricky to get right. I wrote BestGuess to make command-line benchmarking easy and accurate, after trying other solutions.

Micro-benchmarks or macro-benchmarks?

Micro-benchmarking measures the CPU time (for instance) of one part of a program. You put some portion of your code inside a wrapper that reads a clock or counter before the code executes, reads it again after, and stores the difference for later analysis.

To do this, you have to be able to modify the code you are measuring, and you need to know which calculations to instrument. If you have profiled your code, and you know where it’s spending its time.

But there are many pitfalls, from the dynamic effects of JIT compilation to the subtleties of how CPU time and wall clock time are obtained. Beyond these, there are the usual suspects when it comes to preventing code from achieving its maximum performance:

Micro-architecture features (caches, branch prediction, etc.)
Operating system features (more caches, address randomization, etc.)
Compiler/linker behavior (optimizations, link order, etc.)
System load and the details of how the OS scheduler handles it

By definition, micro-benchmarks measure the work done by only a fraction of a program’s code. With all of the above factors interfering with the ideal (best) execution, measurement error can be large relative to the performance of the code being measured.

In order to (try to) prevent measure error from invalidating the results, it’s often necessary to do the same calculation $N$ times, measure the aggregate time, then divide by $N$. Modifying a program so that it repeats a key calculation seems awkward, but you’re already modifying code to create a micro-benchmark. Still, the idea of a using a macro-benchmark starts to look appealing, because you may not need to modify your code at all.

One approach to command-line benchmarking leaves the program unchanged. That is, no modifications are made just for measurement purposes. You benchmark the code as-is, make a code change that may affect performance, and benchmark again. Easy.

The “unmodified code” approach I just described works well when you’ve already done some profiling, and you are testing out possible performance boosts. In this scenario, you’re working on the part of the code responsible for a large fraction of the program’s run time. Any worthwhile performance enhancement will yield a repeatable and measurable boost.

If, instead, you’re working on a piece of code whose run time is a tiny fraction of the program’s overall run time, there are a couple of questions to consider. The first is, why am I doing this? If a program runs in 100ms and I achieve a 20% increase in the speed of a calculation that accounts for 1% of the run time, I’ll see a 99.8ms result. And in many cases, a 20% speed increase is huge and very hard to obtain! Is it worth the effort?

Note: When swapping out one algorithm for another, large performance gains can be obtainable. In CS education, we emphasize worst-case and average-case complexity bounds for algorithm analysis. But this is not the whole story. E.g. for small $n$, insertion sort is usually fastest, even though its complexity is higher than the best comparison sorts. That means there are cases where replacing merge sort with insertion sort increase performance!

If you are convinced it is worth the effort to improve a small part of your program, the other question to consider is whether you should create a special version of your code for benchmarking. You’ve already written unit tests (right?), some of which access internal (not exposed) APIs, so why not add a performance test program as well?

You can arrange for a key calculation to run $N$ times, and read $N$ from the command line, enabling (1) interactive experiments done at the command line; and (2) automated tests to catch performance regressions.

Testing at the command line is convenient, but also really flexible. You can use one of the ready-made tools for command-line benchmarking, and you won’t be limited to the timing library that comes with Python or Ruby or Go or Rust. You can run quick experiments interactively, then script longer experiments and let them run. Your test automation (some scripts, a benchmarking tool) is likely already portable, making it easy to compare performance across platforms.

When writing a library, I usually write little driver programs that each call an API whose performance I’m interested in. The driver reads command-line arguments to determine what arguments are passed to the API and, often, $N$: how many times it should execute. This becomes part of my test suite.

Set $N$ to zero, and I can measure all of the time my program spends outside of the part I’m measuring. Set $N$ to a low number, and I measure the cost of just a few invocations of the key API. This situation may match what happens in a typical use case. But set $N$ to a high number, and the (possibly high) cost of early API invocations will be averaged out with many warmed-up ones. This scenario may match my use cases better than the one where $N$ is low.

Micro-benchmarks have their place, especially in low-level work like tuning the implementation of a virtual machine or a codec. In my work, however, macro-benchmarks have been useful in a much wider range of scenarios. And after using several command-line benchmarking tools, I ended up writing one that did what I needed, BestGuess.

Of course, you can avoid using any tool by rolling your own benchmarking. It’s not dangerous like rolling your own crypto, or impractical like rolling your own linear algebra library.

Roll your own benchmarking

Most shells have a built-in time command, and there’s also /usr/bin/time, which is supposedly better. On some systems, /usr/bin/time will report everything in the rusage struct (on Unix), and more. This is macOS, using the calculator bc to compute 1000 digits of π:

 1$ /usr/bin/time -l bc -l <<<"scale=1000;4*a(1)" >/dev/null
 2        0.03 real         0.02 user         0.00 sys
 3             1703936  maximum resident set size
 4                   0  average shared memory size
 5                   0  average unshared data size
 6                   0  average unshared stack size
 7                 224  page reclaims
 8                   1  page faults
 9                   0  swaps
10                   0  block input operations
11                   0  block output operations
12                   0  messages sent
13                   0  messages received
14                   0  signals received
15                   0  voluntary context switches
16                   2  involuntary context switches
17           155614143  instructions retired
18            49804923  cycles elapsed
19             1393536  peak memory footprint
20$

Pretty nice! But the CPU time units are 1/100s, which is rather coarse. If we used a benchmarking tool instead, we could get an idea of how much variation there is.

 1$ bestguess -M -r 10 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"'
 2⇨ Use -o <FILE> or --output <FILE> to write raw data to a file. ⇦
 3
 4Command 1: bc -l <<<"scale=1000;4*a(1)"
 5                      Mode    ╭     Min   Median      Max   ╮
 6   Total CPU time   18.03 ms  │   17.98    18.19    24.31   │
 7       Wall clock   19.25 ms  ╰   19.19    19.47    26.14   ╯
 8
 9Only one command benchmarked.  No ranking to show.
10$

Looks like there was over 6ms difference in CPU time (about 33% of the median time!) between the fastest and slowest executions in a batch of 10. While expected, in part because the first execution fills some caches, it is interesting to observe.

You can put /usr/bin/time into a for loop in a shell script to collect repeated measurements, and if its granularity is enough for you, this may be a good way to go. Of course, you’ll have to produce your own statistics. An afternoon remembering how to use awk, sed, and cut is in your future! Or, for a one-off experiment, collect the data and import it into a spreadsheet for analysis.

But if you’ll be running more than a few more experiments, you’ll want a benchmarking tool. You can download one of them and look at all the information it gives you (often a lot). It may tell you when you need to use warmup runs, and likely emits really official-looking statistics.

But all is not as it seems. The numbers can be subtly but unmistakably wrong, and the statistics may not hold water.

There are plenty of tools for benchmarking parts of a computer, like the CPU itself, the GPU, memory subsystem, disk subsystem, and more. I am not considering those here.

Of the general-purpose benchmarking tools I could find, I tried a few.

Cmdbench

The cmdbench tool wasn’t a good fit for my work. This Python tool launches your program in the background and monitors its execution. It was designed to benchmark long-running programs (think seconds, not milliseconds), and to show how CPU load and allocated memory vary over the duration of each execution. I had problems getting it to run several times, though when I opened issues, they were fixed quickly. Missing checks for user errors (e.g. a command with unbalanced quotes) result in Python exceptions. I’m sure this will improve over time. For the right use cases, this tool appears worth the effort to use it.

Multitime

Laurie Tratt wrote multitime, a spin on /usr/bin/time that repeatedly executes a single command, producing a similar report of real, user, and system times. Basic descriptive statistics are calculated, and the granularity of the reported data is milliseconds.

1$ multitime -n 5 bc -l <<<"scale=1000;4*a(1)" > /dev/null
2===> multitime results
31: bc -l
4            Mean        Std.Dev.    Min         Median      Max
5real        0.014       0.005       0.008       0.015       0.022 
6user        0.007       0.006       0.003       0.004       0.018 
7sys         0.005       0.002       0.003       0.005       0.009 
8$

The multitime tool reflects the author’s expertise in performance matters, and is well-implemented. I recommend it for measuring single programs, though I would suggest one enhancement. The ability to save the raw data for later statistical analysis would be quite valuable. That said, this tool has a small but clever set of features that play well with other Unix tools.

BestGuess reports virtually the same median run time (shown earlier) for the bc command as does multitime above. BestGuess uses a shell to run this command, due to the input redirection. Running an empty command, e.g. bestguess -r 10 -s '/bin/bash -c' '', shows a median of 3.23ms of shell overhead on my system. The 18.19ms reported by BestGuess minus 3.23ms of overhead gives 14.96ms, and multitime reported 15ms.

Hyperfine

I used Hyperfine in my research for a couple of years. It is written in Rust, and the excellent Rust toolchain makes it a breeze to install. It is a mature project with many features.

Hyperfine wasn’t the right tool for my work, as it turned out. I had to modify it to save the raw data, because the individual user and system times were not being saved, precluding further analysis. And the supplementary Python scripts introduced a dependence on Python (tragic, due to Python’s abysmal module system) into the otherwise easy-to-manage automated testing we had crafted. Our test automation starts with a file of commands to be benchmarked, and ends with all the raw data in one CSV file, a statistical summary in another, and performance graphs produced by gnuplot. Benchmarking done this way is more repeatable and less subject to human error than if all the steps have to be done manually.

The outlier detection in Hyperfine prompted me to investigate just what constitutes an outlier in performance testing. I concluded that in a very real sense, there are no outliers – all measurements are valid.

Suppose you are concerned with the speed of cars going by your house. You set up a radar device and let it record measurements for a week. Several cars creep slowly by at 5MPH, maybe because they were about to turn into a driveway. Some cars speed by at 40MPH. All of these measurements are valid. If you throw away outliers, you discard the part of the distribution you are most interested in (the rightward tail)!

The radar device should not be warning you that there are outliers. In many distributions there are outliers, and the warning has no scientific basis. The benchmarking tool does not know what you are trying to learn, and does not need to know (and should not care) how you will analyze the data. If you see this warning and re-run your experiment, you are invalidating its results. It’s like measuring vehicle speeds each week until you finally hit a week with no creepers or speeders.

Outliers are valid data. If you are interested in the extreme values, your data analysis should focus on them. If you are not, then your analysis should be insensitive to them.

Benchmarking tools are not oracles. They don’t know what your experiment is designed to test. For example, if you know that there are extra costs for early executions, and that’s not what you want to measure, you can filter those out later, during analysis. Or you can configure some number of unmeasured warmup runs in Hyperfine (or BestGuess) to measured “warmed up” times.

Warming up a program does not guarantee that the subsequent timed executions will be faster than the warmup times, however. You may find that no number of warmup runs can achieve this, because performance is up and down over all the timed executions. There’s no substitute for actually looking at the data, which means you need a tool that saves all the data.

Note that “all measurements are valid” is a claim circumscribed by the experiment design. If the goal is to measure performance on a quiet machine and Chrome starts up, subsequent benchmarks are not valid data for this experiment.

For my work, I had to stop using Hyperfine (and some other tools) because I could not find support in the literature for using the mean and standard deviation to characterize distributions of performance data (e.g. CPU times). These distributions are not normal (Gaussian). You can see for yourself with the BestGuess -D option, which analyzes the distribution of CPU times for normality using three different measures (AD score, skew, and excess kurtosis). Mean and standard deviation are misleading here. Different statistics are needed, and the ones used by BestGuess are discussed below.

Making accurate measurements

Having tried several benchmarking tools and used Hyperfine extensively, I wondered if there existed a clear notion of accuracy for command-line benchmarking. It does not appear that practitioners agree on a common definition or metric, and I’m not prepared to offer a proposal. All I can say here is what I have observed.

In my work, I noticed that the CPU times reported by Hyperfine differed noticeably from what /usr/bin/time would say. Hyperfine’s times were always higher. I looked into this enough to make an educated guess as to why. Multitime, BestGuess, and /usr/bin/time produce unsurprisingly similar numbers, because they work the same way. They fork, manipulate a few file descriptors, then exec the program to be measured.

Crucially, they all use wait4 to get measurements from the operating system. Without resorting to external process monitoring of some kind, this appears to be the best way to get good measurements. These tools report what the OS itself recorded about the benchmarked command.

Multitime wait4

BestGuess wait4

/usr/bin/time wait4

Hyperfine does not follow the “bare bones” fork/exec/wait method. It uses a general Rust package for managing processes, and it’s possible that some Rust code executes after the fork, but before the exec. The time spent in that code will be attributed by Hyperfine to the benchmarked command, and could explain the data it reports. Of course, if you are measuring programs that run for several minutes or more, you’ll never notice.

For now, I’m sticking with tools like BestGuess, /usr/bin/time, and multitime, because I understand the metrics they report, and it’s clear from their source code what is being measured. It is not all obvious how a tool could produce “better” numbers than what the OS itself measures, short of deploying some kind of external monitoring.

Now, on to what we do with benchmark measurements!

Statistics

When a sample distribution is not normal, the mean is not a useful measure of central tendency, and the standard deviation is not a useful measure of dispersion (spread). The median and inter-quartile range are the usual alternatives, and that’s what BestGuess shows. In case you want to know the “typical” run time, BestGuess also estimates the mode, though this metric is useful only in unimodal cases.

More importantly, we must find another way to rank the benchmarked commands, because the mean and standard deviation are not suited to the task. BestGuess currently compares two sample distributions at a time; call them $X$ and $Y$. We start with a Mann-Whitney U calculation, a statistical test where the null hypothesis is: the chance that a randomly-chosen observation from $X$ is larger then a randomly-chosen observation from $Y$ is equal to the chance that one from $Y$ is larger than one from $X$. Essentially, the null hypothesis is “the two samples, $X$ and $Y$, came from the same distribution”.

BestGuess ranks the benchmarked commands, marking the best performers with a ✻. In the example below, we see that it took 2.25x as long to compute 1500 digits of π than only 1000 digits. Later, we’ll ask BestGuess to explain how it determined that the two commands really performed differently.

1$ bestguess -QR -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1500;4*a(1)"'
2Best guess ranking:
3
4  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
5  ✻   1: bc -l <<<"scale=1000;4*a(1)"          18.39 ms 
6  ══════════════════════════════════════════════════════════════════════════════
7      2: bc -l <<<"scale=1500;4*a(1)"          41.39 ms   23.00 ms   2.25x 
8  ══════════════════════════════════════════════════════════════════════════════
9$

The accompanying p-value indicates (roughly) how strongly we can reject the null hypothesis. In other words, a low p-value is a strong indication that two samples of run time measurements came from different distributions. For instance, a p-value of 0.01 indicates that with high confidence (99%), we can reject the hypothesis that the two commands have indistinguishable performance. Phrased more naturally, a low p-value like 0.01 gives a strong indication that the two commands perform differently.

Below, the -E (explain) option shows that $p < 0.001$, indicating significance. The median difference between the run times of the two commands is Δ = 22.86ms. We’ll discuss that next.

 1$ bestguess -QRE -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1500;4*a(1)"'
 2
 3  ╭────────────────────────────────────────────────────────────────────────────╮
 4  │  Parameter:                                    Settings: (modify with -c)  │
 5  │    Minimum effect size (H.L. median shift)       effect   500 μs           │
 6  │    Significance level, α                         alpha    0.01             │
 7  │    C.I. ± ε contains zero                        epsilon  250 μs           │
 8  │    Probability of superiority                    super    0.33             │
 9  ╰────────────────────────────────────────────────────────────────────────────╯
10
11Best guess ranking:
12
13  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
14  ✻   1: bc -l <<<"scale=1000;4*a(1)"          18.29 ms 
15                                                                                
16  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
17      2: bc -l <<<"scale=1500;4*a(1)"          41.16 ms   22.87 ms   2.25x 
18 
19      Timed observations      N = 20 
20      Mann-Whitney            U = 0 
21      p-value (adjusted)      p < 0.001  (< 0.001) 
22      Hodges-Lehmann          Δ = 22.86 ms 
23      Confidence interval     99.02% (22.74, 22.99) ms 
24      Prob. of superiority    Â = 0.00 
25  ══════════════════════════════════════════════════════════════════════════════
26$

A difference in performance can arise from a difference in the shape but not location (e.g. median) of a sample, such as when one is widely dispersed and the other is not, though they are centered on the same value. In my opinion, this is important information in an A/B comparison. If both programs run in the same (median) amount of time, but one has higher dispersion, they behave differently. When choosing between them, we probably want to know this, even if it’s not the foremost consideration in the choice.

When two distributions are similar in shape but differ by location, the Hodges-Lehmann median shift calculation is a good measure of the shift amount. It tells us how much faster is program A compared to B without relying on mean run times (too sensitive to high values) or standard deviations (not meaningful for non-normal empirical distributions). And it comes with a confidence level and corresponding interval, to help with interpretation.

The Mann-Whitney U test and the Hodges-Lehmann median shift figure provide what we really want to know:

Can we distinguish between the performance of these two programs? (Null hypothesis: Our samples $X$ and $Y$ came from the same underlying distribution.)
If there is a distinction, how much faster is one compared to the other? (What is the effect size, i.e. the shift of the median run time?)

The effect size is important. We might borrow from the experience of other scientists who analyze experimental results. Any clinical researcher will tell you that statistical significance is not all you need to know. Treatment A might give better outcomes than treatment B in a statistically significant way, but you need to know the effect size before deciding whether the cost or side effects of A are worth it. Moreover, a small effect size could indicate that the experiment produced an unlikely outcome, which subsequent experiments may reveal.

In macro-benchmarking, we can obtain the result that program A is faster than program B in a statistically significant way, but the effect size may be quite small, such as a fraction of a millisecond. Given the interference produced by the myriad processes running on even a “quiet” machine, it’s entirely possible that an experiment with a small effect size is not consistently reproducible.

On my MacBook, there are 704 processes running right now.

BestGuess measures the effect size by examining all of the pairwise differences between run times for two samples, and finds the median difference. This is the Hodges-Lehmann measure. It comes with a confidence measure and interval. If the confidence measure is low, or if the confidence interval contains zero, then we should not put a high value on the computed effect size.

On the other hand, consider the case where (1) the effect size is “large enough” (modest experience suggests 0.5ms to 1ms); (2) the effect size confidence is high (e.g. > 99%); and (3) the confidence interval does not include zero. In this case, we have many good reasons to believe that one program ran faster than the other.

Statistical significance and effect size are used by BestGuess to rank benchmark results.

When two programs are not statistically different in performance, they are reported as indistinguishable.
When the effect size is too small or has low confidence, the programs are reported as indistinguishable.

Below, I compare generating 1005 digits of π to 1000 digits. Using configuration parameters that I left as default values, BestGuess shows a high p-value ($p = 0.081$) indicating a lack of significant difference between the two commands. Also, the effect size was small (Δ$ = 0.07$ms), so even if we considered $p = 0.081$ to be good enough, the difference is minor. Worse still, the 99% confidence interval for the difference includes zero.

 1$ bestguess -QRE -r 20 -s '/bin/bash -c' 'bc -l <<<"scale=1000;4*a(1)"' 'bc -l <<<"scale=1005;4*a(1)"'
 2[SNIP! Configuration parameters omitted]
 3
 4Best guess ranking: The top 2 commands performed identically
 5
 6  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
 7  ✻   1: bc -l <<<"scale=1000;4*a(1)"          18.18 ms 
 8                                                                                
 9  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
10  ✻   2: bc -l <<<"scale=1005;4*a(1)"          18.27 ms    0.09 ms   1.00x 
11                                                                                
12      Timed observations      N = 20 
13      Mann-Whitney            U = 135 
14      p-value (adjusted)      p = 0.081  (0.080)       ✗ Non-signif. (α = 0.01) 
15      Hodges-Lehmann          Δ = 0.07 ms              ✗ Effect size < 500 μs 
16      Confidence interval     99.02% (-43, 322) μs     ✗ CI ± 250 μs contains 0 
17      Prob. of superiority    Â = 0.34                 ✗ Pr. faster obv. > 33% 
18                                                                                
19$

Finally, BestGuess employs one additional check on sample similarity, a calculation related to the Mann-Whitney U value called the probability of superiority. It measures the probability that a randomly-chosen run time from program B is lower (faster) than one randomly-chosen from program A. Assuming A is faster than B, it is unlikely that many measurements for program B were faster than those of A, though some may be. A high probability here simply indicates a large overlap in the distributions of the two samples.

Currently, BestGuess version 0.7.5 is the latest release. Earlier versions have been tested on a trial basis in my research, and I am confident in the measurements. BestGuess always saves the raw data, so it is easy to re-analyze it (with BestGuess or other systems).

We are not yet ready to declare a version 1.0, but I hope people will try out the current release and give us feedback. I expect to use this tool for a long time, and to develop it further to meet my group’s future needs and those of others.

BestGuess

BestGuess is a tool for command-line benchmarking. It does these things:

Runs commands and captures run times, memory usage, and other metrics.
Saves the raw data, for record-keeping or for later analysis.
Optionally reports on various properties of the data distribution.
Ranks the benchmarked commands from fastest to slowest.

The default output contains a lot of information about the commands being benchmarked:

 1$ bestguess -r 20 "ls -lR" "ls -l" "ps Aux"
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4Command 1: ls -lR
 5                      Mode    ╭     Min      Q₁    Median      Q₃       Max   ╮
 6   Total CPU time    5.86 ms  │    5.79     5.86     5.92     6.26     9.77   │
 7        User time    2.18 ms  │    2.14     2.17     2.18     2.29     3.03   │
 8      System time    3.68 ms  │    3.65     3.69     3.75     3.97     6.73   │
 9       Wall clock    7.17 ms  │    7.14     7.22     7.34     7.65    12.55   │
10          Max RSS    1.86 MB  │    1.64     1.83     1.86     1.88     2.06   │
11       Context sw      13 ct  ╰      13       13       13       14       18   ╯
12
13Command 2: ls -l
14                      Mode    ╭     Min      Q₁    Median      Q₃       Max   ╮
15   Total CPU time    2.63 ms  │    2.61     2.64     2.71     2.77     3.44   │
16        User time    1.05 ms  │    1.00     1.01     1.03     1.05     1.19   │
17      System time    1.62 ms  │    1.59     1.62     1.67     1.72     2.25   │
18       Wall clock    3.87 ms  │    3.82     3.90     4.09     4.34     5.14   │
19          Max RSS    1.73 MB  │    1.55     1.73     1.77     1.80     1.98   │
20       Context sw      13 ct  ╰      13       13       13       14       14   ╯
21
22Command 3: ps Aux
23                      Mode    ╭     Min      Q₁    Median      Q₃       Max   ╮
24   Total CPU time   34.10 ms  │   33.45    33.90    34.10    34.26    34.37   │
25        User time    8.52 ms  │    8.48     8.51     8.53     8.57     8.63   │
26      System time   25.62 ms  │   24.91    25.38    25.55    25.65    25.88   │
27       Wall clock   45.18 ms  │   44.28    44.82    45.05    45.31    45.56   │
28          Max RSS    2.94 MB  │    2.88     2.94     2.96     3.03     3.23   │
29       Context sw      98 ct  ╰      96       98       98       98       99   ╯
30
31Best guess ranking:
32
33  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
34  ✻   2: ls -l                                  2.71 ms 
35  ══════════════════════════════════════════════════════════════════════════════
36      1: ls -lR                                 5.92 ms    3.22 ms   2.19x 
37      3: ps Aux                                34.10 ms   31.39 ms  12.60x 
38  ══════════════════════════════════════════════════════════════════════════════
39$

For a mini-report, use the -M option. See reporting options below.

The implementation is C99 and runs on many Linux and BSD platforms, including macOS.

Dependencies:

C compiler
make

Example: Compare two ways of running ps

The default report shows CPU times (user, system, and total), wall clock time, max RSS (memory), and the number of context switches.

The mode may be the most relevant, as it is the “typical” value, but the median is also a good indication of central tendency for performance data.

The figures to the right show the conventional quartile figures, from the minimum up to the maximum observation. The starred command at the top of the ranking ran the fastest, statistically.

 1$ bestguess -r=100 "ps" "ps Aux"
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4Command 1: ps
 5                      Mode    ╭     Min      Q₁    Median      Q₃       Max   ╮
 6   Total CPU time    7.83 ms  │    7.76     8.69    11.18    14.68    27.40   │
 7        User time    1.38 ms  │    1.36     1.53     1.86     2.33     3.96   │
 8      System time    6.44 ms  │    6.39     7.16     9.38    12.21    23.49   │
 9       Wall clock    8.38 ms  │    8.35     9.64    13.22    17.49    45.47   │
10          Max RSS    1.86 MB  │    1.73     1.86     2.08     2.38     2.77   │
11       Context sw       1 ct  ╰       1        2       29       85      572   ╯
12
13Command 2: ps Aux
14                      Mode    ╭     Min      Q₁    Median      Q₃       Max   ╮
15   Total CPU time   37.04 ms  │   36.57    36.98    37.45    38.82    86.70   │
16        User time    9.37 ms  │    9.33     9.42     9.56     9.77    19.74   │
17      System time   27.55 ms  │   27.15    27.57    27.88    29.15    66.97   │
18       Wall clock   50.81 ms  │   49.44    50.81    51.30    52.39   115.32   │
19          Max RSS    3.00 MB  │    2.97     3.08     3.62     3.95     5.33   │
20       Context sw    0.12 K   ╰    0.12     0.12     0.13     0.15     1.07   ╯
21
22Best guess ranking:
23
24  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
25  ✻   1: ps                                    11.18 ms 
26  ══════════════════════════════════════════════════════════════════════════════
27      2: ps Aux                                37.45 ms   26.28 ms   3.35x 
28  ══════════════════════════════════════════════════════════════════════════════
29$

Best practice: Save the raw data

Use -o <FILE> to save the raw data (and silence the pedantic admonition to do so). We can see in the example below that the raw data file has 76 lines: one header and 25 observations for each of 3 commands. The first command is empty, and is used to measure the shell startup time.

 1$ bestguess -o /tmp/data.csv -M -r=25 -s "/bin/bash -c" "" "ls -l" "ps Aux"
 2Command 1: (empty)
 3                      Mode    ╭     Min   Median      Max   ╮
 4   Total CPU time    8.61 ms  │    4.91     8.59    10.41   │
 5       Wall clock   12.58 ms  ╰    6.74    14.99    44.84   ╯
 6
 7Command 2: ls -l
 8                      Mode    ╭     Min   Median      Max   ╮
 9   Total CPU time   16.96 ms  │    5.66    16.87    21.22   │
10       Wall clock   30.26 ms  ╰    7.33    24.56    42.76   ╯
11
12Command 3: ps Aux
13                      Mode    ╭     Min   Median      Max   ╮
14   Total CPU time   34.42 ms  │   32.02    35.31    45.86   │
15       Wall clock   43.78 ms  ╰   42.29    46.54    58.38   ╯
16
17Best guess ranking:
18
19  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
20  ✻   1: (empty)                                8.59 ms 
21  ══════════════════════════════════════════════════════════════════════════════
22      2: ls -l                                 16.87 ms    8.28 ms   1.96x 
23      3: ps Aux                                35.31 ms   26.72 ms   4.11x 
24  ══════════════════════════════════════════════════════════════════════════════
25$ wc -l /tmp/data.csv
26      76 /tmp/data.csv
27$

The accompanying program bestreport can read the raw data file (or many of them) and reproduce any and all of the summary statistics and graphs:

 1$ bestreport -QRB /tmp/data.csv
 2   5        10        16        21        27        32        37        43      
 3   ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
 4        ┌─┬┐
 5 1:├┄┄┄┄┤ │├┄┤
 6        └─┴┘
 7                 ┌───────┬────┐
 8 2: ├┄┄┄┄┄┄┄┄┄┄┄┄┤       │    ├┄┄┤
 9                 └───────┴────┘
10                                                         ┌─┬────────┐
11 3:                                                  ├┄┄┄┤ │        ├┄┄┄┄┄┄┄┄┄┄┤
12                                                         └─┴────────┘
13   ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
14   5        10        16        21        27        32        37        43      
15
16Box plot legend:
17  1: (empty)
18  2: ls -l
19  3: ps Aux
20
21Best guess ranking:
22
23  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
24  ✻   1: (empty)                                8.59 ms 
25  ══════════════════════════════════════════════════════════════════════════════
26      2: ls -l                                 16.87 ms    8.28 ms   1.96x 
27      3: ps Aux                                35.31 ms   26.72 ms   4.11x 
28  ══════════════════════════════════════════════════════════════════════════════
29$

Example: Measure shell startup time

When running a command via a shell or other command runner, you may want to measure the overhead of starting the shell. Supplying an empty command string, "", as one of the commands will run the shell with no command, thus measuring the time it takes to launch the shell.

Rationale: BestGuess does not compute shell startup time because it doesn’t know how you want it measured, if at all. (Which shell? How to invoke it? How many runs and warmup runs?)

On my machine, as shown below, about 2.4ms is spent in the shell, out of the 5.2ms needed to run ls -l.

When reporting experimental results, we might want to subtract the shell startup time from the run time of the other commands to estimate the net run time.

 1$ bestguess -M -w 5 -r 20 -s "/bin/bash -c" "" "ls -l"
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4Command 1: (empty)
 5                      Mode    ╭     Min   Median      Max   ╮
 6   Total CPU time    2.39 ms  │    2.37     2.39     2.57   │
 7       Wall clock    3.15 ms  ╰    3.07     3.15     3.39   ╯
 8
 9Command 2: ls -l
10                      Mode    ╭     Min   Median      Max   ╮
11   Total CPU time    5.21 ms  │    5.15     5.22     5.38   │
12       Wall clock    6.75 ms  ╰    6.66     6.77     7.04   ╯
13
14Best guess ranking:
15
16  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
17  ✻   1: (empty)                                2.39 ms 
18  ══════════════════════════════════════════════════════════════════════════════
19      2: ls -l                                  5.22 ms    2.83 ms   2.18x 
20  ══════════════════════════════════════════════════════════════════════════════
21$

BestGuess reporting options

Mini stats

If the summary statistics included in the default report are more than you want to see, use the “mini stats” option.

 1$ bestguess -M -r 20 "ps A" "ps Aux" "ps"
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4Command 1: ps A
 5                      Mode    ╭     Min   Median      Max   ╮
 6   Total CPU time   23.20 ms  │   22.72    23.94    30.54   │
 7       Wall clock   23.98 ms  ╰   23.43    24.83    33.16   ╯
 8
 9Command 2: ps Aux
10                      Mode    ╭     Min   Median      Max   ╮
11   Total CPU time   26.50 ms  │   26.39    27.08    29.69   │
12       Wall clock   36.27 ms  ╰   35.72    36.45    41.68   ╯
13
14Command 3: ps
15                      Mode    ╭     Min   Median      Max   ╮
16   Total CPU time    7.81 ms  │    7.79     7.85     8.88   │
17       Wall clock    8.42 ms  ╰    8.38     8.46     9.60   ╯
18
19Best guess ranking:
20
21  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
22  ✻   3: ps                                     7.85 ms 
23  ══════════════════════════════════════════════════════════════════════════════
24      1: ps A                                  23.94 ms   16.09 ms   3.05x 
25      2: ps Aux                                27.08 ms   19.24 ms   3.45x 
26  ══════════════════════════════════════════════════════════════════════════════
27$

Bar graph of performance

There’s a cheap (limited) but useful bar graph feature in BestGuess (-G or --graph) that shows the total time taken for each iteration as a horizontal bar.

The bar is scaled to the maximum time needed for any iteration of command. The chart, therefore, is meant to show variation between iterations of the same command. Iteration 0 prints first.

The bar graph is meant to provide an easy way to estimate how many warmup runs may be needed, but can also give some insight about whether performance settles into a steady state or oscillates.

The contrived example below measures shell startup time against the time to run ls without a shell. It looks like bash would could use a few warmup runs. Interestingly, the performance of ls got better and then worse again in this (very) small experiment.

 1$ bestguess -NG -r 10 /bin/bash ls
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4Command 1: /bin/bash
 50                                                                               max
 6│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
 7│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
 8│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
 9│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
10│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
11│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
12│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
13│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
14│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
15│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
16
17Command 2: ls
180                                                                               max
19│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
20│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
21│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
22│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
23│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
24│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
25│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
26│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
27│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
28│▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭▭
29
30$

Box plots for comparisons

Box plots are a convenient way to get a sense of how two distributions compare. We found, when using BestGuess (and before that, Hyperfine) that we didn’t want to wait to do statistical analysis of our raw data using a separate program. To get a sense of what the data looked like as we collected it, I implemented a (limited resolution) box plot feature.

The edges of the box are the interquartile range, and the median is shown inside the box. The whiskers reach out to the minimum and maximum values.

In the example below, although bash (launching the shell with no command to run) appears faster than ls, we can see that their distributions overlap considerably. The BestGuess ranking analysis concludes that these two commands performed statistically identically. You can configure the thresholds used to draw this conclusion to suit your experiment design, such as if you want to ignore the fact that ls often took a long time to run.

 1$ bestguess -NRB -r 100 /bin/bash ls
 2Use -o <FILE> or --output <FILE> to write raw data to a file.
 3
 4 2.0       2.1       2.3       2.4       2.5       2.7       2.8       2.9      
 5   ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
 6                  ┌┬─┐
 7 1:           ├┄┄┄┤│ ├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
 8                  └┴─┘
 9    ┌┬┐
10 2:├┤│├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
11    └┴┘
12   ├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────
13 2.0       2.1       2.3       2.4       2.5       2.7       2.8       2.9      
14
15Box plot legend:
16  1: /bin/bash
17  2: ls
18
19Best guess ranking: The top 2 commands performed identically
20
21  ══════ Command ═══════════════════════════ Total time ═════ Slower by ════════
22  ✻   2: ls                                     2.16 ms 
23  ✻   1: /bin/bash                              2.38 ms    0.23 ms   1.10x 
24  ══════════════════════════════════════════════════════════════════════════════
25$

Feature set

See the project README for an overview of features, and for more information about the statistical calculations. There’s a section for Hyperfine users, too. BestGuess uses some of the same option names and can produce a Hyperfine-format CSV file of summary statistics.

During this time, when the BestGuess documentation is still being created, running bestguess -h is the best way to see all of the options. Currently, they are:

 1$ bestguess -h
 2Usage: bestguess [-A <action>] [options] ...
 3
 4  -w  --warmup          Number of warmup runs
 5  -r  --runs            Number of timed runs
 6  -p  --prepare         Execute <COMMAND> before each benchmarked command
 7  -i  --ignore-failure  Ignore non-zero exit codes
 8      --show-output     Show output of commands as they run
 9  -s  --shell           Use <SHELL> (e.g. "/bin/bash -c") to run commands
10  -n  --name            Name to use in reports instead of full command
11  -o  --output          Write timing data to CSV <FILE> (use - for stdout)
12      --export-csv      Write statistical summary to CSV <FILE>
13      --hyperfine-csv   Write Hyperfine-style summary to CSV <FILE>
14  -f  --file            Read additional commands and arguments from <FILE>
15  -Q  --quiet           Show only the output requested using other flags
16  -R  --ranking         Calculate and show statistical ranking of commands
17  -S  --summary         Show summary statistics for each command
18  -M  --mini-stats      Show minimal summary statistics for each command
19  -D  --dist-stats      Report the analysis of each sample distribution
20  -T  --tail-stats      Report on the tail of each sample distribution
21  -G  --graph           Show graph of total time for each command execution
22  -B  --boxplot         Show box plot of timing data comparing all commands
23  -E  --explain         Show an explanation of the inferential statistics
24  -c                    Configure <SETTING>=<VALUE>, e.g. width=80.
25                        Configuration settings [default]:
26                          width    Maximum terminal width for graphs, plots [80]
27                          alpha    Alpha value for statistics [.01]
28                          epsilon  Epsilon for confidence intervals (μsec) [250]
29                          effect   Minimum effect size (μsec) [500]
30                          super    Superiority threshold (probability) [.333]
31      --config          Show configuration settings
32      --limits          Show compiled-in limits
33  -A  --action          If the BestGuess executables are installed under custom
34                        names, an <ACTION> option is required, and may be either
35                        'run' or 'report'.  See the manual.
36  -v  --version         Show version
37  -h  --help            Show help
38$

Bug reports

Bug reports are welcome!

BestGuess is implemented in C, which we acknowledge makes good code more difficult to write. But BestGuess needs low-level control over the details of how processes are launched and measured, in order to obtain the best measurements we can.

But with C, segfaults and errant memory accesses are always a possibility. When BestGuess can detect a violation of intended behavior, it terminates in a controlled panic with an error message.

If you see any kind of bug, including a panic message, please let us know by opening an issue with instructions on how we can reproduce the bug.

Contributing

If you are interested in contributing, get in touch! My main blog page shows several ways to reach me.

Acknowledgments

Natalie Grogan showed me the Anderson-Darling test for normality, and analyzed data that my group had been collecting. The result was our understanding that command-line performance distributions are not remotely close to normal, and this changed how I looked at benchmarking.

Code, papers, and blog posts by people like Laurie Tratt and Daniel Lemire have been invaluable over the last few years as I’ve done some performance engineering and benchmarking.

Making things go fast is fun. Knowing that we’ve measured things accurately is satisfying.