13 benchmarking sins - NFHN Reader

Avoid these benchmarking boners if you want useful data from your system tests

Measuring system performance may sound simple enough, but IT professionals know that there’s a lot more to it than might appear. In this excerpt from Systems Performance: Enterprise and the Cloud, performance engineer Brendan Gregg offers advice on what not to do when benchmarking.

Casual benchmarking

To do benchmarking well is not a fire-and-forget activity. Benchmark tools provide numbers, but those numbers may not reflect what you think, and your conclusions about them may therefore be bogus.

With casual benchmarking, you may benchmark A, but actually measure B and conclude you’ve measured C.

Benchmarking well requires rigor to check what is actually measured and an understanding of what was tested to form valid conclusions.

For example, many tools claim or imply that they measure disk performance but actually test file system performance. The difference between these two can be orders of magnitude, as file systems employ caching and buffering to substitute disk I/O with memory I/O. Even though the benchmark tool may be functioning correctly and testing the file system, your conclusions about the disks will be wildly incorrect.

Understanding benchmarks is particularly difficult for the beginner, who has no instinct for whether numbers are suspicious or not. If you bought a thermometer that showed the temperature of the room you’re in as 1,000 degrees Fahrenheit, you’d immediately know that something was amiss. The same isn’t true of benchmarks, which produce numbers that are probably unfamiliar to you.

Benchmark faith

It may be tempting to believe that a popular benchmarking tool is trustworthy, especially if it is open source and has been around for a long time. The misconception that popularity equals validity is known as argumentum ad populum logic (Latin for appeal to the people).

Analyzing the benchmarks you’re using is time-consuming and requires expertise to perform properly. And, for a popular benchmark, it may seem wasteful to analyze what surely must be valid.

The problem isn’t even necessarily with the benchmark software — although bugs do happen — but with the interpretation of the benchmark’s results.

Numbers without analysis

Bare benchmark results, provided with no analytical details, can be a sign that the author is inexperienced and has assumed that the benchmark results are trustworthy and final. Often, this is just the beginning of an investigation, and one that finds the results were wrong or confusing.

Every benchmark number should be accompanied by a description of the limit encountered and the analysis performed. I’ve summarized the risk this way: If you’ve spent less than a week studying a benchmark result, it’s probably wrong.

Much of my book focuses on analyzing performance, which should be carried out during benchmarking. In cases where you don’t have time for careful analysis, it is a good idea to list the assumptions that you haven’t had time to check and include them with the results, for example:

Assuming the benchmark tool isn’t buggy
Assuming the disk I/O test actually measures disk I/O
Assuming the benchmark tool drove disk I/O to its limit, as intended
Assuming this type of disk I/O is relevant for this application

This can become a to-do list, if the benchmark result is later deemed important enough to spend more effort on.

Complex benchmark tools

It is important that the benchmark tool not hinder benchmark analysis by its own complexity. Ideally, the program is open source so that it can be studied, and short enough that it can be read and understood quickly.

For micro-benchmarks, it is recommended to pick those written in the C programming language. For client simulation benchmarks, it is recommended to use the same programming language as the client, to minimize differences.

A common problem is one of benchmarking the benchmark — where the result reported is limited by the benchmark software itself. Complex benchmarks suites can make this difficult to identify, due to the sheer volume of code to comprehend and analyze.

Testing the wrong thing

While there are numerous benchmark tools available to test a variety of workloads, many of them may not be relevant for the target application.

For example, a common mistake is to test disk performance — based on the availability of disk benchmark tools — even though the target environment workload is expected to run entirely out of file system cache and not be related to disk I/O.

Similarly, an engineering team developing a product may standardize on a particular benchmark and spend all its performance efforts improving performance as measured by that benchmark. If it doesn’t actually resemble customer workloads, however, the engineering effort will optimize for the wrong behavior.

A benchmark may have tested an appropriate workload once upon a time but hasn’t been updated for years and so is now testing the wrong thing. The article Eulogy for a Benchmark describes how a version of the SPEC SFS industry benchmark, commonly cited during the 2000s, was based on a customer usage study from 1986.

Ignoring errors

Just because a benchmark tool produces a result doesn’t mean the result reflects a successful test. Some — or even all – of the requests may have resulted in an error. While this issue is covered by the previous sins, this one in particular is so common that it’s worth singling out.

I was reminded of this during a recent benchmark of Web server performance. Those running the test reported that the average latency of the Web server was too high for their needs: over one second, on average. Some quick analysis determined what went wrong: the Web server did nothing at all during the test, as all requests were blocked by a firewall. All requests. The latency shown was the time it took for the benchmark client to time-out and error.

Ignoring variance

Benchmark tools, especially micro-benchmarks, often apply a steady and consistent workload, based on the average of a series of measurements of real-world characteristics, such as at different times of day or during an interval. For example, a disk workload may be found to have average rates of 500 reads/sec and 50 writes/sec. A benchmark tool may then either simulate this rate, or simulate the ratio of 10:1 reads/writes, so that higher rates can be tested.

This approach ignores variance: The rate of operations may be variable. The types of operations may also vary, and some types may occur orthogonally. For example, writes may be applied in bursts every 10 seconds (asynchronous write-back data flushing), whereas synchronous reads are steady. Bursts of writes may cause real issues in production, such as by queueing the reads, but are not simulated if the benchmark applies steady average rates.

Ignoring perturbations

Consider what external perturbations may be affecting results. Will a timed system activity, such as a system backup, execute during the benchmark run? For the cloud, a perturbation may be caused by unseen tenants on the same system.

A common strategy for ironing out perturbations is to make the benchmark runs longer — minutes instead of seconds. As a rule, the duration of a benchmark should not be shorter than one second. Short tests might be unusually perturbed by device interrupts (pinning the thread while performing interrupt service routines), kernel CPU scheduling decisions (waiting before migrating queued threads to preserve CPU affinity) and CPU cache warmth effects. Try running the benchmark test several times and examining the standard deviation. This should be as small as possible, to ensure repeatability.

Also collect data so that perturbations, if present, can be studied. This might include collecting the distribution of operation latency — not just the total runtime for the benchmark — so that outliers can be seen and their details recorded.

Changing multiple factors

When comparing benchmark results from two tests, be careful to understand all the factors that are different between the two.

For example, if two hosts are benchmarked over the network, is the network between them identical? What if one host was more hops away, over a slower network, or over a more congested network? Any such extra factors could make the benchmark result bogus.

In the cloud, benchmarks are sometimes performed by creating instances, testing them, and then destroying them. This creates the potential for many unseen factors: Instances may be created on faster or slower systems, or on systems with higher load and contention from other tenants. It is recommended to test multiple instances and take the average (or better, record the distribution) to avoid outliers caused by testing one unusually fast or slow system.

Benchmarking the competition

Your marketing department would like benchmark results showing how your product beats the competition. This is usually a bad idea, for reasons I’m about to explain.

When customers pick a product, they don’t use it for 5 minutes; they use it for months. During that time, they analyze and tune the product for performance, perhaps shaking out the worst issues in the first few weeks.

You don’t have a few weeks to spend analyzing and tuning your competitor. In the time available, you can only gather untuned — and therefore unrealistic — results. The customers of your competitor — the target of this marketing activity — may well see that you’ve posted untuned results, so your company loses credibility with the very people it was trying to impress.

If you must benchmark the competition, you’ll want to spend serious time tuning their product. Also search for best practices, customer forums, and bug databases. You may even want to bring in outside expertise to tune the system. Then make the same effort for your own company before you finally perform head-to-head benchmarks.

Friendly fire

When benchmarking your own products, make every effort to ensure that the top-performing system and configuration have been tested, and that the system has been driven to its true limit. Share the results with the engineering team before publication; they may spot configuration items that you have missed. And if you are on the engineering team, be on the lookout for benchmark efforts — either from your company or from contracted third parties — and help them out.

Consider this hypothetical situation: An engineering team has worked hard to develop a high-performing product. Key to its performance is a new technology that they have developed that has yet to be documented. For the product launch, a benchmark team has been asked to provide the numbers. They don’t understand the new technology (it isn’t documented), they misconfigure it,and then they publish numbers that undersell the product.

Sometimes the system may be configured correctly but simply hasn’t been pushed to its limit. Ask the question, What is the bottleneck for this benchmark? This may be a physical resource, such as CPUs, disks or an interconnect, that has been driven to 100% and can be identified using analysis.

Another friendly fire issue is when benchmarking older versions of the software that have performance issues that were fixed in later versions, or on limited equipment that happens to be available, producing a result that is not the best possible (as may be expected by a company benchmark).

Misleading benchmarks

Misleading benchmark results are common in the industry. Often they are a result of either unintentionally limited information about what the benchmark actually measures or deliberately omitted information. Often the benchmark result is technically correct but is then misrepresented to the customer.

Consider this hypothetical situation: A vendor achieves a fantastic result by building a custom product that is prohibitively expensive and would never be sold to an actual customer. The price is not disclosed with the benchmark result, which focuses on nonprice/performance metrics. The marketing department liberally shares an ambiguous summary of the result (“We are 2x faster!”), associating it in customers’ minds with either the company in general or a product line. This is a case of omitting details in order to favorably misrepresent products. While it may not be cheating — the numbers are not fake — it is lying by omission.

Such vendor benchmarks may still be useful for you as upper bounds for performance. They are values that you should not expect to exceed (with an exception for cases of friendly fire).

Consider this different hypothetical situation: A marketing department has a budget to spend on a campaign and wants a good benchmark result to use. They engage several third parties to benchmark their product and pick the best result from the group. These third parties are not picked for their expertise; they are picked to deliver a fast and inexpensive result. In fact, non-expertise might be considered advantageous: the greater the results deviate from reality, the better. Ideally one of them deviates greatly in a positive direction!

When using vendor results, be careful to check the fine print for what system was tested, what disk types were used and how many, what network interfaces were used and in which configuration and other factors.

Benchmark specials

A type of sneaky activity — which in the eyes of some is considered a sin and thus prohibited — is the development of benchmark specials. This is when the vendor studies a popular or industry benchmark, and then engineers the product so that it scores well, while disregarding actual customer performance. This is also called optimizing for the benchmark.

The notion of benchmark specials became known in 1993 with the TPC-A benchmark, as described on the Transaction Processing Performance Council (TPC) history page :

The Standish Group, a Massachusetts-based consulting firm, charged that Oracle had added a special option (discrete transactions) to its database software, with the sole purpose of inflating Oracle’s TPC-A results. The Standish Group claimed that Oracle had “violated the spirit of the TPC” because the discrete transaction option was something a typical customer wouldn’t use and was, therefore, a benchmark special. Oracle vehemently rejected the accusation, stating, with some justification, that they had followed the letter of the law in the benchmark specifications. Oracle argued that since benchmark specials, much less the spirit of the TPC, were not addressed in the TPC benchmark specifications, it was unfair to accuse them of violating anything.

TPC added an anti-benchmark special clause:

All benchmark special implementations that improve benchmark results but not real-world performance or pricing, are prohibited.

As TPC is focused on price/performance, another strategy to inflate numbers can be to base them on special pricing — deep discounts that no customer would actually get. Like special software changes, the result doesn’t match reality when a real customer purchases the system. TPC has addressed this in its price requirements:

TPC specifications require that the total price must be within 2% of the price a customer would pay for the configuration.

While these examples may help explain the notion of benchmark specials, TPC addressed them in its specifications many years ago, and you shouldn’t necessarily expect them today.

Cheating

The last sin of benchmarking is cheating: sharing fake results. Fortunately, this is either rare or nonexistent; I’ve not seen a case of purely made-up numbers being shared, even in the most bloodthirsty of benchmarking battles.

Brendan Gregg is lead performance engineer at Joyent and formerly worked as performance lead and kernel engineer at Sun Microsystems an Oracle.

This article is excerpted from the book Systems Performance: Enterprise and the Cloud by Brendan Gregg, published by Prentice Hall Professional, Oct. 2013. Reprinted with permission. Content copyright 2014 Pearson Education, Inc.