std::min Causing Three-Times Slowdown on VC++

Using macros to implement Min and Max has numerous well known pitfalls that can be avoided by using template functions. However the definitions of min and max that are mandated by the C++ standard cause VC++ to generate inefficient code. Until the VC++ compiler fixes this it may be best to use custom Min/Max template functions to avoid this problem.

Update, July 2015: VS 2015 fixes this bug such that StdMaxTest now generates perfect code.

I recommend writing template Min/Max functions that return by value instead of by const reference. In most cases these can be used as drop-in replacements for std::min and std::max, and they cause VC++ to generate much better code. Replacing std::min and std::max won’t always cause a large improvement, but I don’t think it ever hurts.

One of the advantages of working at Valve is having really smart coworkers. This particular code-gen problem was found by Michael Abrash. When he said that the compiler was being difficult I figured it was probably worth investigating.

We don’t need no stinkin’ macros

Macro implementations of Min and Max have four main problems. The first is that if they are not carefully written with lots of parentheses then you can hit non-obvious precedence problems. This can be dealt with by careful placement of parentheses in the macro definitions.

The second problem with macros is multiple evaluations of the arguments. If one of the parameters to a min/max macro has side effects then these multiple evaluations will cause incorrect behavior. In other cases the multiple evaluations might cause bloated code or reduced performance. This problem can be avoided by never passing arguments that have side effects to a macro.

The third problem is that macros don’t care whether your types match. You can pass a float and an int to MIN, or a signed and an unsigned, and the compiler will happily do conversions, sometimes leading to unexpected behavior.

The fourth problem with macros is that they pollute all namespaces. Yucch.

It might be possible to ensure that all of your macros are implemented correctly to avoid precedence problems, but it is very difficult to ensure that all uses of all macros avoid side effects and unplanned conversions, and namespace pollution is unavoidable with macros.

Functions it is

Inline functions avoid these problems, and inline template functions offer the promise of perfectly efficient min/max for all types with none of the problems associated with macros. Here’s the general form for a template max function:

template<class T> inline const T& max(const T& a, const T& b)
{
return b < a ? a : b;
}

This function is designed to work with any type that is comparable with operator<, so T can be an int, double, or your class FooBar. Because the function needs to handle arbitrary types the parameters and return type need to be declared as const references. This allows finding the maximum of objects that are expensive to copy or uncopyable:

auto& largest = std::max(NonCopyable1, NonCopyable2);

However that const reference return type gives Visual C++’s optimizer heartburn when std::max is used with built-in types.

In order to see what sort of code template min/max generate we need to call them. Below we have the world’s simplest test of std::max:

int StdMaxTest(int a, int b)
{
return std::max(a, b);
}

By compiling this in a release build with link-time-code-generation (LTCG) disabled and /FAcs enabled we can get a sense of the code-gen in a simple scenario, without even having to call the function. This technique was described in more detail in How to Report a VC++ Code-Gen Bug. Here’s what the assembly language from the .cod file looks like:

?StdMaxTest@@YAHHH@Z
push   ebp
mov    ebp, esp
mov    ecx, DWORD PTR _a$[ebp]
lea    eax, DWORD PTR _b$[ebp]
cmp    ecx, DWORD PTR _b$[ebp]
lea    edx, DWORD PTR _a$[ebp]
cmovge eax, edx
mov    eax, DWORD PTR [eax]
pop    ebp
ret    0

The first two and last two instructions are boilerplate prologue and epilogue. I’ve put the code that does the work in bold. There are six instructions and, roughly speaking, they conditionally select the address of the largest value, then load the winner from the specified address.

Now let’s consider an alternative definition of a template max function. This function is identical to std::max except that its return type is a value instead of a reference. Here’s the function and a test caller:

template <class T>
T FastMax(const T& left, const T& right)
{
return left > right ? left : right;
}

int FastMaxTest(int a, int b)
{
return FastMax(a, b);
}

And here’s the generated code:

?FastMaxTest@@YAHHH@Z
push ebp
mov   ebp, esp
mov   eax, DWORD PTR _b$[ebp]
cmp   DWORD PTR _a$[ebp], eax
cmovg eax, DWORD PTR _a$[ebp]
pop   ebp
ret   0

The inner section of the function – the part that does the actual work – is three instructions instead of six. Instead of selecting the winning address and then loading the value it just selects the winning value.

All else being equal, smaller and shorter code is better. A shorter dependency chain means higher peak speed, and a smaller footprint means fewer i-cache misses. While lower instruction counts don’t always equal higher speed, if all else is equal (no expensive instructions) they should give equivalent or better speed. The smaller code won’t necessarily be faster, but it will not be slower, and being smaller is a real advantage.

Timing differences

Measuring the performance of three to six instructions is impossible. Given that modern processors can have far more instructions than that in-flight at one time it isn’t even well defined. So, we need a better test.

The simplest timing test I could think of was a loop that scans an array to find the largest value. To compare FastMax and std::max I just need to call them each a bunch of times on a moderately large array and see which one is fastest. In order to avoid distortions from context switches and interrupts I print both the fastest and slowest times but I ignore the slowest times. Here’s one version of the test code:

int MaxManySlow(const int* p, size_t c)
{
int result = p[0];

for (size_t i = 1; i < c; ++i)
result = std::max(result, p[i]);

return result;
}

The results are dramatic. The differences are way more extreme than I had expected. The FastMax code runs three times faster than the std::max code! The code using FastMax took two cycles per iteration, whereas the code using std::max takes six cycles per iteration.

Here is the inner loop generated when using std::max:

SlowLoopTop:
1: cmp      ecx,dword ptr [edx]
2: lea      eax,[result]
3: cmovl    eax,edx
4: add      edx,4
5: mov      ecx,dword ptr [eax]
6: mov      dword ptr [result],ecx
7: dec      esi
8: jne      SlowLoopTop

Here is the inner loop generated when using FastMax:

FastLoopTop:
1: cmp      eax,dword ptr [esi+edx*4]
2: cmovle   eax,dword ptr [esi+edx*4]
3: inc      edx
4: cmp      edx,edi
5: jb       FastLoopTop

Remember that the only difference between the source code of these two functions is the return type of the Max function called. If I change FastMax to return a const reference then it generates code identical to std::max.

The problems handling the std::max return type is apparently an optimizer weakness in VC++. I’ve filed a bug to show the problem and I’m hopeful that the VC++ team will address it. Until then I recommend using FastMax instead of std::max (and FastMin instead of std::min).

All testing was done with VC++ 2013, release builds, with the /O2 optimization setting. Testing was done on an Intel Sandybridge CPU on Windows 7. I tested one other Intel and got similar but not identical results. I saw similar results with VC++ 2010, so this is not a new problem.

One final glitch

In order to protect programs against exploits of overruns of stack based buffers VC++ inserts calls to _security_check_cookie() at the end of vulnerable functions if you compile with /GS. A vulnerable function is one that VC++ thinks could have a buffer overrun. Unfortunately, using std::max in our test function triggers the /GS code so MaxManySlow is further penalized by having code to check for impossible buffer overruns. This doesn’t affect the performance of our test because I passed in a large enough array, but if a small array was passed in then this would be an additional bit of overhead. And, the extra code wastes more space in the instruction cache. MaxManySlow is 35 bytes larger than MaxManyFast – 73 bytes versus 38 bytes.

Other types

I’m not interested in testing this with all of the built-in types, but I did test with float. The minimal test functions showed different code-gen for the two template functions, but it wasn’t obvious which was better. When I ran tests on arrays of floats FastMax was about 4.8x faster than std::max. The loop with std::max took 7.33 cycles per iteration, while the loop with FastMax took an impressive 1.5 cycles per iteration.

Test code is attached to the bug and is also (newer version) available here. Run the release build to see the array results, compile the release build and look at MaxTests.cod to see the code-gen of the micro test functions.

I’m surprised I never noticed this before. I guess I got lucky because when I pushed my coworkers from macro MIN/MAX to template Min/Max I accidentally did return by value.

Update

I’ve tested this lightly with gcc and it seems to handle the references without any problems. I hear clang handles it fine also. However, using a single micro-test to compare compiler quality is clearly meaningless, so don’t over extrapolate.

deniskravtsov suggested using overloads of min/max for built-in types to avoid this problem, which is a fascinating idea. However, if you put the overloads of min/max in the global namespace then they will not always be used and if you put them in the std:: namespace then you are probably breaking the language rules. I like using FastMax better.

While testing the overload suggestion I found that if the overload takes its parameters by value then the performance is slightly lower than if it takes them by reference. I don’t know if that is a general trend or if it is peculiar to this one test. It does seem quite odd that the parameter type and return type of a function that is inlined would cause so much trouble! The performance difference from the parameter types is slight so I wouldn’t read too much into it.

Reddit discussion is here and here.

A few comments to my readers…

A surprising number of people said that the problem would go away if std::max was inlined by the compiler. Uhhh – in both of my examples std::max was inlined by the compiler.

There were also several people saying that this would never happen in real code. Well, it was originally found in real code. That code needed high performance, and it took quite a while to find out what was making the optimizer generate imperfect code. The slowdown in that case wasn’t three times, but it was enough to matter. Also, I think MaxManySlow looks like real code – I’m sure I’m not the first person to write that exact loop.

FastMax may indeed generate slower code for heavyweight types – but maybe not. That test will have to wait for another day.