The 8-Byte Two-Step

48 points by rzezeski 12 years ago · 33 comments

Reader

nkurz 12 years ago

Thanks for writing this. It's great to see more articles this in HN.

The assembly in the article looks like it was compiled without optimization, which is going to change the exact numbers quite a bit. Contrary to what it sounds like, "without optimization" really means something like "output some strange dialect of antiquated assembly that's 10 times slower than it needs to be". It's really only useful if you are debugging the compiler. In particular, it's pushing all the variables onto the stack even though it doesn't need to, so that they are there in case you need to check them manually. 99% of the time you want to be compiling with -02, -03, or -0fast.

With gcc -O3 for x64, here's what they look like. The functions are shorter, faster, and in some ways easier to understand:

  align_1:
        leal    7(%rdi), %eax
        andl    $-8, %eax
        ret

The first integer arg always comes in %rdi/%edi. The compiler has recognized the idiom, and transformed it into (x + 7) & (0b11..11000). LEA is "load effective address", and in this case is a slightly shorter way of writing 'ADD', but no faster.

  align_2:
                cmpl    $8, %edi
                movl    $8, %eax
                jbe     .L6
        .L5:
                addl    $8, %eax
                cmpl    %eax, %edi
                ja      .L5
                rep ret
        .L6:
                rep ret

"If x < 8 (jbe == jump before or equal), return 8. Otherwise keep adding 8 until it's greater (jump after), and return that". The return value is whatever is in %eax on return. The 'rep ret' is a funny way of saying return, apparently done to be slightly faster on ancient AMD processors. At this point it's cruft. And there is no real reason to have two return statements --- that's the kind of wacky thing compilers like to do even when optimizing.

  align_3:
                movl    %edi, %edi
                subq    $8, %rsp
                cvtsi2sdq       %rdi, %xmm0
                mulsd   .LC0(%rip), %xmm0
                call    ceil
                mulsd   .LC1(%rip), %xmm0
                addq    $8, %rsp
                cvttsd2siq      %xmm0, %rax
                ret
        .LC0:
                .long   0
                .long   1069547520
        .LC1:
                .long   0
                .long   1075838976

The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long. In general, you might be better off always using long unless there is a specific reason not to. Then we convert the int to floating point. Floating point on modern systems uses the 128-bit XMM registers that are also used for vectors. We put it into %xmm0, because that is register for the first floating point or vector arg. Keeping the result in %xmm0, multiply by what is presumably a floating point constant equal to ALIGN and then call ceil(). The result is also returned in %xmm0. Then because multiplication is much faster than division, multiply by ALIGN and then call ceil(). Then because multiplication is much faster than division, multiply by another constant that is is approximately (and depending on the value of the alignment, this might be a significant fact) equal to 1/align. Convert that back to a 32-bit "signed integer quadword" (siq) and return it.

The trickiness with optimizing (which is probably why you used the mangled unoptimized version) is that the compiler has a tendency to optimize them out completely so you end up testing an empty loop. Your choices are either work in assembly directly so what you see is what you get, or to be crafty and come up with something that prevents the compiler from doing so: a checksum, or perhaps a comparison that you know the result of but hopefully the compiler doesn't.

Here are the timings in cycles on a Haswell system. I compiled with '-fno-inline', which would be faster, but might skew the results more in this case, even though function calls are fast on x64.

baseline: 5.3 cycles for a loop that returns x and compares against zero.

align_1: 5.0 cycles including the loop. Yes, in this case adding the two instructions of the function actually took fewer cycles than baseline loop. It's more than ILP voodoo, it's a whole way of life!

align_2: 62 cycles per iteration. Horrible, but not as bad as the uncompiled mess. But it worth pointing out that while you don't want to use a loop here, there are much faster loops. Perhaps while (x % align != 0) { x++ }? This gets you down to 11 cycles, only 6 more than the baseline. Interestingly, it's this fast because the processor apparently can perfectly predict the number of loop iterations after the first 30, which is a little scary.

align_3: 19 cycles per call. Faster than the silly loop, but really not a way you want to approach this problem unless you've really thought through all the floating point details.

I put my code up at https://gist.github.com/nkurz/985470b01b999e67d04b. At the bottom are more numbers provided by Likwid for things like number of instructions, number of branch errors, etc.

duckingtest 12 years ago

>In general, you might be better off always using long unless there is a specific reason not to
I don't agree. That's trading off useful cache for almost invisible micro optimizations.
As a general rule, it's always better to use the smallest size possible.
- nkurz 12 years ago
  
  I phrase it as "might be" since I realize it's controversial, but I think that rule is outdated. Yes, having a large static array would be a fine specific reason to use the smallest size possible. But what would be the benefit when you are dealing with the argument to a function in an example like this?
  Most of the time the argument starts in a register, is passed in a register, and returned in a register. Using a smaller size often just means that the compiler adds some unneeded conversions as in this case. Usually this doesn't matter, but when it does, the benefit is almost always in favor of the simpler rule of always using 64-bit variables.
  - duckingtest 12 years ago
    
    One argument? Sure, probably no difference. When you start using longs as local variables and arguments, even if they're all in registers in one function, if you call other functions inside they are going to be pushed on the stack. It all adds up and suddenly you're getting L1 cache misses.
    Anyway, unnecessary conversions mostly go away when you use link time optimizations (fwhole-program or flto in gcc).
    
    nkurz 12 years ago
    
    A reasonable argument, although not one I seem to run up against, perhaps because I'm rarely concerned about high performance when writing functions with that many layers of subcalls.
    By contrast, when trying to optimize inner loops, I frequently encounter cases where the front-end limitation of 4 micro-ops per cycle is a limiting factor, and getting rid of any extraneous instruction is a speedup. And rather than worrying about a deep stack causing L1 data misses, I'm more concerned with missing L1 instruction cache, or with the extra micro-ops causing me to miss the ~1000 slot decoded micro-op cache.
    These concerns are clearly at opposite ends of the performance spectrum, and which should dominate probably depends on the problem at hand.
    (I glanced at your comment history. Welcome to HN! You have good insights. Please stick around.)

kabdib 12 years ago

The floating point version is dangerous and broken on platforms that have 64-bit ints. (Not sure what happens with ceil and negative numbers, it probably works, but I'd have to try it). And while not an issue here, since this looks like it's user-space code, it's bad to use floats in a systems programming environment (floats are often in registers that can't be modified or maybe even referenced in kernel contexts without some save/restore shenanigans).

The loop version works poorly for large queue sizes, and if I saw this in production code I'd replace it immediately.

I'd be okay with a division and re-multiplication (likely going to be optimized by the compiler, through it's foolish to depend on this).

I don't see any problem with the mask/not/and solution. This has been a programming idiom for decades, and should be no more mysterious than (say) a doubly-linked list.

leif 12 years ago

I agree, the mask/not/and is standard within systems contexts. It need not be cognitive overhead, not to be elitist but if you don't have a copy of hacker's delight or the ability to understand constructs like this, you have no business writing systems-level code.
syncsynchalt 12 years ago

Agreed, I had trouble understanding the author's interest/surprise/confusion re. this idiom.
I asked around the room and the consensus is that a decade of C has brain damaged me.

Dylan16807 12 years ago

Maybe you shouldn't care about cycles, but using a loop for something that can be done without branches is a definite cognitive overhead.

If we want to avoid bit twiddling, how about this:

(mqhp->mq_maxsz + MQ_ALIGNSIZE - 1) / MQ_ALIGNSIZE * MQ_ALIGNSIZE;

I would expect a compiler to emit the same instructions as the original, but at worst you would have an add and two shifts rather than jumps or complex float operations.

to3m 12 years ago

Provided everything is unsigned, this should result in something somewhat similar. I think this form also makes it clearer what's going on.
Assume MQ_ALIGNSIZE is a power of two. Then if N is the power, MQ_ALIGNSIZE is 1<<N. Then you can replace divisions by right shifts, and multiplications by left shifts, and turn the expression into (value+(1<<N)-1)>>N<<N.
Consider x>>N<<N. The (unsigned - i.e., guaranteed zero-filling) right shift discards the bottom N bits, and then the left shift puts the remaining bits back where they were. The net result is to mask off the bottom N bits of x.
Another way of putting that would be x&~((1<<N)-1) - hopefully that is clear. In our example, 1<<N is MQ_ALIGNSIZE. So we have x&~(MQ_ALIGNSIZE-1), or, overall, (x+MQ_ALIGNSIZE-1)&~(MQ_ALIGNSIZE-1).
I suppose it's possible a compiler could transform this, or the divide-and-multiply version, into the add-then-and version, given enough things that are compile-time constants. But personally if I knew the alignment would be a power of two - which it pretty much always is - I'd just write the code I want. Life is usually simpler that way.

greenyoda 12 years ago

I'd agree with his last sentence: "My guess, this was done more as an idiom of systems programming than as an optimization."

syntheticnature 12 years ago

Doubly agreed -- similar patterns appear elsewhere to avoid off-by-one errors and bitmasking is fairly intuitive to anyone regularly working at that level. I was surprised at the author not recognizing this idiom, but you have to learn it sometime!
I was further stunned by the seeming naivety of the author's align_2() implementation, but then it does get the job done, eventually. My naive approach would've been roughly align_3(), but using integer math and modulus.
On the other hand, younger team members were recently stunned by code I did for translation of a binary protocol into more easily handled pieces using what I think of as typical idioms, so this article might get sent around Monday morning.
- rzezeskiOP 12 years ago
  
  Hi, author here. My align_2 implementation is indeed very naive. Part of that was to show just how much overhead a naive implementation could add, in this case 3 orders of magnitude. I'm also new to systems programming. This article should serve as a reminder to experienced hackers that idioms are often only obvious _after_ they've been explained. My last 4 years were spent working on a distributed database; I could bring up "obvious" idioms from that world that might perplex a kernel hacker.
  But you are right, this is indeed a very obvious line of code once you understand it. Thank you for your comments.
  - syntheticnature 12 years ago
    
    The key being "once you understand it." I look forward to sending this around to make sure folks "get it" -- that which I grasp intuitively is magic for them[1], and kernel hacking is rarely the deep topic people imagine it to be. I doubt I have the right perspective to explain this as clearly, and if I could, I lack the time to do so as deeply.
    [1]Their words, heard within the past month.
vidarh 12 years ago

Looking at code from the 80's, it is quite common to come across code that is extremely explicit about types, and that exploits known type sizes of the target platform, overflow behaviour etc.
E.g. modern C code tends to use "int" or "long" all over the place without considering if "short" or "char" is likely to be sufficient.
Similarly, you'll find bit fiddling wherever it is possible, including a tendency to use bit shift instead of multiply/divide even for constant factors where one might thing the compilers would optimize it to shifts (many early compilers didn't).
So it's not necessarily just an idiom of systems programming, as an idiom amongst "old school" C programmers in general. The extent to which it has carried through to more modern code, varies, though. I suspect systems programming has an overall higher ratio of "old school" C developers than application code does.
- zokier 12 years ago
  
  > E.g. modern C code tends to use "int" or "long" all over the place without considering if "short" or "char" is likely to be sufficient
  "Sufficient" is bit odd choice of word here. Word-sized types should be used unless there is a good reason not to. See nkurzs comment for a great example for this:
  > The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long.
  https://news.ycombinator.com/item?id=8346336
wereHamster 12 years ago

Or for consistency with the rest of the code. It is one way, if not the fastest way to do alingment. Every systems programer knows that pattern, it is easily recognizable. You find it plentiful in kernels and low-level system libraries.

jgh 12 years ago

It's a fun trick I've used for years, I think originally I gleaned it from http://graphics.stanford.edu/~seander/bithacks.html (or maybe not, now that I look through that page - though it is full of goodies)

suprjami 12 years ago

This is really neat, thanks for the thorough explanation.

I have banned myself from using printfs to figure out things like this. Instead I would use a debugger and breakpoints to view the live variables in their different data formats.

In GDB, that's p/t for binary and p/d for decimal.

thaumaturgy 12 years ago

Thank you for the careful writeup and submission here. I recently picked up a copy of Beautiful Code, but I was hoping it would have more stuff like this.

bitwize 12 years ago

Holy shit, mind blown.

I've done this before, but usually take modulo 8 rather than bitwise-and negative 7 as the final step.

nightcracker 12 years ago

Doesn't matter, any major compiler will transform this into an bitwise and.
syncsynchalt 12 years ago
Same, I'd write it out using modulo and would assume the compiler would figure it out.
```
    void *p = x;
    p += 7;
    p -= (p%8);
```
Now that I've written it out I suppose it's not any clearer than the mask method:
```
    void *p = x;
    p += 7;
    p &= ~7;
```

hardwaresofton 12 years ago

Another great article from that site:

http://zinascii.com/2014/a-posix-queue-implementation.html

toolslive 12 years ago

if you don't know history, you are doomed to reinvent it. The one thing you should realize is that in the past, people were not more stupid. They just had less, or different context. Regarding bit trickery, chess programming is full of it.

Curiously, it's also the area where strong static typing breaks down, as it is sometimes preferred to view the same set of bits as a value of a different type to be able to manipulate it in another way.

Chromozon 12 years ago

TLDR: round a number N up to the nearest multiple of M

blt 12 years ago

... where M is a power of 2.

Settings

The 8-Byte Two-Step

Keyboard Shortcuts