I got nerd sniped into benchmarking legacy x86 instructions (2019)

143 points by kevinday 4 years ago · 46 comments

Reader

If you do more than microbenchmarking, then the cache effects start showing up and often the smaller-yet-individually-slower sequence begins to win.

But I disagree that the 3 sequences are actually identical in semantics, because the ones containing adds and xors will also affect the flags, while xlat and movs with the arithmetic done in the addressing mode don't.

The other thing to note is that pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6.

I remember benchmarking AAD/AAM and they were basically exactly the same as the longer equivalent sequences, although that was on a 2nd generation i7. The (relative) timings do change a little between CPUs, but it seems that Intel mostly tries to optimise them every time so they're not all that much slower. It would be interesting to see this benchmark done on some other CPU models (e.g. AMDs, which tend to have very different relative timings, or something like an Atom or even NetBurst.)

phire 4 years ago

The stack engine only handles the adjustment of the stack pointer, converting the push and pop to regular load/store uops.
But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.
It's not quite free, you still have to complete the store (the cpu can't assume optimising away a stack push is safe, unless it's actually overwritten) and there is still a 4 cycle latency, but that probably isn't an issue due to out-of-order execution.
- brigade 4 years ago
  
  It gets more "free" once you have the zero-latency loads introduced in Zen 2 and the load can be speculatively replaced with a register move if the store is close and obvious enough
  - dataangel 4 years ago
    
    How can you have a zero latency load?
    
    brigade 4 years ago
    
    Similar way register movs can have zero latency - the output is renamed from the register source of the corresponding store. Which takes the load out of the dependency chain, effectively having zero latency so long as the correct store was identified.
moonchild 4 years ago

> pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6
There is a stack engine. But memory accesses and arithmetic are free even without it!

jeroenhd 4 years ago

For an instruction only left in for backwards compatibility, I think the microcode is quite nicely optimized. Sure, it could be faster, but it beat two more naive implementations despite originating from the 386 days.

I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.

b5n 4 years ago

> The meme is wrong

The third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.

Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.

saagarjha 4 years ago

lol, no. Galaxy brain starts somewhere (generally sane and reasonable) and then moves progressively in a certain direction (generally more complicated). In this case the starting point is an instruction sequence that is reminiscent of RISC architectures and then it gets progressively more CISC as you go down the page. The whole point of galaxy brain is that it follows this sequence, and because there’s no special third panel the sequence is extensible to arbitrary lengths.
- comex 4 years ago
  
  I've seen galaxy brain comics where the last row is the punchline. The comic might start out moving in a certain direction in a logical way, which may or may not be humorous by itself, but then the last row has a twist, an unexpected interpretation of the direction.
  Some bad examples I found on Google:
  https://i.redd.it/j0wwzqe2287z.jpg
  https://in.pinterest.com/pin/366128644701746892/
  The x86 comic may or may not count, depending on whether you expect the reader to know that using those sorts of legacy instructions is not actually an improvement…
  - saagarjha 4 years ago
    
    Sure, my point is that the twist doesn't need to be at a particular point, nor does there even need to be a twist. It's just a progression of related images–I think the progression in the ones you're showing is similar to the "evolution of a programmer" joke where a junior engineer starts off with something simple, progressively makes it cleverer and more complicated as they learn more, and eventually return back to the simple solution.
howdydoo 4 years ago

I always thought of the last panel as an initially silly-sounding answer that could still be considered correct in some unexpected way. In this case it fits, because if you compile with `-Os`, xlatb is probably the ideal output. I doubt llvm can output xlatb, but I'd be pleasantly surprised if it could
- DrPhish 4 years ago
  
  If this meme was posted to pouet or another demoscene site in the context of writing 4k (or other) space constrained demos, it would operate as you expected.
  If space efficiency (or fitting in cache) are important, then this instruction being more compact but having worse execution performance could be a good tradeoff!
dspillett 4 years ago

> while the last panel is reserved for the punchline
The meme gets used in a number of similar but different ways. Sometimes the last panel is the sequence taken to a logical but unrealistic extreme.
charcircuit 4 years ago

I'm not really sure this meme has a punchline. The number of instructions decreases in each panel.
- notriddle 4 years ago
  
  The punchline is that they use an instruction that Intel themselves do not recommend.
- jleahy 4 years ago
  
  The punchline is that they didn’t think of:
  movzx rax, al
  mov al, [rax+rbx]
  - mizaru 4 years ago
    
    movzx eax, al
    is one byte shorter, but I'm not sure if any of this would make a difference. x86 is tricky.

oshiar53-0 4 years ago

>The meme is wrong

Nah, it's rather that the meme is correctly absurd, as intended.

oshiar53-0 4 years ago

Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.

moonchild 4 years ago

Yes. Also 'xor rcx, rcx' -> 'xor ecx, ecx'.

kccqzy 4 years ago

Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?

bdonlan 4 years ago

x86-64 does not have an addressing mode that uses the 8-bit register alias; in particular, the base and index registers are always either 32-bit (in 32-bit mode) or 64-bit (in 64-bit mode). As such, you need to zero-extend AL to a 64-bit register before using it in an offset addressing mode (or use XLATB).
For more information on supported addressing modes, see the manual: https://www.intel.com/content/www/us/en/develop/download/int... (specifically volume 1, section 3.7.5)
- rep_lodsb 4 years ago
  One could get rid of the push / pop though, assuming that the high bits of EAX don't need to be saved:
  movzx eax,al ;could also do "and eax,0ffh" mov al,[rbx+rax]

DeathArrow 4 years ago

>However, since that time, all modern CPUs have turned RISC-like, by internally using a reduced instruction set and translating the ISA opcodes into internal commands, some implemented using CPU microcode.

Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?

If yes, would there be anything to gain or lose?

wongarsu 4 years ago

One advantage of not exposing microcode is that newer processors can add support for new microcode instructions and map existing X86 instructions to them. In a sense there's a tiny JIT in the CPU that turns X86 into processor-optimized code.
The disadvantage is of course that this is complex to do in silicon, and the CPU might lack some insights that the compiler had. As I understand it Itanium was HP's and Intel's attempt to give a lot more power to the compiler, with an instruction set that better matches what's going on under the hood. But we all know how that ended: performance was lackluster and the Itanic was nothing but a waste of money for everyone involved.
GPUs have successfully moved the microcode translation one layer up, you generally compile to an intermediate ISA (let's call it a bytecode) and when you load the program (or shader) the GPU driver translates it to GPU-specific instructions. But that model doesn't easily translate to CPUs.
zinekeller 4 years ago

> Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?
Maybe, but MOV is still MOV, so Intel, for the most part, is simply using a subset of x86 (or AMD64) instructions. Except for a few proprietary commands used to implement the more complex commands, most simple instructions are implemented as-is and are passthroughed anyways.
> If yes, would there be anything to gain or lose?
Gains: Very slight faster performance (reduced lookup is always great, but realistically it doesn't matter unless you're doing supercomputer stuff).
Losses: It's pretty much like the kernel land of Linux or NT's undocumented functions: subject to change, fully not supported. Also, cannot be done on the current CPU families anyway since that the microcode can't be updated in such a way that it is worth it.
anamax 4 years ago

Micro-ops often have more bits than the ISA they're implementing, so you'd pay a program-size penalty.
Moreover, Intel (and I assume AMD), will take a sequence of micro ops corresponding to a sequence of "instructions" and optimize the micro-op sequence based on dynamic usage, together with "undo" for when the usage assumptions are wrong.
TonyTrapp 4 years ago

From my understanding, this microcode may and will change between processors, so you lose the possibility of running your code on more than specific CPU type / generation.
grishka 4 years ago

Isn't microcode specific to a particular microarchitecture, that can, and often does, change between CPU model generations?

jstanley 4 years ago

> what are the chances this obscure opcode is faster than optimized loads?

Sometimes it's not about being faster, sometimes it's about taking up less space. The graphic doesn't say what it's aiming for, and based on what I see in the graphic, the 4th panel seems to take up the least space.

kayson 4 years ago

Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0

CoastalCoder 4 years ago

> why you would do xor rcx, rcx when that result is always 0
It's an idiomatic way to populate a register with the value zero.
Not sure if it's still true, but IIRC it took fewer cycles than the more obvious "load #0 into $rcx" instruction.
- saagarjha 4 years ago
  These days you also get the benefit that it’s four bytes shorter, since it doesn’t have to store an immediate:
  48 31 c9 xor rcx,rcx 48 c7 c1 00 00 00 00 mov rcx,0x0
  (This is even shorter:
  31 c9 xor ecx,ecx )
- rep_lodsb 4 years ago
  
  It should be easier for the processor to detect xor-reg-with-itself as a special case. Intel has documented this as the preferred instruction to use since the Pentium afaik.
kloopersoop 4 years ago

We call those runes “the shibboleth of an assembly programmer.” They are ancient and wise. If one speaks them, one knows of and yearns for a simpler time when MOV vs XOR was a debate.
(Neighbor’s got it, and I am as unsure of contemporary relevance as they are.)

celrod 4 years ago

Hmm, uiCA results: xlatb: https://bit.ly/3cyBNN5 sequence: https://bit.ly/3nCmVTX

xlatb is looking better here. There are also some front end concerns that may favor xlatb, in particular if it's friendlier to the decoder. xlat is also fewer muops, taking less of the muop cache once decoded.

oblib 4 years ago

>>nerd sniped

I honestly don't know anything about this stuff, but the title is awesome.

phkahler 4 years ago

I never heard the term until I did it to someone. He said I nerd sniped him, but now there is an algebraic constraint solver written in Rust on github. He wrote a decent blog about it too.
NobodyNada 4 years ago

It's an xkcd reference: https://xkcd.com/356/
VortexDream 4 years ago

Hadn't heard it either until I saw it twice here on HN this week. Interesting how that goes.

Settings

I got nerd sniped into benchmarking legacy x86 instructions (2019)

Keyboard Shortcuts