I got nerd sniped into benchmarking legacy x86 instructions (2019)
acepace.netIf you do more than microbenchmarking, then the cache effects start showing up and often the smaller-yet-individually-slower sequence begins to win.
But I disagree that the 3 sequences are actually identical in semantics, because the ones containing adds and xors will also affect the flags, while xlat and movs with the arithmetic done in the addressing mode don't.
The other thing to note is that pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6.
I remember benchmarking AAD/AAM and they were basically exactly the same as the longer equivalent sequences, although that was on a 2nd generation i7. The (relative) timings do change a little between CPUs, but it seems that Intel mostly tries to optimise them every time so they're not all that much slower. It would be interesting to see this benchmark done on some other CPU models (e.g. AMDs, which tend to have very different relative timings, or something like an Atom or even NetBurst.)
The stack engine only handles the adjustment of the stack pointer, converting the push and pop to regular load/store uops.
But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.
It's not quite free, you still have to complete the store (the cpu can't assume optimising away a stack push is safe, unless it's actually overwritten) and there is still a 4 cycle latency, but that probably isn't an issue due to out-of-order execution.
It gets more "free" once you have the zero-latency loads introduced in Zen 2 and the load can be speculatively replaced with a register move if the store is close and obvious enough
How can you have a zero latency load?
Similar way register movs can have zero latency - the output is renamed from the register source of the corresponding store. Which takes the load out of the dependency chain, effectively having zero latency so long as the correct store was identified.
> pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6
There is a stack engine. But memory accesses and arithmetic are free even without it!
For an instruction only left in for backwards compatibility, I think the microcode is quite nicely optimized. Sure, it could be faster, but it beat two more naive implementations despite originating from the 386 days.
I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.
BCD instructions: https://news.ycombinator.com/item?id=8477254
> The meme is wrong
The third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.
Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.
lol, no. Galaxy brain starts somewhere (generally sane and reasonable) and then moves progressively in a certain direction (generally more complicated). In this case the starting point is an instruction sequence that is reminiscent of RISC architectures and then it gets progressively more CISC as you go down the page. The whole point of galaxy brain is that it follows this sequence, and because there’s no special third panel the sequence is extensible to arbitrary lengths.
I've seen galaxy brain comics where the last row is the punchline. The comic might start out moving in a certain direction in a logical way, which may or may not be humorous by itself, but then the last row has a twist, an unexpected interpretation of the direction.
Some bad examples I found on Google:
https://i.redd.it/j0wwzqe2287z.jpg
https://in.pinterest.com/pin/366128644701746892/
The x86 comic may or may not count, depending on whether you expect the reader to know that using those sorts of legacy instructions is not actually an improvement…
Sure, my point is that the twist doesn't need to be at a particular point, nor does there even need to be a twist. It's just a progression of related images–I think the progression in the ones you're showing is similar to the "evolution of a programmer" joke where a junior engineer starts off with something simple, progressively makes it cleverer and more complicated as they learn more, and eventually return back to the simple solution.
I always thought of the last panel as an initially silly-sounding answer that could still be considered correct in some unexpected way. In this case it fits, because if you compile with `-Os`, xlatb is probably the ideal output. I doubt llvm can output xlatb, but I'd be pleasantly surprised if it could
If this meme was posted to pouet or another demoscene site in the context of writing 4k (or other) space constrained demos, it would operate as you expected.
If space efficiency (or fitting in cache) are important, then this instruction being more compact but having worse execution performance could be a good tradeoff!
> while the last panel is reserved for the punchline
The meme gets used in a number of similar but different ways. Sometimes the last panel is the sequence taken to a logical but unrealistic extreme.
I'm not really sure this meme has a punchline. The number of instructions decreases in each panel.
The punchline is that they use an instruction that Intel themselves do not recommend.
The punchline is that they didn’t think of:
movzx rax, al
mov al, [rax+rbx]
movzx eax, al
is one byte shorter, but I'm not sure if any of this would make a difference. x86 is tricky.
>The meme is wrong
Nah, it's rather that the meme is correctly absurd, as intended.
Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.
Yes. Also 'xor rcx, rcx' -> 'xor ecx, ecx'.
Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?
x86-64 does not have an addressing mode that uses the 8-bit register alias; in particular, the base and index registers are always either 32-bit (in 32-bit mode) or 64-bit (in 64-bit mode). As such, you need to zero-extend AL to a 64-bit register before using it in an offset addressing mode (or use XLATB).
For more information on supported addressing modes, see the manual: https://www.intel.com/content/www/us/en/develop/download/int... (specifically volume 1, section 3.7.5)
One could get rid of the push / pop though, assuming that the high bits of EAX don't need to be saved:
movzx eax,al ;could also do "and eax,0ffh" mov al,[rbx+rax]
>However, since that time, all modern CPUs have turned RISC-like, by internally using a reduced instruction set and translating the ISA opcodes into internal commands, some implemented using CPU microcode.
Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?
If yes, would there be anything to gain or lose?
One advantage of not exposing microcode is that newer processors can add support for new microcode instructions and map existing X86 instructions to them. In a sense there's a tiny JIT in the CPU that turns X86 into processor-optimized code.
The disadvantage is of course that this is complex to do in silicon, and the CPU might lack some insights that the compiler had. As I understand it Itanium was HP's and Intel's attempt to give a lot more power to the compiler, with an instruction set that better matches what's going on under the hood. But we all know how that ended: performance was lackluster and the Itanic was nothing but a waste of money for everyone involved.
GPUs have successfully moved the microcode translation one layer up, you generally compile to an intermediate ISA (let's call it a bytecode) and when you load the program (or shader) the GPU driver translates it to GPU-specific instructions. But that model doesn't easily translate to CPUs.
> Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?
Maybe, but MOV is still MOV, so Intel, for the most part, is simply using a subset of x86 (or AMD64) instructions. Except for a few proprietary commands used to implement the more complex commands, most simple instructions are implemented as-is and are passthroughed anyways.
> If yes, would there be anything to gain or lose?
Gains: Very slight faster performance (reduced lookup is always great, but realistically it doesn't matter unless you're doing supercomputer stuff).
Losses: It's pretty much like the kernel land of Linux or NT's undocumented functions: subject to change, fully not supported. Also, cannot be done on the current CPU families anyway since that the microcode can't be updated in such a way that it is worth it.
Micro-ops often have more bits than the ISA they're implementing, so you'd pay a program-size penalty.
Moreover, Intel (and I assume AMD), will take a sequence of micro ops corresponding to a sequence of "instructions" and optimize the micro-op sequence based on dynamic usage, together with "undo" for when the usage assumptions are wrong.
From my understanding, this microcode may and will change between processors, so you lose the possibility of running your code on more than specific CPU type / generation.
Isn't microcode specific to a particular microarchitecture, that can, and often does, change between CPU model generations?
> what are the chances this obscure opcode is faster than optimized loads?
Sometimes it's not about being faster, sometimes it's about taking up less space. The graphic doesn't say what it's aiming for, and based on what I see in the graphic, the 4th panel seems to take up the least space.
Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0
> why you would do xor rcx, rcx when that result is always 0
It's an idiomatic way to populate a register with the value zero.
Not sure if it's still true, but IIRC it took fewer cycles than the more obvious "load #0 into $rcx" instruction.
These days you also get the benefit that it’s four bytes shorter, since it doesn’t have to store an immediate:
(This is even shorter:48 31 c9 xor rcx,rcx 48 c7 c1 00 00 00 00 mov rcx,0x031 c9 xor ecx,ecx )It should be easier for the processor to detect xor-reg-with-itself as a special case. Intel has documented this as the preferred instruction to use since the Pentium afaik.
We call those runes “the shibboleth of an assembly programmer.” They are ancient and wise. If one speaks them, one knows of and yearns for a simpler time when MOV vs XOR was a debate.
(Neighbor’s got it, and I am as unsure of contemporary relevance as they are.)
Hmm, uiCA results: xlatb: https://bit.ly/3cyBNN5 sequence: https://bit.ly/3nCmVTX
xlatb is looking better here. There are also some front end concerns that may favor xlatb, in particular if it's friendlier to the decoder. xlat is also fewer muops, taking less of the muop cache once decoded.
>>nerd sniped
I honestly don't know anything about this stuff, but the title is awesome.
I never heard the term until I did it to someone. He said I nerd sniped him, but now there is an algebraic constraint solver written in Rust on github. He wrote a decent blog about it too.
It's an xkcd reference: https://xkcd.com/356/
Hadn't heard it either until I saw it twice here on HN this week. Interesting how that goes.