UPDATE: This article now has an ARM64 port.
Why, in 2021, does anyone need to learn about assembly language? First, reading assembly language is the way to know exactly what your program is doing. Why, exactly, is that C++ program 1 MiB (say) instead of 100 KiB? Is it possible to squeeze some more performance out of that function that gets called all the time?
For C++ in particular, it is easy to forget or just not notice some operation (e.g., an implicit conversion or a call to a copy constructor or destructor) that is implied by the source code and language semantics, but not spelled out explicitly. Looking at the assembly generated by the compiler puts everything in plain sight.
Second, the more practical reason: so far, posts on this blog haven’t required an understanding of assembly language, despite constant links to Compiler Explorer. By popular demand, however, our next topic will be parameter passing, and for that, we will need a basic understanding of assembly language. We will focus only on reading assembly language, not writing it.
Instructions
The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. (The x86 architecture has lots of not-so-small instructions as well. Some of these are legacy cruft built up over the 40-odd years of the architecture’s existence, and others are newfangled additions. )
Example 1: Vector norm
Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
int64_t normSquared(Vec2 v) {
return v.x * v.x + v.y * v.y;
}
and here is the resulting x86-64 assembly from clang 11, via Compiler Explorer:1
imulq %rdi, %rdi
imulq %rsi, %rsi
leaq (%rsi,%rdi), %rax
retq
Let’s talk about that first instruction: imulq %rdi, %rdi. This
instruction performs signed integer
multiplication. The q
suffix tells us that it is operating on 64-bit quantities. (In
contrast, l, w, and b would denote 32-bit, 16-bit, and 8-bit,
respectively.) It multiplies the value in the first given register
(rdi; register names are prefixed with a % sign) by the value in
the second register and stores the result in that second
register. This is squaring v.x in our example C++ code.
The second instruction does the same with the value in %rsi, which
squares v.y.
Next, we have an odd instruction: leaq (%rsi,%rdi), %rax. lea
stands for “load effective address”, and it stores the address of the
first operand into the second operand. (%rsi, %rdi) means “the
memory location pointed to by %rsi + %rdi”, so this is just adding
%rsi and %rdi and storing the result in %rax. lea is a quirky
x86-specific instruction; on a more
RISC-y
architecture like ARM64, we would expect to see a plain old add
instruction.2
Finally, retq returns from the normSquared function.
Registers
Let’s take a brief detour to explain what the registers we saw in our
example are. Registers are the “variables” of assembly
langauge. Unlike your favorite programming language (probably), there
are a finite number of them, they have standardized names, and the
ones we’ll be talking about are at most 64 bits in size. Some of them
have specific uses that we’ll see later. I wouldn’t be able to rattle
this off from memory, but per
Wikipedia,
the full list3 of 16 registers on x86_64 is rax, rcx, rdx, rbx,
rsp, rbp, rsi, rdi, r8, r9, r10, r11, r12, r13,
r14, and r15.
Example 2: The stack
Now, let’s extend our example to debug print the Vec2 in normSquared:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and, again, let’s see the generated assembly:
subq $24, %rsp
movq %rdi, 8(%rsp)
movq %rsi, 16(%rsp)
leaq 8(%rsp), %rdi
callq Vec2::debugPrint() const
movq 8(%rsp), %rcx
movq 16(%rsp), %rax
imulq %rcx, %rcx
imulq %rax, %rax
addq %rcx, %rax
addq $24, %rsp
retq
In addition to the obvious call to Vec2::debugPrint() const, we have
some other new instructions and registers! %rsp is special: it is
the “stack pointer”, used to maintain the function call
stack. It points to the
bottom of the stack, which grows “down” (toward lower addresses) on
x86. So, our subq $24, %rsp instruction is making space for three
64-bit integers on the stack. (In general, setting up the stack and
registers at the start of your function is called the function
prologue.) Then, the
following two mov instructions store the first and second arguments
to normSquared, which are v.x and v.y (more about how parameter
passing words in the next blog post!) to the stack, effectively
creating a copy of v in memory at the address %rsp + 8. Next, we
load the address of our copy of v into %rdi with leaq 8(%rsp), %rdi and then call Vec2::debugPrint() const.
After debugPrint has returned, we load v.x and v.y back into
%rcx and %rax. We have the same imulq and addq instructions as
before. Finally, we addq $24, %rsp to clean up the 24
bytes4 of stack space we allocated at the start of
our function (called the function
epilogue),
and then return to our caller with retq.
Example 3: Frame pointer and control flow
Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings.5 We might write something like the following:
#include <cstdio>
#include <cstring>
#include <memory>
void copyUppercase(char *dest, const char *src);
constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;
void printUpperCase(const char *s) {
auto sSize = strlen(s);
if (sSize <= MAX_STACK_ARRAY_SIZE) {
char temp[sSize + 1];
copyUppercase(temp, s);
puts(temp);
} else {
// std::make_unique_for_overwrite is missing on Godbolt.
std::unique_ptr<char[]> temp(new char[sSize + 1]);
copyUppercase(temp.get(), s);
puts(temp.get());
}
}
Here is the generated assembly:6
printUpperCase(char const*): # @printUpperCase(char const*)
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
callq strlen
leaq 1(%rax), %rdi
cmpq $1024, %rax # imm = 0x400
ja .LBB0_2
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.LBB0_2:
callq operator new[](unsigned long)
movq %rax, %rbx
movq %rax, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
The pushq %rbp; movq %rsp, %rbp sequence is very common: it pushes
the frame
pointer
stored in %rbp to the stack and saves the old stack pointer
(which is the new frame pointer) in %rbp. The following four
pushq instructions store registers that we need to save before
using.7
On to the function body. We save our first argument (%rdi) in
%r14, because we are about to call strlen and it, like any other
function, may overwrite %rdi (we say %rdi is “caller-saved”) but
not %r14 (we say %r14 is “callee-saved”). We call strlen(s)
with callq strlen and store sSize + 1 in %rdi with lea 1(%rax), %rdi.
Next, we finally see our first if statement! cmpq $1024, %rax sets
the flags register
according to the result of %rax - $1024, and then ja .LBB0_2
(“jump if above”) transfers control to the location labeled .LBB0_2
if the flags indicate that %rax > 1024. In general, higher-level
control-flow primitives like if/else statements and loops are
implemented in assembly using conditional jump instructions.
Let’s first look at the path where %rax <= 1024 and thus the branch
to .LBB0_2 was not taken. We have a blob of instructions to create
char temp[sSize + 1] on the stack:
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
We save %rsp to %r15 and %rbx for later
use.8 Then, we add 15 to %rdi (which,
remember, contains the size of our array), mask off the lower 4 bits
with andq $-16, %rdi, and subtract the result from %rbx, which we
then put back into %rsp. In short, this rounds the array size up to
the next multiple of 16 bytes and makes space for it on the stack.
The following block simply calls copyUppercase and puts as written in the code:
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
Finally, we have our function epilogue:
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
We restore the stack pointer to deallocate our variable-length array
using leaq. Then, we popq the registers we saved during the
function prologue and return control to our caller, and we are done.
Next, let’s look at the path when %rax > 1024 and we branch to
.LBB0_2. This path is more straightforward. We call operator new[], save the result (returned in %rax) to %rbx, and call
copyUppercase and puts as before. We have a separate function
epilogue for this case, and it looks a bit different:
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
The first mov sets up %rdi with a pointer to our heap-allocated
array that we saved earlier. As with the other function epilogue, we
then restore the stack pointer and pop our saved registers. Finally,
we have a new instruction: jmp operator delete[](void *). jmp is
just like goto: it transfers control to the given label or
function. Unlike callq, it does not push the return address onto the
stack. So, when operator delete[] returns, it will instead transfer
control to printUpperCase’s caller. In essence, we’ve combined a
callq to operator delete[] with our own retq. This is called tail
call optimization, hence the
# TAILCALL comment helpfully emitted by the compiler.
Practical application: catching surprising conversions
I said in the introduction that reading assembly makes implicit copy and destroy operations abundantly clear. We saw some of that in our previous example, but I want to close by looking at a common C++ move semantics debate. Is it OK to take parameters by value in order to avoid having one overload for lvalue references and another overload for rvalue references? There is a school of thought that says “yes, because in the lvalue case you will make a copy anyway, and in the rvalue case it’s fine as long as your type is cheap to move”. If we take a look at an example for the rvalue case, we will see that “cheap to move” does not mean “free to move”, as much as we might prefer otherwise. If we want maximum performance, we can demonstrate that the overload solution will get us there and the by-value solution will not. (Of course, if we aren’t willing to write extra code to improve performance, then “cheap to move” is probably cheap enough.)
#include <string>
class MyString {
std::string str;
public:
explicit MyString(const std::string& s);
explicit MyString(std::string&& s);
};
class MyOtherString {
std::string str;
public:
explicit MyOtherString(std::string s);
};
void createRvalue1(std::string&& s) {
MyString s2(std::move(s));
};
void createRvalue2(std::string&& s) {
MyOtherString s2(std::move(s));
};
If we look at the generated
assembly9
(which is too long to include even though I’ve intentionally
outlined10 the constructors in question), we can
see that createRvalue1 does 1 move operation (inside the body of
MyString::MyString(std::string&&)) and 1 std::string::~string()
call (the operator delete before returning). In contrast,
createRvalue2 is much longer: it does a total of 2 move operations
(1 inline, into the s parameter for
MyOtherString::MyOtherString(std::string s), and 1 in the body of
that same constructor) and 2 std::string::~string calls (1 for the
aforementioned s parameter and 1 for the MyOtherString::str
member). To be fair, moving std::string is cheap and so is
destroying a moved-from std::string, but it is not free in terms of
either CPU time or code size.
Further reading
Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.
My first practical exposure to x86 assembly in particular was in the context of security exploits, or learning to become a “l33t h4x0r”, as the kids used to say. If this strikes you as a more entertaining reason to learn about assembly, great! The classic work in the space is Smashing the Stack for Fun and Profit. Unfortunately, modern security mitigations complicate running the examples in that article on your own, so I recommend finding a more modern practice environment. Microcorruption is an industry-created example, or you could try finding an application security project from a college security course to follow along with (e.g., Project 1 from Berkeley’s CS 161, which seems to be publicly available currently).
Finally, there is always Google and Hacker News. Pat Shaughnessy’s “Learning to Read x86 Assembly Language from 2016 covers the topic from the perspective of Ruby and Crystal, and there was also a recent (2020) discussion on how to learn x86_64 assembly.
Good luck, and happy hacking!