October 29th, 2025
In my Advanced
Programming in the UNIX Environment class, we
discuss the layout of a Unix process in memory with
the aid of a diagram like the one to the right,
illustrating the location of the different segments.
You've no doubt seen similar ones before.
During our discussion, a question relating to multi-threaded applications came up. In such an application, each thread gets its own stack, but within the same process space of the main program. And so the question was with regards to the placement of stacks in a multi-threaded application, and whether or not those would be located at predictable offsets.
Now normally, I'd answer questions about threads with a link to shouldiusethreads.com, but I figured "What the hell, let's explore this." My initial take was that, assuming the use of Address Space Layout Randomization (ASLR), the location of each thread's stack ought to be non-predictable. But of course the answer is never quite that easy.
Locating thread stacks
Like we've
done before, let's print the addresses of a local
variable to estimate the location of the function
frame. Since each thread still runs within the main
process's space, we expect those stacks to be below
main.
$ uname -mrsp NetBSD 10.1 evbarm aarch64 $ clang -Wall -Wextra thread-stacks1.c -lpthread -lm $ ./a.out argc at 0xfffffffc4b58 argv at 0xfffffffc4b50 envp at 0xfffffffc4b48 main at 0xfffffffc4b28 Guard size: 65536 Stack size: 8388608 Thread 1 is at 0xf0ef84bdffc0. Thread 1 stack size: 8388608. Thread 0 is at 0xf0ef85bfffc0. Thread 0 stack size: 8388608. Thread 2 is at 0xf0ef853effc0. Thread 2 stack size: 8388608. Stack address differences between threads: Thread 0 - Thread 1: 8454144 bytes Thread 1 - Thread 2: 8454144 bytes $
Running this multiple times in succession on a
NetBSD/amd64 10.0 system, this seems to show us that
while the initial thread is placed at an unpredictable
location, each subsequent thread always gets placed at
a fixed offset below the previous thread.
That offset can be calculated as the sum of the stack
size (see ulimit -s / RLIMIT_STACK; here: 8388608) and the size of the stack
guard page (see sysctl
vm.thread_guard_size; here: 65536), adding
up to 8454144 in this example.
Comparing to other OS, we find:
- On an Linux/x86_64 5.15.0 system, this looks very similar, with the predictable offset also equal to the sum of the stack size and the guard page size; on a different Linux/x86_64 6.8.0 system, the offset is larger than that, but still a fixed size.1
- macOS 15.7.1 shows a fixed
offset as well (although it looks like there the
thread stack size is 0.5MB, i.e., smaller than
RLIMIT_STACK) and the thread stacks are located above themainstack. - OpenBSD 7.7 and OmniOS r151044 appear to apply ASLR for this placement as well and locate the stack threads at unpredictable, inconsistent offsets from one another.
- FreeBSD 15.0 BETA places the thread stack at
unpredictable offsets from one another, but, like
macOS, above the
mainstack.
These layout choices can then be visualized as shown below:
If you've noticed that some of the addresses there
don't add up quite as neatly as the illustration on
the right shows, well, we glossed over a few details
here to illustrate the point. Depending on your OS,
you can inspect all the details via, e.g., /proc/self/maps or (on macOS) use
vmmap(1).
Thread Local Storage
Looking at the placement of the thread stacks, another question that arises is where each stack saves its registers (stack pointers, program counters, link registers, what have you). Due to context switching, all the details of the thread must be saved somewhere, and we might have any number of threads at runtime, but as we saw above, sometimes the threads are right below one another -- so where do those go?
To investigate this, we are looking for the Thread
Control Block (TCB), which in turn is dynamically
allocated with the Thread
Local Storage (TLS) at thread creation time
together with the space for the stack. The address of
the TLS (no, not that
one) is found in the thread pointer register
(%fs on x86_64,
%tpdir_el0 on arm64);
if you want to inspect it, you'll have to inline the
assembly call to pull the address from the register.
(Combined with the use of some non-portable
functions to get a thread's address, the resulting code shows why writing
cross-platform code using POSIX threads is not much
fun.)
Let's again see what this looks like on the different platforms:
It may be easier for you to view the relative
placement of each element by running the command as
./a.out | sort -r -k2 -t: | grep
0x.
Observations of interest:
- NetBSD appears to place 4 TCB's underneath one another in one location, then place another 4 at a randomized location below that, then place the first thread at a randomized location below all TCBs, then place all subsequent threads underneath that, one next to the other.
- Linux places the TCB into the thread's stack, shortly below the thread stack's high address, above the thread's argument and the local variables.
- macOS places the TCB at 224 bytes above the thread stack's high address.
- OpenBSD completely randomizes the placement of both the TCBs as well as the thread stacks, with some TCBs and some thread stacks ending up above the heap, and some below.
- FreeBSD (on arm64, anyway) places the heap
way above the
mainstack, the TCBs growing upwards below the heap, and the thread stacks below that, but still above themainstack. - OmniOS places all TCBs underneath one another, but then randomizes the placement of each thread's stack. In addition, OmniOS may reuse a given thread's stack location if the thread that was placed there first has already terminated by the time another thread is spawned.
What else?
As we've seen here, the layout of virtual memory for a given process really can be quite a bit different from the simplified illustrations we use when explaining core concepts. It's useful to periodically be reminded of the fact that, as so often in Computer Science, we're dealing with layers of abstraction, and that the implementations of such abstractions may well vary from operating system to operating system.
While digging into all of this, I noticed a few other angles worth investigating and explaining, including the way arguments are passed and how to better understand shared memory. But this blog post is already too long, so I'll get back to those topics another time.
October 29th, 2025
Footnotes:
[1] I'm not sure what causes the
observed difference; it could have to do with a change
in the stack_guard_gap between the two
kernel versions, or with the use of a shadow
stack in the newer kernel, or simply with some
alignment of the stacks, but honestly, I'm really just
guessing here.↩
Links: