Stacking Threads - NFHN Reader

October 29th, 2025

Diagram of
a process's layout in memory as a vertical rectangle
with the high address at the top, the stack growing
down towards the heap and the data and text segments
at the low address. In my Advanced Programming in the UNIX Environment class, we discuss the layout of a Unix process in memory with the aid of a diagram like the one to the right, illustrating the location of the different segments. You've no doubt seen similar ones before.

During our discussion, a question relating to multi-threaded applications came up. In such an application, each thread gets its own stack, but within the same process space of the main program. And so the question was with regards to the placement of stacks in a multi-threaded application, and whether or not those would be located at predictable offsets.

Now normally, I'd answer questions about threads with a link to shouldiusethreads.com, but I figured "What the hell, let's explore this." My initial take was that, assuming the use of Address Space Layout Randomization (ASLR), the location of each thread's stack ought to be non-predictable. But of course the answer is never quite that easy.

Locating thread stacks

Like we've done before, let's print the addresses of a local variable to estimate the location of the function frame. Since each thread still runs within the main process's space, we expect those stacks to be below main.

$ uname -mrsp
NetBSD 10.1 evbarm aarch64
$ clang -Wall -Wextra thread-stacks1.c  -lpthread -lm
$ ./a.out
argc at 0xfffffffc4b58
argv at 0xfffffffc4b50
envp at 0xfffffffc4b48
main at 0xfffffffc4b28

Guard size:   65536
Stack size: 8388608

Thread 1 is at 0xf0ef84bdffc0.
Thread 1 stack size: 8388608.
Thread 0 is at 0xf0ef85bfffc0.
Thread 0 stack size: 8388608.
Thread 2 is at 0xf0ef853effc0.
Thread 2 stack size: 8388608.

Stack address differences between threads:
Thread 0 - Thread 1: 8454144 bytes
Thread 1 - Thread 2: 8454144 bytes
$

Running this multiple times in succession on a NetBSD/amd64 10.0 system, this seems to show us that while the initial thread is placed at an unpredictable location, each subsequent thread always gets placed at a fixed offset below the previous thread. That offset can be calculated as the sum of the stack size (see ulimit -s / RLIMIT_STACK; here: 8388608) and the size of the stack guard page (see sysctl vm.thread_guard_size; here: 65536), adding up to 8454144 in this example.

Comparing to other OS, we find:

On an Linux/x86_64 5.15.0 system, this looks very similar, with the predictable offset also equal to the sum of the stack size and the guard page size; on a different Linux/x86_64 6.8.0 system, the offset is larger than that, but still a fixed size.¹
macOS 15.7.1 shows a fixed offset as well (although it looks like there the thread stack size is 0.5MB, i.e., smaller than RLIMIT_STACK) and the thread stacks are located above the main stack.
OpenBSD 7.7 and OmniOS r151044 appear to apply ASLR for this placement as well and locate the stack threads at unpredictable, inconsistent offsets from one another.
FreeBSD 15.0 BETA places the thread stack at unpredictable offsets from one another, but, like macOS, above the main stack.

These layout choices can then be visualized as shown below:

If you've noticed that some of the addresses there don't add up quite as neatly as the illustration on the right shows, well, we glossed over a few details here to illustrate the point. Depending on your OS, you can inspect all the details via, e.g., /proc/self/maps or (on macOS) use vmmap(1).

Thread Local Storage

Looking at the placement of the thread stacks, another question that arises is where each stack saves its registers (stack pointers, program counters, link registers, what have you). Due to context switching, all the details of the thread must be saved somewhere, and we might have any number of threads at runtime, but as we saw above, sometimes the threads are right below one another -- so where do those go?

To investigate this, we are looking for the Thread Control Block (TCB), which in turn is dynamically allocated with the Thread Local Storage (TLS) at thread creation time together with the space for the stack. The address of the TLS (no, not that one) is found in the thread pointer register (%fs on x86_64, %tpdir_el0 on arm64); if you want to inspect it, you'll have to inline the assembly call to pull the address from the register. (Combined with the use of some non-portable functions to get a thread's address, the resulting code shows why writing cross-platform code using POSIX threads is not much fun.)

Let's again see what this looks like on the different platforms:

It may be easier for you to view the relative placement of each element by running the command as ./a.out | sort -r -k2 -t: | grep 0x.

Observations of interest:

NetBSD appears to place 4 TCB's underneath one another in one location, then place another 4 at a randomized location below that, then place the first thread at a randomized location below all TCBs, then place all subsequent threads underneath that, one next to the other.
Linux places the TCB into the thread's stack, shortly below the thread stack's high address, above the thread's argument and the local variables.
macOS places the TCB at 224 bytes above the thread stack's high address.
OpenBSD completely randomizes the placement of both the TCBs as well as the thread stacks, with some TCBs and some thread stacks ending up above the heap, and some below.
FreeBSD (on arm64, anyway) places the heap way above the main stack, the TCBs growing upwards below the heap, and the thread stacks below that, but still above the main stack.
OmniOS places all TCBs underneath one another, but then randomizes the placement of each thread's stack. In addition, OmniOS may reuse a given thread's stack location if the thread that was placed there first has already terminated by the time another thread is spawned.

What else?

As we've seen here, the layout of virtual memory for a given process really can be quite a bit different from the simplified illustrations we use when explaining core concepts. It's useful to periodically be reminded of the fact that, as so often in Computer Science, we're dealing with layers of abstraction, and that the implementations of such abstractions may well vary from operating system to operating system.

While digging into all of this, I noticed a few other angles worth investigating and explaining, including the way arguments are passed and how to better understand shared memory. But this blog post is already too long, so I'll get back to those topics another time.

A cropped section
of Dali's 'Soft Monster in Angelic Landscape'

October 29th, 2025

Footnotes:

[1] I'm not sure what causes the observed difference; it could have to do with a change in the stack_guard_gap between the two kernel versions, or with the use of a shadow stack in the newer kernel, or simply with some alignment of the stacks, but honestly, I'm really just guessing here.^↩

Links:

Uninitialized Stack Variables
stdarg And The Case Of The Forgotten Registers
Memory Layout of a Process (APUE lecture video)