JIT Compilers and Cache Coherency

14 min read Original article ↗

Cache coherency is notoriously tricky to reason about, and even trickier to debug. With that challenge in mind, we’ll deep dive into cache coherency considerations in the design and development of a JIT compiler I’ve been working on.

The JIT compiler targets the Brainfuck programming language, which is intentionally minimal so that I can focus on all the interesting aspects, such as cache coherency and compiler design when implementing it. Brainfuck has a very simple instruction set and memory model, consisting of just eight different character inputs. The full specification and some background can be found at its Wikipedia page: https://en.wikipedia.org/wiki/Brainfuck.

The primary goal of creating my own JIT compiler is to learn by doing, failing, debugging by hand, and through that develop intuition and deepen my understanding through experience.

Background

My implementation, dubbed bfjit, is an optimizing interpreter and Just-In-Time (JIT) compiler, with an Ahead-Of-Time (AOT) mode and support for tiered compilation. It supports both the x86 and AArch64 platforms, and can emit native code for loops in JIT mode, or for the entire program in AOT mode.

In JIT mode, execution starts in the interpreter, which means instructions can begin executing quickly. When loops become “hot” enough, a compilation request is sent to a concurrent thread that compiles the loop asynchronously into a compiled method containing native instructions. Once the loop is compiled, the compiler thread installs a reference to the compiled method, which the interpreter picks up and transfers execution to. This installation step, as we’ll see, is where cache coherency becomes critical.

In AOT mode, the same compilation pipeline is used, but code is generated for the entire program synchronously up front before execution begins.

To bridge the interpreter and JIT compiler, Brainfuck instructions are parsed into an intermediate representation, implemented as a simple AST. This representation is shared between the interpreter and JIT, with the JIT emitting instructions for each node in the AST. This allows execution to move between interpreted and compiled code with minimal state transition. The intermediate representation also makes it straightforward to implement optimizations by applying passes over the AST.

Development and Testing

Initially, bfjit only implemented an interpreter, providing a known baseline for validating correctness and performance while developing the JIT compiler. Since my development machine runs Linux x86_64, I started with the x86 implementation.

From there, I refactored the code to support multiple platforms by adding platform detection and code selection in the Makefile, allowing me to move on to the AArch64 implementation. To run AArch64 code on my local machine, I used QEMU, specifically user-mode emulation via the qemu-aarch64 binary. User-mode emulation translates instructions for the target platform into native instructions. As a result, execution still occurs on the x86 host, and AArch64-specific cache behavior is therefore never observed.

With both the x86 and AArch64 implementations complete, I tested them for correctness and performance using a Brainfuck program that renders a Mandelbrot set in the terminal. Everything behaved as expected when running on my Linux x86_64 machine, both natively and when cross-running via qemu-aarch64. The JIT compiler achieved roughly a 98% speedup over the interpreter, with some room left for further optimization.

From my experience working on the OpenJDK HotSpot implementation, I knew that the memory models and overall design differ significantly between x86 and AArch64, so I was eager to test the AArch64 implementation on real hardware.

All of my machines at home are x86_64-based, either Intel or AMD, with the exception of my phone, a Samsung S22+ running an Exynos 2200 based on the ARMv9.0-A architecture. After setting up SSH, installing a C++ compiler, and compiling bfjit via Termux, everything seemed to be working. However, roughly every tenth run, the program would crash with a SIGILL:

$ ./bfjit "$(cat programs/mandelbrot.b)"
Illegal instruction

Illegal Instruction, What?

The “Illegal instruction” message alone is not telling us much, so I installed a signal handler to catch and print information when a SIGILL is encountered to get more insight into what’s going on.

static void sigill_handler(int sig, siginfo_t *info, void *ctx) {
  ucontext_t *uc = (ucontext_t *)ctx;
  printf("SIGILL at PC: %p, fault addr: %p\n",
         (void *)uc->uc_mcontext.pc,
         info->si_addr);

  // Print the word at the faulting PC so we can see what was fetched
  uint32_t *pc = (uint32_t *)uc->uc_mcontext.pc;
  printf("Fetched instruction word: 0x%08x\n", *pc);

  exit(1);
}

Running bfjit again with the signal handler installed, and in debug mode so that we get info on compiled methods, we see the following:

$ ./bfjit --debug "$(cat programs/mandelbrot.b)"
...
Compiled method 0x72bc417120 size 572:
e1 03 00 aa 20 00 40 39
60 00 00 35 e0 03 01 aa
...
SIGILL at PC: 0x72bc417120, fault addr: 0x72bc417120
Fetched instruction word: 0xaa0003e1

Disassembling the observed instruction we see an orr. I know that this is what the “mov (register)” instruction is aliased to. This is totally expected, since the first instruction in any compiled method moves the first and only argument of the compiled method from x0 into x1. But this instruction doesn’t seem to be “illegal”, so what’s going on here?

0: aa0003e1 orr x1, xzr, x0, lsl #0

; Which is actually
mov x1, x0

Cache (In)Coherency

The problem is cache coherency! What we’ve observed above is a real-world example of incoherent data and instruction caches.

On most modern processors, the L1 cache is split into two separate caches, one for data and one for instructions, often referred to as L1d (data cache) and L1i (instruction cache). This is referred to as a (Modified) Harvard Architecture. The boundary where data and instruction caches are guaranteed to observe the same memory contents (on a specific core) is called the Point of Unification, which is often the L2 cache.

Image of cache hierarchy in AArch64 showcasing separated instruction (L1i) and data (L1d) caches below a unified L2 cache and above the CPU

On an x86 system, the hardware guarantees that the data and instruction caches are coherent, requiring no intervention from software. However, on AArch64 systems, cache coherency between data and instruction caches is not guaranteed in hardware and might require software intervention.

Incoherent caches are primarily problematic when dealing with self-modifying code. For example, a JIT compiler writing instructions to memory that will later be executed is self-modifying code.

The problem is that instructions are fetched from memory and stored in the instruction cache. When the JIT compiler emits instructions by writing them to memory, they are buffered in the data cache. This creates two different views of the instructions: one stored in the instruction cache and another stored in the data cache. When the CPU continues with normal execution, it may execute a stale instruction fetched from the instruction cache, where the cache line has not yet been updates to reflect the newly written memory.

This explains why we observed a SIGILL above. It also explains why, even though we executed an “illegal” instruction, we see a valid instruction when reading the contents in memory, since the read fetches memory from the data cache, not the instruction cache.

Ensure Cache Coherency

To fix this we need a way to make sure that when we’re writing a new instruction to memory (e.g., from the JIT compiler), the instruction cache also sees that write and that the CPU executes the new instruction instead. In C/C++, this is straightforward to achieve using GCC’s __builtin___clear_cache(), which flushes cache(s) for a region of memory.

On systems that do not need cache flushes, __builtin___clear_cache() does nothing. See gcc/builtins.cc, where CLEAR_INSN_CACHE is undefined for x86 and defined for AArch64.

AArch64 Cache Coherency

As we’ve established by now, AArch64 requires more work to achieve cache coherency. To fully explain this, we’ll approach each step of GCC’s implementation of __builtin___clear_cache(), which can be found at libgcc/config/aarch64/sync-cache.c. It starts by getting the contents of the system register “CTR_EL0”, which: “provides information about the architecture of the caches” (see ARM documentation).

“CTR” stands for “Cache Type Register” and “EL0” for “exception level 0”, which is the lowest exception level for unprivileged operations, used by user-space programs.

ARM provides a good visual of the CTR_EL0 register in their documentation, which shows the fieldset of the register. What’s interesting to us are four things: the DIC bit, the IDC bit, DminLine and IminLine:

  • DIC: Instruction cache invalidation requirements for data to instruction coherence. This bit tells us if we need to, via software, invalidate the instruction cache to have data to instruction coherence. A 1 tells us this is handled in hardware and no software support is needed.
  • IDC: Data cache clean requirements for instruction to data coherence. This bit tells us if we need to, via software, clean the data cache to have data to instruction coherence. A 1 tells us this is handled in hardware and no software support is needed. There are some exceptions to this which are listed clearly in the documentation, which we’ll ignore here.
  • DminLine: The size of the (smallest) cache line of all data caches
  • IminLine: The size of the (smallest) cache line of all instruction caches

Image of the “fieldset” of AArch64’s CTR_EL0 register

As a side note, while working with this I tend to mix up what the IDC and DIC bits refer to, since IDC starts with an I but is for cleaning the data cache, not the instruction cache. As a rule of thumb:

  • IDC = Instruction to Data Coherence. About cleaning the data cache so the instruction cache can see it.
  • DIC = Data to Instruction Coherence. About invalidating the instruction cache so it can see what’s inside the data cache.

Assembly Instructions for Cache Coherency

With information from the CTR_EL0 register we can figure out how to proceed to make sure that the data and instruction caches are coherent. Let’s assume that the hardware has no support for cache coherency, so we must use software to achieve coherency. The steps are:

  1. Cleaning the data cache (IDC):

    Cleaning the data cache means cleaning it to the Point of Unification, pushing the contents of L1d to L2, where L2 is the Point of Unification. This is achieved with the following assembly code, where dc cvau is executed for each data cache line in the memory we want to clean:

    dc cvau, <address>
    dsb ish
    

    After dc cvau has been executed for all affected cache lines we execute dsb ish once. The dsb ish instruction is a “data synchronization barrier” (where “ish” stands for “inner shareable domain”, see ARM’s “Memory Attributes” for more information). This instruction makes sure that by the time it is finished, all data and instruction cache operations are complete (and also branch predictor and TLB maintenance). This makes sure all dc cvau instructions that were just executed are guaranteed to have finished when we continue.

  2. Invalidate the instruction cache (DIC):

    Invalidating the instruction cache also means invalidating it to the Point of Unification, essentially throwing everything away until the point where the data and instruction caches are unified. This is achieved with the following assembly code, where ic ivau is executed for each instruction cache line in the memory we want to invalidate.

    ic ivau, <address>
    dsb ish
    

    Similarly here, we execute dsb ish once to make sure that the instruction cache invalidation(s) are guaranteed to be complete.

  3. Instruction Synchronization Barrier (ISB):

    After the data and instruction caches are known to be cleaned and invalidated, we execute an isb instruction, which is an “instruction synchronization barrier”. The isb flushes the pipeline in the processor, so that all instructions following it are fetched from cache or memory.

    Regardless of whether software intervention is required for data cache cleaning or instruction cache invalidation, an isb must be executed to make sure that new instructions really are fetched from cache or memory.

With the cache cleaning and invalidation done and having executed the instruction synchronization barrier, all instructions are refetched from the L2 cache, which is now guaranteed to contain the up-to-date instructions that the JIT compiler has emitted.

Hardware Support

Let’s check the state of the DIC and IDC bits in the CTR_EL0 register on my Exynos 2200 processor with a small C program.

#include <stdio.h>
#include <stdint.h>

int main() {
  uint64_t ctr_el0 = 0;
  __asm__ ("mrs %0, ctr_el0" : "=r" (ctr_el0));
  printf("DIC: %d\n", (ctr_el0 >> 29) & 0b1);
  printf("IDC: %d\n", (ctr_el0 >> 28) & 0b1);
  return 0;
}
$ ./ctr_el0_printer
DIC: 0
IDC: 1

From the output we can determine that the Exynos 2200 processor handles instruction to data cache coherency in hardware by the IDC bit being set to 1. This means there is no need to execute dc cvau to clean the affected data cache lines in software. In this case, we only need ic ivau to invalidate the affected instruction cache lines, and as always (regardless of DIC/IDC status) end with an isb to flush the CPU pipeline. But even with hardware support, things can go wrong.

Hardware Bugs (Neoverse N1)

One exceptionally interesting example of a hardware bug is the Neoverse N1 Errata 1542419.

The Neoverse N1 CPU is designed with hardware support for automatic data and instruction cache coherency, meaning both DIC and IDC should be set to 1. However, in rare cases the caches might not be coherent, which can lead to exactly the kind of SIGILL we observed earlier, even when the user has done everything right.

The mitigation for the erratum consists of two steps and requires cooperation from both software and firmware.

  1. “Hide” the fact that the CPU announces automatic cache coherency by setting the DIC bit to 0, signaling to user-space programs that software intervention is required to have data to instruction cache coherency. Let’s look at how this is done in the Linux kernel.

    Access to the CTR_EL0 register is trapped if the CPU is affected by the erratum (arch/arm64/kernel/cpu_errata.c):

    /* ... or if the system is affected by an erratum */
    if (cap->capability == ARM64_WORKAROUND_1542419)
           enable_uct_trap = true;
    

    In the trap handler, the DIC bit is “hidden” by setting it to 0, and the instruction cache line size is increased so that fewer ic ivau instructions are executed (arch/arm64/kernel/traps.c):

    if (cpus_have_final_cap(ARM64_WORKAROUND_1542419)) {
        /* Hide DIC so that we can trap the unnecessary maintenance...*/
        val &= ~BIT(CTR_EL0_DIC_SHIFT);
    
        /* ... and fake IminLine to reduce the number of traps. */
        val &= ~CTR_EL0_IminLine_MASK;
        val |= (PAGE_SHIFT - 2) & CTR_EL0_IminLine_MASK;
    }
    
  2. In firmware, trap the ic ivau instruction to perform a much more complete (and expensive) instruction cleanse. Firmware implementations are vendor-specific, but ARM has provided a fix to TrustedFirmware that we can reference: https://github.com/ARM-software/arm-trusted-firmware/commit/8094262.

    Instead of executing ic ivau, the trap handler performs a global TLB invalidation with the tlbi vae3is, xzr instruction.

    func neoverse_n1_errata_ic_trap_handler
        cmp   x1, #NEOVERSE_N1_EC_IC_TRAP
        b.ne  1f
        tlbi  vae3is, xzr
        dsb   sy
    

    As mentioned earlier, instruction cache coherency is handled in hardware on Neoverse N1 (DIC bit set to 1). However, due to the erratum, stale instruction fetch state can still survive across isb execution. ic ivau is normally unnecessary on this CPU, and even if you try to use it, it does not address the issue. To solve this, we have to resort to a much bigger hammer: tlbi vae3is, which forces the CPU to refetch instructions from a clean slate.

If you’re interested in knowing how a bug like this might affect performance in real-world applications, I recommend the blog post “It’s Never a Hardware Bug, Until it is.” by Charles Connell, which details how this bug affects Java performance negatively.

Conclusion and Summary

Coming back to bfjit: adding a call to __builtin___clear_cache() before installing a compiled method solves the SIGILL issue we observed earlier in the post. The instructions that were written to memory are now visible to the CPU when instructions are fetched from coherent data and instruction caches.

We’ve learned that on AArch64, JIT compilers must explicitly synchronize data and instruction caches. This requires a bit of work if you want to do it step-by-step yourself, but using GCC’s __builtin___clear_cache() is a portable and easy alternative.

This investigation opened up a real rabbit hole for me; investigating and learning more about caches and cache coherency has been really fun. Personally, caches have always been somewhat abstract, and getting hands-on experience in how caches work has been very valuable. It’s also exciting to learn more about how my own hardware works, especially since I can test things out and see differences across platforms and architectures.

I must admit that cache coherency is a tricky subject, and one that is not immediately obvious as the culprit when encountering an illegal instruction signal. Now that we’re at the end, I hope that you’ve learned something new or drawn a connection that you wouldn’t have otherwise. Thank you for reading and I hope you found this deep dive as interesting as I did!