A Tale of Three WebAssembly Runtimes

31 min read Original article ↗

WebAssembly (Wasm) is a great little bytecode format. It’s easy to produce, it’s easy to parse, it’s easy to execute, and it’s easy to explain. I like to recommend Wasm interpreters as a fun way to learn the basics of language runtimes1. In practicing what I preach, I’ve ended up making not just one runtime, but three! More specifically, it ends up being something along the lines of “three modes of execution within one runtime”, since they share everything except execution logic, but that doesn’t have as much of a ring to it. This blog post will go into some fun bits I’ve picked out of the three of them. I’ll also go into some performance comparisons between the three and some more established runtimes towards the end.

Note: All code can be found on GitHub, and everything is built for an M1 Macbook. A few of the design choices throughout are dependent on Arm64, but there isn’t really anything super requisite on any specific architecture.

What constitutes a WebAssembly runtime?

An abstract definition for a language runtime might be “a program that allows you to run another program written in some language”. In our case, a WebAssembly runtime is an executable that lets you run a WebAssembly program. Often, Wasm runtimes include some extra bits. For example, most implement some amount of WASI, which provides a syscall layer for Wasm.

The Wasm spec requires an ahead-of-time validation step (which almost all runtimes2 comply with), which is a more interesting extra bit for me. Since all these runtimes require an O(n) processing step off the bat, we can shoehorn a bit of extra work in there, and as long as it doesn’t take much time, it won’t be a super big deal. The three methods of execution all take advantage of this validation step, to varying degrees.

Validation

Wasm binary validation has a few stages, but the main one we’re interested in is looking at function bodies. It’s the most granular, and takes up the majority of the time we spend validating. Validating function bodies is a relatively simple abstract interpretation step, wherein we track the types of the expressions in the Wasm program, check integer constants are in range, etc.

Some quick example functions to get a sense of what valid Wasm looks like:

(func $foobar (param i32 i32) (result i32)
  (local f32)

  i32.const 0
  i32.const 0
  i32.add ;; i32.add pops two i32s, pushes their sum
  local.set 0

  i32.const 1
  drop ;; drop pops a value off the stack and does nothing else

  block
    local.get 3
    drop
  end
)

and what invalid Wasm looks like:

(func $foobar (param i32 i32) (result i32)
  (local f32)

  i32.const 0
  i32.const 0
  f32.add ;; f32.add expects two f32s on the stack

  i32.const 10000000000000000000000 ;; out of range constant

  end ;; no corresponding starting block/loop/if

  local.get 3 ;; no corresponding local (locals 0 and 1 are params, local 2 is the f32 declared on the second line)
)

As mentioned above, since this is abstract interpretation, and each instruction requires a somewhat bespoke validation process, it lends itself very nicely to the same strategies that make a regular interpreter fast. As such, I ended up using a tail-call interpreter style for validation. Below you can see the handler for the call instruction.

HANDLER(call) {
    auto fn_idx = safe_read_leb128<uint32_t>(iter);
    ensure(fn_idx < mod.functions.size(), "unknown function");

    auto &func = mod.functions[fn_idx];
    stack.apply(func.type);

    _(call, func, fn_idx);
    nextop();
}

To anyone who’s worked on an interpreter before, this is probably pretty familiar. The HANDLER macro sets up a function named call with a standardized signature, and nextop is a macro that reads the next byte in the Wasm function and tail-calls to the next handler. The _ macro is a bit more of a stranger, but this is where the validation stage processing comes into play. All these handlers are actually templated on a certain Target (provided by whichever method of execution we chose to use), and the validator forwards any relevant information to this Target. This allows Target to do its own processing, in any way it wants.

#define HANDLER(name)                                                          \
    template <typename Target>                                                 \
    std::byte *validate_##name(Module &mod, safe_byte_iterator &iter,          \
                               FunctionShell &fn, WasmStack &stack,            \
                               std::vector<ControlFlow> &control_stack,        \
                               std::byte *code, Target &jit)

#define nextop()                                                               \
    do {                                                                       \
        auto byte = *iter++;                                                   \
        [[clang::musttail]] return funcs<Target>[byte](                        \
            mod, iter, fn, stack, control_stack, code, jit);                   \
    } while (0)

#define _(name, ...) jit.name(code, stack __VA_OPT__(, ) __VA_ARGS__)

Non-control flow instruction validation doesn’t need as much bespoke logic, since they’re typically just popping and pushing values. In theory, this could be very very efficiently done; for example, i32.add pops two i32 from the stack, and pushes one i32. In a perfect world, the validator would only have to do a few operations. Load the top two 8-bit type labels on the stack, check if they’re i32, jump to some failure cold path if they aren’t, and decrement the stack by 1 label if they do. Since we know we have two i32 already, decrementing is equivalent to popping both and pushing a new one on. We can even skip a lower bound check by putting always-unequal types on the stack at the bottom, so any comparison will fail.

#include <cstddef>

enum class type {
    i32 = 0x7f, i64 = 0x7e, f32 = 0x7d, f64 = 0x7c,
};

void next_validator(type *stack);
[[noreturn]] void fail();

void validate_i32add(type *stack) {
    type t1 = stack[-1], t2 = stack[-2];
    if (t1 != type::i32 || t2 != type::i32) [[unlikely]] {
        fail();
    }

    [[clang::musttail]] return next_validator(stack - 1);
}

Unfortunately, the Wasm spec has an interesting little concept called “stack polymorphism”. The stack becomes polymorphic after an instruction that unconditionally transfers control elsewhere, making the follow-up instructions unreachable. A polymorphic stack effectively allows for anything to be popped from it, leading to behaviour as follows:

unreachable ;; unconditionally "traps" / throws an error

;; stack now effectively contains an infinite amount of "any" values

i32.add ;; legal instruction, pops two "any" values from the stack and pushes an i32

;; stack now contains infinite "any" values, plus an i32 on top

f32.add ;; illegal instruction, f32.add can't work with the i32 on the top of the stack

There’s a few reasons why this exists, but the moral of the story is that is helps producers produce their code. Ruins our perfect world for validation though. Anyways, we end up with a bit of a headache for figuring out a system that efficiently & generically compiles down to what we want (including the uncommon but possible polymorphic stack path). My current approach looks something like the below snippet. It looks a tad bloated, and it relies on the compiler heavily inlining & optimizing the constant-ness of it, but it has the positive of being generically applicable to more dynamic signatures. We can see this in the select instruction handler, also shown below. select is a ternary instruction similar to the ternary operator in most languages, and it allows for any type to be in its true/false branches.

template <typename T> void WasmStack::push(const T &values) {
    std::copy(values.begin(), values.end(), buffer);
    buffer += values.size();
}

template <typename T> bool WasmStack::check(const T &expected) const {
    if (is_polymorphic()) [[unlikely]]
        return check_polymorphic(expected);

    return std::equal(end() - expected.size(), end(), expected.begin());
}

template <typename T> void WasmStack::pop(const T &expected) {
    if (is_polymorphic()) [[unlikely]]
        return pop_polymorphic(expected);

    ensure(check(expected), "type mismatch");

    buffer -= expected.size();
}

template <size_t pc, size_t rc>
void WasmStack::apply(std::array<valtype, pc> params, std::array<valtype, rc> results) {
    pop(params);
    push(results);
}

HANDLER(i32add) {
    stack.apply(std::array{valtype::i32, valtype::i32}, std::array{valtype::i32});
    _(i32add);
    nextop();
}

HANDLER(select) {
    // first pop the condition
    stack.pop(valtype::i32);

    auto ty = stack.back();
    ensure(ty == valtype::any || is_numtype(ty), "type mismatch");

    // then apply the dynamic type
    stack.apply(std::array{ty, ty}, std::array{ty});

    _(select, ty);
    nextop();
}

Arithmetic instructions become a bit repetitive in terms of boilerplate with this design (imagine i32add repeated with slight variation 40 times over), but you don’t really have to optimize for good looks when it’s code that you don’t look at often. Anyways, after we’re done validating a function, the Target I mentioned before has built us a function to execute! This architecture lets us share the same validator between the three methods of execution. Now onto the juicy stuff.

The In-Place Interpreter

The first method I used was an in-place interpreter, roughly based on Ben Titzer’s 2022 paper. This is a method that’s used in production in JavascriptCore, and is a very cool little design. The core of the idea is that WebAssembly is already a pretty good format for direct interpretation. One of the few caveats is that the destinations of control flow instructions have to be calculated by the runtime; there’s no relative or absolute “address” given. Titzer’s paper outlines how you can build a side-table for navigation, and end up with speeds comparable to any full-scale rewriting interpreter.

;; A simple dummy program to show Wasm control flow
(func $countdown (param i32) (result i32)
  loop
    local.get 0
    i32.const 1
    i32.sub
    local.tee 0 ;; note: local.tee is equivalent to a local.set immediately followed by a local.get
    br_if 0
    ;; The `br_if` instruction says "conditionally jump to the 0th block/loop", but doesn't say *where* the loop is
  end
  local.get 0
)

My adaptation of his paper is much less performant, largely because I took the easy way out and just used a hash map to map jumps instructions’ locations in memory to their destinations’ place in memory (as well as omitting several other optimizations he discusses). The control flow instructions end up looking something like below.

HANDLER(block) {
    // all control flow instructions in Wasm have an associated signature
    // i.e. they might take an `i32` as input and output another
    // usually it's the empty signature though
    auto sn = RuntimeType::read_blocktype(instance.types, iter);

    auto &block_end = instance.block_ends.find(iter)->second;
    auto result_stack = stack.unsafe_ptr() - sn.n_params;
    control_stack.push({
        result_stack,
        block_end,
        sn.n_results
    });

    nextop();
}

The core of the interpreter is structured as a pretty standard tail-call interpreter, very similar to Python 3.14’s new design. The nextop is actually identical to the one in the validator, with just a couple different parameters being passed around. The fun parts of this interpreter are mostly in the control flow parts. Jumps are a very common instruction, and Wasm has a few different forms of jumps. It allows you to not only pass a return value alongside a jump, but an arbitrary amount of return values! So an interpreter has to deal with the 0, 1, and 2+ cases. Luckily, the 2+ case was introduced in a later proposal, so it is exceedingly uncommon to see it out in the wild. That led me to making this fun little function for dealing with jumps:

static void __attribute__((noinline, preserve_most))
slow_memmove(WasmValue *dst, const WasmValue *src, size_t n) {
    std::memmove(dst, src, n * sizeof(WasmValue));
}

static inline bool brk(Instance &, uint8_t *&iter, tape<WasmValue> &stack,
                       tape<BrTarget> &control_stack, uint32_t depth) {
    control_stack -= depth;
    BrTarget target = control_stack.pop();
    stack -= target.arity;
    std::memmove(target.stack, stack.unsafe_ptr(), sizeof(WasmValue));
    if (target.arity > 1) [[unlikely]] {
        slow_memmove(target.stack, stack.unsafe_ptr(), target.arity);
    }
    stack = target.stack + target.arity;
    iter = target.dest;
    return control_stack.empty();
};

Don’t get thrown off by the tape helper; for our purposes, they’re basically just flexible stacks. But here you can see we’ve folded the 0 and 1 cases into a single memmove, and cast off the 2+ case into an extremely cold path. I wasn’t very happy with how compilers were treating my [[unlikely]] annotation, so I decided to really drill it in with some extra annotations that I don’t care about the performance of that branch. It’s safe to treat the 0 case as moving 1 value because it just ends up copying a garbage value past the top of our target stack, and anything past the top of the value stack is ignored. The unnecessary memory interaction for the 0 case doesn’t hurt too much, since so many interactions happen in that slice of memory that it should be nice and warm in the L1 cache. In return, the compiler knows a call to the memmove function is only needed in the (very, very) cold path, so the registers don’t get messed up unless it hits said path.

There isn’t all that much novel work in this bit, but I think it’s a relatively good starter example of an in-place Wasm interpreter. Again, I encourage you to try making your own version of a Wasm in-place interpreter if you want to get started in the world of language runtimes! Anyways, onto the next.

The Unrolled Interpreter JIT

The big slowdown of interpreters is the whole not-knowing of it all. The interpreter never knows what instruction comes next, nor what immediate arguments that instruction may have. This is particularly bad with our in-place interpreter, since immediate arguments are LEB128-encoded, so it isn’t just a simple load instruction to get e.g. the constant associated with an i32.const instruction. Ergo, many optimizations for interpreters focus on the “interpreter loop” - that is, the logic of how the interpreter fetches & runs the next instruction. This is especially important for Wasm, because its instructions do relatively little, so this overhead piles on. Most directly translate to just 1 native instruction, or even 03. My idea for the unrolled interpreter JIT was born of this struggle to predict the next instruction. Additionally, I wanted to preserve the cross-platform compatibility of interpreters. Nobody likes rewriting their whole codebase to support a new backend.

The implementation I ended up with is what I’ll call the “unrolled interpreter JIT”. A snippet of the JITted assembly looks something like the following.

;; local.get 0
mov x3, 0 ;; local 0 is at offset x1-0 (stack pointer is in x1)
mov x6, &runtime::localget
call x6
;; i32.const 1
mov x3, 1
mov x6, &runtime::i32const
call x6
;; i32.add
mov x6, &runtime::i32add
call x6
;; local.set 0
mov x3, -8 ;; local 0 is at offset x1-8
mov x6, &runtime::localset
call x6

This translates almost 1:1 with the instructions that the CPU actually ends up executing when running our original tail-call interpreter, minus the LEB128 integer decoding. The functions being called, all under the runtime namespace, are almost identical to those interpreter handlers. One major change is that they no longer read the Wasm source code. This information is all embedded into the code itself. The instructions translate into these calls to runtime:: functions, and the x3 and x4 registers are used to pass instruction-specific information that would otherwise be stored in the Wasm. These are things like local.{get,set,tee} memory locations, the integer an i32.const is pushing to the stack, etc. The other major change is one to help us inform the compiler that we want to preserve the x0/x1/x2 registers, which correspond to the Wasm memory, the stack, and miscellaneous Wasm data (i.e. globals, tables, function addresses). The tail-call interpreter was explicit in its call to the next function, but the compiler doesn’t know we’re going to set up & run this JITted code. It thinks it’s fine to overwrite these registers in the handler functions. To allow for the sequential calls, we end up with functions looking like this:

HANDLER(localget) {
    // tmp1 = local stack offset to local
    PRELUDE;
    auto &local = *byteadd(stack, tmp1);
    *stack++ = local;
    POSTLUDE;
}

byteadd is just a little helper macro to use integer arithmetic instead of pointer arithmetic, and the PRELUDE macro isn’t really needed, it’s just convenient for helper definitions / debug info. The POSTLUDE function is where this technique gets funky.

__attribute__((noinline))
void dummy(std::byte *memory, void **misc, WasmValue *stack, uint64_t, uint64_t) {
    asm volatile("" :: "r"(memory), "r"(misc), "r"(stack));
    return;
}

#define POSTLUDE [[clang::musttail]] return dummy(memory, misc, stack, tmp1, tmp2)

So all of our handler functions really tail return to dummy. dummy compiles down to just a ret instruction, but semantically it tells the compiler a couple things. One, at the end of a handler function, calling it says “I still need memory / misc / stack in the same place where they started, so make sure they’re in x0/x1/x2”. This is why it has a noinline label; if it gets inlined, memory / misc / stack could be in any register when dummy starts operating on them. Two, it has an empty asm volatile block that says it reads from memory / misc / stack. Empty asm volatile blocks are the canonical way of expressing a blackbox; all we’re really trying to do is lie to the compiler about these variables being used. Unfortunately, everything staying where we want it still isn’t guaranteed; the compiler is well within its right to change the registers memory / misc / stack are in before or after the block.

;; A possible evil-compiler produced code for dummy
mov x5, x0
mov x6, x1
mov x7, x2
;; the empty asm block "uses" x5/x6/x7 (to do nothing)
;; the evil compiler zeros the registers after using them (still legal for the compiler to do)
mov x0, #0
mov x1, #0
mov x2, #0
ret

Fortunately, they don’t really have any reason to, and compilers typically aren’t in the business of doing stuff they don’t have a reason to do.

clang::musttail comes in handy again for control flow instructions. You can make the observation that a tail call is basically just a jump to an arbitrary location in memory, with an additional guarantee that some values are in some specific registers. Luckily, that’s the fundamental part of this execution technique! It’s just a series of calls where we want some values in some specific registers. So the equivalent of the interpreter’s brk function becomes something like this:

template <ssize_t Arity = -1, ssize_t StackOffset = -1> HANDLER(base_br) {
    PRELUDE;
    auto info = std::bit_cast<BrInfo>(tmp2);
    auto arity = Arity == -1 ? info.arity : Arity;
    auto stack_offset = StackOffset == -1 ? info.stack_offset : -StackOffset;

    std::memmove(byteadd(stack, stack_offset), byteadd(stack, -arity), arity);
    stack = byteadd(stack, stack_offset + arity);
    [[clang::musttail]] return reinterpret_cast<Signature *>(tmp1)(PARAMS);
}
HANDLER(br_0_0) { [[clang::musttail]] return base_br<0, 0>(PARAMS); }
// ...
HANDLER(br_n_n) { [[clang::musttail]] return base_br<>(PARAMS); }

The tmp1 / tmp2 restriction becomes a bit of a nuisance here, because there’s really 3 values we want to pass in (the address of the destination, the # of values the destination expects, and the stack offset to put those values), but we can just cram two of them into one register (std::bit_cast comes in handy here). It’s also specialized for a few permutations of arity / stack offsets, a nicety we didn’t get to do in the unknown-everything interpreter-land. br_0_0, the most common form of jump, compiles to just br x3! Awesome.

The great part of this all is that we only need 5 incredibly simple functions to implement for each new architecture: put_prelude, put_postlude, put_call, put_temp1, and put_temp2. These are fairly self-explanatory - the prelude/postlude in this case are just saving the frame and link pointers and restoring them + returning respectively. They go at the start and end of all functions generated from Wasm code. put_{call,temp1,temp2} are the functions we see in the generated snippet above. Basically just instructions for moving integers into registers, and for call, an instruction to call a register. And that’s it! About 100 lines for the ARM implementation.

One footnote - this technique isn’t super novel (though I made it up independently). I originally envisioned it relative to copy-and-patch, but I later found out there’s another project, GNU Jitter, that implements almost the exact same idea. At least, at one point it was implemented almost the exact same; Jitter is much more built up & optimized. If this execution technique interested you, there’s some pretty cool presentations talking about similar compiler hacking by the author of that project.

The Single-Pass JIT

The final method of execution I’ve built is a full single-pass JIT. Basically a baseline JIT. For those unfamiliar, the term baseline JIT is usually used in the context of an engine with dynamic tiers of JITting, and its job is typically generating code fast, not generating fast code. Compilation speed was a very high priority of mine, but I probably put a bit too much emphasis on the output code’s speed to be considered a proper baseline JIT to some4. Regardless, the JIT compilation only takes about 1.5-2x what validation takes. On a large Wasm binary like Figma’s core 30 megabyte file, it takes about 73ms to validate, and an additional 123ms when compiling5.

What actually goes on in this JIT? A typical instruction handler looks something like the following, with a bit of formatting for your viewing pleasure:

HANDLER(i32add) {
    auto [p1, p2, res] = allocate_registers<
        std::tuple<iwant::ireg, iwant::literal<>>,
        iwant::ireg
    >(code);

    if (p2.is<value::location::imm>()) {
        masm::add(code, this, false, p2.as<uint32_t>(), p1.as<ireg>(), res.as<ireg>());
    } else {
         raw::add(code,       false, p2.as<ireg>(),     p1.as<ireg>(), res.as<ireg>());
    }

    finalize(code, res.as<ireg>());
}

Hopefully this makes a bit of sense. Most instructions’ register interactions are deferred to allocate_registers, where you declare what kind of values you want from the Wasm stack, and what result you’ll produce, and it spits out the values for you to use. iwant::TYPE (read: I want TYPE) lets you tell allocate_registers if there’s some special kind of value you’re able to use; in our case, ARMv8 supports addition instructions with a literal, so iwant::literal<> is used. You can also use iwant::bitmask for bitwise instructions, which allow for a different kind of literal, and iwant::flags, which allows the native flag register to be used directly in conditional instructions. You then write your logic instructions, feed finalize the result register you wrote to, and then you’re done! Under the hood, allocate_registers and finalize are interacting with a stack of these values, and roughly translate to pop / push respectively.

Many Wasm instructions in a binary often translate into 0 native instructions. The most common Wasm instructions, local.{get,set,tee} and iXX.const, can usually be translated into parameters of an instruction, not instructions themselves. For example, local.get 0 / i32.const 1 / i32.add / local.set 0 might translate into a single add x19, x19, #1 on ARM (as a result of the above i32add’s p2.is<imm>() branch). This is one of the main benefits we have in this JIT over the previous runtimes we’ve went over: context-sensitivity. Hence, instructions that can take advantage of this do so with vigor. Instead of passing every value between instructions on the stack, we now have 4 places they can be: reg, stack, imm, and flags. Note that from hereon out, whenever I mention a “value”, I mean these. The categories are fairly self-explanatory; imm represents an integer constant pushed by the i32const/i64const instructions, flags is pushed by comparison instructions, stack is the place we dump stuff that has no other home. reg represents a register, and lies upon by far the most complexity. In fact, aside from general info like the values stack itself and the stack size, the only persistant state that’s kept in the JIT is 4 kilobytes of register information.

4kb of Registers

Quick intro to register allocation. One big difference between this JIT and the previous two runtimes is that we actually keep values in registers between instructions. Previously, everything just stuck around on the stack, so the maintenance overhead was something like 8 bytes * (# of active values). Having proper register allocation in the previous JIT would require having a whole bunch of specialized versions of the handler functions from before, one for every combination of register inputs you can have (which would produce a bit of an excessive amount of functions).

In this version, instead of a bunch of specialized handler functions, we can just pass around the register information. Still, 4kb of memory just to keep track of a few registers may seem a wee bit excessive. In ARMv8, we have 63 registers to work with, 31 general-purpose and 32 floating point. After some implementation & MacOS-specific reservations, this ends up being around 56, so the 4kb is really only about 72 bytes per register. Good register allocation is one of the most important compiler optimizations, so having a decent chunk of register-specific info is a worthy tradeoff.

There’s 2 main things we want to track for each register: what values are currently depending on the contents of this register, and what locals are currently depending on the contents of this register. To understand why these are needed, you need to understand a bit of nuance into Wasm locals. Locals are the only way to “repeat” a value in Wasm - there’s no duplicate instruction or anything to that effect, so if you wanted to use a value twice without recomputing it, you’d need to store it in a local. Additionally, when you local.get, it becomes fully disconnected from that local - if you local.set before the local.get’s value is used, the local.get will not update. We didn’t have to worry about managing this in the previous tiers, because they always copied their values to the stack, but now we want to directly use registers, which can be overwritten if we aren’t careful.

i32.const 100
local.tee 0

i32.const 500
local.tee 0

i32.add ;; bad implementation may erroneously print 1000 instead of 600

Therefore, we must keep track of values corresponding to a local, and when the local is written to, we have to make sure the “in-flight” values are safe. There’s two ways for locals to end up in registers. Since a function can declare any amount of locals, we can’t always assign them all directly to registers. Since the idea of callee-saved registers maps relatively well to the idea of locals, the first way that locals can end up in registers is having a callee-saved register assigned exclusively to them. This is great for most functions, because many of them don’t have many locals, so sometimes we can end up with all of them in their own registers. When there’s more locals than callee-saved registers, the first few still get their own, and the rest get assigned a spot in the stack. On top of this, LLVM (by far the most prominent producer of Wasm binaries) purposefully sorts them in an ~most-to-least-used ordering, so usually the hottest locals get a register.

The second way that locals can end up in registers is when one of these stack-based locals is given a value in local.set. Since the value it’s being set to is typically already in a register, it would be a pretty big waste to take that value, persist it to the stack, and read from the stack when the value is used again. Instead, we can use the combination of the value & local register tracking to allow for local.set to “absorb” registers, which allows a code sequence like:

local.get 0
local.get 1
i32.add
local.tee 2
local.get 2
i32.add
local.set 2

to be compiled into something like

ldr x0, [sp+0x0]
ldr x1, [sp+0x8]
add x2, x0, x1
add x2, x2, x2

Without the absorption optimization, it may look something more like this:

ldr x0, [sp+0x0]
ldr x1, [sp+0x8]
add x2, x0, x1
str x2, [sp+0x10]
ldr x3, [sp+0x10]
add x5, x2, x3
str x5, [sp+0x10]

Much less pretty than our version!

Calling Convention

That “locals-in-temporary-registers” trick also allows us to semi-trivially add in a custom calling convention. Wasm function parameters are also passed in as locals, so it follows that the special sauce we use for local.set should also allow us to mark parameters as being in temporary registers. However, you may be thinking - aren’t we putting the first n registers in their own callee-saved registers? Isn’t it a waste to have them in temporaries, which will most likely later be shoved onto the stack? Turns out, while they’re identical semantically, parameter locals tend to be used in a distinct way from regular locals. This makes sense if you think about it relative to C variables. Typically the main logic uses variables that you’ve defined, and you just read data from the parameters.

Conventionalizing, i.e. moving arguments around into the proper places at call-sites, is a fun problem to think through. Since you could have a situation where, for example, you want the value of x1 to be the first argument (i.e. wants to be stored in x0), and the value of x0 to be the second argument (i.e. wants to be stored in x1), you may have cycles forming. This is, as all things are, a graph problem, so I did a little graph traversal to fix it up. This kind of “negotiation” forms some interesting kinds of graphs; nodes (registers) have in-degree of at most 1, and arbitrary out-degree. This forms a graph called a pseudoforest, i.e. a graph composed of distinct components that are either cycle-less trees, or trees rooted on nodes in a cycle. This is a very nice property, because it means any cycle can be broken with a single temporary register. Once the cycle is broken, the graph can be traversed like any old tree.

auto traverse_line = [&](T node) {
    constexpr auto invalid = (T)32;

    for (auto &pair : param_edges) {
        if (pair.src == node) {
            traverse_line(pair.dest);
            raw::mov(code, true, pair.src, pair.dest);
            pair = edge<T>(invalid, invalid);
        }
    }
};

auto traverse_maybe_cycle = [&](T node) {
    constexpr auto tiebreaker = ireg::x30;
    constexpr auto invalid = (T)32;

    raw::mov(code, true, node, tiebreaker);

    for (auto &pair : param_edges) {
        if (pair.src == node) {
            auto dest = pair.dest;
            pair = edge<T>(invalid, invalid);
            // after the cycle has been broken, we can confidently treat
            // any connections as trees whose root is `node`
            traverse_line(code, param_edges, dest);
            raw::mov(code, true, tiebreaker, dest);
        }
    }
};

This combo of using a proper (not-everything-passed-on-stack) calling convention + not blindly throwing registers at each other gives a very nice speedup, and allows for a lot of leaf functions to not have to touch the stack at all.

Destination Register Patching

This is one of my favourite optimizations, because it really shows off a nice trait of how ARM instructions are represented. All instructions in ARM have a fixed width of 32-bits, and all instructions that output a value to a register store that output register in the last 5 bits6 of the instruction. As I mentioned earlier, our optimizations for local.set allow us to temporarily use a caller-saved register for a stack-based local. However, what if we local.set for a callee-saved register-based local? Then we have to do a mov <local register>, <input register>, even if the input register is thrown away after this! No good. To fix this, I added “patching” to the local.set implementation. Patching checks if the register we’re using is the one that was written to in the last instruction, and patches the new register on top of the old one if possible. Assuming locals 0/1/2 are in x19/x20/x21 respectively, the same example snippet as above can compile to something like this:

local.get 0 ;; x19
local.get 1 ;; x20
i32.add
local.tee 2 ;; x21
local.get 2
i32.add
local.set 2
;; Without patching
add x0, x19, x20
mov x21, x0
add x0, x21, x21
mov x21, x0
;; With patching
add x21, x19, x20 ;; x21 is patched on top of x0
add x21, x21, x21 ;; x21 is patched on top of x0 again

And this gives us the optimal compilation for that set of instructions! There may be some elegant means of doing this on a variable-width instruction architecture, but it’s ridiculously simple on ARM, and I find it quite beautiful.

if (in_callee_saved && can_overwrite(reg, code)) {
    // backtrack
    code -= sizeof(inst);

    inst instruction;
    std::memcpy(&instruction, code, sizeof(inst));
    assert((instruction & 0b11111u) == (unsigned)reg);
    // mask in the updated destination reg
    instruction &= ~0b11111u;
    instruction |= (unsigned)local_reg;

    // purge (to the stack) any values that were depending on the destination reg
    purge(code, local_reg);
    // write back the updated instruction
    put(code, instruction);
}

Performance

For performance testing, I mainly worked off Figma’s Wasm binary (only for startup speed benchmarking, as it can’t be run standalone), as well as the Wasm ports of esbuild, Spidermonkey, and the Coremark benchmark. Startup was compared with hyperfine, execution speed is median-of-three, and both were measured on an M1 Macbook Air.

Startup Speed7

RuntimeSpidermonkey (ms)esbuild (ms)Figma (ms)
Validation Only24.838.973.4
In-Place Interpreter35.464.7108.3
Unrolled JIT52.889.8159.7
Single-Pass JIT65.8102.0198.8
Wasmtime183.3209.9710.4
Wasmtime (Single-Pass)197.4243.6767.5
Node66.375.392.5

With regards to startup speed, the in-place interpreter is typically the fastest to get going, with the exception of the largest binary, Figma, where Node edges it out. The interpreter being fast is expected, as it does very very little on top of validation. Node’s speed is mainly because of its compilation being done lazily, which is super beneficial on this benchmark, because only one function is actually called. In fact, I would guess that the only reason Node isn’t ~fully even with my validation-only baseline is because I have to do some Javascript setup to run the Wasm.

This chart shows that I lost the plot for the single-pass JIT somewhere along the way, and started to care about optimizing the output rather than just flatly optimizing for compile times. However, as Node shows off, lazy compilation would probably render most speedups moot, so it’s difficult to say without trying it out.

Execution Speed8

↑ means higher is better, ↓ means lower is better

Runtimeesbuild (↓)Spidermonkey (↑)Coremark (↑)
In-Place Interpretertoo slow~2.83749.6
Unrolled JITtoo slow11.81326.6
Single-Pass JIT3.12634726616.0
Wasmtime3.31657728260.4
Wasmtime (Single-Pass)7.33723211546.0
Node3.69430024853.7

In general, Wasmtime does a decent chunk better than my JIT. I also benchmark against their single-pass option (-O regalloc-algorithm=single-pass), but the results I see lead me to believe it hasn’t gotten much love, or potentially just not much in the context of Arm64. Unsurprisingly, the interpreter & unrolled JIT tiers aren’t very fast, but the increase to the JIT is a bit interesting, and could be a somewhat interesting topic for a later deep dive.

I do feel quite surprised at how my JIT beats Node (and even Wasmtime on esbuild). It’s likely my implementation cuts corners that would go against the much (much, much) more safety-centric designs of Node and Wasmtime. In the esbuild comparison, I would guess that Wasmtime has trouble optimizing the WebAssembly generated by the go compiler9 (as opposed to Spidermonkey & Coremark which are both written in C++), and as such, its more extensive suite of optimizations has less effect.

I really want to emphasize that while these comparisons may make it seem like my JIT is practically comparable to Wasmtime and Node’s, it really isn’t, and I could probably write a blog post just as long as this one about things that I would improve on / reasons why you should use other runtimes.

I hope you found at least a couple snippets in this post that you enjoyed. Language runtimes are really fun to tinker around with, and WebAssembly is a really cool language to build one for. I first wrote the interpreter when I was prepping for an interview to intern on a team that focused on language runtimes, and it was so fun that I iterated to where we are now. I can’t make any guarantees that you’ll love it as much, but if you found something you enjoyed in this (longer than anticipated) article, I’m confident that you can find your own fun implementation magic along the way.

In the future, I’m planning on adding a proper multi-tier JIT with non-single-pass optimization passes, and I have some fun optimization ideas I’d like to experiment with. Stay tuned for more updates there! I’m also planning on writing some deeper dives into specific bits of my implementation if there’s interest.

Hopefully you’ve come out of this with some new insight into runtimes, or performance, or whatever you may have gleaned! Thanks for reading.


  1. I’m aware of one person so far who has taken up this exercise

  2. wasm3 is the one example I can think of that diverges from the usual ahead-of-time validation step, instead validating functions just-in-time during their first call. This is a bit of a controversial choice from what I’ve seen, because you have to be extra careful about interacting with the unverified bytes in the file, and in general it rubs up against the security goals of Wasm.

  3. Spoilers: We’ll look at this in the baseline JIT

  4. I did not follow the Law of No Noodling

  5. This difference is really easy to measure, because of the shared validation described above. It ends up being a 1-line diff: auto mod = Module::compile<Mac, Noop>(bytes); to auto mod = Module::compile<Mac, arm64::Arm64>(bytes);.

  6. 5 bits = 32 possible values, 1 for each of ARM’s 32 registers

  7. Measured by adding a _start function that immediately returns and executing the file

  8. esbuild was run on a 13mb Javascript file, Spidermonkey was run with the full suite from https://github.com/ahaoboy/js-engine-benchmark

  9. The go compiler generates what I can only describe as “very tabley”. I can only assume it’s modeling some sort of state machine, but I doubt it’s fun for an optimizer to try to reason about