Experiment in Hardware Intrinsics for WebAssembly
Summary
Proof of concept demonstrates that SHA-1 with dedicated AArch64 C intrinsics can be executed via Wasm intrinsics in Wasmtime at 1.3x native performance.
Potential issues revealed by this experiment:
- Semantics mismatches between C, Wasm and target instructions can eliminate performance gains if not handled carefully.
- Peak performance may be limited by relatively simple optimizers in JIT engines
- Supporting large sets of intrinsics in Wasm JITs would require careful engineering
Application
The experiment provides a proof-of-concept for a representative use case, namely the SHA-1 hash algorithm using the Cryptographic Extension on AArch64. The prototype demonstrates how C code written against ARM's C intrinsics API can be executed both natively and via Wasm. Wasm execution is achieved with a Wasm AArch64 intrinsics C API layer that serves as a drop-in replacement for native intrinsic header files. In addition, I have a fork of Wasmtime with support for intrinsic calls for a select group of AArch64 instructions. The end result is SHA-1 execution via Wasm with intrinsics at 1.3x native AArch64 performance.
To give a feel for the implementation, four rounds of the SHA-1 compression function in C with AArch64 intrinsics are:
// Rounds 28-31 e0 = vsha1h_u32(vgetq_lane_u32(abcd, 0)); abcd = vsha1pq_u32(abcd, e1, t1); t1 = vaddq_u32(m1, vdupq_n_u32(K1)); m2 = vsha1su1q_u32(m2, m1); m3 = vsha1su0q_u32(m3, m0, m1);
These intrinsics are defined in arm_neon.h. The proof of concept provides an
alternate wasm_arm_neon.h that the C code can
be compiled against unchanged, and pure Wasm
fallbacks that would work on any platform.
However, when executed under the modified Wasmtime, calls to intrinsic
functions such as vsha1h_u32 are recognized and compiled directly to the
corresponding hardware instructions like SHA1H.
Lessons
Some lessons from this proof-of-concept, with the caveat that they may not generalize to other intrinsics domains.
Challenge of Semantics Mismatches.
Compilation via intrinsics passes through many layers: C intrinsics API, engine
intrinsics API, Wasm operators, CLIF IR and machine code representation. Each of
these has their own semantics and value representations. Earlier stages of this
project showed that if not handled correctly, semantics mismatches can eliminate
any performance you might hope to gain from the intrinsics calls. Specifically,
in this case some of the special SHA-1 instructions have the oddity that they
accept s<n> registers which are scalar 32-bit values in the low bits of vector
registers. Wasm engines naturally want to store 32-bit integers in general
purpose registers, therefore without careful handling the intrisincs calls are
surrounded by redundant register moves between vector and general purpose
register files (with a significant performance penalty). These details are
important when the entire goal of hardware intrinsics in Wasm is reaching
near-native performance. We might hope that the idiosyncrasies of the SHA-1
instruction set are not widespread. However, it seems possible that this broader
problem of semantic mismatches could rear its head in other cases, for example
when attempting to use wide vector types (e.g. Intel AVX-512) that do not have
Wasm equivalents.
Significance of the Intrinsics API. The design of the C API layer was critical in achieving near native performance. Specifically, it should be designed to limit the number of intrinsics required in the engine, and intrinsics offered by the engine should be as close as possible to the machine instructions. Therefore:
- Implement C layer intrinsics as existing Wasm operators wherever possible. For
example, the AArch64 intrinsic
vdupq_n_u32can just be implemented asi32x4.splat(orwasm_u32x4_splatin C) without the need to add an intrinsic to the engine. Importantly, this also improves the ability of the AOT compiler to optimize code around the intrinsics. - Engine intrinsics API should be as close as possible to machine instructions.
This makes the engine work essentially a passthrough, and limits the
optimizations required from the JIT. As a concrete example, the
vsha1h_u32intrinsic takes and returns auint32_t. However, it is best if the C layer maps to an internal__intrinsic_vsha1h_u32version that takes and returns av128, since these match the underlyingSHA1Hinstruction more closely.
Importance of Accompanying Optimizations. The first version of SHA-1 via Wasm intrinsics had poor performance (3.2x native), showing that merely mapping to the right machine instructions is not enough. Supporting optimization passes are critical. In the SHA-1 case, it was crucial to eliminate redundant moves between register classes, but it is reasonable to expect instances of this problem for other classes of intrinsics. Optimizing JITs are designed for compile speed and therefore have a much more limited set of optimizations than a full AOT compiler. In this case we were able to work around missing Cranelift JIT optimizations by moving the problem to the AOT compilation layer, however it is not clear that would always be possible. Indeed, the remaining approximately 30% overhead over native execution may be a difficult gap to close, given the lack of optimizations such as instruction scheduling in JIT compilers. Overall, we might expect that Wasm intrinsics performance would be limited by JIT compiler optimization capabilities.
Fallback Performance.
When the intrinsics implementation is executed under Wasm with the fallback
implementations, the performance is very poor (over 9x native intrinsics). In
fact, it's even worse than a generic version of SHA-1 compiled to Wasm. The
function call overhead is likely a major problem, so inlining of fallbacks would
likely be necessary for tolerable performance. Alternatively we could accept
that fallback performance is not a goal, and the is_available functions are
there to allow users to provide an alternative.
Engineering Aspects. The fork of Wasmtime for this project was modified with this proof-of-concept in mind. While the engineering was reasonable, the approach taken is not one that would scale to adding hundreds or thousands of intrinsic calls. At the time of writing, the ARM intrinsics database contains 12,855 function calls, with 4,344 in the Neon instruction set extension. A full production-grade version of the Wasm intrinsic header library and accompanying engine support would be a substantial undertaking. You would almost certainly want automation and code-generation involved, but also certain parts of the engine integration would not scale well. The current hand-written assembler would need to support many more instructions. You also probably would not want to actually extend the Engine's IR to support every intrinsic either, but instead perhaps support an explicit passthrough or intrinsic IR node that would effectively perform a trivial lowering to a wrapped machine instruction. None of these engineering challenges are intractable, but they would need careful thought.