Scan HTML even faster with SIMD instructions (C++ and C#)

Earlier this year, both major Web engines (WebKit/Safari and Chromium/Chrome/Edge/Brave) accelerated HTML parsing using SIMD instructions. These ‘SIMD’ instructions are special instructions that are present in all our processors that can process multiple bytes at once (e.g., 16 bytes).

The problem that WebKit and Chromium solve is to jump to the next target character as fast as possible: one of <, &, \r and \0. In C++, you can use std::find_first_of for this problem whereas in C#, you might use the SearchValues class. But we can do better with a bit of extra work.

The WebKit and Chromium engineers use vectorized classification (Langdale and Lemire, 2019): we load blocks of bytes and identify the characters you seek using a few instructions. Then we identify the first such character loaded, if any. If no target character is found, we just move forward and load another block of characters. I reviewed the new methods used by the Web engines in C++ and C# in earlier posts along with minor optimizations. The results are good, and it suggests that WebKit and Chromium engineers were right to adopt this optimization.

But can we do better?

Let us consider an example to see how the WebKit/Chromium approach works. Suppose that my HTML file is as follows:

<!doctype html><html itemscope="" itemtype="http://schema.o'...

We load the first 16 bytes…

<!doctype html><

We find the target characters: the first character and the 16th character (<!doctype html><). We move to the first index. Later we start again from this point, this time loading slightly more data…

!doctype html><h

And this time, we will identify the 15th character as a target character, and move there.

Can you see why it is potentially wasteful? We have loaded the same data twice, and identified the same character again as a target character.

Instead, we can adopt an approach similar to that used by systems like simdjson. We load non-overlapping blocks of 64 bytes. Each block of 64 bytes is turned into a 64-bit register where each bit in the word correspond to a loaded character. If the character is a match, then the corresponding bit is set to 1. The computed 64-bit word serves as an index for the 64 characters. Once we have used it up, we can load another block and so forth.

Let us build a small C++ structure to solve this problem. We focus on ARM NEON, but the concept is general. ARM NEON is available on your mobile phone, on some servers and on your macBook.

Our constructor initializes the neon_match64 object with a character range defined by start and end. We define three public methods:

get(): Returns a pointer to the current position within the character range.
consume(): Increments the offset and shifts the matches value right by 1.
advance(): Advances the position within the range

To iterate over the target characters, we might use the class as follows:

neon_match64 m(start, end);
while (m.advance()) {
  char * match = m.get(); m.consume();
}

The class code might look like the following:

struct neon_match64 {
  neon_match64(const char *start, const char *end) 
       : start(start), end(end), offset(0) {
    low_nibble_mask = {0, 0, 0, 0, 0, 0, 0x26, 0, 0, 0, 0, 0, 
        0x3c, 0xd, 0, 0};
    bit_mask = {0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
                0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80};
    v0f = vmovq_n_u8(0xf);
    update();
  }
  const char *get() const { return start + offset; }
  void consume() {
    offset++;
    matches >>= 1;
  }
  bool advance() {
    while (matches == 0) {
      start += 64;
      update();
    }
    int off = __builtin_ctzll(matches);
    matches >>= off;
    offset += off;
    return true;
  }

private:
  inline void update() { update((const uint8_t*)start); }
  inline void update(const uint8_t *buffer) {
    uint8x16_t data1 = vld1q_u8(buffer);
    uint8x16_t data2 = vld1q_u8(buffer + 16);
    uint8x16_t data3 = vld1q_u8(buffer + 32);
    uint8x16_t data4 = vld1q_u8(buffer + 48);

    uint8x16_t lowpart1 = vqtbl1q_u8(low_nibble_mask, 
      vandq_u8(data1, v0f));...
    uint8x16_t lowpart4 = vqtbl1q_u8(low_nibble_mask, 
       vandq_u8(data4, v0f));

    uint8x16_t matchesones1 = vceqq_u8(lowpart1, data1);...

    uint8x16_t matchesones4 = vceqq_u8(lowpart4, data4);

    uint8x16_t sum0 =
        vpaddq_u8(matchesones1 & bit_mask, matchesones2 & bit_mask);
    uint8x16_t sum1 =
        vpaddq_u8(matchesones3 & bit_mask, matchesones4 & bit_mask);
    sum0 = vpaddq_u8(sum0, sum1);
    sum0 = vpaddq_u8(sum0, sum0);
    matches = vgetq_lane_u64(vreinterpretq_u64_u8(sum0), 0);
    offset = 0;
  }

  const char *start;
  const char *end;
  size_t offset;
  uint64_t matches{};
  uint8x16_t low_nibble_mask;
  uint8x16_t v0f;
  uint8x16_t bit_mask;
};

The most complicated function is the update function because ARM NEON makes it a bit difficult. We use use vld1q_u8(buffer) loads 16 bytes (128 bits) of data from the memory location pointed to by buffer into the data1 variable. Similarly, data2, data3, and data4 load subsequent 16-byte chunks from buffer + 16, buffer + 32, and buffer + 48, respectively. The expression vandq_u8(data1, v0f) performs a bitwise AND operation between data1 and a vector v0f (which contains the value 0xF in each lane). The expression vqtbl1q_u8(low_nibble_mask, ...) uses the low_nibble_mask vector to permute the low nibbles of the result of the AND operation. The result is stored in lowpart1, lowpart2, lowpart3, and lowpart4. The expressionvceqq_u8(lowpart1, data1): Compares each lane of lowpart1 with data1. If equal, the corresponding lane in the result is set to all ones (0xFF). They correspond to target characters. You repeat the same computation four times. We then use bitwise AND with bit_mask(...) followed by several applications of pairwise sums vpaddq_u8(...) to compute the 64-bit word.

I have added this C++ structure to an existing benchmark. Furthermore, I have ported it to C# for good measure with a lot of help from Arseniy Zlobintsev.

How well does it do?

I use a C++ small benchmark where I scan the HTML of the Google home page. I run the benchmark on my Apple M2 laptop (LLVM 15).

method	speed	instructions/byte	instructions/cycle
`find_first_of`	2.0 GB/s	9.8 instructions/byte	5.1 instructions/cycle
vectorized classification	6.8 GB/s	1.9 instructions/byte	2.8 instructions/cycle
64-bit vectorized classification	33 GB/s	0.7 instructions/byte	6.6 instructions/cycle

The number speaks for themselves: the 64-bit vectorized classification approach is the clear winner. It requires few instructions and it allows retiring many instructions per cycle on average.

What about C#?

method	speed
Conventional	1.4 GB/s
SearchValues	4.2 GB/s
vectorized classification	6.3 GB/s
64-bit vectorized classification	29 GB/s

Again, the 64-bit vectorized-classification approach is the clear winner. The SearchValues class provided by .NET gets an honorable mention for its good performance.

I have not yet ported the code for x64 processors, but I expect equally good results.