Making Unicode things fast

By “Unicode things” I mean programs that process UTF-8. The basic logic of such programs is typically:

Decode the runes (codepoints) from a UTF-8 string
Look up their categories
Do something with that information

It looks something like this (using Go for example):

for _, r := range myString {  // "DecodeRune" is called under the hood
    if unicode.Is(r, category1) || unicode.Is(r, category3) {  // Binary searches on a range table
        // do something based on category
    }
}

It’s all fine. But! Runes are a deserialization, which we only use temporarily. Multiple category lookups. Can we do better?

Prefix trie + bitwise categories

Yes, and the answer is a prefix trie, a type of KV (key-value) data structure. The K is a slice of UTF-8 bytes, and the V is a bitmap of Unicode categories. It looks like this:

type category uint8

const (
	category1 category = 1 << iota
	category2
	category3
	category4
)

func (this category) is(that category) bool {
	return (this & that) != 0
}

position := 0

for position < len(myString) {
    cat, w := trie.lookup(myString[position:])   // no rune decoding
    position += w

    if cat.is(category1|category3) { // check multiple categories
        // do something based on category
    }
}

This turns out to be 30%-50% faster in my experience. No allocations, no runes, one lookup retrieves all categories.

Further, the result is a bitmap (integer) for all categories that match. O(1) vs O(n) lookups.

The trie is pretty compact, since each unique byte is only stored once.

In Go, we have to code-gen the trie. Happily, there is a package deep in the x/text module, called triegen. In other languages, you might use compile-time generation.

ASCII optimizations

I have two packages that do Unicode things:

uax29 - tokenize graphemes & words
displaywidth - determine the monospace display width of strings

Each of those tasks are defined by Unicode standards, and require complicated logic: lookbacks, lookaheads, many rules.

However, the ASCII cases are simple: a byte is a grapheme; words are alphanumeric bytes followed by a space; printable characters always have a display width of 1.

We can carefully exploit those special cases: apply the ASCII rules until we encounter a byte that is not ASCII. Then, fall back to the full logic. In my libraries, this improves the general case by around 50%, and the all-ASCII case by 2x-10x.

Importantly, one needs to understand the rules that one is eliding. For example, I had a gotcha where an ASCII byte followed by a non-ASCII byte…is no longer “ASCII”. The character might legitimately be followed by a modifier, such as an accent. In that case, we back out by one byte and defer to the standard logic.