By “Unicode things” I mean programs that process UTF-8. The basic logic of such programs is typically:
- Decode the runes (codepoints) from a UTF-8 string
- Look up their categories
- Do something with that information
It looks something like this (using Go for example):
for _, r := range myString { // "DecodeRune" is called under the hood
if unicode.Is(r, category1) || unicode.Is(r, category3) { // Binary searches on a range table
// do something based on category
}
}
It’s all fine. But! Runes are a deserialization, which we only use temporarily. Multiple category lookups. Can we do better?
Prefix trie + bitwise categories
Yes, and the answer is a prefix trie, a type of KV (key-value) data structure. The K is a slice of UTF-8 bytes, and the V is a bitmap of Unicode categories. It looks like this:
type category uint8
const (
category1 category = 1 << iota
category2
category3
category4
)
func (this category) is(that category) bool {
return (this & that) != 0
}
position := 0
for position < len(myString) {
cat, w := trie.lookup(myString[position:]) // no rune decoding
position += w
if cat.is(category1|category3) { // check multiple categories
// do something based on category
}
}
This turns out to be 30%-50% faster in my experience. No allocations, no runes, one lookup retrieves all categories.
Further, the result is a bitmap (integer) for all categories that match. O(1) vs O(n) lookups.
The trie is pretty compact, since each unique byte is only stored once.
In Go, we have to code-gen the trie. Happily, there is a package deep in the x/text module, called triegen. In other languages, you might use compile-time generation.
ASCII optimizations
I have two packages that do Unicode things:
- uax29 - tokenize graphemes & words
- displaywidth - determine the monospace display width of strings
Each of those tasks are defined by Unicode standards, and require complicated logic: lookbacks, lookaheads, many rules.
However, the ASCII cases are simple: a byte is a grapheme; words are alphanumeric bytes followed by a space; printable characters always have a display width of 1.
We can carefully exploit those special cases: apply the ASCII rules until we encounter a byte that is not ASCII. Then, fall back to the full logic. In my libraries, this improves the general case by around 50%, and the all-ASCII case by 2x-10x.
Importantly, one needs to understand the rules that one is eliding. For example, I had a gotcha where an ASCII byte followed by a non-ASCII byte…is no longer “ASCII”. The character might legitimately be followed by a modifier, such as an accent. In that case, we back out by one byte and defer to the standard logic.