Let's Make a Varint
carlmastrangelo.comThis post starts off good, but...
- The motivation behind this post is storing generated IDs for a site the author is working on. Generated IDs don't have the "smaller values are exponentially more likely" property that make varints useful in the first place.
- The post rejects UTF-8's variable-length encoding because "it can only store numbers up to 1,112,064". That's the maximum number of unicode code points (...ish). The UTF-8 encoding's actual limit is >10^16. A much better reason to not use UTF-8's encoding is that it was created under pretty heavy backwards-compatibility restrictions that the author doesn't have. For example, there's no need for being self-synchronizing or to be compatible with Boyer-Moore substring searches.
- The post goes with a length-prefixed encoding, but then uses that length prefix as part of the folder structure (it's using the generated IDs as filepaths). Which is a great way to create an exponential distribution in the sizes of your directories, instead of the uniform one that was the goal.
It's not a bad post, there's facts in there, but I wouldn't recommend bothering with it.
Those are all really good points actually. In response to them individually:
- The problem with generating id's is that it isn't known ahead of time how many there will be. This forces a solution that is suboptimal in all circumstances.
- The reason for rejecting UTF-8 is mostly backwards compatibility with existing software. Being able to use encoded UTF-8 strings that exceed the million is possible, but really burns a lot of bridges along the way. The point about boyer moore is really cool, I had no idea that was a goal!
- Having the length in folder structure is exponential, but only at the top most level. It will be uniform under each length dir. This is an acceptable price to pay when typing "ls ./dir", since removing the prefix would make it hard to read quickly:
0/ 0.jpg 1/ 1.jpg
Something probably not particularly important to the author (given that he's just using these essentially as a serial number), but worth mentioning for general use:
> Varints can represent negative numbers too, by casting them to unsigned numbers first.
But, assuming you're working with twos-complement numbers, it's going to be really inefficient, because small negative numbers are going to end up represented as extremely large positive numbers (e.g. -1 will map to MAX_INT - 1, where MAX_INT is the largest unsigned value representable in your underlying integer type).
You also lose the unlimited length property if you need to handle twos-complement integers, and your encoding is dependent on the underlying integer type you're serializing from/deserializing to (especially problematic in languages where types are architecture-dependent, like C/C++).
Protobuf handles this by doing what they call "ZigZag encoding" if you declare a value as a signed integer type. This maps small negative numbers to small positive numbers, alternating back and forth like so:
Original Encoded
0 0
-1 1
1 2
-2 3
2 4
and so on.The catch is that you lose sortability if you do this... if you're going to have a bunch of negative numbers and need their encoded versions to sort correctly, you'll need to do something else. (Of course, you'll need to with twos-complement numbers, too, given that -1 will appear to be greater than 1).
Protobuf documentation on ZigZag encoding: https://developers.google.com/protocol-buffers/docs/encoding...
FWIW the protobuf varint encoding was seen as a mistake by the inventor. https://groups.google.com/d/msg/capnproto/Qxw7YSoPP18/HS06Kc...
First of all, I believe that Kenton Varda is not the inventor of protobuffs, but rather a Google engineer who did a bunch of work on Protobuffs v2 (I believe on both of the v2 specs for protobuffs, because there are two different v2s). He is certainly quite knowledgeable, but not "the inventor".
Second, while he does characterize it as a "huge mistake" for encoding/decoding performance reasons, there are a few caveats that are obvious here.
1) Half of his remarks here explain why varints are inappropriate for CapnProto, a separate project of his, inspired by his experience with protobuffs. CP has different design goals than PB, and in that context varints are inappropriate. Fine.
2) His argument here against varints in protobuffs (as contrasted with in capnproto) is only that they cost CPU time (and thus latency). That's surely true, but it's not at all clear to me that he's right that this is, in general, the wrong performance tradeoff.
3) In general, if you want to represent a long-tailed distribution of numbers, some variable-length encoding has merits, because the alternative (that protobuffs promote, IIUC) are fixed-width ints, which means you waste a lot of space to accommodate your long tail, and even so you might not have allocated enough space for the largest integer you ever need to handle.
FWIW, before I even joined Google, there were notes from Sanjay (the actual inventor) saying that varint was a poorly-chosen format due to excessive branching. IIRC he was arguing for a format where you can tell the length based on the first byte although I don't completely remember. In theory you could design such a format that is just as space-efficient as varint, by re-arranging bits.
Protobuf -- like most successful systems -- was not so much "designed" as it was "evolved" as a series of decisions made out of expediency rather than because they were the best possible answer. Overall it has obviously worked extremely well, but I would not assume that just because protobuf does something, it is good. (But I'd say that about just about everything, not just protobuf.)
The mistake wasn't that particular varint encoding, but that it was variable length. That means you need to perform more complex decoding which is not the tradeoff chosen for Cap N Proto.
The protobuf varint format itself actually is pretty expensive to decode-- the way it's encoded pretty much mandates a branch per seven bit chunk (of the resulting integer); you can't tell in advance how long an integer is (as opposed to a length-prefixed varint, where you can).
That's in addition to the encoding-independent overhead inherent to any variable-length integer scheme, which is what mandates the encoding/decoding step in protobuf. Combined, this makes protobuf pretty expensive to decode (protobuf varints are everywhere; they're used for field keys and length delimiters for length-delimited field types).
That's unrelated to ZigZag encoding of negative numbers, though, which is fairly straightforward and computationally inexpensive (it's a couple bitshifts and an XOR).
Funny thing is, despite all this, protobuf beats almost everything in benchmarks, in part due to excessive micro-optimization. Almost all varints in practice are 1-byte, so the larger ones don't really matter.
That said, the fact that you can't skip forward in a protobuf without parsing through all the previous fields is unfortunate in some use cases, hence Cap'n Proto.
Yeah, we've been pretty happy with protobuf in the projects where we've used it.
Cap'n Proto would've been a pretty strong contender if it had existed when we started using protobuf, though. Serialization/deserialization was on the critical path for our server-side code, so we most likely would have been able to get at least a bit of a performance boost out of Cap'n Proto.
I'm not the inventor of protobuf, just a guy who rewrote and open sourced it. My rewrite did not change the underlying encoding.
Any good reason not to just use a single sign bit?
Honestly, no, there's no "good" reason.
In practice the reason it ended up designed this way is because varint was introduced first with no special support for negative numbers, and then zigzag was introduced years later. Zigzag worked well because it stacked on top of the existing varint parser without changing it.
But for a new format with no backwards-compatibility constraints, it would probably make much more sense to use sign extension (so, the first encoded bit becomes the sign bit).
(I am the person who rewrote and open sourced protobufs at Google, though I am not the inventor.)
Cool. Thanks :)
The references in the article don't reflect the latest research. So far as I know, this is close to state of the art:
https://github.com/lemire/JavaFastPFOR
The links to the papers in the readme are useful if you want to learn more.
"No “u”, (and no “o” or “i”) so accidental profanity cannot happen."
h0ly fvcking sh1t!
It's a good try, and making all the letters lowercase makes the numbers as vowels less legible. But if you're creating random strings you will create profanity and slurs.
If you're generating alphanumeric IDs, removing I, O, and U from your alphabet really does decrease the likelihood of the ID being perceived as English profanity.
Sure, nothing involving letters will be perfect, but this is something to take into account. You can choose a level of risk in your IDs somewhere between purely numeric IDs and the Automated Curse Generator [1].
[1] http://thedailywtf.com/articles/The-Automated-Curse-Generato...
That's why I said it was a good try. I don't think it's a bad idea, but if you ever wanted to push the needle close to zero, you would have to add some other layer of checking.
I find the entire premise of the chosen alphabet rediculous. Long, random IDs are not for typing nor reading. It doesn't sound like they are even being used for user facing URLs.
Varints are widely used in databases. https://www.sqlite.org/src4/doc/trunk/www/varint.wiki