ULID – Sortable Unique Identifier
yadukrishnan.liveThis reminds me a little bit of Twitter's snowflake: To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.
Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).
https://blog.twitter.com/engineering/en_us/a/2010/announcing...
Is that still a viable strategy or are there better options available by now? Is Twitter still using Snowflake? (There’s a nice Elon pun in there somewhere, I’m sure)
The initial version, released in 2010, was based on Apache Thrift and it predated Finagle, their building block for RPC services at Twitter. The Snowflake they're using internally is a full rewrite and heavily relies on existing infrastructure at Twitter to run.
Or just use the standards-track orderable UUID variants: https://uuid6.github.io/uuid6-ietf-draft/
Wouldn't it be better for it to become an actual standard before using the new variants? If something changes between now and when it gets standardized, you'd have to deal with the difference going forward.
And in the meantime, ULID isn't even that far along, and is there even more likely to change.
The ULID spec has been stable for many years now and changes are extremely unlikely. The spec wasn't designed for forwards compatibility for future specs, which on the one hand makes it a possible risk if there are future specs, but also gives some weight to the "promise" that future specs are not expected and will be backwards incompatible anyway.
It would be nice to see ULID recognized by some standards body that isn't just "itself", but also unlike UUID version changes doesn't need to be. The new UUID versions need to be backward and forward compatible with existing standards documents "universally", whereas ULID is designed to be entirely self-contained.
Support for it has also been officially integrated into Laravel[0].
[0] - https://laravel.com/docs/9.x/migrations#column-method-ulid
> If more than one ULID is generated within the same millisecond(so, the timestamp component is the same), the algorithm should increment the previously generated random by 1 bit.
This is bad approach. If you know one ULID, you can with high probability deduce next one. Don't use this approach.
Why with high probability? Because generating several ids within the same millisecond is extremely common case when you're doing batching.
I recently gave thought to it and actually implemented several algorithms and compared them each to each other. I don't care about standards, sorry (I think that standard UUID is oxymoron, UUID is 128 bit and that's about it).
So the best approach I've found is:
6 bits for version/variant (if you don't care about standards you can use those bits for better randomness)
first 48 bits is unix milliseconds.
Then you have 12 more bits in the fist 64 bits. There're two approaches to use them:
either use them as a nanoseconds (nanoseconds_part * 4096 / 1_000_000).
Or use them as a counter if several ids are generated in the same millisecond. It allows for 4096 values per millisecond. Counter should be used when you can't access nanosecond timer like with browser JavaScript.
Then you have 2-3 bits for variant and rest 62-61 bits for pure randomness. Or just 64 bits for randomness. This is enough for security.
If you need to generate more than one ID per 1/4 of microsecond (approach one) or more than 4096 ids per millisecond (approach two), you can just keep generating random part until it's greater than previously generated one. It slightly reduces randomness but not by much.
I'm pretty much sure that my approach is the best approach. It allows for high speed of generation (like 10 000 000 / second with nanosecond approach with unoptimized Java) and good security.
If you insist on using this ULID approach, I suggest to apply the aforementioned approach: don't just increment random bits, generate random bits until you've got higher value.
Of course one should only use this ascending UUID thing (that's what I call it) when you need it. Random UUIDs should be used by default and ascending UUIDs only when you use those as primary keys for RDBMS.
And of course keep in mind that you're leaking generation timestamp which might be a bad thing.
That monotonic support in the ULID spec is described as an option. The default in most implementations I've seen is pure random ULIDs and monotonic has to be opt-in.
(Though yes, anyone opting into that behavior should beware that it reduces the entropy strength of generated IDs.)
Discussed many time before: https://hn.algolia.com/?q=ulid
A lot of distributed systems’ clocks can’t tell you within 1 ms when an event happened, leading to assumptions that aren’t true. I don’t think we should commit to IDs being even partially ordered unless we have a set of monotonic clocks with sequence number generators, and record a timestamp, a sequence number, and which clock they came from.
It of course depends on your application needs. Many applications don't need a total order, they just need a predictable, stable order. ULIDs offer that. ULIDs generated within the same clock ms are sorted randomly. Random is unpredictable to people, but to a system that is a predictable sort order (it is randomly sorted) and the random numbers in ULID are more importantly still a stable sort order.
Obviously, applications exist where you do need a total order of IDs/events and ULID may not be the best choice for those, but don't underestimate the usage scenarios of partially ordered, stable sorts.
People want to believe that sorting ULIDs is semantically different than sorting UUIDs, and it’s really hard not to depend on something that looks sort of true, so I wouldn’t want a system to use them unless our architecture guarantees they always (not just mostly) monotonically increase.
ULIDs are sorted semantically different from UUIDs. UUIDs have 6+ different sort orders depending on who you ask and which library call you make. All of those sort orders are different from what you'd get just string forms of UUIDs. There are cross-platform sorting endian issues due to the struct order that GUIDs were originally designed to be grouped by.
ULIDs have a single, consistent sort that matches both byte patterns and string representation. That's a huge semantic difference.
Sure, ULIDs make no claims to accurate sorting or total ordering or monoticity beyond a single machine, but ULIDs aren't designed to be a Snowflake/Thrift replacement, they are designed to be a UUID replacement. You are correct that they make no more guarantees than UUIDs, but they don't have to, that was out of scope of their design. I can understand how that makes it less useful for some of your applications, but that doesn't make it not useful for all sort of applications. (Including many applications that once used UUIDs successfully but want something with a cleaner string representation and fewer cross-platform sorting headaches.)
ulid1 < ulid2 does not reliably tell you anything about the ordering of events one and two (without careful architecture most systems don’t have), but people read the spec and want to believe that it does. I think this makes the provisions for sorting more misleading than helpful.
It would help if the spec talked about clock skew and other issues to design and test for; ordering has a heavy cost.
You still seem to be assuming that the only reason to sort IDs is a total ordering of events. There are plenty of other reasons someone may want a good enough partial order of events or a sortable ID of things that aren't events in a system.
It might help if the spec mentioned things like clock skew and distributed time keeping, but mostly just to state that those problems are out of scope of the spec itself. You are right, ULID itself wasn't designed for those specific classes of problems. I think you are just too easily dismissing the classes of problems that ULID does solve reliably well.
I admit I don’t see the use case for IDs that are mostly ascending.
If you are curious, just some of them:
- UIs that want minimal visual thrashing in "user time" (wall clock time)
- Databases and B-Trees and other storage with primary key indexes/clusters that offer slightly better writes/clustering for "roughly but not necessarily exactly in order insertions": key/value stores, document databases, SQL databases
Wall clock time is plenty of time for a UI to rely on a partial ordering.
Databases are generally designed for arbitrary order insertions, it's just common that they are most optimized/efficient for "roughly in-order". A partial order is generally good enough to opt-in to most of the optimizations/efficiencies and reduce worst cases, especially if the "out-of-order" insertions occur below wall clock time and things like journaling-based transaction semantics versus competing with clustered inserts, reindexing scanners/services, and so forth.
This is a good point. Something is either reliably sortable or it is not sortable at all.
To give the appearance of being sortable to those who are less familiar with how they are generated is potentially dangerous and misleading. And whilst it is true to say that these IDs will consistently sort in the same order, that is equally true of standard GUIDs etc - the difference being the latter does not lead people to believe that the order has inherent meaning, which the former does.
It's a little similar to how the designers of Go noticed that people were relying on the ordering of keys in maps matching the order items were added. So range iteration over keys was specifically changed to start from a 'random' point in the sequence (not truly random, but enough to stop people relying on it). They understood that the appearance of consistency without the fact of consistency leads to errors.
They have their uses I'm sure, they just need to be carefully considered and clearly understood uses.
> And whilst it is true to say that these IDs will consistently sort in the same order, that is equally true of standard GUIDs etc
GUIDs do not have a consistent sort order: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
> the difference being the latter does not lead people to believe that the order has inherent meaning, which the former does.
That's mostly an accident of v4 UUIDs becoming the most common. v1 GUIDs had plenty of metadata with lots of semantic data in them that had all sorts of inherent meaning in the sort orders. That's part of why there are so many different ways to sort UUIDs in the first place (with again the above blog post being one of the best discussions on it). v1 GUIDs had machine identifiers and tried to be total orderable in distributed systems. It didn't work out well, there were privacy issues with the machine identifiers, and everyone switched to random v4 UUIDs. (That's relevant in discussions of the v6+ UUIDs which are trying to get back to partially semantically meaningful UUIDs again.)
The obvious counter-example to the Go example to me is Python: python dict (which also powers most "ordinary" python objects) had years of implementation behavior where keys were generally sorted lexicographically. It was undocumented behavior for many years and expected to change if the underlying implementation changed and people were directed towards SortedDict if they needed a dict with a reliable sort.
Python eventually decided to just document the implementation behavior and make it assumed that dict will generally behave with sorted keys, but also still suggests SortedDict for apps that need that guarantee of a total order. Reliable partial order is "good enough" for a lot of programs and a lot of algorithms.
> Something is either reliably sortable or it is not sortable at all.
ULIDs are reliably sortable, they just aren't a total order with respect to monotonic time and distributed clocks. I'm not discounting that there are needs for monotonic time and distributed clock coordination and such not, but I do think it's a bridge much further down from what ULIDs are trying to do.
ULIDs are "UI stable" for sorting: if you've got a UI that shows data as it comes in and sorts by ULID keys the amount of "UI jiggle" is about as low as possible. You are still going to get "middle inserts" in between previous loaded items in a way that a user will sometimes notice, but not in a way that thrashes the UI or drastically changes the viewport. That's a sometimes useful property. (For more things than just UIs, but UIs are easy to visualize. It can also be a useful property in data locality/clustering in DB indexes as something that has been a sorting need in projects I've worked on, across several different kinds of database.)
You all are right, you can't rely on that if you've got a stateless backend whose operations only strictly rely on if the key is greater than the last key they saw on whether or not they have new data to work on. If that is your use case, you need something that isn't ULID. That's not a lot of applications' use case. I think where we are most disagreeing is that you all seem to think that that is an eventual use case for all applications, and in my experience it is much more rare and you have other problems than what style of IDs you are using by the time you get there. For what it is worth, ULIDs have been a reasonable compromise for my applications' needs versus total orderable things like Thrift/Snowflake/et al. They make a great replacement for v4 UUIDs, which are still the default "golden child" among a certain class of dark matter Enterprise developer. That's fine if those aren't uses that you have, I'm saying those are uses that have a lot of applications.
> GUIDs do not have a consistent sort order
I didn't write that they "have a consistent sort order" (they don't), I wrote that they will "consistently sort in the same order".
In other words I was not saying they are generated in a particular order, but that once you have them they will sort consistently - that is, requesting the data multiple times will always return them in the same order. In context it wasn't a comment about their inherent nature in any way, more about what the stable sorting might make people believe about that nature.
My point is that whilst both GUIDs and ULIDs have stable ordering (as per the above), the sort order of GUIDs very clearly has no meaning whereas the sort order of ULIDs gives the impression that it might without further consideration. This is simple fact; the issue is the weight you put on that fact. To me it is dangerous, but feel free to differ.
I think you missed facts that I already referenced in my post: v1 GUIDs have timestamps and machine keys for sorting. The proposed v6 and v7 UUIDs have timestamps again. The "no meaning of GUID sort" today again seems purely an accident of v4 becoming the most common type of GUID/UUID not an actual intent by GUID/UUID developers especially given how important sorting was to v1 GUIDs.
(That's part what makes GUID/UUID sort so inconsistent today: there's still a major database used by lots of applications that sorts even v4 UUIDs by v1 machine key then v1 timestamp!) The sort of GUIDs have too many meanings in practice today, even when you assume the majority of GUIDs you see are v4 UUIDs.