Understanding How UUIDs Are Generated
digitalbunker.devAfter discovering ULIDs [0] I can't see ever using UUIDs ever again.
ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.
Does anyone have any criticisms of ULIDs? I can't see how they don't take over general purpose use of unique ids in the future except where a more guarantee of uniqueness is needed. (ie, bajillion records a second unique...)
The concept of ULID is interesting, but the spec is a bit weird[1]. If you want the benefits of ULID, I'd highly suggest checking out KSUIDs:
https://github.com/segmentio/ksuid
https://segment.com/blog/a-brief-history-of-the-uuid/
Same advantages of ULIDs, but I prefer the base62 to the base32 encoding (more compact; no need to bikeshed about upper versus lower case), it's been tested at scale, and the decisions made are sensible.
[1]: Specifically, they try and guarantee absolute monotonicity. The way they do this is that if you ever try and generate more than on ULID per millisecond, you increment the least significant bit of the random component. In other words, we have a key that's basically <timestamp>-<random-int>, and if you generate more than one key per timestamp, you just increment the random number. If the random number would overflow, by the spec, you have to just throw an exception; no wraparound. There's a lot of issues here. For one thing, none of this can possibly work if you're generating your IDs in a distributed fashion; it assumes a single, central, consistent key generator. For another, our key generator now has state, since it needs to know if any keys have been generated earlier, and if so, what they were. Doable, but...potentially a lot of work depending on your environment. Also, why are we even trying to force strict monotonicity? What does that possibly gain you? Why would we want a spec that, by design, has a chance of sometimes not letting you generate a key? The whole thing feels like the result of someone really wanting an auto-incrementing primary key, hearing that UUIDs were cool, and trying to make a auto-incrementing primary key that looks like a UUID, ending up without the advantages of either. Of course, you could ignore the spec (and several implementations do), but at this point it's worth asking what you're gaining from ULID. It's a weird feature that basically only works if you don't need it (since realistically, anyone generating many keys per millisecond would of course need to generate the keys in a distributed fashion).
Thanks for the alternative (KSUIDs) I took a quick look at it, and from what I can tell, if you aren't generating thousands of IDs 1 per second (or more) they achieve the exact same result. [0]
I find the argument that ULID won't work under extreme and harsh conditions proof that it's just fine for many of us that simply do not work on systems with that kind of load/requirements.
I appreciate seeing the weaknesses of ULID, as this helps me choose whether or not I can live with them. (which I can)
Again, thanks for the detailed reply, it was very helpful.
Yeah. ULID will work. It's just that there's a lot of small annoyances and quirks, and no real advantages. If you've already adopted it, it's probably not worth changing, but I can't think why you'd select it given a choice for greenfield development.
What are the "annoyances and quirks" you are referring to?
I hate programming myself into a corner only to figure this out much later. So I appreciate any serious feedback on this.
I described the advantages for me in my first post. Maybe they are situational?
I have no issues adopting a new UID system, but so far what everyone has described as problems don't apply to the work I am doing.
Right, your initial post said:
> ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.
1) Being roughly sortable is a valuable characteristic if used as a DB index in most DB systems, and ULIDs meet this requirement (as do others). However, they are NOT strictly time sortable except under very specific conditions. If you need that, you should avoid ULIDs (because the design is broken). But if you don't, then...well, ULIDs will work fine for you! Of course, so would other options. :)
2) 26 characters is fine. However, the use of base32 is a bit unfortunate. Some implementations generate lowercase IDs; others generate uppercase. The spec officially recommends generating uppercase IDs, but if you're not careful (and are generating them on different platforms), you may end up with a mix of both, and naive lexical sorting will yield incorrect results. You need to normalise all your IDs; doable but overhead. Conversely, if a denser encoding was chosen (like base62 or base58), you could either get the same entropy in a shorter string, or more entropy in the same string, but ALSO not need to worry about case. Win/win, from my point of view.
3) Human readable...eh, not sure "01ARZ3NDEKTSV4RRFFQ69G5FAV" is very human readable, but okay. I'd score it no better or worse than a normal UUID (like "fe9c8a21-61c1-44dd-b552-95616d62404d") or a KSUID (like "0ujssxh0cECutqzMgbtXSGnjorm").
4) ULIDs are fine in terms of entropy. Again, this doesn't set them apart from others.
Apart from the question of being strictly sortable (if you need that, you need a different tech), ULIDs are fine. But they're also not unique. Some people re-order a v1 UUID so it starts with a timestamp, and obtain the same benefits. Others use KSUIDs and obtain the same benefits. Others roll their own "slam a timestamp and some random data together and base whatever encode it" and...obtain the same benefits.
It's cool to have an ID which is roughly sortable by time, is safe to generate on multiple systems, and can be encoded as a relatively short alpha-numeric string, and I think a lot of people can benefit from this. Certainly I've found uses for this, which is why I've opted to use KSUID.
...that being said, ULID also works for this. It's just...odd. Like, 80 bits of randomness per millisecond is fine! But actually, if you follow the spec and generate them on a single system, only the first ULID is random; the rest just increment until you run out of room, which means you can generate a random number of ULIDs per millisecond. Anywhere between a bit over one septillion and one. Of course, the odds of getting a small amount of room is low, so as a practical matter it's fine. But it's odd that, if you ever did try and generate several ULIDs at once, there's just a random chance it might not work! Of course, you can just wait a millisecond and try again, so no big deal.
Of course, that opens another issue. If you tell me a ULID, then if you're following the spec, and also generating more than one ULID per millisecond, then if I have one ULID I now know the other ULIDs! For many people a major attraction of IDs like this is that they're unguessable even if you have seen a bunch, but a ULID is actually, in theory, open to enumeration attacks. That's abasolutely crazy! If you managed to make a system you don't control generate a ULID for you at the same millisecond it generates one for your target, you can now trivially guess the target's ULID. If you were relying on that ULID being unguessable for security reasons (which you would be if you were using it as an API key, session identifier, or unguessable URL), then it's completely compromised. I admit, hard attack to pull off, but wow, imagine it even being possible to do that!
Of course, in the real world, it's fine. I don't like base32, but it's good enough. The crazy increment logic is a wtf, but whatever. ULIDs are most broken when you use them for what they're designed for (guaranteeing strict monotonicity at scale), but the design is so impractical you'll give up before you hit the landmines. And if used more "normally", then they'll be good enough. I still wouldn't choose ULIDs for a project I control, just because it's so odd. But I doubt you'd gain any benefit from switching to, eg, KSUIDs now.
Wow, thanks this was more than I expected. Here's my considerations on your reply, though it's likely not useful for you, but it is good for me to write this out.
1) DB performance has as much (or more) todo with time sorting as anything. But I am not an expert, I just rely on what I read from others. (hence, my questions to you, and I appreciate your time/effort on your responses here)
But, I do appreciate the inherit use of a time part, just makes me feel better about having a "randomish" id, but not loosing what I felt I had previously with incremented integers.
2) This is one reason I _prefer_ ULID to other options. I am using ULIDs in low count generated IDs per second, and I want to use in them in URLs. All lowercase is a trivial thing to implement. And I remove ambiguity about content (ie, 1,i,I,L, etc...)
So for human readable ULID is almost unparalleled from what I have found.
Is there another UID system that is just as human readable? I'd love to find something else to compare it to. (that I haven't already)
3) #2 dove-tails into #3 I guess, but yes, I look at data all day long, and the shorter the better. The fewer conflicts, misunderstandings, etc... the more human readable it is.
Is there another UID that is _more_ human readable?
4) I don't need my UIDs to be special, I just need them to be useful.
Why is something "odd" to you because of an arbitrary number like "80 bits"? Am I missing something about unique ids that require a specific hex/octet/base-10 value or they don't work properly?
As far as I know _all_ unique ids have the potential for conflicts, there is no getting around this.
A) I don't have any issues with making more than one ULID per millisecond, I spend way more CPU time on other things, I have no issue holding back ULID generation to avoid the "countable" error of multiple ULIDs that can be guessed.
Thank you for pointing out this flaw, it would be nice if it didn't exist, and maybe the libraries should have some kind of flag/option to enable guaranteed random values at the cost of speed. (which considering how long it takes to download very large background video on a splash/hero banner on a site, I think solves what seems to be 100% of the criticisms you have against ULIDs.
B) Here's a discussion about ULID vs KSUID, which includes replies from the author you linked to from Segment, Rick Branson.
https://github.com/segmentio/ksuid/issues/8
Rick says this about ULID vs KSUID:
"If one is concerned about the extra 4 bytes of data or if millisecond precision is needed, then ULID is probably a better choice."
It seems that ULID has some advantages over KSUID, and it's not so cut and dry that KSUID is superior in all ways.
I don't see how a distributed system is an "extreme condition". If you don't have distributed ID generation, why not use an auto incrementing u64 and call it a day?
I will have at most 100 users generating an ID in a day (on the project I am using ULIDs with). The possibility of them having a conflict with generating the ULID is so low that I don't even consider it a concern.
So, "extreme condition" is one where someone could actually generate a conflict. A realm of production I don't work in. (maybe Twitter/Facebook would have this problem)
Therefore, I feel an easier to use/work with UID like ULID is both reasonable and preferable over other options.
Sadly they wrap around in 133 years. Maybe Segment don't plan on still existing in 2153? Also, base62 really should give way to base58 for any identifier that an ops person might have to type in a hurry.
Thus the quest for the perfect identifier continues.
NB: generating many unique IDs per millisecond for a long period may be a hallmark of a large distributed system, but even a small application may want this for a brief time e.g. importing bulk customer data.
I do not understand your criticism. ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't. The state that needs to be saved for that feature is exactly the last generated ULID (if it's the same microsecond still, increase, else regenerate). And if you are generating on the order of 2^40 ULIDs per microsecond, you'll have to have a larger id space anyway.
So,for me, your criticism boils down to "this is weird because it has a feature I don't need". Why would you care?
> ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't
The thing is, you don't need it. Nobody needs it. And I know this is mean, but...if you think people need it, you probably shouldn't be writing specs for things like this, because it suggests you haven't really thought about the problem.
Keep in mind:
1. If you want guaranteed monotinicity within a single generator, just increment an integer in a DB column. This is a solved problem!
2. ULID cannot guarantee monotonicity. Instead, the spec does something that will guarantee it if you use the library in an extremely specific way. The second you generate two ULIDs on different systems (or even in different processes on the same system), all bets are off. Which means you really shouldn't rely on that strict monotonicity!
3. But if you can't rely on it, then it serves no purpose. And if it serves no purpose, then you could remove it, and at a stroke it becomes much simpler to implement tje spec. As you note, you don't need much state, but any state greatly complicates something like this.
> your criticism boils down to "this is weird because it has a feature I don't need"
It's a feature that literally nobody needs, implemented in a way that doesn't work. It's presence is so bizarre, it raises questions about the entire spec.
I disagree with the strong sentiment that nobody needs it. In fact, I can think of several situations where this would have been helpful. For example, "ids generated by each import/client/process are guaranteed to be sortable."
But in any case, this is such a simple spec and such an irrelevant feature thet this sounds very much like bikeshedding to me now. As such, please accept my apologies and do choose what is best for you.
Speaking of human-readable, I really like ssh's "randomart" visualizations of ssh fingerprints.[1][2][3]
They're much easier for humans to differentiate than the usual long string of hex characters (even 26 characters is too long to reliably compare when a single mismatched character might make all the difference).
Examples of randomart:
[1] - http://www.dirk-loss.de/sshvis/drunken_bishop.pdfGenerating public/private rsa key pair. The key fingerprint is: 05:1e:1e:c1:ac:b9:d1:1c:6a:60:ce:0f:77:6c:78:47 you@i The key's randomart image is: +--[ RSA 2048]----+ | o=. | | o o++E | | + . Ooo. | | + O B.. | | = *S. | | o | | | | | | | +-----------------+ Generating public/private dsa key pair. The key fingerprint is: b6:dd:b7:1f:bc:25:31:d3:12:f4:92:1c:0b:93:5f:4b you@i The key's randomart image is: +--[ DSA 1024]----+ | o.o | | .= E.| | .B.o| | .= | | S = .| | . o . .= | | . . . oo.| | . o+| | .o.| +-----------------+[2] - https://www.man7.org/linux/man-pages/man1/ssh.1.html
[3] - https://superuser.com/questions/22535/what-is-randomart-prod...
I love randomart. To see the randomart of the host you're connecting to, append this to the ssh command:
To see the randomart of your own key, or your known hosts:ssh user@host -o VisualHostKey=yesssh-keygen -lv -f ~/.ssh/mykey ssh-keygen -lv -f ~/.ssh/known_hosts> To see the randomart of the host you're connecting to
put it in a ~/.ssh/config or /etc/ssh/ssh_config
why type this stuff over and over?
> Does anyone have any criticisms of ULIDs
I'm not a fan of the Crockford encoding, since it supports noncanonical forms that'll silently trash the lexicographic sorting assertion when present, and the exclusion of "U" as some kind of profanity filter is both prissy and ineffective.
base58 seems better, vs my own fat fingers at any rate.
"It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity."
Excluding ILO doesn't seem a bad idea - but leaving out U for that particular reasons seems downright weird:
It doesn’t exclude them. Ignore the Wikipedia text, and look at the decoding table instead.
True - they should be generated during encoding but are allowed during decoding:
https://web.archive.org/web/20021223012947/http://www.crockf...
base58 avoids 0 as it looks similar to capital O in some fonts. Seems more reasonable than U.
Maybe you don’t want to leak the time component.
This occurred to me after looking at the Wikipedia article [1], and seeing the different "versions" and the rationale about not leaking data in the ID.
In general, if you want to encode info in your IDs, you can do that. Just make sure you want to do that, and that you don't run out of entropy.
I view random UUIDs as a silver-bullet type solution to assigning IDs, without overlap. (8 bytes should be enough to just assign them at random)
8 bytes will have a non-negligible (say, 1 in 1 million) chance of generating the same identifier twice even if you only generate a few million million IDs -- see https://en.wikipedia.org/wiki/Birthday_problem#Probability_t... and look at the "16 hex digits" row. Depending on your use case, this may or may not be a problem.
Microsoft research has a great blog post on this.
https://devblogs.microsoft.com/oldnewthing/20160114-00/?p=92...
TLDR: It's far more likely for a UUIDv1 to be corrupted to a collision due to memory errors and bit-flips, than for two UUIDv4 to collide.
TLDRTLDR: All computing is approximate at scale.
I had not considered that. In my use case it's not an issue, but I can see that if the ID value is publicly viewable (ie, in a URL) and can reveal sensitive info via the timestamp... thanks.
So you basically mean UUID v6, the missing version of UUID which would be sortable too [1] [2]
[1] https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-... [2] http://gh.peabody.io/uuidv6/
It's funny that this took so long since multiple people in this thread (including me) mentioned that they implemented their own UUID generation function which encodes the timestamp into the first bits just to create sortable UUIDs.
It appears that UUID v6 is still a proposal, as of August 2020.
Also, this sounds like a non-starter for UUID v6:
"Like version 1 UUIDs, version 6 UUIDs use a MAC address from a local hardware network interface."
The sort-ability of ULIDs is of questionable usefulness. It looks like the generator relies on the clock of the machine generating the ID. If you generate ULIDs across machines, there is no guarantee that there isn't a time-drift. So to rely on that, you'd have to have a single oracle generating the ids. Why not just use an auto-incrementing ID at that point?
> Why not just use an auto-incrementing ID at that point?
Because sometimes (and by sometimes I mean "surprisingly often"), you don't care about exact sortability but simply want roughly-correct ordering.
Examples for most social media sites:
"Give me the latest 100 tweets/videos/posts for this user" => Milliseconds differences won't matter, because users don't tweet/post videos that often.
"Give me the latest 1000 tweets/videos/posts for this search" => Even a significant drift won't matter, because you don't care if things aren't exactly in the correct order, you just care about showing recent stuff.
And at that scale (even long before that scale), having a single oracle for auto-incrementing IDs is a hassle. So this is a nice solution any time you need globally unique IDs, need to support sharding, and your default sort is time-based (or whatever-based if there's another piece of info you want to put in that timestamp portion of the ID, as long as "drift" is a non-issue).
BTW, not just theoretical: I believe this exact reasoning is why twitter came up with time-sortable snowflake IDs.
That's fair, if you care about close-enough time-ordering but don't care about strict global causality then ULIDs would be useful. I suppose if you have a partition-able workload (say users posts go into Kafka keyed on userId) then you can use ULIDs to have guaranteed order within a partition, which is probably enough for many use-cases.
One reason is database indexes. If you shove truly random data into a standard B-tree (or whatever), performance tends to suck. If the start of the key lets you roughly sort by time, performance improves.
It's not super uncommon for people to use a normal UUID (usually v1, NOT v4; you need a timestamp), then restructure the fields so the timestamp is in front on save, then flip it back on load; this gives you a "proper" UUID, but (in theory) gives you better performance. See, eg, here: https://www.percona.com/blog/2014/12/19/store-uuid-optimized...
Now, ULID mades an odd attempt at making keys strictly sortable by time, which 1) doesn't work and 2) is pointless. :-) In that case, you really would be better off with an auto-incrementing idea. But while their implementation is questionable, the idea isn't absurd.
> sortable (time component)
You can easily prepend a unix millisecond timestamp to a UUID if you want time-sortability, i.e.
174e1da9377-d739f4ed-6aa1-4c99-a0bf-3b0b68e9ce77
where 174e1da9377 is a unix timestamp in milliseconds.
> short (26 chars)
Short == increased possibility of a collision. If you're okay with this possibility, you can just use UUIDv4 and truncate it.
> and nearly human readable
Convert your UUID into base26 or whatever you wish for human readbility.
> and good enough entropy/randomness
Subjective, if you are okay with "good enough entropy" you could arguably just use a random string of the number of bits you'd like.
This makes a ton more sense than messing with the UUID itself.
Also, for most applications you can just add a timestamp record and sort by that.
> Does anyone have any criticisms of ULIDs?
Only that they aren’t more widely adopted or supported.
I use UUIDs in things like OBSERVER support[0]. They just need to be unique within the context of the runtime, so there’s no need for anything more.
However, for things like Bluetooth Attributes, I think ULIDs would be a good thing. That ship has probably sailed, though.
[0] https://github.com/RiftValleySoftware/RVS_GeneralObserver#ob...
Well, you might not want your ids to be sortable by time (you don't want it possible to see which thing was created before which). So the unsortable nature of UUIDs can be a feature.
+1.
Combine that with KV stores like badger or rocks that store things in lexicographic order - ULIDs really comes in handy to have a random ID but still be able to do a scan in sorted order
I "independently invented" ULIDs for a past work project.
Not a criticism, but something to be aware of, is that SQL Server doesn't sort the uniqueidentifier data type as one would expect. Instead of being ordered by bytes 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1. So you want the timestamp in the last 48 bits instead of the first 48 bits.
Well, this is beyond my ability to grok.
Does this apply to ULIDs?
It will, but looking at the SQL Server implementation[1] it looks like they're already aware of that.
I never get this. Why are you sorting by ID?
Don't know about the OP, but there are some rarer cases, like dynamodb, where more indexes means more $, you can avoid creating a timestamp column & new index by having a sortable id column.
But even without the need to sort by id, one of the advantage is that sequential ids makes it easier for databases to fetch data, as data stored ends up being more sequential in disk as well.
More reading on that:
- https://eager.io/blog/how-long-does-an-id-need-to-be/#locali...
- https://www.percona.com/blog/2019/11/22/uuids-are-popular-bu...
Well, for some databases, yes, this is "clustering", and there's a command to deal with it ([0] is postgres, which doesn't do this particularly well, other flavours of database have similar commands).
But it's the usual thing with databases: if you're trying to optimise performance by hacking the engine, then your schema probably needs a good hard looking at.
[0] https://www.postgresql.org/docs/current/sql-cluster.html
I don't consider being aware of your data distribution to be "hacking the engine". It's part of a good DBA's job. For any data set of appreciable size you can't treat the data as a black box. Even if you're not clustering the underlying table in postgres the performance of your indices will be heavily affected by data layout.
I agree. Hacking the ID field to include index data so the storage clustering is a bit more performant is not "treating the data as a black box" to my mind.
In my case, I don't really need it, but what I read there seems to be a DB performance boost by having it. And finally, if I wanted it down the road, it's already there.
Default sorting out of the box is kind of nice anyways.
I am wondering how this differs from KSUID
KSUIDs are slightly longer, base62 encoded, have a different epoch, have second precision in the timestamp component, have a different length random component, and omitted the bizarre attempt in the ULID spec to enforce strict monotonicity.
Functionally, they're similar, but I think a lot more work has gone into KSUIDs in terms of usability and ensuring they solve the intended problem. The only advantage ULIDs have is they are the same length as vanilla UUIDs, if such is useful to you.
https://tools.ietf.org/html/rfc4122 is good reading.
Agreed. It should be required reading for anybody making heavy use of UUIDs.
In my experience, "UUID" often means nothing more than "take 16 bytes from a CSPRNG".
The different strategies are interesting, but I don't understand the bit flags. Why would you want your ID to include a record of how it was generated? Some use cases go out of their way to obfuscate that information, and one selling point of UUID over auto increment is that you can't infer missing items or how many are being created by watching PKs.
It's cool that I can make deterministic UUIDs, but it seems silly that the spec should require me to expose the fact that it's deterministic. It's not like my deterministic UUID is any more likely to collide with someone else's random one than two deterministic or two random are to collide with each other.
I don't know the “real” historical reason, but the deterministic UUID versions aren't really for the case where you want to obfuscate ID generation. They're more useful for shoehorning existing database-key-like information into slots where UUIDs are expected. In fact, I very recently ran into a thus-far-hypothetical situation with a set of in-memory objects identified and indexed by UUID, where “real” such objects have UUIDs that are also lookup keys for information elsewhere, and where I wanted to be able to deterministically derive a secondary, distinct “fake” object from a “real” one. My mental design sketch used SHA-1 UUIDs for this. (Sorry for handwaving over the specifics.) Though I do think the digest-based UUID versions are noncentral, especially in greenfield.
From another angle, if you look at the difference between version 1 and version 2 time+node UUIDs, they're actually quite similar except for version 2 having a UID/GID-like field replacing some of the least significant bits of the other data. So collisions between v1 and v2 would be much more possible without the version bits. And at that point there's no reason not to use similar flags to refer to other generation methods; even if in theory you could decide that they're unlikely to create collisions, that just feels like it leaves the door open for edge cases and having to remember which of those decisions were made. Why not be consistent?
From the opposite angle, in a case where you're using random UUIDs generated within a trust boundary as lookup keys, the knowledge that one of them is random doesn't tell an adversary on the outside anything useful about the others, because they're all random, so it's a few harmless “wasted” bits. Of course, since UUIDs didn't have “unguessable” as a selling point to start with, if that's something you're leaning on, then go ahead and use something else…
Xcuse me asking, but is there some standard way to make a copied partition in Ubuntu bootable as an unique entity? First you change the UUID in Gparted and then you waste an hour rebooting and hunting for references to that old UUID. Not good.
This question might be better asked here: https://askubuntu.com/
I think I remember creating the partition first, then copying the partition from one to another with dd, and then letting grub figure the situation out. Maybe it's ill-advised to do this, but it can work.
Except /etc/fstab and /boot/grub/cmdline jump right into that that old disk. Also grub-install sometimes finds the old UUID from some mystery hole, and you have to change that manually.
Along with the comments about ULIDs and KSUIDs, some people might want to check out C4 IDs. Webpage at http://www.cccc.io/, which has a link to its Github page.
A website asking you to drag in a file, yet it doesn't support TLS. Any MITM can trivially modify the delivered javascript to exfiltrate the file or show a wrong identifier for it.
In addition, those "drag a file to upload" dialogs are just awful. I don't have a file explorer installed that supports dragging (I just use midnight commander). If you give me a 'browse' button, I can use the gtk file-picker to get something, but "drag and drop" doesn't work. It's also an accessibility issue: screenreaders can click buttons, but they're much worse at dragging files into specific webpage areas.