UUID, serial or identity columns for PostgreSQL auto-generated primary keys?
cybertec-postgresql.com> Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.
You know, you think that, but it's never that simple. The field was added incorrectly and nobody noticed until the value is in countless tables that you now need to simultaneously update or the value is something that's supposed to be semi-secret, so now a low level support staff can't reference the row when dealing with a request. Or the table's requirements change and now you need to track two different kinds of data or data that is missing the field.
Me, I always just have the table make its own ID. It is just simpler, even when you think it is overkill.
In Spain each person has a unique ID number assigned at birth. The numbers for newborns are geographically pre-distributed to guarantee uniqueness despite delay in paperwork. It is universally accepted that this ID "number" (it actually has one letter too) is all you need to identify yourself, ever.
Except that I knew a coworker who had a duplicate ID. An extremely rare event, they messed up the pre-assignment and there is another dude somewhere with his same ID. So from time to time, some system would tell him that his ID was already registered. A lot of banks and stuff like private healthcare systems like to use the DNI as usernames.
He tried to get his ID changed, but that was such a foreign concept to any of the involved institutions, that he had to give up because there simply is no such procedure. I guess he could have taken it to court, but the guy decided to just live with it (the justice system is quite slow here).
The fact that it's fixed / can never be changed is a massive problem with social security numbers. That, and the fact it's often used as authentication instead of identification. They're moving away from that slowly, but it's taking a lot of time and effort.
The problem with ssn is specific to the US. Other countries have sane ways of authenticating citizens and the personal id number is just used as a global foreign kes for all government or public or bank database where you need to uniquely identify a citizen.
It does seem that a "natural key" is frequently just a really foreign key in a database you and your org don't manage.
That's a good observation. The only meaningful distinction between a natural key and a surrogate key is whether the number ever escapes the original system.
For instance, a driver's license number is printed on the card itself, so a human sees it. Therefore, it's a natural key, just like a name.
When you decide that whatever natural keys already exist aren't good enough for your organization, and you make a new key, it's not good to think of that as a surrogate key. The number will make it out somehow (as a "record locator" in a customer support call or something), and eventually become a natural key.
It's best to just plan for any new key to be a natural key, which means using best practices for natural keys. That means it should be something reasonable to print, read, say, and hear; and it should also follow a pattern so it can be distinguished from other special numbers.
Auto-increment is a shortcut, but usually not great in the long term unless it's something that will be well-contained inside the database as an implementation detail (e.g. a join key designed to refer to rarely-accessed fields of a wide table).
> > Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.
> You know, you think that, but it's never that simple.
It’s that simple if you’re the Social Security Administration and its a table of Social Security Accounts, not people.
Other than that, using SSNs as a primary key is just plain wrong.
> It’s that simple if you’re the Social Security Administration and its a table of Social Security Accounts, not people.
Nah, they keep track of duplicate usage: https://www.nbcnews.com/technolog/odds-someone-else-has-your...
> The IRS often knows when this happens, when the imposter pays taxes. The Social Security Administration knows, too, for the same reason. And the nation's credit bureaus usually know, because the imposter often ends up applying for some form of credit. Plenty of financial institutions also have access to this information.
Maybe using them as passwords is what's wrong.
How do you cope if you want to record people who don't have an SSN or equivalent? (e.g. I have none, because the country of my citizenship doesn't issue anything comparable)
Or if you're expanding to countries whose SSNs clash with each other?
SSN as PK is wrong regardless of whether you're doing SSN as password.
Right? Credit card numbers as a primary key is far more efficient.
I finally got our company to standardize on someone's employee number as a primary key for everything employee related. It's a simple monotonically increasing integer value -- the best possible primary key.
We moved to a new HR system and they have a set of "reserved" employee numbers that cannot be used and we have employee numbers in that range. Arg!
We had a classic situation at a software house I worked at in the '80s - employee numbers were 1-999 and then jumped to 5,000 - because, you guessed it, this "unique" field was used with magic numbers 1,000 - 5,000 being reserved for project ids in various key accounting systems!
And we were supposed to teach our customers good design principles...
In an event-sourced setup, I found that for projections, this is less of a problem. And opens some possibilities, like more semantic schema's and easier, or simpler API's.
A projection in ES, is more a cache, not your primary store. The primary store is the eventlog. The latter should, obviuosly, never use natural ID's.
I've been bitten by using natural keys on several occasions, but I can' think of a time surrogate keys failed me beyond the tediousness of implementation.
In the Netherlands SSNs are not unique, they handed out some duplicate ones back in the day. So not great as a primary key. Besides, I think using them as primary keys is illegal anyway.
In Denmark they are specifically not allowed to be used as a primary key. I mean many do, but technically your suppose to have a separate internal ID and then you use the SSN to look up that ID.
Huh? How does sign-in work for people with duplicate ID’s?
SSN's predate sign-in and the internet. Anyway there is a digital government sigle-sign-on solution called DigID, but you need to create an account etc for it using an e-mail address as the sign-in.
Imagine a man with a stick or gun coming to your office to shackle you in chains for how you arranged your database schema.
About UUID as Primary Key and performance, the following article has some insights and benchmarks as well: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...
Essentially, they observed sizeable performance improvements by using UUID generators that are tweaked to get more sequentia resultsl. It results in better indexes. The articles compares sequences, random UUIDs and 2 kinds of sequentialish UUID generators.
Mentioned this in a sibling comment:
There's another benefit to UUID - You can generate them anywhere including application side. Doing this on application side would have tremendous batching benefits or inserting objects with relationships at the same time (Vs waiting first insert to return an ID to be used in the FK).
Caveat programmer: this could be problematic, not in the sense it doesn't work, but in the sense that someone working on backend code may have a preconceived expectation that UUIDs are also effectively a keyspace i.e. they're hard to guess. The validity of that is already challenged by variants defining temporal or logical order, and evaporates completely if you let clients declare their own that you accept at face value. Applications may have potentially guessable/gameable object identifiers sloshing around inside as a consequence, which is modestly ironic given that one benefit many folks expect from adopting UUIDs in the first place is hardening up the attack surface of trivially enumerable sequences.
There are a few mitigations but my favourite is the "casino chips" approach: pregenerate them server side, and allocate to clients on demand, including en masse if need be ("here kid, have a few million UUIDs to get you started"). Verify with whatever simple signature scheme comes with your application server framework, or at small scale just toss them in a crude LRU store.
Or, remember where the UUID came from, and apply their organisational scope to any trust you place upon it. This might work particularly for multi-tenanted SaaS. However it requires that all usage is tenant-bounded end-through-end throughout your application. This may be in conflict with a) your framework, b) your zenlike contemplation of simplicity in data management, or c) programmers in a hurry forgetting to scope their queries properly.
Ultimately, relying on UUIDs as intrinsically unguessable security tokens is probably not a great idea, but it's one that remains thoroughly embedded in the programming zeitgeist. As ever, nothing useful comes without a compromise. Keep your eyes open to the systemic consequences of design choices, and don't leave traps for your fellow developers.
He's not saying clients can create their own ids. The applications can.
The concepts he's talking about are required for cqrs. Which is a popular concept applied with mostly DDD or microservices.
There definitely are people out there in this thread proposing clients be able provide UUIDs. I’ve seen it elsewhere too.
I’ve also personally experienced UUID collisions due to badly set up VM environments under Windows. It isn’t a good idea to blindly trust any value - and that includes supposedly ‘never collide’ id’s like UUID.
For what it’s worth, I also had the joy of debugging someone’s distributed hash table that was using md5 as the hash bucket key (this was... 2 decades ago?) and had no way to handle collisions because obviously that is impossible.
This seems more an issue of the libraries random generator to form uuids.
Eg. I use guids ( .net) and i have never seen an issue.
I getcha, but these days the ambit reach of "application" extends to Javascript executing client-side in an environment that's basically overrun with lions/tigers/bears, and I'll suggest that's particularly a consideration when the front-end is a SPA participating in a CQRS/event-sourced overall application architecture.
For perspective, the npm uuid package is now being downloaded ~50M/week. It's usage is ubiquitous at this point, on any platform JS is running.
Little bit later to reply.
Unfortunately that doesn't mean much.
Since nodejs is a server side language and can handle that package too. And it's not "solely" for js/spa's.
Should you ever use a plain token (where you just check if it exists in some authed_users table) vs, I dunno, some sort of signed/HMAC type thing, where you have to call some function on it? I genuinely don't know but I know enough to generally leave authentication up to those that do know.
Maybe I'm just thinking of OAuth where there are multiple hops involved?
The comparison to OAuth is quite reasonable. Perhaps the most obvious parallel is the use of a state parameter during the three-legged exchange, without which it's exposed to a CSRF clickjacking attack.
Right. Maybe it's paranoid, but it seems like a bearer token has potential avenues for forgery (CSRF or others), replay attacks, add-on jacking, etc. Also harder to coordinate with distributed apps. I think the Captain Tightpants approach would be to initialize some client-side private key/cert, use that to sign each request and verify based on that cert.
That should also make it easier to, say, verify and unwrap the request at the gateway to the server, before sending it to the rest of the application-proper.
In Java there is a UUID generator based on SecureRandom. That's about as unguessable as you're going to get.
It's not a question of whether UUIDs can be generated unguessably. They can be, as you point out.
It's whether the UUIDs in your system can be reliably presumed to be unguessable - including the UUIDs that were generated by code which was written after you wrote your query that assumes unguessability.
Today you might say "Oh, this SecureRandom-based UUID generator is unguessable and meets all of our requirements". Tomorrow you might say "Ah, this SecureRandom-based UUID generator is too slow, let's generate them in our Android app instead of on the server". But now the UUIDs stored in your database aren't reliably unguessable, because you accept whatever your client API tells you without verification. How plausible is it, within the timeframes you actually get, to review every query for whether it assumes the trustworthiness of the UUID generation? Better to assume UUIDs have some convenient properties, than to assume that they're unguessable just because the API is cryptographically secure today.
I think you can probably get the same "batching benefits" if you use a global ID generation service of some sort, with sequential IDs to improve indexing. Using sequential IDs doesn't necessitate using auto-generated sequential IDs.
correct, but global service need local caching. Some "HiLo" type identity generation could work. Sequential-UUID simplifies it as you don't need that service, so it's preferable.
HiLo explanation below:
A client simply gets a range of ids to be used, exclusive to them. Then can use it without any roundtrip to server.
This can be achieved with a "hi" stored on service, and a batchSize that's constant in the system.
Each time a "hi" is requested, it's incremented by 1. Now the client can generate (hi * batchsize, (hi+1) * batchsize - 1).
The Postgres JDBC driver does not guarantee that batch inserts come back in the same order that you insert them (when you use RETURNING *). So, if you generate UUIDs server-side, you can't conveniently match them up with the records you just inserted. You're better off generating them in the app server first and then sending them to Postgres.
Plus, it's way better to generate a UUID in the app server and send it to the server as a string. The reason is that you can deal with the id as a plain string and don't have to deal with a non-standard datatype in your app.
Another benefit to UUIDs is merging tables/databases without key conflict.
Another benefit is that clients can generate and store objects before sending them to the server.
Allowing simple caching, easy async, easy handling of offline, or far simpler clientside code for dealing with those objects. etc.
For mobile app development which relies on an online (http) backend, clientside generatable UUIDs offer almost only benefits.
There's nothing that prevents you from fetching a batch of N IDs from the server. Server just does +N on the sequence and you can do whatever you like with the IDs client side.
You can also use one sequence for everything on the server, and then you can also pre-create ID based relationships client side.
Made the same claim here: https://news.ycombinator.com/item?id=27347243
You can just use PostgreSQL's writeable CTEs to get the same batching benefits plus the benefits from using serials. So, no, I do not think batching is a good reason for using UUIDs.
Writable ctes sound like a very niche feature I haven't used and don't plan to use. It feels like a good reason to me.
I use a ulid[1] as a uuidv4 replacement:
They almost got it right, a better implementation would overflow regularly to make use of the entire key space, and counter untuitively more resistant to overflows.
Clocks aren't reliable enough for timestamps anyways so garbage collection is the only thing you kinda wanna rely on them for.
A good sweet spot seems to be, 32bit milliseconds + 96bit of entropy. This overflows appeoximately every 50 days, allowing for 50 day rolling data retention.
Not the worst idea—50 days is a nice sweet spot between infrequent enough to have some indexing benefit and frequent enough that potential downsides will be discovered early in the product’s life cycle.
Personally I wouldn’t do this. A scenario where for each individual millisecond of elapsed time, 96 bits of entropy is an upgrade over 80 bits of entropy, is fairly extreme. I don't think there are many databases in the world which would ever need more collision mitigation than that.
> I don't think there are many databases in the world which would ever need more collision mitigation than that.
Individual instances? Maybe not. But for those an autoincrement key would also work. That is not the scenario that ULIDs and GUIDs are advertised for.
The goal is to have an universally/globally unique ID. So whenever you encounter two IDs you can be (resonably, probability wise) sure that they won't collide.
Any such sheme thus must, by definition, serve every single use case now and forever everywhere. That's a tough one.
Also it's not really 80bit vs 96bit (which due to the birthday paradox is already a huge difference) but more 80bits vs. 128bit as the timestamp is recycled with sufficient usage.
I'm actually concerned that 96bit isn't enough, as it relies on the assumption that you'll use this scheme for for data spanning years, in order to properly use the timestamp as entropy.
I was debating between using ULID vs just using the same sequential bigint sharding strategy that Instagram uses[1][2].
I ended up deciding to use sharded bigints because it enables better indexing, even though there are drawbacks when first writing the instances; The benefit of faster search was more important for me.
[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...
[2]: http://www.livinginthepast.org/2016/02/24/sharding-into-bigi...
Would you recommend this over https://github.com/ericelliott/cuid ? I use cuid with Postgres and it’s worked out great.
They seem like what I'm looking for. I was looking at moving to something more random because our current UUIDs (uuid1) are too easy to mistake for one another at a glance.
I was hoping someone would mention these. Its really quite nice. 128bit, sortable, client or server generated, no collision (well almost zero: 80bits are random per millisecond)
There is also CUID:
Careful with leaking sensitive information with semi-sequential UUIDs though.
To alleviate the issue of having a sequential part, they make it wrap around so that you cannot tell the order between two UUIDs. It's already some protection, and the random part is still large.
A little late to comment here. But for database IDs, I have found that Instagram's technique to generate IDs works very well: https://instagram-engineering.com/sharding-ids-at-instagram-...
They are not serially incrementing but still sortable. Thus prevent index fragmentation issues observed with UUIDS. Are 8 bytes in length. So index size is smaller compared to UUIDs. So you get all benefits of serial IDs but they are not easily guessable thus preventing sequential access attacks.
> With more than 25 photos and 90 likes every second
What unimaginable scale
That was in 2012, when they "only" had 15M users
Today, a decade later, they're at 1.074B
> they are not easily guessable
I don't see how that's true. From reading the article you linked, you only need a valid shard ID (which you can extract from known IDs), the millisecond (which is guessable) and a 10-bit sequence (which you can easily brute-force).
(And that's completely fine if their security model doesn't require unguessable IDs.)
>> which you can easily brute-force
It will results in a very high number of 404s. These can be monitored and the origin IPs can be banned.
2^10 is 1024, so hundreds of requests. Not a very high number. And since it’s a counter, even less. Easy to disguise; the official app is likely to do more requests in a one-minute session. And obviously, sophisticated attackers aren’t limited to one IP.
I’m suspecting you meant “easily guessable” in the human sense, not the cryptographic/security sense. My bad if I misunderstood you. Again, I’m not saying Instagram has any security problem, I’m just saying that this ID scheme in particular isn’t a security feature.
Meta: this company wrote an impressive number of articles about PostgreSQL since 2013. List at https://www.cybertec-postgresql.com/en/tag/postgresql/
Also, here's a list of blog posts from Laurenz Albe (the author of the OP post): https://www.cybertec-postgresql.com/en/author/cybertec_albe/ His blog posts are a great read, I'd recommend checking them out!
I just had to do a double take as I was reading a stack overflow post at the same time and recognised it as the same author.
Laurenz (the author) was Postgres person of the week not too long ago: https://postgresql.life/post/laurenz_albe/
UUIDs are great when you use the id "publicly" but using an incremental value would be too revealing for different reasons.
So it's good to know that performances are not bad.
I don't think a lot of the argument that integer IDs reveal too much.
Yes, they are guessable but your application should not rely solely on the "secrecy" of the ID to authorize access to a record. If you are worried about someone crawling your public API with wget or curl and an incrementing counter you should re-think whether your data are really public or not, or maybe rate-limit anonymous users, etc.
They also reveal something about the total number of records in your database, I guess that could matter in some contexts but it's never really been an issue in practice for me.
I have definitely used the technique of defining non-overlapping sequences in different shards (with Oracle, not Postgres, but they are very similar in this regard). It worked very well and was easy to reason about.
As a developer, the big issue I have with UUIDs is that they are impossible to read or type. You have to copy/paste and it isn't easy to look at two UUIDs and tell if they are different.
I use integers in general unless the specific use case demands something else.
> Yes, they are guessable but your application should not rely solely on the "secrecy" of the ID to authorize access to a record
Any information you give to a potentially malicious actor can help them attack you. If you have a choice between leaking information and not leaking information, I can’t imagine why you would ever intentionally go with the former, unless you didn’t actually have a choice (feasibility, etc.).
As an example, maybe I needed to CSRF someone but needed their ID (say, in a b2b app where IDs are not generally public) - with sequential IDs I have a decent chance of cracking it with a small number of requests, especially if it’s a small userbase. Sure, the CSRF was the main issue, but this is a contrived example to illustrate the point.
Admittedly, IDs are oftentimes public information by necessity - but there’s no need to allow them to leak extra information.
As soon as those IDs are used by any other people or business processes in any way whatsoever, their usability starts to matter and arguably in most cases is simply more relevant than a minor hypothetical advantage to an attacker.
For example, if some customer ID is used by customers in communication e.g. when calling you on the phone over some billing issue, then there would be strong advantages if your IDs aren't unnecessarily long and if they include some explicit redundancy (e.g. a check-digit with Luhn formula) to protect against communication mistakes.
For another example, if your IDs aren't visible to outsiders but are used in your internal business processes then it may be quite valuable to ensure that IDs of different types (e.g. customer ID vs account ID vs transaction ID) are obviously distinguishable in some way so that someone seeing XXXXXXX knows that it likely is a transaction ID and definitely can't be a customer ID; and it's quite valuable to ensure that you can't have accidental collisions where the same number is a valid ID in different key tables so a bug or miscommunication that confuses them would result in data corruption or information disclosure instead of simply failing.
So "ID design" deserves some attention from UX perspective and blocks of random data aren't optimal UX.
So we run surveys among general and specialized audiences (among other things), and these surveys link to custom scripting, images, videos, etc. The URLs have to be freely accessible, but if they are sequential, anyone can simply try to guess what's in other surveys, potentially getting information about their competitors.
This is an example where you don't need a UUID as the key (since you could have another field that stores this "secret" value), but it makes it very convenient if you do use UUID as primary key by default because you get that "secret" value for free (no need to create another column and index). In my projects I use it by default for all models. It comes in handy. Another use case is needing to know the primary key before inserting into the database (at either the front end or the backend, but typically the backend).
True, the record contains both a classical sequential id and a uuid (to maintain backwards compatability), but now everything is linked through the uuid instead of the id. Convenient, indeed. And there's never much data associated with a single uuid, so performance is not an issue.
I had this issue in a case I think is interesting; a customer had a database with incremental IDs of a certain product they sold. On a web platform, the product owner in turn could log in and view a list of their products and their status. The id of the product was part of the URL; /product/851. Of course, the product owners could not get any information on IDs they didn’t own, but the numbers gave away info on how many devices existed before them. And they wanted to hide that information.
Of course, there are many ways to solve that situation, but UUIDs is one.
It's the german tank problem.
Serial IDs, with some light assumptions, leak information about the total count of items.
Just pick a random number at the beginning, and start incrementing IDs from there. Like personal checks starting at 1000 so they're always(ish) 4 digit. Of course, maybe pick another starting number that's less obvious.
That's not effective at all.
Still leaks count -- you do similar stuff to estimate the minimum and the maximum.
From a security perspective, you're right.
They can inherently leak how much data you do or don't have, which you may not want your competitors to know.
There's another benefit to UUID - You can generate them anywhere including application side.
Doing this on application side would have tremendous batching benefits or inserting objects with relationships at the same time (Vs waiting first insert to return an ID to be used in the FK).
I think ideally your primary key is whatever makes sense for your performance/data model, and then if you want to delegate authority with UUIDs you do that via a separate mapping.
By separating that out you can get a lot:
1. You can extend your delegate system by modifying the delegate table, rather than having to muddy your data model with authority information
2. You can TTL the mappings in the delegate table
3. You can have many mappings to the same data, but each one can have its own restrictions.
It's a bit more complex but you end up with a system that really hones in on the benefit that you're bringing up.
That sounds really overkill for simply wanting an identifier that doesn't change, isn't guessable, and doesn't reveal how many rows there are
We'll have to agree to disagree. Systems like the ones I described are a hell of a lot easier to build on long term while maintaining those invariants.
How is that easier to maintain than literally just having a UUID primary key? I feel like we must be talking about different use cases
If your goal is to use the uuid as a delegated capability it's going to be much more complex to use the primary key for your row than to use a separate key.
Yeah we are talking about different use cases entirely
I’ve stuck with incremental values internally but use Hashid to convert them when exposed publicly. Seems to work well.
Yes, I completely forgot about it. I used it a few years ago, I tried also [1] which is integers instead of strings.
Hashid is super slow though.
What is 'super slow'? Takes milliseconds for me.
That's super slow, you could only handle a few hundred ids per second.
On the other hand, if you can get away with incremental ids it makes debugging much easier during development.
Not really. You should develop better tooling to visualize debugging information. Today's serious systems (this in my opinion includes e.g. collaborative rich text editors) are just too complicated to just eyeball. Pavel, a colleague of mine is developing a new collaborative rich text editor for OrgPad and here is, how we do some testing currently https://www.youtube.com/watch?v=VeVcNmNFzmc We use UUIDs for basically everything. It is simple, we have good tools for working with UUIDs. When we present any IDs to users, those are URL-safe BASE64 encoded UUIDs. In some places, we have "short" links that are about half the length - they cannot really be guessed but are much shorter to type/ Copy&Paste/ visually compare for our customers, who are not always super computer literate. :-)
Disagree. /user/edit/5 tells me easily what record it is about on the database without having to copy paste an UUID.
Dev experience truly is a case of death by thousand cuts. I avoid every little cut I can like the plague so energy goes into making cool stuff.
> Should develop better tooling to visualize debugging information.
Thing is, why would I spend time overengineering tooling I don't need if I can get away with incrementing ids? When optimizing for value, I'd rather spend time solving business problems.
Your editor is very cool btw. And it's clearly a case where incrementing ids are not optimal.
Thank you for the reply. You are totally right about the dev experience and the death by thousand cuts.
I really was talking about serious/ complicated systems, where you want some consistency in the components it is made of even if a particular case could just use an incremental id. I my view, you don't spend the mental energy on switching between models. I have just adjusted/ rewritten a part of the system, where we for historical reasons used logins instead of UUIDs. By using logins in this particular case, we had subtle bugs that wouldn't occur with UUIDs. It would never have happened, had we used UUIDs everywhere from the start. Those already have standard validation functions and you don't have to think about stuff that really isn't your business problem. We are also talking engineering days thrown out of the window just for the change from logins to UUIDs later on. I don't think you waste so much time copy&pasting stuff.
I guess, we arrive at the same conclusion but have our differences about the ways leading to it. We both want to focus on business problems as much as possible.
You're not wrong, but (as I suspect is the case with a lot of us) the vast majority of my work is CRUD and I don't reach for heavyweight debugging tools unless printf() or the equivalent fails me (which is rare). Integer IDs work great in this situation.
Simple rules:
Use integer primary keys internally for identifiers and relationships.
Use English/Other Language permalinks for URL's
Use UUID's in places like API's one-time action links and "private" links that you only want to share with other people.
Worked fine for me for many, many years.
A vote here against integer/serial PKs, not only because they leak information, but also because they can result in incorrect joins.
IME it's much more often I've quickly made a table with a serial PK and later wished it were uuid; just about never made a uuid and later wished for the compactness or natural clustering of bigint. Maybe for a table of millions and millions of time-ordered events.
Note I said "internal use". But how can primary keys result in incorrect joins?
Unless you're changing a foreign key, joins will always be correct.
Unless I'm doing something wrong in the last 30 years of using SQL.
If you use serial integer ids and accidentally join on the wrong tables/columns you will get rows back even if the join doesn't make sense, because all serian integer ids have values in common. If you're using UUIDs you will "never" get rows back when joining on the wrong ids and spot your mistake.
I think if you're joining against wrong tables or columns, then you have bigger problems than if the values are BigInt or UUID's.
"Don't ever be wrong" is one way to do it; on the other hand, sometimes we make mistakes anyway, especially in ad hoc queries, and not getting any results back at all helps to spot the problem quickly (more quickly than spending minutes befuddled by the nonempty resultset).
One thing that no one mentions so far is the extra space these ids consume.
Is the value of making that mistake strong enough that you use so much more data to store a UUID value??
Well, I in fact did mention the "compactness" of integers at the start of this thread :)
Value safety instead of type safety. It's reasonably likely to work; even if you can't trust the UUIDs to be generated by a trustworthy source, when you test the query it'll be wrong even if you accidentally join against a table that has a sufficiently similar schema.
I'm more familiar with MySQL where I don't have anything other than basic, automatically converted types.
But in Postgres, where there's more types, can I create a type that is representationally equivalent to a builtin type but not automatically convertable? So that if I say "select * from donkey left join plant on plant.id = donkey.animal" it says "error: cannot compare PlantId with AnimalId, use an explicit cast if you really want to do this". If this were possible and ergonomic, it would be better than a relatively cryptic null in the case that plants and animals have distinct (UUID) PKs.
> […] but also because they can result in incorrect joins.
Side question: can I get Postgres to throw an error if I try to join on two IDs where neither of the IDs have a foreign key reference to the other?
I wouldn't think so, and this sounds like a mighty footgun!
Something I really like about integer incrementing IDs is that you can run ad-hoc "select * from table order by id desc limit 10" queries to see the most recently inserted rows.
I end up doing this a lot when I'm trying to figure out how my applications are currently being used.
Strictly incrementing UUIDs can offer the same benefit.
Another point: if there's any temporal locality to your future access patterns - if you're more likely to access multiple rows which were inserted at roughly the same time - then allocating sequential identifiers brings those entries closer together in the primary key index.
I used to work on a reconciliation system which inserted all its results into the database. Only the most recent results were heavily queried, with a long tail of occasional lookups into older results. We never had a problem with primary key indexes (though this was in MySQL, which uses a clustered index on the primary key for row storage, so it's an even bigger benefit); the MD5 column used for identifying repeating data, on the other hand, would blow out the cache on large customers' instances.
To add on. If you are joining against a table where you are joining on a UUID the join becomes quite slow with very large tables, like >10 million rows.
PG will say it's doing a hash look up and you'd think it'd be fast but it will take quite sometime relative to joining two large tables with integer IDs. With UUIDS PG will give up doing a hash look up sometimes and try to do table scans unless you adjust random_page_cost.
In general joining on UUIDs for large tables is a bad idea. It can be great if you are joining a single row to another row.
I once pondered how I might generate IDs that were as compact as a machine word, without a value (or small set of values) revealing the size of the data set. One application might be user-visible customer numbers that don't easily reveal how many customers there are.
I eventually came across the idea of using maximal period linear-feedback shift registers to transform an integer variable through every possible value (minus one), but in a non-incremental sequence that depends on the LFSR arrangement.
I never ended up putting the idea to use, but I've always been curious about people who have and how it worked out for them. [Edit to clarify: It was meant for obfuscation, not security against a determined attacker.]
I've used a small block cipher like Skip32 or Speck to obfuscate database sequences, either on INSERT or as part of the encoding scheme.
This works well against the German Tank Problem when there's no oracle allowing an attacker to guess lots of IDs quickly (such as when there are reasonable rate limits). It does not provide enough entropy when such an oracle exists (especially an offline one).
For something like a password reset token, it still needs to be paired with suitably random bytes.
Please see my previous comment, feel free to give feedback.
So far I haven't encountered any problems in the short term by using the approach described.
The problem is that if your encoding algorithm leaks, it’s game over.
I just started a little side project and chose to use UUID for Postgresql keys. The schema is highly generic and I anticipate the possibility of merging instances. UUID precludes collisions in such a case.
That includes foreign keys?
Yes.
Postgres (and other relational DBs) really need to implement something like snowflake[1] or ksuid[2]
1 https://blog.twitter.com/engineering/en_us/a/2010/announcing...
I’m currently prototyping a little database+api+cli todo app and I want identifiers that can be abbreviated in the same way as partial git commit hashes can be used on the command line. What should I use?
I was thinking of generating random character strings and simply retry when the db throws duplicate key error on insert. No sharding is necessary and I’d like to have efficient foreign keys. Any thoughts?
You could try NanoID[0]? Seems available in many languages.
You could use a serial int and just hex-encode when interacting with the CLI? You could then use range queries to match short hashes by zeroing out the remaining bytes and using >=
Check out linear congruential generators or other pseudorandom number generators. Then map the resulting number to letters.
I have no particular expertise with modern databases and it has been decades since I did any work as a DBA.
However, I cannot imagine creating table entries without a datestamp. No matter what else you are doing, or what you index by, I would want YYYY-MM-DD_HH-MM-SS in every row.
Maybe I'm just weird that way ...
Same. Every entity always gets a created column at the minimum, that way when we query later we can order by created to see the last few days worth of data first. Can't do that if you don't know when something was created.
And ideally there is a created time stamp and a last updated time stamp.
(at least in my cases) Ideally nothing ever gets updated, there's just a newer version of the row.
I feel the whole debate is overkill: 99% of businesses/systems will never have so much data that they NEED to use uuid's. I personally don't like using integers for keys either as I've been burnt by them before. I also doubt any software I build today or have built in the last 10 years will be used 100 years from now.
Recently I built a new system (typical business-type backend) and forced to use sqlite + C# + dapper. Using this combination I cannot use guid/uuid as dapper cannot properly map it back to c# from sqlite, and my dislike of int's got me thinking. I have a random string generator (have used it for years for things like OTP's and other reference numbers), where I give it an alphabet + length of the desired string. Using 8 to 12 characters, I can get a few million unique permutations. That is, if used as a primary key, few million per database table. Then I hear in the back of my head, guys from work who would argue I would run out of unique combinations or would have to do lookups to see if they exist. So I decided slap the year and month on it as a prefix, so a key might look like this: 2105HSUAMWPA. This gets indexed really well too and there is some inherent information that can be seen from looking at the key: Year 21, Month 5 and then the unique bits.It's basically 4 lines of code that gets called on every new database entity. I think it will be easy to shard/partition the data too if the need arise in the future, by simply looking at the first 4 digits.
Thus to summarize:
Data is sliced by entity type (customer, invoice, etc), then by date (2105 for May 2021) then by unique string.
What do you guys think about this approach? Anyone been burnt by something like this?
> Recently I built a new system (typical business-type backend) and forced to use sqlite + C# + dapper. Using this combination I cannot use guid/uuid as dapper cannot properly map it back to c# from sqlite
Are you sure about this? This is pretty poor of a well known solution in the ORM world, SQLite or not.
If you were going for sortability/understandability then I understand slapping your own together, but why not generate v1/v4/v6[0] UUIDs in your application and then send them along to teh database, possibly prefixed with whatever you want it to be sorted by (though IMO you should just add that metadata to the thing being saved and sort on that properly)?
The problem is when reading the data back with Dapper (writing works fine) and it has to parse the guid stored in sqlite back into a .net Guid. Thus I resorted to making the column type text and ToString() all the Guid's before writing, but that's pretty gross. With a normal string field, sqlite can set it as primary key, make unique indexes on it, use it as foreign keys. Thus my attempt to use my own generator the generate id's. So far it works great but not sure if it will be fine with 5 years of usage (basically on tables that keep growing over time, like a comments table).
But of a rabbit hole: https://github.com/DapperLib/Dapper/pull/1082
Ahh thanks for noting this -- I've written off writing C# and PHP for the rest of my life for reasons, but this is a fascinating read.
On the other hand, C# + postgres is a super sweet combo. I don't use EntityFramework, so I don't know how well they play together, but if you write sql queries and run them from c#, it's super nice. I typically have a generic repository class that generates all the statements and it works great (I recently made the query generator able to generate inner join if I pass it filters that are not found on the table/object, based on a naming convention (any property that end in "Id" that is not the main entity Id and not a direct property, use the part before the "Id" part as the join table name) - it works great too!).
I also cache certain prepared statements in a static concurrent dictionary, so I can short circuit if the it the query/insert/update has been run before for the specific object type. That way all of the string manipulation can be avoid for each query - I just use a previous statement but with different sql parameters. On every deployment the cache gets cleared because the app basically get restarted, so it won't go stale. It's very cool (at least in my mind). The only indirect rule I have is that nothing else should modify the main schema besides my migrations and not while the app is using it (using FluentMigrator code that execute on app startup).
If I may ask, why don't you like C#? I totally get the php thing - I cannot stand it at all either.
> If I may ask, why don't you like C#? I totally get the php thing - I cannot stand it at all either.
I'm just not good at it -- I came in expecting it to be better Java (better async ergonomics, slightly less verbose, etc) -- but after working with it (while working with a client no less) I found that I disliked it just as much as Java and had a relatively rough go of it. There are a lot of things that I think contributed:
- .NET4 was becoming .NET core / .NET standard while this project was happening
- I needed a windows VM to work on the project (dotnet was not ready for prime time yet, so I couldn't depend on it to build the legacy code)
- Packaging was uncomfortable and a source of pain (also had to do with the core/standard transition) -- after you've used NPM/yarn/cargo/stack(haskell) nuget+chocolatey (IIRC?) felt like a huge step backwards, basically had to be IDE driven.
- EF was so incomplete (due in part to the core/standard transition), and so much worse than my then-and-now favorite TypeORM.
- The async paradigm and how they handle results are a little weird, IIRC awaiting a task would mean you received the task, not the result of the computation of the task?
- IIRC there was no Option type. Java learned this lesson in 1.8 IIRC (and it rocked my world a bit, in a good way), but C# not taking it up is weird to me.
- The codebase was just like you'd expect an old Java codebase to be -- i.e. terrible notfun.
- Everything was heavily IDE/visual studio driven which was not fun for me. Visual studio is an amazing tool no doubt, but a lot of it is unintuitive to me at this point after years of straight emacs/vim and occasionally sublime/atom/vscode. This is more personal than any other reason.
I just felt like it was so much worse than Typescript/Javascript for questionable gain -- I had worked with this same client to deliver a JS codebase that was easy for them to work on, on time and at budget. The C# side felt much worse to work on, and I just took it to mean I'm not good at C#, and I have no desire to be. The ecosystem you're almost forced to accept (appveyor, windows machines + powershell, etc) is just not my cup of tea.
I'm spoiled for choice these days -- if I need performance for a backend thing (and need to hand the project off to someone) I can choose Go. If I really need performance I can pick Rust. If I don't need performance-per-say I pick Typescript (it's still generally better perf than Ruby/Python). If I want to get it right (and really craft software) I pick Haskell (and of course it's near impossible to hand that off to most companies).
In the end C# is a great language (whether I like it or not), but it fell short of my expectations as a "better java". If I'm going to do Java I'll just do Java/Kotlin/Scala/Clojure. I think I could go my whole life without ever touching C# again, so was an easy rule to make for myself.
Got ya, thanks for explaining.
Legacy C# projects can be a mess, I've stepped away after a month or from one or two projects in the past simply because it was beyond fixable - I think there is clean way to use c# and there is an over-engineered way (typically java devs that try to use the exact same patterns in c#, which bloats it).
You are right, the past 3 years or with .net core and .net standard was messy too! They should've just named it something different. Same with Entity Framework. I'm still steaming about them using the same names for everything, as there are plenty cases when things don't work the same way as before. It completely screwed googling things.
And any desktop development was broken too for a time (still bugging out with new .net core xaml editor) - so yeah, skip.
I primarily use C# for api's and tooling (my own build server is code I wrote myself in C#, hosting my own git repo's (not using gitlab/bitbucket etc; rather bare repositories!), and my own nuget server). But for anything web/desktop related, meh skip. In C# api land, you can get such clean/unbloated code with full type safety and compiler checks, I love it. Most C# devs use/throw exceptions but they really muddy the water and make it feel like Java. My own code only have try/catch when talking to the outside world (harddrive, network, etc), the rest of the layers are clean and with minimal null checks (cause at those layers I never have nulls). Exceptions are expensive/slow too, so rather catch it as early as possible and return a Result object (not built in, my own envelope) instead of rethrowing multiple times and doing a crazy amount of null checks. So it is possible to write "clean" C# but most commercial projects I've seen is pretty darn ugly.
Just an fyi, I don't use appveyor/windows/powershell/Azure at all. I work primarily from Fedora with Jetbrains products, with most of my tooling being hand rolled. I use the Digital Ocean Api to manage all my boxes and use SSH.NET (nuget package that can open ssh tunnels in C#, I use it for all sorts of things, incl reaching my db servers (ssh.net + Npgsql)) and FluentFtp if needed. So I don't really touch Microsoft's ecosystem at all. My builder code calls the dotnet sdk and I copy the build artifacts around as needed - no third parties involved. In the beginning I stressed about going this way but I see now how great it is - no need to worry about CI minutes or paying Gitlab or Bitbucket, or dealing with 10 different api keys or dealing with Azure etc. So it's totally possible to have sanity - but to be honest I cannot take "my way" to work as everything is hand rolled - people want to use Gitlab/Bitbucket etc else they lose their minds (other developers seems to be my biggest enemy these days - everything needs to match to their exact pattern else they quit). So yeah, own clients = own stack.
On the visual studio side, the IDE has gotten slower and slower... I have a 6 core cpu, 32Gb ram and a Samsung nvme drive, RX580.. yet visual studio lags like crazy, intellisense take 2 minutes to update etc.. that's with only VS open and no extra extensions installed. It also crashes sometimes while doing nothing out of the ordinary. So not sure what the heck they are doing to the project at Microsoft. On the otherhand, the jetbrains products have super slow startup times (try 10 to 30 seconds for Rider and Datagrip) but once in, they work great!
Another cool thing to note, Gitkraken sees my bare git repositories as valid remotes (provided ssh keys are in place) and pushes to them without issue (and near instantly, bitbucket push takes 20s no matter what I do).
Sorry for dumping all of this on you!
I meant intellisense take 2 seconds, not 2 minutes.
Yeah, a lot of it was a mix of the project, me not being good enough at C# in general to fix up the legacy adeptly, etc -- and I picked a bad time to do to it. You can't be good at everything, and I made a choice to just leave my C# skill where it's at.
> I primarily use C# for api's and tooling (my own build server is code I wrote myself in C#, hosting my own git repo's (not using gitlab/bitbucket etc; rather bare repositories!), and my own nuget server). But for anything web/desktop related, meh skip. In C# api land, you can get such clean/unbloated code with full type safety and compiler checks, I love it. Most C# devs use/throw exceptions but they really muddy the water and make it feel like Java. My own code only have try/catch when talking to the outside world (harddrive, network, etc), the rest of the layers are clean and with minimal null checks (cause at those layers I never have nulls). Exceptions are expensive/slow too, so rather catch it as early as possible and return a Result object (not built in, my own envelope) instead of rethrowing multiple times and doing a crazy amount of null checks. So it is possible to write "clean" C# but most commercial projects I've seen is pretty darn ugly.
That sounds interesting! Yeah I definitely prefer having types these days (basically don't pick languages without them anymore), and errors-as-values is one of the patterns that is a pearl of modern PL development. Languages that come out these days that still rely on exceptions are an instant turn off for me (ex. Dart). I remember something like 8-10 years ago having discussions around whether checked exceptions had meaning in Java and now that I look back all those discussions were so silly.
> Just an fyi, I don't use appveyor/windows/powershell/Azure at all. I work primarily from Fedora with Jetbrains products, with most of my tooling being hand rolled. I use the Digital Ocean Api to manage all my boxes and use SSH.NET (nuget package that can open ssh tunnels in C#, I use it for all sorts of things, incl reaching my db servers (ssh.net + Npgsql)) and FluentFtp if needed. So I don't really touch Microsoft's ecosystem at all. My builder code calls the dotnet sdk and I copy the build artifacts around as needed - no third parties involved. In the beginning I stressed about going this way but I see now how great it is - no need to worry about CI minutes or paying Gitlab or Bitbucket, or dealing with 10 different api keys or dealing with Azure etc. So it's totally possible to have sanity - but to be honest I cannot take "my way" to work as everything is hand rolled - people want to use Gitlab/Bitbucket etc else they lose their minds (other developers seems to be my biggest enemy these days - everything needs to match to their exact pattern else they quit). So yeah, own clients = own stack.
Wow thanks for this level of detail, I exclusively use linux on everything now, and it's great to know that this kind of setup is possible without resorting to a VM. As far as infrastructure/deployment goes, you manage your C# projects with C#, SSHing to them and doing stuff? What are the commands you run like? I'm not super familiar with easily calling C# from the command line, something like `dotnet run deploy.csharp`?
SO I'm a huge proponent of CI (I think GitLab has the best CI out there) but it definitely is more complex than it should be. Setting up tokens, figuring out how things interact is really annoying 90% of the time, and the other 10% is bliss.
I'm actually about to relaunch a product I was working on -- one that makes it cheaper to run CI ($10 for 1000 minutes is what GitLab charges and I'm going to offer $15 for unlimited minutes per dedicated vCore/2GB RAM). More to your problem though, I actually make all my stuff work with Makefiles (so much so that I even deploy over-complicated infrastructure you wouldn't like with Makefiles[0][1]), has that not worked for you as far as fixing CI inconsistency? I find that generally knowing that to build I just need to run `make <target>` (assuming you have the correct system-level libraries installed of course) has fixed most of my issues in this area.
> On the visual studio side, the IDE has gotten slower and slower... I have a 6 core cpu, 32Gb ram and a Samsung nvme drive, RX580.. yet visual studio lags like crazy, intellisense take 2 minutes to update etc.. that's with only VS open and no extra extensions installed. It also crashes sometimes while doing nothing out of the ordinary. So not sure what the heck they are doing to the project at Microsoft. On the otherhand, the jetbrains products have super slow startup times (try 10 to 30 seconds for Rider and Datagrip) but once in, they work great!
This is unfortunate, crazy wild conjecture but do you think it's possible visual studio will be subsumed into VS Code? And yeah what you're describing is why I live in emacs/vim (and even emacs is too slow sometimes). I know that I'm giving up a LOT of creature comforts but for me it fits like a worn glove. I'm also constantly impressed with JetBrains -- the IDE experience has generally been good for me as well (I've had to use it before because everything else was just too painful), and it seems like they just keep their heads down and put out good product.
> Another cool thing to note, Gitkraken sees my bare git repositories as valid remotes (provided ssh keys are in place) and pushes to them without issue (and near instantly, bitbucket push takes 20s no matter what I do).
Interesting, is bitbucket push having an issue with web requests or something I wonder, 20s is a long time to wait! Since I use emacs magit[2] works for me and it is amazing.
[0]: https://vadosware.io/post/using-makefiles-and-envsubst-as-an...
[1]: https://vadosware.io/post/setting-up-mailtrain-on-k8s/#step-...
[2]: https://magit.vc/
> As far as infrastructure/deployment goes, you manage your C# projects with C#, SSHing to them and doing stuff?
Basically yes. C# console apps can be run like any other linux cli application, provided the dotnet runtime or sdk is installed (~40mb vs ~150mb). If you installed the sdk, you can build/publish/pack projects for production. You can specify the input directory and output directory, so you can easily call something like this: "Process.Start("dotnet publish ~/code/app1/app1.sln --output ~/builds");". Since you can start a process from C#, I can also git clone and git checkout like this: "Process.Start($"git clone {repo path} {output}");" where Process.Start lives in the System.Diagnostics namespace. So I built my own tooling as Console Apps that call other things (since I dislike bash scripts, yaml files (insert ci dsl) - the c# compiler protects me against a whole bunch of things and their System.IO namespace is awesome when working with files/directories).
So I can clone it, build it, open tunnel somewhere, copy it, straight from c# (although I have to setup ssh keys everywhere when needed, I use some trickery with "ssh-copy-id" that comes with ssh to add keys to new servers). So far I have 3 tools: a builder, a server manager (for droplets) and a tool that makes backups, but I'm looking at wrapping borg in the future as it is waaay better than mine. Next I need to build a tiny api to call "git init --bare" for new repositories, as I have to run it manually at the moment to make new project repo's.
> I'm going to offer $15 for unlimited minutes per dedicated vCore/2GB RAM
Careful, don't get abused by the crypto community; there has been some more incidents lately where miners abuse CI systems.
Thanks again for the detail!
> Careful, don't get abused by the crypto community; there has been some more incidents lately where miners abuse CI systems.
Will do! I’ll have to do some detection of the workloads. I know other platforms are fighting this as well.
It seems like int vs bigint is brushed off rather quickly here. bigint is twice the size of int, therefore indexing will be larger as well. Furthermore, all the FK storage and indexing will also be bloated by this choice. If you design a customer table with a bigint PK, and everything will point to customer (invoices, billing statements, etc), then that's not an insignificant amount of space. While most of us may want to have "billions served" like McDonald's, the reality is my company and your company will never have 2 billion customer accounts, even in the wildest of imaginations. If you ever did reach this point, it's "a good problem to have" and relatively easy to move from int -> bigint. Moving in the reverse direction is likely difficult or impossible.
It would be nice to see real benchmarking on millions of rows to compare the three, but my gut tells me you use int by default, bigint if you outgrow int, and UUID if you have plenty of money for hardware and need distribution capabilities a UUID would enable.
In datamodelling, tables can often be categorized by lifetime. 'Business Relationships' eg. customers, suppliers, products have a fairly long lifetime; whereas 'Business Transactions' are created on a much higher frequency.
I'm generally fairly comfortable using int for business relationships, and bigint (long) for transaction data.
For performance, insertion speed often seems to be dominated by 'commit latency' to sync to the disk; rather than by record size. I would agree that record size affects table scan, but for many datamodels keying may often be a relatively small proportion compared to the size of text fields and other data.
I like to model keyspaces to work for 200 years, for the largest forseeable market growth, times at least a factor of 10 for safety.
Speaking from personal experience, just use bigint... If you aren't dealing with billions of rows, the size difference isn't that big a deal, and if you are dealing with billions of rows, the int -> bigint migration is definitely not "relatively easy".
One of the most memorable anecdotes of my professional career is a production environment going down because we hit maxint on an important (and busy) table. The dirty hack we used to get the site back up (hint: int is _signed_), and the weeks it took to plan, test, and execute the migration.
This is making me reconsider how I do IDs. I thought the performance of sequential IDs was significantly better. So my approach was to use a standard auto-increment primary ID and then obfuscate by id * p mod m where p and m are coprime and very large. then i get back the original ID using the mod inverse. Should I just be using UUID?
I would use uuid in this case. If p and m are too large you get overflow. If they are too small your keys are guessable. If it matters, use uuid and don't waste time and mental energy on it.
> You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.
in one sense I agree with the author that things are generally just easier when you use surrogate primary keys, however they really should note here that the FOREIGN KEY constraint itself is not a problem at all as you can just use ON UPDATE CASCADE.
ON UPDATE CASCADE avoids much developer impact, but it isn’t free and has (potentially quite large) performance impacts.
Always, always use a bigserial.
(Actually, all serials are bigserial’s but the “base type” they add to the table differs, and it’ll always come back to bite you later. Ask me how I know…)
I don't think I've ever seen this mentioned anywhere, but if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions? Are there any downsides besides making the application side a tiny bit more complex?
That’s the UUID approach, but worse. According to the birthday problem[1], you’re 50% likely to get a collision in 65 bit numbers after about 5 billion insertions. That’s not an awful lot. Replace that with a 128-bit UUID and you’d have to insert 22,000,000,000,000,000,000 rows to get a 50% chance. That’s probably less likely than a cosmic ray flipping a random bit in RAM and corrupting the index that way.
[1] https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...
The post qualified as <= 10,000,000 total records. For that number of records, there's about a chance of about 0.00001 that you get a collision, assuming good randomness.
Sure, but stuff always grows, and the experiment gets run a bunch of times. Why not go with the built-in solution and then not have to worry about it?
I'm just answering the poster's question directly; but in the general case, I agree with you. The cognitive overhead of dealing with the various "what ifs" usually aren't worth the couple bytes or cycles that you could save.
Getting a collision with this approach doesn’t matter — the whole point is to loop if you do get a collision. The only issue is getting a long string of sequential collisions, which is highly unlikely.
But now you’ve tried code complexity for a few bytes if storage. That’s just not worth it.
8 extra bytes per row and per foreign key reference relative to an int64 can add up quickly especially if the row is small. I agree it’s not typically the right trade off but it’s not as absolute as you claim.
This is why we need 2TB drives now, when we used to get by with 2GB.
Back in my day we measured things in Ks, not Ms or Gs or Ts.
Get off my lawn...
But seriously, UUIDs work, they don't need application code to avoid collisions. If you want something a bit more compact/shardable, use ULIDs.
for what it's worth, YouTube still uses 11 character base64 strings for their video ids, which are assumed to be 64-bit ints. They also allow unlisted videos, which people usually take to mean "semi-private".
It's an interesting tradeoff. The UX of the smaller YouTube video id links is probably of some benefit to them. Plus they have private videos for when you really don't want your video to be viewed, with unlisted being the middle ground of easy sharing but also keeping it exclusive.
Sure, and it makes sense there: write a service that returns unique 64 bit ints and encapsulate the complexity inside that one location. That’s easier than making every `insert` in your app code have to do a `while not unique` loop.
If you have this situation, it’s not even worth the time or effort to not just use UUIDs. You’re adding complication for no gain whatsoever.
> if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions?
What’s the use case for this where UUIDv4 or sequential ID isn’t better? Because it sounds like a solution in search of a problem.
> Are there any downsides besides making the application side a tiny bit more complex?
Are there any upsides to warrant the complexity?
I wouldn't say it's a particularly great upside, but I've found tooling doesn't tend to work as well with UUID keys as it does integer keys. E.g., Hasura won't map UUID keys to the GraphQL ID type [1], which makes working with them unnecessarily difficult. Arguably, the issue should be fixed with the tool (or a different tool chosen), but there's only so many battles to pick and sometimes that decision is out of your control.
UUIDv4 has performance implications. But I agree, if you are already coupled to the DB (due to the check loop), generally you might as well use sequential.
Seems too easy to screw up.
Database sequences comply with the ACID properties of the transactional processing. Generating your own IDs and adding "a simple for loop" means that you lose that capability for no good reason.
If you have 10 million rows, you're looking at 16MB for the storage of a UUID, vs 8MB for storage of a 64 bit int.
Both of those are entirely cacheable.
I have a web service where I need to generate access tokens of five characters. I just generate a random one and check to see if it’s already in use. Been working fine for years.
You can also do the loop on the server side using PL/pgSQL:
https://www.postgresql.org/docs/current/plpgsql-control-stru...
That's not that trivial. You can't just loop to get a unique ID. Maybe if you lock the whole table for reads first, which is quite drastic.
I should have been more clear. Here are the details:
If you read the linked doc, you'll see an EXCEPT clause. That can be used for a retry loop inserting into a table with a UNIQUE constraint. No read locks are necessary, because the UNIQUE constraint will catch violations safely (regardless of concurrent activity), and the retry loop can simply retry until that succeeds. For instance:
This will obviously loop forever if 0..n are all occupied, but if you choose n as (2::numeric^63 - 1)::int8, that won't happen.CREATE TABLE u(i INT8 UNIQUE); -- insert random unique value in the range 0..n -- into table u, retrying if it's already present -- -- NOTE: this will not terminate if 0..n are all -- present CREATE OR REPLACE FUNCTION insert_uniq(n INT8) RETURNS VOID LANGUAGE plpgsql AS $$ DECLARE x INT8; BEGIN <<retry_loop>> LOOP BEGIN x := (random() * n)::int8; INSERT INTO u VALUES(x); RAISE NOTICE 'inserted unique value %', x; EXIT retry_loop; EXCEPTION WHEN unique_violation THEN RAISE NOTICE 'collision with value %; retrying', x; END; END LOOP; END; $$;Thank you.
The point of guids/uuids is no collision... Ever... Including for large datasets. Even when you merge multiple together in 10 years.
For your use-case you could use incremental IDs.
Is the goal here to save space?
Reducing space is less about pure storage amount but rather about the fact that having a not-too-large datatype (e.g. native integer) for keys generally improves all kinds of performance as the indexes are more compact and better fit in caches, comparison is trivial so joins are faster, etc.
The goal is to have a non-sequential ID (for example, to hide the information about the actual size of the data set) in a situation where one needs to support a storage that doesn't have a native support for UUIDs (that isn't just “convert to TEXT”) and reuse as much of the schema as possible without [ab]using ORMs. For example, an application that has to support SQLite as well as PostgreSQL and maybe some other storages.
Another benefit of using sequential integers is that you can leverage a number of optimizations.
For one thing you can represent a range of data more efficiently by just storing offsets. This means that instead of having to store a 'start' and 'end' at 8 + 8 bytes you can store something like 'start' and 'offset', where offset could be based on your window size, like 2 bytes.
You can leverage those offsets in metadata too. For example, I could cache something like 'rows (N..N+Offset) all have field X set to null' or some such thing. Now I can query my cache for a given value and avoid the db lookup, but I can also store way more data in the cache since I can encode ranges. Obviously which things you cache are going to be data dependent.
Sequential ints make great external indexes for this reason. Maybe I tombstone rows in big chunks to some other data store - again, I can just encode that as a range, and then given a lookup within that range I know to look in the other datastore. With a uuid approach I'd have to tombstone each row individually.
These aren't universal optimizations but if you can leverage them they can be significant.
Doesn't the offset approach run into trouble when sequence values get skipped due to rollbacks?
It's going to be an optimization that assumes some constraints on how you interact with your database.
I'm a fan of generating primary key by copying natural key (if it's one integer) or hash of natural key. This is done only once when row is created and is never updated, even if natural key changes. In this case you are left with valuable bit of information that something happened to natural key.
Another alternative is ULID, which can be stored as UUID on a Postgres side, but is more b-tree friendly.
Are there articles/examples on how to use ULIDs in postgres?
ULID and UUID are same size, passing ULID as UUID to PostgreSQL works seamlessly.
Yeah, just use a UUID unless the bits to store the UUID really are your driving limitation (they're not), having a UUID that is non-linear is almost always the most straight-forward option for identifying things, for the tradeoff of human readability (though you can get some of that back with prefixes and some other schemes). I'm not going to rehash the benefits that people have brought up for UUIDs, but they're in this thread. At this point what I'm concerned about is just... what is the best kind of UUID to use -- I've recently started using mostly v1 because time relationship is important to me (despite the unfortunate order issues) and v6[0] isn't quite so spread yet. Here's a list of other approaches out there worth looking at
- isntauuid[1] (mentioned in this thread, I've given it a name here)
- timeflake[2]
- HiLo[3][4]
- ulid[5]
- ksuid[6] (made popular by segment.io)
- v1-v6 UUIDs (the ones we all know and some love)
- sequential interval based UUIDs in Postgres[7]
Just add a UUID -- this almost surely isn't going to be what bricks your architecture unless you have some crazy high write use case like time series or IoT or something maybe.
[0]: http://gh.peabody.io/uuidv6/
[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...
[2]: https://github.com/anthonynsimon/timeflake
[3]: https://en.wikipedia.org/wiki/Hi/Lo_algorithm
[4]: https://www.npgsql.org/efcore/modeling/generated-properties....
[5]: https://github.com/edoceo/pg-ulid
[6]: https://github.com/segmentio/ksuid
[7]: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...