Representing Enums in PostgreSQL
making.close.comTFA is all about how to make changes where you drop elements of an enum, and how hard that is. The obvious thing though is not covered: don't do that! Instead you should:
a. Add CHECK() constraints on columns of
the relevant ENUM type checking that
the value is one of the "live" values.
b. RENAME "dead" ENUM values to indicate
that they're dead.
c. UPDATE ... SET column_of_that_enum_type = ...
WHERE column_of_that_enum_type IN (dead_values_here) ...
(c) can be done slowly, say, with a LIMIT clause to keep load and disruptions down. Yeah, you might have "broken" data for a while, but you can make your queries report better values for columns of that ENUM type in the same way that the UPDATE would fix them.Glad to see this as the top comment, completely agree. After reading the article, was thinking that the only real downside of using enum types (but there is one more issue not mentioned, more on that below) is when you need to remove values. In reality, I've found removing enum values to be a very rare occurrence in prod. Removing an enum value fundamentally breaks backwards compatibility, so usually a better option is to essentially deprecate it, which is more like what you recommend.
Note one other thing I have found is that postgres enums are less "portable" than plain text columns. For example, if you're using BigQuery to query postgres tables as an external connection, BQ doesn't support enums, so your external connection query has to cast any enum return values (e.g. some_enum_column::text) before returning them to BQ, which can be a pain if you want to do "SELECT * FROM ..." some table that returns enum columns.
You can create an implicit cast to and from text with CREATE CAST and it becomes transparent.
Can you explain a little more what you mean here? I'm not sure this helps my use case. E.g. if I run a BQ external query like the following
that always fails for me, even if I define "CREATE CAST (my_enum_type AS text) WITH INOUT AS IMPLICIT;"SELECT * FROM EXTERNAL_QUERY("my-pg-connection-string", """ SELECT enum_column FROM foo; """);The only way I can get it to work is if I use "SELECT enum_column::text FROM foo;" and I don't need a custom cast definition to do that.
Apologies, I’m hallucinating features here. In our codebase we also have lots of JDBC extensions to map PG objects to language types and back, and this is the easiest way to handle queries over enums. But obviously some type info has to trigger the coercion, so SELECT * FROM might work in the context of an INSERT INTO but elsewhere you’d have to be explicit.
I think you would need a way to trigger the cast either way? Eg concatenate with a string?
Otherwise, maybe a function and a generated column (Ed: in a VIEW i guess, if it chokes on the raw enum)?
https://dba.stackexchange.com/questions/276477/immutably-con...
By the time you've done a generated column you might as well go with the articles solution of using text with a check constraint since you're now using the space anyway.
The CREATE CAST will help on DMLs, but not SELECTs.
Oh nice. I didn’t know about that feature.
> BQ doesn't support enums
Sounds like a reason to use views to work around the limitations of BigQuery
> Note one other thing I have found is that postgres enums are less "portable" than plain text columns. For example, if you're using BigQuery to query postgres tables as an external connection, BQ doesn't support enums, so your external connection query has to cast any enum return values (e.g. some_enum_column::text) before returning them to BQ, which can be a pain if you want to do "SELECT * FROM ..." some table that returns enum columns.
Yeah, one has to write more adapter queries to add those casts to text (or from text, for DMLs), but it's OK, and ENUMs are just really nice. And as u/faangsticle says, use VIEWs for that.
Good approach. And with Postgres `DOMAIN` types features to DRY up the `CHECK` the live constraint, this should be very palatable. I usually opt for enum tables and regularly find myself needing to declare `archived_at` columns for said tables. Did not know of the `RENAME` command either and what you describe renders native enums less warty.
Agree, PostgreSQL enum is really just for mapping with an enum in application code. And just like you should not remove an enum or change their ordinal in application code, don't do that to a PostgreSQL enum
I find it odd how many schema changes in modern RDBMS must be done on the whole table at once. You can split a table in chunks and recode each chunk gradually in a way which doesn't change the data in it (so no downtime) but removes dead entries like updated enums.
In a way you describe how we can emulate this process. The question is why the heck wouldn't databases do this themselves? Same with adding and dropping columns.
Consider how PostgreSQL encodes null for example, by skipping them in the row as fields, and adding them in a null bitmap in front of the row. Meaning... rows are not uniformly sized, there's no math like offset = row * rowsize + field_offset; kind of addressing for reading a field in PG where recoding some of the rows breaks the entire table.
And yet we have all those huge monolithic operations that need to be done atomically. So weird.
After having suffered through the consequences of "type" enums on MySQL, and see some things go through a long life that used "enums" in the database (in multiple different databases, include Postgres), I'm not convinced that either of these are the right choice for representing enumerations. The string with check constraint seems dumb if for no other reason than if the table that uses it winds up having many rows, you're basically burning up lots of extra space for arguably no reason, and if you ever have to alter the name of an enum (or something similar), that update is going to be really expensive.
I think the "right" choice for enums probably looks a little more like:
CREATE TABLE myEnum(
enumID SERIAL NOT NULL PRIMARY KEY,
enumName TEXT UNIQUE NOT NULL,
-- enum description or other metadata columns here.
);
CREATE TABLE foo(
...
associatedEnumID INTEGER NOT NULL REFERENCES myEnum(enumID),
...
);
I think this has the benefit of being space efficient like the native typed enum, while being relatively flexible (easy to change names, add new enum values, add data about the enum itself, etc.)I prefer directly using strings as enums, and using the foreign key constraint only to validate enum values.
The reason is because a SELECT * FROM foo showing cryptic enum ordinals is a headache, and having to join the enum table every time is potentially slower than just reading from the column directly. An ASCII character only takes 1 byte, so an INT enum is just as space efficient as using 4 characters, which affords way more descriptiveness than a meaningless ordinal number.CREATE TABLE my_enum ( name TEXT PRIMARY KEY ); CREATE TABLE foo ( my_enum TEXT REFERENCES my_enum (name) );If you use longer named enums(eg. my_enum_xyz) in my_enum, does this store a full copy of the enum text bytes of 'my_enum_xyz' into table foo?
Yeah the full text of the enum is stored in the table ('my_enum_xyz' is 11 ASCII characters so it takes up 11 bytes, plus 1 byte needed to store the length of the string).
The point of enums is so the names can be descriptive, which typically will be longer than 4 characters.
Perhaps I was unclear, my underlying point is that short strings are cheap. And enum names are almost always short strings. 4 characters not enough? What about 8? That’s the size of a BIGINT, which everybody now uses for primary keys (they’re not slow). What about 16 characters? That’s the size of a UUID, which everyone now uses for primary keys (they’re not slow).
Even 4 characters (which perform just as fast as Postgres’ native enums) is enough if you develop a system of abbreviations. The airline industry gets by with just three characters!
Just wanted to let you know I really appreciate your comments. I hadn't thought about doing it the way you recommend (in my mind I was thinking "oh, storing enum values as text will be inefficient", but I never did the math as you did to realize storing anything up to 8 chars is no worse than bigint), but I think it's a pretty optimal solution. I also checked my DB where we heavily use enums, and basically every enum text value that is longer than 8 characters could easily be shortened/abbreviated to 8 chars and still be very easily understood. Thanks very much!
Yep thats what I typically do except a string "code" and string "description" that way the raw data is more readable without joining the coded value and many times you want a short code vs long description, like with US states and their abbreviations. Most codes are 1 or 2 characters up to maybe 4 so you end up with less or the same space used than a 32 bit int with UTF-8.
Metadata like sorting and even what date ranges the code is valid for and even security as in who's allowed to use the code in the app.
This is... exactly what PG does under the covers for ENUM types. And also of course this is historically the canonical way to do ENUMs in SQL.
This is exactly what I do, save with natural keys.
In my experience, you need those descriptions and other metadata on these enum values more often than not.
PostgreSQL enums feel like a bit of a hack in general. I end up using an "enum table" approach in many cases as joining against a very small table has negligible performance impact in all but the most performance sensitive databases and foreign key constraints are a well understood quantity.
PG enums are "enum tables" under the hood. With native enum support the JOINs with the enum tables happen at query compilation time, which is a performance benefit you should not ignore.
PostgreSQL enums are there for you to use on your code API. They can go into tables because sometimes it's important to put code data into tables.
What they are absolutely not for is to replace the one standard way to define data enums that everybody use since relational algebra was created. It's such an outrageously naive idea that I'm sure most people here were attracted to the title thinking it's about algebraic types or some other similar misunderstanding.
Can you elaborate on what it means to use PostgreSQL enums "on your code API"?
> It's such an outrageously naive idea that I'm sure most people here were attracted to the title thinking it's about algebraic types or some other similar misunderstanding.
Just to share how we at Close got into this discussion (which I personally don't find as "outrageous" as you), SQLAlchemy – which we use in our Python code – uses `native_enum=True` as the default value for its `Enum` type [0], which then translates to using types in PostgreSQL:
> native_enum – Use the database’s native ENUM type when available. Defaults to True. When False, uses VARCHAR + check constraint for all backends.
[0]: https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqla...
> Can you elaborate on what it means to use PostgreSQL enums "on your code API"?
If you export a procedure for creating a socket into pgPlSQL, you shouldn't use magical numbers for setting the socket flags. You should use enums.
As for sqlalchmey, well that design is not good. It should support more mappings than just to string. But well, personally, I would ignore the feature and go without enum types (notice that it's a recent addition). You can always declare your own enum and convert the column if you want.
> But well, personally, I would ignore the feature and go without enum types (notice that it's a recent addition).
Postgres has had enums since 2008
So, recent addition.
I would argue "certainly not", since it's had them for more than half its lifetime.
I still don't totally understand that first point. Are you talking about using enums for controlling the connection itself? Can you give an example?
Just to clarify, sqlalchemy supports any arbitrary mapping you want — it's entirely flexible in every direction.
EDIT ah, do you mean something like this?
CREATE FUNCTION do_thing(flags custom_enum)Yep, you use it as CREATE FUNCTION do_something(some_option custom_enum). The main use-case is not putting them on tables.
Relational algebra has the concept of enums builtin (as long as they all have the same structure). You create them in databases by using foreign keys.
Heyya! Close customer here.
> When False, uses VARCHAR + check constraint for all backends
Is this behaviour new? Or is it that alembic still doesn't pick up changes to the enum to recreate the constraint? We implemented our own alembic hooks to automate the migrations for us. We've found the check constraint model to be fairly effective.
Hiya Aidan! :D
> Is this behaviour new?
Don't think so. Even the docs for SQLA v1.3 ([0]) mention it:
> native_enum – Use the database’s native ENUM type when available. Defaults to True. When False, uses VARCHAR + check constraint for all backends. The VARCHAR length can be controlled with Enum.length
[0]: https://docs.sqlalchemy.org/en/13/core/type_basics.html#sqla...
--
> Or is it that alembic still doesn't pick up changes to the enum to recreate the constraint?
Off the top of my head, I believe Alembic doesn't pick it up automatically. We make a lot of manual tweaks to our auto-generated Alembic migrations and I believe adding / modifying the CHECK constrains is part of it.
Happy to ping you guys over the code if you want to look. It’s probably fairly tweak-able to handle your specific cases.
Is there a reason ALTER TYPE name ADD VALUE new_enum_value wouldn't Just Work for the first example for 99% of use cases? Seems like the only drawback highlighted for native enums is that you have to lock the whole table if you completely swap out the type of one of the columns, which... Yes, that's true, but also very rare? I guess TFA mentions removing a value from an enum, but you shouldn't do that imo - leave the data in place and ignore it at the app layer, like how you'd treat a deleted flag or similar. Or migrate it and leave the old value hanging around in the type.
> Is there a reason ALTER TYPE name ADD VALUE new_enum_value wouldn't Just Work for the first example for 99% of use cases?
I'm sorry, we don't have any examples in the article about adding values to enums, so I don't know to what you are referring here.
> Yes, that's true, but also very rare?
It is rare, indeed, but we just have a strong inclination towards having a clean schema, so we do prefer to clean up things that would otherwise be unused. But I understand that many people would prefer to leave them hanging there, even if they are not going to be used in any way.
Native enums can be used in queries like strings, but with type checking:
select * from cust where type = ’company’ -- ok
select * from cust where type = ’c0mpany’ -- error
As mentioned, they take less space. Important if you use these columns in index and have millions of rows.The "less space" argument does not seem to be true in every situation. Enums may suffer from alignments. See here: https://dba.stackexchange.com/a/258591/50410
Isn’t index size more proportional to number of distinct values and number of rows. With few distinct enum options, the size of each shouldn’t have big impact?
Little known bug, but enums also choke dump/restore if you use them as a hash partition key.
That said, they can be really nice for all the same reasons static types are nice. Good article!
Whew, good to know! Another point towards the `CHECK` constraint approach. Thanks! :)
It sure sounds like a bug to me, too! Is there an upstream bug id? I’d imagine dump/restore should “just work” with anything that can be done in postgres oob.
Donno about a bug Id, but there were a couple mailing list discussions. Here: https://www.postgresql.org/message-id/flat/1376149.167526827...
Disclosure: I reported this particular bug. It was a fun one to run down.
Native enums in Postgres just seem to have many more drawbacks than advantages. I never really used them, it didn't seem worth it.
I usually use C# Enums translated by EF Core now, which works perfectly fine on the C# side. The only missing part would be to give the DB the information about the enum names, so it could show those to me instead of just the raw numbers. But I assume there is no way to do that. Adding entire tables just doesn't seem worth it for this use case alone.
> Adding entire tables just doesn't seem worth it for this use case alone.
If you peek at a particular table often enough and it has a crazy enough enum to maybe justify making your life a little easier, what you can do is add a table just for debug purposes and use a view that maps the int to it, but otherwise only use the direct int:enum cast in your app/orm/etc. thereby not paying any runtime cost for foreign keys or integrity checking.
> Adding entire tables just doesn't seem worth it for this use case alone.
I am going down this path. There are a lot of downstream use cases (e.g. support, reporting, etc) that can benefit from having the canonical names & human-friendly labels in the database.
Keeping database in sync with code like this requires some discipline and manual scripting, but I think it's worth it.
> I usually use C# Enums translated by EF Core now (...)
Does this approach ensure data consistency? Or could you accidentally insert a number value into your table which is not represented by any of the enum values?
Well C# is strongly typed so that’s punted to the CLR type system (assuming you’re not using raw sql to insert values via other means/frontends) - but it isn’t too hard to forcibly to cast a raw int to an enum in C# which doesn’t actually contain a definition with that integral value (unlike in even stronger typed languages eg in rust where it’s a lot more work).
I “manually” add a check constraint (via EF, so it’s not so much manual as it is “remember to copy and paste this in your db.cs) to assert the value is greater than or equal to zero and less than the number of inhabitants in the enum, but this fails if you manually code the enum values (eg for flags, legacy interop, etc).
With raw SQL you could write anything in it. Within EF Core it won't allow bad values, you could trick it I think but that is not something you'd do accidentally. You can always add a CHECK constraint in addition, if you have more uncontrolled places modifying those values.
IDK I use the npgsql enum mapping to native pg enum and it works great. Pretty sure I have it configured for string values too.
The article misses the most common way to represent enums: foreign keys…
Foreign keys have a different set of tradeoffs. We'd have to either look enum values up before making the query, or use a join that might affect query plan.
When would one use this approach versus a domain type [1]? How does it differ?
That would be an improvement on the `CHECK` constraint approach if the same `CHECK` constraint is used in multiple columns, to DRY it up. The `DOMAIN` types offer the same commands for `CHECK` constraints, namely `NOT VALID` when updating the constraint, and `VALIDATE CONSTRAINT` to make sure it is valid after having updated it.
I am probably missing something obvious(space/time optimization for very large/busy tables?) but isn't this a perfect example of why foreign keys exist. In fact both case studies presented by the article(enums and fixed string check constraints) are clunky enough that I would say both are wrong and they should just go with a foreign key.
I agree with this, I don't use enums they are always more trouble than they are worth.
They break FDW unless they are pre-created on the importing side. Super inconvenient.
Apologies for my ignorance, but what does "FDW" stand for in your comment? :)
Foreign Data Wrapper
specifically, they are one of the coolest features in postgres. Have you ever wanted to pretend that a CSV (or your filesystem or google) or basically anything else was a SQL database? With a couple hundred lines of code, you can do this. The performance won't be great, but it's ridiculously flexible.
I have a two-liner shell script that turns a csv into a virtual SQLite db (./to_sql.sh foo.csv) and lets me use the full power of SQLite. Usually that’s enough for my purposes, but I do use pgsql‘s copy from csv to generate tables representing foreign data imported from a csv (say census info or NAICS occupation codes).
I had to implement a workaround for that for our DW which imports from two different app databases. IMO this should be part of
but then you would have to qualify the names somehow if you were importing from more than one db. I wrote this as a workaround, it runs every day as part of our data import job.import foreign schemaDROP SCHEMA IF EXISTS fdw_enum CASCADE; CREATE SCHEMA fdw_enum; -- Get names of the enums using ::regtype casting and label and sort_order from app_a and app_b. CREATE VIEW fdw_enum.app_a_enums AS SELECT * FROM dblink('fdw_app_a', $QUERY$ SELECT enumtypid::regtype AS name, enumsortorder sort_order, enumlabel label FROM pg_enum; $QUERY$ ) AS t (name text, sort_order integer, label text); CREATE VIEW fdw_enum.app_b_enums AS SELECT * FROM dblink('fdw_app_b', $QUERY$ SELECT enumtypid::regtype AS name, enumsortorder sort_order, enumlabel label FROM pg_enum; $QUERY$ ) AS t (name text, sort_order integer, label text); -- Ensure enums with the same names aren't defined in both app_a and app_b. DO $DO$ DECLARE matching_count integer; BEGIN SELECT COUNT(*) into matching_count FROM fdw_enum.app_a_enums INNER JOIN fdw_enum.app_b_enums USING (name); ASSERT matching_count = 0, 'app_a and app_b-NG have identically named enums. Implement a check that they are identically defined.'; END $DO$; CREATE VIEW fdw_enum.upstream_enums AS SELECT * FROM fdw_enum.app_a_enums UNION ALL SELECT * FROM fdw_enum.app_b_enums; CREATE PROCEDURE fdw_enum.create_type_if_not_exists(name text) LANGUAGE plpgsql AS $PROC$ BEGIN EXECUTE format('CREATE TYPE %s AS ENUM ()', name); EXCEPTION WHEN duplicate_object THEN RAISE NOTICE 'Enum type % already exists, skipping', name; END; $PROC$ ; -- To make this idempotent we create the enums empty then alter them by adding values. -- So instead of `CREATE TYPE foo AS ENUM ('bar', 'baz');` we do -- `CREATE TYPE IF NOT EXISTS foo AS ENUM (); -- `ALTER TYPE foo ADD VALUE IF NOT EXISTS ('bar'); -- `ALTER TYPE foo ADD VALUE IF NOT EXISTS ('baz'); -- This ensure the procedure is re-runnable and can add new values to the enum as they are added upstream. -- Order is ensured by the ORDER BY in the loop query. CREATE PROCEDURE fdw_enum.refresh_upsteram_enums() LANGUAGE plpgsql AS $PROC$ DECLARE rec record; ddl text; BEGIN FOR rec IN SELECT DISTINCT(name) AS name FROM fdw_enum.upstream_enums LOOP CALL fdw_enum.create_type_if_not_exists(rec.name); END LOOP; FOR rec in SELECT * FROM fdw_enum.upstream_enums ORDER BY name, sort_order LOOP ddl := FORMAT('ALTER TYPE %s ADD VALUE IF NOT EXISTS %s', rec.name, quote_literal(rec.label)); EXECUTE ddl; END LOOP; END $PROC$ ; CALL fdw_enum.refresh_upsteram_enums();
I'm happy to have a FK to a reftable with int, shortname, description
What is the difference in size on disk between the two options?
An enum stores a small reference to the enum value in the tuples, and the constraint approach stores the string itself. The constraint approach takes more space with (a) larger enum values, (b) more columns using them, and (c) more rows in the tables that use them. The difference depends on the characteristics of the database we're talking about.
smallint looks like a good alternative, with dictionary in the app or separate table. So far i've only seen storing dictionary in app source code approach
Just to make sure I understand your suggestion fully, you're saying: 1. Use `smallint` instead of `text` for the column type. 2. Otherwise follow our "`CHECK` constraint" approach (without it we're jeopardizing data consistency because we can store anything between -32768 and 32767 in that column). 3. Translate the int to the enum in your application.
Right?
Most ORMs or just DALs can automate step 3 for you. If you’re using a strongly typed language with any type of ffi support, you’ll probably already have native cast from int to enum available (unless you’re using rust where that’s considered unsafe by default).
Yep!
I don't like this approach much because getting cryptic integers when you do a `SELECT` in the database is really cumbersome.
Agreed, but on the other hand it also saves a lot of complexity and possible headaches down the line. It's kind of a matter of choosing which headache you want. Especially if your table gets larger all those extra bytes in text columns can cost you dozens of GB of disk space, makes indexing slower, etc.
I MySQL/MariaDB enums are just "aliases" for ints, and that works much nicer IMHO, and adding a new value is cheap because it doesn't recheck all the rows (removing values is still expensive, as it needs to check it's not actually used by any row).