Settings

Theme

Representing Enums in PostgreSQL

making.close.com

112 points by wojcikstefan 3 years ago · 75 comments

Reader

cryptonector 3 years ago

TFA is all about how to make changes where you drop elements of an enum, and how hard that is. The obvious thing though is not covered: don't do that! Instead you should:

  a. Add CHECK() constraints on columns of
     the relevant ENUM type checking that
     the value is one of the "live" values.
  b. RENAME "dead" ENUM values to indicate
     that they're dead.
  c. UPDATE ... SET column_of_that_enum_type = ...
     WHERE column_of_that_enum_type IN (dead_values_here) ...
(c) can be done slowly, say, with a LIMIT clause to keep load and disruptions down. Yeah, you might have "broken" data for a while, but you can make your queries report better values for columns of that ENUM type in the same way that the UPDATE would fix them.
  • hn_throwaway_99 3 years ago

    Glad to see this as the top comment, completely agree. After reading the article, was thinking that the only real downside of using enum types (but there is one more issue not mentioned, more on that below) is when you need to remove values. In reality, I've found removing enum values to be a very rare occurrence in prod. Removing an enum value fundamentally breaks backwards compatibility, so usually a better option is to essentially deprecate it, which is more like what you recommend.

    Note one other thing I have found is that postgres enums are less "portable" than plain text columns. For example, if you're using BigQuery to query postgres tables as an external connection, BQ doesn't support enums, so your external connection query has to cast any enum return values (e.g. some_enum_column::text) before returning them to BQ, which can be a pain if you want to do "SELECT * FROM ..." some table that returns enum columns.

    • thom 3 years ago

      You can create an implicit cast to and from text with CREATE CAST and it becomes transparent.

      • hn_throwaway_99 3 years ago

        Can you explain a little more what you mean here? I'm not sure this helps my use case. E.g. if I run a BQ external query like the following

            SELECT * FROM EXTERNAL_QUERY("my-pg-connection-string", """
                SELECT enum_column FROM foo;
            """);
        
        that always fails for me, even if I define "CREATE CAST (my_enum_type AS text) WITH INOUT AS IMPLICIT;"

        The only way I can get it to work is if I use "SELECT enum_column::text FROM foo;" and I don't need a custom cast definition to do that.

        • thom 3 years ago

          Apologies, I’m hallucinating features here. In our codebase we also have lots of JDBC extensions to map PG objects to language types and back, and this is the easiest way to handle queries over enums. But obviously some type info has to trigger the coercion, so SELECT * FROM might work in the context of an INSERT INTO but elsewhere you’d have to be explicit.

        • e12e 3 years ago

          I think you would need a way to trigger the cast either way? Eg concatenate with a string?

          Otherwise, maybe a function and a generated column (Ed: in a VIEW i guess, if it chokes on the raw enum)?

          https://dba.stackexchange.com/questions/276477/immutably-con...

          • aidos 3 years ago

            By the time you've done a generated column you might as well go with the articles solution of using text with a check constraint since you're now using the space anyway.

        • cryptonector 3 years ago

          The CREATE CAST will help on DMLs, but not SELECTs.

      • aidos 3 years ago

        Oh nice. I didn’t know about that feature.

    • faangsticle 3 years ago

      > BQ doesn't support enums

      Sounds like a reason to use views to work around the limitations of BigQuery

    • cryptonector 3 years ago

      > Note one other thing I have found is that postgres enums are less "portable" than plain text columns. For example, if you're using BigQuery to query postgres tables as an external connection, BQ doesn't support enums, so your external connection query has to cast any enum return values (e.g. some_enum_column::text) before returning them to BQ, which can be a pain if you want to do "SELECT * FROM ..." some table that returns enum columns.

      Yeah, one has to write more adapter queries to add those casts to text (or from text, for DMLs), but it's OK, and ENUMs are just really nice. And as u/faangsticle says, use VIEWs for that.

  • cropcirclbureau 3 years ago

    Good approach. And with Postgres `DOMAIN` types features to DRY up the `CHECK` the live constraint, this should be very palatable. I usually opt for enum tables and regularly find myself needing to declare `archived_at` columns for said tables. Did not know of the `RENAME` command either and what you describe renders native enums less warty.

  • dikei 3 years ago

    Agree, PostgreSQL enum is really just for mapping with an enum in application code. And just like you should not remove an enum or change their ordinal in application code, don't do that to a PostgreSQL enum

  • BulgarianIdiot 3 years ago

    I find it odd how many schema changes in modern RDBMS must be done on the whole table at once. You can split a table in chunks and recode each chunk gradually in a way which doesn't change the data in it (so no downtime) but removes dead entries like updated enums.

    In a way you describe how we can emulate this process. The question is why the heck wouldn't databases do this themselves? Same with adding and dropping columns.

    Consider how PostgreSQL encodes null for example, by skipping them in the row as fields, and adding them in a null bitmap in front of the row. Meaning... rows are not uniformly sized, there's no math like offset = row * rowsize + field_offset; kind of addressing for reading a field in PG where recoding some of the rows breaks the entire table.

    And yet we have all those huge monolithic operations that need to be done atomically. So weird.

rpcope1 3 years ago

After having suffered through the consequences of "type" enums on MySQL, and see some things go through a long life that used "enums" in the database (in multiple different databases, include Postgres), I'm not convinced that either of these are the right choice for representing enumerations. The string with check constraint seems dumb if for no other reason than if the table that uses it winds up having many rows, you're basically burning up lots of extra space for arguably no reason, and if you ever have to alter the name of an enum (or something similar), that update is going to be really expensive.

I think the "right" choice for enums probably looks a little more like:

   CREATE TABLE myEnum(
     enumID SERIAL NOT NULL PRIMARY KEY,
     enumName TEXT UNIQUE NOT NULL,
     -- enum description or other metadata columns here.
   );

  CREATE TABLE foo(
    ...
    associatedEnumID INTEGER NOT NULL REFERENCES myEnum(enumID),
    ...
  );
I think this has the benefit of being space efficient like the native typed enum, while being relatively flexible (easy to change names, add new enum values, add data about the enum itself, etc.)
  • SPBS 3 years ago

    I prefer directly using strings as enums, and using the foreign key constraint only to validate enum values.

        CREATE TABLE my_enum ( name TEXT PRIMARY KEY );
    
        CREATE TABLE foo ( my_enum TEXT REFERENCES my_enum (name) );
    
    The reason is because a SELECT * FROM foo showing cryptic enum ordinals is a headache, and having to join the enum table every time is potentially slower than just reading from the column directly. An ASCII character only takes 1 byte, so an INT enum is just as space efficient as using 4 characters, which affords way more descriptiveness than a meaningless ordinal number.
    • ledgerdev 3 years ago

      If you use longer named enums(eg. my_enum_xyz) in my_enum, does this store a full copy of the enum text bytes of 'my_enum_xyz' into table foo?

      • SPBS 3 years ago

        Yeah the full text of the enum is stored in the table ('my_enum_xyz' is 11 ASCII characters so it takes up 11 bytes, plus 1 byte needed to store the length of the string).

    • faangsticle 3 years ago

      The point of enums is so the names can be descriptive, which typically will be longer than 4 characters.

      • SPBS 3 years ago

        Perhaps I was unclear, my underlying point is that short strings are cheap. And enum names are almost always short strings. 4 characters not enough? What about 8? That’s the size of a BIGINT, which everybody now uses for primary keys (they’re not slow). What about 16 characters? That’s the size of a UUID, which everyone now uses for primary keys (they’re not slow).

        Even 4 characters (which perform just as fast as Postgres’ native enums) is enough if you develop a system of abbreviations. The airline industry gets by with just three characters!

        • hn_throwaway_99 3 years ago

          Just wanted to let you know I really appreciate your comments. I hadn't thought about doing it the way you recommend (in my mind I was thinking "oh, storing enum values as text will be inefficient", but I never did the math as you did to realize storing anything up to 8 chars is no worse than bigint), but I think it's a pretty optimal solution. I also checked my DB where we heavily use enums, and basically every enum text value that is longer than 8 characters could easily be shortened/abbreviated to 8 chars and still be very easily understood. Thanks very much!

  • SigmundA 3 years ago

    Yep thats what I typically do except a string "code" and string "description" that way the raw data is more readable without joining the coded value and many times you want a short code vs long description, like with US states and their abbreviations. Most codes are 1 or 2 characters up to maybe 4 so you end up with less or the same space used than a 32 bit int with UTF-8.

    Metadata like sorting and even what date ranges the code is valid for and even security as in who's allowed to use the code in the app.

  • cryptonector 3 years ago

    This is... exactly what PG does under the covers for ENUM types. And also of course this is historically the canonical way to do ENUMs in SQL.

  • jfb 3 years ago

    This is exactly what I do, save with natural keys.

  • netcraft 3 years ago

    In my experience, you need those descriptions and other metadata on these enum values more often than not.

jpgvm 3 years ago

PostgreSQL enums feel like a bit of a hack in general. I end up using an "enum table" approach in many cases as joining against a very small table has negligible performance impact in all but the most performance sensitive databases and foreign key constraints are a well understood quantity.

  • cryptonector 3 years ago

    PG enums are "enum tables" under the hood. With native enum support the JOINs with the enum tables happen at query compilation time, which is a performance benefit you should not ignore.

  • marcosdumay 3 years ago

    PostgreSQL enums are there for you to use on your code API. They can go into tables because sometimes it's important to put code data into tables.

    What they are absolutely not for is to replace the one standard way to define data enums that everybody use since relational algebra was created. It's such an outrageously naive idea that I'm sure most people here were attracted to the title thinking it's about algebraic types or some other similar misunderstanding.

    • wojcikstefanOP 3 years ago

      Can you elaborate on what it means to use PostgreSQL enums "on your code API"?

      > It's such an outrageously naive idea that I'm sure most people here were attracted to the title thinking it's about algebraic types or some other similar misunderstanding.

      Just to share how we at Close got into this discussion (which I personally don't find as "outrageous" as you), SQLAlchemy – which we use in our Python code – uses `native_enum=True` as the default value for its `Enum` type [0], which then translates to using types in PostgreSQL:

      > native_enum – Use the database’s native ENUM type when available. Defaults to True. When False, uses VARCHAR + check constraint for all backends.

      [0]: https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqla...

      • marcosdumay 3 years ago

        > Can you elaborate on what it means to use PostgreSQL enums "on your code API"?

        If you export a procedure for creating a socket into pgPlSQL, you shouldn't use magical numbers for setting the socket flags. You should use enums.

        As for sqlalchmey, well that design is not good. It should support more mappings than just to string. But well, personally, I would ignore the feature and go without enum types (notice that it's a recent addition). You can always declare your own enum and convert the column if you want.

        • faangsticle 3 years ago

          > But well, personally, I would ignore the feature and go without enum types (notice that it's a recent addition).

          Postgres has had enums since 2008

        • aidos 3 years ago

          I still don't totally understand that first point. Are you talking about using enums for controlling the connection itself? Can you give an example?

          Just to clarify, sqlalchemy supports any arbitrary mapping you want — it's entirely flexible in every direction.

          EDIT ah, do you mean something like this?

              CREATE FUNCTION do_thing(flags custom_enum)
          • marcosdumay 3 years ago

            Yep, you use it as CREATE FUNCTION do_something(some_option custom_enum). The main use-case is not putting them on tables.

            Relational algebra has the concept of enums builtin (as long as they all have the same structure). You create them in databases by using foreign keys.

      • aidos 3 years ago

        Heyya! Close customer here.

        > When False, uses VARCHAR + check constraint for all backends

        Is this behaviour new? Or is it that alembic still doesn't pick up changes to the enum to recreate the constraint? We implemented our own alembic hooks to automate the migrations for us. We've found the check constraint model to be fairly effective.

        • wojcikstefanOP 3 years ago

          Hiya Aidan! :D

          > Is this behaviour new?

          Don't think so. Even the docs for SQLA v1.3 ([0]) mention it:

          > native_enum – Use the database’s native ENUM type when available. Defaults to True. When False, uses VARCHAR + check constraint for all backends. The VARCHAR length can be controlled with Enum.length

          [0]: https://docs.sqlalchemy.org/en/13/core/type_basics.html#sqla...

          --

          > Or is it that alembic still doesn't pick up changes to the enum to recreate the constraint?

          Off the top of my head, I believe Alembic doesn't pick it up automatically. We make a lot of manual tweaks to our auto-generated Alembic migrations and I believe adding / modifying the CHECK constrains is part of it.

          • aidos 3 years ago

            Happy to ping you guys over the code if you want to look. It’s probably fairly tweak-able to handle your specific cases.

akersten 3 years ago

Is there a reason ALTER TYPE name ADD VALUE new_enum_value wouldn't Just Work for the first example for 99% of use cases? Seems like the only drawback highlighted for native enums is that you have to lock the whole table if you completely swap out the type of one of the columns, which... Yes, that's true, but also very rare? I guess TFA mentions removing a value from an enum, but you shouldn't do that imo - leave the data in place and ignore it at the app layer, like how you'd treat a deleted flag or similar. Or migrate it and leave the old value hanging around in the type.

  • jpmelos 3 years ago

    > Is there a reason ALTER TYPE name ADD VALUE new_enum_value wouldn't Just Work for the first example for 99% of use cases?

    I'm sorry, we don't have any examples in the article about adding values to enums, so I don't know to what you are referring here.

    > Yes, that's true, but also very rare?

    It is rare, indeed, but we just have a strong inclination towards having a clean schema, so we do prefer to clean up things that would otherwise be unused. But I understand that many people would prefer to leave them hanging there, even if they are not going to be used in any way.

jpalomaki 3 years ago

Native enums can be used in queries like strings, but with type checking:

  select * from cust where type = ’company’ -- ok
  select * from cust where type = ’c0mpany’ -- error
As mentioned, they take less space. Important if you use these columns in index and have millions of rows.
Icathian 3 years ago

Little known bug, but enums also choke dump/restore if you use them as a hash partition key.

That said, they can be really nice for all the same reasons static types are nice. Good article!

fabian2k 3 years ago

Native enums in Postgres just seem to have many more drawbacks than advantages. I never really used them, it didn't seem worth it.

I usually use C# Enums translated by EF Core now, which works perfectly fine on the C# side. The only missing part would be to give the DB the information about the enum names, so it could show those to me instead of just the raw numbers. But I assume there is no way to do that. Adding entire tables just doesn't seem worth it for this use case alone.

  • ComputerGuru 3 years ago

    > Adding entire tables just doesn't seem worth it for this use case alone.

    If you peek at a particular table often enough and it has a crazy enough enum to maybe justify making your life a little easier, what you can do is add a table just for debug purposes and use a view that maps the int to it, but otherwise only use the direct int:enum cast in your app/orm/etc. thereby not paying any runtime cost for foreign keys or integrity checking.

  • bob1029 3 years ago

    > Adding entire tables just doesn't seem worth it for this use case alone.

    I am going down this path. There are a lot of downstream use cases (e.g. support, reporting, etc) that can benefit from having the canonical names & human-friendly labels in the database.

    Keeping database in sync with code like this requires some discipline and manual scripting, but I think it's worth it.

  • wojcikstefanOP 3 years ago

    > I usually use C# Enums translated by EF Core now (...)

    Does this approach ensure data consistency? Or could you accidentally insert a number value into your table which is not represented by any of the enum values?

    • ComputerGuru 3 years ago

      Well C# is strongly typed so that’s punted to the CLR type system (assuming you’re not using raw sql to insert values via other means/frontends) - but it isn’t too hard to forcibly to cast a raw int to an enum in C# which doesn’t actually contain a definition with that integral value (unlike in even stronger typed languages eg in rust where it’s a lot more work).

      I “manually” add a check constraint (via EF, so it’s not so much manual as it is “remember to copy and paste this in your db.cs) to assert the value is greater than or equal to zero and less than the number of inhabitants in the enum, but this fails if you manually code the enum values (eg for flags, legacy interop, etc).

    • fabian2k 3 years ago

      With raw SQL you could write anything in it. Within EF Core it won't allow bad values, you could trick it I think but that is not something you'd do accidentally. You can always add a CHECK constraint in addition, if you have more uncontrolled places modifying those values.

  • Rapzid 3 years ago

    IDK I use the npgsql enum mapping to native pg enum and it works great. Pretty sure I have it configured for string values too.

brycelarkin 3 years ago

The article misses the most common way to represent enums: foreign keys…

  • tsxxst 3 years ago

    Foreign keys have a different set of tradeoffs. We'd have to either look enum values up before making the query, or use a join that might affect query plan.

rollulus 3 years ago

When would one use this approach versus a domain type [1]? How does it differ?

[1]: https://www.postgresql.org/docs/current/domains.html

  • jpmelos 3 years ago

    That would be an improvement on the `CHECK` constraint approach if the same `CHECK` constraint is used in multiple columns, to DRY it up. The `DOMAIN` types offer the same commands for `CHECK` constraints, namely `NOT VALID` when updating the constraint, and `VALIDATE CONSTRAINT` to make sure it is valid after having updated it.

somat 3 years ago

I am probably missing something obvious(space/time optimization for very large/busy tables?) but isn't this a perfect example of why foreign keys exist. In fact both case studies presented by the article(enums and fixed string check constraints) are clunky enough that I would say both are wrong and they should just go with a foreign key.

Canada 3 years ago

I agree with this, I don't use enums they are always more trouble than they are worth.

They break FDW unless they are pre-created on the importing side. Super inconvenient.

  • wojcikstefanOP 3 years ago

    Apologies for my ignorance, but what does "FDW" stand for in your comment? :)

    • Quekid5 3 years ago

      Foreign Data Wrapper

      • adgjlsfhk1 3 years ago

        specifically, they are one of the coolest features in postgres. Have you ever wanted to pretend that a CSV (or your filesystem or google) or basically anything else was a SQL database? With a couple hundred lines of code, you can do this. The performance won't be great, but it's ridiculously flexible.

        • ComputerGuru 3 years ago

          I have a two-liner shell script that turns a csv into a virtual SQLite db (./to_sql.sh foo.csv) and lets me use the full power of SQLite. Usually that’s enough for my purposes, but I do use pgsql‘s copy from csv to generate tables representing foreign data imported from a csv (say census info or NAICS occupation codes).

  • efxhoy 3 years ago

    I had to implement a workaround for that for our DW which imports from two different app databases. IMO this should be part of

      import foreign schema
    
    but then you would have to qualify the names somehow if you were importing from more than one db. I wrote this as a workaround, it runs every day as part of our data import job.

      DROP SCHEMA IF EXISTS fdw_enum CASCADE;
      CREATE SCHEMA fdw_enum;
      
      -- Get names of the enums using ::regtype casting and label and sort_order from app_a and app_b.
      CREATE VIEW fdw_enum.app_a_enums AS SELECT * FROM dblink('fdw_app_a',
      $QUERY$
        SELECT
          enumtypid::regtype AS name,
          enumsortorder sort_order,
          enumlabel label
        FROM pg_enum;
      $QUERY$
      ) AS t (name text, sort_order integer, label text);
      
      CREATE VIEW fdw_enum.app_b_enums AS SELECT * FROM dblink('fdw_app_b',
      $QUERY$
        SELECT
          enumtypid::regtype AS name,
          enumsortorder sort_order,
          enumlabel label
        FROM pg_enum;
      $QUERY$
      ) AS t (name text, sort_order integer, label text);
      
      -- Ensure enums with the same names aren't defined in both app_a and app_b.
      DO
      $DO$
      DECLARE
        matching_count integer;
      BEGIN
        SELECT COUNT(*) into matching_count
        FROM fdw_enum.app_a_enums INNER JOIN fdw_enum.app_b_enums USING (name);
        ASSERT
          matching_count = 0,
          'app_a and app_b-NG have identically named enums. Implement a check that they are identically defined.';
      END
      $DO$;
      
      CREATE VIEW fdw_enum.upstream_enums AS
      SELECT * FROM fdw_enum.app_a_enums
      UNION ALL
      SELECT * FROM fdw_enum.app_b_enums;
      
      
      CREATE PROCEDURE fdw_enum.create_type_if_not_exists(name text)
      LANGUAGE plpgsql
      AS $PROC$
        BEGIN
          EXECUTE format('CREATE TYPE %s AS ENUM ()', name);
          EXCEPTION WHEN duplicate_object THEN RAISE NOTICE 'Enum type % already exists, skipping', name;
        END;
      $PROC$
      ;
      
      
      -- To make this idempotent we create the enums empty then alter them by adding values.
      -- So instead of `CREATE TYPE foo AS ENUM ('bar', 'baz');` we do
      -- `CREATE TYPE IF NOT EXISTS foo AS ENUM ();
      -- `ALTER TYPE foo ADD VALUE IF NOT EXISTS ('bar');
      -- `ALTER TYPE foo ADD VALUE IF NOT EXISTS ('baz');
      -- This ensure the procedure is re-runnable and can add new values to the enum as they are added upstream.
      -- Order is ensured by the ORDER BY in the loop query.
      CREATE PROCEDURE fdw_enum.refresh_upsteram_enums()
      LANGUAGE plpgsql
      AS $PROC$
      DECLARE
        rec record;
        ddl text;
      BEGIN
        FOR rec IN SELECT DISTINCT(name) AS name FROM fdw_enum.upstream_enums
        LOOP
          CALL fdw_enum.create_type_if_not_exists(rec.name);
        END LOOP;
      
        FOR rec in
          SELECT *
          FROM fdw_enum.upstream_enums
          ORDER BY name, sort_order
        LOOP
          ddl := FORMAT('ALTER TYPE %s ADD VALUE IF NOT EXISTS %s', rec.name, quote_literal(rec.label));
          EXECUTE ddl;
        END LOOP;
      END
      $PROC$
      ;
      
      CALL fdw_enum.refresh_upsteram_enums();
cmcconomy 3 years ago

I'm happy to have a FK to a reftable with int, shortname, description

jensenbox 3 years ago

What is the difference in size on disk between the two options?

  • jpmelos 3 years ago

    An enum stores a small reference to the enum value in the tuples, and the constraint approach stores the string itself. The constraint approach takes more space with (a) larger enum values, (b) more columns using them, and (c) more rows in the tables that use them. The difference depends on the characteristics of the database we're talking about.

out_of_protocol 3 years ago

smallint looks like a good alternative, with dictionary in the app or separate table. So far i've only seen storing dictionary in app source code approach

  • wojcikstefanOP 3 years ago

    Just to make sure I understand your suggestion fully, you're saying: 1. Use `smallint` instead of `text` for the column type. 2. Otherwise follow our "`CHECK` constraint" approach (without it we're jeopardizing data consistency because we can store anything between -32768 and 32767 in that column). 3. Translate the int to the enum in your application.

    Right?

    • ComputerGuru 3 years ago

      Most ORMs or just DALs can automate step 3 for you. If you’re using a strongly typed language with any type of ffi support, you’ll probably already have native cast from int to enum available (unless you’re using rust where that’s considered unsafe by default).

    • out_of_protocol 3 years ago

      Yep!

  • jpmelos 3 years ago

    I don't like this approach much because getting cryptic integers when you do a `SELECT` in the database is really cumbersome.

    • arp242 3 years ago

      Agreed, but on the other hand it also saves a lot of complexity and possible headaches down the line. It's kind of a matter of choosing which headache you want. Especially if your table gets larger all those extra bytes in text columns can cost you dozens of GB of disk space, makes indexing slower, etc.

      I MySQL/MariaDB enums are just "aliases" for ints, and that works much nicer IMHO, and adding a new value is cheap because it doesn't recheck all the rows (removing values is still expensive, as it needs to check it's not actually used by any row).

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection