Choosing a Database Schema for Polymorphic Data (2024)

56 points by gm678 22 days ago

In real life, insurance often doesn't pay for everything and service providers would need to support capturing both sets of info, not either one or the other. But that's beside the point.

Properly naming check constraints can go a long way towards providing better error messages, and then there are stored procs which can be used to ensure that multiple tables are updated as a single logical step. It would've been nice if SQL supported literal values in FK constraints, e.g. FOREIGN KEY (id, 'uninsured') REFERENCES patients(id, payment_type), as well as exclusion constraints, e.g. FOREIGN KEY (id) NOT REFERENCES insured_payment_info(id). As it stands, my preferred approach would be combining approach 4 with stored procs, except I would make id the primary key and (id, payment_type) a unique key.

gm678 - 21 days ago

The article alludes to it by mentioning single table inheritance, but some of these patterns go all the way back to Fowler's _Patterns of Enterprise Application Architecture_:

* https://www.martinfowler.com/eaaCatalog/singleTableInheritan...

* https://www.martinfowler.com/eaaCatalog/classTableInheritanc...

* https://www.martinfowler.com/eaaCatalog/concreteTableInherit...

bob1029 - 21 days ago

JSON can work surprisingly well as a blob column up to a point. The part that is interesting is that point varies dramatically depending on where and how that blob is being serialized.

If you are storing json blobs in SQLite and using a very fast serializer (gigabytes/s), then anything under a megabyte or so won't really show up on a hot path. Updates to complex entities can actually be faster, even if you're burning more IO and device wear.

If you need to join across properties in the JSON, I wouldn't use JSON. My canary is if I find myself using the built in json query functionality, I am too far into noSQL Narnia.

fmjrey - 21 days ago

The article reads like a story of trying to fit a square peg in a round hole, discussing pros and cons of cutting the square corners vs using a bigger hole. At some point one needs to realize we're using the wrong kind of primitives to build the distributed systems of today. In other words, we've reached the limit of the traditional approach based on OO and RDBMS that used to work with 2 and 3-tier systems. Clearly OO and RDBMS will not get us out of the tar pit. FP and NoSQL came to the rescue, but even these are not enough to reduce the accidental complexity of building distributed systems with the kind of volume, data flows, and variability of data and use cases.

I see two major sources of inspiration that can help us get out of the tar pit.

The first is the EAV approach as embodied in databases such as Datomic, XTDB, and the like. This is about recognizing that tables or documents are too coarse-grained and that entity attribute is a better primitive for modeling data and defining schemas. While such flexibility really simplifies a lot of use cases, especially the polymorphic data from the article, the EAV model assumes data is always about an entity with a specific identity. Once again the storage technology imposes a model that may not fit all use cases.

The second source of inspiration, which I believe is more generic and promising, is the one embodied in Rama from Red Planet Labs, which allows for any data shape to be stored following a schema defined by composing vectors, maps, sets, and lists, and possibly more if custom serde are provided. This removes the whole impedance mismatch issue between code and data store, and embraces the fact that normalized data isn't enough by providing physical materialized views. To build these, Rama defines processing topologies using a dataflow language compiled and run by a clustered streaming engine. With partitioning being a first-class primitive, Rama handles the distribution of both compute and data together, effectively reducing accidental complexity and allowing for horizontal scaling.

The difficulty we face today with distributed systems is primarily due to the too many moving parts of having multiple kinds of stores with different models (relational, KV, document, graph, etc.) and having too many separate compute nodes (think microservices). Getting out of this mess requires platforms that can handle the distribution and partitioning of both data and compute together, based on powerful primitives for both data and compute that can be combined to handle any kind of data and volumes.

setr - 21 days ago

I mean this particular problem would be resolved if the database let you define/defend a UNIQUE constraint across tables. Then you could just do approach #2 without the psychotic check constraint.

acquiesce - 22 days ago

I wouldn’t do this personally because the downstream code very often has to handle differences where polymorphism breaks and you end up having to query the type. Polymorphism shouldn’t be used for data, only behavior, and only in very specific circumstances. Subclassing is a different topic.

setr - 22 days ago

You wouldn’t do what? Have polymorphic data to begin with? I don't see how you can choose to avoid the scenario that record A has one of several possible related metadata, other than just ignoring it and allowing invalid representations
- acquiesce - 21 days ago
  
  Correct. This data doesn’t meet criteria for polymorphism because insured and uninsured processes from a book keeping perspective have very distinct flows and requirements in reality. Using OOP here is a mistake. Straight forward procedural code in-flight as well as any batch jobs should deal separately with insured and uninsured data and there should be zero overlap unless we need to extract aggregate data like how many payments in total and what’s the total amount. For those situations you can use a separate domain model that distinctly deals with these queries as value objects themselves if you want to go down that rats nest.
  Other top level comments covered what I wanted to say but my comment is the OG one. I deal with payments, transactions and all that with multiple currencies and other complexities. Just keep it simple and don’t use OOP for this stuff it’s the wrong tool for the job.
  - setr - 19 days ago
    
    The primary impetus is to enforce the constraint: a billing is either insured or uninsured. Insured has additional metadata X, Y, Z. Uninsured has additional metadata A,B,C. A billing cannot be both.
    If you separate them into different insured and uninsured tables, then any tables associated with billing generally needs to be cascaded into an insured/uninsured variants as well. Billing_customer and billing_customer_details now becomes uninsured_billing_customer and insured_billing_customer and uninsured_billing_customer_details, etc.
    As you add more data of this constraint, everything fragments again, scaling at 2^n tables. This is similar to the async coloring problem; what you wanted is to locally fragment the data model, but instead anything that touches it gets poisoned.
    Ideally the DB would let you enforce the constraint UNIQUE(uninsured.billing_id, insured.billing_id) and split-table would suffice
    I’m not seeing any top-level comments that resolve this —- other than ignoring the problem altogether (let the data model encode invalid states, handle it in app code), or switching to a different database.
    
    frollogaston - 18 days ago
    
    Letting the app code enforce these constraints isn't ignoring the problem, it's how you solve this. Your DB is never going to represent all the business logic by itself. You can also add the slightly clunky constraint it mentions if you really want.
    I wouldn't do this with separate tables. I also wouldn't do this with polymorphism, or OOP in general, even if the DBMS properly supported OOP. Trying to represent these constraints by classifying things will get confusing fast.
- ozgrakkurt - 22 days ago
  
  You can have different tables for different data. You don’t have to put all in same table
  - setr - 21 days ago
    
    Only the first strategy shoves it into the same table? The fundamental problem is record A can have type X Y or Z, with each type having additional metadata. You could flatten the model and have a table for each type X Y Z and query them independently, and pay the cost of duplicate schema structure and having to ensure they’re always synchronized manually (including any dependent tables), or you pluck out the common core and run into the article’s problem
  - gm678 - 21 days ago
    
    I think the article alludes to the difficulty of this solution by discussing the need for invariants to be upheld when an insured patient becomes uninsured or vice versa. Different tables for each 'subclass' could be an option, but if that can change later on, you now need to move patients between the insured_patient and uninsured_patient tables and make sure you don't have duplicates.
IceDane - 22 days ago

Want to elaborate on how you're going to magically disappear the inherent polymorphism in your problem domain every time?
Sometimes you can indeed view things from a different perspective and come up with a simpler data model, but sometimes you just can't.
- acquiesce - 21 days ago
  
  There is no polymorphism. There’s nothing polymorphic about the 2 types of payments. And furthermore you’re likely to run into situations where you have to have both an insured amount and an uninsured amount for a given treatment/procedure. So now you’re dealing with arrays of heterogenous data.
  The process for handling the two cases is distinct. This is the classic OOP issue of trying to use a Shape object to represent both Box and Sphere. Just don’t. Stick with transaction/linear code and use transforms as it makes sense for certain cases (ie, MVVM style). Handle the two cases distinctly so there is no room for ambiguity.
  People get this confused and they think it can’t be simpler.

1st1 - 22 days ago

FWIW polymorphic schema and queries are trivial in Gel https://docs.geldata.com/reference/edgeql/select#polymorphic...

oulipo - 22 days ago

Is gel mature enough to be used in prod? what's the main advantage it has over postgres?

hot_gril - 22 days ago

Definitely don't want to store types as columns in a DB, especially because of the inevitable thing that qualifies as two different types. In this situation, I'd usually take the first one (nullable cols) without much consideration. The DB doesn't need an xor constraint, but it can if you really want. New cols can be added without much impact on existing data.

And if the info is non-scalar, it's either option 2 (nullable FK) or 5 (JSON), depending on whether or not other things join with fields inside it.

hobs - 22 days ago

First one gets messier faster than most, the fourth one generally grows the least crappily over time.

j45 - 22 days ago

I’m unsure why one wouldn’t use a polymorphic database for polymorphic data, instead of the gymnastics of bending a relational db.

setr - 21 days ago

If 95% of the data can be trivially modeled in a relational database, your recommendation is to switch databases or add a new system entirely?
- j45 - 21 days ago
  
  A thing about polymorphic references is the schema definitely can be trivial in relational db.
  Using a polymorphic setup in something like Postgres is where I like to start any modeling.. until the complexity of queries reveal themselves.
  The further one goes down that rabbit hole, the sql can easily become more and more complex for the kinds of queries that become beneficial in polymorphic or graph type databases.
  Where SQL gets complex and convoluted over time to handle polymorphic queries, is one place where polymorphic databases can simplify the complex.
  Using a polymorphic database may be worth it for some or all of the schema. It seems reasonable to extend something like Postgres to sync and allow reads or writes from the polymorphic side.
  Happy to learn and listen from alternatives approaches.

4b11b4 - 22 days ago

I learned a lot in a short time, thanks

feitico - 22 days ago

Just use NoSQL

MongoDB is great for this

dinfinity - 21 days ago

It's web scale!
- j45 - 21 days ago
  
  Haha, I haven't quite understood why so much effort is spent making non-relational databases into relational databases, and now polymorphic.

toolslive - 22 days ago

approach # 6: Column Store?