Nobody ever got fired for using a struct

146 points by gz09 4 months ago · 139 comments

Reader

amluto 4 months ago

There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.

I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.

Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.

(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)

pjc50 4 months ago

Dotnet used to have BinaryFormatter, which they had to kill for this reason, and they basically recommend that exact set of serializers as a replacement: https://learn.microsoft.com/en-us/dotnet/standard/serializat...
hdjrudni 4 months ago

Don't forget PHP's serialize/unserialize, it's also sketchy. Looks like they at least put up a big warning in their docs: https://www.php.net/manual/en/function.unserialize.php
Not hating on PHP, to be clear. It has its warts, but it has served me well.
- nchmy 4 months ago
  
  igbinary is often a good drop-in replacement for native serialize/unserialize. Faster and smaller.
imtringued 4 months ago

Fully agreed. rkyv looks like something that is hyper optimizing for a very niche case, but doesn't actually admit that it is doing so. The use case here is transient data akin to swapping in-memory data to disk.
"However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot."
At a first glance, it might sound like rkyv is better, after all, it has less restrictions and external schemas are annoying, but it doesn't actually solve the schema issue by having a self describing format like JSON or CBOR. You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.
- bombela 4 months ago
  
  > You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.
  This seems false after reading the book, the doc, and a cursory reading of the source code.
  It is definitely independent of rust version. The code make use of repr(C) on struct (field order follows the source code) and every field gets its own alignment (making it independent from the C ABI alignment). The format is indeed portable. It is also versioned.
  The schema of the user structs is in Rust code. You can make this work across languages, but that's a lot of work and code to support. And this project appears to be in Rust for Rust.
  On a side note, I find the code really easy to understand and follow. In my not so humble opinion, it is carefully crafted for performance while being elegant.
gz09OP 4 months ago

> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
bombela 4 months ago
> (OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust?
Trusting possibly malicious inputs is an universal problem.
Here is a simple example:
```
    echo "rm -rf /" > cmd
    sh cmd
```
And this problem is no different in rkyv than rkvy_dyn or any other serialization format on the planet. The issue is trusting inputs. This is also called a man in the middle attack.
The solution is to add a cryptographic signature to detect tempering.
- amluto 4 months ago
  
  This is an unhelpful interpretation. With a decent memory-safe parser, it’s perfectly safe [1] to deserialize JSON or (most of) XML [0] protobuf or Cap’n Proto or HTTP requests, etc. Or to query a database containing untrusted data. You need to be careful that you don’t introduce a vulnerability by doing something unwise with the deserialized result, but a good deserializer will safely produce a correctly typed output given any input, and the biggest risk is that the output is excessively large.
  But tools like Pickle or Java deserialization or, most likely, rkyv_dyn will happily give you outputs that contain callables and that contain behavior, and the result is not safe to access. (In Python, it’s wildly unsafe to access, as merely reading a field of a Python object calls functions encoded by the class, and the class may be quite dynamic.)
  [0] The world is full of infamously dangerous XML parsers. Don’t use them, especially if they’re written in C or C++ or they don’t promise that they will not access the network.
  > The solution is to add a cryptographic signature to detect tempering.
  If you don’t have a deserializer that works on untrusted input, how do you verify signatures. Also, do you really thing it’s okay to do “sh $cmd” just because you happen to have verified a signature.
  > This is also called a man in the middle attack.
  I suggest looking up what a man in the middle attack is.
  - bombela 4 months ago
    
    Ah, I see the confusion. rkyv_dyn doesn't serialize code. Rust is compiled to machine code. It would be quite a feat to accomplish.
    I was a bit confused when you compared it to Python pickle and assumed you were talking about general input validation somehow.
    I agree that pickle and similar are profoundly surprising and error prone. I struggle to find any reasonable reason one would want that.
    As for the man in middle attack, I meant that if somebody intercepts the serialized form, they can mutate it. And without a cryptographic signature, you wouldn't know.
    
    amluto 4 months ago
    
    > rkyv_dyn doesn't serialize code. Rust is compiled to machine code.
    Java is compiled to bytecode, and Obj-C is compiled to machine code. Yet both Android and iOS have had repeated severe vulnerabilities related to deserializing an object that contains a subobject of an unexpected type that pulls code along with it. It seems to be that rkyv_dyn has exactly the same underlying issue.
    Sure, Rust is “safe”, and if all the unsafe code is sufficiently careful, it ought to be impossible to get the type of corruption that results in direct code execution, memory writes, etc. But systems can be fully compromised by semantic errors, too.
    If I’m designing a system that takes untrusted input and produces an object of type Thing, I want Thing to be pure data. Once you start allowing an open set of methods on Thing or its subobjects, you have lost control of your own control flow. So doing:
    thing.a.func()
    may call a function that wasn’t even written at the time you wrote that line of code or even a function that is only present in some but not all programs that execute that line of code.
    Exploiting this is considerably harder than exploiting pickle, but considerably harder is not the same as impossible.
    
    bombela 4 months ago
    
    You know very well what I meant by "compile to machine code". But you decided to interpret it in a combative way. Even though you seem very knowledgeable, this makes me want to stop discussing with you.
    Ultimately you should read the code of rkyv_dyn to understand what it does instead of making random claims.
    It will be faster for you to read the code than for me to attempt explaining how it works. Especially since you will most likely choose the least charitable interpretation of everything I say. There is very little code, it won't take long.
    
    amluto 4 months ago
    
    > You know very well what I meant by "compile to machine code".
    I really don't. I think you mean that Rust compiles to machine code and neither loads executable code at runtime nor contains a JIT, so you can't possibly open a file and deserialize it and end up with code or particularly code-like things from that file being executed in your process.
    My point is that there's an open-ended global registry of objects that implement a given trait, and it's possible (I think) to deserialize and get an unexpected type out, and calling its methods may run code that was not expected by whoever wrote the calling code. And the set of impls and thus the set of actual methods may expand by the mere fact of linking something else into the project.
    This probably won't blow up quite as badly as NSCoding does in ObjC because Rust is (except when unsafe is used) memory-safe, so use-after-free just from deserializing is pretty unlikely. But I would still never use a mechanism like this if there was any chance of it consuming potentially malicious input.
john01dav 4 months ago

> even a really cool library that makes Rust do this.
The first library that comes to mind when I think of this is `serde` with `#[derive(Serialize, Deserialize)]`, but that gives persistence-format output as you describe is preferable to the former case. I usually use it with JSON.
So, this seems like it may be a false dichotomy.
- amluto 4 months ago
  
  Maybe a little bit. But serde works with JSON (among other formats), and you can use it to read and write JSON that interoperates with other libraries and languages just fine. Kind of like how SQLAlchemy looks kind of like you’re writing normal Python code, but it interoperates with SQL.
  rkyv is not like this.
- thechao 4 months ago
  
  I know "serde" is a take on "codec" but *rewrite* was right there! Also, as long as I'm whinging about naming? 'print' and 'parse' are five letter p words in a bidirectional relationship. Oh! Oh! push, peek, poke, ... pull! It even makes more sense than pop! And it's four letters!
LtWorf 4 months ago

But if you use complicated serialisation formats you can't mmap a file into memory and use it directly. Which is quite convenient if you don't want to parse the whole file and allocate it to memory because it's too large compared to the amount of memory or time you have.
vlovich123 4 months ago

Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.
It’s like you listed a bunch of serialization technologies without grokking the problem outlined in the post doesn’t have much to do with rkyv itself.
- imtringued 4 months ago
  
  Actually, it's you who is giving that impression with an ultra vague "doesn't solve the problems described".
  The only problem in the blog post is efficient coding of optional fields and all they was introduce a bitmap. From that perspective, JSON and XML solve the optional fields problem to perfection, since an absent field costs exactly nothing.
  - vlovich123 4 months ago
    
    I guess you missed the part where the size of the data stored on disk and efficient deserialization are also critically important performance characteristics that neither JSON nor XML have?
    Capnproto doesn’t support transform on serialize - the optional fields still take up disk space unless you use the packed representation which has some performance drawbacks. Also the generated capnproto rust code is quite heavy on compile times which is probably some consideration that’s important for compiling queries.
    
    zadikian 4 months ago
    
    Indeed Capnproto is more optimized for serdes time than space usage.
- locknitpicker 4 months ago
  
  > Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.
  I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.
  Can you be a little more specific about your assertion?
  - vlovich123 4 months ago
    
    The space overhead and the overhead of serialization/deserialization. Rkyv is zero overhead - it’s random access without needing to deserialize and can even be memory mapped.
    
    amluto 4 months ago
    
    The whole “zero overhead” thing is IMO a red herring. I care about a few things: stability across versions and languages, space efficiency (sometimes) and performance. I do not care about “overhead” — performance trumps overhead every time.
    Your deserializer is probably running on a CPU, and that CPU probably has a very fast L1 cache and might be targeted by a compiler that can do scalar replacement of aggregates and such. A non-zero-overhead deserializer can run very quickly and result in the output being streamed efficiently from its source and ending up hot in L1 in a useful format. A zero-overhead deserializer might do messy reads in a bad order without streaming hints and run much slower.
    And then to get very very large records, as in the OP, where getting a good on-disk layout may require thought. And, frequently, the right layout isn’t even array-of-structs, which is why there are so many tools designed to query column stores like Parquet efficiently.
    
    zadikian 4 months ago
    
    Serdes time can be significant. There are use cases for the zero copy formats even though they use more space. Likewise bit-packed asn1 is often slower than byte-aligned.
    
    cozzyd 4 months ago
    
    If you care about space, you're almost certainly going to compress your output (unless, like, you're literally storing random noise) and so you'll necessarily have overhead from that.
    Unless the reason you care about space is because it's some sort of wire protocol for a slow network (like LoRaWAN or Iridium packets or a binary UART protocol), where compression probably doesn't make sense because the compression overhead is too large. But even here, just defining the data layout makes sense, I think.
    Tihs could take the form of a C struct with __attribute__((packed)) but that is fragile if you care about more platforms than one. (I generally don't, so that works for me!).
- amluto 4 months ago
  
  Even completely ignoring the issues of language-centric vs data-format-centric serializers, your list is missing two very notable entries from my list: Arrow and Parquet. Both of them go to quite some lengths to efficiently handle optional/missing data efficiently. (I haven’t personally used either one for large data sets, but I have played with them. I think you’ll find that Arrow IPC / Feather (why can’t they just pick one name?) has excellent performance for the actual serialization and deserialization part as long as you do several rows at a time, but Parquet might win for table scans depending on the underlying storage medium.). Both of them are, quite specifically, the result of years of research into storing longish arrays of wide structures with potentially complex shapes and lots of missing data efficiently. (Logical arrays. They’re really struct-of-arrays formats, and I personally have a use case I kind of want to use Feather for except that Feather is not well tuned for emitting one row at a time.)
- UqWBcuFx6NV4r 4 months ago
  
  I have zero doubt that you’re on some ‘no true Scotsman’-style “you’re not doing Real Development if you are using these technologies to solve these problems” thing. Let’s just drop that. There are myriad ‘real man webscale development’ scenarios where these are more than acceptable.
- zadikian 4 months ago
  
  Pretty sure protobuf used a header to track field presence within a message, similarly to what this article does. That does have its own overhead you could avoid if you knew all fields were present, but that's not the assumption it makes.
neilyio 4 months ago

Delightful metaphor, I'll be looking everywhere for a chance to use that now!
userbinator 4 months ago

and often performance as well
BS. Nothing can be faster than a read()/write() (or even mmap()) into a struct, because everything else would need to do more work.
- pjc50 4 months ago
  
  Sure, if your structure doesn't contain any pointers and you only ever want to support one endianness and you trust your compiler to fix the machine layout of the struct forever.
  - zadikian 4 months ago
    
    Mainly the first thing. If your struct is already serial, of course serialization will be easy.
  - userbinator 4 months ago
    
    ...which is true for 99.999% of the time anyway, so it's not worth worrying about.

duc_minh 4 months ago

> Sometimes the best optimization is not a clever algorithm. Sometimes it is just changing the shape of the data.

This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)

jeswin 4 months ago

I wouldn't give too much credit to rules like this. Data structures are often created with an approach in mind. You can't design a data structure without knowing how you will use it.
If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
- scott_w 4 months ago
  
  > If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
  And even there, the data models usually come about to make specific business processes easier (or even possible). An Order Summary is structured a specific way to allow both the Fulfilment and Invoicing processes possible, which feed down into Payment and Collections processes (and related artefacts).
- ssivark 4 months ago
  
  To elaborate on @jeswin's point above (IDK why it got downvoted)... a data structure is basically like a cache for the processing algorithm. The business logic and algorithm needs will dictate what details can be computed on-the-fly -vs- pre-generated and stored (be it RAM or disk). Eg: if you're going to be searching a lot then it makes sense to augment the database with some kind of "index" for fast lookup. Or if you are repeatedly going to be pllotting some derived quantity then maybe it makes sense to derive that once and store with the struct.
  It's not enough for a data structure to represent the "fundamental" degrees of freedom needed to model the situation; the algorithmic needs (vis-a-vis the available resources) most definitely matter a lot.
- sublinear 4 months ago
  
  If you don't know enough to design a data structure, requirements are missing and someone talking to the client is dropping the ball big time.
  - jeswin 4 months ago
    
    Where did I say any of that?
    I'm saying that if you care about performance, data structures should be designed with approach specific tradeoffs in mind. And like I've said above, in typical business apps, it's ok to start with data structures because (a) performance is usually not a problem, (b) staying close to the domain is cleaner.
    
    reverius42 4 months ago
    
    You said: "You can't design a data structure without knowing how you will use it."
    But the whole discussion involves knowing how you will use it; the advocacy is for careful consideration of data structures (based on how you will use them) resulting in less pain when designing/choosing algorithms.
    
    jeswin 4 months ago
    
    My point is that one doesn't follow the other. To design good data structures, you need to know how it'll get used (the algorithm).
    > If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
    This is what I was responding to.
    
    reverius42 4 months ago
    
    See also:
    "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
    https://en.wikiquote.org/wiki/Fred_Brooks
ZaoLahma 4 months ago

Full agree on this.
I (deep, deep in embedded systems) have seen this too often, that code is incredibly complex and impossible to reason around because it needs to reach into some data structure multiple times from different angles to answer what should be rather simple questions about next step to take.
Fix that structure, and the code simplifies automagically.

SoftTalker 4 months ago

> But SQL schemas often look like this. Columns are nullable by default, and wide tables are common.

Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.

Mikhail_Edoshin 4 months ago

I saw tables with more than a thousand columns. It was a law firm home-grown FileMaker tool. Didn't inspect it too closely, so don't know what was inside
I remember a phrase from one of C. J. Date's books: every record is a logical statement. It really stood out for me and I keep returning to it. Such an understanding implies a rather small number of fields or the logical complexity will go through the roof.
- p_v_doom 4 months ago
  
  > I saw tables with more than a thousand columns.
  I used to work in a company that had all the tags in their SCADA system feed into SQL tables. They had multiple tables (as in tens of tables), because they ran out of columns ...
holden_nelson 4 months ago

https://jimmyhmiller.com/ugliest-beautiful-codebase
- roblh 4 months ago
  
  I kinda love this. That sounds like an incredibly entertaining place to work for between 1 and 2 years in your late 20s and not a second longer.
  - tdeck 4 months ago
    
    If you enjoyed this, you'd probably enjoy thedailywtf.com, which is full of stories like that.
- lelanthran 4 months ago
  
  With AI "programmers", this will be the future: bugs galore and the things that do work, work by accident.
  I think this company was ahead of the curve.
- locknitpicker 4 months ago
  
  The blog post is an entertaining read, but I was left with the impression the author might have tried do embellish, particularly in it's disbelief angle.
  Take this passage:
  > The app relied on a SOAP service, not to do any servicey things. No, the service was a pure function. It was the client that did all the side effects. In that client, I discovered a massive class hierarchy. 120 classes each with various methods, inheritance going 10 levels deep. The only problem? ALL THE METHODS WERE EMPTY. I do not exaggerate here. Not mostly empty. Empty.
  > That one stumped me for a while. Eventually, I learned this was in service of building a structure he could then use reflection on. That reflection would let him create a pipe-delimited string (whose structure was completely database-driven, but entirely static) that he would send over a socket.
  Classes with empty methods? Used reflection to create a pipe-delimited string? The string was sent over the wire?
  Why congratulations, you just rediscovered data transfer objects, specifically API models.
- linolevan 4 months ago
  
  This is awesome. Got completely lost reading this and was struggling to figure out where I got this link from. Amazing story.
- bobson381 4 months ago
  
  This is like the functional ugly tier of buildings from "how buildings learn". Excellent stuff
gz09OP 4 months ago

Hi, I'm the author of the article.
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
- sublinear 4 months ago
  
  Can you clarify which knowledge domains those enterprises fall under with examples of what problems they were trying to solve?
  If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
  Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
  - fiddlerwoaroof 4 months ago
    
    Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
    Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
    
    gz09OP 4 months ago
    
    Yep, this comment sums it up well.
    We have many large enterprises from wildly different domains use feldera and from what I can tell there is no correlation between the domain and the amount of columns. As fiddlerwoaroof says, it seems to be more a function of how mature/big the company is and how much time it had to 'accumulate things' in their data model. And there might be very good reasons to design things the way they did, it's very hard to question it without being a domain expert in their field, I wouldn't dare :).
    
    locknitpicker 4 months ago
    
    > I can tell there is no correlation between the domain and the amount of columns.
    This is unbelievable. In purely architectural terms that would require your database design to be an amorphous big ball of everything, with no discernible design or modelling involved. This is completely unrealistic. Are queries done at random?
    In practical terms, your assertion is irrelevant. Look at the sparse columns. Figure out those with sparse rows. Then move half of the columns to a new table and keep the other half in the original table. Congratulations, you just cut down your column count by half, and sped up your queries.
    Even better: discover how your data is being used. Look at queries and check what fields are used in each case. Odds are, that's your table right there.
    Let's face it. There is absolutely no technical or architectural reason to reach this point. This problem is really not about structs.
    
    diarrhea 4 months ago
    
    Feldera speak from lived experience when they say 100+ column tables are common in their customer base. They speak from lived experience when they say there's no correlation in their customer base.
    Feldera provides a service. They did not design these schemas. Their customers did, and probably over such long time periods that those schemas cannot be referred to as designed anymore -- they just happened.
    IIUC Feldera works in OLAP primarily, where I have no trouble believing these schemas are common. At my $JOB they are, because it works well for the type of data we process. Some OLAP DBs might not even support JOINs.
    Feldera folks are simply reporting on their experience, and people are saying they're... wrong?
    
    gz09OP 4 months ago
    
    Haha, looks like it.
    I remember the first time I encountered this thing called TPC-H back when I was a student. I thought "wow surely SQL can't get more complicated than that".
    Turns out I was very wrong about that. So it's all about perspective.
    We wrote another blog post about this topic a while ago; I find it much more impressive because this is about the actual queries some people are running: https://www.feldera.com/blog/can-your-incremental-compute-en...
    
    locknitpicker 4 months ago
    
    > Normalization is possible but not practical in a lot of cases: nearly every “legacy” database I’ve seen has at least one table that just accumulates columns because that was the quickest way to ship something.
    Strong disagree. I'll explain.
    Your argument would support the idea of adding a few columns to a table to get to a short time to market. That's ok.
    Your comment does not come close to justify why you would keep the columns in. Not the slightest.
    Tables with many columns create all sorts of problems and inefficiencies. Over fetching is a problem all on itself. Even the code gets brittle, where each and every single tweak risks beijg a major regression.
    Creating a new table is not hard. Add a foreign key, add the columns, do a standard parallel write migration. Done. How on earth is this not practical?
    
    fiddlerwoaroof 4 months ago
    
    I’m not justifying the design but splitting a table with several billion rows is not a trivial task, especially when ORMs and such are involved. Additionally, it’s easier to get work scheduled to ship a feature than it is to convince the relevant players to complete the swing.
    
    locknitpicker 4 months ago
    
    > I’m not justifying the design but splitting a table with several billion rows is not a trivial task, especially when ORMs and such are involved.
    I don't agree. Let me walk you through the process.
    - create the new table - follow a basic parallel writes strategy -- update your database consumers to write to the new table without reading from it -- run a batch job to populate the new table with data from the old table -- update your database consumer to read from the new table while writing to both old and new tables
    From this point onward, just pick a convenient moment to stop writing to the old database and call the migration done. Do post-migrarion cleanup tasks.
    > Additionally, it’s easier to get work scheduled to ship a feature than it is to convince the relevant players to complete the swing.
    The ease of piling up technical debt is not a justification to keep broken systems and designs. It's only ok to make a messs to deliver things because you're expected to clean after yourself afterwards.
    
    fiddlerwoaroof 4 months ago
    
    I've done this sort of thing or worked with people doing it. The concept is simple, actually executing can take months.
    
    grey-area 4 months ago
    
    There are sometimes reasons this is harder in practice, for example let’s say the business or even third parties have access to this db directly and have hundreds of separate apps/services relying on this db (also an anti-pattern of course but not uncommon), that makes changing the db significantly harder.
    Mistakes made early on and not corrected can snowball and lead to this kind of mess, which is very hard to back out of.
    
    magicalhippo 4 months ago
    
    > How on earth is this not practical?
    Fine, but you still need to read in those 100+ fields. So now you gotta contend with 20+ joins just to pull in one record. Not more practical than a single SELECT in my opinion.
    
    sublinear 4 months ago
    
    You don't need to join what you don't actually need. You also need to be careful writing your queries, not just the schema. The most common ones should be wrapped in views or functions to avoid the problem of everyone rolling their own later.
    Performance generally isn't an issue for an arbitrary number of joins as long as your indices are set up correctly.
    If you really do need a bulk read like that I think you want json columns, or to just go all in with a nosql database. Even then, the above regarding indexing is still true.
  - rpcope1 4 months ago
    
    I think you believe the average developer, especially on enterprise software where you see this sort of shit, is far more competent or ambitious than they actually are. Many would be horrified to see the number of monkeys banging out nasty DDL in Hibernate or whatever C# uses that have no idea what "normal forms" or "relational algebra" are and are actively resistant to even attempting to learn.
woah 4 months ago

No idea what these guys do exactly but their tagline says "Feldera's award-winning incremental compute engine runs SQL pipelines of any complexity"
So it sounds like helping customers with databases full of red flags is their bread and butter
- gz09OP 4 months ago
  
  > it sounds like helping customers with databases full of red flags is their bread and butter
  Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
  If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
  There is some more information about how it works under the hood here: https://docs.feldera.com/literature/papers
unclad5968 4 months ago

It might not be common in typical software shops. I work in manufacturing and our database has multiple tables with hundreds of columns.
- pizza-wizard 4 months ago
  
  I’m working on migrating an IBM Maximo database from the late 90s to a SQL Server deployment on my current project. Also charged with updating the schema to a more maintainable and extensible design. Manufacturing and refurbishing domain - 200+ column tables is the norm. Very demoralizing.
- ambicapter 4 months ago
  
  What's in them?
  - unclad5968 4 months ago
    
    Data from measurement tools. Everything about the tool configuration, time of measurement, operator ID, usually a bunch of electrical data (we make laser diodes) like current, potential, power, and a bunch of emission related data.
    
    arethuza 4 months ago
    
    I think I'd rather work with very wide tables than the entity-attribute-value approach - which seems like a good idea but rapidly becomes a mess...
  - jayanmn 4 months ago
    
    Property1 to 20 or more is an example. There are better ways to do it but I have seen columns for storing ‘anything’
    
    Spivak 4 months ago
    
    Sounds like a generic form of single table inheritance. I don't honestly see any other way to do it (punting to a JSON field is effectively the same thing) when you have potentially thousands of parts all with their own super specific relevant attributes.
    I've worked on multiple products that have had a concept of "custom fields" who did it this way too.
adrianN 4 months ago

If lots of columns are a red flag then red flags are quite common in many businesses. I’ve seen tables with tens of thousands of columns. Naturally those are not used by humans writing sql by hand, but there are many tools that have crazy data layouts and generate crazy sql to work with it.
vharuck 4 months ago

https://apps.naaccr.org/data-dictionary/data-dictionary/vers...
771 columns (and I've read the definitions for them all, plus about 50 more that have been retired). In the database, these are split across at least 3 tables (registry, patient, tumor). But when working with the records, it's common to use one joined table. Luckily, even that usually fits in RAM.
magicalhippo 4 months ago

> I have never in my life worked with a database table that had 700 columns
Main table at work is about 600, though I suspect only 300-400 are actively used these days. A lot come from name and address fields, we have about 10 sets of those in the main table, and around 14 fields per.
Back when this was created some 20+ years ago it was faster and easier to have it all in one row rather than to do 20+ joins.
We probably would segment it a bit more if we did it from scratch, but only some.
nikhilsimha 4 months ago

It is very common to find tables with 1000+ columns in machine learning training sets at e-commerce companies. The largest I have seen had over 10000 columns.
orthoxerox 4 months ago

It's OLAP, it very common for analytical tables to be denormalized. As an example, each UserAction row can include every field from Device and User to maximize the speed at which fraud detection works. You might even want to store multiple Devices in a single row: current, common 1, 2 and 3.
bob1029 4 months ago

Some businesses are genuinely this complicated. Splitting those facts into additional tables isn't going to help very much unless it actually mirrors the shape of the business. If it doesn't align, you are forcing a lot of downstream joins for no good reason.
xarope 4 months ago

OLTP tables typically are normalized.
But OLAP tables (data lake/warehouse stuff), for speed purposes, are intentionally denormalized and yes, you can have 100+ columns of nullable stuff.
wombatpm 4 months ago

Not everyone understands normal form, much less 3rd normal form. I’ve seen people do worse with excel files where they ran out of columns and had to link across spreadsheets.
locknitpicker 4 months ago

> Hard disagree. That database table was a waving red flag.
Exactly this.
This article is not about structs or Rust. This article is about poor design of the whole persistence layer. I mean, hundreds of columns? Almost all of them optional? This is the kind of design that gets candidates to junior engineer positions kicked off a hiring round.
Nobody gets fired for using a struct? If it's an organization that tolerates database tables with nearly 1k optional rows then that comes at no surprise.
- grey-area 4 months ago
  
  The database table is someone else’s data. That’s why this company exists and is explained in the article.
  They don’t have the option to clean up the data.
  - gz09OP 4 months ago
    
    Yep pretty much, feldera is the engine we don't control what SQL people throw at us.
bananamogul 4 months ago

That statement jumped out at me as well. I've worked as a DBA on tons of databases backing a wide variety of ERPs, web apps, analytics, data warehouses...700 columns?!? No.
- shakna 4 months ago
  
  You've never seen an SAP database where the business object had a couple hundred fields? Its pretty much required if you're touching international data.
randallsquared 4 months ago

I have seen tables (SQL and parquet, too) that have at least high hundreds of optional columns, but this was always understood to be a terrible hack, in those cases.
shakna 4 months ago

Salesforce by default comes with some where your tables have 50 columns before you start tweaking anything.
100s is not unusual. Thousands happens before you realise.

lsuresh 4 months ago

Feldera co-founder here. Great discussions here.

Some folks pointed out that no one should design a SQL schema like this and I agree. We deal with large enterprise customers, and don't control the schemas that come our way. Trust me, we often ask customers if they have any leeway with changing their SQL and their hands are often tied. We're a query engine, so have to be able to ingest data from existing data sources (warehouse, lakehouse, kafka, etc.), so we have to be able to work with existing schemas.

So what then follows is a big part of the value we add: which is, take your hideous SQL schema and queries, warts and all, run it on Feldera, and you'll get fully incremental execution at low latency and low cost.

700 isn't even the worst number that's come our way. A hyperscale prospect asked about supporting 4000 column schemas. I don't know what's in that table either. :)

kwillets 4 months ago

This site is underweighted on OLAP. Columnstores were invented for precisely this use case; nobody in the field wants to normalize everything.
Which brings me to the question, why a rowstore? Are Z-sets hard to manage otherwise?
Another aspect of wide tables is that they tend to have a lot of dependencies, ie different columns come from different aggregations, and the whole table gets held up if one of them is late. IVM seems like a good solution for that problem.
- lsuresh 4 months ago
  
  Good questions!
  Feldera tries to be row- and column-oriented in the respective parts that matter. E.g. our LSM trees only store the set of columns that are needed, and we need to be able to pick up individual rows from within those columns for the different operators.
  I don't think we've converged on the best design yet here though. We're constantly experimenting with different layouts to see what performs best based on customer workloads.

pmoati 4 months ago

Great writeup !

The bitmap trick is elegant and I've seen similar patterns in other contexts. The core insight resonates beyond Rust and SQL: the data structure that's "obvious" at design time can become a bottleneck when the real-world usage pattern diverges from your assumptions. Most fields exist vs most fields might not exist is a subtil but critical distinction.

The fix being a simple layout change rather than a clever algorithm is also a good reminder. I've spent 20 years building apps and the most impactful optimizations were almost always about changing the shape of data, not adding complexity.

logdahl 4 months ago

Strictly speaking, Isn't there still a way to express at least one Illegal string in ArchivedString? I'm not sure how to hint to the Rust compiler which values are illegal, but if the inline length (at most 15 characers) is aliased to the pointer string length (assume little-endian), wouldnt {ptr: null, len: 16} and {inline_data: {0...}, len: 16} both technically be an illegal value?

I'm not saying this is better than your solution, just curious :^)

bombela 4 months ago

In the code you will find union { {len, relptr}, [u8; 16] }
The length is first. The pointer second. The inline string is terminated with 0xFF. The length is 62 bits out of 64 bits such that a specific pattern is placed in the first byte that utf8 doesn't collide with.
gz09OP 4 months ago

> Isn't there still a way to express at least one Illegal string in ArchivedString?
There may be good reasons (I don't know any) why it wasn't done like this, but from a high-level it looks possible to me too yes.

SigmundA 4 months ago

Looks like they just recreated a tuple layout in rust with null bit map and everything, next up would be storing them in pages and memmap the pages.

https://www.postgresql.org/docs/current/storage-page-layout....

gz09OP 4 months ago

Absolutely, it's a very common technique :)
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.
- SigmundA 4 months ago
  
  I think its a good article and I enjoyed learning a little more about rust, but would have been nice to point out this is a common technique used for tuple storage in databases for those not familiar.
  It comes off as being a novel solution rather than connecting it to a long tradition of DB design. I believe PG for instance has used a null bitmap since the beginning 40 years ago.
  - gz09OP 4 months ago
    
    That would be surprising to me if anyone would think this is novel.
    Using bitmaps to indicate whether things are in-use or not is very common in systems programming. Like you said PG does it, but most other systems do this too. It's not specific to databases: in an operating system, one of the first thing it needs is an allocator, the allocator most likely will use some bitmap trick somewhere to indicate what's free vs. what's available etc.
    
    SigmundA 4 months ago
    
    Interesting so you don't find it odd that an article about the storage engine of a SQL database system explains the solution to problem without once mentioning that it is the way most other sql database engines solve it? It is mentioned several times sql table are typically sparse, but not that this is how its typically solved, only this is how you solved it...
    >The fix is simple: we stop storing Option<T> and instead we store a bitmap that records which fields are None.
    Right here would have been the opportunity to let those not familiar with database internals know. Something like "This technique has been widely used in many RDBMS systems over the years, you can see the PG version of this in their documentation for page layout".
    Instead you go into detail on what a null bitmap is and how to implement it, calling it a "trick". Which is strange if you think your target audience is assumed to already know this common technique.
    I mean not one mention of the origin of the trick or even calling it a common solution to this problem...
    
    gz09OP 4 months ago
    
    > not one mention of the origin of the trick
    And who or what would you say is the origin? The "trick" is so old I'm afraid it is lost to time to say who invented the bitmap. It was used in MULTICS or THE long before PostgreSQL was invented.
    
    SigmundA 4 months ago
    
    There is a lot of space between no acknowledgment at all like this article and tracing down the exact origin of who and when invented to null bit map in what system.
    Your version comes off as self promotion, going into detail how to implement a the technique with out once calling it a common technique or even mentioning its widely used in the industry to solve the problem you are solving.
    It would be like saying finding rows in storage is slow so we implemented a b-tree in a separate structure for certain columns and go into detail how to create a b-tree, calling it a "trick" while never once acknowledging the is the most common form of indexing in RDBMS's, would you understand how this would seem strange?

astrostl 4 months ago

I have mixed feelings about it, but I'm going to fire somebody tomorrow for using a struct just to prove a point to the author.

gz09OP 4 months ago

Point them to us https://github.com/feldera/feldera -- we are hiring ;)
- adampunk 4 months ago
  
  You folks have too many structs already! I just finished reading about it!

saghm 4 months ago

I feel like I'm missing something, but the article started by talking about SQL tables, and then in-memory representations, and then on-disk representation, but...isn't storing it on a disk already what a SQL database is doing? It sounds like data is being read from a disk into memory in one format and then written back to a disk (maybe a different one?) in another format, and the second format was not as efficient as the first. I'm not sure I understand why a third format was even introduced in the first place.

gz09OP 4 months ago

Feldera is an incremental query engine, you can think of it as a specialized database. If you have a set of question you can express in SQL it will ingest all your data and build many sophisticated indexes for it (these get stored on disk). Whenever new data arrives feldera can instantly update the answers to all your questions. This is mostly useful when the data is much larger than what fits in memory because then the questions will be especially expensive to answer with a regular (batch) database.
Feel free to try it out, it's open source: https://github.com/feldera/feldera/

kristianp 4 months ago

> A new use case processed about the same amount of data as their existing pipelines, but it ran much slower.

Did I miss something? They didn't mention why the new use case was slower. I was expecting some callback to that new usecase in the article somewhere.

Always enjoy reading about performance debugging, thanks for writing this.

Edit: they didn't talk about profiling either. It was an enjoyable read of rust serialisation for non rusty people though.

lsuresh 4 months ago

We use an in-product profiler (discussed here: https://www.feldera.com/blog/introducing-feldera's-visual-pr...), along with CPU profiles to identify where in the code we're spending time.

jamesblonde 4 months ago

Here is an article I wrote this week with a section on Feldera - how it uses its incremental compute engine to compute "rolling aggregates" (the most important real-time feature for detecting changes in user behavior/pricing/anamalies).

https://www.hopsworks.ai/post/rolling-aggregations-for-real-...

jim33442 4 months ago

I did read the rest, but I'm stuck on the first part where their SQL table has almost a thousand cols. Why so many?

everyone 4 months ago

Just cus structs and classes work differently, and classes are much more common. I tend to make everything a class, unless there is a really good reason to make it a struct.

tialaramex 4 months ago

> Just cus structs and classes work differently
The most common programming language where "struct" and "class" are both kinds of user defined type is C++. In C++ they only "work differently" in that the default accessibility is different, a "struct" is the same as a "class" if you change the accessibility to public at the top with an access specifier.
If you think you saw a bigger difference you're wrong.
bob1029 4 months ago

Classes are a safe default even if you expect things to go very, very fast.
The overhead of screwing up NUMA concerns vastly outstrips any kind of class vs struct differences. It's really one of the very last things you should be worrying about.
Allocating an array of a class vs an array of struct might seem like you're getting a wildly different memory arrangement, but from the perspective of space & time this distinction is mostly pointless. Where the information resides at any given moment is the most important thing (L1/L2/L3/DRAM/SSD/GPU/AWS). Its shape is largely irrelevant.

arcrwlock 4 months ago

Why not use a struct of arrays?

https://en.wikipedia.org/wiki/Data-oriented_design

mustache_kimono 4 months ago

> Why not use a struct of arrays?
I would assume because then the shape of the data would be too different? SOAs is super effective when it suits the shape of the data. Here, the difference would be the difference between an OLTP and OLAP DB. And you wouldn't use an OLAP for an OLTP workload?

kleiba 4 months ago

> This struct we saw earlier had 700+ of optional fields. In Rust you would never design a struct like this. You would pick a different layout long before reaching 700 Options. But SQL schemas often look like this.

Really? I've never had to do any serious db work in my career, but this is a surprise to me.

Ciantic 4 months ago

If I understand this problem was in rkyv, and solution is using rkyv with glue code. I hope they could integrate some sort of official derive macro `rkyv::Sparse` for this if it can't be done automatically in rkyv.

porise 4 months ago

Why is rust allowed to reorder fields? If I know that fields are going to be generally accessed together, this prevents me from ordering them so they fit in cache lines.

tialaramex 4 months ago

You can choose in Rust to explain the representation you want for your data type. Unlike C or C++ that's not a non-portable vendor extension it's just part of the language, look at the repr documentation: https://doc.rust-lang.org/nomicon/other-reprs.html
So if you want "what C does" you can just repr(C) and that's what you get. For most people that's not a good trade unless they're doing FFI with a language that shares this representational choice.
- porise 4 months ago
  
  Rusts tradeoff is awful. People organize struct members in the logical ways in which they will be accessed. Rust just decides to do things different so they can say they do things different and "better" than the old C. The featured article is an extreme case but other normal cases would have the same issue. Packing structs should be the exception.
kzrdude 4 months ago

It's allowed as an optimization, the order it uses will limit the space lost to field alignment.
Narishma 4 months ago

You can tell it not to reorder them if you want but it's not the default.

Settings

Nobody ever got fired for using a struct

Keyboard Shortcuts