Buckets and objects are not enough

sagi.org

15 points by sagiba 5 days ago · 16 comments

Reader

gberger 13 hours ago

S3: Simple Storage Service. It's a building block, and it's only natural other abstractions are built on top of it.

sagibaOP 5 hours ago

Agree it doesn't have to be part of S3 itself. My point is that there is a missing semantic layer.
In practice, many teams use S3 directly without any layer on top. So without better organizational capabilities, they can't keep track of what they have stored where, who created it, whether it is still used, etc.
And when teams do use a catalog, it's usually detached from the storage layer itself, so you can't easily view a dataset in the catalog and know how much it costs, who accessed it, and so on.
Have you seen better places that figured out a better way to handle this? Without a ton of custom tooling?
FridgeSeal 10 hours ago

No but why doesn’t this object-storage-primitive accommodate all my specific requirements already?
They should also accommodate my need for all POSIX filesystem API’s included cheap-moves and renames!!!!!
/s
- sagibaOP 5 hours ago
  
  POSIX isn't the ask. Datasets are. The need to keep track of what data you have stored is universal, not my specific requirement.
  - FridgeSeal an hour ago
    
    I make the (glib) comment, because it’s a similar argument to the one that was popular a few years ago.
    S3 is an object store. Treat it more like a KV store. As other comments have pointed out, the solution here is pick-your-favourite-metadata-store, be it Postgres, or what iceberg does, and other data on S3.

hilariously 14 hours ago

The prefixes are not meaningless, they are performance boundaries as well if you read the docs.

sagibaOP 5 hours ago

Good point, prefixes are performance boundaries too, per-prefix rate scaling means you can spread load across prefixes to get aggregate throughput well above 3.5k RPS [1].
But that's a different thing than what the post is about. Even teams that use prefixes for performance don't have an S3-native way to ask what a prefix represents, who owns it, whether it's still accessed, and so on. The semantic layer is missing whether you're hashing for throughput or just laying data out the obvious way.
[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...

CodesInChaos 5 hours ago

The "dataset" abstraction this article proposes feels rather specific for their use-case, not universal. None of my S3 use-cases would benefit from it.

Just store such metadata in your database, where you can organize, index and aggregate it whatever way you like.

sagibaOP 4 hours ago

Curious what your use cases look like. If you're storing data where you always know what's there, who created it, and whether it's still in use without needing to query for it, that's actually a great place to be. The post is about the much messier middle ground most teams I've talked to are in.

mdavid626 5 hours ago

Store key to the item in a database. Then you can query it however you want.

I could imagine it though that S3 could offer something similar. We can already list the bucket items, why not add some of querying ability?

skybrian 10 hours ago

Why do they put everything into one huge bucket? Wouldn't the best way to clean it up be to create more buckets?

sagibaOP 5 hours ago

You can have lots of buckets, but each one typically still contains many datasets.
Think of a team doing ML, for example. They work with data all day across many different tools, each reading some inputs from S3 and writing outputs to S3. They won't create a bucket for every output, that's not practical. So they write to a single bucket with outputs organized under prefixes.
Buckets are more of an administrative boundary (IAM, cost, replication) than a data organization unit. So even with more buckets, the dataset abstraction is still missing - there's no good native way to track what a prefix represents, who created it, whether it's still accessed, how much it costs, etc.

dchess 12 hours ago

I feel like this is what delta lake and ducklake are largely solving for. And then some.

sagibaOP 4 hours ago

They solve it, partially, for tabular data. Delta, Iceberg, DuckLake are all table formats. And yeah, they do more than dataset abstraction (transactions, time travel, schema evolution).
But that's just one slice of storage. Most teams also have logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format. And even with tables, you often can't easily look at a Delta table and know what the underlying storage is costing you, whether it's still accessed, etc.
Another system might solve it for your media files, another for your log streams, and so on. That's the thing, you have a set of management nice-to-haves that are quite generic and aren't universally supported today, so you end up reinventing them separately across each domain. And even if you did, you still wouldn't have a central aggregated view across all your storage.
- FridgeSeal an hour ago
  
  > logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format.
  You would be appalled at the kind of stuff I have seen teams stuff into parquet and iceberg tables.
  - sagibaOP an hour ago
    
    Ha. The fact that teams reach for iceberg to organize things that aren't really tables is itself a symptom of needing better management tools for other types of data.

Settings

Buckets and objects are not enough

Keyboard Shortcuts