Replacing EBS and Rethinking Postgres Storage from First Principles

tigerdata.com

89 points by mfreed a day ago


0xbadcafebee - 7 hours ago

There's a ton of jargon here. Summarized...

Why EBS didn't work:

  - EBS costs for allocation
  - EBS is slow at restores from snapshot (faster to spin up a database from a Postgres backup stored in S3 than from an EBS snapshot in S3)
  - EBS only lets you attach 24 volumes per instance
  - EBS only lets you resize once every 6–24 hours, you can't shrink or adjust continuously
  - Detaching and reattaching EBS volumes can take 10s for healthy volumes to 20m for failed ones, so failover takes longer
Why all this matters:

  - their AI agents are all ephemeral snapshots; they constantly destroy and rebuild EBS volumes
What didn't work:

  - local NVMe/bare metal: need 2-3x nodes for durability, too expensive; snapshot restores are too slow
  - custom page-server psql storage architecture: too complex/expensive to maintain
Their solution:

  - block COWs
  - volume changes (new/snapshot/delete) are a metadata change
  - storage space is logical (effectively infinite) not bound to disk primitives
  - multi-tenant by default
  - versioned, replicated k/v transactions, horizontally scalable
  - independent service layer abstracts blocks into volumes, is the security/tenant boundary, enforces limits
  - user-space block device, pins i/o queues to cpus, supports zero-copy, resizing; depends on Linux primitives for performance limits
Performance stats (single volume):

  - (latency/IOPS benchmarks: 4 KB blocks; throughput benchmarks: 512 KB blocks)
  - read: 110,000 IOPS and 1.375 GB/s (bottlenecked by network bandwidth
  - write: 40,000–67,000 IOPS and 500–700 MB/s, synchronousy replicated
  - single-block read latency ~1 ms, write latency ~5 ms
unsolved73 - 6 hours ago

TimescaleDB was such a great project!

I'm really sad to see them waste the opportunity and instead build an nth managed cloud on top of AWS, chasing buzzword after buzzword.

Had they made deals with cloud providers to offer managed TimescaleDB so they can focus on their core value proposition they could have won the timeseries business, but ClickHouse made them irrelevant and Neon already has won the "Postgres for agents" business thanks to a better architecture than this.

DenisM - 2 hours ago

IUUC they built a EBS replacement on top of NVME attached to a dynamically sized fleet of EC2 instances.

The advantage is that it’s allocating pages on demand from an elastic pool of storage so it appears as an infinite block device. Another advantage is cheap COW clones.

The downside is (probably) specialized tuning for Postgres access patterns. I shudder to think what went into page metadata management. Perhaps it’s similar to e.g. SQL Server buffer pool manager).

It’s not clear to me why it’s better than Aurora design - on the surface page servers are higher level concepts and should allow more holistic optimizations (and less page write traffic due to shipping log in lieu of whole pages). Is also not clear what stopped Amazon from doing the same (perhaps EBS serving more diverse access patterns?).

Very cool!

electroly - 4 hours ago

EC2 instances have dedicated throughput to EBS via Nitro that you lose out on when you run your own EBS equivalent over the regular network. You only get 5Gbps maximum between two EC2 instances in the same AZ that aren't in the same placement group[1], and you're limited by the instance type's general networking throughput. Dedicated throughput to EBS from a typical EC2 instance is multiple times this figure. It's an interesting tradeoff--I assume they must be IOPS-heavy and the throughput is not a concern.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

maherbeg - 7 hours ago

This has a similar flavor to xata.io's SimplyBlock based storage system * https://xata.io/blog/xata-postgres-with-data-branching-and-p... * https://www.simplyblock.io/

It's a great way to mix copy on write and effectively logical splitting of physical nodes. It's something I've wanted to build at a previous role.

stefanha - 6 hours ago

@graveland Which Linux interface was used for the userspace block driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you choose it?

Also, were existing network or distributed file systems not suitable? This use case sounds like Ceph might fit, for example.

the8472 - 7 hours ago

Though AWS instance-attached NVMe(oF?) still has less IOPS per TB than bare metal NVMe does.

    E.g. i8g.2xlarge, 1875 GB, 300k IOPS read
    vs. WD_BLACK SN8100, 2TB, 2300k IOPS read
thr0w - 8 hours ago

Postgres for agents, of course! It makes too much sense.

runako - 6 hours ago

Thanks for the writeup.

I'm curious whether you evaluated solutions like zfs/Gluster? Also curious whether you looked at Oracle Cloud given their faster block storage?

kristianp - 2 hours ago

So they've built a competitor to EBS that runs on EC2 and nvme. Seems like their prices will need to be much higher than those of AWS to get decent profit margins. I really hate being in the high-cost ecosystem of the large cloud providers, so I wouldn't make use of this.

7e - 5 hours ago

Yes, EBS sucks, but plenty of cloud providers already implemented the same thing Tiger Data has a decade ago. Like Google.

tayo42 - 6 hours ago

Are they not using aws anymore? I found that confusing. It says they're not using ebs, not using attached nvme, but I didn't think there were other options in aws?

cpt100 - a day ago

pretty cool