SPQR 1.3.0: a production-ready system for horizontal scaling of PostgreSQL

185 points by denchick 2 years ago · 77 comments

Reader

How far can one get these days with vertical scaling of Postgres? I dont know how well the engine could make use of the extra resources. If it scales well vertically, I expect it to cover 80% - 95% of use cases for people using Postgres. Probably with less complexity, and less overhead than scaling horziontally

endymi0n 2 years ago

Far. As in, really, really far. We started out with Postgres because it was just the simple and sensible option for a production prototype, and when somebody came around telling me we need something more scalable recently, I calculated that there's not even enough addressable market in the world for our business for more than 4x our size. That's exactly the two remaining vertical doublings of our DB instance (to ridiculously looking RAM & CPU numbers) we could still do in case we need it.
Other than that, thanks to a lot of the recent work on connection handling and concurrency since PG 11, Postgres is getting better and better actually using these additional resources well: https://www.enterprisedb.com/blog/performance-comparison-maj...
- ThinkBeat 2 years ago
  
  Several years ago I was a contractor on a project for that would sip up data from all forms of sensor and moving objects around a large city. It was decided that we needed Kafka and a few similar tools to handle it.
  It was not hard to calculate the current max traffic or estimate the traffic growth over the next 10 years.
  I did a demo of the system running on my laptop (all of it) + Postgres handling 100x the current data without too much difficulty.
  Still they went with the "scale" solution because it was the right design. (and of course the consultants and me got quite lot more work todo so made a good deal more money)
- mdasen 2 years ago
  
  At this point you can get 24TB of RAM in an EC2 instance (along with 448 vCPUs, 100Gbps of network bandwidth and 38Gbps of EBS bandwidth). That won't scale forever, but Stack Overflow has been running on a single primary/standby setup with 1.5TB of RAM so that would be 16x Stack Overflow's RAM.
  I think a lot of work goes into horizontal scaling which is necessary at a certain scale, but very few people actually get anywhere near that scale. It can be important to understand which things are needed at your scale and where you can simply buy some beefier hardware. I've been at places where people run a dozen sharded DB servers with each server having 16GB of RAM. Maybe that's resume-driven-development where someone wants to say they've done that.
  - SJC_Hacker 2 years ago
    
    A bunch of smaller distributed instances could be cheaper than one big one at equivalent size/compute. It also allows you to grow as needed, without worrying about things like DB transfer, instead of absorbing a big upfront cost.
    I agree it adds alot of complexity to the problem, which is another cost.
    I guess this would be another argument for pay-as-you go cloud-managed DBs, despite being more expensive than rolling your own.
  - rbjorklin 2 years ago
    
    Scaling horizontally undoubtedly introduces complexity but it also comes with some upsides:
    * DB backups are now (much) faster. * Smaller backups means faster restores which reduces your RTO (Recovery Time Objective) * If you have a well architectured application a catastrophic DB failure will now only impact a portion of your userbase instead of all of them.
    There are probably more good reasons but these are the ones I could think of now.
    
    mbreese 2 years ago
    
    Is high availability or easier backups why people look to horizontal scaling though? I don't think that's ever been a primary reason for any story I've read. It's a great "bonus", but I can't think that it would be a compelling reason to choose horizontal vs vertical scaling.
    (There are other reasons, of course...)
- hot_gril 2 years ago
  
  I remember using PG 10 at a previous company that was kinda abusing Postgres as a data processing tool with temp tables. Even with the parallel scans etc, we found it was a lot faster to split our queries (mostly INSERT(SELECT...)) into separate ones operating on separate ranges of rows, one for each CPU core. We'd run EXPLAIN to print out the plan then shard on the innermost or outermost join. I even implemented a huge sparse matrix addition/multiplication calculator this way, chaining multiple operations into a single huge query, far exceeding the limits of numpy. I've always wondered if Postgres could be used as a more efficient Spark backend.
  It usually scaled linearly. We had a 32-core (64-vcore) server, saturating all cores and running a bit more than 32x as fast as a single query. In some cases, it was less than linear but much better than singular, and I think that was only cause of mistakes like uuid4 pkeys.
- zht 2 years ago
  
  how long does it take to unencrypted and back up your database at its current size?
karlmdavis 2 years ago

From personal experience, it scales very well vertically. Have a system in production with tens of billions of rows and north of 12 TB of storage total. That system is read-heavy with large batched inserts, not many deletes or updates.
Biggest limiter is memory, where the need for it grows linearly with table index size. Postgres really really wants to keep the index pages hot in the OS cache. Gets very sad and weird if it can’t: will unpredictably resort to table scans sometimes.
We are running on AWS Aurora, on a db.r6i.12xlarge. Nowhere even close to maxed out on potential vertical scaling.
- brightball 2 years ago
  
  Isn’t Aurora horizontal by default?
  EDIT: Here's what I was thinking about. It's chunked in 10gb increments that are replicated across AZs.
  > Fault-tolerant and self-healing storage
  Aurora's database storage volume is segmented in 10 GiB chunks and replicated across three Availability Zones, with each Availability Zone persisting 2 copies of each write. Aurora storage is fault-tolerant, transparently handling the loss of up to two copies of data without affecting database write availability and up to three copies without affecting read availability. Aurora storage is also self-healing; data blocks and disks are continuously scanned for errors and replaced automatically.
  https://aws.amazon.com/rds/aurora/features/
  - dalyons 2 years ago
    
    No? it’s a standard single node primary with replicas setup. With a fancy log based storage layer.
- SOLAR_FIELDS 2 years ago
  
  I recently did some db maintenance on a write heavy workload and I found that eventually it will bloat over time with a table with 500 million records. Switching it to use a proper partitioning scheme helped a lot. So people should not read this and assume you can just dump massive workloads into pg and they will be screamingly performant without some tuning and thoughtful design (I don’t think this is what you are implying, just a PSA)
  - riku_iki 2 years ago
    
    is there a chance you run some older version of PG? They reduced bloating significantly in last few releases.
    
    SOLAR_FIELDS 2 years ago
    
    Yes! PG12 in this case. Thanks, this definitely motivates my inflight project to get these db’s up to PG16. Very happy to hear that
    
    wlll 2 years ago
    
    If you need more motivation check out the feature matrix, there have been a fair number of pretty big advances since 12.
    https://www.postgresql.org/about/featurematrix/
- mrbonner 2 years ago
  
  My cluster is clocked in at 230TB in Aurora. It is hitting a hard limit of 250TB AWS can support.
  - mrbonner 2 years ago
    
    No, we do not store log or IoT. The data are all business related metrics. I didn't choose aurora but inherited from another team. We have 4 replication reads to scale out the read access. The internal team owns the ingestion (insert) to the write node. All other external accesses are read.
    I think the reason behind aurora pick is to support arbitrary aggregation, filtering and low latency read (p90 < 3000ms). We could not pick distributed DB based on Presto, Athena or Redshift mainly for latency requirements.
    The other contender I consider is Elastic search. But, I do think using it in this case is akin to fitting a square peg in round hole saying.
  - LunaSea 2 years ago
    
    Being curious I was wondering what type of applications could generate this quantity of data.
    Is it IoT / remote sensing related?
    
    manquer 2 years ago
    
    You are thinking of normalized ( bcnf if not 3nf) well architectures application storing structured data , unless the app is 100 million+ users or grew super fast 250TB size would be hard to get to .
    Timeseries (like IoT you mentioned ) or binary blobs or logs or any other data in SQL storage that shouldn’t be really there can hit any size wouldn’t be all that interesting.
    Can’t speak for OP, however managing data for few million user apps, what I have observed is most SQL stores hit single TB range and then start getting broken down into smaller dbs either coz now teams have grown want their own Micro-service or DB or infra wants easier to handle in variety of ways including Backup /recovery larger DBs are extremely difficult to get reasonable RTO/RPO numbers for.
    
    SJC_Hacker 2 years ago
    
    If you want to store video data as BLOBs in a DB, you can get there easily.
    Maybe not the best idea, I guess a file system would be better for that, and just use the DB for metadata.
    But OTOH all the data is one place, so you just migrate the DB. Less to worry about.
    I just looked up, all of English Wikipedia (including images) is barely even 100 GB ... crazy world we live in.
    
    manquer 2 years ago
    
    You wouldn’t say less to worry about when you have to do full backup or show recovery from backup within a set recovery time .
    This one data store is easier is a myth , it just offloads complexity from developer to infra teams who are now provisioning premium NVMe storage instead of cold object stores for binary data .
    Binary data is not indexed or aggregated in a SQL store there is no value in doing this is one place expect dev experience at the cost of infra team experience.
    
    wavemode 2 years ago
    
    Super easy to generate arbitrary amounts of data if you start using postgres as a log, of any sort.
    I worked for a company that had only a few thousand active customers yet had dozens of terabytes of data, for this reason.
hot_gril 2 years ago

Whenever I've had bottlenecks on a single Postgres instance, it's been because of patterns causing excessive lock contention. Redesigning your stuff to be compatible with a horizontal Postgres in the first place involves eliminating forms of shared state, which could also greatly improve the performance on a vanilla single-machine Postgres instance, so you can get very far.
Like, Citus's FAQ says "if you use Citus, you do not need to manually shard your application, and you do not need to re-architect your application in order to scale out." But the line between application-level and DB-level sharding isn't this sharp. The fundamental limitations of distributed systems surface in their rules* about what you can/cannot do across shards, and you might find yourself re-architecting your application anyway.
* https://docs.citusdata.com/en/stable/develop/reference_worka...
ManBeardPc 2 years ago

Depends on what you want to scale. Memory, CPU and disc scale well. High concurrency can become a problem due to the 1 process per connection architecture. Keeping latency low between your app and db server and understanding how and where locks are used helps. So far I have used and seen it being used in pretty big companies successfully in a single instance (+ standby) setup.
- denchickOP 2 years ago
  
  > High concurrency can become a problem due to the 1 process per connection architecture.
  If I understand correctly what you mean, then this is no longer a problem. You will simply need to use a connection pool, such as Odyssey or PgBouncer. Even SPQR has its own pool of connections for each shard.
  - ManBeardPc 2 years ago
    
    PgBouncer just pools the connection, but each connection still needs its own process in PostgreSQL itself. Each query blocks the whole process. That limits the amount of queries that can run in parallel/concurrently to the amount of connections. Long-running queries can easily clog up everything. No tool can fix this, you need to be aware of it and consider it in your design.
    
    hot_gril 2 years ago
    
    It's not really about the processes. Even if each query ran on a thread within a process, or some form of greenthreading were in use, there are I/O constraints and locking to consider.
    
    ManBeardPc 2 years ago
    
    What I talk about is blocking of limited resources. Processes/connections are expensive and therefore you want/have to limit the max amount of them. Each query/transaction requires its own process and blocks it for everyone else until it is done.
    I/O or compute constraints are another issue, if your CPU or disc is already saturated you get probably no additional benefit.
    But if you wait for something (locks, I/O) the connection/process can't do other things. High latency between app/database and long running transactions can also use up your available processes, even if they don't consume a lot of CPU or I/O or fight for the same locks.
    Lock contention is its own problem, but makes the blocking of processes/connections worse.
    
    hot_gril 2 years ago
    
    What I meant is, even if processes had no cost, you will probably bottleneck on the I/O. So it doesn't matter a lot how many Postgres connections you have.
    
    ManBeardPc 2 years ago
    
    Sure, it’s not the most common problem to experience, just something that’s useful to know once you have more and more clients. It did happen to me, all connections used up for long running transactions (for multiple reasons) and then nothing else was able to run, CPU and I/O nearly idling at <5%, no significant lock contention.
    
    hot_gril 2 years ago
    
    Ah, long-running transactions wreak havoc with connection pools.
mhuffman 2 years ago

Pretty far for pretty cheap! I had a Postgres installation that ran for 6 years without issue with 11 billion rows in one of the larger tables, 1-2 million hits per day, on a single hetzner machine that ran around $130-$150 per month. Same machine ran the web server, reverse proxy, web app back ends, and ML processing in batches.
riku_iki 2 years ago

To me the largest bottleneck is when you do large insert/update/create table as etc, all final writes serialize as single threaded ops, which makes it hard to scale for heavy ETL workflows.
notnullorvoid 2 years ago

Scaling vertically will work well in scenarios where you have consistently flat load. Horizontal scaling is good for when you have predictable valleys or spikes in load.
In my experience if you want to be cost effective you need both. A decent amount of vertical scaling to have headroom for baseline and some amount of unpredictable spikes, horizontal scaling for the valleys of traffic that match your primary markets day/night cycle.
- thelastparadise 2 years ago
  
  For bursty load why not restart a slave node on a bigger vm, promote said slave to master?
  Repeat with bigger and bigger nides as needed. To scale down, do the inverse.
  - hot_gril 2 years ago
    
    Wouldn't there be a significant hit to availability or latency during the master promotion?
brightball 2 years ago

Really far.
Partial indexes and partitioning go a loooooooong way.

Mortiffer 2 years ago

A comparison to https://www.citusdata.com/ would be nice

mdaniel 2 years ago

for one thing:
https://github.com/pg-sharding/spqr/blob/1.3.0/LICENSE (BSD2)
https://github.com/citusdata/citus/blob/v12.1.2/LICENSE (AGPLv3)
- waldrews 2 years ago
  
  I still have a hard time understanding where the AGPL copyleft feature kicks in on a product like Citus. Are you safe just deploying it as part of your backend? Like, do you have to actually touch Citus's own code to trigger it? What if you start using some sprocs or sample connectivity code? Anybody know a good set of guidelines?
  - wavemode 2 years ago
    
    This is the entirety of the text which was added to the AGPL causing it to differ from the plain GPL:
    > Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph.
    In other words, if you deploy a modified version of Citus, your modified version is AGPL licensed, and thus you must provide the source code of this modified version to all users who interact with it remotely (e.g. through your web application).
    What it does not state is that you must provide the source code of your entire web application just because you deployed a modified version of Citus, nor that your web application becomes an AGPL-licensed derived work of Citus because you used it over a network. The AGPL also does not require anything at all from you other than the plain GPL's basic requirements if the version of Citus you deploy is unmodified. (These are all incredibly common misconceptions on the internet, by people who've never read the license nor the GNU website.)
    
    drunkenmagician 2 years ago
    
    This is a different understand that I had on AGPL 3 quote FSF
    "The AGPLv3 does not adjust or expand the definition of conveying. Instead, it includes an additional right that if the program is expressly designed to accept user requests and send responses over a network, the user is entitled to receive the source code of the version being used."
    Have I misread this? As I understand AGPL3 is an extension of GPL 3. Therefore means if I use an AGPL3 licensed service I have rights to receive the code for this service.
    ref: https://www.fsf.org/bulletin/2021/fall/the-fundamentals-of-t...
    
    wavemode 2 years ago
    
    The key word there is "the version being used". If you're using an unmodified version of Citus, the user is free to go get the source code from GitHub.
    That's why the text of the clause itself makes a distinction about "if you modify the Program".

vb-8448 2 years ago

sono pazzi questi romani

vincnetas 2 years ago

"On the first page where we find the map of France we see a Roman banner with S.P.Q.R. which means "Senatus Populus Que Romanus" (The senate and people of Rome). The Italian translator made a great pun about it and translated it into "Sono Pazzi questi Romani". The latter can be translated to "These Romans are crazy"."
https://www.asterix-obelix.nl/index.php?page=hjh/dos-italy.i...
- wlll 2 years ago
  
  Unrelated, but you reminded me of this clip from The Life of Brian:
  https://www.youtube.com/watch?v=wjOfQfxmTLQ
  It's funnier if you know that the ear-twisting Roman at the start was basically a caricature of many Latin teachers that were around when Latin was taught more widely in schools (such as when the Monty Python team were young).

lenage 2 years ago

Happy to see so many tools created for PostgreSQL, Thanks for your great work

geenat 2 years ago

Why use SPQR over Citus? Vitess?

lebski88 2 years ago

The fact that it's a router and not a PG extension is helpful if you are using cloud hosted postgres and don't want to get into the business of managing backups etc yourselves.
Only azure supports Citus as far as I know.
- scraplab 2 years ago
  
  CrunchyBridge will run you a Citus cluster on any of the major cloud providers.
jitl 2 years ago

Vitess is a MySQL cluster manager, it doesn’t support Postgres. https://github.com/vitessio/vitess/issues/7084
feverzsj 2 years ago

SPQR, as the name suggests, is just a query router.
- calvinmorrison 2 years ago
  
  Yes, that's what the name suggests
  - verandaguy 2 years ago
    
    The name also suggests plans to go to war with the Carthaginians.
    
    adontz 2 years ago
    
    I was looking for such comment, thank you for making my day.
    
    jimz 2 years ago
    
    Carthago delenda est!

soleri17 2 years ago

Do you have some plans for k8s operator?

denchickOP 2 years ago

https://github.com/pg-sharding/spqr/issues/571

u320 2 years ago

Does it handle byzantine failures?

gclawes 2 years ago

No, only Roman failures
- lenerdenator 2 years ago
  
  Sounds like a pretty germane feature to have.
- skrebbel 2 years ago
  
  Parent means Eastern Roman failures
- bheadmaster 2 years ago
  
  Like the one where they failed to conquer that village of Gauls?
  - SketchySeaBeast 2 years ago
    
    As usual, Asterixes are to be avoided.
    
    alexvoda 2 years ago
    
    Besides, there are plenty of Obelixs to be found in Egypt. There is no need to get your phalic stone monuments from the Gauls.
    (I expected Roman puns and was not disappointed)
- hot_gril 2 years ago
  
  So during a Parthian event, they lose the Crassus.

bookofjoe 2 years ago

SPQR https://en.wikipedia.org/wiki/SPQR

davemo 2 years ago

Veni, Vidi, Vici

jimmywetnips 2 years ago

Nice

AtlasBarfed 2 years ago

SQL is problematic to horizontally scale, joins in particular.

I see nothing about partition tolerance, so I assume it isn't at all.

Give me Aphyr tests or there is no reason to pay attention.

Settings

SPQR 1.3.0: a production-ready system for horizontal scaling of PostgreSQL

Keyboard Shortcuts