AWS RDS Postmortem: Is AWS Collapsing Under its Own Weight ?
Hi everyone,
I have been working with AWS with many years and the service over the years has been outstanding. A few months ago I started the migration of (yet another business) to AWS, and we had an incident. This made me think that maybe AWS is starting not to be good any more.
I would like to share the postmortem report with the community, and please comment on what you think. I would like to know if we made a fundamental mistake or if AWS is actually degrading.
Times are UTC. Personal opinions are removed from the report, just facts are stated.
---------- POSTMORTEM REPORT
The project consists on moving several services to AWS. The system consists Services in 1 autoscaling group, and a PostgreSQL Database in RDS.
- Sunday 4.30 am: we migrate the PostgreSQL database to RDS. RDS is configured with 200 GB in the disk, the database size is 15 GB.
- Sunday 10.17 am: RDS detects that we are running out of space, and decides to grow the database from 200 GB to 999 GB. The RDS auto scaling event starts
At this point the performance if the database is degraded. Alerts are triggered.
A test performed from the VPC network with the query "SELECT now()" took 20 seconds and 248 milliseconds.
The database performance is that bad that many of the services goes down.
- The RDS auto scaling event finalized at 14:39.
After contacting with AWS Support (see details below) we decided to roll back.
We contacted with AWS Support (Business). The points more important in the transcript are:
- The fact that RDS decided to grow the disk from 200 GB to 999 GB, when the actual database size is 15 GB, it is not a problem.
- In the actual auto-scaling event a performance degradation while the auto scaling event is operational is expected. As "sessions"(1) are not being drop, from AWS the database is considered online so it is working as per expectations.
- Pointed out the example of a SELECT now() taking 20 seconds. This did not change the fact that the database is online and all is good.
- When asked for an estimation of the duration of the event, it was stated that could take "from several minutes to several days".
- The objective of AWS Support is to communicate what is happening (implies that you should not expect them to help you to fix the actual problem)
-------
Opinions ? To me, this reads like they are trying to tell you to get lost in nice words. Most likely, you're not spending $1+ mio annually so they just don't care about you or your experience. AWS is built to support the biggest cloud empires on the planet. If you're not working on that scale, a smaller provider will likely provide much more personal attention, so better support, better tuning, and potentially better bang per buck. But I wouldn't call AWS collapsing. It's operating as designed. The issue is that too many small companies bought into their "stand on the shoulders of giants" marketing and then convinced themselves that they need planet-scale whatever when really they don't. If you're small and nimble, you want small and nimble solutions, too. @fxtentacle I agree with you. This actually makes a lot of sense. Probably the title reflected the anger and frustration that I was feeling after the poor support. Actually the company that is being migrated to AWS is quite large, but not large enough to get in spending $1+mio annually. This migration was one system from many others that needs to be moved. But, still, 10+ years ago the support in AWS was available for free to everybody, you would get in contact with a big nerd that knew the ins and outs of the system and would help you massively on pretty much anything. Slowly, they allow just paid support, and now not even paying listen to your problems. At the very beginning, it was the small(-ish) companies (which a few exceptions) which gave name to AWS, and an army of geeks talking all positive about the services. If now they just work with large corporations, it is a big (and risky) change, which goes against the Amazon Leadership Principle "Customer Obsession". It's very difficult to manage IOPS for RDS. IOPS scales with RDS disk size to a certain point, so you might not have had nearly enough IOPS at only 200GB. It's possible that you used too small of an instance type since you should probably use an instance that can keep all 15GB in memory and incur only write IOPS from WAL writes, bgwriter, and checkpointing. Even a single nvme ssd can greatly exceed the maximum IOPS available to RDS, so you have to be very careful migrating database workloads to RDS. Once your workload exceeds available throughout, latency will be terrible. These are all hard lessons we've learned on our own. Support will not help you if you have terrible performance issues with RDS. They love to tell you to try optimizing your queries. AWS could afford to staff multiple dedicated support positions for our account, but they pocket the money instead and give terrible canned responses. If you have a tiny account, they definitely won't help you. Some people say support is great, but I've never had a good experience. Something definitely seems off here, from the fact that RDS chose to scale to the response you got from support (who've always been... mostly ok in my experience). First of all I think I'd like to find out a little more context before I jump to any conclusions; - Does Cloudwatch confirm that RDS "needed" to scale? And if it does, do any other metrics increase simultaneously with an increase in the storage being used? - Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event? - Had this service been tested on RDS prior to the migration being performed? - Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature. To me, it sounds like support found something that suggested whatever was happening to RDS at the time was "someone else's problem" under their shared responsibility model. Whether or not that's true - who knows but from how you've described it they definitely seem to be trying to palm you off, worth mentioning to your TAM/rep if you have one because this is pretty poor service. @synicalx thanks for your feedback. I will see if there is anything visible in Cloud Watch. > - Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event? At 4.30 the following has been logged: "Storage size 999 GB is approaching the maximum storage threshold 1000 GB. Increase the maximum storage threshold." However the auto scaling event started at 10.17. > - Had this service been tested on RDS prior to the migration being performed? We have performed a dozen migration simulations in our Sandbox Account in multiple weeks. We developed scripts and automation to make the actual migration. The only difference in the Sandbox Account is that the RDS database was smaller in CPU and RAM. > - Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature. I will double check with the team, but all the migration was fully automated with scripts. I have not been reported any action required outside executing the automation scripts and performing the plan. Yeah I feel like something was definitely up with RDS, or at least this event is suspicious enough to warrant further investigation. Step one is definitely get all your "evidence" together - CloudWatch screenshots, logs, maybe cloudtrail etc. and then double check there's nothing you missed or any boo-boos in the scripts or data. Either way, definitely mention your support interaction to your account manager if you have one, from how you've described it this was a pretty poor interaction especially if you're paying for Business support. If it's definitely not something you caused, I would also ask them to escalate the issue and get you an explanation as to why RDS did what it did. > Does Cloudwatch confirm that RDS "needed" to scale? Actually you pointed out to a clue that I missed. I should have checked cloud watch! The free space graphs shows that at some point something got the 200 GB of space that we originally assigned. This is a good clue that I am going to dig in. Thanks for your feedback! EDIT: Cloudwatch metrics in RDS has been the key to find the source of the issue. So what was it? My money would be on (in order): Wal, logs, unexpected indices growth. The things started to go from bad to worse when work_mem parameter has not been set up correctly. Some queries that requires a large amount of memory to process started to use disk. Once the autoscaling kicked in, even if we would have realized about it, wouldn't have helped as you are locked out of the system. The auto-scaling event was triggered earlier than our alerts for low disk available. What I don't know yet, is that we have been moving services for multiple days, and the service that required the work_mem parameter was in production in that AWS for 48 hours before started to use disk rather than memory to process the SQL Queries. Interesting problem to come across, sounds like this is a scenario where RDS and the lack of host access/visibility was a bit of a handicap. Glad you found the issue though, hopefully things go smooth next time! is this RDS Aurora or RDS "classic"? classic RDS is essentially control-plane-only - AWS spins up an EC2 instance on your behalf, installs the database software, configures replication, etc etc. but on the data plane, you're connecting to a more or less stock Postgres or MySQL instance. Aurora uses a more modern distributed design [0] akin to Spanner, CockroachDB, etc. they implemented their own quorum-based log layer as a storage backend, so it is now involved in both the control plane and data plane. classic RDS has been around long enough, and in essentially maintenance mode without many new features added, that I wouldn't expect weird behavior like this from it. so I'm guessing Aurora. YMMV, but personally I don't trust Aurora for databases I administrate. I use classic RDS for small stuff, but if I needed availability & throughput beyond what classic RDS with its single-node / scale-up-only model can offer, I would reach for CockroachDB or Cassandra or something similar. also if your database size is only 15gb you definitely don't need (and are overpaying) for a 200gb Aurora instance. your data fits in RAM on a mid-sized RDS instance. 0: https://www.allthingsdistributed.com/2019/03/amazon-aurora-d... @evil-olive Thanks for your feedback. Aurora is something that we would definitely look in the future, but with the levels of workload that we are currently experiencing postgres RDS should work for quite a long time. > CockroachDB or Cassandra or something similar. I can't wait for a business use case where I need something like this :) but unfortunately right now this is not the case. Have a look at YugabyteDB. If you can tolerate Postgres 11.2, it would work for you. Yugabyte offers a managed service, on premise and you can roll out your own system. I’m not affiliated with the company. Decline is the wrong word. It is getting too big for it's own good. It's collapsing under it's own weight. Time is ripe for a smaller, sharper, leaner startup to do things differently and better. @tus666 thanks for the tip. "Collapsing Under it's Own Weight" is more accurate. You don't mention the type of disks you're using with RDS.
Disk scaling/changing operations also consume IOPS to do their work which can get you into trouble if you, say, run out of burstable IOPS but RDS is still partway through a background mirror operation. If you have RDS Enhanced Monitoring turned on do look at these metrics. They're a bit trickier to use but as they come from an agent on the instance rather than the hypervisor can sometimes help debug. Like any company sometimes the support agent you get might be having an off day. This is where having an AWS TAM is helpful - you can get them to re-esculate the support issue for a second opinion. > run out of burstable IOPS but RDS is still partway through a background mirror operation. This is a very good point. Will check. > If you have RDS Enhanced Monitoring turned on do look at these metrics. We don't have it enabled yet, but point taken. > This is where having an AWS TAM is helpful - you can get them to re-esculate the support issue for a second opinion. We currently have "Business". I requested support to escalate internally and he said that he could not. The difference with the next level would be "Enterprise On-Ramp" ( https://aws.amazon.com/premiumsupport/plans/?nc=sn&loc=1 ) By checking the differences, looks like if we move to the next support we gain the following stuff: - Business-critical system down: < 30 minutes
- A pool of Technical Account Managers to provide proactive guidance, and coordinate access to programs and AWS experts - I think that we have already this service via our account manager.
- Concierge Support Team - no idea what is this. I will check, but if anybody has any option whether this is worth it from experience, I would like to know. Yes.