Architectures for Mitigating AWS Outages

58 points by mmbleh 5 years ago · 16 comments

Reader

Poor man's architecture for mitigating outages: Avoid us-east-1 (N Virginia). It unequivocally is the fail-whale of AWS regions. Every service there seems to have abnormally high usage numbers and absurd amount of scale to deal with. As James Hamilton would like to remind us, at such scale, even rare events are frequent [0]. The curse of being the default region, I suppose?

us-east-2 (Ohio) and eu-west-1 (Dublin) are my go-to regions. Prices are the same and most new services (and new features) are almost always ready to go on launch days.

[0] https://perspectives.mvdirona.com/2017/04/at-scale-rare-even...

taf2 5 years ago

I can recall a number of outages on us west 1/2 so not sure the argument applies maybe us east 1 failure is just more visible when it happens... the latest incident didn’t impact all of us east 1 - really just people using kensis right?
- luhn 5 years ago
  
  No one should use us-west-1 unless they have a good reason to: It's more expensive, only two AZs for new customers, frequent capacity issues, and slow to get new services. us-west-2 should almost always be preferred, as it's a flagship region and only a couple light-milliseconds away.
  As a user of us-west-2, I've found it to be very reliable. I can only remember a couple minor issues off the top of my head. The region hasn't even made the leaderboard. [1] ;)
  Every time us-east-1 goes haywire, I can kick my feet up and relax as the world burns.
  [1] https://aws.amazon.com/premiumsupport/technology/pes/
- freeone3000 5 years ago
  
  Well, yes, but services like SNS and S3 use kinesis, so if you used either of those...
Rapzid 5 years ago

It's my understanding that they roll out changes there first, historically. New features land there first. Etc. US-EAST-1 IS the canary.
- Matthias247 5 years ago
  
  This is a repeated wrong statement.
  Every AWS service can decide their deployment strategy on their own - there is no mandate on which region gets updates first. And starting in smaller regions is recommended.
- toast0 5 years ago
  
  If it's the canary, you don't want to be there if you're single region. You probably do want to be there if you're multi-region, in case they roll out something that breaks you, but not everyone else.

ComputerGuru 5 years ago

This is kind of... fluffy for an HN post. It makes no mention of distributing your application between availability zones as a first step, swats away at the issues involved in any of the suggestions, then ends with “just configure what you need to configure for RDS and S3 replication and restore the rest by hand from backup.”

forelse 5 years ago

Author here - totally fair. It is pretty high level. Distributing across AZs is mentioned in the fact it only accounts for HA, not DR.
The intent of this post was mainly to counter the arguments I was seeing online that dismissed multi-region setups as a way to guard against AWS outages because "it's too hard." I merely wanted to point out there is a way to do it without it being super complicated, or super expensive. There seems to be a trend to build super fancy architectures and as a whole, many people have neglected the simple "good enough" options.
acdha 5 years ago

It’s especially unnecessary when AWS has considerably more information available with the right questions to ask:
https://wa.aws.amazon.com/wat.pillar.reliability.en.html

mikesabbagh 5 years ago

As an SRE, we accept that different regions will fail every now and then. So how to mitigate them is the big question.

I think having a database replicated to 2 regions and load balance all traffic to both region has a big impact on performance and read after write consistency The solution is what the last option suggests; the second region is prepared as a fallback solution only and not as a live solution. This way, you accept that you will still have some downtime and maybe some data loss too whenever a failover takes places. But this is much better than going down for hours, and is much better than having 2 regions in Active-Active mode where your system will suffer from performance and data consistency So accept you will have failures, work on the better solution. The perfect solution does not exist. There is no 100%

dalrympm 5 years ago

Sadly there is no multi-region Cognito. Would love to hear if anyone has a strategy for that.

synaesthesisx 5 years ago

Unfortunately if you use Cognito for authentication you're currently SoL, as many of us learned the hard way last week.
Region rollover is not possible with Cognito. I believe you could backup the user pool to a different region, but users would still need to reset their passwords which would open another can of worms when you switch back to the original pool.

Thaxll 5 years ago

This architecture is flawed, your DNS does not seems very resiliant. Also good luck with DB replication across region and the nasty side effects you can get with out of sync data.

My advice is: have a well designed architecture in a single region with multiple AZ and you will cover most problems.

forelse 5 years ago

Flawed how? Route53 claims 100% uptime. Though if goes down, options are limited anyway. Are you saying a read replica won't stay in sync? Lag time on replication tends to be low. Keeping databases in sync across regions is definitely difficult, so i recommended against that.
Being in multiple AZs alone seems to be insufficient more and more. So the point was that by at least pushing backups or replicas into another region, you have options to restore from them.
- shiftpgdn 5 years ago
  
  You could host your DNS elsewhere. Cloudflare (though they recently had a serious outage) or any of the million DNS providers that have been around forever.

Settings

Architectures for Mitigating AWS Outages

Keyboard Shortcuts