AWS Glue/Athena/Redshift outage
status.aws.amazon.comShould edit title to add “in us-east-1”. Other regions are unimpacted.
There are so many outages in us-east-1. I've heard the reason is because that's where they roll out maintenance first or something along those lines. Just look at this list of outages on Wikipedia [1] and scan for US-east-1, North Virginia, or "Northeast" (all the same places).
Just don't use US-EAST-1 as your region.
1. https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...
It's the oldest region, which means:
* It's the largest region (ever had an unexpected scaling bug?).
* It has more legacy stuff lying around. For example, old regions have EC2 Classic, while new regions are VPC only.
* There are more customers there. More whales, more use cases.
Most AWS teams explicitly try not to deploy to us-east-1 first, but because us-east-1 is so different on so many dimensions, it is more likely to have issues that dont manifest elsewhere.
(Source: An AWS Engineer)
I've heard the reason is because that's where they roll out maintenance first
That doesn't make sense - why would they do maintenance in their largest (and oldest) region first? I'd expect them to roll out changes to smaller regions first so problems will affect fewer users.
I think the more likely explanation is that it's their largest (and oldest) region.
an aws tam once told me the same thing. us-east-1a gets the new stuff first. i never validated it against anything other than this one person's statement.
"1a" in this context means nothing. The AZ assignments each account gets is random. us-east-1a is probably a different data center for you than me.
It’s also full of legacy infrastructure since it was the first region.
It is usually the cheapest region though. Maybe this is why
Source? I see price parity across us-east-* and us-west-2 for every service i've looked at.
It's the biggest region, if it breaks it breaks in us-east-1.
Redshift changes roll out in us-east-1 after other regions though, so I imagine the root cause is something else.
Summary:
> Between 9:21 AM and 2:36 PM PDT we experienced increased query failures and latency in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.
> The issue with the Data Catalog APIs started with a software update in the US-EAST-1 Region that completed at 9:21 AM PDT. The software update was immediately rolled back[...]
Thankfully the redshift outage was just on APIs, not existing machines. Our cluster was fine today, but external schema which rely on glue/athena did time out.
Cloud services go down more often than my old WordPress sites. Avoiding vendor lock in and doing multi-provider deployments should be par the course.
Failovers, man.