AWS S3 Having Problems Again? (Monday 12PM Pacific)

99 points by RyanGWU82 11 years ago · 58 comments · 1 min read

We're seeing similar problems to last night -- lots of 503s from S3. Anyone else?

Indeed. As I write this we're in the midst of our third S3 outage of the day. The past two were eventually documented on the AWS Service Dashboard. The latest one has not yet received its tiny status icon to indicate an outage.

It's one thing that S3 keeps going down today; we run our own server cluster and I accept that 100% uptime isn't possible. But it's aggravating that they can't at least figure out how to give timely updates on their dashboard when something is broken.

We inevitably learn of S3 outages through our internal error reporting systems before AWS posts it to their status page. When they do finally post, it is usually a tiny "information" icon, even when reporting a problem that makes the service unusable. The laggy, misleading nature of their status page gives the impression they must be tying bonuses to the status icons. Can't fathom why else they would be so inept when it comes to keeping us updated when something is wrong. Surely they have sufficient internal monitoring to pick up on these outages long before they update their customers.

ceejayoz 11 years ago

It shouldn't be, but I've found https://twitter.com/ylastic by far the best way to find early info on AWS issues.

KenCochrane 11 years ago

From Amazon:

"Hello, We have just become aware of EC2 network connectivity issues in the US-EAST-1 region. The impact of this issue is loss of network connectivity to EC2 instances in US-EAST-1. The AWS support and engineering teams are actively working on bringing closure to this issue. I will share additional information as soon as I learn more about this issue."

dkuebric 11 years ago

Yep, same. Lots of latency too--here's what we're seeing: http://kuebri.ch/bucket/s3_latency_081015.png

zeeta6 11 years ago

What tool is that?
- dkuebric 11 years ago
  
  http://www.appneta.com/products/traceview/
  - zeeta6 11 years ago
    
    Thanks
astroteller 11 years ago

>This webpage is not available
Heh.
- dkuebric 11 years ago
  
  I'd wanted to put this on imgur instead of my micro instance, but I'm guessing imgur is backed by S3 because their upload was broken :)

Negitivefrags 11 years ago

I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?

I feel like AWS has way too many moving parts to be stable.

It's very tempting for them to reuse bits of infrastructure everywhere which increases the chances that if something goes wrong somewhere it will break your stuff. So for example, hosting instance images on S3 means that when S3 has issues, now EC2 has issues.

RyanGWU82OP 11 years ago

S3's uptime is probably an order of magnitude better than I can provide myself. From 2010-2011 I worked with a medium-size MogileFS cluster, with about 160 million files and 50 TB of data, triple-replicated. We had great uptime until we ran out of I/O capacity on the metadata DB. Then we had a week of 10% downtime, while we ordered new database servers from Dell.
As far as I remember, S3's US Standard region hasn't had a serious incident since Fall 2012. That's a pretty great uptime record in general, even though it's terribly frustrating on days like today.
My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions.
- toomuchtodo 11 years ago
  
  > My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions.
  Exactly. Build multi-region support into your app(s), enable S3's replication so objects in your primary region are replicated to another region, and then properly handle loss of a region (go read-only or write to another region and restore consistency later).
  You'll still be spending less than attempting to maintain a highly durable object store yourself across multiple datacenters or geographic areas.
deanCommie 11 years ago

AWS is so massive that even when 0.1% of the customers are having problems, it is huge news like this.
The reality is most customers are not affected, and overall service uptime is highest anywhere around.
Not to mention that whenever AWS is having issues it's always in one region at a time, and frequently a single availability zone. As long as you build your application to be AZ-tolerant, you won't run into problems.
- mnutt 11 years ago
  
  The reality is most customers are not affected, and overall service uptime is highest anywhere around.
  Unfortunately it's really impossible to say in this case, since they don't release numbers. Informally everyone I know with S3 buckets in US-Default had issues this morning.
  As long as you build your application to be AZ-tolerant, you won't run into problems.
  What you say about multiple AZs is true for EC2, but many other AWS services (especially EBS-backed ones) tend to go down across the entire region. If you're serious about availability, you really need to be in multiple regions.
- anko 11 years ago
  
  > As long as you build your application to be AZ-tolerant, you won't run into problems.
  This is a total crock. On the 31st of July, our ec2 instances across 2 availability zones were shutdown without warning. I waited 3 hours not being able to do anything. So far I've only got info from first level support and it's been "escalated".
- RyanGWU82OP 11 years ago
  
  S3 is already supposed to be AZ-tolerant. In the US Standard region, they say your data is actually replicated across multiple geographic locations in Virginia and the Pacific Northwest. That's probably true—your data may be highly durable—but availability incidents can apparently span both geographic regions.
- kordless 11 years ago
  
  > As long as you build your application to be AZ-tolerant, you won't run into problems.
  And the perfect "out" for AWS, every single time they have issues.
- juliangregorian 11 years ago
  
  I avoid almost all of these snafus simply by not putting anything in Virginia.
ceejayoz 11 years ago

> I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?
I haven't rented servers in about two years, but yes. My old MediaTemple dedicated servers and the pair we had colocated at a local telco's large datacenter experienced occasional network outages and whatnot just like AWS does.
IMO, AWS made load balancing and fault-tolerant setups much more accessible to small businesses. At the time of our switch, getting a load balancer at our colo was quite pricey whereas AWS charges $15ish a month for it. Getting a three-tier setup on AWS was easy, whereas at the colo we'd have to pay for at least 6U of space even for relatively small amounts of traffic.
simonebrunozzi 11 years ago

Former AWS employee here. I disagree. Amazon S3 is way more "complex" than you think. This is a talk I gave years ago, I guess most of the principles are still valid today: https://vimeo.com/7330740 In most cases, large cloud infrastructures have a higher uptime than a dedicated one. I said "most", because if your IT is particularly good and well funded, then their uptime might be better.
spydum 11 years ago

it's not all about uptime. It's about upfront capital costs.
It also highly depends on the service and solution you build. I know in traditional EC2 space, over the long term you absolute can self-host for far less money, but you are talking about 5-year TCO.
Things like S3 bring another advantage: a globally distributed data store is HARD to build and supply on your own dime. I would argue things like that are much harder to beat self-hosted.
cbsmith 11 years ago

> I feel like AWS has way too many moving parts to be stable.
Well, everyone has different definitions of stable.
> I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?
I don't think it is inflammatory, but perhaps you have an idealized notion of how well teams manage uptime, and more importantly, failures, on average. Sure, a great team will do fantastic, but an average team will... not.
That said, it seems likely that AWS is hitting some kind of a rough patch right now. There are, however, other cloud services whose uptime records for the last year would be the envy of your typical company's internal hosting services, and I'd expect AWS to return to that fold shortly.
bdcravens 11 years ago

Uptime, probably. At best, I'd get equal (no I wouldn't, but for sake of discussion), but it'd cost me several orders of magnitude more, in terms of redundancy, security infrastructure, etc
akamaka 11 years ago

Yes, your comment is unneccessarily inflammatory, in case you're wondering why I'm downvoting you.
Nobody in this thread has made any comments about their uptime expectations before you brought up the subject. There's no sign here of the "people" you are referring to.
- cbsmith 11 years ago
  
  It's inflammatory to ask a question that is no doubt is some people's minds at times like this?
  It's not inflammatory to ask a question that people are no doubt thinking, even if the answer is obvious.
  - akamaka 11 years ago
    
    Colo versus cloud is not a typicial flamebait subject, but the way he worded it may as well have been trolling.
    For example, think of someone writing "Do people really believe that Vi is better than Emacs?".
    Yeah, I feel that starting off one's comment by bluntly stating one's incredulity at relying on the leading player in the cloud hosting industry is a bit inflammatory. There are a dozen other ways to word the question.
kordless 11 years ago

Not to mention the 'moving parts' are all closed source.

edgan 11 years ago

The us-east-1 region gets treated differently than all other regions by AWS. Part of the reason it gets treated differently it is the default, and hence the most popular. It also doesn't help that it is on the east coast, and experiences more weather.

For the above reasons, and that I work in the SF bay area, I put everything in us-west-2. us-west-2 sometimes has it's own issues, but nothing quite at the level of us-east-1.

numbsafari 11 years ago

We're all just huddling in us-east-1 waiting for the day us-west-2 falls into the ocean...
toomuchtodo 11 years ago

us-east-1 data is replicated to another region on the west coast behind the scenes. Its supposed to fail over automatically.
http://shlomoswidler.com/2009/12/read-after-write-consistenc...
> Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical “region”. The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that’s in a vacuum – it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way).
mdellabitta 11 years ago

IIRC, the AWS console itself is hosted out of us-east-1. Which means you're always somewhat exposed to whatever failure modes it has.
- not_kurt_godel 11 years ago
  
  This is no longer true.
AlphaSite 11 years ago

East1 is absurdly huge, isn't it? I saw statements to the effect that its nearly 60% of AWS's capacity.

thspimpolds 11 years ago

"12:28 PM PDT Between 12:03 PM to 12:19 PM PDT we experienced elevated errors for requests made to Amazon S3 in the US-STANDARD Region. The issue has been resolved and the service is operating normally"

Our AWS TAM called us. I don't think he wanted the nasty call I gave him at 4:30am

atopuzov 11 years ago

Amazon yet again lying to it's customers about the status of the service is the only real issue I see here>. Services fail, it's a fact of life but at least admit it's broken and that the issue is being fixed instead of blatantly lying and saying minor disruptions.

eric_h 11 years ago

http://status.aws.amazon.com appears to indicate that they did have problems and have now resolved them.

bhz 11 years ago

We saw a short burst of 503s a short while ago, but we have not seen any since. Hopefully we do not see any more though.

Also, for the record, S3 has been very stable for us otherwise. We have been rather happy with AWS overall.

onyxraven 11 years ago

Same, though just as I write this we see another spike of errors.
- bhz 11 years ago
  
  Ok, no 503s but just got a very small burst of 500s,
  "com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 500, AWS Service: Amazon S3, AWS Request ID: -redacted-, AWS Error Code: InternalError, AWS Error Message: We encountered an internal error. Please try again., S3 Extended Request ID: -redacted-"
  :/

RyanGWU82OP 11 years ago

Looks like it got better around 12:20 PM, about 10 minutes after the incident started. We haven't seen any problems in the last few minutes.

RyanGWU82OP 11 years ago

... and errors started up again at 1:00 PM.

autotune 11 years ago

What happened to that 99.99% availability? Either way this just got posted at reddit.com/r/sysadmin which might be useful to some for tracking error rate: https://pulse.turbobytes.com/results/55c8751aecbe400bf80005f...

ceejayoz 11 years ago

Their SLA guarantees 99.9% on a monthly basis. The 99.99% mentioned on the product page isn't guaranteed at all.
As for what happened, my money is on this: https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3...
> You can now increase your Amazon S3 bucket limit per AWS account... Amazon S3 now supports read-after-write consistency for new objects added to Amazon S3 in US Standard region.
The 100 bucket limit used to be an absolute, unchangeable hard limit - rare for AWS and thus likely something deep in the architecture from S3 being one of their first services - so I suspect the lifting of that limit involved some fairly major changes to the backend.
- StabbyCutyou 11 years ago
  
  They actually would let you increase that, but only up to a certain point and only if you specifically requested it. I don't see them mention the absolute ceiling being lifted, so that is probably still in place somewhere.
  I'd wager it's more likely that read-after-write change.

toomuchtodo 11 years ago

503s galore. Is anyone seeing issues in other S3 regions?

arturhoo 11 years ago

We had problems while connecting to S3 standard US region from us-east-1 at 19UTC but it was solved 20 minutes later.

edit: seeing connectivity issues again at 19h50UTC

azundo 11 years ago

We're seeing similar symptoms here as well.

matwood 11 years ago

We have also seen a higher rate of port scans/attacks today. I wonder if it is AWS wide causing system overload issues.

kordless 11 years ago

Interesting this article was bumped from the front page so quickly. Makes you wonder...

needcaffeine 11 years ago

Just started again in us-east-1.

andrebrov 11 years ago

We had problems with AWS ML tonight

mstkrft 11 years ago

Same here :(

kernel_sanders 11 years ago

Same for us

Stovoy 11 years ago

Yes, seeing the same thing.

AnonNo15 11 years ago

Seeing it too. 15:00 EST

ronreiter 11 years ago

Yes, same here.

ninjakeyboard 11 years ago

bad day.

mej10 11 years ago

Yep!

Settings

AWS S3 Having Problems Again? (Monday 12PM Pacific)

Keyboard Shortcuts