Settings

Theme

AWS S3 Having Problems Again? (Monday 12PM Pacific)

99 points by RyanGWU82 10 years ago · 58 comments · 1 min read


We're seeing similar problems to last night -- lots of 503s from S3. Anyone else?

wbharding 10 years ago

Indeed. As I write this we're in the midst of our third S3 outage of the day. The past two were eventually documented on the AWS Service Dashboard. The latest one has not yet received its tiny status icon to indicate an outage.

It's one thing that S3 keeps going down today; we run our own server cluster and I accept that 100% uptime isn't possible. But it's aggravating that they can't at least figure out how to give timely updates on their dashboard when something is broken.

We inevitably learn of S3 outages through our internal error reporting systems before AWS posts it to their status page. When they do finally post, it is usually a tiny "information" icon, even when reporting a problem that makes the service unusable. The laggy, misleading nature of their status page gives the impression they must be tying bonuses to the status icons. Can't fathom why else they would be so inept when it comes to keeping us updated when something is wrong. Surely they have sufficient internal monitoring to pick up on these outages long before they update their customers.

KenCochrane 10 years ago

From Amazon:

"Hello, We have just become aware of EC2 network connectivity issues in the US-EAST-1 region. The impact of this issue is loss of network connectivity to EC2 instances in US-EAST-1. The AWS support and engineering teams are actively working on bringing closure to this issue. I will share additional information as soon as I learn more about this issue."

dkuebric 10 years ago

Yep, same. Lots of latency too--here's what we're seeing: http://kuebri.ch/bucket/s3_latency_081015.png

Negitivefrags 10 years ago

I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?

I feel like AWS has way too many moving parts to be stable.

It's very tempting for them to reuse bits of infrastructure everywhere which increases the chances that if something goes wrong somewhere it will break your stuff. So for example, hosting instance images on S3 means that when S3 has issues, now EC2 has issues.

  • RyanGWU82OP 10 years ago

    S3's uptime is probably an order of magnitude better than I can provide myself. From 2010-2011 I worked with a medium-size MogileFS cluster, with about 160 million files and 50 TB of data, triple-replicated. We had great uptime until we ran out of I/O capacity on the metadata DB. Then we had a week of 10% downtime, while we ordered new database servers from Dell.

    As far as I remember, S3's US Standard region hasn't had a serious incident since Fall 2012. That's a pretty great uptime record in general, even though it's terribly frustrating on days like today.

    My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions.

    • toomuchtodo 10 years ago

      > My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions.

      Exactly. Build multi-region support into your app(s), enable S3's replication so objects in your primary region are replicated to another region, and then properly handle loss of a region (go read-only or write to another region and restore consistency later).

      You'll still be spending less than attempting to maintain a highly durable object store yourself across multiple datacenters or geographic areas.

  • deanCommie 10 years ago

    AWS is so massive that even when 0.1% of the customers are having problems, it is huge news like this.

    The reality is most customers are not affected, and overall service uptime is highest anywhere around.

    Not to mention that whenever AWS is having issues it's always in one region at a time, and frequently a single availability zone. As long as you build your application to be AZ-tolerant, you won't run into problems.

    • mnutt 10 years ago

      The reality is most customers are not affected, and overall service uptime is highest anywhere around.

      Unfortunately it's really impossible to say in this case, since they don't release numbers. Informally everyone I know with S3 buckets in US-Default had issues this morning.

      As long as you build your application to be AZ-tolerant, you won't run into problems.

      What you say about multiple AZs is true for EC2, but many other AWS services (especially EBS-backed ones) tend to go down across the entire region. If you're serious about availability, you really need to be in multiple regions.

    • anko 10 years ago

      > As long as you build your application to be AZ-tolerant, you won't run into problems.

      This is a total crock. On the 31st of July, our ec2 instances across 2 availability zones were shutdown without warning. I waited 3 hours not being able to do anything. So far I've only got info from first level support and it's been "escalated".

    • RyanGWU82OP 10 years ago

      S3 is already supposed to be AZ-tolerant. In the US Standard region, they say your data is actually replicated across multiple geographic locations in Virginia and the Pacific Northwest. That's probably true—your data may be highly durable—but availability incidents can apparently span both geographic regions.

    • kordless 10 years ago

      > As long as you build your application to be AZ-tolerant, you won't run into problems.

      And the perfect "out" for AWS, every single time they have issues.

    • juliangregorian 10 years ago

      I avoid almost all of these snafus simply by not putting anything in Virginia.

  • ceejayoz 10 years ago

    > I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?

    I haven't rented servers in about two years, but yes. My old MediaTemple dedicated servers and the pair we had colocated at a local telco's large datacenter experienced occasional network outages and whatnot just like AWS does.

    IMO, AWS made load balancing and fault-tolerant setups much more accessible to small businesses. At the time of our switch, getting a load balancer at our colo was quite pricey whereas AWS charges $15ish a month for it. Getting a three-tier setup on AWS was easy, whereas at the colo we'd have to pay for at least 6U of space even for relatively small amounts of traffic.

  • simonebrunozzi 10 years ago

    Former AWS employee here. I disagree. Amazon S3 is way more "complex" than you think. This is a talk I gave years ago, I guess most of the principles are still valid today: https://vimeo.com/7330740 In most cases, large cloud infrastructures have a higher uptime than a dedicated one. I said "most", because if your IT is particularly good and well funded, then their uptime might be better.

  • spydum 10 years ago

    it's not all about uptime. It's about upfront capital costs.

    It also highly depends on the service and solution you build. I know in traditional EC2 space, over the long term you absolute can self-host for far less money, but you are talking about 5-year TCO.

    Things like S3 bring another advantage: a globally distributed data store is HARD to build and supply on your own dime. I would argue things like that are much harder to beat self-hosted.

  • cbsmith 10 years ago

    > I feel like AWS has way too many moving parts to be stable.

    Well, everyone has different definitions of stable.

    > I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers?

    I don't think it is inflammatory, but perhaps you have an idealized notion of how well teams manage uptime, and more importantly, failures, on average. Sure, a great team will do fantastic, but an average team will... not.

    That said, it seems likely that AWS is hitting some kind of a rough patch right now. There are, however, other cloud services whose uptime records for the last year would be the envy of your typical company's internal hosting services, and I'd expect AWS to return to that fold shortly.

  • bdcravens 10 years ago

    Uptime, probably. At best, I'd get equal (no I wouldn't, but for sake of discussion), but it'd cost me several orders of magnitude more, in terms of redundancy, security infrastructure, etc

  • akamaka 10 years ago

    Yes, your comment is unneccessarily inflammatory, in case you're wondering why I'm downvoting you.

    Nobody in this thread has made any comments about their uptime expectations before you brought up the subject. There's no sign here of the "people" you are referring to.

    • cbsmith 10 years ago

      It's inflammatory to ask a question that is no doubt is some people's minds at times like this?

      It's not inflammatory to ask a question that people are no doubt thinking, even if the answer is obvious.

      • akamaka 10 years ago

        Colo versus cloud is not a typicial flamebait subject, but the way he worded it may as well have been trolling.

        For example, think of someone writing "Do people really believe that Vi is better than Emacs?".

        Yeah, I feel that starting off one's comment by bluntly stating one's incredulity at relying on the leading player in the cloud hosting industry is a bit inflammatory. There are a dozen other ways to word the question.

  • kordless 10 years ago

    Not to mention the 'moving parts' are all closed source.

edgan 10 years ago

The us-east-1 region gets treated differently than all other regions by AWS. Part of the reason it gets treated differently it is the default, and hence the most popular. It also doesn't help that it is on the east coast, and experiences more weather.

For the above reasons, and that I work in the SF bay area, I put everything in us-west-2. us-west-2 sometimes has it's own issues, but nothing quite at the level of us-east-1.

  • numbsafari 10 years ago

    We're all just huddling in us-east-1 waiting for the day us-west-2 falls into the ocean...

  • toomuchtodo 10 years ago

    us-east-1 data is replicated to another region on the west coast behind the scenes. Its supposed to fail over automatically.

    http://shlomoswidler.com/2009/12/read-after-write-consistenc...

    > Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical “region”. The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that’s in a vacuum – it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way).

  • mdellabitta 10 years ago

    IIRC, the AWS console itself is hosted out of us-east-1. Which means you're always somewhat exposed to whatever failure modes it has.

  • AlphaSite 10 years ago

    East1 is absurdly huge, isn't it? I saw statements to the effect that its nearly 60% of AWS's capacity.

thspimpolds 10 years ago

"12:28 PM PDT Between 12:03 PM to 12:19 PM PDT we experienced elevated errors for requests made to Amazon S3 in the US-STANDARD Region. The issue has been resolved and the service is operating normally"

Our AWS TAM called us. I don't think he wanted the nasty call I gave him at 4:30am

atopuzov 10 years ago

Amazon yet again lying to it's customers about the status of the service is the only real issue I see here>. Services fail, it's a fact of life but at least admit it's broken and that the issue is being fixed instead of blatantly lying and saying minor disruptions.

bhz 10 years ago

We saw a short burst of 503s a short while ago, but we have not seen any since. Hopefully we do not see any more though.

Also, for the record, S3 has been very stable for us otherwise. We have been rather happy with AWS overall.

  • onyxraven 10 years ago

    Same, though just as I write this we see another spike of errors.

    • bhz 10 years ago

      Ok, no 503s but just got a very small burst of 500s,

      "com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 500, AWS Service: Amazon S3, AWS Request ID: -redacted-, AWS Error Code: InternalError, AWS Error Message: We encountered an internal error. Please try again., S3 Extended Request ID: -redacted-"

      :/

RyanGWU82OP 10 years ago

Looks like it got better around 12:20 PM, about 10 minutes after the incident started. We haven't seen any problems in the last few minutes.

autotune 10 years ago

What happened to that 99.99% availability? Either way this just got posted at reddit.com/r/sysadmin which might be useful to some for tracking error rate: https://pulse.turbobytes.com/results/55c8751aecbe400bf80005f...

  • ceejayoz 10 years ago

    Their SLA guarantees 99.9% on a monthly basis. The 99.99% mentioned on the product page isn't guaranteed at all.

    As for what happened, my money is on this: https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3...

    > You can now increase your Amazon S3 bucket limit per AWS account... Amazon S3 now supports read-after-write consistency for new objects added to Amazon S3 in US Standard region.

    The 100 bucket limit used to be an absolute, unchangeable hard limit - rare for AWS and thus likely something deep in the architecture from S3 being one of their first services - so I suspect the lifting of that limit involved some fairly major changes to the backend.

    • StabbyCutyou 10 years ago

      They actually would let you increase that, but only up to a certain point and only if you specifically requested it. I don't see them mention the absolute ceiling being lifted, so that is probably still in place somewhere.

      I'd wager it's more likely that read-after-write change.

toomuchtodo 10 years ago

503s galore. Is anyone seeing issues in other S3 regions?

arturhoo 10 years ago

We had problems while connecting to S3 standard US region from us-east-1 at 19UTC but it was solved 20 minutes later.

edit: seeing connectivity issues again at 19h50UTC

azundo 10 years ago

We're seeing similar symptoms here as well.

matwood 10 years ago

We have also seen a higher rate of port scans/attacks today. I wonder if it is AWS wide causing system overload issues.

kordless 10 years ago

Interesting this article was bumped from the front page so quickly. Makes you wonder...

needcaffeine 10 years ago

Just started again in us-east-1.

andrebrov 10 years ago

We had problems with AWS ML tonight

mstkrft 10 years ago

Same here :(

kernel_sanders 10 years ago

Same for us

Stovoy 10 years ago

Yes, seeing the same thing.

AnonNo15 10 years ago

Seeing it too. 15:00 EST

ronreiter 10 years ago

Yes, same here.

ninjakeyboard 10 years ago

bad day.

mej10 10 years ago

Yep!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection