AWS S3 Having Problems Again? (Monday 12PM Pacific)
We're seeing similar problems to last night -- lots of 503s from S3. Anyone else? Indeed. As I write this we're in the midst of our third S3 outage of the day. The past two were eventually documented on the AWS Service Dashboard. The latest one has not yet received its tiny status icon to indicate an outage. It's one thing that S3 keeps going down today; we run our own server cluster and I accept that 100% uptime isn't possible. But it's aggravating that they can't at least figure out how to give timely updates on their dashboard when something is broken. We inevitably learn of S3 outages through our internal error reporting systems before AWS posts it to their status page. When they do finally post, it is usually a tiny "information" icon, even when reporting a problem that makes the service unusable. The laggy, misleading nature of their status page gives the impression they must be tying bonuses to the status icons. Can't fathom why else they would be so inept when it comes to keeping us updated when something is wrong. Surely they have sufficient internal monitoring to pick up on these outages long before they update their customers. It shouldn't be, but I've found https://twitter.com/ylastic by far the best way to find early info on AWS issues. From Amazon: "Hello, We have just become aware of EC2 network connectivity issues in the US-EAST-1 region. The impact of this issue is loss of network connectivity to EC2 instances in US-EAST-1. The AWS support and engineering teams are actively working on bringing closure to this issue. I will share additional information as soon as I learn more about this issue." Yep, same. Lots of latency too--here's what we're seeing: http://kuebri.ch/bucket/s3_latency_081015.png >This webpage is not available Heh. I'd wanted to put this on imgur instead of my micro instance, but I'm guessing imgur is backed by S3 because their upload was broken :) I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers? I feel like AWS has way too many moving parts to be stable. It's very tempting for them to reuse bits of infrastructure everywhere which increases the chances that if something goes wrong somewhere it will break your stuff. So for example, hosting instance images on S3 means that when S3 has issues, now EC2 has issues. S3's uptime is probably an order of magnitude better than I can provide myself. From 2010-2011 I worked with a medium-size MogileFS cluster, with about 160 million files and 50 TB of data, triple-replicated. We had great uptime until we ran out of I/O capacity on the metadata DB. Then we had a week of 10% downtime, while we ordered new database servers from Dell. As far as I remember, S3's US Standard region hasn't had a serious incident since Fall 2012. That's a pretty great uptime record in general, even though it's terribly frustrating on days like today. My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions. > My current company spends a few hundred dollars a month on S3. I certainly couldn't match S3's uptime at that budget. Maybe I could do it at 2-3X the budget, but with that budget it might be easier to mirror my files across multiple S3 regions. Exactly. Build multi-region support into your app(s), enable S3's replication so objects in your primary region are replicated to another region, and then properly handle loss of a region (go read-only or write to another region and restore consistency later). You'll still be spending less than attempting to maintain a highly durable object store yourself across multiple datacenters or geographic areas. AWS is so massive that even when 0.1% of the customers are having problems, it is huge news like this. The reality is most customers are not affected, and overall service uptime is highest anywhere around. Not to mention that whenever AWS is having issues it's always in one region at a time, and frequently a single availability zone. As long as you build your application to be AZ-tolerant, you won't run into problems. The reality is most customers are not affected, and overall service uptime is highest anywhere around. Unfortunately it's really impossible to say in this case, since they don't release numbers. Informally everyone I know with S3 buckets in US-Default had issues this morning. As long as you build your application to be AZ-tolerant, you won't run into problems. What you say about multiple AZs is true for EC2, but many other AWS services (especially EBS-backed ones) tend to go down across the entire region. If you're serious about availability, you really need to be in multiple regions. > As long as you build your application to be AZ-tolerant, you won't run into problems. This is a total crock. On the 31st of July, our ec2 instances across 2 availability zones were shutdown without warning. I waited 3 hours not being able to do anything. So far I've only got info from first level support and it's been "escalated". S3 is already supposed to be AZ-tolerant. In the US Standard region, they say your data is actually replicated across multiple geographic locations in Virginia and the Pacific Northwest. That's probably true—your data may be highly durable—but availability incidents can apparently span both geographic regions. > As long as you build your application to be AZ-tolerant, you won't run into problems. And the perfect "out" for AWS, every single time they have issues. I avoid almost all of these snafus simply by not putting anything in Virginia. > I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers? I haven't rented servers in about two years, but yes. My old MediaTemple dedicated servers and the pair we had colocated at a local telco's large datacenter experienced occasional network outages and whatnot just like AWS does. IMO, AWS made load balancing and fault-tolerant setups much more accessible to small businesses. At the time of our switch, getting a load balancer at our colo was quite pricey whereas AWS charges $15ish a month for it. Getting a three-tier setup on AWS was easy, whereas at the colo we'd have to pay for at least 6U of space even for relatively small amounts of traffic. Former AWS employee here. I disagree. Amazon S3 is way more "complex" than you think. This is a talk I gave years ago, I guess most of the principles are still valid today: https://vimeo.com/7330740
In most cases, large cloud infrastructures have a higher uptime than a dedicated one. I said "most", because if your IT is particularly good and well funded, then their uptime might be better. it's not all about uptime. It's about upfront capital costs. It also highly depends on the service and solution you build.
I know in traditional EC2 space, over the long term you absolute can self-host for far less money, but you are talking about 5-year TCO. Things like S3 bring another advantage: a globally distributed data store is HARD to build and supply on your own dime. I would argue things like that are much harder to beat self-hosted. > I feel like AWS has way too many moving parts to be stable. Well, everyone has different definitions of stable. > I'm risking being inflammatory here, but do people really believe that they get better uptime from AWS compared to renting dedicated servers? I don't think it is inflammatory, but perhaps you have an idealized notion of how well teams manage uptime, and more importantly, failures, on average. Sure, a great team will do fantastic, but an average team will... not. That said, it seems likely that AWS is hitting some kind of a rough patch right now. There are, however, other cloud services whose uptime records for the last year would be the envy of your typical company's internal hosting services, and I'd expect AWS to return to that fold shortly. Uptime, probably. At best, I'd get equal (no I wouldn't, but for sake of discussion), but it'd cost me several orders of magnitude more, in terms of redundancy, security infrastructure, etc Yes, your comment is unneccessarily inflammatory, in case you're wondering why I'm downvoting you. Nobody in this thread has made any comments about their uptime expectations before you brought up the subject. There's no sign here of the "people" you are referring to. It's inflammatory to ask a question that is no doubt is some people's minds at times like this? It's not inflammatory to ask a question that people are no doubt thinking, even if the answer is obvious. Colo versus cloud is not a typicial flamebait subject, but the way he worded it may as well have been trolling. For example, think of someone writing "Do people really believe that Vi is better than Emacs?". Yeah, I feel that starting off one's comment by bluntly stating one's incredulity at relying on the leading player in the cloud hosting industry is a bit inflammatory. There are a dozen other ways to word the question. Not to mention the 'moving parts' are all closed source. The us-east-1 region gets treated differently than all other regions by AWS. Part of the reason it gets treated differently it is the default, and hence the most popular. It also doesn't help that it is on the east coast, and experiences more weather. For the above reasons, and that I work in the SF bay area, I put everything in us-west-2. us-west-2 sometimes has it's own issues, but nothing quite at the level of us-east-1. We're all just huddling in us-east-1 waiting for the day us-west-2 falls into the ocean... us-east-1 data is replicated to another region on the west coast behind the scenes. Its supposed to fail over automatically. http://shlomoswidler.com/2009/12/read-after-write-consistenc... > Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical “region”. The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that’s in a vacuum – it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way). IIRC, the AWS console itself is hosted out of us-east-1. Which means you're always somewhat exposed to whatever failure modes it has. This is no longer true. East1 is absurdly huge, isn't it? I saw statements to the effect that its nearly 60% of AWS's capacity. "12:28 PM PDT Between 12:03 PM to 12:19 PM PDT we experienced elevated errors for requests made to Amazon S3 in the US-STANDARD Region. The issue has been resolved and the service is operating normally" Our AWS TAM called us. I don't think he wanted the nasty call I gave him at 4:30am Amazon yet again lying to it's customers about the status of the service is the only real issue I see here>. Services fail, it's a fact of life but at least admit it's broken and that the issue is being fixed instead of blatantly lying and saying minor disruptions. http://status.aws.amazon.com appears to indicate that they did have problems and have now resolved them. We saw a short burst of 503s a short while ago, but we have not seen any since. Hopefully we do not see any more though. Also, for the record, S3 has been very stable for us otherwise. We have been rather happy with AWS overall. Same, though just as I write this we see another spike of errors. Ok, no 503s but just got a very small burst of 500s, "com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 500, AWS Service: Amazon S3, AWS Request ID: -redacted-, AWS Error Code: InternalError, AWS Error Message: We encountered an internal error. Please try again., S3 Extended Request ID: -redacted-" :/ Looks like it got better around 12:20 PM, about 10 minutes after the incident started. We haven't seen any problems in the last few minutes. ... and errors started up again at 1:00 PM. What happened to that 99.99% availability? Either way this just got posted at reddit.com/r/sysadmin which might be useful to some for tracking error rate: https://pulse.turbobytes.com/results/55c8751aecbe400bf80005f... Their SLA guarantees 99.9% on a monthly basis. The 99.99% mentioned on the product page isn't guaranteed at all. As for what happened, my money is on this: https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3... > You can now increase your Amazon S3 bucket limit per AWS account... Amazon S3 now supports read-after-write consistency for new objects added to Amazon S3 in US Standard region. The 100 bucket limit used to be an absolute, unchangeable hard limit - rare for AWS and thus likely something deep in the architecture from S3 being one of their first services - so I suspect the lifting of that limit involved some fairly major changes to the backend. They actually would let you increase that, but only up to a certain point and only if you specifically requested it. I don't see them mention the absolute ceiling being lifted, so that is probably still in place somewhere. I'd wager it's more likely that read-after-write change. 503s galore. Is anyone seeing issues in other S3 regions? We had problems while connecting to S3 standard US region from us-east-1 at 19UTC but it was solved 20 minutes later. edit: seeing connectivity issues again at 19h50UTC We're seeing similar symptoms here as well. We have also seen a higher rate of port scans/attacks today. I wonder if it is AWS wide causing system overload issues. Interesting this article was bumped from the front page so quickly. Makes you wonder... Just started again in us-east-1. We had problems with AWS ML tonight Same here :( Same for us Yes, seeing the same thing. Seeing it too. 15:00 EST Yes, same here. bad day. Yep!