Heroku is down for the third time today
status.heroku.comI use Heroku for subscription software services, online retail stores, and phone ordering system for our staff.
Right now all of our sites are failing with 503 errors. Our store is down and when one of our employees went to take a phone order they got a "Welcome to your new app" message.
I've been a big evangelist of Heroku since we migrated over last year, but I'm getting deeply concerned about the elevated error rate since every minute is costing us money.
Does Heroku have an SLA? (I could not find it)
At some point they're exposing themselves to serious risk. Rackspace had to pay out ~$3MM (in free service credits) after an outage in 2009:
http://www.networkworld.com/news/2009/070609-rackspace-outag...
This is offtopic, but what's MM? What's the second M for?
M = 1000 in Roman numerals, but the confusing bit is not reading them like Roman numerals (2000) but rather interpreting them as one thousand thousands, or one million.
Hmm, so who uses this? How is it better than just saying $3M?
MM is the standard abbreviation for million in the financial world. "The great thing about standards is that there are so many to choose from."
This came up in a thread a while back (http://news.ycombinator.com/item?id=1483667). Bottom line is that "MM" comes from the banking/finance world. In banking, $3M actually means $3,000 and $3MM means $3,000,000.
I see, thank you. I prefer the SI, kilodollars, megadollars, etc.
My question to you is, could you do better and how much would it cost? If you didn't use Heroku or another cloud provider you would pay a lot more up front to get your applications running. When things go wrong you would have to fix it, which means paying technical staff to be on call. Since you and your company are likely experts in your domain and not in infrastructure then any infrastructure that you built would likely have more downtime than Heroku. You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.
Disclaimer: I'm not affiliated with Heroku and I don't use their service.
You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.
That's silly, and also not how it works, at all. You're paying PaaS/IaaS companies so that it's their headache, not yours. Once it becomes your headache, they are no longer doing their job, and you are no longer receiving value for which you are paying for. You don't debit the cost of their downtime from the cost i would've built your infrastructure, you debit the cost of their downtime from your business' revenue and reputation.
Whether or not you could do it better yourself does not excuse the downtime one bit.
Excusing or not excusing are irrelevant.
If you stop using Heroku and manage your own infrastructure, you need to take all relevant costs into account.
Of course, finding an alternative provider of the same (or similar) services is also an option.
Isn't it commonly recognized that it's cheaper to run your own hardware than to pay a cloud provider? It just requires more capital outlay and maintenance.
We're actively sending pay per click traffic to our online store and it's very easy to spend hundreds of dollars. When our traffic converts it's great but it pains me to think I could be sending traffic to a Heroku 503 error page and have zero control over it.
So, "could we do better"? I'm not sure. I'm trying to figure that out. It certainly would not be as easy to use as Heroku or easy to deploy. But at a minimum I need to get some other host option set that we can switch over to.
Is there a service that will switch off adwords campaigns if your site is down or in maintenance mode? If not, there should be.
Thought about writing an app to do that. Unfortunately I would not be able to host it on Heroku. :-)
I've had clients with sloppy dev teams who decided to change the URL structure of all landing pages without letting me know (I was managing their PPC campaigns). Google stops serving ads after getting 404 errors - unfortunately I don't think they count other errors (like a 503) and they don't stop until they've sent a few hundred (or thousand) clicks.
would ppl pay for a good solution here, like $10 a month, or a % of money saved? I imagine that with a good implementation they might.
If you run your own site and things go wrong, you (hopefully) know what you did. When Heroku (or AWS, or anyone) makes a change, they don't consult every customer to find out if now is a good time to go down.
Are there any companies that provide 'server host failure' insurance for instances like this? It seems like a possible opportunity.
The E&O insurance I looked into getting when I moved into consulting would have covered it -- "lost sales" resulting from a "hardware or software malfunction." I assume if you start making recurring claims on that the insurance company will reevaluate whether they want to continue doing business with you, though.
Don't know specifically about "server failure" insurance, but I assume it exists. There is insurance for practically everything. For example, an Uncle of my friend builds home security systems, and he is insured in case a home he has secured is ever broken into anyway.
I've currently got a few small apps on Heroku and am considering moving some larger ones over. But the "Heroku | Welcome to your new app!" is very worrying.
Custom error pages for these kinds of errors would be very useful.
It took Google App Engine two years to add the option to specify a custom error page for server errors and over quota errors. Hopefully that'll come soon for Heroku as well…
It will. Currently in private beta, heroku will render an iframe pointed to an arbitrary url hosted externally (say, on S3.)
Eggs, basket, etc.
It just occurred to me that you know you've made some pretty serious traction as a startup when HN posts about your company no longer have something like "(YC W08)" appended to the end.
I think it has to do with Heroku's target market being so similar (identical?) to HN's demographic. If that weren't the case, I doubt we would all recognize Heroku so easily.
Thanks for pointing that out, because I had completely forgotten that this was the case. (I actually can't remember at all, but I figure that I knew that from when they came out)
They did go a long way in a short period of time. Winter 2008 feels so close.
Yeah, I'm in the same boat as you. I see successful, "big" companies mentioned here with "YC-whatever" on the end and am blown away by which ones are YC alumni!
It all depends on what the SLA says, but hypothetically, if they are down for 24 hours a year, that's 99.7% uptime, which isn't terrible.
Heroku had a 1-2 hour outage the week after we switched an app there last year. My boss was freaking out, cursing about how they were unreliable, etc, neglecting the following:
1. The timing was unfortunate, but that was the first outage in months.
2. We had had multiple outages on our Rackspace box that were our own fault, due to bad server management.
In the long term you're likely better on Heroku, for small companies at least.
Uh... 99.7% is ridiculously bad if you're doing anything that matters.
Depends, really.
Internal examples:
If shadowcat's public facing website is down for a day, a few people can't read blog posts and maybe we'll miss out on a potential customer - but our existing customers will be entirely unaffected.
If our ticket tracking system is down for a day, it'll annoy the hell out of the existing customers but we can still get the work done since they all have direct email and IM contact info for people.
On the other hand if our ircd is down for an hour, it's time to panic, because that massively interrupts our ability to co-ordinate our work.
External examples:
If linked in is down for a day, I don't care - anything I do on that can wait until tomorrow.
If duckduckgo is down for a day, I am going to burst into tears because I use it all the time for information I want -now- and going via google is substantially more annoying.
So "anything that matters" is really quite relative.
99.7%? Ridiculously bad?
I just did the calculation. That's about a day of downtime. I'd say it's bad if:
- The downtime is scattered all over the year. 1 hour downtime here, 30 min downtime there.
But not if:
- This 1 day of downtime is scheduled, e.g. during the holidays. Scheduled and planned is the keyword. If the client is informed and aware of it, the client will also remain happy.
You'd be surprised how much downtime clients are willing to put up with, as long as they are informed well ahead of time.
I agree with you, but only in theory. I can't think of one thing that runs 100% non-stop.
Even in places like medicine or finance or security. Stuff breaks, things fail. It's sad, but the reality is there.
Of course nothing will have 100.0 (repeating)% uptime. But 99.7% uptime means it can be down for over 2 hours every month. Anything less than 99.9% uptime (which means 3x less allowed downtime--a big difference) is probably unacceptable, and if downtime costs you serious money, you're going to want more decimal places.
Part of my job is network administration of a small (~50 server) colo/hosting service. It's unacceptable for us to be down for even 30 minutes (from our perspective and our clients). We maybe top out at 5 hours of downtime a year (during a bad year) and most of that (unfortunately) is upstream from us.
Move to vps.net for a few weeks, then move back to Heroku, by that time he'll be counting uptime not downtime!
We've been running on Slicehost for almost 2 years and I believe we've had two outages, one of which wasn't a real outage but a backbone provider went kaplooey in Europe. That can't really be helped.
Heroku, on the other hand, feels like it's up and down more than... something that goes up and down a lot. A friend of mine hosts his blog there and he launched a small product today and he kept sending his customers to an error page, because Heroku was up, down, up down, up down.
If it's a misconfiguration of your own, you can get it fixed. But if your hosting provider has an unsound business, you can't fix that except by leaving.
Yo-yo :)
I have an app running on Heroku. Interestingly, it caches itself using HTML 5 application cache, so most people won't even notice the site is down. Need to make sure the background network ops are fault tolerant though.
Interesting. Care to share what you're doing/what the heck that means?
http://motodiaryapp.com -- of course if Heroku is down and it's not already cached for you it won't load. This is the technology the site uses to allow offline access: http://www.whatwg.org/specs/web-apps/current-work/multipage/...
That is really awesome. I just got back from playing with it between Chrome on an old 800MHz P-III (very usable) and an Android (Nexus One). On the Nexus, I went off-line (airplane mode), edited, and then went back on-line. MAGIC! My edits showed up in my Chrome browser on the desktop.
My use case is that I want to use Google Docs (or equivalent) to keep notes while on-line and off-line. MotoDiary ain't quite there yet, but it has the hard part (IMHO), the on-line/off-line syncing. What is rough is text size and fixed(?) edit box size on the Android. Also (obviously), it is diary-oriented (single entry per day) rather than supporting multiple documents.
Google Docs are totally uneditable (?WTF!) on Android, never mind doing it off-line and syncing.
There are some Apps that work better, such as GDocs. GDocs has been a mixed bag, it allows me to edit off-line and sync docs, but has been iffy in terms of success rate. It definitely isn't as smooth as my brief experience with MotoDiary.
Wow, that opened my eyes! That use case didn't occur to me. I might have to add that to my box of treats, especially since I have apps on Heroku too.
(Love you really, Heroku).
You have to give Heroku credit for selling major quantities of Kool Aid. They've been pretty flakey for the past couple of months, and people are here claiming that this is the first outage. Someone's even claiming that 99.7% is a good record.
"Applications are fully restored." via http://status.heroku.com/
Downtime always sucks, but gotta give them credit the way they keep everyone in the loop and provided status along the way.
Make your error page prettier. You guys are capable of better design (after seeing your pricing page).
It's worth noting that this was not universal as far as I can tell.
I have 5 minute watchdogs on all of my 3 sites in production with Heroku, and none of them pinged me. Given that I know the watchdogs work (regular testing and previous incidents) I would have to conclude that not everyone was affected.
We've been monitoring a heroku instance for the past 8 months. Our current instance uptime is 99.953% (about 200 minutes of downtime). Of the 76 services we monitor, Heroku is #64.
The magic of cloud computing: As someone running an app on Heroku, I had no idea. Luckily, I simply don't care.
Our app has a cyclic usage pattern and all is quiet right now. So rather than freaking out about it, I'll just let someone at Heroku figure it all out.
It would suck if it happened during our busy period, but then again I could say "We're working on it." and just assume the Heroku team will fix things faster than I ever could have with my limited *nix admin skills.
How exactly is the fact that you didn't know about the outage "the magic of cloud computing"?
I get that you're saying your users don't care/didn't notice, but I'm clearly missing something because if I had an app on Heroku, I'd be a little nervous. When the cyclic nature of your app swings back around and it's in regular use again, this kind of outage might not be so magical.
Well technically, I was informed of it. I got email alerts and stuff, but I was busy doing other things, so I didn't read them.
Users surely noticed, but Heroku definitely noticed before my users did. They're quietly working on a solution and I can quietly go about my day. If my users start complaining, I'll have time to talk to them; time I wouldn't have if I was neck deep in log spew.
Having run apps on my own servers before, I know what a pain in the ass it is to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.
Having experts be responsible for dealing with problems is not unique to "cloud computing".
Not unique to it, but it is implicit in it. This matters. If you are at the size where you can't have a dedicated staff monitoring your uptime 24/7 than you are at the size where a cloud solution is going to be more responsive than what you can afford.
> to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.
This is downtime coming from their infrastructure, not your app.
It's sooner or later for most people to realize that, it is not that safe to rely on a specific deployment system that is not directly controllable. It could be dangerous to use a full stack that cannot easily be replaced without a decent amount of efforts.
Initial laziness now adds up.
Even if you hosted your own server and it was just serving one static file, there are still services you depend who could cause an outage.
Heroku so far has not had major outages.
And they will be learning from the current ones.
I host an app on there that I've been using all day and I didn't notice it go down. I reckon I've got some kind of unplugged-TV poltergeist action going on.
I haven't seen an explanation for this, but I could be related to ec2 issues today. I'm a heroku user. Downtime with any host always seems to happen with bad timing, during a daily client call today. However I'm not concerned about heroku - yet... I think they have less downtime than I would have doing it myself.