Settings

Theme

Heroku is down for the third time today

status.heroku.com

87 points by Janteh 15 years ago · 62 comments

Reader

ayb 15 years ago

I use Heroku for subscription software services, online retail stores, and phone ordering system for our staff.

Right now all of our sites are failing with 503 errors. Our store is down and when one of our employees went to take a phone order they got a "Welcome to your new app" message.

I've been a big evangelist of Heroku since we migrated over last year, but I'm getting deeply concerned about the elevated error rate since every minute is costing us money.

  • qeorge 15 years ago

    Does Heroku have an SLA? (I could not find it)

    At some point they're exposing themselves to serious risk. Rackspace had to pay out ~$3MM (in free service credits) after an outage in 2009:

    http://www.networkworld.com/news/2009/070609-rackspace-outag...

    • StavrosK 15 years ago

      This is offtopic, but what's MM? What's the second M for?

      • jcsalterego 15 years ago

        M = 1000 in Roman numerals, but the confusing bit is not reading them like Roman numerals (2000) but rather interpreting them as one thousand thousands, or one million.

  • SpikeGronim 15 years ago

    My question to you is, could you do better and how much would it cost? If you didn't use Heroku or another cloud provider you would pay a lot more up front to get your applications running. When things go wrong you would have to fix it, which means paying technical staff to be on call. Since you and your company are likely experts in your domain and not in infrastructure then any infrastructure that you built would likely have more downtime than Heroku. You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.

    Disclaimer: I'm not affiliated with Heroku and I don't use their service.

    • nphase 15 years ago

      You have to debit the cost of Heroku's downtime from the cost of building your own infrastructure.

      That's silly, and also not how it works, at all. You're paying PaaS/IaaS companies so that it's their headache, not yours. Once it becomes your headache, they are no longer doing their job, and you are no longer receiving value for which you are paying for. You don't debit the cost of their downtime from the cost i would've built your infrastructure, you debit the cost of their downtime from your business' revenue and reputation.

      Whether or not you could do it better yourself does not excuse the downtime one bit.

      • jimbokun 15 years ago

        Excusing or not excusing are irrelevant.

        If you stop using Heroku and manage your own infrastructure, you need to take all relevant costs into account.

        Of course, finding an alternative provider of the same (or similar) services is also an option.

        • jemfinch 15 years ago

          Isn't it commonly recognized that it's cheaper to run your own hardware than to pay a cloud provider? It just requires more capital outlay and maintenance.

    • ayb 15 years ago

      We're actively sending pay per click traffic to our online store and it's very easy to spend hundreds of dollars. When our traffic converts it's great but it pains me to think I could be sending traffic to a Heroku 503 error page and have zero control over it.

      So, "could we do better"? I'm not sure. I'm trying to figure that out. It certainly would not be as easy to use as Heroku or easy to deploy. But at a minimum I need to get some other host option set that we can switch over to.

      • bl4k 15 years ago

        Is there a service that will switch off adwords campaigns if your site is down or in maintenance mode? If not, there should be.

        • ayb 15 years ago

          Thought about writing an app to do that. Unfortunately I would not be able to host it on Heroku. :-)

          • mikeyur 15 years ago

            I've had clients with sloppy dev teams who decided to change the URL structure of all landing pages without letting me know (I was managing their PPC campaigns). Google stops serving ads after getting 404 errors - unfortunately I don't think they count other errors (like a 503) and they don't stop until they've sent a few hundred (or thousand) clicks.

            • bl4k 15 years ago

              would ppl pay for a good solution here, like $10 a month, or a % of money saved? I imagine that with a good implementation they might.

    • tedunangst 15 years ago

      If you run your own site and things go wrong, you (hopefully) know what you did. When Heroku (or AWS, or anyone) makes a change, they don't consult every customer to find out if now is a good time to go down.

  • Goosey 15 years ago

    Are there any companies that provide 'server host failure' insurance for instances like this? It seems like a possible opportunity.

    • patio11 15 years ago

      The E&O insurance I looked into getting when I moved into consulting would have covered it -- "lost sales" resulting from a "hardware or software malfunction." I assume if you start making recurring claims on that the insurance company will reevaluate whether they want to continue doing business with you, though.

    • edanm 15 years ago

      Don't know specifically about "server failure" insurance, but I assume it exists. There is insurance for practically everything. For example, an Uncle of my friend builds home security systems, and he is insured in case a home he has secured is ever broken into anyway.

  • biaxident 15 years ago

    I've currently got a few small apps on Heroku and am considering moving some larger ones over. But the "Heroku | Welcome to your new app!" is very worrying.

    Custom error pages for these kinds of errors would be very useful.

    • iampims 15 years ago

      It took Google App Engine two years to add the option to specify a custom error page for server errors and over quota errors. Hopefully that'll come soon for Heroku as well…

      • sync 15 years ago

        It will. Currently in private beta, heroku will render an iframe pointed to an arbitrary url hosted externally (say, on S3.)

  • mike-cardwell 15 years ago

    Eggs, basket, etc.

vegashacker 15 years ago

It just occurred to me that you know you've made some pretty serious traction as a startup when HN posts about your company no longer have something like "(YC W08)" appended to the end.

  • adammichaelc 15 years ago

    I think it has to do with Heroku's target market being so similar (identical?) to HN's demographic. If that weren't the case, I doubt we would all recognize Heroku so easily.

  • Timothee 15 years ago

    Thanks for pointing that out, because I had completely forgotten that this was the case. (I actually can't remember at all, but I figure that I knew that from when they came out)

    They did go a long way in a short period of time. Winter 2008 feels so close.

    • petercooper 15 years ago

      Yeah, I'm in the same boat as you. I see successful, "big" companies mentioned here with "YC-whatever" on the end and am blown away by which ones are YC alumni!

gfunk911 15 years ago

It all depends on what the SLA says, but hypothetically, if they are down for 24 hours a year, that's 99.7% uptime, which isn't terrible.

Heroku had a 1-2 hour outage the week after we switched an app there last year. My boss was freaking out, cursing about how they were unreliable, etc, neglecting the following:

1. The timing was unfortunate, but that was the first outage in months.

2. We had had multiple outages on our Rackspace box that were our own fault, due to bad server management.

In the long term you're likely better on Heroku, for small companies at least.

  • whirlycott1 15 years ago

    Uh... 99.7% is ridiculously bad if you're doing anything that matters.

    • mst 15 years ago

      Depends, really.

      Internal examples:

      If shadowcat's public facing website is down for a day, a few people can't read blog posts and maybe we'll miss out on a potential customer - but our existing customers will be entirely unaffected.

      If our ticket tracking system is down for a day, it'll annoy the hell out of the existing customers but we can still get the work done since they all have direct email and IM contact info for people.

      On the other hand if our ircd is down for an hour, it's time to panic, because that massively interrupts our ability to co-ordinate our work.

      External examples:

      If linked in is down for a day, I don't care - anything I do on that can wait until tomorrow.

      If duckduckgo is down for a day, I am going to burst into tears because I use it all the time for information I want -now- and going via google is substantially more annoying.

      So "anything that matters" is really quite relative.

    • sgt 15 years ago

      99.7%? Ridiculously bad?

      I just did the calculation. That's about a day of downtime. I'd say it's bad if:

      - The downtime is scattered all over the year. 1 hour downtime here, 30 min downtime there.

      But not if:

      - This 1 day of downtime is scheduled, e.g. during the holidays. Scheduled and planned is the keyword. If the client is informed and aware of it, the client will also remain happy.

      You'd be surprised how much downtime clients are willing to put up with, as long as they are informed well ahead of time.

    • kes 15 years ago

      I agree with you, but only in theory. I can't think of one thing that runs 100% non-stop.

      Even in places like medicine or finance or security. Stuff breaks, things fail. It's sad, but the reality is there.

      • jackowayed 15 years ago

        Of course nothing will have 100.0 (repeating)% uptime. But 99.7% uptime means it can be down for over 2 hours every month. Anything less than 99.9% uptime (which means 3x less allowed downtime--a big difference) is probably unacceptable, and if downtime costs you serious money, you're going to want more decimal places.

      • invisible 15 years ago

        Part of my job is network administration of a small (~50 server) colo/hosting service. It's unacceptable for us to be down for even 30 minutes (from our perspective and our clients). We maybe top out at 5 hours of downtime a year (during a bad year) and most of that (unfortunately) is upstream from us.

  • citricsquid 15 years ago

    Move to vps.net for a few weeks, then move back to Heroku, by that time he'll be counting uptime not downtime!

  • ahoyhere 15 years ago

    We've been running on Slicehost for almost 2 years and I believe we've had two outages, one of which wasn't a real outage but a backbone provider went kaplooey in Europe. That can't really be helped.

    Heroku, on the other hand, feels like it's up and down more than... something that goes up and down a lot. A friend of mine hosts his blog there and he launched a small product today and he kept sending his customers to an error page, because Heroku was up, down, up down, up down.

    If it's a misconfiguration of your own, you can get it fixed. But if your hosting provider has an unsound business, you can't fix that except by leaving.

awt 15 years ago

I have an app running on Heroku. Interestingly, it caches itself using HTML 5 application cache, so most people won't even notice the site is down. Need to make sure the background network ops are fault tolerant though.

  • davidamcclain 15 years ago

    Interesting. Care to share what you're doing/what the heck that means?

    • awt 15 years ago

      http://motodiaryapp.com -- of course if Heroku is down and it's not already cached for you it won't load. This is the technology the site uses to allow offline access: http://www.whatwg.org/specs/web-apps/current-work/multipage/...

      • gvb 15 years ago

        That is really awesome. I just got back from playing with it between Chrome on an old 800MHz P-III (very usable) and an Android (Nexus One). On the Nexus, I went off-line (airplane mode), edited, and then went back on-line. MAGIC! My edits showed up in my Chrome browser on the desktop.

        My use case is that I want to use Google Docs (or equivalent) to keep notes while on-line and off-line. MotoDiary ain't quite there yet, but it has the hard part (IMHO), the on-line/off-line syncing. What is rough is text size and fixed(?) edit box size on the Android. Also (obviously), it is diary-oriented (single entry per day) rather than supporting multiple documents.

        Google Docs are totally uneditable (?WTF!) on Android, never mind doing it off-line and syncing.

        There are some Apps that work better, such as GDocs. GDocs has been a mixed bag, it allows me to edit off-line and sync docs, but has been iffy in terms of success rate. It definitely isn't as smooth as my brief experience with MotoDiary.

      • davidamcclain 15 years ago

        Wow, that opened my eyes! That use case didn't occur to me. I might have to add that to my box of treats, especially since I have apps on Heroku too.

        (Love you really, Heroku).

froggie 15 years ago

You have to give Heroku credit for selling major quantities of Kool Aid. They've been pretty flakey for the past couple of months, and people are here claiming that this is the first outage. Someone's even claiming that 99.7% is a good record.

railsjedi 15 years ago

"Applications are fully restored." via http://status.heroku.com/

Downtime always sucks, but gotta give them credit the way they keep everyone in the loop and provided status along the way.

n-named 15 years ago

Make your error page prettier. You guys are capable of better design (after seeing your pricing page).

aarongough 15 years ago

It's worth noting that this was not universal as far as I can tell.

I have 5 minute watchdogs on all of my 3 sites in production with Heroku, and none of them pinged me. Given that I know the watchdogs work (regular testing and previous incidents) I would have to conclude that not everyone was affected.

jread 15 years ago

We've been monitoring a heroku instance for the past 8 months. Our current instance uptime is 99.953% (about 200 minutes of downtime). Of the 76 services we monitor, Heroku is #64.

http://cloudharmony.com/status

snprbob86 15 years ago

The magic of cloud computing: As someone running an app on Heroku, I had no idea. Luckily, I simply don't care.

Our app has a cyclic usage pattern and all is quiet right now. So rather than freaking out about it, I'll just let someone at Heroku figure it all out.

It would suck if it happened during our busy period, but then again I could say "We're working on it." and just assume the Heroku team will fix things faster than I ever could have with my limited *nix admin skills.

  • jbail 15 years ago

    How exactly is the fact that you didn't know about the outage "the magic of cloud computing"?

    I get that you're saying your users don't care/didn't notice, but I'm clearly missing something because if I had an app on Heroku, I'd be a little nervous. When the cyclic nature of your app swings back around and it's in regular use again, this kind of outage might not be so magical.

    • snprbob86 15 years ago

      Well technically, I was informed of it. I got email alerts and stuff, but I was busy doing other things, so I didn't read them.

      Users surely noticed, but Heroku definitely noticed before my users did. They're quietly working on a solution and I can quietly go about my day. If my users start complaining, I'll have time to talk to them; time I wouldn't have if I was neck deep in log spew.

      Having run apps on my own servers before, I know what a pain in the ass it is to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.

      • absconditus 15 years ago

        Having experts be responsible for dealing with problems is not unique to "cloud computing".

        • Goosey 15 years ago

          Not unique to it, but it is implicit in it. This matters. If you are at the size where you can't have a dedicated staff monitoring your uptime 24/7 than you are at the size where a cloud solution is going to be more responsive than what you can afford.

      • kranner 15 years ago

        > to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.

        This is downtime coming from their infrastructure, not your app.

itsnotvalid 15 years ago

It's sooner or later for most people to realize that, it is not that safe to rely on a specific deployment system that is not directly controllable. It could be dangerous to use a full stack that cannot easily be replaced without a decent amount of efforts.

Initial laziness now adds up.

boltofblue 15 years ago

Even if you hosted your own server and it was just serving one static file, there are still services you depend who could cause an outage.

Heroku so far has not had major outages.

And they will be learning from the current ones.

alexyoung 15 years ago

I host an app on there that I've been using all day and I didn't notice it go down. I reckon I've got some kind of unplugged-TV poltergeist action going on.

aneth 15 years ago

I haven't seen an explanation for this, but I could be related to ec2 issues today. I'm a heroku user. Downtime with any host always seems to happen with bad timing, during a daily client call today. However I'm not concerned about heroku - yet... I think they have less downtime than I would have doing it myself.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection