Incident Response at Heroku
blog.heroku.comAnd that's why I pay for Heroku. They let me sleep at night instead of worrying about keeping my stack up to date. I can't afford a dedicated engineer to keep things 'kosher' on my boxes, so Heroku is fantastic in that regard.
I have a pet project up and running on DigitalOcean but my god, it's duct-taped to hell. I don't even know how to apply Nginx patches as they come out without bringing down the website. :(
Thanks Heroku for sharing!
> They let me sleep at night instead of worrying about keeping my stack up to date
I'm not sure what kind of stack you're running on Heroku, but if it happens to be Ruby, Heroku won't do anything (AFAIK) to update gems specified in your Gemfile.lock file when vulnerabilities are found. That's still up to you, and security issues in gems appear to be far more common than those in parts of the infrastructure that Heroku does keep up-to-date.
Generally if it's a large enough security incident and the fix is in gem land, we send out emails. At the end of the day, you're still responsible for security of your own code (and libraries), but we try to help.
Well, as a former heroku user, I disagree. Sure, your stack is up to date, but then heroku has downtime related to issues with its own architecture from time to time.
Sure, they have a better, larger, ops team than most orgs, but they also need to run a much more complex product. I think its a wash IMHO.
When did you last use Heroku? Have you seen https://status.heroku.com/uptime? It's not five nines, but it's more than most places I've worked :)
Their uptime calculation is...interesting, to be polite. Look at Dec 2013 or Sep 2013 and you'll see major incidents covering most of a day, but still three nines. How? See https://devcenter.heroku.com/articles/heroku-status#uptime-c... , but basically, if 40% of running applications are down for 10 minutes they count it as 4 minutes outage.
If 40% of your applications are down for 10 minutes, that's 10 minutes of downtime.
That sounds pretty reasonable, actually. Sure, you can make the case that if 40% of the apps are down, then that's a whole-service downtime, and should be counted as such. But what if it's 4%? Or 1%, or 20 apps? Should a downtime that took down 20 apps be counted at the same level as one that took down 80% of all of their apps?
Seems like the best compromise is exactly what they did. The numbers are about as relevant to people looking to serve their applications as they could be without breaking things down by availability zone or stack or some other metric.
It's not completely backwards, and I'm glad they're reasonably upfront about it. I think given their business, I'd prefer a range or histogram - "uptime was between 96.4% and 99.99% depending on the customer" or "uptime was 96% for 10% of customers and 99.8% for 90%" or something.