Heroku's Managed DB's have been down for 2+ hours

status.heroku.com

71 points by kentf 5 years ago · 58 comments

Reader

pgn674 5 years ago

It mentions an issue at an upstream service provider. Is it AWS and their Degraded EBS Volume Performance in Northern Virginia? https://status.aws.amazon.com/

mcjiggerlog 5 years ago

Definitely appears to be a wider issue, circleci is having a major outage too: https://status.circleci.com.
hiyer 5 years ago

Likely - they're also reporting provisioning failures in Virgina, which is consistent with what AWS is reporting as well.

throwdecro 5 years ago

Is there an "uncanny reliability" range where increasing reliability on the part of a service provider makes things worse, by being so close to 100% reliable that any failure is a shock?

Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.

strzibny 5 years ago

This is something along the line what I say in the Scaling chapter in my book[0]. If your infra is really simple (like a server or two), you can actually recreate it in a different provider and beat any hard to fix issue (whole AWS region going down or this Heroku's databases problem).
Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.
My book also contains a Bash script to configure you a PostgreSQL cluster in a few minutes with/without attached storage space, with self-signed SSL, SELinux, and more. Great for simple apps and as a start in learning production PostgreSQL.
[0] https://deploymentfromscratch.com/ [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...
- busterarm 5 years ago
  
  > Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.
  This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.
  We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.
  We already had all of our data ready at another provider and the infrastructure hot so it was just a matter of flipping some configs and waiting for DNS propagation. I still ended up working 20 hours that day just monitoring everything and calming people down but the alternative would have been working straight through New Year's.
remus 5 years ago

Yes. There's a nice example of this in the Google SRE book (I think it may have been their internal paxos service?) If I remember they ended up building in planned downtime so users could learn to degrade gracefully if the service went down.
- GeneralMayhem 5 years ago
  
  Google does this pretty regularly internally. Every system has a published SLO, and for a couple weeks a year major components will respect their SLO and not a single request or millisecond better. If you were relying on something performing 10x better than what it's rated for in order to provide your own guarantees, then that's on you.
deepsun 5 years ago

Often that means spending 10x on building failure-tolerant architecture.
For example, software may assume that files get corrupted sitting on a disk, and work around that. But it turned out to be easier to build in the self-healing redundancy checks to the bottomest layer possible, to hard drives, and assume it's clean afterwards.
Another thing I've heard of, is when they make space radiation-resistant CPUs, instead of making the CPU robust to miscalculations, it's easier to shield it as much as possible, use larger process nodes (like 110nm+). Of course, they also make all kinds of checks in the software as well, because they do real engineering.
zokier 5 years ago

That's the theory behind chaos monkeys/simian army.

forgingahead 5 years ago

Heroku has been strangely unreliable the past few weeks. Even their ticket response team has been slow, with their support engineers often talking past the issue to just send a scripted reply.

We have the majority of our client apps hosted with them, but most don't require 24/7 availability. This is still concerning though, and we do have one high-availability app hosted on them now that we're trying to plan contingencies for.

Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.

lbruder 5 years ago

Simple dynos can be replicated with Dokku and Ledokku as a GUI. Just get an Ubuntu VM on Digitalocean, Vultr or whatever, install and configure UFW, fail2ban and automatic security updates, install dokku and you're set.
For managed databases with replication however, Dokku still leaves much to be desired...
- i386 5 years ago
  
  I want a birthday cake. But first I'll be growing and milling my own grain, raising chickens and a cow. Water will be manually pumped from a well.
  - MikeDelta 5 years ago
    
    It is seriously not that bad at all, I would compare it to making your own cake from the flour, water, butter vs buying ready-made batter.
    
    subsection1h 5 years ago
    
    Heroku provides many features like pipelines and review apps that would be impossible to implement on a single VPS and very time-consuming to implement on multiple VPSes. Anyone who recommends a single VPS as a hosting solution (as lbruder did) is likely a hobbyist or a student.
    
    MikeDelta 5 years ago
    
    Maybe it sounded a bit simplistic in the description, but running VMs /servers in the cloud/datacenter with CICD pipelines, VM patching, testing, the whole nine yards, etc is not that extreme, difficult, or ridiculous as most of us think it is (e.g. comparing it with making your own flour or growing your own coffee).
    There are plenty of professionals doing it this way. Agreed that one machine would not be enough for all that, and building it will take more time, but it has its own advantages of being in control.
    Curious to hear your opinion on this if you like to share.

Daniel_sk 5 years ago

Signal is down too due to an outage of a service provider (I assume AWS).

supermatt 5 years ago

Whatever happened to 5-nines uptime? It seems no cloud service provider these days is able to offer what was considered an industry standard.

AWS even have documents telling people how to achieve exactly this! https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

Why don't "premium" service providers like heroku, etc, do this?

zokier 5 years ago

> It seems no cloud service provider these days is able to offer what was considered an industry standard.
I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted
- nicoffeine 5 years ago
  
  Five nines uptime only exists in the mainframe world. Everywhere else it's a requirement set by someone in management, which is met by the vendor in their marketing material. It's never achieved over the long term, but enough time passes so the inevitable downtime can be blamed on the previous management. The vendor meets their "guarantee" by paying back less than a point on the yearly bill, and then everyone can reset the clock and pretend that it won't happen again.
  The only people who suffer consequences are the staff forced to work overtime performing SEV0 RED ALERT theater. They will work through nights/weekends while the responsible parties tut-tut and "manage" by reading updates they can collate into the post crisis report. After that, everyone participates in the joy of emergency meetings to discuss said report that will be entirely worthless when a completely different part of the system fails the next time. A more reliable HA solution will be worked up by the engineers, finance will estimate implementation costs, and it will be turned down by an executive on the 8th hole green because they don't care about anything except improving profitability so they can hand themselves a bonus.
  Not that I'm bitter or anything.
  - supermatt 5 years ago
    
    I worked on multiple services since the late nineties on bare metal over multiple datacentres to achieve 5-nines.
    With IaaS that is now easier than ever, yet these so-called cloud service providers dont do any of that - they tie themselves to a single AZ and have ZERO redundancy.
- lbriner 5 years ago
  
  The issue is more like the guarantee is not worth anything.
  AWS/Azure/whomever "promise" 5 9s uptime. Something goes wrong, you don't get 5 9s, and what do you get?
  A system that went down for 4 hours and a $50 rebate on your next bill!
  - sciurus 5 years ago
    
    Your point about the credit s stands, but could providers don't even offer 5 9 SLAs.
    E.G. https://aws.amazon.com/compute/sla/
  - supermatt 5 years ago
    
    its not about AWS/Azure etc. They are providing IaaS. Literally compute services littered around the globe. It is up to these so-called cloud service providers, like heroku, to utilize that infrastructure to achieve 99.999%.
    I even gave a link in my comment to what AWS say about this.
    Are people downvoting me because they dont read, or what?
    
    weird-eye-issue 5 years ago
    
    You make it sound like AWS has 100% uptime and services built on top of them are completely to blame.
    And for something like Heroku's managed DBs you can't just achieve 99.99999% availability on a DB without making certain sacrifices. Availability isn't everything past a certain point
    
    supermatt 5 years ago
    
    Thats not what im doing at all. I even gave a link to AWS documentation of achieving 5-nines by utilizing multiple AZs, etc. I also reiterated this in the comment you responded to above..
    What sacrifices you talking about when synchronously replicating to a backup environment? Write latency? How do you deal with that usually? How much is too much? There are strategies to deal with reducing replication related latency depending on level of consistency required.
    
    weird-eye-issue 5 years ago
    
    Costs and general complexity. It is quite easy to accidentally reduce a system's uptime by introducing extra complexity involved with higher availability
    
    supermatt 5 years ago
    
    Oh yeah, there are definitely additional costs and complexity involved.
    Im just saying that these cloud service providers offering managed services should be covering all that - they certainly charge as if they do!
    And in the case of heroku - and their specific architecture - it is not that complex. Im aware that other cases may vary.
    
    Amasuriel 5 years ago
    
    No particular love for Heroku, but you can pay for multi az failover if you want it.
    https://devcenter.heroku.com/articles/heroku-postgres-ha
    If you don’t pay you won’t get the feature. Given that multi az failover directly impacts their cost that seems pretty fair.
    
    supermatt 5 years ago
    
    Yeah, I suppose in some circumstances their offering is OK.
    I don't think i have ever worked on a system where 10 mins of data loss is anywhere near acceptable though.
    I guess for mostly static pages, or self hosting a blog its ok, although id be pissed if i had to rewrite an article. Makes you wonder who their target market is.
    
    sitzkrieg 5 years ago
    
    has aws ever suffered multi region blackouts? or even all AZs in one?
- Macha 5 years ago
  
  Less customers, less moving parts, less to go wrong. I'm sure a lot of places were basically rolling the dice, but I'd imagine a lot won that bet while those that lost it had a much more difficult recovery process than today's vendors.
- RantyDave 5 years ago
  
  Plenty. But in most cases there was luck involved.
CodesInChaos 5 years ago

For non trivial services (in particular ones that need consistency), I'm skeptical that it's realistic to achieve 5-nines at competitive cost. You'll probably achieve it for several years, and then you run into a complex failure which takes 1h to fix, blowing through the downtime budget of a decade.
- supermatt 5 years ago
  
  In herokus case:
  Their "dynos" are ephemeral. They could literally deploy the images to a backup environment hosted elsewhere. Their data services could all be synchronously replicated to that backup environment. And thats it - they dont offer any other core services (and their other services run on the same platform.)
  So for (at most) double their infrastructure cost they have another network they can immediately switch over to.
  And herokus already soooo expensive. Even if you used a 1-to-1 mapping for ec2 to heroku dynos (which they dont - its multiple dynos per backing instance), you would be looking at 5-10x markup using on-demand instances! Reserved instances are even less expensive. Spot instances can be 5x less again!
  I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.
  Fly.io is making strides in this direction, distributing the VMs across multiple availability zones, and routing traffic internally from their multiple geographically distributed POPs - but you need to roll your own DB VMs for multi-az synchronization..
  EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!
makeitdouble 5 years ago

TBF there is very few real world services that offer customers and non giant size companies 5-nines of uptime.
E.g. my electricity provider doesn't.
- supermatt 5 years ago
  
  Services providers such as Heroku should be easily able to have five-nines uptime.
  They ONLY offer fully managed services, which can be backed by the multi-cloud, multi-AZ setup I refer to - but instead a single product outage from a single upstream provider in a single datacenter is affecting all their clients.
  This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".
  - danjac 5 years ago
    
    https://floridanewstimes.com/the-decline-of-heroku-infoworld...
busterarm 5 years ago

AWS doesn't offer 5-nines uptime on Compute.
It's S3 that is 5-nines availability.
AWS's published SLA for Compute (which includes EBS) is 4-nines.
https://aws.amazon.com/compute/sla/
- terom 5 years ago
  
  It's worth noting that the AWS EC2 99.99% SLA is a regional SLA, i.e. it only covers a situation where multiple AZs are down simultaneously.
  One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.
  - busterarm 5 years ago
    
    Design accordingly. :)
- supermatt 5 years ago
  
  I know they dont, but in the link I gave they tell you how to achieve 5-nines via redundancy - something these cloud service providers (like heroku) neglect to implement.
ranguna 5 years ago

99.999% uptime still mean around 7 and a half hours of downtime per year.
- rafBM 5 years ago
  
  99.999% uptime is 5m 15s per year: https://uptime.is/99.999
- bobviolier 5 years ago
  
  No, just 5 minutes https://uptime.is/99.999
- ranguna 5 years ago
  
  Ups, quick maths == wrong maths.

kentfOP 5 years ago

Luckily we had followers in different zones by chance. Still scary though. What are the best solutions for replicating Heroku in different clouds?

mappu 5 years ago

If by "replicating" you mean replicating the experience, then you're looking for Dokku
Easy setup + MIT license and you get the same git push deploys, Heroku-compatible buildpacks or bring your own Dockerfile.
I would recommend running it on probably on Linode / DigitalOcean / Vultr.
- strken 5 years ago
  
  Dokku is great for a single host. If you have a more complicated setup you can go a long way with post-receive hooks, although it won't be as magical without buildpacks.
aswinmohanme 5 years ago

Fly.io comes close. You have docker based builds on the cloud with native multi region support. Postgres is in beta right now.
- mcintyre1994 5 years ago
  
  They also have built-in Heroku migration: https://fly.io/docs/app-guides/speed-up-a-heroku-app/
corobo 5 years ago

DigitalOcean's App platform things plays nicely with Heroku buildpacks from what I've seen

thomaslord 5 years ago

Interestingly, I have an app using Heroku Postgres that seems to have had zero issues during this outage. I can see data that was stored during this period of time and Rollbar doesn't show any DB connection errors.

pvsukale3 5 years ago

I have been trying to deploy fix to a bug we deployed yesterday. I think they have stopped deploys as well. As the deploys are being rejected without any explanation.

TedShiller 5 years ago

Heroku may have been down for 2+ hours, but MongoDB has been unreliable for 10+ years.

Insalgo 5 years ago

And why are you referring to mongo here?

Settings

Heroku's Managed DB's have been down for 2+ hours

Keyboard Shortcuts