Settings

Theme

DigitalOcean block storage is down

status.digitalocean.com

165 points by kaendfinger 6 years ago · 87 comments

Reader

CaliforniaKarl 6 years ago

Personally, if DO don’t have anything new in a status post, I’d prefer seeing an update that says something like “We are continuing to work on the issue. Nothing new to report. Next update in X minutes.” That is a lot easier for me to parse than the text that someone seems to be copy/pasting in each update.

  • iamsb 6 years ago

    Would be great if statuspage.io has a button when pushed publishes message similar to your suggestion.

kyledrake 6 years ago

What unholy thing did they do that broke it across 12 different datacenters, good lord.

  • alexeldeib 6 years ago

    This does seem to indicate a notable lack of isolation for the blast radius between DO datacenters. Would be interesting to see the post mortem.

    • protomyth 6 years ago

      I get the feeling that whoever writes the post-mortem is going to have a bit of pressure to assure folks that there is isolation going forward.

      • klodolph 6 years ago

        That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.

        • viraptor 6 years ago

          Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.

          • klodolph 6 years ago

            > Those are not mutually exclusive and actually a good idea.

            The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.

            The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.

            > You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.

            This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.

        • protomyth 6 years ago

          I would think having all zones go down is a flaw that genuinely needed to be fixed.

          • klodolph 6 years ago

            That’s not the flaw, that’s the outcome. The purpose of a post-mortem is to identify the flaws that caused that outcome, and ways to fix those flaws.

      • swsieber 6 years ago

        Whoever broke it is going to feel significant pressure to actually isolate things too.

    • notyourday 6 years ago

      Propagating mistakes across all the things is devops

    • nodesocket 6 years ago

      Google Cloud recently had a global outdate. DevOps tools that interact with all resources across data centers are primarily the culprit.

      • sterlind 6 years ago

        Google's RCA is here, I believe: https://status.cloud.google.com/incident/cloud-networking/19...

        Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.

        They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.

        (Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)

    • temikus 6 years ago

      Or just bad rollout procedures.

  • bluedino 6 years ago

    Probably the old "one command ran on everything"

    • astrodust 6 years ago

      tmux is a dangerous tool in the wrong hands.

      • rubbingalcohol 6 years ago

        to be fair, it's dangerous even in the best hands. mistakes happen but business processes need to be in place to prevent catastrophes...

        every time i see something like this, my inclination is to blame the CTO, not the engineer who pulled the trigger.

        • toomuchtodo 6 years ago

          A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.

          Disclaimer: Ops/infra engineer in a previous life.

          • astrodust 6 years ago

            I wonder how many outages these days start with something like "kubectl apply" and then things go horribly awry.

        • solotronics 6 years ago

          We can blame whoever we want but you better believe shit rolls downhill at most places.

          • GhettoMaestro 6 years ago

            Until it is a big enough F-up that an executive's head must roll.

            • mdaniel 6 years ago

              There's a famous corollary to that approach: "Fire him? Why, I just spent 10 million dollars _educating_ him"

              (regrettably I can't find any evidence it's a true (quote|story), but I enjoy the sentiment)

  • nonbirithm 6 years ago

    It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.

    https://news.ycombinator.com/item?id=19812919

    • dc352 6 years ago

      That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)

      • markonen 6 years ago

        I would be absolutely amazed if DNS was not involved in mounting a block storage volume.

  • pmlnr 6 years ago

    Bad puppet/ansible/etc commit is the most probably explanation.

  • mdellavo 6 years ago

    dee ennn esss

  • hinkley 6 years ago

    A bug that has no obvious side effects that only became visible once all data centers were upgraded?

    Happens. Statistics are hard.

pastrami_panda 6 years ago

This is OT, but I have a droplet on DO and I'm amazed at the amount of malicious traffic it gets. Is it normal for a very private vps to receive thousands of ssh attempts per hour? I have fail2ban installed and the jail is so busy it's quite astounding. Anyone with more web hosting experience that can weigh in?

  • zeta0134 6 years ago

    I work for a web hosting company in Texas, and this is ridiculously common. Any public IP with any public service at all will be poked, prodded, and generally made uncomfortable by every bot and crawler you can think of, trying common password combinations and scanning for common vulnerabilities in popular software. This catches so many of our customers by surprise, who tend to mistakenly believe they're being targeted in some kind of attack. Generally they're not, unless they're running something vulnerable and one of the bots noticed.

    Fail2ban is great to at least stem the tide. It's good at slowing down SSH brute forcing, and can be set up to throttle poorly behaved scrapers so your site isn't getting hammered constantly. If you can deal with the inconvenience, it's even better to put services that don't need to be truly public behind an IP whitelist. That stops the vast majority of malicious traffic, most of which is going after the low hanging fruit anyway.

    Otherwise, it's kinda just a fact of life. With the good traffic also comes the bad.

    • davrosthedalek 6 years ago

      I always switch my outward-facing ssh servers to key-only. Is there any advantage for running fail2ban additionally?

      • pbhjpbhj 6 years ago

        Presumably then iptables handles the reject/drop rather than SSH sessions being created that fail at login? (Disclaimer, I don't know exactly how any sshd handles dropping clients who attempt to connect by password when you have set it to cert only; but it seems like dropping at the firewall would be more optimal).

      • pnutjam 6 years ago

        It makes your logs easier to read because they aren't full of ssh failures.

      • ac2u 6 years ago

        for my DO droplet I also changed the ssh port to a silly-high random port and the last time I checked it reduced the amount of nosy bots knocking at the door to zero.

        • davrosthedalek 6 years ago

          I used to do so too, but sometimes had problems with very restrictive firewalls killing connections to high/unknown ports when traveling. They would only allow vpns or ssh to connect.

    • pastrami_panda 6 years ago

      Cheers for weighing in. A whitelist is a good solution, since the sheer amount of attempts is making me uncomfortable. It seems to be accelerating over time as well which is even more disturbing.

  • Crosseye_Jack 6 years ago

    > Is it normal for a very private vps to receive thousands of ssh attempts per hour?

    Yes. The thing is about the IPv4 space is that it’s really not that big (3,706,452,992 public addresses) so it’s pretty trivial to poke every single one esp if you fine tune your port list.

    The most common advice is to hide your private services. Instead of using port 22 for ssh use 23231 instead. It’s a little more annoying but you can also use port knocking. So to open port 22 (or what ever port you like) first you got to poke port 23123 then 7654 then port 39212 within a short period of time then the port knocking software will open up port 22. (Or a combo of both change default port and port knocking)

    It won’t stop people “ringing the door bell” to see if anyone is home, but it will help with the then trying to brute force pass the login prompt.

  • ollybee 6 years ago

    I work for a hosting company and this is totally normal. Digital Ocean and other VPS providers IP ranges are specifically targeted as many amateurs running servers there.

    If you've disabled password logins then just don't worry about it. fail2ban is overkill you can rate limit with firewalld or iptables withou needing extra tools.

  • pmlnr 6 years ago

    It is "normal". Even my home fix IP gets it without any service running on it other than ssh.

  • mercer 6 years ago

    I had the same experience on DO as well as a few other providers.

    Changing the ssh port to something in the 50_000 range drastically reduced the number of attempts and left my logs much cleaner :).

  • gmiller123456 6 years ago

       Is it normal for a very private vps to receive thousands of ssh attempts per hour?
    
    Well, I haven't bothered looking in a long time. But, back when I first got a cable modem back in the late 90's the malicious access attempts filled up my hard drive in just a couple of weeks. I don't remember the size of the HD, but I can only imagine the situation has gotten much, much worse since then.
  • tomxor 6 years ago

    Yup this is normal, when I can't change the SSH port (e.g for compatibility) then I switch f2b to permanently ban, which should reduce any incurred load by black-holing instead of attempting authentication as the list grows more comprehensive over time (this wont affect other services e.g apache, in-case a user is unwittingly part of a botnet).

    • pnutjam 6 years ago

      Fatfingered typist beware...

      • jaster 6 years ago

        Better use a password manager in this case (or even better, use public key auth!)

      • tomxor 6 years ago

        hah, i also tend to up the attempts... If you have more than one server you can always tunnel through one of the others if you lock yourself out. Worst case of course VPS console.

  • mobilemidget 6 years ago

    Off topic reply, you should monitor the amount of malicious traffic coming from DO networks too. (I did for a few customers at different isps and its insane)

hartator 6 years ago

Not sure why the previous incident page got flagged. This is the new one.

It's affecting us for real. Making almost our whole service - serpapi.com - down. As we are storing database files on block storage volumes.

  • dang 6 years ago

    I took a look at the flags on these stories and am pretty sure they're from users who are tired of "X is down" submissions, which tend to get posted a lot and often to be a little on the trivial side.

    However, since several HN users are expressing that this issue is genuinely affecting them, I've turned off flags on the OP about this and merged the comments here.

    • tyingq 6 years ago

      The "across all regions" part makes this one different for me, and interesting even though I'm not a customer of their block storage. I'm curious about the sequence of events, or design choices, that would cause that.

  • astrodust 6 years ago

    I reported it and they were like "what? oh..."

    Then the status page changed and as things got worse, the dashboard page got an announcement as well.

louwrentius 6 years ago

Isn't Digital Ocean running Ceph for their block storage?

I would wonder - as others suggested - that they may have stretched the cluster across datacenters ?!

Would be interested in the post-mortem.

jacquesm 6 years ago

Thank you Digital Ocean for once again proving that 'The Cloud' is not a backup.

  • vinw 6 years ago

    'The Cloud' is _a_ backup. Just don't let it be your only backup!

    • pnutjam 6 years ago

      It's not backed up if you don't have 3 copies.

      • lunchables 6 years ago

        The saying we always use is "If it is not in 3 places, it doesn't exist."

        And another: "3 copies, at least one offsite"

stephenr 6 years ago

This is your weekly reminder that anything you want to be reasonably “HA” should span multiple vendors in multiple DCs.

  • dc352 6 years ago

    that would be pretty cool but to have that, you need a high-network-latency solution, i.e., pretty much cold back-up. For some time I thought it's pretty last century option but having been experimenting for some time now, it's the option with lowest impact on system performance. More importantly, it's reasonably resilient.

    • stephenr 6 years ago

      I've read your comment now about 4 times and all I have come up with is "huh?"

      Literally thousands if not millions of organisations operate multi-DC infrastructure across the planet.

      Is it harder than setting up a single box in one DC? Yes. Is it harder than setting up a mini-cluster of boxes in one DC? Yes. Is it rocket science? No.

  • Bombthecat 6 years ago

    Yeah, the myth that "just use aws" to have 99, 9999999 percent uptime is coming to an end...

simplehuman 6 years ago

Anyone have a review of using DO k8s or DO managed DB in production?

unilynx 6 years ago

DigitalOcean just posted a post-mortem on http://status.digitalocean.com/incidents/g76kgjxqrzxs

(the same url)

privateSFacct 6 years ago

Higher latency (per status) is not end of world especially if it’s just “may experience” higher latency.

  • erikrothoff 6 years ago

    That wording kinda ticked me off because our volume was completely inaccessible. Rebooting did not mount it at all.

imglorp 6 years ago

Hrm, Atlassian BitBucket is also down. Just a coincidence? Does BB use DO?

https://bitbucket.status.atlassian.com/incidents/4t1pkwrdtl8...

seaghost 6 years ago

Their block storage is such a failure. I’m back and forth with support to automatically delete files with lifecycles for over 2 months now and it’s still not resolved.

  • ngrilly 6 years ago

    Since you're trying to "delete file with lifecycles", I'm quite sure your problem is with their object storage (called Spaces), and not their block storage.

sb8244 6 years ago

It looks like they have just updated it as resolved and monitoring.

sunasra 6 years ago

I was always wondering how I can get know proactively if something like this break or some service has an outage. As a result, I have built this tool( http://incidentok.com )

  • sunsetMurk 6 years ago

    great idea - going to give it a whirl this week.

    i'm curious about the slack integration. can you provide some more info on what that looks like? eg. just a message in real-time when it goes down? a daily message of statuses? etc. Any sort of customization w/ it?

    I currently use a soup of zapier zaps to take care of this problem.

    • sunasra 6 years ago

      Hey. Thanks IncidentOK will send message to slack using webhook as soon as incident reported by any product. Didn't thought to send status everyday. But I am open for suggestions

      Message looks like this https://imgur.com/jjbMKj8

sodosopa 6 years ago

So that’s why bot attacks and spam traffic was lower.

golanggeek 6 years ago

This is really down for more than 2 hours!!!

  • sondh 6 years ago

    Last night I was testing DO managed Kubernetes cluster with persistent volume claim and the volume took 15 minutes to reattach after the pod is rescheduled to another host. I thought it was just some weird hiccup and went to bed.

    The incident report indicated the problem started 4 hours ago (around 9pm GMT) but I was having problem around 4pm. It's definitely not a 2-hour incident.

    • dc352 6 years ago

      our disks in London went down at about 8:45pm UTC (10 mins 100% disk utilization alert triggered at 5 to) and DO recovery message was sent out at about 2am UTC. We switched our service (keychest.net) on at 3:15am

irfanbaigse 6 years ago

DigitalOCean bad experience

jbverschoor 6 years ago

Their ad was “you’ve been developing like a beast and your app is ready to go live”

DO is a nice thing to play around with and maybe launch something, but I wouldn’t run full production on it.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection