DigitalOcean block storage is down
status.digitalocean.comPersonally, if DO don’t have anything new in a status post, I’d prefer seeing an update that says something like “We are continuing to work on the issue. Nothing new to report. Next update in X minutes.” That is a lot easier for me to parse than the text that someone seems to be copy/pasting in each update.
Would be great if statuspage.io has a button when pushed publishes message similar to your suggestion.
What unholy thing did they do that broke it across 12 different datacenters, good lord.
This does seem to indicate a notable lack of isolation for the blast radius between DO datacenters. Would be interesting to see the post mortem.
I get the feeling that whoever writes the post-mortem is going to have a bit of pressure to assure folks that there is isolation going forward.
That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.
Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.
> Those are not mutually exclusive and actually a good idea.
The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.
The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.
> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.
This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.
I would think having all zones go down is a flaw that genuinely needed to be fixed.
That’s not the flaw, that’s the outcome. The purpose of a post-mortem is to identify the flaws that caused that outcome, and ways to fix those flaws.
Whoever broke it is going to feel significant pressure to actually isolate things too.
Propagating mistakes across all the things is devops
DevOps Borat: "To make error is human. To propagate error to all server in automatic way is #devops."
Google Cloud recently had a global outdate. DevOps tools that interact with all resources across data centers are primarily the culprit.
Google's RCA is here, I believe: https://status.cloud.google.com/incident/cloud-networking/19...
Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.
They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.
(Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)
Or just bad rollout procedures.
potato, potato ;)
Probably the old "one command ran on everything"
tmux is a dangerous tool in the wrong hands.
to be fair, it's dangerous even in the best hands. mistakes happen but business processes need to be in place to prevent catastrophes...
every time i see something like this, my inclination is to blame the CTO, not the engineer who pulled the trigger.
A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.
Disclaimer: Ops/infra engineer in a previous life.
I wonder how many outages these days start with something like "kubectl apply" and then things go horribly awry.
We can blame whoever we want but you better believe shit rolls downhill at most places.
Until it is a big enough F-up that an executive's head must roll.
There's a famous corollary to that approach: "Fire him? Why, I just spent 10 million dollars _educating_ him"
(regrettably I can't find any evidence it's a true (quote|story), but I enjoy the sentiment)
It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.
That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)
I would be absolutely amazed if DNS was not involved in mounting a block storage volume.
Bad puppet/ansible/etc commit is the most probably explanation.
dee ennn esss
A bug that has no obvious side effects that only became visible once all data centers were upgraded?
Happens. Statistics are hard.
This is OT, but I have a droplet on DO and I'm amazed at the amount of malicious traffic it gets. Is it normal for a very private vps to receive thousands of ssh attempts per hour? I have fail2ban installed and the jail is so busy it's quite astounding. Anyone with more web hosting experience that can weigh in?
I work for a web hosting company in Texas, and this is ridiculously common. Any public IP with any public service at all will be poked, prodded, and generally made uncomfortable by every bot and crawler you can think of, trying common password combinations and scanning for common vulnerabilities in popular software. This catches so many of our customers by surprise, who tend to mistakenly believe they're being targeted in some kind of attack. Generally they're not, unless they're running something vulnerable and one of the bots noticed.
Fail2ban is great to at least stem the tide. It's good at slowing down SSH brute forcing, and can be set up to throttle poorly behaved scrapers so your site isn't getting hammered constantly. If you can deal with the inconvenience, it's even better to put services that don't need to be truly public behind an IP whitelist. That stops the vast majority of malicious traffic, most of which is going after the low hanging fruit anyway.
Otherwise, it's kinda just a fact of life. With the good traffic also comes the bad.
I always switch my outward-facing ssh servers to key-only. Is there any advantage for running fail2ban additionally?
Presumably then iptables handles the reject/drop rather than SSH sessions being created that fail at login? (Disclaimer, I don't know exactly how any sshd handles dropping clients who attempt to connect by password when you have set it to cert only; but it seems like dropping at the firewall would be more optimal).
It makes your logs easier to read because they aren't full of ssh failures.
for my DO droplet I also changed the ssh port to a silly-high random port and the last time I checked it reduced the amount of nosy bots knocking at the door to zero.
I used to do so too, but sometimes had problems with very restrictive firewalls killing connections to high/unknown ports when traveling. They would only allow vpns or ssh to connect.
Cheers for weighing in. A whitelist is a good solution, since the sheer amount of attempts is making me uncomfortable. It seems to be accelerating over time as well which is even more disturbing.
> Is it normal for a very private vps to receive thousands of ssh attempts per hour?
Yes. The thing is about the IPv4 space is that it’s really not that big (3,706,452,992 public addresses) so it’s pretty trivial to poke every single one esp if you fine tune your port list.
The most common advice is to hide your private services. Instead of using port 22 for ssh use 23231 instead. It’s a little more annoying but you can also use port knocking. So to open port 22 (or what ever port you like) first you got to poke port 23123 then 7654 then port 39212 within a short period of time then the port knocking software will open up port 22. (Or a combo of both change default port and port knocking)
It won’t stop people “ringing the door bell” to see if anyone is home, but it will help with the then trying to brute force pass the login prompt.
Another really good option is single packet authorization. Which, again, a little more complex than port knocking but also another step up in security.
I work for a hosting company and this is totally normal. Digital Ocean and other VPS providers IP ranges are specifically targeted as many amateurs running servers there.
If you've disabled password logins then just don't worry about it. fail2ban is overkill you can rate limit with firewalld or iptables withou needing extra tools.
It is "normal". Even my home fix IP gets it without any service running on it other than ssh.
I had the same experience on DO as well as a few other providers.
Changing the ssh port to something in the 50_000 range drastically reduced the number of attempts and left my logs much cleaner :).
Well, I haven't bothered looking in a long time. But, back when I first got a cable modem back in the late 90's the malicious access attempts filled up my hard drive in just a couple of weeks. I don't remember the size of the HD, but I can only imagine the situation has gotten much, much worse since then.Is it normal for a very private vps to receive thousands of ssh attempts per hour?Yup this is normal, when I can't change the SSH port (e.g for compatibility) then I switch f2b to permanently ban, which should reduce any incurred load by black-holing instead of attempting authentication as the list grows more comprehensive over time (this wont affect other services e.g apache, in-case a user is unwittingly part of a botnet).
Fatfingered typist beware...
Better use a password manager in this case (or even better, use public key auth!)
hah, i also tend to up the attempts... If you have more than one server you can always tunnel through one of the others if you lock yourself out. Worst case of course VPS console.
Off topic reply, you should monitor the amount of malicious traffic coming from DO networks too. (I did for a few customers at different isps and its insane)
Not sure why the previous incident page got flagged. This is the new one.
It's affecting us for real. Making almost our whole service - serpapi.com - down. As we are storing database files on block storage volumes.
I took a look at the flags on these stories and am pretty sure they're from users who are tired of "X is down" submissions, which tend to get posted a lot and often to be a little on the trivial side.
However, since several HN users are expressing that this issue is genuinely affecting them, I've turned off flags on the OP about this and merged the comments here.
The "across all regions" part makes this one different for me, and interesting even though I'm not a customer of their block storage. I'm curious about the sequence of events, or design choices, that would cause that.
I reported it and they were like "what? oh..."
Then the status page changed and as things got worse, the dashboard page got an announcement as well.
Isn't Digital Ocean running Ceph for their block storage?
I would wonder - as others suggested - that they may have stretched the cluster across datacenters ?!
Would be interested in the post-mortem.
Yes, DO uses Ceph: https://blog.digitalocean.com/why-we-chose-ceph-to-build-blo...
Thank you Digital Ocean for once again proving that 'The Cloud' is not a backup.
'The Cloud' is _a_ backup. Just don't let it be your only backup!
It's not backed up if you don't have 3 copies.
The saying we always use is "If it is not in 3 places, it doesn't exist."
And another: "3 copies, at least one offsite"
This is your weekly reminder that anything you want to be reasonably “HA” should span multiple vendors in multiple DCs.
that would be pretty cool but to have that, you need a high-network-latency solution, i.e., pretty much cold back-up. For some time I thought it's pretty last century option but having been experimenting for some time now, it's the option with lowest impact on system performance. More importantly, it's reasonably resilient.
I've read your comment now about 4 times and all I have come up with is "huh?"
Literally thousands if not millions of organisations operate multi-DC infrastructure across the planet.
Is it harder than setting up a single box in one DC? Yes. Is it harder than setting up a mini-cluster of boxes in one DC? Yes. Is it rocket science? No.
Yeah, the myth that "just use aws" to have 99, 9999999 percent uptime is coming to an end...
Oh I'm sure the myth will persist for many years.
Anyone have a review of using DO k8s or DO managed DB in production?
DigitalOcean just posted a post-mortem on http://status.digitalocean.com/incidents/g76kgjxqrzxs
(the same url)
Higher latency (per status) is not end of world especially if it’s just “may experience” higher latency.
That wording kinda ticked me off because our volume was completely inaccessible. Rebooting did not mount it at all.
Hrm, Atlassian BitBucket is also down. Just a coincidence? Does BB use DO?
https://bitbucket.status.atlassian.com/incidents/4t1pkwrdtl8...
BitBucket definitely doesn't use DO.
Their block storage is such a failure. I’m back and forth with support to automatically delete files with lifecycles for over 2 months now and it’s still not resolved.
Since you're trying to "delete file with lifecycles", I'm quite sure your problem is with their object storage (called Spaces), and not their block storage.
It looks like they have just updated it as resolved and monitoring.
I was always wondering how I can get know proactively if something like this break or some service has an outage. As a result, I have built this tool( http://incidentok.com )
great idea - going to give it a whirl this week.
i'm curious about the slack integration. can you provide some more info on what that looks like? eg. just a message in real-time when it goes down? a daily message of statuses? etc. Any sort of customization w/ it?
I currently use a soup of zapier zaps to take care of this problem.
Hey. Thanks IncidentOK will send message to slack using webhook as soon as incident reported by any product. Didn't thought to send status everyday. But I am open for suggestions
Message looks like this https://imgur.com/jjbMKj8
So that’s why bot attacks and spam traffic was lower.
This is really down for more than 2 hours!!!
Last night I was testing DO managed Kubernetes cluster with persistent volume claim and the volume took 15 minutes to reattach after the pod is rescheduled to another host. I thought it was just some weird hiccup and went to bed.
The incident report indicated the problem started 4 hours ago (around 9pm GMT) but I was having problem around 4pm. It's definitely not a 2-hour incident.
our disks in London went down at about 8:45pm UTC (10 mins 100% disk utilization alert triggered at 5 to) and DO recovery message was sent out at about 2am UTC. We switched our service (keychest.net) on at 3:15am
DigitalOCean bad experience
Their ad was “you’ve been developing like a beast and your app is ready to go live”
DO is a nice thing to play around with and maybe launch something, but I wouldn’t run full production on it.