Cloudflare outage – 24 hours now
news.ycombinator.comLooking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.
I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.
Hard to imagine how the Internet could ever exist without Cloudflare.
The Internet has been decentralized from the beginning. Now I don't want to claim that Cloudflare made something worse (at least it's enabling a lot of websites to exist without fear of DDoS) but the fact is that Cloudflare made it more centralized, as there are lots of websites that cannot be accessed without going through Cloudflare.
I think you might have missed the joke on this one.
Honestly, the original post could have been a joke, or it could not have been. I regularly talk to people who seem to genuinely believe this sort of thing (on this topic and others).
I am from an era no one was serious on the Internet, so we didn't need to add /s every time we were sarcastic.
In that era we also saw the last sysadmin configuring Apache with their bare hands without the help of Cloudflare.
It's such a sad reflection of the state of the devops art when setting up a TLS terminator is considered a black art worthy of vaunted experts being paid huge sums. I've seen this descent over the course of my career, watching the profession go from low-level knowledge to being mere YAML-wiring monkeys, slinging shit over the wall to get functionality working just well enough to make it to the nEXt SprInT. The joke above aside, I think it will continue to get worse, and the outcome to overall stability reflected in that until it comes to a head and either people re-learn 'lost' skills, or the ball of bailing wire, gum and glue implodes more completely.
> decentralised global Internet, with packets being routed through alternative paths
> I understand we do not have the technology for that just yet
I looked at my router, remembered the term "packet-switched network", and wept.
We have the technology. We can make him better, than he was. Better, stronger, faster.
That technology is far too advanced, unfortunately. Maybe someday, packets will freely roam the cyber plains, untethered by the reins of single-point-of-failure gatekeepers. Until that halcyon day dawns, we'll remain humble supplicants at the towering obelisks of centralization, chanting incantations of redundancy and resilience, and laying burnt offerings of legacy hardware upon the altars of the uptime deities.
> Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.
It's not just packet routing though, many of their other products seem to be affected as well.
Missing the /s I hope.
As I said elsewhere, I come from a time everyone was fluent in sarcasm on the Internet, without needing disclaimers.
Cloudflare Continuing to Experience Outages - https://news.ycombinator.com/item?id=38121370 - Nov 2023 (2 comments)
Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)
Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)
Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)
I never experienced a longer than 12 hours outage with any service provider over my ~13 years career (maybe I was lucky). But thanks to Cloudflare I have been able to enjoy not just one, but two ~24h outages in not even a month!
Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.
Azure leap year outage is a famous one.
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
> But something is clearly wrong over there
We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.
My limited interaction with their sales & account management org gave me the impression of remarkable levels of disorganization. I know those tend to have a lot of turnover, but it seemed like they also weren't really training or managing them. Really weird vibes.
> two ~24h outages in not even a month
Wasn't the previous outage on Oct 30 less than an hour?
Yep, but on Oct 9 they were down for 22h.
Trying high for that 2-nines reliability.
You just can't get that level of reliability if you do it yourself, no matter how hard you try.
We won't do it ourselves, but we also won't do it with a provider that has accumulated 50+ hours of downtime in less than a month all the while having no communication or support.
That's barely clearing the one nine availability for the last 30 days (93%) for our particular stack on CF, this is insane.
Mind you last time we were hit by a 22h outage on Oct. 9 we didn't get so much as an email from CF either during or after the outage.
The no communication or support is the real killer here, imo. I can understand them having some catastrophic issues which I would be reasonably confident they could fix, it’s the uncertainty of the situation that makes me worry. Is this so easy to fix it will be back in less than an hour and they will communicate then? Are they going dark and I need to find a new provider asap?
To be fair, their status page says emails don’t work haha
It's long been accepted practice in the hosting industry to have your critical communications as a provider (status page, support system) hosted somewhere that's not your network, for this reason.
It continues to amaze me how major infrastructure providers seem to consistently fuck this one up (see also: AWS' status page outage a while ago).
we used to joke that we had 5 8's of availability
BTW Cloudflare tunnels are not working (for the at least the last 16 hours), but it says "Operational" and "restored" on the ticket.
Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:
Wanted to hack today but the universe is telling me to go enjoy the sun
Data point of just one, but my tunnels are working just fine.
If you've previously created a tunnel it will still work, just don't close it because you won't be able to open a new one.
Yeah, can confirm the same here. This seems to be the most frequent failure mode that I see with my use of Tunnels. Existing connections will continue working, but attempts to create new ones (or other changes) fail. Ie. the control plane fails more often than the core connectivity service.
I certainly prefer that failure mode to the opposite, but I do find the status information on Cloudflare's page to be very confusing about this.
Tunnels as a product is essentially heavily degraded (putting it lightly) and yet it's listed currently as "Cloudflare Tunnel: restored" [0]
https://www.cloudflarestatus.com [0]
Edit: also having used Shopify's CLI a little, one thing I noticed immediately is how opaque the whole thing is. They want to push you down a very specific path, and don't provide a lot of information if you want to take a bit more control over your dev process (as I always want to do) which directly leads to points of failure like this. From your GitHub links it looks like devs are struggling to figure out how to quickly switch to a different reverse proxy.
Same here, but they’ve been up for a while. Does anyone know if rebooting the machine will kill them?
I run hirevire.com one way video interview SaaS - and we were pretty much dead in the water during the Cloudflare Stream outage.
We moved out to BunnyCDN's stream after waiting for 20 hours.
One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.
I've also been using Bunny's image and video delivery, while using CF for everything else. It's pretty neat - it just works. I like having both in my toolbelt, makes fallbacks like these easy.
How much work was the migration? Were the APIs feature-compatible or did you lose functionality?
The migration work was only a couple of hours for our core process. Took us 4 in total to restart collecting video.
We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.
I'm really looking forward to the post-mortem to this.
Cloudflare's greatest product is arguably its blog !
I can't believe we haven't heard anything yet. AFAIK we've only been told, "power outage", which was resolved yesterday.
What should our expectations be? The best assumption could be that this is the new normal.
Power outage + data inconsistency issues.
Isn't the real issue that the control plane isn't decentralized/redundant? Entirely dependent on PDX
A silver lining I take from this is at least we have incidents page hosted somewhere else :)
I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...
Has Cloudflare said anything of substance yet? This is far beyond a simple power outage.
https://www.theregister.com/2023/11/02/cloudflare_outage/
""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.
The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """
The KV outage is the previous one, from Nov 1st.
We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.
Cloudflare Postmortem:
https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...
"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.
Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.
It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."
That are having issues with the new process spanning global MITM'd traffic to the NSA.
Honestly Cloudflare's PR pissed me off yesterday.
Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.
Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.
Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.
Is it possible that you're referencing the other outage from the 30th? Just going by the 37 minutes number as it's very specific.
This outage (not the current one) was 37 minutes long:
https://blog.cloudflare.com/cloudflare-incident-on-october-3...
They are straight up scumbags.
Hmm... Who just changed their dns vs. riding it out?
I was in the midst of migrating my namecheap domain from Route53 to Cloudflare. Set up all the DNS records while ignoring the /api/ errors shown at the bottom of the Cloudflare dashboard thinking some ad block setting in my browser was messed up.
Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.
We did, we were slowly working towards migrating to AWS entirely and this just helped expedite it.
It hasn't affected my home network at all. I use their DNS servers and nothing has resolving addresses has not stopped working
Parent comment was likely referring to authoritative DNS, not Cloudflare's public resolvers.
I'm planning my transition away for 10 or so subdomains and 30 records.
The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.
Is there a summary of what Cloudflare services are operational? Feels like it would be easier to track.
Basic proxying seems to be working fine for me. Existing DNS records continue to be served. Existing files on R2 are accessible. Can't change anything without a bunch of API errors, though. Hope I don't need to turn on "I'm under attack" anytime soon.
Wonder if this is related to the many product launches recently? Even though my general impression is that they test rigorously with long-running alpha and beta test phases.
It is the consequences of a power outage in Flexential PDX02 data center.
That's the inference, but AFAIK there's been no direct assertion or explanation: Why has CF been knocked back to Alpha-status reliability across the board.
I've heard tell of massive DDoS attacks against international news sources (AP, Reuters, NY Times). Not sure if this is related.
In this case it is not. A power outage in a critical data-center is the root cause here: https://www.cloudflarestatus.com/incidents/hm7491k53ppg
I am in this DC, and we lost power to all of our racks but one. Power was restored about 2 hours later. I would assume Cloudflare had some significant failures in equipment due to the power drop. We lost a couple of servers that didn't come back up, which is not uncommon problem with hardware that has been running without a power-off for 4-5 years.
What I'm confused by is we had "power partially restored" 22 hours ago, and no news from PDX02 since.
I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.
There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.
if it was a state sponsored attack, various 3 letter agencies may want to investigate which could slow down restoration of service
If you can't cope with the loss of a data centre you're not really running a resilient system.
On yesterday's post someone that used to work at CF mentioned that PDX is "the brain" and if it goes down data stops propagating and starts getting stale. It's crazy to me that a company that is so critical to SO MUCH of the traffic on the internet doesn't even have a failover strategy for "the brain" of their operation.
Not great to know an aspiring hyperscaler is one tornado away from insolvency.
I dont think PDX would get many tornados BUT the PNW has the joys of earthquakes that are overdue.
It's amusing, I have plans on how much service degradation I can accept and what I need to keep working in the UK if the Thames Barrier is breached (which would flood most of the UK's internet connectivity in places like Telehouse, Sovereign House, etc)
To have 30% of the internet relying on a single building in a single city is hilarious.
Especially 15 hours after the DC comes back up and you're still offline.
Anymore info on that?
I can't find any published details, it was just circling the media biz.