Cloudflare outage – 24 hours now

news.ycombinator.com

241 points by sanat 2 years ago · 75 comments

Reader

sph 2 years ago

Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.

I understand we do not have the technology for that just yet, and DevOps able to configure TLS terminators on their own are worth their weight in gold.

Hard to imagine how the Internet could ever exist without Cloudflare.

renonce 2 years ago

The Internet has been decentralized from the beginning. Now I don't want to claim that Cloudflare made something worse (at least it's enabling a lot of websites to exist without fear of DDoS) but the fact is that Cloudflare made it more centralized, as there are lots of websites that cannot be accessed without going through Cloudflare.
- belthesar 2 years ago
  
  I think you might have missed the joke on this one.
  - joepie91_ 2 years ago
    
    Honestly, the original post could have been a joke, or it could not have been. I regularly talk to people who seem to genuinely believe this sort of thing (on this topic and others).
    
    sph 2 years ago
    
    I am from an era no one was serious on the Internet, so we didn't need to add /s every time we were sarcastic.
    In that era we also saw the last sysadmin configuring Apache with their bare hands without the help of Cloudflare.
theideaofcoffee 2 years ago

It's such a sad reflection of the state of the devops art when setting up a TLS terminator is considered a black art worthy of vaunted experts being paid huge sums. I've seen this descent over the course of my career, watching the profession go from low-level knowledge to being mere YAML-wiring monkeys, slinging shit over the wall to get functionality working just well enough to make it to the nEXt SprInT. The joke above aside, I think it will continue to get worse, and the outcome to overall stability reflected in that until it comes to a head and either people re-learn 'lost' skills, or the ball of bailing wire, gum and glue implodes more completely.
TeMPOraL 2 years ago

> decentralised global Internet, with packets being routed through alternative paths
> I understand we do not have the technology for that just yet
I looked at my router, remembered the term "packet-switched network", and wept.
- dylan604 2 years ago
  
  We have the technology. We can make him better, than he was. Better, stronger, faster.
survirtual 2 years ago

That technology is far too advanced, unfortunately. Maybe someday, packets will freely roam the cyber plains, untethered by the reins of single-point-of-failure gatekeepers. Until that halcyon day dawns, we'll remain humble supplicants at the towering obelisks of centralization, chanting incantations of redundancy and resilience, and laying burnt offerings of legacy hardware upon the altars of the uptime deities.
barbazoo 2 years ago

> Looking forward to a more decentralised global Internet, with packets being routed through alternative paths, so outages like these become a non-event.
It's not just packet routing though, many of their other products seem to be affected as well.
lazydon 2 years ago

Missing the /s I hope.
- sph 2 years ago
  
  As I said elsewhere, I come from a time everyone was fluent in sarcasm on the Internet, without needing disclaimers.

arp242 2 years ago

Cloudflare Continuing to Experience Outages - https://news.ycombinator.com/item?id=38121370 - Nov 2023 (2 comments)

Cloudflare Is Down (Again) - https://news.ycombinator.com/item?id=38116892 - Nov 2023 (2 comments)

Cloudflare API Down - https://news.ycombinator.com/item?id=38112515 - Nov 2023 (141 comments)

Cloudflare incident on October 30, 2023 - https://news.ycombinator.com/item?id=38100932 - Nov 2023 (29 comments)

sailingparrot 2 years ago

I never experienced a longer than 12 hours outage with any service provider over my ~13 years career (maybe I was lucky). But thanks to Cloudflare I have been able to enjoy not just one, but two ~24h outages in not even a month!

Jokes aside, it must be extremely stressful to be a SRE at CF recently. But something is clearly wrong over there. We have been burned so bad there is no chance we will touch CF ever again in the next decade once our migration off of it is complete.

adrr 2 years ago

Azure leap year outage is a famous one.
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
throaway920181 2 years ago

> But something is clearly wrong over there
We renewed our agreement with them in the middle of the year (~$50k) and they've yet to invoice us for it. Our financial controller noticed and I pinged our account rep a few times. Not a peep back.
- hotnfresh 2 years ago
  
  My limited interaction with their sales & account management org gave me the impression of remarkable levels of disorganization. I know those tend to have a lot of turnover, but it seemed like they also weren't really training or managing them. Really weird vibes.
CodesInChaos 2 years ago

> two ~24h outages in not even a month
Wasn't the previous outage on Oct 30 less than an hour?
- sailingparrot 2 years ago
  
  Yep, but on Oct 9 they were down for 22h.
  - ta1243 2 years ago
    
    Trying high for that 2-nines reliability.
    You just can't get that level of reliability if you do it yourself, no matter how hard you try.
    
    sailingparrot 2 years ago
    
    We won't do it ourselves, but we also won't do it with a provider that has accumulated 50+ hours of downtime in less than a month all the while having no communication or support.
    That's barely clearing the one nine availability for the last 30 days (93%) for our particular stack on CF, this is insane.
    Mind you last time we were hit by a 22h outage on Oct. 9 we didn't get so much as an email from CF either during or after the outage.
    
    samtho 2 years ago
    
    The no communication or support is the real killer here, imo. I can understand them having some catastrophic issues which I would be reasonably confident they could fix, it’s the uncertainty of the situation that makes me worry. Is this so easy to fix it will be back in less than an hour and they will communicate then? Are they going dark and I need to find a new provider asap?
    
    olivermuty 2 years ago
    
    To be fair, their status page says emails don’t work haha
    
    joepie91_ 2 years ago
    
    It's long been accepted practice in the hosting industry to have your critical communications as a provider (status page, support system) hosted somewhere that's not your network, for this reason.
    It continues to amaze me how major infrastructure providers seem to consistently fuck this one up (see also: AWS' status page outage a while ago).
    
    smugma 2 years ago
    
    we used to joke that we had 5 8's of availability

samlinnfer 2 years ago

BTW Cloudflare tunnels are not working (for the at least the last 16 hours), but it says "Operational" and "restored" on the ticket.

Since Shopify's CLI uses Cloudflare tunnels by default to load local resources, Shopify partners are affected by this outage by unable to develop apps, unless they use another tunnel:

[0] https://github.com/Shopify/cli/issues/3065

[1] https://github.com/Shopify/cli/issues/3060

lamroger 2 years ago

Wanted to hack today but the universe is telling me to go enjoy the sun
perryizgr8 2 years ago

Data point of just one, but my tunnels are working just fine.
- samlinnfer 2 years ago
  
  If you've previously created a tunnel it will still work, just don't close it because you won't be able to open a new one.
  - supertron 2 years ago
    
    Yeah, can confirm the same here. This seems to be the most frequent failure mode that I see with my use of Tunnels. Existing connections will continue working, but attempts to create new ones (or other changes) fail. Ie. the control plane fails more often than the core connectivity service.
    I certainly prefer that failure mode to the opposite, but I do find the status information on Cloudflare's page to be very confusing about this.
    Tunnels as a product is essentially heavily degraded (putting it lightly) and yet it's listed currently as "Cloudflare Tunnel: restored" [0]
    https://www.cloudflarestatus.com [0]
    Edit: also having used Shopify's CLI a little, one thing I noticed immediately is how opaque the whole thing is. They want to push you down a very specific path, and don't provide a lot of information if you want to take a bit more control over your dev process (as I always want to do) which directly leads to points of failure like this. From your GitHub links it looks like devs are struggling to figure out how to quickly switch to a different reverse proxy.
- mypastself 2 years ago
  
  Same here, but they’ve been up for a while. Does anyone know if rebooting the machine will kill them?

sanatOP 2 years ago

I run hirevire.com one way video interview SaaS - and we were pretty much dead in the water during the Cloudflare Stream outage.

We moved out to BunnyCDN's stream after waiting for 20 hours.

One side benefit is that our videos are now stored in EU instead of Cloudflare's <hand wavy> edge location around you.

summarity 2 years ago

I've also been using Bunny's image and video delivery, while using CF for everything else. It's pretty neat - it just works. I like having both in my toolbelt, makes fallbacks like these easy.
karlerss 2 years ago

How much work was the migration? Were the APIs feature-compatible or did you lose functionality?
- sanatOP 2 years ago
  
  The migration work was only a couple of hours for our core process. Took us 4 in total to restart collecting video.
  We still have some accessory features to be moved to video on Bunny. Like transcriptions, downloads.

creshal 2 years ago

I'm really looking forward to the post-mortem to this.

liotier 2 years ago

Cloudflare's greatest product is arguably its blog !
dogweather 2 years ago

I can't believe we haven't heard anything yet. AFAIK we've only been told, "power outage", which was resolved yesterday.
What should our expectations be? The best assumption could be that this is the new normal.
laluser 2 years ago

Power outage + data inconsistency issues.
- BillinghamJ 2 years ago
  
  Isn't the real issue that the control plane isn't decentralized/redundant? Entirely dependent on PDX

dixie_land 2 years ago

A silver lining I take from this is at least we have incidents page hosted somewhere else :)

I look fondly to earlier AWS outages where everything is Green on the status page because the Red icon hosted on S3 was down...

dogweather 2 years ago

Has Cloudflare said anything of substance yet? This is far beyond a simple power outage.

gtirloni 2 years ago

https://www.theregister.com/2023/11/02/cloudflare_outage/
""" In a nutshell, Cloudflare rolled out a new KV build to production. It turned out that the deployment tool had a bug, and some traffic got diverted to the wrong destination, which triggered a rollback … which failed. The result was that engineers had to manually switch the production route to the previous working version of Workers KV.
The problem is that an awful lot of Cloudflare products and services depend on Workers KV, meaning that when there is a problem with the platform, the blast radius can be impressive. """
- tux3 2 years ago
  
  The KV outage is the previous one, from Nov 1st.
  We're currently in the Nov 2-3 outage, soon to rollover into Nov 4 in my timezone. This one is the power outage — also mentioned in the article — but unrelated to KV.

sponaugle 2 years ago

Cloudflare Postmortem:

https://blog.cloudflare.com/post-mortem-on-cloudflare-contro...

"On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.

Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.

It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power."

client4 2 years ago

That are having issues with the new process spanning global MITM'd traffic to the NSA.

epolanski 2 years ago

Honestly Cloudflare's PR pissed me off yesterday.

Straight up going on LinkedIn and other socials telling everything was solved in one hour (actually 37 minutes), even though I and many other companies I know still had issues with their services *16 hours after* the post.

Those are things that make me reconsider my position with Cloudflare. Straight up lying and not verifying whether your customers are able to operate on your platform while impacting their operations but making PR stunts about how good and fast they are at solving critical issues is something that erodes credibility.

Especially after they used the Okta security failure to bash them on their blog for their lack of honest communication to their customers.

corobo 2 years ago

Is it possible that you're referencing the other outage from the 30th? Just going by the 37 minutes number as it's very specific.
This outage (not the current one) was 37 minutes long:
https://blog.cloudflare.com/cloudflare-incident-on-october-3...
anacrolix 2 years ago

They are straight up scumbags.

spacebacon 2 years ago

Hmm... Who just changed their dns vs. riding it out?

sirius87 2 years ago

I was in the midst of migrating my namecheap domain from Route53 to Cloudflare. Set up all the DNS records while ignoring the /api/ errors shown at the bottom of the Cloudflare dashboard thinking some ad block setting in my browser was messed up.
Then I realised setting the NS in Namecheap to Cloudflare's nameservers was taking an inordinate amount of time to propagate, and that's when I checked X/Twitter. Set it back to Route53.
burcs 2 years ago

We did, we were slowly working towards migrating to AWS entirely and this just helped expedite it.
issafram 2 years ago

It hasn't affected my home network at all. I use their DNS servers and nothing has resolving addresses has not stopped working
- 0x0000000 2 years ago
  
  Parent comment was likely referring to authoritative DNS, not Cloudflare's public resolvers.
dogweather 2 years ago

I'm planning my transition away for 10 or so subdomains and 30 records.
The only feature I need to research in new providers is: access to Whois ASN numbers, which I insert into HTTP request headers. I use this to tailor my site for .gov and .edu users.

hipadev23 2 years ago

Is there a summary of what Cloudflare services are operational? Feels like it would be easier to track.

kijin 2 years ago

Basic proxying seems to be working fine for me. Existing DNS records continue to be served. Existing files on R2 are accessible. Can't change anything without a bunch of API errors, though. Hope I don't need to turn on "I'm under attack" anytime soon.

l5870uoo9y 2 years ago

Wonder if this is related to the many product launches recently? Even though my general impression is that they test rigorously with long-running alpha and beta test phases.

stpe 2 years ago

It is the consequences of a power outage in Flexential PDX02 data center.
- dogweather 2 years ago
  
  That's the inference, but AFAIK there's been no direct assertion or explanation: Why has CF been knocked back to Alpha-status reliability across the board.

tootie 2 years ago

I've heard tell of massive DDoS attacks against international news sources (AP, Reuters, NY Times). Not sure if this is related.

nkcmr 2 years ago

In this case it is not. A power outage in a critical data-center is the root cause here: https://www.cloudflarestatus.com/incidents/hm7491k53ppg
- sponaugle 2 years ago
  
  I am in this DC, and we lost power to all of our racks but one. Power was restored about 2 hours later. I would assume Cloudflare had some significant failures in equipment due to the power drop. We lost a couple of servers that didn't come back up, which is not uncommon problem with hardware that has been running without a power-off for 4-5 years.
- tux3 2 years ago
  
  What I'm confused by is we had "power partially restored" 22 hours ago, and no news from PDX02 since.
  I assume both Clouflare and Flexential are on DEFCON 1 right now, but I'm wondering if it might be more than just the building going dark.
  There's something about a failover than was attempted and crashed halfway through, but unclear if that's what's causing the 24h+ situation.
  - throwaway_13140 2 years ago
    
    if it was a state sponsored attack, various 3 letter agencies may want to investigate which could slow down restoration of service
- ta1243 2 years ago
  
  If you can't cope with the loss of a data centre you're not really running a resilient system.
  - itslennysfault 2 years ago
    
    On yesterday's post someone that used to work at CF mentioned that PDX is "the brain" and if it goes down data stops propagating and starts getting stale. It's crazy to me that a company that is so critical to SO MUCH of the traffic on the internet doesn't even have a failover strategy for "the brain" of their operation.
    
    sangnoir 2 years ago
    
    Not great to know an aspiring hyperscaler is one tornado away from insolvency.
    
    seized 2 years ago
    
    I dont think PDX would get many tornados BUT the PNW has the joys of earthquakes that are overdue.
    
    ta1243 2 years ago
    
    It's amusing, I have plans on how much service degradation I can accept and what I need to keep working in the UK if the Thames Barrier is breached (which would flood most of the UK's internet connectivity in places like Telehouse, Sovereign House, etc)
    To have 30% of the internet relying on a single building in a single city is hilarious.
  - dogweather 2 years ago
    
    Especially 15 hours after the DC comes back up and you're still offline.
andrewinardeer 2 years ago

Anymore info on that?
- tootie 2 years ago
  
  I can't find any published details, it was just circling the media biz.

Settings

Cloudflare outage – 24 hours now

Keyboard Shortcuts