Settings

Theme

Ask HN: Cloudflare Workers are down?

157 points by cristaloleg 2 years ago · 86 comments · 1 min read

Reader

Got too much 500 on dozen of services now.

UPD: https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc

elithrar 2 years ago

This should be resolved. We’re still investigating the underlying root cause, and intend to share a write-up once we have that in hand.

This is not the way we wanted anyone to start their week.

(I am the PM lead for Cloudflare Workers: Databases & Storage)

nkcmr 2 years ago

https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc

codegeek 2 years ago

3:55 PM Eastern: Our entire website hosted on cloudflare pages is returning 500. I also cannot login to the dashboard either (it just spins)

EDIT 4:10 PM Eastern: Now I can login to the dashboard but "Workers and Pages" menu is returning errors and no access. Website still down :(

EDIT at 4:23 PM Eastern: RESOLVED. Website (cloudflare pages) is back up now for me.

Looks like they took about 25 mins to resolve.

  • camjohnson26 2 years ago

    Our prod app and staging just completely died. Bad day for somebody at Cloudflare

    • codegeek 2 years ago

      Our main Marketing website that brings revenue is down. No Sympathy from me. It has been 20 mins now. Losing money as I type this.

      EDIT: I panicked a little. As a dev, I should have been more sympathetic.

      • acdha 2 years ago

        It’s only reasonable to be angry but do try to remember that the people fixing this are people like you who showed up at work to build something and are instead dealing with a fire. Ask their bosses about how they got in that situation but be nice to them, they’re having an even worse day than you are.

        • codegeek 2 years ago

          Fair enough. They resolved it now and I was in a bit of panic considering our revenue depends on the website. As a developer though, I should have been more sympathetic.

          • andrelaszlo 2 years ago

            I'm curious, I definitely get the panic. How much does a 30 minute outage cost you, vs how much would it cost to build a solution with some kind of standby that you could fail over to in scenarios like this?

            It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?

            • codegeek 2 years ago

              Honestly, the impact may be less monetary and more reputation but as a B2B SAAS provider, I agree that it may not have been that dramatic as I made it sound for the 25 mins of downtime. It is just that we never had a downtime this long, EVER in 8+ years of business so I hit the panic button fast. 25 mins seemed like 25 hours if you ask me :).

  • midasuni 2 years ago

    3.55pm eastern. My websites work

    4.10pm eastern, still working

    4.23 eastern. Yep you guessed it

    Half an hour means they’ve lost their five nines for this year based on this outage alone.

  • EthicalSimilar 2 years ago

    Us also, prod and staging are down and dashboard is resulting in API failure requests (500).

madjam002 2 years ago

And just 30 minutes ago we were about to flip the switch on a months long migration to Cloudflare Pages for our new website, I guess some things weren't meant to be :')

  • codegeek 2 years ago

    Omg. What timing. I feel your pain. We recently migrated to Cloudflare Pages and I was happy at the speed and everything and now this :(. Never had a downtime when I self hosted on my DigitalOcean droplet. damn. Re-considering going back to old school nginx static site hosting.

    • ikekkdcjkfke 2 years ago

      Those might have had downtime, but never reported

    • hobs 2 years ago

      Well then you haven't used DO that long, I get regular emails about X or Y server needing to go down for maint.

      • goldinfra 2 years ago

        I've used Digital Ocean (and many other hosting providers) for as long as most of them have existed. Most of my servers have been running nearly uninterrupted for many years. Yes, there will be a reboot or move every so often but the uptime is incredibly high.

        The idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.

        The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.

    • madjam002 2 years ago

      Yep our current marketing site is NextJS hosted on Hetzner fronted by Cloudflare, fortunately that's still up and never has any problems.

      We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((

    • nabakin 2 years ago

      Not sure why you're being downvoted

    • nijave 2 years ago

      Did you ever reboot for patches or was it load balanced?

    • rexreed 2 years ago

      Heck even shared hosting for $3/mo works just fine

JohnMakin 2 years ago

For any terraform users that may be using code like this:

data "cloudflare_ip_ranges" "cloudflare_ipv4_list" {}

This is coming back with an empty list on some fields and causing havoc in terraform.

  • freedomben 2 years ago

    It is shocking to me how bad to non-existent error handling is in most terraform providers. It leads to some remarkably arcane and esoteric error messages

    • mschuster91 2 years ago

      Terraform error handling as a whole is nuts anyway. Like, I recently tried to delete an ACM cert that still was in use in a Cloudfront distribution - didn't work, but it took 20 minutes for Terraform to recognize that, yes, there's an API error. It shouldn't have come so far given that the API call immediately errors out when trying over the CLI or Web Console, but instead of erroring out, Terraform retried for 20 minutes until it hit some sort of timeout.

      To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.

      Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.

      (Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)

      • nijave 2 years ago

        This is usually down to provider implementation which switching the core won't help. The provider controls HTTP calls and errors against the relevant service API.

        Here are the retries in the provider code https://github.com/hashicorp/terraform-provider-aws/blob/mai...

        It's hard coded to "certificateCrossServicePropagationTimeout" which is 20 minutes here https://github.com/hashicorp/terraform-provider-aws/blob/mai...

        • mschuster91 2 years ago

          Sure, but Terraform Core doesn't provide any way of getting user feedback in case unexpected situations happen, or aborting while saving the current state, both of which would save me serious amounts of time.

          • nijave 2 years ago

            Being able to actually interrupt/cancel would be nice. You can get more feedback by adjusting TF_LOG env var. Logging levels have been getting improvements for a while (it used to just be TRACE that spammed everything)

            • mschuster91 2 years ago

              > You can get more feedback by adjusting TF_LOG env var.

              yep, but that's often enough useless after the fact. In the PHP world, for example, there's Symfony/Monolog's `fingers_crossed` logger [1]... it keeps logs below the normal threshold in memory, but if there is a single event of a given severity or worse, it dumps out all the logs it has ingested so far for this request.

              A real lifesaver that one is.

              [1] https://symfony.com/doc/current/logging.html

              • nijave 2 years ago

                I used to add TF_LOG_PATH to my shell profile so all TF runs log to disk with a daily cron to truncate the file.

    • JohnMakin 2 years ago

      There was no error message, that was the really unsettling part.

    • eddyfromtheblok 2 years ago

      it's shocking how much of a desirable skill it is in devops job roles given its clear deficiencies.

  • JohnMakin 2 years ago

    In the time I made this post and now it's come back. Really wish that would've returned an error and not an empty list, that almost caused a disaster in my automation.

TacticalCoder 2 years ago

Anyone remember big iron and servers with uptimes of 5 or 7 nines?

I mean: it used to be a thing. Now we have the cloud.

  • fizx 2 years ago

    7 9's is 3 seconds of downtime per year. That was never a thing.

vicnov 2 years ago

Auth0 seems to be down as well

  • thom 2 years ago

    Yeah, can confirm this (for those looking at their status pages which claim otherwise).

j-rom 2 years ago

Complete Pages outage for me. I have several sites hosted on Cloudflare Pages and I can't access any of them, they're all returning 500's.

campbellman 2 years ago

Fun day to release a blog post[0] about cloudflare page functions, on a site hosted on cloudflare pages.

[0] https://interbolt.org/blog/split-it-and-forget-it

tootie 2 years ago

Apparently Auth0 as well. Possibly related.

c22 2 years ago

It's probably bad that I noticed this just due to a large percentage of my regular online-habits suddenly breaking. I liked the old internet where websites just broke one at a time.

blintz 2 years ago

This is preventing new logins to ChatGPT.

xrd 2 years ago

I can't login to my domain dashboard either. Maybe that is a downstream effect of workers being offline?

dogweather 2 years ago

Yes — Workaround is to disable your workers. That got my site back up and running.

EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.

  • codegeek 2 years ago

    Are you saying disable workers and then your cloudflare Pages will be back up ?

    • dogweather 2 years ago

      Ah, I don't know about Cloudflare Pages. I think they use Workers underneath. So unfortunately, there's no fix yet. Sorry.

      • codegeek 2 years ago

        Ah well. I cannot access the Workers and pages menu. It returns an error.

gsanderson 2 years ago

My sites have started coming back up now. Their site has also just started working again (previously got a 500): https://pages.cloudflare.com/

pcblues 2 years ago

I won't be the first or last to say these three things:

The internet was meant to stop reliance on single sources (in case of nuclear war)

The size of a house of cards increases the number of failure points

Marketers lie

  • jve 2 years ago

    > The internet was meant to stop reliance on single sources

    You have all the technical means. Your home server possibly won't be reachable, yes.

    The global connectivity as-is is really, really, really fault tolerant.

CommonGuy 2 years ago

Cloudflare Pages aren't working on a few of my sites too

mparnisari 2 years ago

It is funny that just a few days ago the company that laughed at Okta for a breach and whose core competency is availability are now experiencing an outage.

gkfasdfasdf 2 years ago

Ongoing DDoS attacks are targeting sites that raise funds for Gaza relief efforts: https://twitter.com/arblauvelt/status/1719027920054702363

I wonder if there's a connection.

ironmagma 2 years ago

That's sad, hopefully something comes along that can brighten their day.

TheCleric 2 years ago

Wonder if this is related to the mini NPM outage I was experiencing earlier:

https://status.npmjs.org/incidents/zdznxkrp22py

  • imslavko 2 years ago

    No way to confirm, but I think so, just because NPM threw this error at me:

         KV GET failed: 401 Unauthorized
    
    where KV could refer to the CF KV in workers
Animats 2 years ago

All of them, or just those in some data centers?

AxiomaticSpace 2 years ago

Looks like it's working now as of 1:23 PST

ChrisArchitect 2 years ago

Damn yeah, noticing for last few mins+

winseybash 2 years ago

Looks to be back up now for my sites

johnmaguire 2 years ago

Yes, down since about 3:55 ET.

tomschlick 2 years ago

Back up for me as of 16:20 ET

darknavi 2 years ago

npmjs is also in a bad spot:

https://status.npmjs.org/

ChrisArchitect 2 years ago

Ask HN:

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection