Ask HN: Cloudflare Workers are down?

157 points by cristaloleg 3 years ago · 86 comments · 1 min read

Got too much 500 on dozen of services now.

UPD: https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc

This should be resolved. We’re still investigating the underlying root cause, and intend to share a write-up once we have that in hand.

This is not the way we wanted anyone to start their week.

(I am the PM lead for Cloudflare Workers: Databases & Storage)

elithrar 3 years ago

Our public postmortem on the incident: https://blog.cloudflare.com/cloudflare-incident-on-october-3...

nkcmr 3 years ago

https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc

huerlisi1 3 years ago

Just noted on HN and already incident upgrade. Much faster "response" than most other companies:-)
All the best to the people fixing!
NicoJuicy 3 years ago

Works for me: https://blog.sapico.me/
Seems to be 30 minutes. According to status page.
Fix is fast. Curious what it was.

codegeek 3 years ago

3:55 PM Eastern: Our entire website hosted on cloudflare pages is returning 500. I also cannot login to the dashboard either (it just spins)

EDIT 4:10 PM Eastern: Now I can login to the dashboard but "Workers and Pages" menu is returning errors and no access. Website still down :(

EDIT at 4:23 PM Eastern: RESOLVED. Website (cloudflare pages) is back up now for me.

Looks like they took about 25 mins to resolve.

camjohnson26 3 years ago

Our prod app and staging just completely died. Bad day for somebody at Cloudflare
- codegeek 3 years ago
  
  Our main Marketing website that brings revenue is down. No Sympathy from me. It has been 20 mins now. Losing money as I type this.
  EDIT: I panicked a little. As a dev, I should have been more sympathetic.
  - acdha 3 years ago
    
    It’s only reasonable to be angry but do try to remember that the people fixing this are people like you who showed up at work to build something and are instead dealing with a fire. Ask their bosses about how they got in that situation but be nice to them, they’re having an even worse day than you are.
    
    codegeek 3 years ago
    
    Fair enough. They resolved it now and I was in a bit of panic considering our revenue depends on the website. As a developer though, I should have been more sympathetic.
    
    andrelaszlo 3 years ago
    
    I'm curious, I definitely get the panic. How much does a 30 minute outage cost you, vs how much would it cost to build a solution with some kind of standby that you could fail over to in scenarios like this?
    It could be worth it, but if you do the math and it seems like it's not worth it, it could perhaps give you some equanimity the next time it happens?
    
    codegeek 3 years ago
    
    Honestly, the impact may be less monetary and more reputation but as a B2B SAAS provider, I agree that it may not have been that dramatic as I made it sound for the 25 mins of downtime. It is just that we never had a downtime this long, EVER in 8+ years of business so I hit the panic button fast. 25 mins seemed like 25 hours if you ask me :).
midasuni 3 years ago

3.55pm eastern. My websites work
4.10pm eastern, still working
4.23 eastern. Yep you guessed it
Half an hour means they’ve lost their five nines for this year based on this outage alone.
EthicalSimilar 3 years ago

Us also, prod and staging are down and dashboard is resulting in API failure requests (500).

madjam002 3 years ago

And just 30 minutes ago we were about to flip the switch on a months long migration to Cloudflare Pages for our new website, I guess some things weren't meant to be :')

codegeek 3 years ago

Omg. What timing. I feel your pain. We recently migrated to Cloudflare Pages and I was happy at the speed and everything and now this :(. Never had a downtime when I self hosted on my DigitalOcean droplet. damn. Re-considering going back to old school nginx static site hosting.
- ikekkdcjkfke 3 years ago
  
  Those might have had downtime, but never reported
- hobs 3 years ago
  
  Well then you haven't used DO that long, I get regular emails about X or Y server needing to go down for maint.
  - goldinfra 3 years ago
    
    I've used Digital Ocean (and many other hosting providers) for as long as most of them have existed. Most of my servers have been running nearly uninterrupted for many years. Yes, there will be a reboot or move every so often but the uptime is incredibly high.
    The idea that single server is capable beat the reliability of a massively distributed system is counter-intuitive and yet usually it's the case.
    The average distributed system is a house of cards that can come tumbling down if any one of a number of pieces fails. The average static server is a rock of stability, with very few failure modes.
- madjam002 3 years ago
  
  Yep our current marketing site is NextJS hosted on Hetzner fronted by Cloudflare, fortunately that's still up and never has any problems.
  We've moved to next-on-pages for our new marketing site and I've spent the whole day on finishing touches ready for switch over at 20:00 UTC, and now this :((
- nabakin 3 years ago
  
  Not sure why you're being downvoted
- nijave 3 years ago
  
  Did you ever reboot for patches or was it load balanced?
- rexreed 3 years ago
  
  Heck even shared hosting for $3/mo works just fine

JohnMakin 3 years ago

For any terraform users that may be using code like this:

data "cloudflare_ip_ranges" "cloudflare_ipv4_list" {}

This is coming back with an empty list on some fields and causing havoc in terraform.

freedomben 3 years ago

It is shocking to me how bad to non-existent error handling is in most terraform providers. It leads to some remarkably arcane and esoteric error messages
- mschuster91 3 years ago
  
  Terraform error handling as a whole is nuts anyway. Like, I recently tried to delete an ACM cert that still was in use in a Cloudfront distribution - didn't work, but it took 20 minutes for Terraform to recognize that, yes, there's an API error. It shouldn't have come so far given that the API call immediately errors out when trying over the CLI or Web Console, but instead of erroring out, Terraform retried for 20 minutes until it hit some sort of timeout.
  To make it worse, you can't even kill Terraform safely because while it does register your Ctrl+C, it won't interrupt an ongoing process, and if you force kill it you run the very serious risk of corrupting your state file.
  Seriously, I'm looking for OpenTofu to light some fire under the ass of Hashicorp. I don't know where all the VC money went, but for what's supposed to be the golden standard of IaC solutions, it's sometimes bloody ridiculous.
  (Not to mention it's written in Go of all things which means there's virtually zero tooling and documentation to debug it or to develop anything for... especially when compared to the state of the art in Java, NodeJS or PHP)
  - nijave 3 years ago
    
    This is usually down to provider implementation which switching the core won't help. The provider controls HTTP calls and errors against the relevant service API.
    Here are the retries in the provider code https://github.com/hashicorp/terraform-provider-aws/blob/mai...
    It's hard coded to "certificateCrossServicePropagationTimeout" which is 20 minutes here https://github.com/hashicorp/terraform-provider-aws/blob/mai...
    
    mschuster91 3 years ago
    
    Sure, but Terraform Core doesn't provide any way of getting user feedback in case unexpected situations happen, or aborting while saving the current state, both of which would save me serious amounts of time.
    
    nijave 3 years ago
    
    Being able to actually interrupt/cancel would be nice. You can get more feedback by adjusting TF_LOG env var. Logging levels have been getting improvements for a while (it used to just be TRACE that spammed everything)
    
    mschuster91 3 years ago
    
    > You can get more feedback by adjusting TF_LOG env var.
    yep, but that's often enough useless after the fact. In the PHP world, for example, there's Symfony/Monolog's `fingers_crossed` logger [1]... it keeps logs below the normal threshold in memory, but if there is a single event of a given severity or worse, it dumps out all the logs it has ingested so far for this request.
    A real lifesaver that one is.
    [1] https://symfony.com/doc/current/logging.html
    
    nijave 3 years ago
    
    I used to add TF_LOG_PATH to my shell profile so all TF runs log to disk with a daily cron to truncate the file.
- JohnMakin 3 years ago
  
  There was no error message, that was the really unsettling part.
- eddyfromtheblok 3 years ago
  
  it's shocking how much of a desirable skill it is in devops job roles given its clear deficiencies.
JohnMakin 3 years ago

In the time I made this post and now it's come back. Really wish that would've returned an error and not an empty list, that almost caused a disaster in my automation.

TacticalCoder 3 years ago

Anyone remember big iron and servers with uptimes of 5 or 7 nines?

I mean: it used to be a thing. Now we have the cloud.

fizx 3 years ago

7 9's is 3 seconds of downtime per year. That was never a thing.

vicnov 3 years ago

Auth0 seems to be down as well

thom 3 years ago

Yeah, can confirm this (for those looking at their status pages which claim otherwise).

j-rom 3 years ago

Complete Pages outage for me. I have several sites hosted on Cloudflare Pages and I can't access any of them, they're all returning 500's.

gsanderson 3 years ago

Yep, same for me :(

campbellman 3 years ago

Fun day to release a blog post[0] about cloudflare page functions, on a site hosted on cloudflare pages.

[0] https://interbolt.org/blog/split-it-and-forget-it

tootie 3 years ago

Apparently Auth0 as well. Possibly related.

juancampa 3 years ago

Most likely related, I see a `cf-ray` header in the 500 response.

c22 3 years ago

It's probably bad that I noticed this just due to a large percentage of my regular online-habits suddenly breaking. I liked the old internet where websites just broke one at a time.

NicoJuicy 3 years ago

That was before ddos became common and cheap to execute.

blintz 3 years ago

This is preventing new logins to ChatGPT.

toomuchtodo 3 years ago

Error 1101 Worker threw exception. Interestingly fronting their auth0 tenant with CF.
- Recursing 3 years ago
  
  auth0 itself is (edit: was) down, https://auth0.com/

xrd 3 years ago

I can't login to my domain dashboard either. Maybe that is a downstream effect of workers being offline?

dogweather 3 years ago

Yes — Workaround is to disable your workers. That got my site back up and running.

EDIT 15:07 MDT: People are reporting that Workers are back up. Mine isn't in my site's critical path. So I'm going to leave the Worker disabled (un-routed) until tonight.

codegeek 3 years ago

Are you saying disable workers and then your cloudflare Pages will be back up ?
- dogweather 3 years ago
  
  Ah, I don't know about Cloudflare Pages. I think they use Workers underneath. So unfortunately, there's no fix yet. Sorry.
  - codegeek 3 years ago
    
    Ah well. I cannot access the Workers and pages menu. It returns an error.

gsanderson 3 years ago

My sites have started coming back up now. Their site has also just started working again (previously got a 500): https://pages.cloudflare.com/

pcblues 3 years ago

I won't be the first or last to say these three things:

The internet was meant to stop reliance on single sources (in case of nuclear war)

The size of a house of cards increases the number of failure points

Marketers lie

jve 3 years ago

> The internet was meant to stop reliance on single sources
You have all the technical means. Your home server possibly won't be reachable, yes.
The global connectivity as-is is really, really, really fault tolerant.

CommonGuy 3 years ago

Cloudflare Pages aren't working on a few of my sites too

mparnisari 3 years ago

It is funny that just a few days ago the company that laughed at Okta for a breach and whose core competency is availability are now experiencing an outage.

ystad 3 years ago

You should probably indicate what you meant by laughed at okta. Do you have a link??
- mparnisari 3 years ago
  
  https://blog.cloudflare.com/introducing-har-sanitizer-secure...
  Maybe "laugh" is not accurate, idk. But their post kinda looked like 'here, we built this tool that should have been made by okta'.

gkfasdfasdf 3 years ago

Ongoing DDoS attacks are targeting sites that raise funds for Gaza relief efforts: https://twitter.com/arblauvelt/status/1719027920054702363

I wonder if there's a connection.

elithrar 3 years ago

Not related.
(I am the PM lead for Workers databases & storage)
- sterlind 3 years ago
  
  I really appreciate how you've showed up quickly and given direct answers. It's an admirable level of comms for a company so large.
- codegeek 3 years ago
  
  Is there a postmortem coming ? Would you be able to tell us what happened at a high level ?
  - elithrar 3 years ago
    
    See my comment here: https://news.ycombinator.com/item?id=38075877
    (We’ll share more when we can)

ironmagma 3 years ago

That's sad, hopefully something comes along that can brighten their day.

TheCleric 3 years ago

Wonder if this is related to the mini NPM outage I was experiencing earlier:

https://status.npmjs.org/incidents/zdznxkrp22py

imslavko 3 years ago
No way to confirm, but I think so, just because NPM threw this error at me:
```
     KV GET failed: 401 Unauthorized
```
where KV could refer to the CF KV in workers

Animats 3 years ago

All of them, or just those in some data centers?

AxiomaticSpace 3 years ago

Looks like it's working now as of 1:23 PST

ChrisArchitect 3 years ago

Damn yeah, noticing for last few mins+

winseybash 3 years ago

Looks to be back up now for my sites

johnmaguire 3 years ago

Yes, down since about 3:55 ET.

chocolatkey 3 years ago

The first error was at [hour]:54:23 for me, in case it helps
cristalolegOP 3 years ago

Yeah, my system alerted on 19:56 UTC

tomschlick 3 years ago

Back up for me as of 16:20 ET

darknavi 3 years ago

npmjs is also in a bad spot:

https://status.npmjs.org/

ChrisArchitect 3 years ago

Ask HN:

cristalolegOP 3 years ago

Thanks, fixed.

Settings

Ask HN: Cloudflare Workers are down?

Keyboard Shortcuts