Settings

Theme

Fastly Outage

fastly.com

1255 points by pcr0 5 years ago · 739 comments

Reader

lpmitchell 5 years ago

This seems to be impacting a number of huge sites, including the UK government website[0].

[0] https://www.gov.uk/

https://m.media-amazon.com/

https://pages.github.com/

https://www.paypal.com/

https://stackoverflow.com/

https://nytimes.com/

Edit:

Fastly's incident report status page: https://status.fastly.com/incidents/vpk0ssybt3bj

  • caseymarquis 5 years ago

    Fastly Engineer 1: Seems like a common error message. Can you check stackoverflow to see if there's an easy fix?

    Fastly Engineer 2: I have some very bad news...

    • oefrha 5 years ago

      Well, with SO, at least you can search on Google and view the version cached by Google just fine.

      With Reddit however, these days almost all comments are locked behind “view entire discussion” or “continue this thread”. In fact, just now I searched for something for which the most relevant discussion was on Reddit; Reddit was down so I opened the cached version, and was literally greeted by five “continue this thread”s and nothing else. What a joke.

      • tomrod 5 years ago

        Reddit's attempts at dark patterns are embarrassing from all perspectives. If you use dark patterns it's a laughably abysmal implementation. If you abhor dark patterns, it's a frustration.

        • stevewodil 5 years ago

          It's just enough to annoy you but not enough to make everyone leave the platform

          • marvin 5 years ago

            They've actually done a masterful job of finding this balance. I've been on reddit for 15 years and would have quit if they didn't leave the old interface available.

          • charonn0 5 years ago

            On the same day that old.reddit.com stops working I'll leave.

          • CodesInChaos 5 years ago

            The mobile version is literally unusable. Half the subs show an error and you can't load most comments.

          • dandellion 5 years ago

            I think it's because there haven't been any interesting alternatives. I know if I ever see one I'll probably switch in a femtosecond.

          • Zetaphor 5 years ago

            Not _yet_, the same was said about Digg once.

          • muskox2 5 years ago

            I honestly thought Reddit would die when they introduced Reddit awards, it seemed like such an obvious cash grab. You can't underestimate the amount of community momentum that the site has though.

      • southerntofu 5 years ago

        Yeah it's crazy how bad user-hostile reddit.com has become. Fortunately old.reddit.com is still available, but for how long? If only Javascript did not exist, it would be impossible for UX people to come up with something that bad.

        • gpm 5 years ago

          > only Javascript did not exist, it would be impossible for UX people to come up with something that bad.

          Arrange the html so that the list of comments is at the end (via css). Keep the http connection open, have the show more button send some of request, and when you receive that request send the rest of the page over the original http connection.

          As usual, solve people problems via people, not tech.

          • peq 5 years ago

            How would you make the button send a request without js and without navigating to another page?

            Maybe css to load an image on :active or is there some better way?

            • chrismorgan 5 years ago

              Here are two robust techniques that I haven’t seen actually employed in production for maybe fifteen years:

              ① A submit button or link targeting an iframe which is visually hidden. (Or even don’t hide it. If only seamless iframes had happened, or any other way of auto-resizing an iframe: relevant spec issues are https://github.com/whatwg/html/issues/555 and https://github.com/w3c/csswg-drafts/issues/1771.)

              ② A submit button or link to a URL that returns status 204 No Content.

              (CSS image loading in any form is not as robust because some clients will have images disabled. background-image is probably (unverified claim!) less robust than pseudoelement content as accessibility modes (like high contrast) are more likely to strip background images, though I’m not sure if they are skipped outright or load and aren’t shown. :active is neither robust nor correct: it doesn’t respond to keyboard activation, and it’s triggered on mouse down rather than mouse up. Little tip here for a thing that people often get wrong: mouse things activate on mouseup, keyboard things on keydown.)

            • cmg 5 years ago

              Yep:

              .button:active { background-image: url('/some-reference-thats-actually-a-tracker'); }

          • southerntofu 5 years ago

            Well technically everything is possible. But Javascript was precisely designed to encourage this kind of patterns.

            > As usual, solve people problems via people, not tech.

            So true..

        • chrismorgan 5 years ago

          “Continue this thread” links don’t depend on JavaScript at all.

          “View entire discussion” couldn’t be implemented perfectly with <details> in its present form, but you can get quite close to it with a couple of different approaches.

          I think the infinite scrolling of subreddits is about the only thing that would really be lost by shedding JavaScript. Even inline replies can be implemented quite successfully with <details> if you really want.

        • metalforever 5 years ago

          Yeah I’m going to stop using the platform when they get rid of this . Not interested

        • driminicus 5 years ago

          When it goes away you can try teddit.net

          • ilikepi 5 years ago

            Why wait? Teddit has been a great substitute for reading in a mobile browser, and making an iOS shortcut for transforming Reddit links was pretty straightforward.

        • jaywalk 5 years ago

          Impossible? Man, it's crazy how fast people forget things like good old fashioned <form> GETs and POSTs. It would obviously be a full page refresh, but other than that the same awful UX could still be implemented.

      • comboy 5 years ago

        I wanted to suggest site:old.reddit.com since I use that version with automatic redirect, but this:

        https://old.reddit.com/robots.txt

        is very different from this:

        https://reddit.com/robots.txt

        I guess there is a market for search engine (maybe accessed through tor) which does not care about robots.txt, DMCAs, right to be forgotten etc. Bootstrapping it should not be that hard since it can also provide better results for some queries since nobody is fighting about the position until it's widely known.

        I'm not sure how far are we from being able to do full text internet search. Or rather even quote search, preferably some fuzziness options. That would be cool, Google's quotation marks were really neat back when they were working.

        • utf_8x 5 years ago

          Wonder what the story is behind these two...

              User-Agent: bender
              Disallow: /my_shiny_metal_ass
              
              User-Agent: Gort
              Disallow: /earth
          • gaius_baltar 5 years ago

            That's the good old Easter eggs, perhaps a memory from when Reddit was a nice place. They stop appearing and are replaced by dark patterns once sites jump the shark.

          • toastal 5 years ago

            I reod some people use false slugs in the robots.txt as a honey pot of sorts. IPs that actually read the robots.txt, ignore the disallow, and still access the uri are outright banned.

            • taylorfinley 5 years ago

              Then when a flamewar breaks out you just have to get your adversary to click a link to get them IP banned.

            • jedberg 5 years ago

              Ha, that would have been a really smart idea! Sadly we didn't think of that at the time. But we had other honey pot URLs.

          • gostsamo 5 years ago

            It might be related to the time few years ago when Google added exclusions for user agent t1300 in regard to its founders. Gort seems to be a robot from old scifi and bender might be something similar.

          • jedberg 5 years ago

            Just some fun humor we added for other nerds who read robots.txt files.

          • iso1210 5 years ago

            Easter eggs

          • pluc 5 years ago

            It's neckbeard humor.

        • oefrha 5 years ago

          > I guess there is a market for search engine (maybe accessed through tor) which does not care about robots.txt, DMCAs, right to be forgotten etc. Bootstrapping it should not be that hard since it can also provide better results for some queries since nobody is fighting about the position until it's widely known.

          That’s not going to happen before Cloudflare is dethroned. See this recent thread for some perspective: https://news.ycombinator.com/item?id=27153603

          And even if there’s no Cloudflare, large sites that people want to search will always find ways to block bad bots.

          The only thing I can think of that might work is using crowd-sourced data, with all the problems that come with crowdsourcing.

          • comboy 5 years ago

            Sadness.

            There is a solution for all this mess and I'm blocking HN and a few different domains until I implement at least the first step after which I can share it here.

        • bruce343434 5 years ago

          try editing your hosts file to redirect reddit to old.reddit

          /etc/hosts

          reddit.com old.reddit.com

          www.reddit.com old.reddit.com

          np.reddit.com old.reddit.com

      • Siira 5 years ago

        I am archiving subreddits on Github in plain-text org-mode. If you have some subreddit in mind, open an issue, and I'll create an archive repo for it.

        - https://github.com/NightMachinary/r_HPfanfiction

        - https://github.com/NightMachinary/r_rational

      • ornornor 5 years ago

        Try “site:old.Reddit.com”

        • oefrha 5 years ago

          That's not going to work.

            $ curl https://old.reddit.com/robots.txt
            User-Agent: *
            Disallow: /
          
          Also, even if search engines are allowed, old.reddit.com pages are not canonical (<link rel="canonical"> points to the www.reddit.com version, which is actually reasonable behavior), so pages there would not be crawled as often or at all.
    • amirhirsch 5 years ago

      Stack Overflow is down, can someone tell me how to declare a static multidimensional array in C++?

      • vlovich123 5 years ago

        Google and DDG surface SO results cached within their own page. Here’s the copied answer:

        int main() { int arr[100][200][100]; // allocate on the stack

            return 0;
        }
    • divs1210 5 years ago

      Haha! That sounds highly plausible!

    • alvis 5 years ago

      Haha! That explains why the internet was down for a while!

    • runawaybottle 5 years ago

      Oh man, how do we keep a pocket copy of SO? All of our jobs depend on it.

  • lucasverra 5 years ago

    But https://news.ycombinator.com/ is UP! :) Prepare those HN servers for massive influx in 3...2..1..

    • unityByFreedom 5 years ago

      While we're here.. I am a bit surprised to see how many sites use Fastly. As a dev I've always been happy with Cloudflare.

      • OJFord 5 years ago

        Me too, but in a way I'm even happier knowing that not everyone does and something else popular exists too.

      • factsaresacred 5 years ago

        Google's Firebase platform uses Fastly so that's a significant chunk of the web.

      • xuki 5 years ago

        Now imagine how many sites would go down if it was CF

        • s_fischer 5 years ago

          No need to imagine! Just search HN for "cloudflare outage" and you'll see that it happened several times over the last few years

        • unityByFreedom 5 years ago

          Is this a call for competition? I regard Cloudflare as state-of-the-art in terms of security and ease-of-use. I certainly hope their knowledge replicates across other organizations. As of now they're still building highly impactful tools that are easy to use and that noone else quite provides. I don't really expect another organization to match them given the strength of their current leadership. I think they've built in a head start for awhile.

          • southerntofu 5 years ago

            > Cloudflare as state-of-the-art in terms of security and ease-of-use

            Depends whose security. I value my security dearly and that's why i use the Tor Browser. Cloudflare has decided i cannot browse any of their websites if i care about my security (they filter out tor users and archiving bots agressively) so i'm not using any cloudflare-powered website. Is it good for security that we prevent people from using security-oriented tooling, and let a single multinational corporation decide who gets to enter a website or not? In my book creating a SPOF is already bad practice, but having them filter out entrances is even worse.

            Also, are all of these CDNs and other cloud providers are solving the right problems?

            If you want your service to be resilient against DDOS attacks, you don't need such huge infrastructure. I've seen WP site operators move to Cloudflare because they had no caching in place, let alone a static site.

            If you want better connectivity in remote places where our optic fiber overlords haven't invested yet, P2P technology has much better guarantees than a CDN (content-addressing, no SPOF). IPFS/dat/Freenet/Bittorrent... even multicast can be used for spreading content far and wide.

            Why do sysadmins want/use CDNs? Can't we find better solutions? Solutions that are more respectful to spiders and privacy-minding folks with NoScript and/or Tor Browser?

            • unityByFreedom 5 years ago

              Speaking for myself here, I don't see how people can use the web without javascript. As for Tor, you're routing other people's traffic while they route yours, so I can understand how such connections would be blocked given that blocking IPs is still a method for mitigating security issues, and you can't determine the IP of a Tor browser.

              • falcolas 5 years ago

                > I don't see how people can use the web without javascript.

                Its pretty easy: browse marked up documents, not applications. If some developer conflates the first for the second, move on.

              • dolmen 5 years ago

                > As for Tor, you're routing other people's traffic while they route yours

                Using Tor doesn't imply that your machine is also a Tor exit node.

          • belter 5 years ago

            They have also been responsible for one the worst security incidents ever:

            https://news.ycombinator.com/item?id=13718752

            Only discovered we should not forget,due to the good graces of google project zero.

            A certain those of skepticism towards any technical offer out there would be advised.

            • unityByFreedom 5 years ago

              I like Cloudflare's post mortems, and I like how they fight back against patent trolls. For me as a dev they are #1.

              • detaro 5 years ago

                Do you have experience with the competitors?

                • unityByFreedom 5 years ago

                  I prefer tech that I can use both at work and on hobby projects at home.

                  To that end I've only used cloudflare and netlify. The others have too much friction to try out. I expect I would get experience on the job if necessary.

        • oneeyedpigeon 5 years ago

          Do more rely on Cloudflare? Because this felt like it was more than half the internet, certainly more than half the biggest sites.

          • ohashi 5 years ago

            I think so, Fastly seems to have a few huge enterprise clients while Cloudflare seems more balanced (and larger)

            • dotancohen 5 years ago

              I think that Fastly starts at $50/month, no free tier. So that would preclude small or not-profit-motivated sites from using it.

        • monkeydust 5 years ago

          interesting thought ... a new type of 'to big to fail' ?

        • mongol 5 years ago

          Where is Akamai in this comparison?

          • unityByFreedom 5 years ago

            Fair point. Maybe Fastly is more akin to Akamai given it seems to be more enterprise-y. By market cap, Cloudflare is 26 billion, Akamai is 18, and Fastly is 6.

            Fastly's free offering gives you "$50 worth of traffic" whereas Cloudflare has a perpetually free option. And for Akamai you have to apply for a free trial.

            • krzyk 5 years ago

              This is market cap, but if you look at amount of traffic you have Akamai estimated at 15-30%, CF at 10%.

              So if it would go down, it would cripple vast amount of internet.

              • mekkkkkk 5 years ago

                Akamai is balls deep in video streaming, which is probably the most bandwidth/traffic intense thing for a CDN to dabble with. My guess is that CF has much more diverse traffic. Hence the fallout from an interruption would be quite different.

                • krzyk 5 years ago

                  Not quite, Akamai is more large corp centric (they don't serve average Joe) besides that they do also security. If it went down you would get all of sudden e.g. a lot of DDOS possible.

    • atymic 5 years ago

      New error now, hopefully fix in progress.

      Fastly error: unknown domain: www.fastly.com.

      Details: cache-syd10161-SYD

      • Hamuko 5 years ago

        >The issue has been identified and a fix is being implemented.

        According to the status page.

      • foo__-bar 5 years ago

        That doesn't take away their embarrassment. It's mean how many websites rely on fastly. Twitter hasn't been loading emojis in a while, and I believe it's for the same reason.

    • darkr 5 years ago

      Might not be the case anymore, but a few years back, Hackernews was just running on a single server.

    • VMtest 5 years ago

      I am already here

  • maest 5 years ago

    Amusingly, the Stackoverflow 503 page has a typo:

      Error 503 Service Unavailable
      Service Unavailable
      
      Guru *Mediation*:
      Details: cache-lon4236-LON 1623146049 854282175
      
      Varnish cache server
  • c-fe 5 years ago

    also https://www.reddit.com (at least in Netherlands)

    edit: 12:05 up again for me, no images or custom fonts loading though ... and down again 1 minute later

    edit: 13:01 reliably up again for me

  • kevincox 5 years ago

    > potential impact to performance

    So it is a "performance" issue when all pages give a 503.

  • threeseed 5 years ago

    I wonder why Amazon is not using Cloudfront for their own website.

    • lordpankake 5 years ago

      Cloudfront, by Amazon's own admission, specialises in high bandwidth delivery (ie huge videos). Fastly has consistently better performance as a small object cache, which makes it the choice for web assets

      https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...

    • morelikeborelax 5 years ago

      Fastly gives them the edge performance they need without having to build it themselves. They have been a customer for a while I think.

      • raphaelj 5 years ago

        But they have competing products through AWS.

        • playpause 5 years ago

          I imagine it works well for the whole business that they allow product teams to use the best cloud tools for the job rather than requiring them to use AWS for everything. If AWS is forced to compete even for Amazon.com's custom, that should make the whole company more resilient to long term technical stagnation.

      • threeseed 5 years ago

        AWS Route53 and Cloudfront are direct competitors to Fastly.

    • schappim 5 years ago

      Yeah, this is what makes me feel this is more an AWS thing

      • lpmitchell 5 years ago

        The m.media-amazon.com domain (and a few other CDN'd domains that they use) are running through Fastly:

            nslookup m.media-amazon.com
            
            Name:  media.amazon.map.fastly.net
        
        
        It is very interesting that they are not using CloudFront!
        • cowsandmilk 5 years ago

          really, m.media-amazon.com seems to have a very short TTL (showing 37 seconds right now) and has been weighted to cloudfront now.

          Amazon is also known to use Akamai. Sure, Amazon relies heavily on AWS, but why should it surprise anyone that a retail website obsessed with instant loading of pages decides to use non-AWS CDNs if the performance is better.

          Even if CloudFront became the default, I'm certain amazon.com would keep contracts with fastly and akamai just so they can weight traffic away from CloudFront in an outage.

        • jamespo 5 years ago

          Good to have 3rd party redundancy, time to fail over to something else now I'd think though.

          • longwave 5 years ago

            They already have:

              $ host m.media-amazon.com
              m.media-amazon.com is an alias for c.media-amazon.com.
              c.media-amazon.com has address 99.86.119.84
            
            (which is a Cloudfront IP)
          • xmdx 5 years ago

            Yep they did exactly this and are now running on cloudfront

        • tester34 5 years ago

          Why?

      • adzicg 5 years ago

        looks like amazon.com started using fastly in May 2020 (https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...) so it's not an AWS thing

      • threeseed 5 years ago

        AWS is reporting no issues across the regions:

        https://status.aws.amazon.com

      • alex_duf 5 years ago

        Fastly deploy their own hardware, (That's one of their selling points) I don't think they rely much on AWS, maybe just for network interconnection?

      • lordpankake 5 years ago

        Fastly doesn't run on AWS.

      • apexalpha 5 years ago

        It sure looks like a AWS error, even Amazon.com is mostly down.

  • antihero 5 years ago

    I wonder why amazon.co.uk uses Fastly and not CloudFront?

    • georgekeesjones 5 years ago

      I imagine they use a few different CDNs for things like this.

    • abunuwas 5 years ago

      That doesn't appear to be the case does it? Amazon sites are all working fine, at least for me

      • danso 5 years ago

        Their CSS and JS were down for a few minutes. I was able to login to Amazon but the entire site was in Times New Roman, but was fixed a few minutes later

    • jonplackett 5 years ago

      Must be more than fastly. Heroku is also down.

      • abunuwas 5 years ago

        That's also because of fastly, I've got this response from the Heroku dashboard:

        Fastly error: unknown domain: dashboard.heroku.com.

  • lkbm 5 years ago

    Good thing we use Cloudfront and Cloudflare where I work.

    > Statuspage Automation updated third-party component Spreedly Core from Operational to Major Outage.

    > Statuspage Automation updated third-party component Filestack API from Operational to Degraded Performance.

    Oh, right. :-D

    Don't get me wrong, I love the proliferation of APIs and easily-integrated services over the past 20 years. We're all one interdependent family, for better and for worse.

  • huijzer 5 years ago

    CSS/Javascript at https://github.com/ appears to be down as well making GitHub quite unusable.

    • oneeyedpigeon 5 years ago

      GitHub Pages appears to be down too, taking an awful lot of sites offline

      • busymom0 5 years ago

        Github is working fine for me in Canada but others aren't. Tried without browser cache too and it works okay.

        EDIT: Most sites seem fixed now here in Canada. Tested stackoverflow, reddit, GitHub, PayPal, gov.UK and all worked fine.

  • weird-eye-issue 5 years ago

    Yikes seeing just a "connection failure" on Paypal is something else.

    edit: PayPal looks be back up at least in US East but when I turn off my VPN and access from Asia I get "Fastly error: unknown domain: www.paypal.com."

    Now I'm seeing a 503

  • 1ncorrect 5 years ago

    > Monitoring The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return. Posted 4 minutes ago. Jun 08, 2021 - 10:57 UTC

    Looks to be working again my end.

  • thrdbndndn 5 years ago

    Interestingly, Twitter only has its emoji SVGs down.

    • oneeyedpigeon 5 years ago

      And this is (one reason) why using images instead of actual emojis is such a stupid idea. Why, Twitter, WHY?

      • numpad0 5 years ago

        err, to make representations platform independent?

        • oneeyedpigeon 5 years ago

          That sounds antithetical to the purpose of emojis.

          • numpad0 5 years ago

            Vendors don’t even agree on whether the :gun: is a revolver or an automatic or space ray guns or even water guns, btw it’s an 1911 in original DoCoMo emojis

            1: https://blog.emojipedia.org/content/images/2018/04/microsoft...

            • oneeyedpigeon 5 years ago

              Sure, that's a benefit of emojis being semantic. If you want 'SFW' emojis, you can get them. Converting them to images makes that impossible. And uses vastly more bandwidth, makes them impossible to copy+paste, probably has accessibility issues, etc.

          • mbreese 5 years ago

            Same reason why Gmail uses their own emojis rather than the system ones — (as said above) branding. When you send a tweet, Twitter wants it to look identical across all devices. The classic native UI vs cross-platform UI debate in a nutshell.

            • gspr 5 years ago

              Cool, so instead of actually serving text, they could also just serve up little SVGs for each letter. Because god forbid the recipient chooses a different font than Gmail!

            • dolmen 5 years ago

              That's not a minor UI issue.

              Twitter is a media between people. Removing emoji representation differences on user devices is a way to hopefully reduce misunderstandings between users.

      • bb010g 5 years ago

        Branding! (Fun fact: Hacker News strips emoji.)

        • busymom0 5 years ago

          How does it strip it? Test:

          Edit: You are right. It got rid of the emoji after Test.

  • doublerabbit 5 years ago

    https://deb.debian.org is down too which borked my installation.

  • benrbray 5 years ago
    • fgaz 5 years ago

      The mirrors still work though, and cabal will just fall back to those

  • jakub_g 5 years ago

    https://www.bbc.com/news/technology-57399628

    "A number of leading media websites are currently not working, including the Guardian, Financial Times, Independent and the New York Times."

  • toaway 5 years ago
  • aliasEli 5 years ago

    What's far worse than half of the internet being down was that Hacker News also had problems. If I waited long enough on a comments page I got an error message. I don't quite understand what happened there. The communication between my system and HN must have been working otherwise I would never have gotten an error message, so it must have been some internal HN problem. But since HN should only need its own internal "database" to generate comment pages, I don't understand why it should be impacted by the Fastly problems.

  • jzer0cool 5 years ago

    I could not tell from the fastly status page. What caused the fault? Could anyone point to any past stories which may be of similar nature other than DDos?

  • zhan_eg 5 years ago

    Bitwarden is also down (the Web Vault, not the website).

    • 0xbkt 5 years ago

      I will never understand the meaning of putting CDN behind CDN.

      • numpad0 5 years ago

        What makes sense in the world is what puts bacon on the table and not what actually make sense.

      • dTal 5 years ago

        Yo dawg, I heard you like CDNs...

    • iso1631 5 years ago

      My self hosted bitwarden is fine, as are all my self hosted sites.

  • stevenwliao 5 years ago

    Seems to affect Target ( https://www.target.com/ ) and Reddit ( https://www.reddit.com/ ) as well.

  • tandav 5 years ago
  • abhiminator 5 years ago

    PayPal seems to be working for me at the moment. Rest of the sites are 503s.

  • rvz 5 years ago

    Centralising everything™ and the whole internet goes down because of that.

    • aembleton 5 years ago

      and yet you're able to leave this comment.

      • rvz 5 years ago

        Because HN and those who use less or those who use backup services are smart and those who are caught now have to panic and wait.

        Probably going to short the hell out of $FSLY.

        • morelikeborelax 5 years ago

          Over one issue that highlights they have an abundance of top level customers? Interesting strategy when it's already at a low.

          • rvz 5 years ago

            One issue that should have been mitigated at least by Fastly; worse if the client has to do it.

            They proudly stated this from their own website to their customers:

            > "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime."

            If that isn't one huge lie, I don't know what is.

            • aliasEli 5 years ago

              Please don't call it a lie. It means that they knowingly presented something they knew to be false as the truth. So far I have seen no evidence to support that.

            • southerntofu 5 years ago

              It is definitely a lie, but it's the same lie sold by all cloud offerings. Can you name a single cloud/CDN operator without downtimes?

              It's normal to have downtimes but they are usually scheduled and quick (think <10 minutes per month for rebooting and/or hardware parts replacement). I'm pretty sure most non-profit hosts like disroot.org or globenet.org have similar or better 9's than all these fancy cloud services.

            • detaro 5 years ago

              It can have all these things and still fail, suggesting otherwise would be fairly naive.

    • smhost 5 years ago

      if by "everything" you mean one thing, and by "centralize" you mean not centralized, then sure.

      • darkwater 5 years ago

        How is having a large chunk of the internet using the same CDN provider not "centralizing"? It's not a hard monopoly obviously but still it meets the definition of centralization.

        • smhost 5 years ago

          how is private companies choosing to use a common supplier in a competitive market centralization? monopolies are not centralization either. you need to read a better book.

          • southerntofu 5 years ago

            How is a market competitive when there's a quasi-monopoly on infrastructure? When public money is used to irrigate the same corporations with huge $$$, while non-profit network operators are left to rot?

          • darkwater 5 years ago

            it's centralization because they all use the same provider. Why do you care about incentives here? The result is the same, just like capitalism and free market tend to monopolies in the long run.

  • Simran-B 5 years ago
  • jb1991 5 years ago

    For what its worth, I'm having these problems also with cnn.com, reddit and many others, however when I switch away from WiFi to use my cell provider network, they work fine.

  • tus89 5 years ago

    Paypal back, off fastly

    • jslakro 5 years ago

      Why no other sites bypass the CDN and go directly?

      • meibo 5 years ago

        If you aren't prepared to do CDN changes on a whim when something like this happens, it's often better to wait for the problem to be resolved instead of making things worse for yourself due to misconfigurations, revealing your origin IPs, etc.

        Can always improve the process for the next outage.

        • jslakro 5 years ago

          For sure, similar to other industries all changes come after big troubles like this. But would be interesting to heard about how them (paypal) deal with that

        • numpad0 5 years ago

          Also it takes time for DNS changes to propagate(some people hates this word but actually)

      • tus89 5 years ago

        You need big infra and Crack teams of ops people, which paypal can't afford not to have.

  • dan-robertson 5 years ago

    Is their anything these big sites could do in this situation, or must they choose between running and maintaining all of their own infra or relying on a single CDN?

    • darkr 5 years ago

      If you have absolutely vanilla CDN requirements, you can run multiple CDNs and fail-over or load balance between them using DNS.

      Quite a few Fastly customers have more than vanilla requirements though, and may have a lot of business logic performed within the CDN itself. That Fastly is "just Varnish" and you can perform powerful traffic manipulation is one of it's main selling points.

      • dan-robertson 5 years ago

        I suppose it’s still a bad experience for the user if some % of attempts to connect fail or if some % of scripts/styles/images fail to load. So I think that means dns information about failures needs to somehow be propagated quickly. Not sure how well that works in practice.

    • fmajid 5 years ago

      Use two CDNs and DNS providers for redundancy. Gets expensive, but at scale, probably doesn't make a huge difference. More complexity for the site operators to manage, however.

  • zarker 5 years ago

    Spotify is behaving strangely as well https://www.spotify.com/

  • mondaygreens 5 years ago

    Quora and reddit too

  • black_puppydog 5 years ago

    All of these work from here in Grenoble, France...

    • southerntofu 5 years ago

      That's the problem with these black-box cloud offerings, that you can never know what will work (or not) and from where. You get semi-random, pseudo-localized outages that are not accounted for in all the 9's of availability.

      With a standard TCP/UDP session, it mostly just works or doesn't and you can get a proper traceroute to know what's up. With these fancy CDNs, there's a whole new can of worms to deal with and from a client's perspective you have no clue what's happening because it's all taking place in their private network space where we have no "looking glass".

      Fuck the cloud, i want real Internet.

    • terramex 5 years ago

      Same here in central Poland (Łódź area), no problem with any of linked websites.

      edit: My whole Twitter timeline is full of posts saying "Twitter outage? what outage?". Same on Reddit and Twitch chat, feels like for a short time I was invited into some exclusive circle lmao. StackOverflow and other StackExchange sites also work so I can look stuff up for you.

    • clydethefrog 5 years ago

      Interesting. Here in the Netherlands they don't.

    • ju-st 5 years ago

      Germany here (n=1), everything works except reddit and ft.com

    • bsaul 5 years ago

      depends where in France, most people i know here are affected as well

    • bambax 5 years ago

      Not from Paris, France.

    • agumonkey 5 years ago

      Not for me

  • EmptyStatement 5 years ago
  • amiga-workbench 5 years ago

    https://www.theverge.com/ seems to be down too

  • conradfr 5 years ago
  • cellover 5 years ago

    Is the fact of looking at those links is like looking at a road accident with insistence instead of just passing by?

  • zthxxx 5 years ago
  • alvis 5 years ago
  • xmdx 5 years ago

    Terraform having issues and rubygems down too

    • meepmorp 5 years ago

      That explains the spotty container build failures over the last half hour. Good thing I decided to procrastinate instead of debugging the issue!

  • max23_ 5 years ago
  • sammygreen 5 years ago

    Seems to be every site that runs varnish...

    • meowface 5 years ago

      Fastly largely runs on Varnish, it seems: https://www.fastly.com/blog/benefits-using-varnish

      >At the core of Fastly is Varnish, an open source web accelerator that’s designed for high-performance content delivery. Varnish is the key to being able to accelerate dynamic content, APIs, and logic at the edge.

    • busymom0 5 years ago

      I think Fastly is the one having problems (they happen to use varnish but I haven't seen anything which says varnish is the root cause) - so all sites using it are down.

  • dreamer7 5 years ago

    Firebase hosting has been affected as well

  • playpause 5 years ago
  • rottc0dd 5 years ago

    SSO and github are back online now

  • maelito 5 years ago

    nature.com

  • linuxfan2021 5 years ago

    You would think that the UK GOVERNMENT would have their own private CDN or something...

  • keithnz 5 years ago

    twitch also, lots of other minor ish websites

  • brador 5 years ago

    Searchable offline backup of stack anyone?

  • toaway 5 years ago

    www.gov.uk & bbc are back

  • dpacmittal 5 years ago

    elastic.co down as well

  • treeshateorcs 5 years ago

    developer.spotify.com

  • magicturtle 5 years ago

    reddit down aswell

    • bartread 5 years ago

      It's OK though, because large swathes of this discussion seem to have turned HN into reddit, at least temporarily. Normal service will no doubt resume in due course.

  • ramshanker 5 years ago

    twitch.tv Too.

  • Hani1337 5 years ago

    etsy.com too

  • madeofpalk 5 years ago

    > [0] https://www.gov.uk/

    Just checked, thank god the NHS vaccine site is still available - vaccines just got rolled out for under 30s today.

    • patentatt 5 years ago

      Edit: I didn’t mean anything negative here! Just slightly shocked that as the UK is opening up under 30 vaccinations, the US is struggling to find any more willing takers. It’s really probably a sign that there’s fewer anti-vaxxers in the UK more than anything. And that universal healthcare is more efficient at distribution than an inherently for profit system. I don’t know, but I just didn’t realize it was so different in the UK

      • bdsa 5 years ago

        I think this may be because we've had much higher uptake as far as I know, so getting down the age ranges has been slower (by which I mean, yes, maybe the US has made it available to all adults, but how many (as a proportion) have taken it up)

      • madeofpalk 5 years ago

        US and UK have very similar vaccination rates despite the US being open to more age ranges. This indicates that a higher percentage of eligible people have gotten the vaccine in the UK, and the US has somewhat hit a wall in terms of vaccinations (though there is the concern that the rates will slow down in the UK also).

        I must admit, it has been strange seeing my US peers getting the vaccine months before I can in the UK, but I guess I take comfort knowing that both countries are still doing pretty well!

      • numpad0 5 years ago

        You know which one’s worst? Japan... still reservation based and for 65 and up only!

      • zhte415 5 years ago

        Both the UK and US are doing well.

        https://ig.ft.com/coronavirus-vaccine-tracker for reference.

        What's important is important to share vaccines with all nations, and non-nations.

      • spzb 5 years ago
        • patentatt 5 years ago

          Fascinating. So those rates are including only ages 30+, which means that once it’s unrestricted the UK should have a very high vaccination rate while ~15-25% of the US will still remain unvaccinated entirely by choice. Wow. So you’re absolutely right, the UK is in reality far far ahead and the US is completely broken as far as public health is concerned because of willing ignorance.

        • oneeyedpigeon 5 years ago

          For one dose. For full vaccination, the US is (slightly) ahead according to that same site.

          • spzb 5 years ago

            I think we can agree it's certainly not "far behind"

          • lol768 5 years ago

            This is by design though, the gap between the two doses is higher here.

  • PaywallBuster 5 years ago

    and Imgix

  • collyw 5 years ago

    Click the new tab. Lots of posts about sites being down. All flagged.

austinjp 5 years ago

Yeah so it's been mentioned in the comments already, but to everyone in Fastly right now: I feel for you. Something like this must be insanely stressful, and not just during the outage. There will be (should be) a massive post-mortem. People will be losing sleep over this for days, weeks, months.

:(

Edit: There seems to be a major empathy outage in this thread. Disgusted but not surprised, unfortunately.

  • strictfp 5 years ago

    Meh. Losing sleep sounds like an over-reaction. No system is foolproof. Of course Fastly should do what they can to prevent downtime, but it's still expected that they will go down.

    I would blame anyone who claimed otherwise or couldn't deal with it while not having a fallback.

    • kylen 5 years ago

      I hear that you're suggesting that those involved shouldnt feel bad because its a systemic / just a job / etc. But the reality is that incidents like this can be very traumatic for those involved and thats not something they can control. If it was that simple to manage, depression and anxiety would not be a thing.

      Think its best to show a large amount of support and empathy for the individuals having a really bad day today, and how awful they may feel. Some will probably end up reading this thread (I know I would).

      And of course, still hold Fastly the business accountable for their response (but objectively, once we understand what the root cause was, and the long term solution).

      • paublyrne 5 years ago

        I don't see how it's so traumatic for the engineers involved, unless the company culture in Fastly is really awful and there are punitive repercussions, or attempts to pin responsbility on individuals rather than systems, which I doubt.

        Many here have been responsible for web service outages albeit on much smaller scales, and in my experience it feels awful while it's happening but you quickly forget about it because so does everyone else.

        • dagw 5 years ago

          I guess it very much depends on your personality. I screwed up a a not very important project for a client 4 years ago while working at a different company, and I still feel bad when I think about it, despite the fact that my company had my back through the entire process and literally everybody involved has moved on and probably forgotten about it.

        • efficax 5 years ago

          When CNN is reporting on the bug you deployed it might have some psychological impact

        • Jare 5 years ago

          > on much smaller scales

          > you quickly forget about it because so does everyone else

          This is definitely not the case here, and the experiences are bound to be very different.

      • strictfp 5 years ago

        I wanted to show support to the engineers in the sense that I don't think you should encourage a working culture where you have "massive post-mortems" and expect people to feel bad for extended periods of time over simple mistakes. By not making a big deal out of it, you can also support your staff.

        But I think our disagreement mainly stems from how we interpreted the parent comment. I thought it was very double, at one hand claiming to show support, at the other hand emphasizing how big of a catastrophy this was.

        I just wanted to say that I think it most likely was a completely natural mistake, only exerbarated by the scale of the company, and that while you should take some action to prevent it in the future, you should not spend so much time dwelling on it. Shit happens, it's fine.

        • kylen 5 years ago

          I agree, and I think I picked on your comment a bit because it was the top one.

    • busymom0 5 years ago

      I think the government websites being down (UK ones for example) are the bigger issue. Reddit/Stackoverflow etc being down isn't that big of a deal imo.

  • H8crilA 5 years ago

    Imagine losing sleep over a corporate problem where you're just the next Joe Engineer, to be fired the second you're not needed. Have some perspective people.

    • adkadskhj 5 years ago

      I'm confused, why isn't being fired something to lose sleep over in your eyes?

      I get that you're implying that the job itself is not worth that much concern, but it seems you're ignoring that jobs bring in income, pay your mortgage, etc.

      If i lost my job tomorrow i'd be terrified.

      • ihumanable 5 years ago

        People rarely get fired for outages. The comment you are replying to is saying that engineers shouldn’t stress out over an outage that only impacts a corporation.

        It’s a commentary on work / life balance and the all-too-common phenomenon of employees sacrificing for a company (in this case, feeling such personal stress that they would lose sleep) and contrasting it with the fact that most employers will fire you without a second thought if it’s what’s best for the business (they won’t lose any sleep).

        It’s a critique of the asymmetry that often exists and is frequently exploited by companies. This is often seen in statements like, “we are one big family so put in a few more hours for this launch” coupled with announcements like, “profit projections didn’t meet expectations so we are downsizing 5% of the work force.” You are family when they need you to work hard, and an expendable free market agent when your continued employment might risk hitting the quarterly goal.

        It is, of course, reasonable to lose sleep if you think your employment is in jeopardy. Very few companies, especially in the competitive SV market are firing engineers because of a single outage, even a bad one, because you just paid a bunch of money to train those engineers how to see this coming and fix it.

  • throwaway7747 5 years ago

    I have worked for one of their competitors (I'm not saying which) for quite a while. I've indirectly caused multiple outages that were maybe 1% this bad before, that didn't make the news only due to luck. Code that I owned (but did not write) was once a key cause of a severe outage that did make the news, and it would have been worse if I weren't coincidentally halfway through replacing that code with something more modern. I also had to do some very rapid work on internal failsafes around the time of the infamous Mirai botnet, to minimize service degradation in case it was pointed at us.

    It sucks. Working on CDN reliability is like working on wastewater management: the public forgets you exist until something breaks, when they start asking why you weren't doing your job. Fortunately, internal people at least seem to get it -- I hope this is the same as Fastly.

  • mnordhoff 5 years ago

    They shouldn't lose sleep over it, though.

    • austinjp 5 years ago

      Everyone's got responsibilities and aspirations. To be fair, I was thinking more of the jobbing engineer who's going to face anxiety about losing their job over this, but it extends to all levels. Having a fat bank balance helps get through periods without employment, but it's not just about money. There's anxiety, shame, embarrassment, the whole gamut. Going through a major incident at work is a shitty experience.

  • yvan 5 years ago

    Well, not much, I mean all our competitors are also using Fastly. I would be more worried if we were the only one using Fastly and everybody else was fine. But as we are all in the same boat, we lose the same :-)

  • dm319 5 years ago

    Empathy is hard to find around here, maybe someone needs to study it. Is it a feature of people in tech? Don't remember much being on slashdot either.

  • willejs 5 years ago

    #HugOps

  • southerntofu 5 years ago

    I feel for the Fastly workers, who managers are probably currently harassing to get things back online. I certainly don't feel any sympathy for Fastly administrators/managers who make business out of exploiting other people.

  • mothsonasloth 5 years ago

    Call me old fashioned but the latest trend of showing "empathy" for a serious incident, then proceeding to dance around the aftermath of it, whilst people give themselves a pat on back in a retro/post-mortem, isn't the way to do it.

    People need to be blamed, and responsibility for actions taken (without covering asses)

    • q3k 5 years ago

      The point isn't to dance around the incident, but to not blame people. You can blame systems, design, engineering culture, processes, but don't blame people. Even if someone accidentally pressed the 'destroy prod' button, that's not the fault of that person, it's the fault of that button existing and being accessible in the first place.

      I have no empathy for Fastly-the-company. I hate the fact that the Internet is centralized around CDNs. I wish this idea of 'but we _must_ run a CDN for our 1QPM blog!' would die in a fire. But I can still empathize with the Fastly engineers handling this shitstorm right now.

      • tyrex2017 5 years ago

        I disagree. People implemented those systems, so if you are correct that it is the systems fault, then it is also a persons fault.

        People must be held accountable to have good incentives to reduce such outtages in the future.

        I do agree though that we should always be compassionate and realistic with other humans.

        • q3k 5 years ago

          > I disagree. People implemented those systems, so if you are correct that it is the systems fault, then it is also a persons fault.

          How do you make sure that mistakes don't happen, then? Do you blame and fire people who make mistakes, and hope that the next person put in the same spot doesn't make a mistake? Or do you figure out what caused that person to make the mistake and ensure there are processes in place so that next time this is less likely to happen?

          Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up.

          • tyrex2017 5 years ago

            I see: When I said we need to hold people accountable, you may have heard that we need to fire people. That was honestly never on my mind.

            Maybe its a cultural thing. I hear a lot of firing at the US. I am from Europe.

          • colesantiago 5 years ago

            This sort of culture worked at Netflix. Did they go down today?

            • stuff4ben 5 years ago

              Lets hope you don't ever go into management. You clearly have no idea how to motivate and retain people or have any insight on how hard it is to hire good people to begin with. And no, I'm pretty certain this is not how Netflix's culture is.

              • colesantiago 5 years ago

                > pretty certain this is not how Netflix's culture is.

                > pReTtY CeRtAiN

                This, the wording in of itself shows you have absolutely no clue whatsoever at all of Netflix's culture.

                • stuff4ben 5 years ago

                  Riiiight... Anyways, you kept complaining of being downvoted, here's a clue: you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...

                  • colesantiago 5 years ago

                    > you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...

                    Okay? some proof please? This is not far off from a baseless character attack which isn't really effective when trying to convince me about your point on you knowing about Netflix's culture.

                    If you really want a proper answer, the truth is, unfortunately for you I am in management (previously was an engineer) and have always known Netflix to have a stellar performance oriented (and fear driven) culture, their playbook operates like a sports team. Not for everyone, but that's the point and it works for them.

                    Maybe you should look inward to yourself if you're so vexed with me to call me silly names, that you can't handle the truth or the culture about why some companies like Netflix adopts this.

                    Peace.

                    • stuff4ben 5 years ago

                      Proof? All the downvotes you got and why your comments are barely visible and all the crying you did in your comments about getting downvoted.

                      • colesantiago 5 years ago

                        You think downvotes and character attacks present as a good argument? Doesn't count as proof IMO if there isn't a valid argument presented, you're going to have to do a lot better than that.

                        And back to the main point, So I assume you agree that Netflix did go completely down the other day then right? It seems according to you that you know better of Netflix's management culture.

                        > I'm pretty certain this is not how Netflix's culture is.

                        Would you be willing to share your expert insight of this if you know better then?

                        • stuff4ben 5 years ago

                          I'm not arguing Netflix, its mostly your attitude towards management and engineering culture. Basically your reply to the user "q3k". "Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up". You don't fire people just because they made a mistake. You find out what caused it, how to prevent it in the future, and you move on. That's what blameless post-mortems are about. No one is perfect and if you really are a manager that expects perfection, you really just suck as a person.

                          But now getting back to Netflix, they have post-mortems and they don't fire people willy-nilly over mistakes. Sure it's not hugops (a term I don't care for either), but they don't just up and fire people over a mistake. I never said anything about netflix going up or down on that day, but they also have problems just like everyone else. Their SLA is not 100% uptime and neither is Fastly.

                          In closing, you are being a pedantic little bitch who wants to argue minutia and I'm done with your trolling. I'm done responding to you, feel free to have the last reply as I really don't care anymore.

        • philjohn 5 years ago

          That's a sure fire way to get a CYA culture, and it's a reason why the most successful tech firms don't do it.

        • altacc 5 years ago

          v1. "It's Bob's fault and so we fired Bob."

          v2. "The issue was caused by a previously unidentified pathway that caused a feedback loop and overloaded our servers in a cascading fashion (or whatever). We have implemented a fix for this and updated our testing and deployment processes to stop similar cascades."

          Which solves the problem long term?

          As an architect making product choices, v2 wins every time.

          (With the caveat that if the cause was something that reveals a fundamental problem with the larger processes/professionalism/culture of the company, especially to do with security concerns, then I'm not buying that product and migrating away if we already use it.

          • richardwhiuk 5 years ago

            If an employee does something actively malicious, you should absolute remove them. This is very rare though - incompetence /broken systems is much more likely.

            Otherwise you develop internal process that's entirely scar tissue, and only stops your teams doing their jobs.

            • altacc 5 years ago

              I feel it is somewhat obvious and goes without saying that malicious action results in personal responsibility & repercussions. However I don't have any evidence or past experience that malicious action by an internal employee is a likely scenario for most outages. It may well occur but most examples I've heard of seem apocryphal.

              The scar tissue: this is where good choices come in because it's certainly not a rule that a change as a result of an incident review is an impediment to work. These definitely occur, and sometimes linger after the root cause is phased out. But best practices often reduce cognitive & process overheads.

              A rough example is that there are still people out there FTPing code to servers, having to manually select which files from a directory to upload. Replacing this error prone process with a deployment pipeline leads to a massive reduction in the likelihood of errors and will actually speed up the deployment process. It's all about making the right choices, not knee-jerk protections, and sometimes the choice is to leave things as they are.

          • tyrex2017 5 years ago

            As I replied to a sibling comment, I never thought about firing Bob. I think we can assign responsibilities without being mean or denegrate someone.

            I am critizing myself all the time for stuff. No hurt feelings there.

        • aetimmes 5 years ago

          > People must be held accountable to have good incentives to reduce such outtages in the future.

          Holding specific people "accountable" for outages doesn't incentivize reducing outages; it incentivizes not getting caught for having caused the outage.

          As a result, post-mortems turn into finger-pointing games instead of finding and resolving the root cause of the issue, which costs the company more money in the long run when a political scapegoat is found but the actual bug in the code is not.

        • taurath 5 years ago

          Loss of trust in a service provider and the afterwards loss of business is quite an incentive. Having someone drawn and quartered just provides an incentive to scapegoat.

      • atq2119 5 years ago

        > don't blame people

        I feel like this requires some nuance.

        Don't blame an IC for introducing a bug or misconfiguration that led to the outage.

        Do consider blaming (and firing!) management if, during the postmortem, it turns out that it was in the way of fixing systemic problems.

        Ultimately, rule #1 should be: don't blame somebody unless malice or gross negligence is proven. Rule #2 should be the assumption that ICs will not have done either. Rule #3 is that sometimes, individual responsibility is required.

    • gurgus 5 years ago

      Blame culture isn't the way forward here.

      Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again.

      Obviously if there are levels of gross negligence or misconduct discovered during post-mortem, that will need to be dealt with accordingly, but coming into this with an attitude of "we must find someone to blame and incur repercussions" isn't healthy at all.

      We are humans - don't forget that.

      edit: forgot some words.

      • colesantiago 5 years ago

        > Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again. And if this happens again? They advertised they had failover and mitigations for this in the RAREST of cases:

        > Notices will be posted here when we re-route traffic, upgrade hardware, or in the extremely rare case our network isn’t serving traffic. - status.fastly.com

        The extremely rare case happened for an hour, which is a very long time in internet time.

      • thrdbndndn 5 years ago

        I think what you said is exactly why people have different opinions on this topic: what counts as "gross negligence" and what doesn't? Different people draw lines at different places.

        • q3k 5 years ago

          There's, to me, no obvious clear cut line. But here are some indicators that make me consider someone was being grossly negligent and/or even malicious:

          - ignoring warnings

          - acting against known-to-them best practices

          - repeating a previous mistake

          But, again, these are just indicators, not a checklist.

          Interestingly, any of these can happen also due to stress, burnout and generally broken company/team culture. Including something like a CYA culture where if they don't do something fast, they will be blamed for it, and thus they need to move fast and break things.

    • mhandley 5 years ago

      The problem is a blame culture ensures the near-misses are never reported. Air safety discovered this many years back - a no-blame culture ensures anything safety-related can be reported without fear of repercussions. This allows you to discover near misses due to human error and ensure that the overall system gains resilience over time. If you blame people for mistakes, they cover the non-obvious ones up, and so you cannot protect against similar ones in future, so your reliability/safety ends up much lower in the long run. It's all about evolving a system that is resilient to human error - we will make mistakes, but the system overall should catch them before they become catastrophies. In air travel now, the remaining errors almost never have a single simple cause, except in airlines/countries that don't have an effective safety reporting culture.

    • hyper_reality 5 years ago

      I recommend reading about "blameless postmortems" [1]. Our natural tendency is to look for who is responsible for an incident and point the finger of blame. Over time this leads to a cover-your-ass culture, whether you like it or not. Therefore such a tendency needs to be actively fought against to keep the focus on quality engineering and not politics.

      "An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization."

      [1] https://sre.google/sre-book/postmortem-culture/

    • darkcha0s 5 years ago

      I'm sure you've never made a mistake.

      The best way (in a team), to tackle mistakes, is to ensure the process in place corrects these mistakes. The only way to do that, is a post-mortem/learning from the mistake. If you blame it on some engineer who did it, that guy will eventually be replaced by some other guy, who may make the same mistake.

      • fmajid 5 years ago

        You also need to be proactive about other possible failure modes. Avoiding a culture of blame may or may not help. There needs to be a strong incentive for the organization to expend the resources to do so, and a mere "oops my bad" doesn't provide that without SLAs with teeth.

    • berkes 5 years ago

      We need to learn from our, and other mistakes, or else we keep repeating them. Nothing "old fashioned" about that.

      And we, especially companies, typically only learn if there is something at stake. Stock-price, a job, customers, liability etc.

      (Call me old fashioned, but what I learned from it, having no stake in the game, is we are truly demolishing the resilient, decentralised nature of the internet; or already have done so)

    • OJFord 5 years ago

      I don't agree about the blame, but I do also find the empathy cringeworthy. Something's broken; someone's job is to fix it; they'll fix it; it will work again. /shrug/

      Post-mortems make far more interesting submissions IMO, but I suppose people up-vote 'yes down for me too'.

    • erwald 5 years ago

      the attitude that "people need to be blamed" will never improve reliability in the long run. people come and go; systems and processes endure. blaming people is the best way to avoid making durable improvements to systems and processes.

      • notyourday 5 years ago

        Doctors that make too many mistakes resulting in too high of payouts can't get individual malpractice insurance. Doctors that can't get individual malpractice insurance go to hospitals. Hospitals that hire too many doctors that make too many mistakes can't get hospital level policy. Hospital has to fire those doctors. That's how the system adjusts.

        We do not have a system that adjusts to "oops"

    • hardwaresofton 5 years ago

      I hear you, but I just want to point out that this rarely happens anywhere else. It's great if tech (and people in general) hold themselves to progressively higher standards than what is out there already, but I don't think tech needs to be that much better, I'd settle for just doing a good honest retro (without throwing anyone under the bus, and without covering their asses)

      A good leader will take the hit (and the repercussions) for their underlings, compensate customers where compensation can make it better (and offer to make it easy to use fallbacks if this happens again) -- and internally fix the problem so it can't happen again, without throwing anyone to the dogs.

    • sophacles 5 years ago

      > People need to be blamed, and responsibility for actions taken (without covering asses)

      What i think this syntactically invalid sentence is trying to say is:

      People need to be blamed, and held responsible for actions taken.

      Why do people need to be blamed? Why do we need to make someone the scapegoat? What does being held responsible look like?

      Let say we find some sacrificial engineer to pin this on:

      * does the downtime magically disappear?

      * does the engineer suffering (say losing his job or whatever) make your downtime meaningful? You'll recoup your revenue somehow from it?

      * does the fact that there's a scapegoat mean that everyone else at fastly is perfect and it's ok to keep using them?

    • taurath 5 years ago

      Scapegoating in those situations happens more often than not. In an operations team all problems are systemic - having to do with decision makers throughout the process, sometimes acting on perverse incentives set up by others. Blame then gets diluted but still tends to fall upon the organization responsible rather than an individual, which is where it should be. Gross negligence is not so cut and dry.

    • austinjp 5 years ago

      "Call me old-fashioned but..." is a dog-whistle harking back to "better days" that never existed.

      Emapthy and responsiblity are not mutually exclusive.

    • colesantiago 5 years ago

      > People need to be blamed, and responsibility for actions taken (without covering asses)

      This. When people talk about "HugOps", "empathy" and all that when a worldwide incident affecting a huge amount of time critical customers (e.g. trading, hft, cargo, food delivery, etc.) is happening for an hour, it has catastrophic consequences.

      I hope the engineers also understand the other side and why we are paying huge sums of cash for their service.

      • q3k 5 years ago

        It's empathy towards people managing the incident, not towards the company. It's a sign of solidarity from SRE to SRE, not a sign of solidarity with a company.

  • jtdev 5 years ago

    Our fathers and mothers put man on the moon… we build shitty software that helps the technocrats sell more junk to the masses.

  • colesantiago 5 years ago

    Well, while engineers are getting paid $100K/yr to post #HugOps, I know someone in HFT and their dashboard uses the Fastly service, so this has had a huge impact on them for sure.

    Flag and downvote all you want, you know this is true.

    • nsp 5 years ago

      I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer. They're both pretty privileged jobs and HFT is not known for having tons of benefits to society

      • notyourday 5 years ago

        > I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer.

        Engineers are paid because their companies have customers. The it is pure madness that #hugops is the thing. I sincerely hope that Fastly's customers wack it $$ wise so hard that it actually affects #hugops engineering culture.

      • colesantiago 5 years ago

        > I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer.

        At least HFT traders don't get paid to spy on their own customers with trackers littered everywhere, I find that very unethical that engineers get paid to even do that sort of thing, and every damn website has these trackers because engineers put them there.

        > They're both pretty privileged jobs and HFT is not known for having tons of benefits to society

        So HFT firms don't have their own foundations and grants to give to charities and organisations then?

    • Deathmax 5 years ago

      And ignore the pre-agreed SLA targets and compensation for not meeting those targets that's in the contract they signed right? If you're going to say you're losing $X/minute of downtime, then either deal with it, architect around it, or negotiate the necessary SLA and compensation.

      • colesantiago 5 years ago

        It's not me you should be telling this to though, if you know someone at Fastly, perhaps you should reminding them that.

        I expect huge clients to be knocking on Fastly's door lining up for answers because of this.

  • colesantiago 5 years ago

    Not my problem. Fastly should work as intended.

    The fault is theirs and they have said that they have failover, this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.

    > "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime." - status.fastly.com

    Even their status page was down. Very embarrassing, Fastly did not work as advertised and mislead its customers.

    Edit: Offended flaggers circling around silencing misled Fastly customers. How pathetic.

    • macintux 5 years ago

      > this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.

      I don’t know Fastly at all, but in my experience there’s no such thing as a foolproof failover system that covers all possible scenarios.

      • colesantiago 5 years ago

        Even when they said this was a rare [0] case, they knew this case should be handled, but didn't handle it.

        > or in the extremely rare case our network isn’t serving traffic.

        reports also came in that this was a service configuration[1] issue, so not only there is no failover system, not even any validation automation was in place that could have prevented this.

        [0] https://status.fastly.com [1] https://twitter.com/fastly/status/1402221348659814411

        • detaro 5 years ago

          Systems failing is not evidence of systems not existing.

          • colesantiago 5 years ago

            So why didn't the 'automatic failover' kick in during the outage? Where was it then? I don't see anything about 're-routing traffic' anywhere in the status page [0]

            [0] https://status.fastly.com/incidents/vpk0ssybt3bj

            • detaro 5 years ago

              We don't know, but the usual scenarios would be "issue impacts failover mechanism too", "failover mechanism overloads other system components leading to cascading failure" or "something causes failover mechanism to to think all is fine".

              • colesantiago 5 years ago

                > We don't know...

                So, the rarest of cases (our network isn’t serving traffic) just happened right now, and their failover system just took a snooze then, but 'it exists apparently' according to you.

                Tell that the huge clients that lost sales because of this, and all you have to say is: "wE DoN'T kNoW..."

                • q3k 5 years ago

                  > Tell that the huge clients that lost sales because of this, and all you have to say is: "wE DoN'T kNoW..."

                  Tell these clients that they should've carefully read their contract with Fastly, especially the 'Service Level Agreement' part.

                  • colesantiago 5 years ago

                    Not the point. They were also told that a failover system would kick in and re-route traffic had there been any issues, but this was where to be seen.

                    A worldwide outage happened that affected almost all locations and everybody, so actually SLA is meaningless in this case. Where was the extra redundancy? Where was the failover system? Why was other companies indirectly affected?

                    As far as I know Fastly's status page was even down during the outage, the fact that the best answer to this 'is we don't know' tells you everything you need to know. Maybe stop victim blaming this situation and focus on the main culprit.

    • q3k 5 years ago

      > Not my problem. Fastly should work as intended.

      What's your SLA with them?

      Just assuming things will always work because the marketing copy said so is recipe for disaster. It's hoping that things never go wrong, and when they inevitably do, being caught pants down.

      Everything fails sometimes. You must know how much your SaaS provider contractually promises, ensure that any SLA breach is something financially acceptable for you, and ensure that you can handle failure time within SLA.

      • colesantiago 5 years ago

        > What's your SLA with them?

        Sorry what?

        You've just witnessed almost the entire internet break because of a catastrophic cascading outage that affected lots of huge companies, since third party services used and trusted Fastly.

        Shopify stores couldn't accept payments on their websites, Coinbase Retail/Pro transactions and trading apps failed to load, and delivery apps stopped loading all of a sudden. These are just a few that this outage has caused, and now you are trying to blame this onto me for not checking their SLA when millions were indirectly affected by this?

        Fastly offered a product, their main product which is a CDN which took down lots of websites. I don't care if everything fails sometimes. There are sites that should NOT go down because of this configuration issue which they messed up.

        • detaro 5 years ago

          > I don't care if everything fails sometimes

          You can say you don't care for reality, but it's not going to help you have better systems.

          > There are sites that should NOT go down

          Then they surely either engineered their system to not 100% rely on Fastly or negotiated appropriate terms with Fastly (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody). Everything else would be negligent, and surely nobody would be negligent when operating a site that "should NOT go down"?

          • colesantiago 5 years ago

            > You can say you don't care for reality, but it's not going to help you have better systems.

            No where in my sentence I said this so quit the strawman argument.

            I know a client using a service that has 100% uptime for the year, that also relies on huge clients, I don't understand why Fastly can't guarantee at the very least and a failover system to counteract this, but clearly didn't work. (or even existed)

            > (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody).

            Then why did this cascade to almost everybody even indirectly? Surely their advertised failover system would have prevented this from prolonging further but lasted longer than it should have.

            I don't think a store, exchange or trading desk not accepting payments from people for an hour is acceptable at all.

        • q3k 5 years ago

          > You've just witnessed almost the entire internet break because of a catastrophic cascading outage that affected lots of huge companies, since third party services used and trusted Fastly.

          Blame the companies that relied on Fastly being up 100% of the time, even though Fastly explicitly states that they might be down any number of hours, and they will even give you money back for that [1]. If they did offer 100% SLA, it would probably be out of budget for most users, as that kind of systems are prohibitively expensive to run.

          Depending on a single CDN like Fastly is building an SPOF into your product. It is not less of a design blunder that whatever Fastly did internally to have an outage. If Shopify lost millions because of a short, simple third-party outage they have at least as much of a high-priority postmortem to write and issues to address as Fastly.

          [1] - https://docs.fastly.com/products/service-availability-sla

          • colesantiago 5 years ago

            The main problem is that they had a failover system, the mystery is where was it in this outage?

            Why didn't this trigger? where was this system in place to prevent further cascading failures?

            > Blame the companies that relied on Fastly

            So it's everybody's fault Fastly went down now? That is a new one.

            • Purefan 5 years ago

              If companyA got affected by this, then either: 1- Its companyA's fault for not having a contingency plan or 2- Its companyA's accepted risk that this might happen.

              We understand you're upset and passionate about this, perhaps now when more information has been published you understand better the circumstances that caused this problem.

iso1631 5 years ago

https://easydns.com/blog/2020/07/20/turns-out-half-the-inter...

The whole idea of the internet was a distributed network impervious to most attacks.

The reality is that a single failure can knock out 90% of the services people use.

  • fagnerbrack 5 years ago

    The internet still works, only the websites are returning the wrong response

    • abluecloud 5 years ago

      yeah, the internet is working perfectly. if you want to view 503 errors.

      • Doxin 5 years ago

        Believe it or not but "the internet" and "the world wide web" are not synonyms.

        • berkes 5 years ago

          True. But the vast majority of use goes via "WWW".

          For example email - the other big "internet-user" is technically not part of the WWW, but most (? I don't have any stats, just a guess) of our mailclients run on the WWW, nonetheless.

          • gspr 5 years ago

            I think that's the point the other person was making: The Internet is still fine, regardless of whether or not the content gets delivered.

            There are roads (or shall I say tubes?). There are cars and busses on the road. Over time, almost everyone has migrated to just a few bus companies. One of them suffers a complete collapse for a few hours. Yes, this means chaos when it comes to transporting people. But the roads are just fine.

            This doesn't mean that the situation is fine and that people aren't affected. But it would be entirely different if the roads had been washed away or something.

          • fmajid 5 years ago

            BitTorrent was half of all Internet traffic for a while, though it has decreased with the rise of legal and convenient streaming services.

            • berkes 5 years ago

              Most of which (unfortunately) run on the WWW.

              I'm not sure what the native clients for Netflix and Spotify actually run, but I use their WWW clients mostly. Making most of my internet bits&bytes go over the WWW.

    • iso1631 5 years ago

      Thank god network people haven't drunk the centralization kool-aid.

      • taurath 5 years ago

        It’s the equivalent to JIT manufacturing. Cheaper when everything is going fine, and devastating when it’s not. And then when everything goes down at once there’s not enough advantage to being the only one still up.

      • Zyansheep 5 years ago

        about that...

    • qw 5 years ago

      Interestingly, server side rendered pages worked well during the outage. Most of the issues were caused by sites that are relying too much on Javascript.

      • fagnerbrack 5 years ago

        Yes, my personal project was working fine all the time. Only I couldn't access the Stripe payment system dashboard

    • jb1991 5 years ago

      And only those websites on some networks. If I connect my phone to my cell network instead of wifi, the problem sites work for me.

  • emptyparadise 5 years ago

    There are ten websites left on the internet and they're all hosted by four or so megacorps. Isn't it great?

  • nabla9 5 years ago

    The Web (World Wide Web) build atop of the Internet, is not impervious.

    ps. "The Internet was build to survive attacks" is not true. It's a myth made popular by Robert Cringely in the early 1990s. The Arpanet was simply a protocol for mainframes used by computer scientists to connect. The Internet is relatively resilient against attacks, but that was not the "whole idea". It was not in the design at all.

    Bob Taylor: “In February of 1966 I initiated the ARPAnet project. I was Director of ARPA‘s Information Processing Techniques Office (IPTO) from late ‚65 to late ‚69. There were only two people involved in the decision to launch the ARPAnet: my boss, the Director of ARPA Charles Herzfeld, and me. The creation of the ARPAnet was not motivated by considerations of war. The ARPAnet was created to enable folks with common interests to connect with one another through interactive computing even when widely separated by geography”.

    Vint Cerf says the same about invention if TCP/IP transport protocol.

    • dijit 5 years ago

      BGP (the protocol underpinning the internet) is built entirely for avoiding outages of any size.

      Even email has a method baked into to the protocol for handling failure.

      Fallbacks are good, baking in resiliency is better.

      • iso1631 5 years ago

        BGP has its problems (that time centurylink blackholed traffic but wouldn't drop their connections, bgp hijacks etc), but it's not centralised in single (or very few) points of failure

      • nabla9 5 years ago

        User iso1631 talked about attacks, not just outages.

        The basic design BGP is very vulnerable against malicious attacks. Email security is nonexistent.

mrzool 5 years ago

Why is this a link to the Fastly homepage, where absolutely no information is provided?

This is the page that should be linked:

https://status.fastly.com

  • jmvoodoo 5 years ago

    Oddly their homepage rendering an error was a more accurate description of the problem than "investigating potential impact to performance with our CDN"

    • taurath 5 years ago

      Stuff is down across the web, but the most it says is “degraded performance” and in my area it’s all green even though the sites are still down.

      • samhh 5 years ago

        All looks orange now, but "degraded performance" is a cheeky way to describe "everything is on fire".

      • kevincox 5 years ago

        > Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime. But when a network issue does arise, we think our customers deserve clear, transparent communication so they can maintain trust in our service and our team.

        What a joke!

    • mrzool 5 years ago

      I didn’t see any error whatsoever on their homepage, while now I see “Global CDN Disruption” on their status page.

  • Haydos585x2 5 years ago

    This is the link you want I think https://status.fastly.com/incidents/vpk0ssybt3bj

  • scolvin 5 years ago

    Because even their homepage is down intermittently/for some people.

  • Silhouette 5 years ago

    To save everyone else hitting the site as well:

    As of 10:44UTC, this status page has just updated to say the issue has been identified and a fix is being implemented.

  • lucasverra 5 years ago

    it is starting to show several Degraded Performance tags

barosl 5 years ago

I didn't know so many sites were depending on Fastly. Stack Overflow, GitHub, reddit, .... Even pip is unavailable. My development workflow is completely janked up. It is a bit scary that we are putting too many eggs in one basket.

  • liveoneggs 5 years ago

    fastly gives free service to things like pip. It's actually very nice.

    • JulianWasTaken 5 years ago

      Bit pedantic, but it's PyPI that Fastly gives services to, not pip (and PyPI that's down, not pip). The two are only loosely related – pip is a piece of software.

  • kypro 5 years ago

    You would think sites like Github and key government sites would at least have a fall back at the ready. It reasonable to use a CDN like Fastly, but having a single point of failure seems silly if you're the BBC or Gov UK. Although, it does seem BBC managed to get back up and running pretty quick so perhaps they were prepared for this.

    • danpalmer 5 years ago

      Gov.UK is back up too. They have a mandate from government to be able to provide emergency communications so I expect they did have a backup and have managed to switch over, but just took 30 mins to do so.

      Gov.UK is supposed to be a bit like BBC 1 or Radio 1 – in a national emergency they can be taken over to disseminate critical information, like if there was a nuclear attack launched on the UK.

  • samhh 5 years ago

    Hackage (Haskell) is down as well: http://hackage.haskell.org

    • fgaz 5 years ago

      The mirrors still work though, and cabal will just fall back to those

  • Zyansheep 5 years ago

    Must... decentralize... internet...

  • joshenders 5 years ago

    Blame site operators that are single homing and not loadbalancing CDNs

    • notyourday 5 years ago

      For sites of any complexity with any dynamic content having CDN redundancy is akin to being multi-cloud -- it is not worth the effort.

      A lot of dynamic sites use Fastly for its programmatic edge control and a near immediate ( ~1s-4s, typically around 2 ) global cache invalidation for any tagged objects with a single call to the tag. That feature alone simplifies backend logic significantly. To make this feature portable to CDNs that do not support it and provide only regular cache invalidation requires a complicated workflow setup which significantly increases the cache bust time, which in turn removes all the advantages of the treat dynamic content as static and cache bust on write approach.

      • joshenders 5 years ago

        >> For sites of any complexity with any dynamic content having CDN redundancy is akin to being multi-cloud — it is not worth the effort.

        I proposed and lead our multi-CDN project at Pinterest for both static and dynamic content and I can tell you, many many times over, it has been well worth the effort. Everybody should do this if not only for contract negotiating leverage.

        Cache invalidation is fast enough on all CDNs now for most use cases (yes, including Akamai). But realistically, most sites (Pinterest included) are not using clever cache invalidation for dynamic content because it’s not worth the integration effort (and it’s very difficult to abstract for large 1k+ engineering teams). Most customers are just using DSAs for the L4/L5 benefits (both security and perf). In that case, it’s not complicated to implement multi-cdn.

csmattryder 5 years ago

Here's the status page incident for this.

https://status.fastly.com/incidents/vpk0ssybt3bj

  • algo_cheese 5 years ago

    > We're currently investigating potential impact to performance with our CDN services.

    Guys, you are offline with a 503 error, this is a little more than "potential impact to performance".

    • superzamp 5 years ago

      Lowkey status reports are the norm now :)

      "some users may experience degraded service" => site completely down for all locations

      • sdflhasjd 5 years ago

        I fully expect that if I find a "major outage" on Slack's status page that it could only mean the outbreak of nuclear war.

      • spoonalious 5 years ago

        Or the AWS typical status of ‘seeing increased error rates on the API’ = us-east-1 is dead

        • maccard 5 years ago

          At least that's accurate. "Degraded performance" would imply to me that things are functional, but slow. increased error rates can be anything from "try again" to ":shrug:"

      • meowface 5 years ago

        "We're investigating reports of intermittent connectivity issues" => transatlantic cables cut, WWIII imminent

      • weird-eye-issue 5 years ago

        Well to be fair some users were not accessing the site at that time

      • strictfp 5 years ago

        Yeah, that's my experience as well. I thought it meant "we have no idea what's going on" though.

    • guiriduro 5 years ago

      Yeah, I also wrote a bot that chooses to create a status incident with the lowest key neutral message when it detects continued healthcheck fails (outside of maintenance) that steps in if an operator hasn't already created an incident. Maybe they're too busy fixing.

    • asmosoinio 5 years ago

      Yes, I also thought the header was hilarious:

      > CDN Performance Impact

  • parksy 5 years ago

    No issues reported for Perth Australia. Strange because reddit, zip pay, fastly itself, and probably a bunch of other sites are down.

    Doesn't seem the status page is automatically updated or perhaps whatever event or polling is used is also broken.

  • JrProgrammer 5 years ago

    > This incident affects: North America (Ashburn (BWI), Ashburn (DCA)).

    How come we are affected by this in the Netherlands?

    • adamjb 5 years ago

      They've updated it to

      >North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).

      • martinald 5 years ago

        MAD affected, not on the list. I assume it is all locations.

      • iso1631 5 years ago

        I've seen errors return to me referencing a LON (London I assume) server - Details: cache-lon4238-LON for example

      • kiwijamo 5 years ago

        Auckland (AKL) is affected but not on the list.

    • jwitthuhn 5 years ago

      Seems like they are still taking stock of exactly what is broken.

      It has now been updated to a pretty sizable list.

      edit: And now it looks like it includes every location.

  • nebulous1 5 years ago

    Currently only listing a small issue in NA

optiomal_isgood 5 years ago

Amazon.com was completely broken here (Europe) and they're back, I was observing from where the assets were loaded from and they switched from EU to NA as a failover. Homework well done.

  • 00deadbeef 5 years ago

    I was surprised to learn Amazon don't use their own CDN

    • optiomal_isgood 5 years ago

      They used to use AWS CloudFront and switched to Fastly, someone shared this in another comment:

      [https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...: CDN Fastly Wins Content Delivery Business For Amazon.com and IMDB Websites)

      Quoting:

      > "But with small object delivery, like images loading fast on Amazon’s home page, it’s the opposite. Customers will pay for a better level of performance and in this case, Fastly clearly outperformed Amazon’s own CDN CloudFront. This isn’t too surprising since CloudFront’s strength isn’t web performance, or even live streaming, but rather on-demand delivery of video and downloads."

    • dastbe 5 years ago

      Amazon (like a lot of others) use several CDNs for redundancy. You can see from dig that it resolves to combinations of cloudfront, akamai, and (presumably, based on your reported experience) fastly.

        dig +short www.amazon.com
        tp.47cf2c8c9-frontier.amazon.com.
        d3ag4hukkh62yn.cloudfront.net.
        65.8.70.16
      
        dig +short www.amazon.co.uk
        tp.bfbdc3ca1-frontier.amazon.co.uk.
        dmv2chczz9u6u.cloudfront.net.
        13.224.0.89
      
        dig +short www.amazon.in
        tp.c95e7e602-frontier.amazon.in.
        d1elgm1ww0d6wo.cloudfront.net.
        13.224.9.30
      
        dig +short www.amazon.co.jp
        tp.4d5ad1d2b-frontier.amazon.co.jp.
        www.amazon.co.jp.edgekey.net.
        e15312.a.akamaiedge.net.
        104.71.134.162
  • abluecloud 5 years ago

    Still getting broken assets from the UK.

    • optiomal_isgood 5 years ago

      You're right, I should've said *partially* back. At least the CSSs now load, but a few products images are still gone. However it was completely broken here before (literally loading just the main HTML).

creamyhorror 5 years ago

basically the internet is down

reddit, stackoverflow, github, paypal, pypi, twitter, twitch, NYT, CNN, BBC, the Guardian...

edit: wow, even Amazon.com relies on Fastly for some of its edge caches!

  • iso1631 5 years ago

    https://www.washingtonpost.com/technology/2020/04/06/your-in...

    “This basic architecture is 50 years old, and everyone is online,” Cerf noted in a video interview over Google Hangouts, with a mix of triumph and wonder in his voice. “And the thing is not collapsing.”

    The Internet, born as a Pentagon project during the chillier years of the Cold War, has taken such a central role in 21st Century civilian society, culture and business that few pause any longer to appreciate its wonders — except perhaps, as in the past few weeks, when it becomes even more central to our lives.

  • aunetx 5 years ago

    Opened my browser, ad my three major Web pages : github, gitlab.gnome.org and old.reddit.com... They all are down.

    • Harzza 5 years ago

      Unless you're browsing reddit without logging in, you can just set the old reddit theme from your account settings so you don't need to use the old. prefix :)

      • taurath 5 years ago

        They reset the setting, regularly just to piss off people who only want the old frames.

      • lumenwrites 5 years ago

        And if you're browsing on mobile, you need to request a desktop website, otherwise it switches to the new version anyway. Took me so long to figure out, so many annoying attempts to replace www with old in safari, and losing the selection after misclicking.

  • busymom0 5 years ago

    > stackoverflow

    How will they troubleshoot the error messages now?

  • secondcoming 5 years ago

    BBC is still up at least in the UK

    • fredoralive 5 years ago

      Seems to be mixed for me, BBC News and Sport works but stuff like Weather, iPlayer (video streaming) and Sounds (audio streaming) have died. I guess the BBC is big enough that different bits of the site run off different solutions (perhaps news and sport are still in spirit running off "news.bbc.co.uk" instead of the main servers?).

    • iso1631 5 years ago

      Not here (although won't be long)

      dig bbc.co.uk

        bbc.co.uk.  193 IN A 151.101.64.81
        bbc.co.uk.  193 IN A 151.101.128.81
        bbc.co.uk.  193 IN A 151.101.192.81
        bbc.co.uk.  193 IN A 151.101.0.81
    • easytiger 5 years ago

      it's down

  • 3np 5 years ago

    debian's main apt repo mirror affected as well

atymic 5 years ago

This has got to be even bigger than when cloudflare went offline, in terms of big companies affected. Clearly they have way more F500 customers than CF.

Good luck to the on call engineers!

  • yxhuvud 5 years ago

    The funny part is that it isn't uncommon for sites to depend on both cloudflare and fastly in one way or another, due to buying services from saas companies that also depend on them.

omk 5 years ago

This outage made me realize that github is served over a single IP address (A record) for my point of origin (India). Stackoverflow has 4 A record listing, but all of these belong to fastly.

The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ... <confused>.

  • raphaelj 5 years ago

    > The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ..

    Well, Internet was indeed designed for redundancy, and it worked as intended. A no point in time it failed to make you reach the server it was supposed to make you talk to.

    What are failing are all the application protocols that are running on top of the network.

  • kayfox 5 years ago

    Github's DNS likely will serve up a different IP for github when there is an outage. I can't talk about the details but GitHub and the rest of Microsoft use a global load balancing system that works through DNS.

    • omk 5 years ago

      Would be interesting to know what these fail over patterns are. As DNS takes a while to propagate, I thought DNS records already indicate fail over addresses.

      • kayfox 5 years ago

        I think only MX records indicate any priority for each additional record returned, for A records theres no indication of which records have priority over others and the usual behavior of authoritative DNS servers is to rotate the order in which records for the same thing are returned, so effectively returning more than one record for the same question results in a distribution of requests to the IPs returned rather than any sort of failover behavior.

        In the case of the software Microsoft uses, it monitors endpoints for the websites in question and then changes which IP(s) are returned based on the availability of those endpoints, the geographic region and other factors.

        • bombcar 5 years ago

          Some reliability systems change the routing for the IPs instead of updating the DNS as BGP can propagate faster than DNS caching.

          Priority for A records would a nice feature.

k_ 5 years ago

Update: The issue has been identified and a fix is being implemented. Posted Jun 08, 2021 - 10:44 UTC

Seems like this is being resolved; curious to see the details afterwards

(from https://status.fastly.com/incidents/vpk0ssybt3bj)

permb 5 years ago

Made my alpine linux docker builds fail as well (varnish) - but shouldn’t it use a mirror when the primary download site is gone?

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKIN... fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/... ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.12/main: temporary error (try again later)

ClearAndPresent 5 years ago

What conclusions can we draw about concentrating web content in a few CDNs?

  • threeseed 5 years ago

    In HTML/CSS you should be able to specify a fallback source if the first returns a non-200.

    Or that companies need to have better DNS strategies.

    • allyant 5 years ago

      > In HTML/CSS you should be able to specify a fallback source if the first returns a non-200.

      Except if the HTML/CSS is hosted on that CDN?

    • nickelpro 5 years ago

      DNS didn't fail, and there's nothing you can do in HTML/CS/JS if your CDN fails to serve those things

    • ilaksh 5 years ago

      Content-centric networking had been a central research topic for many years. And many potentially useful systems have been proposed and implemented.

      At some point some of them will start to become popular.

    • richardwhiuk 5 years ago

      Web Browsers should probably retry a different server in DNS if they get a 503 - but they don't.

  • npteljes 5 years ago

    That sometimes they fail but the world goes on.

  • itsbits 5 years ago

    we had that experience when cloudfare was down for sometime lastyear. We now setup a minor own static server as a backup, if at all this happens again. Althgh we hadn't so far had to use it.

oneeyedpigeon 5 years ago

Good marketing for Fastly! I had no idea so much of the internet relied on it...

threeseed 5 years ago

Shopify's CDN is down.

Which is causing $15+ million in lost product sales for every hour of outage.

Not to mention the loss of any new customers.

  • dspillett 5 years ago

    StackOverflow and all the StackExchange family of sites are down. I suspect the lost productivity from that will be more costly over the whole economy than potential lost sales via Shopify. People can go back to shopify so those transactions not definitely blocked for ever, any time "lost" due to reference resources being unavailable can't so easily be claimed back.

    • threeseed 5 years ago

      I don't think you understand how ecommerce works.

      A very significant amount of people won't go back. It's why the most effective marketing campaign by far is retargeting those people to convince them to come back. Unfortunately that's not possible in this case since you can't track the users as the site is unusable.

      • gezfrg321 5 years ago

        > A very significant amount of people won't go back

        So they didn't need what they were about to purchase and saved their money. Doesn't sound like a net loss to me.

      • dspillett 5 years ago

        > I don't think you understand how ecommerce works ... people won't go back

        I was talking about the economy in general, not specific e-commerce sites. People that actually need what they were looking for but don't go back will buy it elsewhere. The money still flows, just somewhere else. And if they don't need the item(s), they'll perhaps use the money for something more useful.

      • twic 5 years ago

        Some sites are on Cloudflare, right? Looks like we have a natural experiment to test this belief!

    • willj 5 years ago

      Makes me wonder how the engineers will fix this if they can’t visit Stack Overflow :)

    • diveanon 5 years ago

      Believe it or not, but there are developers out there that read the docs.

  • john37386 5 years ago

    Here is lesson to learn for shopify talented staff. Don't put all your eggs in the same nest. I'm sure they can build something better than that. Hopefully, they will learn from this outage.

  • grumple 5 years ago

    Does Shopify do that much when the US is asleep?

Haydos585x2 5 years ago

Such a huge number of sites. It seems like it's mostly US based sites and Australians are okay. Sending good vibes to whatever poor person is on support right now.

  • nineteen999 5 years ago

    I'm in Australia and there are heaps of sites down for me.

    • lysp 5 years ago

      As per report above - most (or all?) of Asia/Pac servers are down.

      This incident affects: North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).

    • Haydos585x2 5 years ago

      Ah, I meant more sites like ABC, 9NOW, SBS, AFL, Foxtel etc rather than accessing US sites from AU.

  • paimoe 5 years ago

    In Perth, reddit is down. So is Blackboard files for uni

jujodi 5 years ago

Would be fascinating if Fastly is not be able to use GitHub, Travis, Terraform, pip, etc. to deploy their fix

  • nraval1729 5 years ago

    Interesting thought. I had not thought about this before. If there is a cyclic dependency (not saying there is at the moment) how would things play out? Do you just ssh into your own servers to deploy the fix?

csomar 5 years ago

So I'm wondering where in the "hundreds of servers around the world" did they exactly go wrong.

This happened with Cloudflare before too. I think we are a little too dependent on these services.

  • 0xbkt 5 years ago

    It is a meaningless premise when you actually have SPoFs baked deep inside the system.

    • patentatt 5 years ago

      I’d love to see a breakdown of what single point of failure causes these worldwide network outages. They even brag about redundancy in their marketing materials. I hope we see a post mortem on this

  • fagnerbrack 5 years ago

    In Software Engineering we call it "coupling"

    /s

  • jfny 5 years ago

    Yeah seriously. Time to rebuilt the architecture from the ground up.

alexchamberlain 5 years ago

Stupid question: why didn't sites "just" fail over to their actual servers to handle the traffic, albeit slowly? I guess they won't be sized to handle the load in a lot of cases, and Fastly was responding, so DNS fail over didn't work?

  • altacc 5 years ago

    Probably a different answer for each site. I'm not a DNS expert but I think you're right on both counts. Having failover also requires a duplicate CDN architecture at the fallback location, which is an increase of costs in time, money & maintenance for relatively little benefit. Often there's a fair amount of background integration with a CDN, and each function slightly differently, so it's not simply plug & play.

  • abluecloud 5 years ago

    yeah. the dns was up. the problem was the servers weren't able to proxy the traffic. Also, as you say, you'll probably end up bringing down the upstream servers if you just fail open (and not even sure that'd be a possibility with fastly in it's "down" state that we saw).

sjaak 5 years ago

Perhaps Fastly is simply taking their commitment to reducing CO2 seriously? Three hurrays for the climate!

snookdebook 5 years ago

I gave it about 10 tries, and it seems a very small percentage of transactions do go through.

A decent number of tries is rejected right at the Varnish front door:

< HTTP/2 503 < server: Varnish < retry-after: 0 < date: Tue, 08 Jun 2021 10:11:41 GMT < x-varnish: 271470009 < via: 1.1 varnish < fastly-debug-path: (D cache-bma1666-BMA 1623147101) < fastly-debug-ttl: (M cache-bma1666-BMA - - -) < content-length: 450 < Service Unavailable Guru Mediation: Details: cache-bma1666-BMA 1623147101 271470009

Many more reach some backend system that just dumps "connection failure":

< HTTP/2 502 < content-type: text/plain; charset=utf-8 < content-length: 18 < connection failure

And a tiny few do get through:

< HTTP/2 200 < content-type: text/html; charset=UTF-8 < cache-control: max-age=0, must-revalidate < date: Tue, 08 Jun 2021 10:11:43 GMT < via: 1.1 varnish < vary: accept-encoding < set-cookie: ...snip... < server: snooserv < content-length: 275036 < <!doctype html><html>...snip...

pimterry 5 years ago

This is one of the things that excites me about IPFS: in a world of decentralized data storage, yes self-hosting and control over your data is nice and all, but serious resilience to most random infrastructure outages is a much bigger deal.

It's still early days, but I'm hopeful that it can provide a real solution to today's CDN centralization.

  • jokoon 5 years ago

    Agree, but currently, ipfs would serve as a fallback, since it's about files. Decentralized/distributed generally has slower network performance.

    Unless most nodes are high performance, I guess?

    Personally I think a distributed database system, where entries are being made redundant in something like a blockchain+dht, would be a good start?

    Decentralizing the internet works if it financially makes sense for platforms to build such tools.

    • pimterry 5 years ago

      > Agree, but currently, ipfs would serve as a fallback, since it's about files.

      Isn't a CDN fundamentally all about files too?

      > Decentralized/distributed generally has slower network performance. Unless most nodes are high performance, I guess?

      There is definitely more work to do here before this is really useful, but it's well within the realm of things that IPFS should be able to do at reasonable performance for production sites in future. Good performance still requires a serious CDN node network similar to traditional CDNs today (to seed your content for day to day use) but with IPFS if that CDN goes down then existing users on your site can _also_ serve the site to other nearby users directly, or other CDNs can serve your site too, etc etc. Your DNS wouldn't be linked to any specific CDN in any way, just to the hash of the content itself, so anybody could serve it.

      > Decentralizing the internet works if it financially makes sense for platforms to build such tools.

      There's a platform company called Fleek who already do this today: https://fleek.co/hosting/ (no affiliation, and I've never even used the product, just looks cool). Seems to be designed as a Netlify competitor: push code with git and it builds it into static content and then deploys to IPFS.

      The benefits don't exist today of course, because no browsers natively support IPFS, so most users can only access the content via an IPFS gateway, which means you're back to fully centralized server infrastructure again... If we can get IPFS support into browsers though then fully decentralized CDN infrastructure for the web is totally possible.

  • jfny 5 years ago

    I'm pretty sure you can serve hundreds if not thousands of users from a single Raspberry Pi

    • pimterry 5 years ago

      I mean, yes, absolutely, and that works to start with, but I'm willing to bet the overall uptime and performance of a raspberry pi in your living room is quite a bit worse that Fastly's :-).

aero-glide2 5 years ago

isitdownrightnow.com is down

  • roachpepe 5 years ago

    Thanks for the best laughs in a while friend - that's pure irony right there!

DoreenMichele 5 years ago

I'm having intermittent Reddit issues, as one more data point.

I'm grateful for HN. I rebooted my computer. I thought it was my device and then saw this on my phone while rebooting.

monkeydust 5 years ago

Just occurring to me how CDNs are a major point of failure now for the internet

unfunco 5 years ago

Amazon being down surely points to something other than Fastly being the cause?

  • austinjp 5 years ago

    I just had a look at amazon.co.uk and most assets fail to load, the browser debug console is full of 503 errors. Picking one at random, it's fastly:

        $ nslookup images-eu.ssl-images-amazon.com
    
        Server:  127.0.0.53
        Address: 127.0.0.53#53
    
        Non-authoritative answer:
        images-eu.ssl-images-amazon.com canonical name = m.media-amazon.com.
        m.media-amazon.com canonical name = media.amazon.map.fastly.net.
        Name: media.amazon.map.fastly.net
        Address: 199.232.177.16
        Name: media.amazon.map.fastly.net
        Address: 2a04:4e42:1d::272
  • mpitt 5 years ago
    • jfny 5 years ago

      [deleted]

      • richardwhiuk 5 years ago

        They will use S3, but they need a CDN in front. Surprised they don't use CloudFront - maybe that's what they've failed over to.

        • macintux 5 years ago

          Apparently they switched from CloudFront after determining Fastly was faster for this use case. CloudFront is focused on large streaming services, not small HTTP resources.

Jamie9912 5 years ago

Yep, seems like:

Reddit BBC News Twitch.tv Twitter emoji cdn?

are all down 503 service error

  • another-dave 5 years ago

    Ah didn't cop that Twitter emoji issue was related! Thought an ad-blocker was stepping up its filters aggressively :)

    Stack Overflow, The Guardian, Gov.uk too as some other biggish names getting hit.

    • strogonoff 5 years ago

      Various bits of GitHub on the Web (committing edits, editing releases) were broken for the same reason. Failure modes of JS-heavy GUIs are interesting.

kypro 5 years ago

Some people are claiming online that this is a cyber attack. I contract for the UK Gov and I'm hearing reports that traffic is going through the roof right now.

Anyone know if there is any legitimacy to this?

  • fr2null 5 years ago

    The fastly monitoring/status page says: "Customers may experience increased origin load as global services return". Which sounds like the increased traffic is to be expected.

    [1] status.fastly.com

cph-w 5 years ago

I did not realise fastly adoption was so wide-spread. Can anyone more enlightened tell my why or have some resource on which use-cases fastly is superior to other CDNs such as CloudFlare?

simonbarker87 5 years ago

how will their devs fix it if stackoverflow has gone down?!

lysp 5 years ago

This incident affects: Europe (Amsterdam (AMS), Dublin (DUB), Frankfurt (FRA), Frankfurt (HHN), London (LCY)), North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD), Ashburn (WDC), Atlanta (FTY), Atlanta (PDK), Boston (BOS), Chicago (ORD), Dallas (DAL), Los Angeles (LAX)), and Asia/Pacific (Hong Kong (HKG), Tokyo (HND), Tokyo (TYO), Singapore (QPG)).

  • tendencydriven 5 years ago

    Their status page is now saying every location has degraded performance.

  • kiwijamo 5 years ago

    Affecting Auckland (AKL) which is not on the list so I can only assume it's affecting more locations than they're letting on.

  • Banana699 5 years ago

    += North Africa (Egypt, Cairo)

    Stackoverflow.com, reddit, qoura down. (and probably more, those are the ones I tested)

modshatereality 5 years ago

This post is suspiciously ranked much lower than it should be (1216 points, 9 hours ago), lower than posts with < 100 points.

sleepyshift 5 years ago

Looks like this has taken out Reddit at least.

  • spyke112 5 years ago

    Is it also hitting Github? I'm not getting any css when loading Github.

    • nevi-me 5 years ago

      Looks like it is. If you're still able to see much of the UI, don't force-reload the page as it'll invalidate the CSS in the cache.

      I did that moments ago, and I regret it.

  • algo_cheese 5 years ago

    And a large part of GitLab

optiomal_isgood 5 years ago

FWIW, Fastly ~8 hours ago (3am UTC) reported another incident: https://status.fastly.com/incidents/1glxxb8sf2zv and deployed a fix—either the fix made it worse or wasn't sufficient to mitigate the problem.

marmot777 5 years ago

I think the honorable thing would be for them to have a statement easily findable.

So many companies sweep this sort of things under the rug if it’s only customer data that’s been breached. If they can’t sweep they have a high priced PR agency do the communicating.

I do not trust companies who handle things this way.

ZoomStop 5 years ago

The outage has already been added to the Fastly Wikipedia page

  • abhiminator 5 years ago

    Holy smokes these Wikipedia writers are quick! I'm sometimes impressed by how fast a page on a super recent happening gets populated with all of the currently known details.

choult 5 years ago

My money is on an expired internal certificate or CA.

dkarp 5 years ago

Before the "Error 503 Service Unavailable" messages appeared, there were a few minutes where the error was a single line:

    connection failure
Not sure if that provides anyone here with more insight into what might have caused this!
  • stordoff 5 years ago

    I got that, then a 'Fastly unknown domain' error (on Reddit), then the 503s on multiple sites (I also had an API I use return a 502 then a 500 error, but I don't know what the full response was as it was just a quickly thrown together script I was using).

    Edit: and now "I/O error" on Reddit.

  • q3k 5 years ago

    I also saw a glimpse of 'I/O error'. That sounds fun.

  • SileNce5k 5 years ago

    It was `connection failure` for me.

tommoor 5 years ago

Hands up if you're also here after being woken up by downtime alerts on the west coast

i386 5 years ago

Anyone want to talk about half the internet going out because one provider couldn’t keep their service up instead of SO jokes and feels for the engineers? the entire internet is like a stack of cards from the protocol to the economic model.

gansai 5 years ago

wouldn't websites have alternate CDN's managing their traffic, why should they have a single point of failure ?

I was assuming there are couple of services like Fastly and companies might have architected keeping in mind the alternatives too, I guess.

  • raimondious 5 years ago

    Normally you configure your a record to point at the cdn as the cdn is the thing that gives you multiple points of failure (caches all over the world). Hard to have a fallback to that. Running multiple cdns would be extremely expensive. Cdn caches are kept useful by traffic running through them, so hard to have a backup for that too.

  • ImpactStrafe 5 years ago

    Because interacting and switching between cdns can be very complicated and/or costly

    It should be planned for, especially by major tech organizations like reddit, or Amazon, etc.

    But I won't fault news organizations, who already don't have boatloads of money for not having fail over cdns

fagnerbrack 5 years ago

https://dashboard.stripe.com/ is down https://github.com/ is defaced

fullstackwife 5 years ago

No mention of outage on https://status.cloud.google.com/, and I wonder why, because apparently this is a GCP problem.

mschuster91 5 years ago

Ah yes, the wonders of centralized internet infrastructure.

Let's use a handful of providers for everything, they said. It will be cheaper, they said. It will be easier to manage, they said.

And it was cheaper, until downtimes began to affect more and more sites when central SPOFs got hit.

And I wonder how much of that need for these centralized SPOFs actually comes from the sheer absurd amount of bloat, ads, code and assets that sites these days "have" to deliver to the customer. I 'member times when pages had 100kb total size, loaded in an instant and were perfectly usable.

evouga 5 years ago

Since Fastly’s own website is currently down:

What is fastly? Why are a huge number of web sites dependent on them? They are some kind of web host for companies that don’t want to run their own servers/data centers?

  • ImpactStrafe 5 years ago

    Fastly is a Content Distribution Network (CDN).

    Basically the closer the server serving the webpage is to the end user the faster it is for the end user to see and interact with.

    But running servers all over the world 1) isn't efficient 2) costs a lot of money.

    So a few companies (fastly, cloud flare, akamai) figured, hey, why don't we build a bunch of small data centers all over the world and then provide a distributed way to serve web traffic from it.

    It originally was brought about for services like Netflix, but has expanded greatly.

    You still host your servers, but a copy of the webpage/media is given to the CDN to serve to customers.

    • evouga 5 years ago

      Thanks. That makes sense.

      Wouldn’t you build in a failsafe that bypasses Fastly and sends traffic to your own servers in the case of this kind of outage? Or outages are so rare that it’s not worth the trouble?

      • ImpactStrafe 5 years ago

        The number of serious CDN outages in the world are incredibly rare.

        In fact, you can probably remember most of them if you were given dates.

        Plus, going around the CDN can be very complex (depending on the type of content), very expensive (all of a sudden you have a massive data out network traffic that didn't exist previously), and not guaranteed to work (DNS updates can take longer to get to everyone than the actual CDN outage lasts).

        There are places where it is worth it and useful, but for a lot of the sites listed it's not useful.

      • npteljes 5 years ago

        That's the fallback, but the original stack is not designed with the volume of traffic in mind. So it gets overwhelmed very quickly and makes the website practically unavailable.

      • Dumbdo 5 years ago

        > Or outages are so rare that it’s not worth the trouble?

        This, I can't remember the last Fastly outage in this dimension, so the time spent on setting up a secondary server serving your assets is probably not really worth it for small-medium companies. Although i'd think otherwise for a company like Shopify.

      • ceejayoz 5 years ago

        Many sites do this; Amazon's failed over to their own servers for images for me, it appears. It typically just takes some human intervention, I suspect.

  • ceejayoz 5 years ago

    I'm particularly intrigued as to why Amazon.com uses them.

    They literally have their own directly competing CDN product. You'd think they'd be dogfooding it.

devops000 5 years ago

BTC/USD is down too.

ysavir 5 years ago

Tangential question, but with services like these, is there a known way to handle failure gracefully? Some way to automatically bypass these services if they are known to be down?

  • efficax 5 years ago

    You have to have two separate cdns and use DNS to fail over. The problem is that means paying for a CDN that just sits dormant for the 99.999% of the time that your primary is down.

    Alternatively you could use DNS to fail over to the content you host, instead of another CDN. But in many cases that would be the same as an outage since the CDN exists to reduce the impact of all those requests on your infra

  • richardwhiuk 5 years ago

    Have two different CDN partners, own your own DNS, and then withdraw one of the CDNs if they are down. Suspect that's what Amazon have done.

sergiomattei 5 years ago

Yikes, seems like a massive outage.

EDIT: Hexdocs is down, elixir-lang.org is down

angled 5 years ago

None of the ES/NQ/RTY/YM futures contracts took kindly to the outage! This could have had a much wider financial impact. Most seem to have recovered now.

asicsp 5 years ago

Related thread: https://news.ycombinator.com/item?id=27432397

hypnoscripto 5 years ago

Looks like fastly.com uses fastly…

mcintyre1994 5 years ago

Do they have an official status page? Googling gets https://docs.fastly.com/en/guides/fastlys-network-status which is 503

Edit: Elsewhere in the comments: https://status.fastly.com/incidents/vpk0ssybt3bj

devops000 5 years ago

Hacker News is the only one UP!

john37386 5 years ago

It should be resolve soon. From fastly status page:

The issue has been identified and a fix is being implemented. Posted 1 minute ago. Jun 08, 2021 - 10:44 UTC

  • abluecloud 5 years ago

    Wonder if all the caches will have been wiped, causing knock on issues

    • john37386 5 years ago

      You might be right. Here is another update from fastly:

      The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return.

      Let's see

  • grumple 5 years ago

    Phew!

    That time to find the issue is always the stressful part. < 1 hour is pretty good for weird stuff, and fortunately the east coast of the US is barely online this early (sorry Europe!).

willvarfar 5 years ago

https://www.bbc.com/news/technology-57399628 is rendering and reporting on the story, but BBC itself was down at the start of the outage, with the same 503 varnish error message.

Presumably the BBC has some kind of fallback in place.

The journalists ought interview their own techies :)

jchandra 5 years ago

https://www.greenhouse.io/ down as well.

hestefisk 5 years ago

The Guardian summarised this as well: https://www.theguardian.com/technology/2021/jun/08/massive-i...

perino 5 years ago

Anything hosted on Firebase seems to be down

easytiger 5 years ago

I will NEVER understand why people put so much trust in single provider solutions for anything critical.

vfclists 5 years ago

What happens when there is excessive centralization.

I thought that one of the principles behind the Internet is to be able to reroute around failures, but neither these service providers nor their clients ever seem to learn.

I guess in their mind that only applies to packet routing not services. SMH

MrGilbert 5 years ago

Interestingly, https://www.fastly.com/ works for me, whereas https://fastly.com/ doesn't.

Omnious58 5 years ago

I was wondering why my Tidal app just stopped mid song and won't connect, after much googling and absolutely no help or even notifications from Tidal explaining there's an issue it seems this outage is the culprit. Bugger.

diveanon 5 years ago

Time to develop CDN for CDNs.

It seems like a pattern that CDN have overly centralized the web and lead to issues like this.

Maybe its time to build a CDN that distributes your static assets to multiple CDNs and has a set of fallback states for service outtages.

tfar 5 years ago

https://flutter.dev/ and https://fastlane.tools/ as well.

Dobbs 5 years ago

I got a push notification from the CNN app telling me a bunch of the internet was down due to a cloud provider. I clicked the link only for the app to open to a 503. In hindsight not surprising, but quite amusing.

misnome 5 years ago

pypi.org, but not https://status.python.org/ - I'm impressed that they actually hosted the status page differently!

lopatin 5 years ago

Their status page keeps claiming that my region, Chicago (ORD), is either Degraded Performance, or Operational. But clearly it's down. Is fuzzing metrics like this how they hit their SLA targets?

abhiminator 5 years ago

Looks like they're currently applying a fix.

https://status.fastly.com/incidents/vpk0ssybt3bj

montag 5 years ago

It's funny, I searched Twitter for "Ebay down" and the top result was an Ebay tweet with some not coincidentally broken Twitter emoji SVGs (as another person mentioned)...

theginger 5 years ago

GitHub? I had some issues, checked the service status page said no issues, but images were returning a 503. Maybe they host their service status page elsewhere including using fastly.

monkeydust 5 years ago

Pretty bad www.gov.uk is down as more services move to digital.

  • zelphirkalt 5 years ago

    I don't think moving to digital is the issue here. The issue is relying on third parties, which can have an issue at any moment, taking down whoever relies on them with them.

    A government should not rely on CDNs like that. In fact government websites should not have any traffic going over third parties. When I want to use/view a government website, I should not be subjected to sharing any data with unwanted third parties and the government should not be affected, when some private company makes mistakes or has outages. It is an unacceptable situation.

    They can set up their own state-owned CDN, using the same underlying technology. Compared to where they spend all that tax money, some servers and some engineers would be a very cheap investment, in relation to the independence achieved.

  • allyant 5 years ago

    They seem to have migrated across to Cloudfront - working now.

plasma 5 years ago

I briefly saw an output error about "domain not found" when hitting fastly.com, wonder if some list of domains has hit a limit/flushed/etc.

  • dkarp 5 years ago

    I get this now on reddit:

        Fastly error: unknown domain: www.reddit.com.
fareesh 5 years ago

How does one design a system that has a redundancy for when the CDN goes down? Paying for more than one CDN is probably too expensive isn't it?

grumple 5 years ago

Good job Fastly for getting the issue identified and resolved so quickly. < 1 hour to identify, <13 minutes to fix (assuming status is accurate).

an0n4u 5 years ago

numpy docs, too. i think it's cloudflare related as well. at least, I keep seeing some cloudflare errors interpolated with the 503 varnish error.

  • AkshitGarg 5 years ago

    Well they thought that using a CDN over a CDN would be a good idea

    • somishere 5 years ago

      We've got Cloudflare sitting in front of our Firebase/GCP instance (which I've just found out is Fastly-cached :/). Getting 503s at the origin but we're up on our URL with an always online notice thanks to CF. Double dip isn't all that bad.

  • lapink 5 years ago

    Pytorch and Python docs, all down. No stackoverflow. I guess this is a forced bank holiday for developers around the world.

MyOnePiece 5 years ago

Quick question if the cdns are down why cant traffic be routed to the web servers the central web servers the company owns ?

I thought cdns had fallback configured ?

_kyran 5 years ago

Those of you that work in DevOps, SRE or are CTOs.

What kind of things do you put in place to manage these kind of centralised issues that are beyond your control?

  • timthorn 5 years ago

    These issues are in your control - not for the centralised service but your use of them. You can build appropriate redundancy for the components/providers in your stack and the budget you have.

devops000 5 years ago

Heroku is down https://dashboard.heroku.com/

JCWasmx86 5 years ago

>The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return.

Is fixed

Nilef 5 years ago

Ironically, even this Outage page is out for me

ur-whale 5 years ago

Wow, talk about a brutal SPOF, most of the things I had planned to work with today are broken: reddit, github, stack overflow.

taosx 5 years ago

I̶n̶ ̶r̶o̶m̶a̶n̶i̶a̶ ̶e̶v̶e̶r̶y̶t̶h̶i̶n̶g̶ ̶s̶e̶e̶m̶s̶ ̶b̶a̶c̶k̶ ̶t̶o̶ ̶n̶o̶r̶m̶a̶l̶.̶.̶.̶?̶

Edit: nope, just worked for 2-3 requests (10 secs)

anotheryou 5 years ago

Looks fixed: https://downdetector.com/

jl6 5 years ago

Worrying that this is impacting so many dev toolchains and services, which will hinder the ability to respond to the issue.

timvisee 5 years ago

This seems to be a bigger issue. BGP failure?

  • stefan_ 5 years ago

    If they can serve me a garbage Varnish error (shoutout to "software that actually runs your business that none of your devs work on") it's not BGP.

_kyran 5 years ago

Things seem to have come back online in Australia, although not sure if that's just sites switching over their DNS?

LightG 5 years ago

"The internet will just route around a local / centralised problem ... like water around an object"

Obligatory LOL ...

graphman 5 years ago

Firebase Dynamic Links is affected too. Checking the IP looks like they are using Fastly which is quite surprising.

taurath 5 years ago

I’ve noticed lots of social media content is tied to this - Reddit and Twitter images and some videos, for one.

loriverkutya 5 years ago

The issue has been identified and a fix is being implemented. Posted 3 minutes ago. Jun 08, 2021 - 10:44 UTC

ilaksh 5 years ago

Let's make all of the main internet sites dependent upon one central private service. Great idea guys.

artembugara 5 years ago

Seems like another single point of failure. What is a solution to not be affected by such an outage?

toong 5 years ago

It is time to remove that "100% uptime guarantee" claim from the website :grimacing:

classicflavour 5 years ago

My work's website is down too and the regular sites I use to escape work borderm

gansai 5 years ago

Fastly is back now. (The issue has been identified and a fix is being implemented.)

pattyj 5 years ago

It would be interesting to see estimations on the man-hour cost of this outage.

mothershesha 5 years ago

Got the same here (Australia)

johnstonnorth 5 years ago

rubygems.org affected too

vincentmarle 5 years ago

Well I know where to go next time if I were to be a Russian hacker

clawphantom 5 years ago

Twitch isn’t working and not responding and also the web dashboard

luke2m 5 years ago

When this happens to cloudflare, it will be even more impactful.

colesantiago 5 years ago

Looks like Fastly did not work as advertised, very misleading.

reuben_scratton 5 years ago

I'm sure it's just a coincidence that today is Patch Tuesday.

:-|

zwirbl 5 years ago

Spotify is also hit, though it still works without images

ddtaylor 5 years ago

Someone must have 51% attack the Pied Piper blockchain!

vlan121 5 years ago

Damn, I thought I cloud blame myself or the provider..

ronyfadel 5 years ago

Ten Percent Happier is down, and now my day is ruined.

  • selykg 5 years ago

    When viewing a meditation session you can see a download button in the upper right (at least on iOS).

    I always have a small stash of my favorites saved locally in case of internet outage or I’m caught in a situation where I don’t have internet but need a few minutes.

    On top of that I’ve been really trying to rely less on an app. So I throw a lightly guided or unguided session in every couple days at least where I focus on going solo so I don’t need an app and just need a timer.

fsnowdin 5 years ago

just had my own site down because of this. glad to see it wasn't my fault lol but good luck to the Fastly people on fixing the issue.

clawphantom 5 years ago

Twitch isn’t responding and also the web dashboard

8K832d7tNmiQ 5 years ago

That explains why I couldn't access reddit

navanchauhan 5 years ago

No wonder, The Verge and NYT are down too.

rich_sasha 5 years ago

www.python.org down as well, with the shortest of messages: 'connection failure'. Probably related?

NewLogic 5 years ago

Even amazon.com styling is borked for me

dilawar 5 years ago

I think reddit in India is down as well.

JosephK 5 years ago

Extremely long call, but what are the chances this turns out connected to the raids on organised crime using the An0m app that started today?

john37386 5 years ago

It's probably a DDoS attack.

dragosbulugean 5 years ago

And all Webflow sites it seems...

alixaxel 5 years ago

Indeed, part of GitHub (.io) too.

ur-whale 5 years ago

Looks like HN is working ;-)

jfny 5 years ago

Do companies really not run test suites / do manual testing before deploying to production?

timetosleep 5 years ago

Seems to be back online

rvz 5 years ago

Basically everything is broken. "Centralising Everything" huh

dragosbulugean 5 years ago

All Webflow sites?

mlnj 5 years ago

StackOverflow too.

schappim 5 years ago

Parts of Shopify

ur-whale 5 years ago

Looks like an SRE team rolled out buggy software.

  • twic 5 years ago

    Let's start getting our guesses in.

    I think it's some dodgy VCL rolled out to all machines at once. For some reason it worked in staging.

rottc0dd 5 years ago

github is back online. SSO too.

raylus 5 years ago

Whew, DevOps fire alarms are going off!

raylus 5 years ago

github.com is pretty broken

schappim 5 years ago

SMH.com.au

heavydust 5 years ago

the problem has been fixed

heavydust 5 years ago

reddit.com is affected too

alexannic 5 years ago

cnn.com is down as well.

cwen 5 years ago

A real-world Chaos experiment!

cdev 5 years ago

it seems to be up now

magicturtle 5 years ago

reddit down aswell

Metacelsus 5 years ago

I first noticed that xkcd was down. Then I went to post about it on reddit . . . also down! Good thing HN is up.

nindalf 5 years ago

Taken out xkcd as well.

pts_ 5 years ago

Are these sites on the same cloud or CDN?

colesantiago 5 years ago

Also, why has this been allowed to happen? Billions of dollars lost because of this one company?

I don't understand this.

ramraj07 5 years ago

For a moment I thought all of Western internet was cut off from India. Says how siloed my browsing habits are!

raphaelj 5 years ago

Couldn't be happier I moved https://noisycamp.com to BunnyCDN.com.

TheRealDunkirk 5 years ago

Every other comment about what's down in this thread -- as if we needed dozens of site-by-site accountings of this outage in the first place -- is a bitch about reddit. Why is reddit so important to this crowd? The specific topics I used to read the site for (half a dozen years ago) have all been overrun by "bucket people," there is literally never an answer to any question I find a google link to there, and the site's design is actively user-hostile. Seriously: what's keeping that place afloat? Porn, I suppose.

  • sergiomattei 5 years ago

    Of course, the Enlightened Folk of this site can no longer use their leisure time on lowly activities such as the "Reddit".

    Teach me your ways, master! /s

    Jokes aside, people can do whatever they please. Reddit has a bunch of niche communities around many hobbies and fun things. No need to be bitter about it.

    • TheRealDunkirk 5 years ago

      You have put your finger on it. I AM bitter about it. It used to be really cool, and really nice to use, before the Taylor/Pao dustup, and the redesign.

  • afroboy 5 years ago

    old.reddit still a thing and there is a plenty of educational subreddits with really nice community around them, it's just like the internet just pick the things that suits you.

  • modshatereality 5 years ago

    reddit taught me to never trust a mod, so it does have some purpose still. i think without glaringly bad examples of how (not) to run a community based site, we would be doomed to repeat it's mistakes.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection