Settings

Theme

Visual Studio Code 1.7 overloaded npmjs.org, release reverted

code.visualstudio.com

302 points by eiopa 9 years ago · 93 comments

Reader

seldo 9 years ago

I'd just like to say on behalf of npm that Microsoft's handling of this incident was A+. As soon as we alerted them to the issue they were all hands on deck and did a rollback.

We've been really pleased that Microsoft chose to put their @types packages into the npm registry rather than a separate, closed system, and in general happy with Microsoft's support of node and npm. We're confident we can make the new features of VSCode work, we just need to work with Microsoft to tweak the implementation a little.

This was an honest mistake on their part, and we caught it in time that there was very little impact visible to any npm users.

Fun fact: at its peak, VSCode users around the world were sending roughly as many requests to the registry as the entire nation of India.

  • thedaniel 9 years ago

    > This was an honest mistake on their part

    From my outside perspective, it doesn't seem like a mistake on their part at all. Later in the thread you say this accounted for 10% of traffic, mostly 404s. This is (i assume) a hell of a lot of requests, but given npm's position as developer infrastructure, I don't think they could have reasonably expected to melt it. It would have been good of them to give a heads up, but I don't think I'd start assigning blame to the Code team.

    • jfindley 9 years ago

      Yeah. It leaves an unpleasant taste in my mouth to hear npm blaming Microsoft for this. As noted elsewhere, 404's are supposed to be very cheap to handle, otherwise DoS attacks become embarrassingly easy.

      I feel like the npm team have once again failed to own their problems and instead tried to push the blame elsewhere. This is just an outside perspective, but I really feel like it would have been more honest and accurate to at least admit to the possibility that npm isn't perfect, and "blame" (which I'm not sure is even a helpful concept in this instance) is shared between parties more equitably.

      • seldo 9 years ago

        I'm sorry my response looked like I was blaming them, that wasn't my intention. Like I said, it was an honest mistake: these things happen, and they handled it well.

        Once we determined 404s were the problem we put mitigation in place that worked fine, but the problem of request volume remained: the 10% figure I gave was at a 5% rollout of VSCode. A full rollout would therefore have meant the registry became 3x bigger overnight and two thirds of that would have been 404s to VSCode users. At that point the issue is financial, not technical, which is another reason the rollback happened.

        • joepie91_ 9 years ago

          Hmm. What is the "mitigation"?

          • seldo 9 years ago

            More efficiently handling 404s, which as many have pointed out we were handling quite naïvely.

            • joepie91_ 9 years ago

              Right, but I'm curious what exactly the issue was (on a technical level), and how you've mitigated it. This might be useful knowledge for other people building similar things, to avoid making the same mistakes :)

              • seldo 9 years ago

                Check out my detailed answer a few comments down: https://news.ycombinator.com/item?id=12861180

                • i336_ 9 years ago

                  Hi, I just wanted to say kudos for this little reply thread.

                  Many times I've seen someone on HN write a negative/flaming reply to a comment, which then nets a bunch of further agreement and consensus, and the original commentator is nowhere to be seen.

                  You quickly responded and fully acknowledged the faux pas (nuking any negative consensus), then you replied twice more, and one of those replies was to a request for technical info.

                  /o/

      • thedaniel 9 years ago
    • lucideer 9 years ago

      On the contrary, this seems like too kind a stance for npm to take. The approach Microsoft took here seems enormously and unnecessarily inefficient.

      Microsoft maintain the @types scope. Instead of providing their own metadata endpoint listing available typings to filter requests on, they lazily opted to just mass bombard a repository they maintain, hosted on a free service they don't fund, for any and all possible package names, even though they themselves maintain the list of packages and should know in advance which don't exist.

    • mjpa 9 years ago

      Sounds like you're describing a mistake there...

  • Arnavion 9 years ago

    Can you elaborate on what the issue is and how you want it to be fixed? Is it just something like rate-limiting requests or something more fundamental?

    Edit: Answered at https://news.ycombinator.com/item?id=12861118

    • seldo 9 years ago

      A VSCode person can (and probably will) answer in more detail, but at heart it's simple: if you want to add type-checking goodness to a library that isn't itself written in TypeScript, you can create a thing called a declaration file: https://github.com/DefinitelyTyped/DefinitelyTyped

      Microsoft publishes a list of known good declaration files for popular npm packages to npm, under the scope @types: https://www.npmjs.com/~types

      The 1.7 release of VSCode helpfully tries to automatically load type declarations for any npm package you use by requesting the equivalent declaration package under @types. When the package exists this is fine, because it's cached in our CDN.

      What they forgot to consider is that most CDNs don't cache 404 responses, and since there are 350,000 packages and less than 5000 type declarations, the overwhelming majority of requests from VSCode to the registry were 404s. This hammered the hell out of our servers until we put caching in place for 404s under the @types scope.

      We didn't start caching 404s for every package, and don't plan to, because that creates annoying race conditions for fresh publishes, which is why most CDNs don't cache 404s in the first place.

      There are any number of ways to fix this, and we'll work with Microsoft to find the best one, but fundamentally you just need a more network-efficient way of finding out which type declarations exist. At the moment there are few enough that they could fetch a list of all of them and cache it (the public registry lacks a documented API for doing that right now, but we can certainly provide one).

      • onlydnaq 9 years ago

        > At the moment there are few enough that they could fetch a list of all of them and cache it (the public registry lacks a documented API for doing that right now, but we can certainly provide one)

        Might I suggest having a bloom filter containing all the existing type declaration (which would be quite small) and only querying the registry if the bloom filter reports the type declaration as a positive.

        Since the filter can be really small it will probably scale a lot better than a complete list of all type-declarations, and a new filter could be downloaded by the clients every now and then.

        • mirhagk 9 years ago

          Is there an efficient diff algorithm for bloom filters?

          • Yugnik 9 years ago

            Depends on the bloom filter, but for the fixed size, fixed hash functions, and other implementations in the same general vein, it would just be XOR of both of the bloom filters. Anything that comes out true is missing from what would be the combined filter. Slightly different for counting filters, where something along the lines of subtraction would get you what you need (anything non-zero is different).

      • mappu 9 years ago

        > most CDNs don't cache 404s

        Sounds like a good CDN-busting DDoS vector.

        • chrismorgan 9 years ago

          Yeah, 404 generation needs to be efficient for cases like this. Sounds like npm simply hadn’t encountered a situation where this mattered.

          • IsaacSchlueter 9 years ago

            404 generation needs to be efficient for cases like this.

            Indeed!

            In general, it is an extremely efficient response. It took a huge number of users all hammering on the same set of 404 handling routes to get our attention, and we were able to handle the load, though it wasn't trivial to do so. The end user impact was minimal.

            If it hadn't been a known-good actor, we had some options to shut down the flood a bit more forcefully, but we didn't want to inadvertently cause errors for vscode users. Like my colleagues have said in this thread already, we really dig what VSCode is doing, and as operational fires go, this one got put out very swiftly and did very little harm.

            All that being said, knowing the npm devops team, this will no doubt be a source of insights for making the registry even more resilient in the future :)

      • dlss 9 years ago

        If your CDN supports stale-while-revalidate and stale-if-error, you should consider enabling them -- it will take the load on your servers from O(users * packages) to O(POPs * packages)

      • netcraft 9 years ago

        > At the moment there are few enough that they could fetch a list of all of them and cache it

        at which point they would be back to the annoying race conditions for fresh publishes, no?

        Can you speak to why it is so expensive on the NPM side to serve a 404? Would a bloom filter like another commenter mentioned be helpful?

      • JohnDeHope 9 years ago

        "There are any number of ways to fix this, and we'll work with Microsoft to find the best one..."

        It's refreshing to read actual engineers' writing. After this, going back to tear-jerking snark-filled twitter and medium gnashing of teeth will be hard.

      • awinder 9 years ago

        Most CDN's are able to invalidate a cache entry, so caching a 404 and busting it when it's not going to be a 404 anymore seems like it'd work?

        • ceejbot 9 years ago

          Yeah, you're right. It was a simple oversight that we hadn't been caching 404s already, since we already have infrastructure in place to bust cache on publishes. It would have been our next step if necessary (it wasn't necessary to mitigate this flood).

          You optimize for the use patterns you anticipate or see in normal usage, because, well, see famous saying about premature optimization. The use pattern we see most often is people installing from pre-determined lists in package.json, so 404s aren't all that common ordinarily.

      • kozhevnikov 9 years ago

        Can they publish a package that contains a list of all their other packages under @types?

        • RyanCavanaugh 9 years ago

          This is the tentative fix, though we'll be looking all-up at ways to reduce the load generated by VS Code from this feature.

      • JohnHaugeland 9 years ago

        If there's only 5000 types, why don't you just keep a catalog file? Then it's a single hit.

    • mohamedhegazy 9 years ago

      The TypeScript team will do feature work to minimize npm requests via a well-known cached list of packages.

  • ec109685 9 years ago

    "Many requests to the registry as the entire nation of India" per what time unit?

    • oridecon 9 years ago

      Approximately 3 new JS frameworks per hour.

    • seldo 9 years ago

      I was a bit vague :-) India's about 10% of total requests on any given day. VSCode was 10% of requests for a couple of hours.

    • pdpi 9 years ago

      Doesn't matter, it's a comparison of request rates.

      Requests over time where user in India ~= requests over time where user is a vs code user

    • lmm 9 years ago

      How does the time unit make a difference? If they're making the same number of requests/second as India then they're also making the same number of requests/hour or requests/day, no?

      • ec109685 9 years ago

        There wasn't any time unit in the original statement, so it wasn't clear that the request per X from India was the same as the request per X from the IDE.

  • Fifer82 9 years ago

    Awesome for sharing your thoughts. Don't mind the children here. You could give them gold and they would moan about the purity.

  • paulftw 9 years ago

    So one day they switched their entire user base to rely on a 3rd party free service without any load testing or heads up? What could possibly go wrong?

    • mohamedhegazy 9 years ago

      We have been testing this on insider builds of vscode for a few weeks as well as preview builds of visual studio with no issues. We were just notified today by npm that we are flooding their servers.

      • ceejbot 9 years ago

        Your installed base is quite large indeed! Your testing load was a drop in the bucket of our daily usage, but once you released to VS users we noticed. Should be straightforward to design something that works for this access pattern and load, now we know what you need. Typeahead package name completion would be a neat feature.

      • paulftw 9 years ago

        Before you start sending couple thousand QPS to any server it's generally not a bad idea to test if that server can handle that much, sometimes it is even worthwhile notifying the other team about the intended change.

        Overall when you "tested" something, but it still breaks in production and requires a rollback it's usually a sign that your testing strategy isn't could use some improvement - what is the point of testing if it doesn't prevent failures from happening

    • foota 9 years ago

      Fwiw if I were building a feature on something that's considered as core a technology as npm is, I likely would not have thought of this either. (Though maybe I would have if I were doing an in depth look into it)

BenjaminCoe 9 years ago

As one of the folks on the front-lines helping patch this, I certainly have no hard feelings; and I'm excited to be able to support this feature properly

... also ... not going to lie, this was the first time we've gotten to test several of the checks and balances we have in the npm registry which I was jazzed about :)

  • raisedadead 9 years ago

    Thanks, Benjamin, Laurie and everyone else for mitigating this, it feels great to know when the community chimes in together for such highly unanticipated scenarios.

    On that note, however, respectfully I believe that features which have the potential of hitting the registry so bad should first be beta tested on a private registry and moved on to the high traffic serving CDNs of npm.

    And 10% of the daily traffic is from India??? Whoa, every day is a school day.

    • poizan42 9 years ago

      > And 10% of the daily traffic is from India??? Whoa, every day is a school day.

      Well, 17% of the world's population lives in India, so doesn't seems surprising.

mavsman 9 years ago

If I were on the Azure team I'd be offering tons of free credit to npmjs.org to get them to use Azure. Azure coming to the rescue would be the perfect ending to this story for Microsoft.

  • CaveTech 9 years ago

    I don't think most organizations could relaunch their infrastructure on a totally different stack at the drop of a dime. And if it was really a "throw more servers at it" problem then it wouldn't really matter who was hosting them, would it?

  • KayL 9 years ago

    CDN caching globally, please!

Arnavion 9 years ago

https://github.com/Microsoft/vscode/issues/14889

sync 9 years ago

Shouldn't all these requests be cached by a CDN? What exactly is overloading?

  • seldo 9 years ago

    CDNs don't usually cache 404s. VSCode was looking for @types packages for any and every npm package its users were using. Packages that had a type description caused no issue, but most packages don't, so we had a > 1000% spike in 404s. Our workaround before MS did the rollback was to cache 404s for @types packages specifically, and it was effective enough that the registry never really went down.

    • akfish 9 years ago

      Interesting. Thanks for sharing this information.

    • tankenmate 9 years ago

      It's a pity, DNS handles negative lookup caching / TTLs (in fact that is exactly what the TTL field in DNS zones are!). So negative lookups can be cached, but you need to do thinking about it ahead of time and set sensible TTLs (preferably configurable) for those negative caches.

    • natuac 9 years ago

      "a > 1000% spike in 404s" overloaded your servers? Such are your generation times? Can I bring the entire NPM ecosystem down from my ADSL line using some silly threaded code to make requests to randomly named packages?

      • seldo 9 years ago

        99.9% of our requests are handled by the CDN. The CDN doesn't cache 404s, so 404s are handled by our origin servers, which are much fewer in number and therefore quite easy to overwhelm.

        You're right that our handling of 404s was naive, and that's definitely something we'll be improving as a result of what we've learned from this incident.

markatkinson 9 years ago

It is a real foreign feeling being exposed to such an actively and well run project. Every time I see a new release on HN I get a little "wow, that time of the month again." Even this rollback was indicative of how fast they move.

akfish 9 years ago

Which is more possible? A bug or they just underestimated the volume of traffic that could be caused by ATA in real life?

manojlds 9 years ago

Would yarn help here? (since FB have their own CDN and registry for it?)

PudgePacket 9 years ago

I wonder if any warning was given to npm that they would be getting this potentially huge new source of traffic. It doesn't seem to be mentioned anywhere.

  • vonklaus 9 years ago

    Eh, NPM is a pretty core service and both sides probably should have done things a bit differently. I don't neccessarily think vscode needed to reach out to NPM to let them know they were going to be consuming their public API. Both teams appear to be in communication as a result however-- which is good.

    This will likely lead to more fault tolerant systems on both projects and hopefully more collaboration & features in the future.

    • CapacitorSet 9 years ago

      >I don't neccessarily think vscode needed to reach out to NPM to let them know they were going to be consuming their public API.

      VSCode is used by a non-negligible number of users, and seems to rely on npm to operate at its best. It would have been good etiquette to let npm know, even though they couldn't forecast this exact situation.

      • vonklaus 9 years ago

        I am not the architect of any large scale system-- that said, I wouldn't expect developers to reach out to GitHub.

        However, it isn't bad etiquette and I'm sure Microsoft could get in touch with the devs. Interesting thought.

antrion 9 years ago

This is a really cool feature! Is there a similar extension for Atom / Sublime?

andyburke 9 years ago

oops

_pmf_ 9 years ago

But they told us that pulling in 1.6 GB for "Hello World" is normal and no big deal.

gremlinsinc 9 years ago

I'd love to use VSCode but can't until they or someone else rolls out a dockblockr extension that works for php as I'm mostly tied to Laravel right now, and my company requires docblocks and they are not fun to write by hand.

z3t4 9 years ago

They are probably trolling for a debate about NPM ... I can smell politics.

hiou 9 years ago

> The feature was so great that we started to overload the npmjs.org service.

I'm not sure I would call my feature "great" if it could have brought down npm.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection