Settings

Theme

How four packets broke CenturyLink's network

theregister.co.uk

86 points by aditya 6 years ago · 27 comments

Reader

cle 6 years ago

> As to what can be done to prevent similar failures, the FCC is recommending CenturyLink and other backbone providers take some basic steps, such as disabling unused features on network equipment, installing and maintaining alarms that warn admins when memory or processor use is reaching its peak, and having backup procedures in the event networking gear becomes unreachable.

Disabling unused services? Alarms when nearing resource limits? Contingency plans? How is this the first time this has come up?! These are like security & devops 101.

  • jchw 6 years ago

    It's kind of funny. These are best practices for running basic run of the mill web services, even something like a forum or personal homepage. Admittedly there's an obvious, massive difference in complexity, but you would expect the gold standard best practices to come from something mission critical like core Internet services and flow down to less critical services, not the other way around.

    • GarvielLoken 6 years ago

      Well it is easy to find time to add gold-plating to a small basically useless service, but those guys are probably swamped or try to cut cost by being agile or something.

      • adrianN 6 years ago

        Cutting costs by fighting fires all the time^W^W^W^W^Wremoving smoke detectors. The classic strategy.

  • jiveturkey 6 years ago

    > How is this the first time this has come up?

    It's not, obviously.

    If one is cynical, it's just a way for the FCC to look like it is doing something. Or, if one attributes great, great rhetorical skill to the FCC, it's their way to lambaste CenturyLink for not even adhering to 101 level principles. I tend to believe the latter.

  • danesparza 6 years ago

    Here is a much more technical and thorough analysis of the incident: https://blog.thousandeyes.com/centurylink-outage-lessons-man...

walrus01 6 years ago

Network engineer here: clink bridged all of the management controllers on their infinera dwdm shelves together into one multi state sized L2 broadcast domain. Best guess is because it made them easier to SNMP poll and to run other management tools to admin them.

Within the circle of people who really know what went on, we've been laughing at them for months.

  • ay 6 years ago

    The oddest part was that FFC recommendations didn’t include “limit the size of your broadcast domain”.

    Large flat L2 are a classic time bomb, with their builders proudly exclaiming “look ma, no hands!” until they painfully get reminded of their mistake :)

  • empath75 6 years ago

    I worked for a regional isp where almost the entire metro network was a single layer 2 network up to the point of presences. We regularly (as in at least once a week) had spanning tree loops and broadcast storms that took down the entire city.

  • LIV2 6 years ago

    Is there a source or is that a guess at what happened? That sounds immensely incompetent to the point that I find it hard to believe

    • walrus01 6 years ago

      In this particular case, some insider info and I am also in possession of the RFO CenturyLink sent out for a number of downed 10GbE transport circuits. From they way they described it a broadcast storm between infinera node controllers propagated uncontrollably across their entire infinera chassis fleet in the western US.

  • jlawson 6 years ago

    I know some of these words.

godelmachine 6 years ago

Page 7 -

>>CenturyLink and Infinera state that, despite an internal investigation, they do not know how or why the malformed packets were generated.

So we still don’t know why the rotten packets were created in the first place?

  • bcaa7f3a8bbc 6 years ago

    I'd bet it's due to a firmware/software bug triggered by a rare condition, or undefined software behavior trigger by a hardware malfunction. If it's true, it means the root cause would probably never be identified as nobody can reproduce it. It's something pretty scary to think about: We can never guarantee most software would work correctly all the time, empirical testing is often the only practical assessment, and probabilistic bugs such as mysterious crashes cannot be discovered.

    But I think the bigger problem is not the packets, but why didn't the backbone reject those malformed packets.

mkj 6 years ago

What protocol is that? Optional TTL sounds like the really fatal part.

  • sh-run 6 years ago

    Assuming that by

    > 3. no expiration time, meaning that the packet would not be dropped for being created too long ago; and

    they mean the TTL was set to zero.

    From RFC 1812:

    > A router MUST NOT originate or forward a datagram with a Time-to-Live (TTL) value of zero.

    So a packet with a TTL=0 should never be on the wire (Example a router receives a packet with TTL=1, if it's not destined for that specific router, then it gets discarded). My guess is the switching vendor had bad code that didn't handle TTL=0.

    • Mathnerd314 6 years ago

      Reading Infinera's network brochure, https://www.infinera.com/wp-content/uploads/Infinera-DTN-X-F..., they are talking about terabit speeds over fiber. I doubt they are using the Internet Protocol or anything close. I mean, they could be (https://en.wikipedia.org/wiki/IPoDWDM), but they have a bunch of different communication protocols going over it. I saw MPLS (https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching) on Twitter and that has a TTL too, but unfortunately the FCC report doesn't go into detail. It's only slightly more informative than the outage report from last year: https://twitter.com/briankrebs/status/1079135599309791235/ph...

      • sh-run 6 years ago

        I agree that MPLS would be used for transport through the Infineras, but the article specifically states that this was caused by management traffic.

        MPLS doesn't have a concept of a broadcast address and wouldn't have been used for management traffic (except maybe during transit). MPLS is really just used to get IP packets to their destination with less L3 overhead. Full disclosure I work in the DC space, not the provider space so I'm far from an expert on MPLS.

        Ethernet famously doesn't have a TTL, so maybe this was just a typical Ethernet broadcast storm. In that case I don't know why TTL would've even been brought up.

        They keep throwing around the word packet, which implies layer 3. Of course lots of people say packet when they mean frame.

        Edit: There is a comment above saying they have an RFO stating this was a broadcast storm. So it was probably Ethernet and CenturyLink brought up TTL as a way to blame the protocol.

    • ajsnigrutin 6 years ago

      This could be a problem

      Usually the lowest TTL on the wire is '1' - the next router then subtracts 1, the value is zero, and the packet is dropped on the same router (and icmp sent back).

      If someone didn't put an aditiional if() to check, this could cause many problems, especially with broadcasts. And why would they check, if no device sends out packets like this normally (without someone else not doing an if() check, or if someone sent those packets on purpose).

  • jlgaddis 6 years ago

    Ethernet.

person_of_color 6 years ago

Is this a broadcast storm?

  • exabrial 6 years ago

    Yes, combined with several amplification bugs, became a perfect [broadcast] storm

awat 6 years ago

From what I’ve read a lot of the reporting on this seems to use frame and packet interchangeably.

  • dlgeek 6 years ago

    There was a footnote in the report about that:

    > In the Bureau’s discussions with Infinera, Infinera used the term “packet” to describe what some experts refer to as Ethernet frames that are sent between nodes. For the sake of simplicity, this report uses the term “packet.”

lightgreen 6 years ago

Correct title is: how misconfigured century link network broke when rotten packet arrived.

This title sounds like it was packet failure, while it is not, it was a matter of time until this problem occurs, hardware must be resilient to malformed input.

  • dang 6 years ago

    We can remove that ambiguity by de-baiting the title and taking out "rotten".

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection