How four packets broke CenturyLink's network
theregister.co.uk> As to what can be done to prevent similar failures, the FCC is recommending CenturyLink and other backbone providers take some basic steps, such as disabling unused features on network equipment, installing and maintaining alarms that warn admins when memory or processor use is reaching its peak, and having backup procedures in the event networking gear becomes unreachable.
Disabling unused services? Alarms when nearing resource limits? Contingency plans? How is this the first time this has come up?! These are like security & devops 101.
It's kind of funny. These are best practices for running basic run of the mill web services, even something like a forum or personal homepage. Admittedly there's an obvious, massive difference in complexity, but you would expect the gold standard best practices to come from something mission critical like core Internet services and flow down to less critical services, not the other way around.
Well it is easy to find time to add gold-plating to a small basically useless service, but those guys are probably swamped or try to cut cost by being agile or something.
Cutting costs by fighting fires all the time^W^W^W^W^Wremoving smoke detectors. The classic strategy.
> How is this the first time this has come up?
It's not, obviously.
If one is cynical, it's just a way for the FCC to look like it is doing something. Or, if one attributes great, great rhetorical skill to the FCC, it's their way to lambaste CenturyLink for not even adhering to 101 level principles. I tend to believe the latter.
Here is a much more technical and thorough analysis of the incident: https://blog.thousandeyes.com/centurylink-outage-lessons-man...
Network engineer here: clink bridged all of the management controllers on their infinera dwdm shelves together into one multi state sized L2 broadcast domain. Best guess is because it made them easier to SNMP poll and to run other management tools to admin them.
Within the circle of people who really know what went on, we've been laughing at them for months.
The oddest part was that FFC recommendations didn’t include “limit the size of your broadcast domain”.
Large flat L2 are a classic time bomb, with their builders proudly exclaiming “look ma, no hands!” until they painfully get reminded of their mistake :)
I worked for a regional isp where almost the entire metro network was a single layer 2 network up to the point of presences. We regularly (as in at least once a week) had spanning tree loops and broadcast storms that took down the entire city.
Is there a source or is that a guess at what happened? That sounds immensely incompetent to the point that I find it hard to believe
In this particular case, some insider info and I am also in possession of the RFO CenturyLink sent out for a number of downed 10GbE transport circuits. From they way they described it a broadcast storm between infinera node controllers propagated uncontrollably across their entire infinera chassis fleet in the western US.
I know some of these words.
Page 7 -
>>CenturyLink and Infinera state that, despite an internal investigation, they do not know how or why the malformed packets were generated.
So we still don’t know why the rotten packets were created in the first place?
I'd bet it's due to a firmware/software bug triggered by a rare condition, or undefined software behavior trigger by a hardware malfunction. If it's true, it means the root cause would probably never be identified as nobody can reproduce it. It's something pretty scary to think about: We can never guarantee most software would work correctly all the time, empirical testing is often the only practical assessment, and probabilistic bugs such as mysterious crashes cannot be discovered.
But I think the bigger problem is not the packets, but why didn't the backbone reject those malformed packets.
What protocol is that? Optional TTL sounds like the really fatal part.
Assuming that by
> 3. no expiration time, meaning that the packet would not be dropped for being created too long ago; and
they mean the TTL was set to zero.
From RFC 1812:
> A router MUST NOT originate or forward a datagram with a Time-to-Live (TTL) value of zero.
So a packet with a TTL=0 should never be on the wire (Example a router receives a packet with TTL=1, if it's not destined for that specific router, then it gets discarded). My guess is the switching vendor had bad code that didn't handle TTL=0.
Reading Infinera's network brochure, https://www.infinera.com/wp-content/uploads/Infinera-DTN-X-F..., they are talking about terabit speeds over fiber. I doubt they are using the Internet Protocol or anything close. I mean, they could be (https://en.wikipedia.org/wiki/IPoDWDM), but they have a bunch of different communication protocols going over it. I saw MPLS (https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching) on Twitter and that has a TTL too, but unfortunately the FCC report doesn't go into detail. It's only slightly more informative than the outage report from last year: https://twitter.com/briankrebs/status/1079135599309791235/ph...
I agree that MPLS would be used for transport through the Infineras, but the article specifically states that this was caused by management traffic.
MPLS doesn't have a concept of a broadcast address and wouldn't have been used for management traffic (except maybe during transit). MPLS is really just used to get IP packets to their destination with less L3 overhead. Full disclosure I work in the DC space, not the provider space so I'm far from an expert on MPLS.
Ethernet famously doesn't have a TTL, so maybe this was just a typical Ethernet broadcast storm. In that case I don't know why TTL would've even been brought up.
They keep throwing around the word packet, which implies layer 3. Of course lots of people say packet when they mean frame.
Edit: There is a comment above saying they have an RFO stating this was a broadcast storm. So it was probably Ethernet and CenturyLink brought up TTL as a way to blame the protocol.
This could be a problem
Usually the lowest TTL on the wire is '1' - the next router then subtracts 1, the value is zero, and the packet is dropped on the same router (and icmp sent back).
If someone didn't put an aditiional if() to check, this could cause many problems, especially with broadcasts. And why would they check, if no device sends out packets like this normally (without someone else not doing an if() check, or if someone sent those packets on purpose).
Ethernet.
Is this a broadcast storm?
Yes, combined with several amplification bugs, became a perfect [broadcast] storm
From what I’ve read a lot of the reporting on this seems to use frame and packet interchangeably.
There was a footnote in the report about that:
> In the Bureau’s discussions with Infinera, Infinera used the term “packet” to describe what some experts refer to as Ethernet frames that are sent between nodes. For the sake of simplicity, this report uses the term “packet.”
Correct title is: how misconfigured century link network broke when rotten packet arrived.
This title sounds like it was packet failure, while it is not, it was a matter of time until this problem occurs, hardware must be resilient to malformed input.
We can remove that ambiguity by de-baiting the title and taking out "rotten".