Settings

Theme

Ask HN: Anyone else having bitflips in Azure Network traffic?

39 points by klaas- 2 years ago · 12 comments · 1 min read


It seems we are having bit flips inside of packets coming into Azure (Germany West Central) via our Express Route. Anyone else seeing this?

dharmab 2 years ago

I did back in ~2017 or so while I was working at $BIGAZURECUSTOMER. It was tough getting a hold of the right person in support, so we pulled some backchannel strings. Turns out an update to their hardware had disabled ECC on some switches.

klaas-OP 2 years ago

So it seems this is their software defined network, the packets still look good when entering via the microsoft edge routers and they get bitflips on their way into our express route

  • klaas-OP 2 years ago

    after a full day of searching for a possible cause, someone from microsoft suggested that they try to move the express route gateways (which are seemingly VMs inside of Azure) to a different hypervisor. For now it seems like the first gateway was flipping bits and on new hardware it is running normal again... That was a very interesting day troubleshooting wise

shrubble 2 years ago

Can you do Wireshark style packet captures somewhere that show it? Very likely that would be what support would ask for.

  • klaas-OP 2 years ago

    yeah, that's what we did. We captured it at the last place in our network (a router) and then compared the tcp payloads to the packets that we received in azure in our firewall. Next step was to get the dumps from as much inbetween as possible, it seems that is only the edge router of Azure and our express route gateway. After looking at the pcap of the azure edge myself I verified that the package is still okay there, the MS support verified it's broken in the express route gateway, so it has to be flipped somewhere inside their Azure Germany network stack. They are still searching where ... :)

    • shrubble 2 years ago

      The header of the pcap should show the MAC address of the devices, so, where you able to determine using the MAC OUI lookup, who makes the hardware that is not working right? The express route gateway would have to be connected directly to see the needed MAC, however.

      • klaas-OP 2 years ago

        so my guess would be there are more than a few devices between the edge and the express route gateway (which is more or less a VM I was told). So I am guessing right now someone is trying to figure out what hardware is involved in between the two points we captured and looking at that stuff :) For me as a customer that's just a big black box (cloud)

klaas-OP 2 years ago

I also have a follow up to this topic. I would say it looks like in Azure nobody is verifying TCP checksums. A bitflip in tcp payload should have been noticed at the NIC/tcp level but it reaches my application with corrupted data.

johnwheeler 2 years ago

Like from neutrinos?

  • dharmab 2 years ago

    A much more common source of bitflips is from heat. If you run a desktop computer with more than about 32GB of RAM you probably have bitflips from the heat alone- which is part of why DDR5 adds a weaker form of error correction as standard.

  • toast0 2 years ago

    Bitflips in network packets are often stuck bits in networking equipment. Replacing the offending equipment tends to be easy, once it's identified. But you can only do so much when you have no control.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection