Settings

Theme

Playing Battleships over BGP (2018)

blog.benjojo.co.uk

124 points by tcard 4 years ago · 24 comments

Reader

throw0101a 4 years ago

> For a protocol that was produced on two napkins in 1989 [...]

I'm not sure I'd want to deal with a protocol that can't be explained on a napkin or two. UTF-8 was design on a diner placemat:

* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

benjojo12 4 years ago

Author of the post here, Ask me almost anything I guess(?)

  • kjrose 4 years ago

    If you could replace BGP globally instantly with no problems. What would you replace it with?

    • benjojo12 4 years ago

      (Keeping in mind that replacing BGP is similar hard-ness as replacing SMTP, and thus, might not be worth it)

      Honestly, the issue that exists with BGP is not the protocol. The issue is attached to trust, and there is not a instantly fixable problem with a different protocol.

      One issue with the internet as a whole is that seemingly simple questions are actually hard, The one is slowly being fixed with RPKI is "Who actually owns this IP address", knowing this we can build better filters against direct (origin AS != owner AS) hijacks.

      However the next question that has no solution for is "Who is allowed to carry this route/transit this data?" -- This is going to be unbelievably hard to solve with certainty, There is question that maybe a PKI solution could be deployed (BGPSEC). However you also will hit the next issue.

      The bgp table is massive. 1M+ routes that is stored on machines with reasonably long lifetimes. It does not help that in terms of computing power these machines are in general very slow. A multi TBit/s router may only have a 2014 era laptop CPU powering it. So computing anything 1M times quickly is a massive ask, and when links go down, it is reasonable have fast recompute/reconvergance times.

      Fixing bgp is not a easy issue. Anyone who is telling you so is either fraudulent or does not understand the sheer scale/scope of the issues attached to the protocol.

      • convolvatron 4 years ago

        it is if you relax the constraint that the providers keep the legacy allocations and can advertise whatever the hell they want

        Steve Deering had a really nice proposal on geographic addressing that would make pki sufficiently performant by using hierarchical assignments

      • makeworld 4 years ago

        Have you seen Yggdrasil? It provides an alternate routing idea, among other things.

        https://yggdrasil-network.github.io/

    • pyvpx 4 years ago

      and keep IPv[4|6]?

  • scratchadams 4 years ago

    no question, just a selfish request for more blog posts please

  • bmsleight_ 4 years ago

    Considering Events yesterday - how do you test non-live ?

    • tg180 4 years ago

      Maybe dn42.eu?

      > Experiment with routing technology

      > Participating in dn42 is primarily useful for learning routing technologies such as BGP, using a reasonably large network (> 1500 AS, > 1700 prefixes).

      > Since dn42 is very similar to the Internet, it can be used as a hands-on testing ground for new ideas, or simply to learn real networking stuff that you probably can't do on the Internet (BGP multihoming, transit). The biggest advantage when compared to the Internet: if you break something in the network, you won't have any big network operator yelling angrily at you.

    • benjojo12 4 years ago

      Who said I tested non-live?

      The actual beta builds/sanity checks were done just with two VMs peered with each other, but the live internet one was done in one take (and never again, at least by me)

      • benjojo12 4 years ago

        To add on, BGP has a very much "meme" status of being scary and dangerous, and any touching will break youtube etc. [Mostly perpetuated by infosec circles]

        It's really not the 2000's anymore, BGP is mostly safe and filtered. There are still improvements to be made (I've even written on the blog about them), but one persons immense fuck ups are far less likely to cause issues now that IRR filters and prefix limits exist.

        • JadeNB 4 years ago

          > It's really not the 2000's anymore, BGP is mostly safe and filtered. There are still improvements to be made (I've even written on the blog about them), but one persons immense fuck ups are far less likely to cause issues now that IRR filters and prefix limits exist.

          Any non-maliciously designed protocol probably can be used safely, but surely yesterday's events show that it is still eminently possible to use BGP dangerously?

          • benjojo12 4 years ago

            What part of yesterday was showing that it was possible to use BGP dangerously?

            If you are certain in this argument, then you master electric switch is dangerous because you could switch off the power to your house.

            • JadeNB 4 years ago

              > If you are certain in this argument, then you master electric switch is dangerous because you could switch off the power to your house.

              This seems like a response to an argument I haven't made yet! (All else aside, if you prebut my argument, it allows me not to make that argument.)

              Sure, it's possible to do dangerous things with BGP; that alone is not why I say it's possible to use it dangerously. What is dangerous is the fact that a small and apparently innocent change can have such far-reaching consequences—for example, I'll bet there was no serious consideration at Facebook of not being able to open electronic door locks in the case of an apparently innocent BGP update.

              I don't consider my master electric switch dangerous because I could switch off the power to my house. I would consider it dangerous if, after switching off the power to my house, I was ejected from my house, and could no longer open the doors of my house to get in and switch the power back on.

              • HideousKojima 4 years ago

                >for example, I'll bet there was no serious consideration at Facebook of not being able to open electronic door locks in the case of an apparently innocent BGP update.

                If that was actually the case, a lot of heads at FB should roll over this. The logic is simple and obvious, and if the sysadmins and network admins didn't think about this line of thinking then they're overpaid:

                1) Our door control system is accessed via a public IP/address, not via an internal/private address.

                2) Accessing our public IPs/addresses is dependent on BGP and DNS not getting borked.

                • JadeNB 4 years ago

                  > If that was actually the case, a lot of heads at FB should roll over this. The logic is simple and obvious, and if the sysadmins and network admins didn't think about this line of thinking then they're overpaid:

                  You say "if" as if it's a conditional, but surely the fact that it happened proves that no-one considered it (or, I suppose, that whoever did consider it didn't have enough sway to stop it from happening). There are, rightly, so many laws and regulations requiring that safe egress in case of emergency not be prevented, and I can't imagine anyone actually considering and tolerating even the slightest risk of an Internet issue preventing that egress. Well, I guess I can imagine lots of people doing lots of awful and harmful things, but I can't imagine anyone doing it in such a way that it would be this easy to get caught doing something blatantly illegal.

                  (Or were there safety measures in place that allowed egress, just not entry? I don't know the specifics, since my source is just the news stories that mention that the door locks didn't work and people couldn't get in—but maybe they could still get out?)

          • HideousKojima 4 years ago

            What happened yesterday was (appears to be) Facebook screwing up their own routing and DNS, not anyone else's. They didn't take down routing for any IPs and domains they didn't own. I can't imagine any other protocol making a mistake like FB's impossible

      • INTPenis 4 years ago

        So how should these problems be mitigated? Have separate infrastructure for critical services or staging BGP or what?

        • midasuni 4 years ago

          It seems that the main problem Facebook group in restoring device was a lack of a completely separate out of band management network

          If my network (way smaller than FB, but budget way lower) goes, I can get in via another ISP and WireGuard into the OOB network which is completly separate to the inband management.

          Not every access switch is on OOB, but the core ones and a few critical devices are.

efitz 4 years ago

This was very cool and also IMO very irresponsible.

dt3ft 4 years ago

So this is why Facebook went down, eh? ;)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection