Settings

Theme

MCC (Master Chief Collection) Server Incident Summary

halowaypoint.com

120 points by DivisionSol 4 years ago · 34 comments

Reader

mrguyorama 4 years ago

The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.

  • thatguy0900 4 years ago

    Now that big game companies are really starting to shut down old servers en masse it should be the default, really. https://www.gamespot.com/articles/ubisoft-shuts-down-online-...

  • jasomill 4 years ago

    Also interesting and impressive: Halo multiplayer fans have modded the original Xbox version of Halo to act as a dedicated server for Xbox LAN multiplayer[1] that serves the same architectural role as 343's UHS does for MCC online play.

    [1] http://halo1nhe.com

  • zymhan 4 years ago

    Now I'm very curious how the P2P matchmaking is bootstrapped.

    • AgentME 4 years ago

      The clients still talk to the official servers which run matchmaking and group up players together. The difference is whether the matchmaking servers tell all the players to connect to a dedicated gameserver or to connect to one of the players.

xmodem 4 years ago

I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.

> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.

What updates? How were they tested (or not)?

  • Darkphibre 4 years ago

    The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.

    Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.

    • xmodem 4 years ago

      Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

      But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

      For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.

      1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...

ozarker 4 years ago

Really cool to see this level of transparency on an issue from a multiplayer game dev. Really cool write up

  • stryan 4 years ago

    Both Bungie and 343 have done an admirable job (well, compared to other devs) about explaining their network infrastructure etc. Back in the day they did a big talk about how their matchmaking in Halo2/3 worked that I think to this day is still one of the best methods of learning when you're not in the industry yet. I can't recall what it was called though: might be the "Chris Butcher - Recreating the LAN Party Online: The Networking and Social Infrastructure of Halo 2" GS talk but I can't listen right now to check

auto 4 years ago

Given that the error resulted from the STUN and ICE servers, which from my understanding exist solely to play a part in the NAT punching process, would this entire situation have been mitigated if things were end-to-end IPV6?

  • pilif 4 years ago

    Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.

    Compared to ipv4 where there is UPnP and NAT-PMP with widespread support in routers, there are protocols to allow clients to reconfigure the router with ephemeral firewall rules, but they are not wide-spread and support is very spotty.

    So in practice, users with just IPv6 would have the exact same problems and would be even more likely to depend on STUN and ICE because their firewalls likely won’t support client-side hole-punching correctly

xeromal 4 years ago

Wow, that was a fun read! I don't envy the people who had to stare at wireshark logs for 3 days though. Oof.

gundmc 4 years ago

I don't recall seeing this sort of post-mortem from a gaming provider before. Really cool to see! Kudos, Halo team and Microsoft!

BaconPackets 4 years ago

The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.

darknavi 4 years ago

Interesting that they call their service UDS. I was under the impression that they used PlayFab.

  • tehbeard 4 years ago

    MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.

  • sgtfrankieboy 4 years ago

    Halo Master Chief Collection was already released for 4 years (2014) when MSFT bought PlayFab (2018)

wyldfire 4 years ago

So if I wanted to refer to a group of them, would they be called 'Masters Chief'?

  • seizethegdgap 4 years ago

    I don't believe so. The full E-9 "title" in the US Navy is Matter Chief Petty Officer, so plural would be Master Chief Petty Officers

verall 4 years ago

Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection