Famous outages along with deep postmortems?
I just finished reading the Roblox outage postmortem (trending on HN: https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/). I learned a lot reading about it and found myself thinking "surely there's other well-written postmortems that are a doozy". Does anyone know a good resource where I can find a compilation of such outages? If they're classified/tagged then even better :) If you'd like to go beyond technical postmortems, I've found the IAEA reports on radiological accidents etc to make very interesting technical reading. Abstractly there is a often something to be learned about having a good playbook to deal with specific events. There are links on this page [0] to proper reports on the cause of the 2003 blackout in eastern North America (apologies to the maritimes and Quebec) 0: https://www.ieso.ca/en/Corporate-IESO/Media/Also-of-Interest... Citizen Lab is known to have good reporting: https://citizenlab.ca/2022/01/cross-country-exposure-analysi... Check out anything from the NTSB! OUTLANDISHLY-GOOD reporting: https://www.ntsb.gov/investigations/AccidentReports/Pages/Re... Microsoft was about to spend $500 million on a blitz ad campaign called Five Nines, for 99.999% uptime re NT 5, was it? 2002ish. They crashed the microsoft.com cluster only days before, sending certain accepted metrics re: uptime from 99.999% to 97.312%. The cluster crash was caused by some errant out-of-band JavaScript being published to a live MSCOM cluster. The postmortem was not in-depth, it was a cremation. Burn and hide the body. I thought it odd that all those involved ended up at AWS shortly thereafter, including the executive whose head rolled right out of 1 Microsoft Way. Those involved owe Dave Cutler an apology, with or without the conspiracy intact. Not a compilation, but here are a couple of interesting postmortems I recall from my time at PagerDuty:
https://www.pagerduty.com/blog/the-discovery-of-apache-zooke...
https://status.pagerduty.com/incidents/v2vrgccbtgxn The second one was a 30+ hour outage that forced us to really up our incident management game. I did start a site where I was logging what I could at outagereports.net, but I let it lapse when cpanel fees shot up. Dan Luu's site referenced below is a good source [1]. I think also there is a big list on github somewhere of all k8s-caused outages. Perhaps not famous, but Bryan Cantrill, who gives my favorite talks, has an interesting and funny talk on one of the Joyent outages: https://youtu.be/30jNsCVLpAE I love the AWS postmortems for their simplicity, timestamps, and insight into internal AWS systems: https://aws.amazon.com/premiumsupport/technology/pes/ Project Zero's exploit reporting is fantastically in-depth: https://googleprojectzero.blogspot.com