Settings

Theme

Famous outages along with deep postmortems?

24 points by fizwhiz 4 years ago · 12 comments · 1 min read


I just finished reading the Roblox outage postmortem (trending on HN: https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/). I learned a lot reading about it and found myself thinking "surely there's other well-written postmortems that are a doozy". Does anyone know a good resource where I can find a compilation of such outages? If they're classified/tagged then even better :)

gtsteve 4 years ago

If you'd like to go beyond technical postmortems, I've found the IAEA reports on radiological accidents etc to make very interesting technical reading. Abstractly there is a often something to be learned about having a good playbook to deal with specific events.

https://www.iaea.org/topics/accident-reports

gfd 4 years ago

https://github.com/danluu/post-mortems

liketochill 4 years ago

There are links on this page [0] to proper reports on the cause of the 2003 blackout in eastern North America (apologies to the maritimes and Quebec)

0: https://www.ieso.ca/en/Corporate-IESO/Media/Also-of-Interest...

warrenm 4 years ago

Citizen Lab is known to have good reporting: https://citizenlab.ca/2022/01/cross-country-exposure-analysi...

warrenm 4 years ago

Check out anything from the NTSB!

OUTLANDISHLY-GOOD reporting: https://www.ntsb.gov/investigations/AccidentReports/Pages/Re...

tommydoesntknow 4 years ago

Microsoft was about to spend $500 million on a blitz ad campaign called Five Nines, for 99.999% uptime re NT 5, was it? 2002ish.

They crashed the microsoft.com cluster only days before, sending certain accepted metrics re: uptime from 99.999% to 97.312%.

The cluster crash was caused by some errant out-of-band JavaScript being published to a live MSCOM cluster. The postmortem was not in-depth, it was a cremation. Burn and hide the body.

I thought it odd that all those involved ended up at AWS shortly thereafter, including the executive whose head rolled right out of 1 Microsoft Way.

Those involved owe Dave Cutler an apology, with or without the conspiracy intact.

romanhn 4 years ago

Not a compilation, but here are a couple of interesting postmortems I recall from my time at PagerDuty: https://www.pagerduty.com/blog/the-discovery-of-apache-zooke... https://status.pagerduty.com/incidents/v2vrgccbtgxn

The second one was a 30+ hour outage that forced us to really up our incident management game.

gadders 4 years ago

I did start a site where I was logging what I could at outagereports.net, but I let it lapse when cpanel fees shot up. Dan Luu's site referenced below is a good source [1]. I think also there is a big list on github somewhere of all k8s-caused outages.

[1] https://github.com/danluu/post-mortems

OmarAssadi 4 years ago

Perhaps not famous, but Bryan Cantrill, who gives my favorite talks, has an interesting and funny talk on one of the Joyent outages: https://youtu.be/30jNsCVLpAE

mooreds 4 years ago

I love the AWS postmortems for their simplicity, timestamps, and insight into internal AWS systems: https://aws.amazon.com/premiumsupport/technology/pes/

warrenm 4 years ago

Project Zero's exploit reporting is fantastically in-depth: https://googleprojectzero.blogspot.com

kapilvt 4 years ago

https://k8s.af

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection