Reddit Releases Post Mortem for Its 3 Hour Outage Last Week
old.reddit.com> In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.
Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?
I'm sure it was worth making breaking changes to make Kubernetes more "inclusive"... whatever that means. As if Kubernetes ever excluded anybody, or "master" referred to slavery in any way.
These terminology changes date back to the George Floyd protests, and instead of getting action to solve the actual problem, we got blog posts from GitHub about changing the default branch name, so they can feel like they're doing something and elevate their brand.
Iirc this shift was actually much older and started as a pre-gamergate idealogical proxywar. There were even various projects parodying wokeness in tech at the time to comment on it. For example C+= which was a C++ derivative that had more "inclusive" keywords
https://github.com/TheFeministSoftwareFoundation/C-plus-Equa...
It’s good to see how increasingly fewer people feel.
The word police seem to have gone kinda quiet lately.
I wonder if they realised that banning a few words wasn't really helping their cause.
Those bullshit positions are the first to go in any kind of recession or downturn, that's the most likely explanation.
Or they don't see the point in arguing with people who are obviously against any sort of progressive or inclusive adjustments to society.
I want a progressive, inclusive society. I want a society that argues and fights for actual change, not just changing word usage. I bristle at the “master vs. main” debate not because I’m racist, but because it distracts from real impactful change. Every second spent arguing over whether “master” has meaning outside a slavery context is a second NOT spent on expanding educational equity or progressive taxation. The people who are benefiting from the status quo want us fighting over the dictionary.
Their main home seemed to be twitter.
I wonder if perhaps twitter's new ownership (and decrease in moderation) has impacted their activities? I wonder if perhaps it was an effort led by twitter employees because that sort of thing leads to greater use of twitter?
Was that BLM/ANTIFA HQ? The building will lose value because of all the woke air inside.
They've had since December 2020 to update their cluster, and the breaking-ness in 1.24 is called out in a section titled 'Urgent Update Notes'[0], and subtitled 'No, really, you MUST read this before you upgrade'.
So by 'word police' you mean 'admins who didn't bother to read the release notes for the last two years and just deployed straight to production while ignoring the release notes'.
Whatever your politics, breaking changes happen. Not reading the release notes and checking to see if anything affects you is just incompetence.
[0] https://github.com/kubernetes/kubernetes/blob/master/CHANGEL...
I laughed out loud when I got to that point. Well deserved.
I started laughing but arent as far from crying as i would ike to.
Ironic.
3 hours, IPO not impacted. People who don’t value inclusion identified. Win-win. Hope you’re enjoying Tucker Carlson (alone, likely) tonight.
Imagine if Cisco or Juniper decided to swap master or remove slave from their code. Core routers going down because of word police terminology and an admin who missed it in the change log.
Sounds more like admins need to read over change logs better and properly test updates on dev environments before just blindly updating systems. Features get deprecated, APIs update, crypto algorithms get dropped, it's entirely on the admin to ensure an update will actually work with existing code and systems.
On a very critical system, I wanted to use a newer python module that fixed a very annoying bug in the much older version we were running. Of course the module required a much newer version of python too. I upgraded everything and found that a function in a built-in module I had been using was entirely deprecated, very bad since it was used all over my code. I ended up writing my own module to overload the deprecated function into the new proper way of doing what I needed with only a simple change to my import statements. If I had just properly read over change logs and ran the update on a dev system, I wouldn't have any downtime since I could have made a fix early.
This can be one reason to run the control plane not on k8s itself. When the control plane runs on k8s you can get these weird states where the control plane is borked and the system cannot recover.
Back when we built our own Kubernetes distribution around the Kube 1.6 era I had to fight really hard with our architect to let me run the control plane with systemd instead of within Kube. The extra nodes were considered to be “a waste of resources”.
But in the five or so years we ran that distro the control plane didn’t fail once. Posts like this make me glad I pushed for it.
Technically it already runs kinda “outside of the loop” using static/mirrored pods so it doesn’t go through scheduler assignment/kcm reconciliation loop. If they ran their reflectors that way it probably wouldn’t happen
I always find this sort of dogfooding to be academically clever, but operationally risky.
I appreciate the transparency and detail in publishing this. With that said, the narrative style and wordy,casual language makes it harder to get to the meat (the five whys) than a typical postmortem.
The intended audience is probably a mix of engineers and regular Reddit users, hence the more casual tone.
This is a pretty funny "bug". Bring down those Nazi Kubernetes nodes. There's some humor in there somewhere... making a change to be inclusive results in Reddit going offline... mmmm.
I'm still waiting for people to rename "white paper".
314 minutes is not three hours.
It's 3.14 metric hours, close enough.
I know here on HN I shouldn't, but I quite enjoyed this comment ...
Sorry. I realized it a minute after I posted :(