Settings

Theme

Reddit Releases Post Mortem for Its 3 Hour Outage Last Week

old.reddit.com

109 points by Rebles 3 years ago · 31 comments

Reader

mwint 3 years ago

> In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

Wow, so the word police brought down Reddit. Why on earth did someone think it a good idea to screw with existing names in running clusters in a cluster management tool?

  • c7DJTLrn 3 years ago

    I'm sure it was worth making breaking changes to make Kubernetes more "inclusive"... whatever that means. As if Kubernetes ever excluded anybody, or "master" referred to slavery in any way.

    • tapoxi 3 years ago

      These terminology changes date back to the George Floyd protests, and instead of getting action to solve the actual problem, we got blog posts from GitHub about changing the default branch name, so they can feel like they're doing something and elevate their brand.

  • londons_explore 3 years ago

    The word police seem to have gone kinda quiet lately.

    I wonder if they realised that banning a few words wasn't really helping their cause.

    • Nextgrid 3 years ago

      Those bullshit positions are the first to go in any kind of recession or downturn, that's the most likely explanation.

      • danudey 3 years ago

        Or they don't see the point in arguing with people who are obviously against any sort of progressive or inclusive adjustments to society.

        • PraetorianGourd 3 years ago

          I want a progressive, inclusive society. I want a society that argues and fights for actual change, not just changing word usage. I bristle at the “master vs. main” debate not because I’m racist, but because it distracts from real impactful change. Every second spent arguing over whether “master” has meaning outside a slavery context is a second NOT spent on expanding educational equity or progressive taxation. The people who are benefiting from the status quo want us fighting over the dictionary.

    • londons_explore 3 years ago

      Their main home seemed to be twitter.

      I wonder if perhaps twitter's new ownership (and decrease in moderation) has impacted their activities? I wonder if perhaps it was an effort led by twitter employees because that sort of thing leads to greater use of twitter?

      • testbjjl 3 years ago

        Was that BLM/ANTIFA HQ? The building will lose value because of all the woke air inside.

  • danudey 3 years ago

    They've had since December 2020 to update their cluster, and the breaking-ness in 1.24 is called out in a section titled 'Urgent Update Notes'[0], and subtitled 'No, really, you MUST read this before you upgrade'.

    So by 'word police' you mean 'admins who didn't bother to read the release notes for the last two years and just deployed straight to production while ignoring the release notes'.

    Whatever your politics, breaking changes happen. Not reading the release notes and checking to see if anything affects you is just incompetence.

    [0] https://github.com/kubernetes/kubernetes/blob/master/CHANGEL...

  • hoseja 3 years ago

    I laughed out loud when I got to that point. Well deserved.

  • lordloki 3 years ago

    Ironic.

    • testbjjl 3 years ago

      3 hours, IPO not impacted. People who don’t value inclusion identified. Win-win. Hope you’re enjoying Tucker Carlson (alone, likely) tonight.

post_break 3 years ago

Imagine if Cisco or Juniper decided to swap master or remove slave from their code. Core routers going down because of word police terminology and an admin who missed it in the change log.

  • wildzzz 3 years ago

    Sounds more like admins need to read over change logs better and properly test updates on dev environments before just blindly updating systems. Features get deprecated, APIs update, crypto algorithms get dropped, it's entirely on the admin to ensure an update will actually work with existing code and systems.

    On a very critical system, I wanted to use a newer python module that fixed a very annoying bug in the much older version we were running. Of course the module required a much newer version of python too. I upgraded everything and found that a function in a built-in module I had been using was entirely deprecated, very bad since it was used all over my code. I ended up writing my own module to overload the deprecated function into the new proper way of doing what I needed with only a simple change to my import statements. If I had just properly read over change logs and ran the update on a dev system, I wouldn't have any downtime since I could have made a fix early.

__turbobrew__ 3 years ago

This can be one reason to run the control plane not on k8s itself. When the control plane runs on k8s you can get these weird states where the control plane is borked and the system cannot recover.

  • cpressland 3 years ago

    Back when we built our own Kubernetes distribution around the Kube 1.6 era I had to fight really hard with our architect to let me run the control plane with systemd instead of within Kube. The extra nodes were considered to be “a waste of resources”.

    But in the five or so years we ran that distro the control plane didn’t fail once. Posts like this make me glad I pushed for it.

    • dilyevsky 3 years ago

      Technically it already runs kinda “outside of the loop” using static/mirrored pods so it doesn’t go through scheduler assignment/kcm reconciliation loop. If they ran their reflectors that way it probably wouldn’t happen

  • dehrmann 3 years ago

    I always find this sort of dogfooding to be academically clever, but operationally risky.

gundmc 3 years ago

I appreciate the transparency and detail in publishing this. With that said, the narrative style and wordy,casual language makes it harder to get to the meat (the five whys) than a typical postmortem.

  • mynameisvlad 3 years ago

    The intended audience is probably a mix of engineers and regular Reddit users, hence the more casual tone.

ethicalsmacker 3 years ago

This is a pretty funny "bug". Bring down those Nazi Kubernetes nodes. There's some humor in there somewhere... making a change to be inclusive results in Reddit going offline... mmmm.

I'm still waiting for people to rename "white paper".

hoseja 3 years ago

314 minutes is not three hours.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection