Ten things not to worry about regarding oncall

11 min read Original article ↗

Did you know that Wednesday Wisdom is also a podcast! Find it on Apple Podcasts or on Spotify.

Because I am an idiot, I recently volunteered to refactor the oncall situation of my team. It is the kind of organizational thankless task that I gravitate to because either I think that I need to atone for something I did wrong (lapsed Catholic here) or because I feel a pressing urge to “take one for the team.” But there was an upside: The effort did give me a nice opportunity to ponder the many problems I have experienced over the decades while being oncall or setting up oncall rotations. I figured that for this week’s edition of this classic article series, I would neatly organize the top ten things not to worry about when dealing with oncall.

The unmistakable number one entry has to be that you should not worry if you feel that you do not know everything there is to know about every service that you are oncall for. Nobody ever does and if that was the bar, nobody would ever qualify to be oncall for anything. To ease the minds of my junior colleagues a bit, I typically tell them that the task of the oncall engineer is to pick up the phone, give it their best shot, and then escalate to someone else, who will then rinse and repeat that exact same procedure. I have been oncall for many things that I know Jacks From Sheets about and I am still alive 🙂.

At number two, we have the number one complaint from people being oncall, which is that the documentation sucks. Yes it does and don’t worry about it because that is always the case and will always be the case and there is literally nothing you are going to do about that. All documentation, including the runbooks, are always bad, missing, incomplete, wrong, outdated, or just plain fantasy. No amount of effort you are going to spend will change that. Instead of expecting documentation, learn how to target your AI programming tool of choice (might I recommend Codex?) at the source code of the offending system and let it tell you what is going on and what you should be doing. Personally, I like writing documentation and I like to think that my documentation is in a decent state, but it never survives first contact with one of my colleagues because it turns out that apparently not everyone has the cdrtools installed and at least one of my runbooks assumes that.

Coming in hot at number three is the fact that there is no perfect oncall schedule and neither is there a great scheduling tool. All schedules are suboptimal and even if you carefully fill your calendar with “no-oncall” entries and teach your professional or homegrown constraint solver to honor these, you will end up with a schedule that is hugely inconvenient to everyone. Nothing to be done about it so don’t worry; just extend your fixed order schedule and let people swap shifts and enter overrides as they see fit. At the end of the day, all oncall schedules run on spreadsheets. So don’t worry that you cannot create the perfect schedule. Nobody can, because it cannot be done. Also, don’t worry about whether your oncall shifts need to start on Monday, Tuesday, or Wednesday; every choice has good and bad aspects. Or if you are in a team that wants to split the oncall in a shift from Monday to Thursday and another shift starting Friday and including the weekend: Great, whatever, it does not matter, but if that is the particular bikeshed you want to paint, go for it.

At number four, a strong contender for a higher spot in this top ten: All alerts are terrible. They are too strict, too loose, or resolve themselves before you can even open the alerting dashboard. They fill up your Slack channel or make your phone ring off the hook. On top of that, there are always significant gaps in coverage.

Anecdote: I once was oncall for YouTube when the phone rang and someone who announced to be the CEO’s admin reported that YouTube uploads were broken because she had uploaded a video an hour ago and it was not live yet. “Surely,” we thought, “if uploads are broken, our pager will be ringing off the hook.” But, because this was Eric Schmidt’s admin, we thought we’d take a look. The dashboard showing the number of files uploaded per minute showed a healthy number. The dashboard containing the number of transcoding errors showed an equally healthy zero. One problem though, the graph of the number of transcodes started also showed zero. Turned out that there was a problem with the job that moved files from the upload bucket to the transcoder bucket and there was no alert on that value reaching zero.

While we are at it: Another anecdote: In one team I was on, we did have an alert for zero; in this case for the number of recently committed change lists merged into the production branch (that was regularly pushed to, well, production). One fine Boxing Day my pager rang because the merge bot had not merged a single change from the trunk into the production branch for the last 24 hours. After careful investigation, it turned out that this was because exactly zero change lists had been merged into the trunk in the last 24 hours, on account of it being Christmas. Apparently, this had never happened before.

Writing good alerts is as difficult as writing good software, but whereas software gets a design doc, sprints, code reviews, tests, and sometimes even quality assurance, alerts seemingly are written on a lazy Friday afternoon by an intern on fentanyl. For big events, Twitter is usually a better alert source than anything else you might cobble together. While oncall for YouTube, I always got informed about big outages by my non-Googler friends, through a chat or text message, before any alerts fired.

But then again, one day when I was oncall for Google Maps, a colleague paged me because his friend in Uzbekistan couldn’t reach Google Maps, so surely Maps was down? I kindly suggested this might be an ISP-related problem in Uzbekistan.

In short: All alerts are terrible so don’t worry if yours are too…

On top of sucky alerts, closely following at number five: All metrics are terrible. They are missing, measuring the wrong thing, measuring the thing wrongly, or not in the unit that you think they are in. Literally every metric that I ever dove into, down to the source code, had something wrong with it. Even something as simple as the request count or the time spent executing the request, is typically measured at the wrong place in the request handler or in the wrong way entirely. But, don’t worry, because, again, there is nothing to be done about it; it is you against an ocean of time series and, as sailors know, the ocean always wins (or does it?).

Surely, you might think, we can put a project together to fix the alerts and the metrics. And surely, you are completely right, theoretically… Practically speaking, few teams ever get that together because there is always something more important to do and the pain of the sucky alerts and metrics is nicely spread out among all members of the oncall rotation, thereby ensuring that it never reaches anyone’s pain threshold. There once was this one service at some employer for which I was the sole oncall 24x7, 52 weeks per year and believe me, those alerts were tuned to never fire unless all hell had broken loose.

Getting to the lower half of this top ten, right there at number six is the worry that during your oncall week you are getting nothing else done because of constant interruptions. This is most likely true, so don’t worry about it as you should only worry about things that might happen, not about things that are going to happen. But it’s okay though, the organization should be fine with this because they have chosen to afford their velocity on the back of your nervous system. Now, to be honest, in the best oncall rotations I have been in, I spent at most twenty percent of my time on oncall related tasks, but these are the exceptions, not the norm. When your oncall load is unseasonably low, that is more often than not caused by the fact that the SMEs are silently picking up a lot of the operational work around “their” services instead of letting it fall to you. So nice of them…

Getting to number 7: Let go of your worry and fear that you will look stupid during an incident. Incidents are not IQ contests. They are debugging problems. Ask the obvious questions, verify the assumptions. As the oncall engineer, I expect you to perform an honest debugging effort, from first principles if need be. Pages represent a software reliability problem, not a moral defect in the unlucky person holding the pager. As the holder of that pager you are the first witness to the things your colleagues did wrong. It is not your fault…

Unless it is of course 🙂.

Reaching the nether regions at number 8: There is no need to worry that you won’t be able to go anywhere during your oncall week, because it is usually no longer true, provided that you take some measures and don’t act stupid. In the olden days, when I got paged, I had to either solve the problem by phone or jump in the car, drive 45 minutes to the data center, hope security would let me in, fire up my 3270 terminal, and start solving the problem.

These days you can be oncall from anywhere provided you have a decent laptop and cell phone. I have been oncall while on a plane (I don’t recommend it), from the tropical backyard of my aunt’s place in Spain (I do recommend that), and while staying over after a successful Tinder date (I highly recommend that because: Unlimited brownie points if you get paged during the night and you are rescuing a well-known Internet service from your date’s kitchen table; I am obviously not speaking from experience 😉). If you want to go to the cinema, no problem but be prepared to step out, which would suck but hasn’t happened to me thus far. I wouldn’t necessarily go to a rave because conditions there might preclude me from hearing/feeling the pager, but I guess it can be done, if done right.

Almost at the bottom at number 9 we find this common worry: I will look dumb if I ask for help. No you don’t! Escalating is not a weakness; it is part of doing the job responsibly. Good teams want you to do that. Here is the proof: I recently asked my colleagues to write down their number one, two, and three tips for new engineers on the team who are about to go oncall. More than half of them wrote things like: “You are not alone!” or “Ask for help!” Every seasoned oncall engineer who is also a normal human being remembers what it was like when they were first oncall. No seasoned oncall engineer emerged from the womb knowing the dashboards, the failure modes, the service dependencies, and the bizarre historical reasons why this one restart script lives in some forgotten directory. Everyone who is good at oncall was bad at it once. They learned by doing it, and by asking for help, which is exactly what you should do.

Then finally at number 10: Not a worry but just some plain old advice: Oncall teaches judgment faster than almost anything else. It might be uncomfortable at first, but it is one of the quickest ways to learn how systems really behave and which assumptions are fake. The pager is a harsh teacher, but a very effective one. One of the reasons why I still like being oncall after many years in the trenches is that nothing teaches me the ins and outs of a new system faster than being forced to look into some problem. You don’t have to solve it fast; I sometimes spend days debugging a simple Slack question, but at the end of that I know things about the system that almost nobody else knows.

Don’t be afraid. Yes, the pager is where reality comes to collect and reality is often rude. But remember: The pager is testing the system, not your worth as a person…