Ask HN: For those on call, how often are you called?

22 points by debunn 6 years ago · 31 comments · 1 min read

My current role as a IT Operations Engineer has recently forced me to join an on-call rotation, which when on primary support is paging me 5-7 times per week after hours (so literally daily.)

I've been in various IT administration, development and DevOps positions for the last 20 years with differing "on call" responsibilities, and have never had anything as intrusive as this.

Getting to the point - my current manager says that getting paged every day of your primary support shift is "normal in the industry for operations". While this definitely doesn't match my personal experience - I'm curious: do any of you in technical support roles with "on call" responsibilities get paged this frequently? If not, what does a "normal" shift look like for you?

Thanks kindly for any feedback!

kylek 6 years ago

Worked at a FAANG, 5-7 was peanuts for the rotation I was on there. The interesting thing (I don't know if I liked it or not) was that when you're on call, that's all you do (even during normal hours, that is), no "normal" work/projects during that time (which relieves a giant burden for everyone NOT on call). At the end of the rotation, there is a proper hand-off to the next on call; every issue that came up is reviewed and a plan put in place to fix it "for good" (meaning a backlog task gets created and assigned to someone during the next sprint planning). If there's no planning to root-cause and fix the underlying problems, run.

Niksko 6 years ago

This is super interesting. So you had a high number of pages, but then you also had a really clearly defined and sensible sounding way of dealing with the root causes of the pages?
If you're constantly fixing the things causing you to get pages, why are there still many more than one per day? Just prioritisation of other work over fixes?
We have a similar system, though we have one person on after hours support doing normal work during the day, and one person during the day who doesn't do normal work. That person works on remediating the issues that cause people to get paged. Leads to a pretty low number of pages.
- kylek 6 years ago
  
  My rotation was a bit weird. I was on an ops team for a service, but my ops team did not have our own rotation- each of us took part in the various dev team rotations (the theory is nice, the ops team had a deep view of most aspects of the service. I don't think this was common to other service teams). The dev team I took part in was an absolute trainwreck. Poorly managed at the team level and one level above (the owners/managers of the service). More concerned with getting features out and burning through people to make progress. The issues were always brought up and root-caused properly, but poor architecture led to a lot of "well, we can't do that until x happens". I should reiterate that I'm no longer at the company - definitely wasn't the place for me (and my sanity)!
debunnOP 6 years ago

Thanks - I'm glad I got at least one reply of someone who's confirming that level of paging wasn't abnormally high to them.
When on primary on-call, we also are generally not expected to make progress on project work, although we don't have reviewing of all our incidents after our shift (generally just major ones.) I think there's definitely room for improvement here.

wsh 6 years ago

I wouldn’t accept that as normal. In well-run organizations, when there is a regular, ongoing need for evening or overnight coverage, it’s provided by people scheduled to work during those hours, who are selected and trained to be able to handle most situations on their own.

After-hours calls should come infrequently, or in situations where someone’s personal involvement (for example, as the engineer with primary responsibility for a particular component or its maintenance) is indispensable.

In my experience, things that need a lot of unplanned attention are more likely to fail, if they haven’t already, in ways that have other unacceptable consequences. Fixing them should be a priority for this reason, too.

You haven’t mentioned why you keep getting paged. Is it the same problem repeatedly, or lots of different problems? Is there any hope of addressing the underlying causes?

closeparen 6 years ago

>In well-run organizations, when there is a regular, ongoing need for evening or overnight coverage, it’s provided by people scheduled to work during those hours, who are selected and trained to be able to handle most situations on their own.
It's decently common to have engineering teams oncall for their own services, with a regular PagerDuty shift as part of the job. In that case 5-7 alerts per week is pretty healthy. It sucks that you need to keep your work laptop with you and stay sober / within cell coverage, but even then it's pretty rare to catch an actual outage that requires significant attention.
debunnOP 6 years ago

Thanks for your reply - we do have some recurring types of issues, but I'd say it varies a fair bit. A lot of issues are customer support related (that require administrator access to fix), but there are a lot of system issues as well. All are deemed as items that need fixing after hours (even if I don't necessarily agree with that assessment.)
There are actions being taken to fix both the number of customer support cases as well as the systems issues - but progress is slow, and our appetite to implement all of our customer requested changes end up adding lots of new problems.

aprdm 6 years ago

I have been a lead devops engineer in my last two companies, both of them with more than 1k VMs on 4+ on prems data centers.

In the first I was on call rotation for a wekend a month for two years and got called twice.

It was 1h of work paid if you didn't called and 4h if the phone rings, if you worked for more than 4h it than went straight to a full day.

Currently I am on call and only get paid if called, but, my manager only calls me on critical situations, have been called 2 times in a year and 7 months. If I get called I get half day of work paid.

debunnOP 6 years ago

Thanks - that seems like a reasonable way to handle on call, and more in line with what I've seen as well. Appreciate the feedback!

AdamGibbins 6 years ago

This is not normal. Our on-call schedules run 5-9 Monday to Friday, and 5pm Friday to 9am Monday. If I were paged twice in a week that would be a bad week, being paged at all is fairly uncommon now. Historically it would be more common, but no where near daily, that would be entirely unacceptable.

We've invested a load of time reducing the frequency of paging incidents over the years, the entire technology organisation recognises the importance of fixing said incidents and how disruptive it is to peoples lives/sleep/etc.

debunnOP 6 years ago

Thanks - this is what I figured was closer to normal. I appreciate getting confirmation my experience is not as far off reality as I was being told!

sqldba 6 years ago

I don’t think it’s normal.

At a previous company I was on call every second week and would receive a call maybe once every few months. That was with many hundreds of servers.

At another company I’m on once a week per month and get called once or twice. That’s with just a few hundred servers.

In the first case all time was reimbursed in lieu. In the second case my salary more than makes up for any inconvenience.

However in both cases I was very proactive in defining what is on call - critical production issues only. If it’s not critical or not production then I won’t log on to look at it.

And in both cases I had a LOT of false alarms from bad alerts when starting. I had all false alarms disabled.

You’ll get push back but I didn’t care - you can’t have an alarm waking up people every night on the off chance that one in a hundred will actually be an error. And hilariously, if you started including your boss on the call, they’d quickly agree it’s not acceptable. The human cost isn’t worth it.

While there’s often tonnes of room for improvements to monitoring and alerting (root cause analysis etc) that others have mentioned - in my experience most of the metrics and alarms are garbage anyway, and can and should be done away with. If it came from a boxed product it should near all be turned off from the get go. That crap is always pointless.

Oh no a server CPU usage has increased and memory is low because - it’s doing what it’s meant to? What junk.

debunnOP 6 years ago

Thanks - yeah, all of the 5-7 incidents I'm seeing are considered high priority and require action. We get lots of the noisy false system alarms too, but those don't require me to action them thankfully.

mduggles 6 years ago

I mean it depends on whether you are doing anything with the pages and if they’re followed up on. As someone who has been on various oncall rotations for a decade I would describe that as a pretty heavy paging load for an average rotation.

The key criteria for me and paging are:

1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down.

2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common.

3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from.

These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services.

But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable.

debunnOP 6 years ago

Thanks - of the 5-7 pages per week I was mentioning, all of these are things that are items that require me to manually action them. Lots are after hours customer support issues that require administration level access, others are systems issues tied to technical debt or legitimate problems that occur.
There is effort to try and resolve the underlying problems, and we do make some headway here - we just keep adding changes to satisfy customers which end up causing new issues. We're being told this will get better over time, but it's certainly not happening fast enough IMHO.
Again, thanks for the feedback and insight!
- lolinder 6 years ago
  
  That comes back to the parent's comment about being empowered to fix the issues. The person on call should have power to prevent such calls in the future. This is important for the health of the individual and of the company.
  Are the people in charge of fixing the underlying issues themselves on call? How about the people producing the changes that cause new issues?
  If those two groups aren't themselves being woken up when there's a problem, you can reasonably expect that this won't change until the support calls start to directly affect the company's bottom line.
  - debunnOP 6 years ago
    
    > Are the people in charge of fixing the underlying issues themselves on call?
    Yes - although we're on call frequently enough, and tasked with other priorities when we're not - so progress is slow. I mentioned in another comment as well that the executive focus is to do pretty much whatever our customers want, so this generally results in lots of new problems by the time we fix older ones.
    > How about the people producing the changes that cause new issues?
    They are responsible for fixing the code, but they can do so more during regular 9-5 type hours. They don't feel the same level of pain. I realise this is a problem, but thanks for suggesting it.

zxcvbn4038 6 years ago

My advice is to use your time on call to your advantage. Don’t address just the symptoms - when you receive a call try to understand the root cause and take steps to prevent that situation from happening again. For example - if paged for low disk space make sure log rotation is present, working, and aggressive enough to stay ahead of the generation rate. Have the thing that checks the disk space preform the most common remediation steps and then page only if unsuccessful. If your in the cloud then just kill anything that runs out of disk space, it’s the application owners responsability to arrange for long term storage, etc. Do this for every call you receive and soon your phone will be silent.

My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval.

I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc.

debunnOP 6 years ago

Thanks - we try to tune our alerts, and we have a lot that are self healing as well. The ones I've been mentioning are ones that we currently don't have automated solutions for, and require me to manually action them. Our management team is working on automating away the work, but the technical debt is going to take longer to fix. We get some flexibility to leave early / start late when alerts affect our shift as well, although it's not worth the cost to me personally.

Niksko 6 years ago

I'm part of a team that operates a roughly 100 node Kubernetes cluster. I'm on call after hours for a week at a time, and am on call roughly every six weeks. I think I've been on call for three weeks this year, and I've been paged twice. Both of those were pretty straightforward problems solved within half an hour or so, with zero customer impact. This is roughly what other people in my team experience, probably averaging less than 1 page per on call rotation.

The question you should be asking is: why am I being paged so often?

Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again. If anyone gets a page, we make it a high priority to fix whatever caused it. We are a team of 7, and we dedicate one person a week to field questions relating to our platform as well as to fix up these issues that wake us up.

If they're not legitimate things that you need to be woken up for, why are you being woken up? If this is the case, you need to make sure everyone is on the same page regarding what constitutes something you need to be paged for after hours.

debunnOP 6 years ago

Thanks for the reply - I appreciate the insight and follow up questions.
> The question you should be asking is: why am I being paged so often? Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again.
This is mostly due to not having anyone else around to handle customer issues (which currently require manual intervention), however system issues are also pretty frequent here as well. Management is working on prioritizing the automation of the customer issues so that there are less of them in total, but system issues will likely be harder to resolve (we try to resolve them as they come up if possible, but many are more systemic to technical debt.)
So yes - I'm only including the events that are actionable and require breaking out the laptop - these generally vary from 15 minutes to 3 hours of support.

algaeontoast 6 years ago

If I'm not doing devOps work (I explicitly avoid this garbage) and not a founder I expect to not be on call - ever.

So basically, I don't work at companies that make their employees carry a pager etc. Life is too short for that shit.

I worked briefly at a startup shortly after it's acquisition by a FAANG. The startup's code was trash - I acknowledged while on call that I didn't exactly know what was going on after digging a while - asked for help - was then reprimanded for "not knowing the code well enough" basically because I asked for help. I left about a month after that. Again, life is too short for that shit.

photonios 6 years ago

Rarely. In a team of 3-4 engineers who share the on-call responsibility, I think one of us gets paged every 3-4 months.

Normal shift is like every other day. Just go to work, do my job. Come home, eat, chill a bit and go to sleep.

It used to be more. The company started with three people three years ago (myself included). Now we're over 50. We have enough resources to fix and solve problems before they become real problems.

debunnOP 6 years ago

Thanks for the reply - this sounds like the right way to run an on call rotation!

EdwardDiego 6 years ago

Hardly ever, but then we've made it an explicit goal that if we're having to fix the system after hours, we need to fix that immediately. It used to be almost daily before we made uninterrupted sleep an explicit priority.

debunnOP 6 years ago

Thanks - that's what I was expecting this to be like also - it's sadly not though. Appreciate the feedback!

shifto 6 years ago

Currently a bit more than a year at my current workplace. I have on-call every 4 weeks for a week. This weekend was my third call.

debunnOP 6 years ago

Thanks - when you're primary on-call, how often do you receive alerts / pages that you have to action?
- shifto 6 years ago
  
  Well, having had my third call in the 14th week of being on call I would say about 0.2 times per week on-call.

Settings

Ask HN: For those on call, how often are you called?

Keyboard Shortcuts