DNA&OC or: how to troubleshoot like a pro

At Amazon, we have on-call for engineers and product owners. The way it is structured is rather clever: there is no divide between software developers adding new features and operators who run the software but the same people are wearing different hats at different times. As a product owner, I’m on-call for what an incident commander (well, at least some aspects) is responsible for. I’m trying to figure out if external customers are impacted and if so how to communicate it.

So I get paged. Of course, at 9pm on a Friday (or, alternatively at 5am on a Saturday ha ha) and confirm the page. I power up my machine and jump on Slack to figure out what’s going on. I team up with my engineering peer to understand the situation.

Besides the content of the page (a ticket) and our dashboards, I don’t have a lot of context, say, 5min after being engaged. I need to understand if external customers are impacted so to make a decision on what we communicate when and to whom (PHD, SHD, follow-up communication). First thing I have to remind myself, because I’m not investigating the incident, my engineering colleagues do, is that I should not assume anything. Ah, that’s an LSE. Ah, I know, that’s a dependency going rouge. Ah, sure, must be the b0rked rollout. Nope. I am listening and asking questions. I form a hypothesis and ask for confirmation. It’s hard, since we tend to see patterns and jump to conclusions.

At the end of the day, our customers first and foremost want to know: Is it me or is it AWS. If it’s AWS then how long will it take to fix it.

When I’m on-call for our services Amazon Managed Service for Prometheus and Amazon Managed Grafana I own the internal and external communication. So I better be up to speed with what’s going on. But that doesn’t mean I should assume things. That can and will backfire (ask me how) and confuse our customers.

Moving on to communication.

Everyone is under pressure. My engineering peers may be RCAing the living daylights out of the problem, may page in other service teams, fixing, rolling back, whatever it is. It’s a back and forth, being on the dashboard call with our excellent central team that coordinates things, our engineering on-calls, myself, and potentially account teams. The only thing that helps, in my experience, is to concisely and promptly communicate. On doubt, repeat what you said on the Chime call or ask a clarification question. Confirm in writing. Repeat back what you heard. All of this has a single purpose: letting our customers know, as quickly and effective as possible what’s going on.

When you are on-call, independent of your role, what did you see and what are your lessons learned? I’m curious :)

DNA&OC or: how to troubleshoot like a pro

Discussion about this post

Ready for more?