Full technical details on Asana's worst outage

blog.asana.com

77 points by marcog1 9 years ago · 69 comments

Reader

merb 9 years ago

> Initially the on-call engineers didn’t understand the severity of the problem

Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.

babo 9 years ago

For me that was the great part of the post mortem, they identified the response process itself as the root cause.
- merb 9 years ago
  
  yep that was what I thinking aswell.

These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.

Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.

Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.

This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.

I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.

bognition 9 years ago

I work at a shop that does these kinds of post mortems. I find them highly beneficial.
They require us to actually do the work of identifying the issues and writing up what happened and why. I realize that having a customer contract to do this shouldn't be a requirement but human psychology is funny thing. I can turn to my pm and say "I have to do this it's part of the contract" and they immediately back off.
I agree it might not be the best solution but it's definitely better than not doing them.
- dogma1138 9 years ago
  
  I think the OP didn't mean that these post mortems are not beneficial internally, what he said that disclosing all these details to the public can be confusing and maybe counter productive.
  - merb 9 years ago
    
    I'm not sure whats better.
    1. Describing the root cause and what you failed at. 2. Blame the stuff you are using / other people (clouds you use) 3. just says nothing and try to forget what happened.
    What do you think is best?
    
    dogma1138 9 years ago
    
    No need to blame it on anything, not necessarily you need to go into the fine details either.
    You can simply make a statement that goes something like this:
    "We've completed our investigation of the outage and we found that it was caused due to both technical and procedural errors in the manner in which we deploy our code and monitor the environment. We gathered all the information we require and have made improvements based on it that would help to prevent these issues and other issues with similar causality from occurring. While we do apologize for any inconvenience that the outage may have caused we do want to stress that because of the lessons learned from it our service would grow to be more robust and reliable in the future."
    That's it, simple even if generic, having to read 3 pages of technical details isn't really helpful to anyone, if anything the more "suspicious" people might see that as an attempt to mask the real cause of the issues.
    But overall when you go into specific what you also give is for people the ability to focus their frustrations and disapproval on a specific subject which is never good. After reading this what I "feel" at first glance is that the the fault lies in the engineers that monitored the environment, so the engineers are incapable of performing their duties, now i feel like the hiring and management processes in that company are not working well if they let "unqualified" engineers in. This is how how a minor outage now blows into a specific complaint or negative bias towards a company and you can easily avoid it by giving enough "reassuring" information but not enough for anyone to actually sink their teeth at.
    Overall a generic positive statements is more likely to be accepted as well it sucks but shit melts down sometimes and sometimes people make mistake. A a more technical statement might be accepted as "well why did you hire bob in the first place?" or "why fuck are you using this_framework_i_dont_like?".
    
    cookiecaper 9 years ago
    
    I think 1 is correct, but it's about the level of resolution into the issue.
    "Asana had an outage for 45 minutes yesterday. This was due to an issue with a deploy that was pushed the night prior. We apologize for the inconvenience and are undertaking a thorough review of processes to ensure that similar events don't occur in the future. Please be assured everything is back in working order now. Thank you for your patience and continued patronage."
    Big detailed postmortems like this should remain internal documents unless they describe a complex or rare technical failure, news and/or discussion of which will actually benefit the larger community.
    
    themartorana 9 years ago
    
    It's maybe a level of details thing. "A bad deploy went unnoticed, causing a cascading failure. We identified how that happened and have new checks in place to prevent it in the future."
    Two lines, with the same information someone not very technically literate would understand from the OP. I agree with being transparent, but I also believe in not unnecessarily scaring and/or confusing customers, either.
    (Pretty soon they'll just start outting individual engineers...)
    
    jwatte 9 years ago
    
    The fact that you think an engineer can be "outed" is a culture problem.
    The process failed the engineer. Testing, deployment, and monitoring infrastructure was not up to the task of supporting human beings. That it happened to be triggered by engineer X instead of engineer Y is entirely coincidental.
    The audience of the post mortem matters. When I see the two line summary, I have no idea whether that's a CYA whitewash, or a sincere part of a process of improvement. When I see the full PM, it builds more trust.
    If you're not an engineer capable of understanding the details, it may have a different effect. And if you're part of a corporate culture of politics, shaming, and status chasing, it must feel totally alien.
    Three cheers for transparency!
    
    marcog1OP 9 years ago
    
    We will never out individuals. The person who committed the code was innocent. We got him a fun gift as a sort of joke.
    
    merb 9 years ago
    
    Yep the best thing some could do:
    - train the people more - help them to get over (some ppl could be really mad and infconfident after they did bad)
    
    themartorana 9 years ago
    
    That was entirely tongue-in-cheek. I wouldn't ever expect you to do that! It was an exaggerated example.
    
    zzzcpan 9 years ago
    
    The post like this is definitely confusing.
    I think you do need to at least acknowledge the problem. With a clear non-technical explanation of the problem in the first paragraph. The rest should go into real technical details of the result of the investigation, not an investigation itself.
    
    shiven 9 years ago
    
    Why is it confusing? I did not find it confusing. I like such excruciatingly detailed postmortem analyses, they make for great reading and my respect for the company that does this is increased by reading these.
smoe 9 years ago

Personally I'd produce both. A brief high level explanation for non technical people (e.G. customers, press) and an in-depth blog post with the gruesome details.
The latter is useful for example when my boss asks me to evaluate whether to continue using a service after an incident. If I can't get enough information to make a recommendation I might propose a switch out of distrust. Especially when to problem was related to security or privacy.
rossjudson 9 years ago

Your approach works for the incident, but not for the relationship. Transparency about the technical nature of the outage is a commitment to the client that this type of outage won't recur, and steps are being taken to ensure that. It pierces the veil of arrogance by assuming client competence. That client is actually someone who reports to someone else, and they're going to have to explain their outage to the boss. For cloud providers, this kind of transparent post-mortem is the root of a fan-out of incident analysis.
marcog1OP 9 years ago

Every response from our users so far has been thanking is for the transparency. It also represents our internal transparency, and that has a real impact on recruiting.
- noir_lord 9 years ago
  
  Not an Asana user but if I were this kind of response is exactly what I like to see as a both a user and a developer so well done.

madelinecameron 9 years ago

>And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version

... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.

marcog1OP 9 years ago

You want to replicate as much as possible, but if we ran canary on the same machines we could have testing code bring down production. That's bad.

bArray 9 years ago

Was this incident really recorded minute by minute or is that made up? I've noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don't understand how they get that accuracy?

gjtorikian 9 years ago

Oh, man. Most definitely that's real.
If you're working in Slack or chat, you've got a minimum of half a dozen people typing and putting out suggestions and offering to investigate something. That's all time stamped. And even if you're not doing that real-time, you may be using something like a GitHub issue to discuss the problem via comments, which are also time-stamped.
No one at the moment of the incident is probably going "Ah, it's 8:01, better write down that I identified the problem." It's most likely "hay I think I got it one sec" and then that works. Or doesn't. But hopefully it does.
- jwatte 9 years ago
  
  Yes, slack and irc time stamps is common. Ideally your shell and auditing gives you that for commands, too!
stephengillie 9 years ago

It's from details gathered from tickets and chat history, customer reports and server logs. My team is developing a set of tools to manage our incidents, and automating the gathering of details like this are central to the reporting element.
- gbin 9 years ago
  
  As a tool for that, chatops is pretty cool because you can easily record your conversations but also your actions.
jon-wood 9 years ago

Generally it's not recorded minute by minute in the moment. When I write post mortems like this I'll piece together the timeline after things have calmed down through a combination of metrics, logs, and the ongoing discussion that takes place on Slack. To assist in that I'll tend to have a running commentary of what I'm doing in Slack even if I'm the only engineer dealing with the incident, it helps putting the timeline together later, and also means other people coming to see what's up and offer help can get caught up without interrupting.
dcosson 9 years ago

Often one person will be in charge of taking notes while the rest diagnose (using things like server logs or email timestamps to get these times as precise as possible). Not just for the post mortem, it can be very helpful in figuring out what happened, making sure the timing of events plausibly lines up with your hypothesis, extrapolating based on the length of a particular part of the incident to decide what to do next, etc.
dgcoffman 9 years ago

We reconstruct history from timestamps in Slack and our logging and monitoring systems.
beachstartup 9 years ago

they probably just look at the chat history and wrote a timestamped summary narrative.
judging from the number of 'sorry's in the text, seems like post mortems have been slowly adapted into a very specialized form of semi-fictional stage drama in which the audience is pandered to excessively through the use of hyperbolic apology.

kctess5 9 years ago

I find it interesting that they didn't notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.

marcog1OP 9 years ago

This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms.
We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.
- tomjen3 9 years ago
  
  What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.
  - mkagenius 9 years ago
    
    They were distracted by the previous experience of having issues elsewhere.
  - lrascao 9 years ago
    
    And apparently there were no alarms in place for these kind of things
    
    babo 9 years ago
    
    Apparently a lot of parts of the system were on alarm.
  - bdob4xcfH 9 years ago
    
    It's because they don't have a simple rollup dashboard that you can see that at a glance, like most places. Can you imagine if your car just showed you an event log for a door open, oil, turn singles on etc. that's what most monitoring systems are like these days.
- jwatte 9 years ago
  
  Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.
  Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!
- abhishekash 9 years ago
  
  Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .
- ycombinatorMan 9 years ago
  
  Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?
  - mkagenius 9 years ago
    
    I doubt partial roll out would have helped in this particular case since it only happens in high load and they roll out new code twice a day.
    
    marcog1OP 9 years ago
    
    Correct. We don't roll out during peak load either.
    
    tonfa 9 years ago
    
    Considered at least starting your release canary during peak load?
    
    marcog1OP 9 years ago
    
    We have talked about it. It is unlikely to helped with an event like this, and I don't recall an event where it would have. It also has the downside of extending our deployment cycle by a lot. Notably, we do run a canary internally, and that had no issues, which actually through us off for a while because while the app was partially down for users it was working for us and that hasn't happened to us in a while.

mathattack 9 years ago

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

marcog1OP 9 years ago

When you do daily deployments, you can't QA every one much. You rely on automated tests and Internal users using the new code for a couple hours before the deployment. We were unlucky in this case with the number of bad releases. Each was relatively minor, and ironically one was to fix a bug with the code that caused this outage. We run a 5 whys for most of them.
- Mtinie 9 years ago
  
  > When you do daily deployments, you can't QA every one much.
  In that case, should you be doing daily deployments to production?
- mathattack 9 years ago
  
  I include automated testing in my definition of QA. (Necessary but not sufficient)
  Are the daily drops predominately bug fixes or also a regular drip of new functionality?
  I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.
  Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!

zzzcpan 9 years ago

Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.

jwatte 9 years ago

The detail was right there: debugging something in security caused massive logging which caused CPU bottlenecking.
Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.
(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)
- marcog1OP 9 years ago
  
  We actually have a traffic replica (dark client) setup for the new webserver architecture we are gradually migrating to. It likely would have caught this before deploying to users.

cookiecaper 9 years ago

Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment.

Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.

Let this be a lesson to all of us. Have basic dashboards and alarming.

marcog1OP 9 years ago

We have very comprehensive dashboards. Getting the perfect ones that help in all cases, while not being information overload (the problem here) and being discoverable is a hard, iterative process.
- cookiecaper 9 years ago
  
  Yes, monitoring requires a lot of tuning until you find a sweet spot, but it doesn't sound like this is something that would've been buried deep in the annals of monitor. CPU/load data on your web servers should be pretty visible/accessible and one of the first graphs that get pulled up (and your alarms should've pointed out the issue anyway).
  I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.
  Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).
  As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.
  I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.

qaq 9 years ago

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

marcog1OP 9 years ago

We do, but it didn't help given the cause of the high cpu was our logging infrastructure (Amazon Kinesis) being overloaded by the webservers.
- matt_wulfeck 9 years ago
  
  Does kinesis not support UDP sylog style logging, some of these old technologies had the right idea: if your sending too much data, drop the packets on the floor instead of falling over!
babo 9 years ago

Autoscaling as not necessary driven by CPU load.
- qaq 9 years ago
  
  true but by default it is
  - themartorana 9 years ago
    
    You'd think that would be at least one metric...

jwatte 9 years ago

The real support for a frequent deployment system is in the immune system! I've had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn't immediately cause user failure. (I e, monitor crucial internals, not just user availability)

Settings

Full technical details on Asana's worst outage

Keyboard Shortcuts