Update on Azure Storage Service Interruption

azure.microsoft.com

26 points by robert_nsu 11 years ago · 17 comments

Reader

    404 - Article Not Found
    The article you were looking for was not found, but maybe try looking again!

You can get the text from the RSS feed [1].

---

Yesterday evening Pacific Standard Time, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions. I want to first sincerely apologize for the disruption this has caused. We know our customers put their trust in us and we take that very seriously. I want to provide some background on the issue that has occurred.

As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.

Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions. While services are generally back online, a limited subset of customers are still experiencing intermittent issues, and our engineering and support teams are actively engaged to help customers through this time.

When we have an incident like this, our main focus is rapid time to recovery for our customers, but we also work to closely examine what went wrong and ensure it never happens again. We will continually work to improve our customers’ experiences on our platform. We will update this blog with a RCA (root cause analysis) to ensure customers understand how we have addressed the issue and the improvements we will make going forward.

---

[1] http://sxp.microsoft.com/feeds/3.0/devblogs

je42 11 years ago

Uhm.

> The configuration change for the Blob Front-Ends exposed a

> bug in the Blob Front-Ends, which had been previously

> performing as expected for the Table Front-Ends.

> This bug resulted in the Blob Front-Ends to go into an

> infinite loop not allowing it to take traffic.

An infinite loop that has not been discovered during the partial role out. That is clearly weird. Also, doesn't put too much trust in their current monitoring scheme.

keithwarren 11 years ago

This video http://channel9.msdn.com/events/Build/2014/3-615; start at 39 minutes. Mark Russinovich explains their update rollout procedure (or at least part of it).

Willing to bet the rollout infrastructure depended on storage and so their ability to control or stop the rollout was broken once the storage failures began.

coreysa 11 years ago

Almost everyone has fully recovered at this point. If you are still seeing problems with your Virtual Machine after the incident earlier this week, we want to help you!! Please send mail to azcommsm@microsoft.com and email me directly at corey.sanders@microsoft.com.

Please send with high importance so it pops in our inbox and we will dig in.

keithwarren 11 years ago

I hope to read more in the post-mortem RCA but I am curious what their flighting missed, is flighting so limited that is does not see the cross region scale or something? I also had the feeling from watching Mark Russinovich discuss previous failures that their patch rollouts were much more controlled.

coreysa 11 years ago

Keith, you can find more details in the RCA that is published here: http://azure.microsoft.com/blog/2014/11/19/update-on-azure-s.... It is updated with more details on the flighting and issues we encountered.
ohyesyodo 11 years ago

What I've seen from their patching of ordinary machines, I would say its pretty far from controlled or well thought through. Their patching has led to our machines becoming unavailable before, despite that we have multiple machines in the same availability set. We've been in contact with support to describe what happens and have gotten an Oh, its by design-response back.

nnx 11 years ago

So they rolled out a performance update (not a critical security fix) to all their datacenters at once?

This sounds incredibly amateur for a provider the size of Azure.

coreysa 11 years ago

Hey nnx, this is Corey from the Azure engineering team. We have a standard protocol in the team of applying production changes in incremental batches. Due to an operational error, this update was made across most regions in a short period of time. I really apologize for the disruption.
- keypusher 11 years ago
  
  So, you had a bug in your code. That happens to everyone and I think we all understand. However, there are a number of other issues here which seem systemic and much more troubling. First, that your "flighting" did not catch the problem. Why was that? If the bug caused an infinite loop on all the live storage systems, that seems like it should have been fairly obvious on the customer systems you tested on. Second, that the patch was rolled out to all servers at the same time. You have admitted this was a mistake, but honestly it looks like amateur hour. If you are running business critical distributed cloud infrastructure, you just don't ever do this. Third, that there was extended fallout from rolling the patch back. If there are still customers experiencing downtime from this problem a full day later, that speaks to some serious flaws in the ops architecture and process. If you guys want to compete with AWS and similar platforms, it seems like you have a long way to go still. This set of mistakes should haunt you for a long time, because it's going to come up whenever someone is trying to convince their boss/colleague/team that Azure is a solid solution.
  - coreysa 11 years ago
    
    Thanks. We are continuing to investigate this and driving needed improvements in our process and technology to avoid similar issues in the future.
    
    ohyesyodo 11 years ago
    
    The last two times there was a big issue the same thing happened with the status dashboard (it became inaccessible). I remember the same issue when the certs expired 1,5 years ago. I really like Microsoft and was convinced "you" would somehow isolate the dashboard and host it separately, but it turns out I was wrong. Do you happen to know the reasons for hosting the status dashboard inside of Azure? It seems so counter-intuitive to me. Or is it actually hosted externally but died due to the load when the issue started to appear?
    The OP mentions that Microsoft representatives gave info via public forums. When the issue appeared I looked in different places trying to find info, but only I found was a statement saying that We are aware of issues. I looked at Azure twitter/blog, ScottGu twitter/blog, Hanselmans, MSDN forums. I also tried this forum and reddit. Do you know where I should have gone to receive details?
    
    coreysa 11 years ago
    
    Thanks. The communications and the service health dashboard are two areas that we are creating improvement plans from the learning of this event. For the dashboard, we do expect it to continue to run even through outages like this one, but we did encounter an issue with our fallback mechanism that we need to understand more deeply.
    For general communications, we did most of our early communication on the event using twitter, announcing the incident and giving updates. We need to build a more formal multi-pronged approach to communicating, including faster responses in the MSDN forums and here in HN to make sure we are reaching as many of our customers and partners as possible. Thanks again for the feedback!!
  - je42 11 years ago
    
    ^This

ohyesyodo 11 years ago

How about not rolling out a patch to all data centers at once?

coreysa 11 years ago

Hi, this is Corey Sanders, an engineer on the Azure compute team. Yes, our normal policy for updates is to roll them in incremental batches. In this case, due to an operational error, we did not apply the changes as per normal policy.

bart3r 11 years ago

404 - Article Not Found

Settings

Update on Azure Storage Service Interruption

Keyboard Shortcuts