How Microsoft Vaporized a Trillion Dollars, Pt. 4

(Continued from Part 3)

Azure has operated under constant strain for as long as I can remember.

Even during the periodic “quality pushes,” the backlog of issues never shrank; it only grew.

In the spring and summer of 2024, a major push began to raise the number of VMs each node could host.

The business case was straightforward: scaling up density on existing servers is far cheaper than building new data centers.

On-premise Azure deployments had always been capped at 16 VMs per node.

Microsoft’s own commercial clouds had run at 32 until that year, still a tiny fraction of the 1,024 the hypervisor itself could theoretically support.

The goal was a 50% increase to 48 VMs per node, with 64 as the longer-term target.

What should have been a matter of removing a few arbitrary software limits turned into a 50% increase in crashes and incidents. The problems scaled in exact proportion to the density.

Earlier, while I was still working on the hypervisor interface re-engineering plan for the bottom of the Azure node stack, I had run a study with the Core OS team that owned the other side of the Hypervisor API.

Call-trace data showed the node agents collectively hammering the hypervisor through its WMI user-mode interface at up to 10,000 calls per second during peak bursts.

The Hyper-V team had no visibility into which agents were responsible or why so many calls were necessary. On our side, no one could give a definitive answer either.

At that point, it became clear that the Overlake offload port would never happen.

Not only because of the dependencies I described earlier, but because of the sheer dynamic behavior of the stack.

The Hyper-V team had planned a cleaner, HCS-style interface with a gRPC frontend, but the Azure team, under tight timelines, decided to press ahead with the existing VM abstraction layer (VMAL) and keep calling through WMI on the host as a stopgap.

Even setting aside the Linux-port issues, the call volume made the plan impossible, even without factoring in the 50% and later 100% density increase expected to be layered on top.

These elements combined into what I came to see as an unsustainable stretch of work, a plan that lacked the necessary depth and visibility to succeed.

I stepped away from that part of the organization. The principal engineer who inherited the effort, a highly respected Windows veteran who had led the ARM32 port back in the Windows 8 era, lasted ten months before he, too, left the team.

The VM management stack never ran offloaded from the Overlake/Azure Boost SoC.

After stepping away from the VM density and offload work, I turned my attention to another foundational piece of the Azure node stack: the set of components the team called the “instance metadata services.”

The name was borrowed from Amazon’s EC2.

On Azure, it consists of a customer-facing web server (“WireServer”) running on each node’s host OS, together with supporting service components.

One of its endpoints is publicly documented and intended to provide information to guest VMs.

What stood out was that this web service runs on the host OS, the secure side of the machine.

Virtual machines are designed to provide strong isolation. A guest VM is a containment boundary: escaping it is difficult, and other VMs on the same node, as well as the host, share almost nothing with it. The VMs themselves act as security boundaries.

A less obvious fact is that the host OS is not isolated from the VMs in the same way.

The memory pages belonging to each VM partition are mapped into processes on the host. On Windows, these are the vmmem.exe processes.

This mapping is necessary for practical operations such as saving a VM’s state to disk, including its full memory contents.

The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.

In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.

In that same period, another team introduced the Metadata Security Protocol, which aims to enhance the security of Azure metadata services by adding HTTP headers that contain a hash-based message authentication code.

While this new protocol is a welcome addition to mitigate illegitimate requests, it does not address the core concern I had about an attack directed at the web server itself.

Many VM escape exploits exploit vulnerabilities in the virtual device drivers that sit halfway between the host and the VMs.

Running a web server on the host OS with unsecured endpoints exposed to guest VMs, whether signed or not, poses a greater security risk.

My recommendation was to remove WireServer and IMDS from the nodes entirely, a view shared without reservation by a VP security architect, author of a popular book about threat modeling, with whom I shared my concerns.

Upon further digging, I discovered that WireServer was maintaining in-memory caches containing unencrypted tenant data, all mixed in the same memory areas, in violation of all hostile multi-tenancy security guidelines.

It is conceivable that, with a little poking, an attacker could obtain data, including secrets such as certificates, belonging to other tenants on the node.

Moreover, the code was leaking cached entries and even entire caches due to misunderstood memory ownership rules, and suffered from a large number of crashes, in the order of 300,000 to 500,000 crashes per month for the WireServer web server alone across the fleet.

New code was throwing C++ exceptions in a codebase that was originally exception-free. The team had coding guidelines in direct contradiction of those of the larger organization, and their testing practices didn’t include long-running tests, so they missed memory leaks and other defects.

The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.

This further illustrates the pervasive gap in technical leadership throughout the organization.

I described the WireServer/IMDS subsystem running on each Azure node as a “walking security liability,” which should be moved out of the nodes, a view shared by many stakeholders outside the organization. The team’s plan for Overlake was to repeat the same thing under a different name, thereby exposing the Azure Boost SoC to any guest VM through a direct network connection.

These services should be hosted as first-party cloud services, with a credential/secrets cache inside each VM that needs it, containing only that VM’s secrets, encrypted with the help of a vTPM where applicable.

This arrangement would also have worked well in bare-metal scenarios as an opt-in package leveraging the physical TPM.

The org’s leadership responded with strong defensiveness and denial. Not long afterward, the organization terminated my employment.

Click for Part 5.

How Microsoft Vaporized a Trillion Dollars, Pt. 4

Discussion about this post

Ready for more?