Exclusive: OpenAI unveils protocol to stretch compute

4 min read Original article ↗

OpenAI is getting creative to deal with the industry's imminent compute crunch.

On Wednesday, the ChatGPT maker and a coalition of researchers from AMD, Broadcom, Intel, Microsoft, and Nvidia published a paper offering a rare look into company’s training stack, debuting a new compute networking protocol designed to make GPU clusters faster, more reliable and conserve precious compute cycles.

The protocol, which has been in the works for two years, is instrumental in OpenAI scaling the compute that it needs to continue building bigger and better models, noting in a blog post that the networking approach accelerates its vision for Stargate, the company’s long-term effort to garner the compute it needs to build and scale cutting-edge AI.

The paper introduces a protocol called MRC, or Multipath Reliable Connection, which essentially tackles two main issues with the networks that serve as the connective tissue of AI infrastructure: Congestion and failures, Mark Handley, OpenAI’s networking lead, told me. As GPU clusters grow, these are problems that become more arduous to solve.

  • This protocol relies on “packet spraying,” said Handley, which essentially scatters data along hundreds of paths in the network simultaneously to prevent any one network link from getting congested. This also reduces the amount of “tiers” in a GPU cluster, resulting in “flatter” networks that use up less of the data center’s compute and power.
  • When handling network failures, MRC detects and reroutes when paths go down in microseconds. This allows GPU clusters to continue training seamlessly, even if parts of the network break down.
  • Additionally, the MRC pairs with a protocol called SRv6, or IPv6 Segment Routing, which essentially tells data the exact path it needs to take through a network, rather than forcing the network switches to do the routing work themselves, further reducing the energy requirements of these switches and the data center more broadly.

“We want to use as much compute as we can get, but also we want to make sure that we're using it efficiently and effectively, and this is a critical component of that,” Greg Steinbrecher, OpenAI’s workload lead, told The Deep View in an exclusive interview.

The protocol is already in use in OpenAI and Microsoft’s largest training clusters, including the Oracle site in Abilene, Texas, and in Microsoft’s Fairwater supercomputers, and has been used to train multiple OpenAI models.

When implemented, this new protocol introduces several major downstream advantages, Steinbrecher told me. Conventional, large-scale AI training jobs are a “failure amplifier” for GPU clusters, he said: If one thing goes wrong, the ripple effect forces the process to grind to a halt, leaving GPUs to sit idle. Network congestion, additionally, slows the rate at which researchers can innovate.

MRC circumvents these issues, allowing OpenAI to “turn the crank on our entire research pipeline much faster,” Steinbrecher said. “That allows us to make better use of the resources that we have.”

The MRC specification is available through the Open Compute Project under an open license. Steinbrecher emphasized the importance of this, claiming that this protocol is not one in which OpenAI is trying to “differentiate,” but rather move the entire industry past what they consider a legacy bottleneck. Handley said that the infrastructure industry has reached a point where it’s worth establishing open standards, “as opposed to each of these large companies doing their own thing.”

“Several players in the industry have their own in-house implementations of protocols … that type of market fragmentation is bad for the networking industry,” Steinbrecher said. “You want everyone's energy going in one direction and pushing together, and then everyone moves faster as a result.”

Our Deeper View

At the end of the day, all roads lead to compute. The math is simple: The more efficient data centers and AI factories can be, the more capacity OpenAI has to train its models. And by making the protocol an open standard, the company stands to deepen the well of compute resources for the entire industry. In turn, that stands to benefit OpenAI. With MRC, OpenAI wants to unleash a rising tide that lifts all boats, which is increasingly critical as AI companies stare down a looming compute shortage.