After Three Years, Modular’s CUDA Alternative Is Ready

8 min read Original article ↗

SAN JOSE, Calif. – Building a CUDA alternative was never going to be an easy task.            

Chris Lattner’s team of 120 at Modular has been working on it for three years, aiming to replace not just CUDA, but the entire AI software stack from scratch.

“What does that take? Well, it’s actually pretty hard building a replacement for CUDA. It takes years,” Lattner told EE Times. “For the last three years, we’ve been working on the programming language, the graph compiler, the LLM optimizations, getting all these things sorted out, implemented at scale, tested and validated.”

Problems with the existing AI software stack stem from the fact that it emerged rapidly, and is still evolving very fast; layers get added quickly to keep up with new use cases and models. On top of CUDA today are libraries like OneMKL, vLLM for inference serving, Nvidia’s TensorRT-LLM, and now Nvidia’s NIM microservices—what Lattner calls “a gigantic stack of stuff.”

Beyond Isolation: NOVOSENSE’s Isolation+ Platform Elevates System Safety for Advanced Power Systems

By Christopher McGrady  06.22.2026

Built-In Memory. Built-In Confidence. 

By Morten Block, Global Eng. Director, Segments and Technology go-to-market  06.17.2026

YMIN SLX Hybrid Supercapacitors Replace Batteries in Space-Constrained Designs

By Shanghai Yongming Electronic Co.,Ltd  06.16.2026

CUDA itself, Lattner pointed out, is 16 years old. In other words, it existed well before the generative AI use case and before GPU hardware features like tensor cores and FP4 were invented.

Chris Lattner (Source: LinkedIn)

What Lattner calls “disposable frameworks,” parts of the stack that get adopted but have a short shelf-life before being superseded, also do not help.

“Everything changes, and it’s not designed for generality, and it falls away,” he said. “What we’re building for enterprises is a technology platform that can actually scale, so they can keep up with AI.”

CUDA Alternatives

There have been other projects aiming to replace CUDA, or to provide some level of code portability from CUDA, or both.

One of the most successful has been the open-source project ApacheTVM. TVM’s main aim is to enable AI to run efficiently on diverse hardware by automating kernel fusion, but generative AI proved to be a technical challenge, since algorithms are larger and more complex than for older computer vision applications. Generative AI algorithms are also more hardware-specific (like FlashAttention). TVM’s core contributors formed a company called OctoAI, which developed a generative AI inference stack for enterprise clusters, but it was acquired recently by Nvidia, casting some doubt on the project’s future.

Another widely known technology, OpenCL, is a standard designed to enable code portability between GPUs and other hardware types. It has been broadly adopted in mobile and embedded devices. However, critics of this standard (including Lattner) point to its lack of agility to keep up with fast-moving AI technology, in part because it is driven by a “co-opetition” of competing companies who generally decline to share anything about future hardware features.

Other commercial projects of this nature are still in the early stages, Lattner said.

“There’s a big gap between building a demo, solving one model and one use case, versus building something that’s generalized at scale, that can actually take on the pace of AI research, which is very significant,” he said.

Modular, as a software-only company, is better positioned to build a stack that works for all hardware, according to Lattner.

“We just want software developers to use their silicon,” he said. “We’re helping to break down those barriers, investing over many years across many generations of hardware that can enable [that].”

Performant portability

Modular’s AI inference engine, Max, launched in 2023 with CPU support for x86 and Arm CPUs, and support for Nvidia GPUs was added recently. This means Modular now has a full-stack replacement for CUDA, including the CUDA programming language and the LLM serving stack that builds on top of it.

Crucially, Lattner said Max can meet the performance of CUDA for Nvidia A100 and H100 GPUs.

“[Nvidia] had a bit of a head start on us—they had the help of the entire world that was tuning for their hardware, and A100 at that point was 4 years old, and that was very well understood and optimized [for], so it was a very high bar,” he said. “What [meeting CUDA performance for A100] told me is: we have a stack that can scale and we have a team that can execute.”

Meeting or beating CUDA’s performance for generative AI inference on H100s took two months from first introducing H100 support—an achievement Lattner is confident the team can reproduce for its next target hardware: Nvidia Blackwell-generation GPUs.

“We are engineering this in a way that is able to scale,” Lattner said. “We got to competitive performance on H100 in two months, not two years, because [our] technology investment has allowed us to scale up and actually lean in to these problems.”

The eventual aim is to enable performant portability between all types of AI hardware.

“No other stack can do that,” Lattner said. “Even Nvidia doesn’t have a performance portability story… CUDA can sort of run on A100 and H100, but practically speaking, you have to rewrite your code to get good performance, because [Nvidia] introduced new features like the TMA units in the H100.”

Tensor memory accelerator, or TMA, units were introduced for Hopper-generation GPUs to allow the asynchronous transfer of tensors between global and shared memories. Performant portability is enabled by Modular’s higher level abstractions for hardware features like this. The company aims to become a bridge between chip makers and software developers who simply want to use the hardware, Lattner said.

“As we unlock [the power of this technology], which we’re just coming into now, we can enable an entirely new category of people to be able to program all the new hardware that’s coming on to the market, and do so in a consistent way,” he said. “Developers don’t have to know about all the complexity on the hardware side or on the AI research side. They can focus on building their agentic workflow or their custom RAG solution and benefit from all the innovation that’s happening in the ecosystem; we can make it simple and adoptable.”

Modular support for non-Nvidia GPUs and other types of accelerators will begin towards the end of 2025.

Cluster management

Modular is also working on cluster management features for its stack.

Traditional cloud systems offer elasticity—the ability to dynamically add more nodes to handle requests as demand goes up—but clouds based on GPUs work differently. Since GPUs are expensive, users commit to fixed blocks of GPUs over months or years, which Lattner said is comparable to buying and selling on-prem GPUs from a cost management perspective.

Also, generative AI workloads like chatbots are stateful; that is, they need to store and access previous inputs from the user for future sessions. This means the most efficient way to process queries from the same user is on the same node, versus sending queries to any available CPU.

Add heterogeneous hardware types—even Nvidia GPUs with different size memories—to LLM layers that may be either memory- or compute-bound, and the level of complexity for platform teams increases. These teams are faced with managing constant changes in workload and demand from the multiple engineering teams in an AI business.

Modular has built data and control planes that route requests coherently across nodes, managing state and distribution across the cluster.

“You need a level of abstraction so that you can say, I want to throw this [workload] onto this many machines,” Lattner said. “So you need to be able to say which model runs best where. Typically, nobody really understands how any of this stuff works, but we do. And we can use the power of understanding this whole stack to say OK, we’ll build this intelligent router, we’ll actually put things in there to make it super easy to roll out and scale out. This is what we’re looking at right now, and it’s amazingly exciting.”

The idea is to intelligently route queries to the right hardware at the right time, given trade-offs like batch size and sequence length support. Separating parts of the workload onto the most appropriate GPUs is something the top 10 companies can do, Lattner said, but almost everyone else does not want to think about.

“Instead of taking your AI away from you, we’ll give you the tools and technology to deploy it on your computer, whether it’s on-prem or in the cloud,” he said. “This is very different than a lot of [companies] who say AI is too hard, just give us all your data, all your models, and we’ll do it for you. We’re saying: It’s democratized. Let’s give it back to software developers. Let’s enable the platform team to own the AI.”

Modular exhibited at Nvidia’s GTC conference. How does Nvidia feel about this CUDA alternative? Does Modular fit into the CUDA ecosystem?

“It’s complicated,” Lattner said, noting that Nvidia has announced some forthcoming software features he considers Modular-inspired, including some that echo Modular’s Pythonic programming focus.

“[Nvidia’s enhancements] don’t exist yet, it doesn’t run on all the GPUs and I speculate that it will never run on anyone else’s GPUs,” Lattner said. “But I think it’s an incredible amount of validation in Modular’s approach. I welcome good ideas in this space, and I’m very happy that they [Nvidia] also think we’re working in the right direction.”

, , ,

,