Nvidia's moat is not what it used to be
For years, the lock-in was dead simple: CUDA.
Want top-tier performance? You wrote CUDA. Do that once, and you were all-in on NVIDIA. The ecosystem compounded—libraries, tooling, docs, talent—everything reinforcing the same gravity well.
That world is starting to crack.
We’re entering a phase where low-level code isn’t a rare skill anymore. Models are capable at generating kernels, bindings, and glue code. Good enough to get a first version running fast and iterate from there. The switching cost to a new accelerator is dropping quickly. What used to require a dedicated team now often looks like a decent prompt plus a few review passes.
There’s an old David Wheeler line: All problems in computer science can be solved by another level of indirection. AI Codegen is an abstraction and more.
At the same time, the economics are shifting.
For many real workloads—especially inference—VRAM matters more than peak FLOPS. You want models resident in memory, batching cleanly, with predictable latency. On a dollars-per-GB basis, AMD is starting to look compelling. Newer cards bring stronger low-precision throughput (FP8/INT4), structured sparsity, and significantly more memory on mainstream SKUs. If you’re running open models and care about cost/throughput, you’re at least evaluating them.
Intel is entering the mix as well. Battlemage (Arc Pro B-series) pushes high VRAM configurations with competitive price/perf for local inference. Not dominant, but another viable option that didn’t exist in the CUDA-only world.
Then there’s supply.
NVIDIA has built enormous demand and maintained pricing power. But scarcity cuts both ways. If you can’t get hardware—or only at extreme prices—people explore alternatives. Startups take what they can get. Infra teams design for heterogeneity. Open source adapts to whatever is available.
This is how moats erode: not via a single replacement, but through many small workarounds that become standard.
Two datapoints from actually standing up a modern model serving stack:
1. In my recent GLM 5.1 deployment on 8xB200s, getting a novel model to serve reliably was painful. It took me ~12–13 minutes of cold starts (many!), random restarts, non-obvious flags, kernel warmups, and graph captures just to reach a stable baseline. Most of that wasn’t “AI”—it was infra whack-a-mole across memory limits, runtimes, and config quirks.
2. Even once it was running, it was fragile. I kept hitting issues like streaming tool calls producing invalid JSON because the model output, server parser, and client SDK were out of sync. Fixing it required patches across multiple layers just to get to consistent outputs. Real systems are leaky—far from clean abstractions.
That’s the actual moat: not CUDA, but the entire stack—libraries, compilers, interconnects, and years of ops knowledge.
But, it’s early.
CUDA isn’t just a platform; it’s a decade-plus of battle-tested infra. Getting something to run is one thing. Getting it to run great at scale is still difficult, performance cliffs in exactly the wrong places.
And NVIDIA is moving up the stack aggressively—higher-level APIs, inference tooling, tighter framework integration. Blackwell-class hardware pushes further efficiency (e.g., low-precision compute like FP4) and targets memory-bound inference directly. If abstractions become the battlefield, they’re positioning to control that layer too.
So what happens:
* Near term: NVIDIA continues to dominate. Demand is still growing fast and they remain the default. * Medium term: the edges fray. Inference becomes more heterogeneous. AMD and Intel pick up share where cost and memory dominate. * Long term: value shifts upward—to models, data, orchestration. Hardware still matters, but becomes more interchangeable at the margin.
Bottom line: CUDA used to be a wall. Now it’s closer to a speed bump. AI didn’t remove the moat—it just made it much easier to cross when there’s a reason to. The moat was never CUDA syntax. It was ten years of documented failure modes, and that’s not something you can prompt your way out of. Whoever ends up owning the serving layer inherits it. Doesn't your own deployment story kind of argue the opposite? If standing up one model still takes hours of cold starts and flag tuning across different layers, the moat was never CUDA. It's years of failure modes that live in the heads of people who built this stuff. AI writing the code is the easy part. Excellent post, could you also direct your chatbot to read the guidelines first? From [0] > Don't post generated comments or AI-edited comments. HN is for conversation between humans. I am so so tired about AI posts on this website. I am seeing so many of them and comments being AI generated too. It seems that the guidelines aren't being enforced. I am sad to see this community die in almost real time but within 2 days, I am seeing so many AI generated things. It's getting real bad real fast and real human submissions get undiscussed oh boy. :-( What is happening to Hackernews, Firstly its obvious that this is AI generated post and then I checked it on an AI text detector website as well and it shows as being clearly AI generated (100%) I bet this user ID got sold, or hacked. Look at its comment and submission history. Codegen is clearly lowering the cost of getting things to run on different hardware
the real moat is everything around it: stability, tooling, all the weird infra edge cases. idk feels a bit overstated. CUDA’s moat isn’t just writing kernels, it’s the whole ecosystem + hard earned perf intuition. AI helps write code but doesn’t replace that. switching costs are def going down though, so CUDA feels more like the default vs the only option. real moat is still ops. getting stuff to run is easy, getting it stable at scale isn’t. so yeah, not gone, just more like a tax now. I believe brand loyalty will carry the day for them. "GPU" means "NVIDIA" to buyers. Good point. Corporate inertia plays in NVIDIA's favor. These days, half the HN posts seem A.I. generated, or at least edited. Am I overly suspicious?