Tesla turns on 10k-node Nvidia H100 Cluster
techradar.comI'm confused. The article from September 1 linked to here is strangely future-tense ("But the firm’s latest investment in 10,000 of the company’s H100 GPUs dwarfs the power of this supercomputer....This AI cluster, worth more than $300 million, will offer a peak performance...").
It links to a Tom's Hardware article (https://www.tomshardware.com/news/teslas-dollar300-million-a...) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).
So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?
Is the sole source for these articles this (https://twitter.com/SawyerMerritt/status/1696011140508045660) random Twitter post by some guy who runs an online clothing company?
I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".
> high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".
If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:
https://x.com/sawyermerritt/status/1696012091964915744
https://x.com/tim_zaman/status/1695488119729238147
Also, just FYI. Sawyer posts most of the Tesla and SpaceX breaking news on Twitter before major outlets even write their articles.
For example, here’s one just 12mins ago as confirmed by Elon: https://x.com/sawyermerritt/status/1728092021628313777
A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.
I think you only see the additional tweets you're talking about if you're for whatever reason actually signed in to Twitter.
> If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:
I don't see those when I scroll. I see
"Buckle up everyone, the acceleration of progress is about to get nutty!"
and this is the end of the post?
Maybe I'm misusing this thing?
> https://x.com/tim_zaman/status/1695488119729238147
So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.
But the use of future tense is a bit weird, right? And the lack of any followup?
> A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.
I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)
==
More seriously:
https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp... says Nvidia is producing 550k H100s in 2023. And there's obviously a significant lead-time requirement.
So, yes, I can sorta imagine Tesla pre-ordered 2% of global supply of H100s early in 2023 and was bragging about it at the end of August just 'cause.
But I can also imagine this is smoke and mirrors, and they have, like, a handful with the rest on backorder, and we haven't heard more about it 'cause Tesla doesn't have marketing people, it just has wahoos who post things on Twitter.
Either way, I guess?
> Maybe I'm misusing this thing?
That seems to be the case here. ;)
> So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.
Another case of misuse? Here’s a tip for you. When you see a company logo/icon on someone's Twitter/X profile. That means they are verified to be affiliated with that org.
“Accounts affiliated with the organization will receive an affiliate badge on their profile with the organization’s logo, and will be featured on the organization’s Twitter profile, indicating their affiliation. “
https://twitter.com/verified/status/1641596848921276417
Instead of inferring that Tim Zaman is a random Twitter user who paid $20 for a blue check. Why not just Google his name? ;)
https://letmegooglethat.com/?q=Tim+Zaman
> I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)
I linked a video where CNBC was interviewing Sawyer but it seems that you didn’t even bother to check it.
This seems to be the problem today. People refuse to do the bare minimum (which is not even much) required for critical thinking. Instead of verifying information, people tend to uncritically repeat inaccurate assumptions, even when provided with additional information in good faith.
Sure. I’m being a bit snarky. But I think the point stands that a single tweet from an employee saying “we’re about to do $thing” doesn’t exactly mean that, two months later, we should be reading a story whose sole origin is that tweet as evidence that $thing actually ended up happening.
Like, whats the actual news story here?
Totally agree there’s a lack of critical thinking at play.
Also, I think the X.com links only work if you have a login or something, fyi.
I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?
I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.
Is this comparable to what they would work with?
My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.
Or is this about what the leading competitors are working with?
Azure, for one, seems to have orders of magnitude more chips at their disposal.
10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/
Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.
High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.
I guess Azure's are spread out too. Latency higher to world wide datacentres.
I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)
That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...
None of it is easy.
Out of interest, what did you use all that compute for?
ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.
I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.
Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.
I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.
Make a vid. Or a blog post, at least. Please :)
Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.
Perhaps reach out to a YouTube channel or podcast that could be interested?
Did you manage to recoup the investment?
Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.
Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.
Classified I imagine.
H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.
I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.
That said, latest 4th gen nvlink is 900G...
https://www.nvidia.com/en-us/data-center/nvlink/
But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.
between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.
Our lead time hasn't been horrible actually, but I work for a pretty big corp
It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.
This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.
I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.
The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.
Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..
Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].
Yes, that's "listed".. I'm curious how big the "unlisted" cluster is.
Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.
Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.
550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.
Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.
Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.
There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P
Government agencies.
The Big Cloud
It's a small cluster the size of large cluster.
What happened to their custom hardware training stack Dojo? They had some interesting ideas there. The last I heard, they had one of those tiles "working" in the lab. Pretty far from a production setup.
I can imagine they either underestimated the software effort needed to squeeze as much performance as possible out of those things, or they underestimated the pace at which Nvidia scales FLOPS/$, or both.
Probably they want to have all and any compute they can have. This doesn't exclude Dojo nor the previous generation nvidia chips they already got.
Vaporware, just like much of what Musk talks about.
Reusable rockets, electric cars, solar panels...
What would you say grants you the standing to opine here?
All of his other false or misleading statements over the last 10 years.
When we've dealt with the oil companies, the chemical manufacturers dumping PFAS into our kids, and the industrial war machine, maybe then we can start complaining about the guy biting off more than he can chew trying to be constructive.
Until then, all of you sound vicious, bitter, and hypocritical.
We are perfectly capable of having issues with all of those as well. I can still ask for a speedbump on my street while also voicing concern with military postering… Crazy I know! Musk passed ‘guy biting off more than he chew’ when he started accusing heroes of being pedophiles.
I'm fairly certain all of those existed prior to Musk's suggestion of them.
He delivered on them though right? Also reusable rockets didn’t exist?
John Carmack on a shoestring budget nearly got this working at Armadillo. If he had more money and time he would've had it working a half decade before SpaceX.
"Nearly" and "if he had more money and time" is "no".
And given how much faster SpaceX has been than anyone else, I can only believe this "could've, would've" in form of the slightly longer hypothetical "if only Carmack hired all the rocket scientists (and raised all the money to give them freedom to go fast) before Musk got there".
Got it, so no then right.
I love Carmack, but having an idea for something is infinitely easier than actually successfully executing on it, especially as incredibly successfully as SpaceX.
Rocketry involves building a lot of prototypes and blowing up a lot of things when mistakes happen.
Carmack did execute, they had a working rocket, and with time they could've solved the software issues, but they didn't have the funding to blow up dozens of prototypes.
Carmack drove that project, he wrote code, he built rocket engines himself, he ran missions.
Elon, notably, doesn't do any of that. He just had more money, or was willing to commit more money, to seeing it through. For Carmack it was more of a fun diversion (the X Prize) than a business he wanted to build.
In general I think what is so often missed on HN, which is so ironic given this is in big part a board for founders or those who aspire to be (I thought), is the challenge and effort and success of building companies and teams that execute on larger than life projects. It is incredibly hard to recruit the best talent in the world, and assemble them and make it work. The leader must be incredibly competent and believable to do that.
You left out a tier, and that's the embittered one with an axe to grind. It's the "nothing ever happens" crowd, and they want to destroy anyone who stands as a counterexample.
The Nazis nearly invented the atomic bomb.
The McDonnell Douglas DC-X?
Never worked. Project canceled.
You’re also wrong
Which one of those didn't?
Reusable rockets. Tesla popularized EVs
Most of those things have happened actually, but the website makes it seems that they didn't. It just lists everything Elon has said, but doesn't track if they happened or not. This is a completely pointless website.
Actually they're in the middle of production at TSMC. They have 10,000 units on order, to be delivered "in the coming year".
What that he has talked about been vaporware?
That Telsa owners can use their cars to make money while they are working as robo taxis -let's just say he vastly underestimates effort it takes to make progress - FSD is not there yet.
Vaporware assumes it will never happen. Is that the case you think or is it that he was vastly over optimistic? Very likely the latter.
FSD is currently in the quantum valley of product development: it is both vaporware and a shipping product.
The shipping product is FSD by name only. Actual automomy anywhere near the levels that has been promised to arrive "by the end of this year" for years will surely arrive by the end of this year.
Vaporware makes no assumptions about the future. Everything is vaporware until it isn’t.
They will get there at some acceptable point but not with the tech in current Tesla's - the current compute module will need to be replaced - think they showed off HW 4 in lieu of HW 3.
Sucks to be if you paid the early bird fee for it.
Agreed, feel bad for anyone that did that. Hopefully lessons learned on not making huge purchases based on some revolutionary future tech that isn’t fully baked yet / all problems solved for. Definitely an area I would never want to be an early adopter in.
Hmm that website could be really interesting if it clearly didn’t try being misleading, e.g. a statement about something being in development but not yet out isn’t a promise that it will be out now. Really silly.
Also if it didn’t exclude everything that has been delivered. That long list would be interesting to see as well.
Dojo has always been a lie.
Source? The article mentions they now have / use both.
Your assertion is inaccurate.
Original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660
Previus article: https://www.tomshardware.com/news/teslas-dollar300-million-a...
This is second-hand blogspam.
Tom's Hardware and Tech Radar belong to the same company. If you consider this to be blog spam, almost any news website these days would be blog spam.
> almost any news website these days would be blog spam
Yes.
Almost everything is in the original tweet.
And the original tweet is very much kool-aid heavy, with "20x performance", "30x performance" claims about the switch from one card to the next.
> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.
I was curious why this statement lead with fp64 flops (instead of fp32, perhaps), but I looked up the H100 specs, and NV’s marketing page does the same thing. They’re obviously talking about the H100 SXM here, which has the same peak theoretical fp64 throughput as fp32. The cluster perf is estimated by multiplying the GPU perf by 10k.
Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.
In the old days, depending on architecture, fp64 performance could be atrocious even when fp32 was decent, so bragging about fp64 performance has an authenticity to it. Not all scientific computing requires 64 bits, but knowing that you can drop to high precision when necessary without penalty is nice.
Also, back in the day, integer ops were just called 'ops', grumble grumble. But yeah FLOPS specifically refers to floating point. Calling them TOPS doesn't make sense to me, since tensor cores were meant for matrix operation speedup, and these matrices are rarely integer.
Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.
> knowing you can drop to high precision when necessary without penalty is nice.
I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?
This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.
You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.
Nit: INT8 is not a floating point operation and thus cannot be used in the term "ExaFLOPS"
I predict it will run for 5 years and then come up with the answer: FSD needs lidar.
n00b questions from someone just beginning to get interested in HPC
I see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?
Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?
As opposed to keeping all of your servers independent of each other, super computers are used any time you want to pretend the entire computer is one computer.
In other words, they're used when you want to share some kind of state across all of the computers, without the potential overhead of communicating to some other system like a database.
Physics simulations and like, molecular modeling come to mind as common examples.
In the case of ML training, model parameters and broadcasting the deltas that get calculated during training are that shared state.
Newbie question, could this cluster easily calculate the largest prime number? I've found that the largest known prime number was found back in 2018, which is a while back considering how compute has evolved.
Finding the largest prime is more a contest of who's willing to commit the most ridiculous amount of compute to the goal than it is a mathematical obstacle.
The cost of finding the next prime is likely into the millions now.
Is FSD really a hardware problem for them?
Do they also order a power plant for that cluster? Or how much energy does such a thing need?
It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.
Weird to think that his next company's compute platform is this.
Linux was a lot harder back then.
Harder for who? Elon certainly didn't have the technical chops to work with it.
Harder for everyone, including his staff, who were asking him to move to Windows…
He should have hired staff that is competent with the tech stack used at his company.
Unforced rewrites are usually always a bad idea.
Oh that simple, huh? Too bad you weren’t around in the late 90s to explain this to him, and to help him find the extremely rare group of folks familiar with Linux…
Actually he was the other way around. He strongly pushing windows and the CTO and engineers from Confonity strongly wanted Linux
“Wanting to rewrite for windows” is what you said; was that not accurate?
The key thing here however is that Elon didn’t want whatever he asked for in a vacuum, despite what the book says. Surely this was his engineer’s preference.
So THAT's why my power blipped
Only 10K?
It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.
> It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.
That seems like a bold claim. Google, Microsoft and Meta make so much more money than Telsa that if making AI chips was so easy, then they could clearly out design and build Tesla without thinking too hard about it.
What makes you think that Telsa, a company with far less AI workers and knowledge, and far less money than the above companies can out design and out build them?
> What makes you think that Telsa, a company with far less AI workers and knowledge, an far less money than the above companies can out design and out build them?
Presumably because Elon himself will be involved in the design, and Elon, as we all know, is one of the world's great thinkers. ;)
Elon is one of the worlds greatest talent poachers, and that is much better than being a great thinker.
Is he, though?
I recently spoke with someone who quit SpaceX because (among other reasons) they felt Elon was a meddling micro-manager. That's just one anecdote, of course, but the Internet is full of them, replete with summary firings (https://www.businessinsider.com/tesla-elon-musk-ruthlessly-f...), worker safety issues (https://www.washingtonpost.com/technology/2021/03/12/hundred...), and just general bullshit (https://www.reddit.com/r/EnoughMuskSpam/comments/9e360m/elon...).
I don't deny that his public image, for years, was an overall positive one. I really enjoyed Jill Lepore's digging into it here: https://www.pushkin.fm/podcasts/elon-musk-the-evening-rocket.
But it seems like people who worked with him knew, for a long time, that he was full of shit. And increasingly, the public seems to as well.
Do you mind explaining what makes him good at it? Pay? Atmosphere? Management style?
From what I heard about SpaceX it seems to be a place grads go to burn out while being paid below market rate simply because they're excited about the idea. Maybe that impression is wrong, so I'd like to hear other perspectives.
Interesting, what makes you think this?
The dirty little open secret with a lot of these platforms is the contract sizes, hardware costs, etc are so massive they come with multiple teams of dedicated engineers and internal expertise to get your application(s) up and running on them. Obviously these things are never quite "pull a docker container and run" and no one dropping eight-nine figures on these installs is going to do it without serious vendor backing and support.
It's part of the reason why AMD has had quite a bit of success here but is in single digit market share for "AI" otherwise.
Most people - even large orgs with thousands of GPUs - are so trapped in CUDA the theoretical on paper performance and cost benefits evaporate immediately when you spend all of your time trying to port everything over to the point you get equivalent performance and functionality.
Got a source for that?
The original tweet makes the claim, but the tweet seems prone to hyperbole as well.
https://twitter.com/SawyerMerritt/status/1696011140508045660
The original tweet quotes Elon Musk saying "Frankly...if they (NVIDIA) could deliver us enough GPUs, we might not need Dojo"
$300 million for those 10,000
much. much. more. You're not factoring in the disks, chassis, ram, networking gear, cabling, data center build, setup, install, etc etc etc...
> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012
Wow, that's some really early hardware access. /s
Lol, I was wondering if A100 is really that old. Turns out A100 was released in 2020.
Yea I assume they meant 2021. 2012 was still the early days of GPU compute. Best we had were M2090s.
Maybe, they picked up date when Elon first communicated that they are "ready" to go live. Like everything else it took a decade to materialize.
I only had about 3 NVIDIA H100 in 1980
Someone needs to figure out at what point all the compute in the world became more powerful than a single H100.
The Dojo is open.
I thought Dojo was custom chips.
You are correct; it is, and flippant HN comments that are additionally incorrect are starting to become a thing. See the original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660
You’re being pedantic (rightfully so) and I’m being loose with words. While Dojo is the supercomputer Tesla built for vision training, I lumped anything contributing to their machine vision model training as Dojo. It’s called Dojo because that’s where the training takes place.
https://en.wikipedia.org/wiki/Tesla_Dojo
From the History section (although Technical Architecure is also worthy of consuming in its entirety):
> In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs.
I’ll take the L wrt being flippant if we’re using words very specifically in this context, that’s fair. It’s great to see Tesla expand its training resources is my sentiment, regardless of how their aggregate ML compute is segmented.
You’re just using it wrong. Dojo “supercomputer” specifically includes custom chips, which doesn’t exist yet.
Pain does not exist in this Dojo. Kiai!
Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.
Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs
Did you just repeat the headline?