Settings

Theme

Nvidia Announces H100 NVL – Max Memory Server Card for Large Language Models

anandtech.com

122 points by neilmovva 3 years ago · 110 comments

Reader

neilmovvaOP 3 years ago

A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100.

The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.

  • binarymax 3 years ago

    It is interesting that hopper isn’t widely available yet.

    I have seen some benchmarks from academia but nothing in the private sector.

    I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.

    Not having any competition whatsoever means Nvidia can release what they like when they like.

    • pixl97 3 years ago

      The question is, do they not have much production, or is OpenAI and Microsoft buying every single one they produce?

    • TylerE 3 years ago

      Why bother when you can get cryptobros paying way over MSRP for 3090s?

      • andy81 3 years ago

        GPU mining died last year.

        There's so little liquidity post-merge that it's only worth mining as a way to launder stolen electricity.

        The bitcoin people still waste raw materials, and prices are relatively sticky with so few suppliers and a backlog of demand, but we've already seen prices drop heavily since then.

        • TylerE 3 years ago

          Right, that's why NVidia is acutally trying again. The money printer has run out of ink.

      • binarymax 3 years ago

        Not just cryptobros. A100s are the current top of the line and it’s hard to find them available on AWS and Lambda. Vast.AI has plenty if you trust renting from a stranger.

        AMD really needs to pick up the pace and make a solid competitive offering in deep learning. They’re slowly getting there but they are at least 2 generations out.

        • fbdab103 3 years ago

          I would take a huge performance hit to just not deal with Nvidia drivers. Unless things have changed, it is still not really possible to operate on AMD hardware without a list of gotchas.

          • brucethemoose2 3 years ago

            Its still basically impossible to find MI200s in the cloud.

            On desktops, only the 7000 series is kinda competitive for AI in particular, and you have to go out of your way to get it running quick in PyTorch. The 6000 and 5000 series just weren't designed for AI.

        • breatheoften 3 years ago

          It's crazy to me that no other hardware company has sought to compete for the deep learning training/inference market yet ...

          The existing ecosystems (cuda, pytorch etc) are all pretty garbage anyway -- aside from the massive number of tutorials it doesn't seem like it would actually be hard to build a vertically integrated competitor ecosystem ... it feels a little like the rise of rails to me -- is a million articles about how to build a blog engine really that deep a moat ..?

          • KeplerBoy 3 years ago

            How could their moat possibly be deeper?

            First of all you need hardware with cutting-edge chips. Chips which can only be supplied by TSMC and Samsung.

            Then you need the software ranging all the way from the firmware and driver over something analogous to CUDA with libraries like cuDNN, cuBLAS and many others to integrations into pytorch and tensorflow.

            And none of that will come for free, like it came to Nvidia. Nvidia built CUDA and people built their DL frameworks around it in the last decade, but nobody will invest their time into doing the same for a competitor, when they could just do their research on Nvidia hardware instead.

            Realistically it's up to AMD or Intel.

            • rcme 3 years ago

              There will probably be Chinese options as well. China has an incentive to provide a domestic competitor due to deteriorating relations with the U.S.

              • KeplerBoy 3 years ago

                They certainly will have to try, since nvidia is banned from exporting A100 and H100 chips.

                • HTTP418 3 years ago

                  They do ship A800 and H800 to China. H800 is the H100 with a much slower memory bandwidth. A800 is also a tiered down version of the A100

          • runnerup 3 years ago

            No other company has sought this?

            https://www.cerebras.net/ Has innovative technology, has actual customers, and is gaining a foothold in software-system stacks by integrating their platform into the OpenXLA GPU compiler.

          • wmf 3 years ago

            There are tons of companies trying; they just aren't succeeding.

  • __anon-2023__ 3 years ago

    Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.

    I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.

  • rerx 3 years ago

    You can also join a pair of regular PCIe H100 GPUs with an NVLink bridge. So that topology is not so new either.

  • ksec 3 years ago

    >H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find

    You can safely assume an entity bought as many as they could.

ecshafer 3 years ago

I was wondering today if we would start to see the reverse of this. Small ASICS or some kind of optimized for LLM Gpu for desktop / or maybe even laptops of mobile. It is evident I think that LLM are here to stay and will be a major part of computing for a while. Getting this local, so we aren't reliant on clouds would be a huge boon for personal computing. Even if its a "worse" experience, being able to load up an LLM into our computer, tell it to only look at this directory and help out would be cool.

  • wyldfire 3 years ago

    In fact, Qualcomm has announced a "Cloud AI" PCIe card designed for inference (as opposed to training & inference) [1, 2]. It's populated with NSPs like the ones in mobile SoCs.

    [1] https://www.qualcomm.com/products/technology/processors/clou...

    [2] https://github.com/quic/software-kit-for-qualcomm-cloud-ai-1...

  • ethbr0 3 years ago

    Software/hardware co-evolution. Wouldn't be the first time we went down that road to good effect.

    For anything that can be run remotely, it'll always be deployed and optimized server-side first. Higher utilization means more economy.

    Then trickle down to local and end user devices if it makes sense.

  • wmf 3 years ago

    Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural engines" in their SoCs. These engines continue to evolve to better support common types of models.

  • Sol- 3 years ago

    Why is the sentiment here so much that LLMs will somehow be decentralized and run locally at some point? Has the story of the internet so far not been that centralization has pretty much always won?

    • wmf 3 years ago

      Hackers want to run LLMs locally just because. It's not a mainstream thing.

      • capableweb 3 years ago

        It makes business sense as well. It doesn't make much sense to build an entire company around the idea that OpenAI's APIs are always available and you won't eventually get screwed. "Be careful of basing your business on top of another" and all that yadda yadda.

        If you want to build a business around LLMs, it makes a lot of sense to be able to run the core service of what you want to offer on your own infrastructure instead of rely on a 3rd party that most likely doesn't give more than 1% care about you.

        • wmf 3 years ago

          Running LLMs on your own servers doesn't mean PCs which is what this thread is about. A100/H100 is fine for a business but people can't justify them for personal use.

    • jacquesm 3 years ago

      Because that is pretty much the pendulum swinging in the IT world. Right now it is solidly in 'centralization' territory, hopefully it will go back towards decentralization again in the future. The whole PC revolution was an excellent datapoint for decentralization, now we're back to 'dumb terminals' but as local compute strengthens the things that you need a whole farm of servers for today can probably fit in your pocket tomorrow, or at the latest in a few years.

      • waboremo 3 years ago

        Not sure this really tracks. Local compute has always been strengthening as a steady incline. Yet we haven't really experienced any sort of pendulum shift, it's always been centralization territory.

        The reasoning seems mostly obvious to me here: people do not care for the effort that decentralization requires. If given the option to run AI off some website to generate all you want, people will gladly do this over using their local hardware due to the setup required.

        The unfortunate part is that it takes so much longer to create not for profit tooling that is just as easy to use, especially when the calling to turn that into your for profit business in such a lucrative field is so tempting. Just ask the people who have contributed to Blender for a decade now.

        • jacquesm 3 years ago

          Absolutely not. Computers used to be extremely centralized and the decentralization revolution powered a ton of progress in both software development and hardware development.

          You can run many AI applications locally today that would have required a massive investment in hardware not all that long ago. It's just that the bleeding edge is still in that territory. One major optimization avenue is the improvement of the models themselves, they are large because they have large numbers of parameters, but the bulk of those parameters has little to no effect on the model output and there is active research on 'model compression', which has the potential to be able to extract the working bits from a model while discarding the non-working bits without affecting the output and realize massive gains in efficiency (both in power consumption as well as for running the model).

          Have a look at the kind of progress that happened in the chess world with the initial huge ML powered engines that are beaten by the kind of program that you can run on your phone nowadays.

          https://en.wikipedia.org/wiki/Stockfish_(chess)

          I fully expect something similar to happen to language models.

          • waboremo 3 years ago

            The bleeding edge will always be in that territory. It still requires a massive investment today to run AI applications locally to produce anywhere near as good results. People are spending upwards of $2000 for a GPU just to get decent results when it comes to image generation, many forgoing this entirely and just giving Google a monthly fee to use their hardware.

            Which is the point, decentralization will always be playing catch up here unless something really interesting happens. It has absolutely nothing to do with local compute power, that has always been on an incline. We just get fed scraps down the line.

            • jacquesm 3 years ago

              Todays scraps are yesterdays state-of-the-art, and that's very logical and applies to far more than just AI applications. It's the way research and development result in products and the subsequent optimization. This has been true since the dawn of time in one form or another. At some point stone tools were high tech and next to affordably. Then it was bronze, then iron, and at some point we finally hit steam power. From there to the industrial revolution was a relatively short span and from there to electricity, electronics, solid state, computers, personal computers, mobile phones, smartphones and so on in ever decreasing steps.

              If anything the steps are now so closely following each other that we have far more trouble tracking the societal changes and dealing with them than that we have a problem with the lag between technological advancement and its eventual commoditization.

    • cavisne 3 years ago

      Nvidia's business model encourages this for starters. They charge a huge markup for their datacenter GPU's through some clever licensing restrictions. So it is cheaper per FLOP to run inference on a personal device.

      Centralization of compute has not always won (even if that compute is mostly controlled by a single company). The failure of cloud gaming vs consoles, and the success of Apple (which is very centralized but pushes a lot of ML compute out to the edge) for example.

    • psychlops 3 years ago

      I think the sentiment is both. There will be advanced centralized LLM's and people want the option to have a personal one (or two). There needn't be a single solution.

    • throwaway743 3 years ago

      Sure, for big business, but torrents are still alive and well.

    • kaoD 3 years ago

      I think it's because it feels more similar to Google Stadia than to Facebook.

  • 01100011 3 years ago

    A couple of the big players are already looking at developing their own chips.

    • JonChesterfield 3 years ago

      Have been for years. Maybe lots of years. It's expensive to have a go (many engineers plus cost of making the things) and it's difficult to beat the established players unless you see something they're doing wrong or your particular niche really cares about something the off the shelf hardware doesn't.

enlyth 3 years ago

Please give us consumer cards with more than 24GB VRAM, Nvidia.

It was a slap in the face when the 4090 had the same memory capacity as the 3090.

A6000 is 5000 dollars, ain't no hobbyist at home paying for that.

  • andrewstuart 3 years ago

    Nvidia don't want consumers using consumer GPUs for business.

    If you are a business user then you must pay Nvidia gargantuan amounts of money.

    This is the outcome of a market leader with no real competition - you pay much more for lower power than the consumer GPUs and you are forced into ujsing their business GPUs through software license restrictions on the drivers.

    • Melatonic 3 years ago

      That was always why the Titan line was so great - they typically unlocked features in between Quadro and Gaming cards. Sometimes it was subtle (like very good FP32 AND FP16 performance) or adding full 10 bit colour support if you had a Titan only. Now it seems like they have opened up even more of those features to consumer cards (at least the creative ones) with the studio drivers.

      • andrewstuart 3 years ago

        Hmmm ... "Studio Drivers" ... how are these tangibly different to gaming drivers?

        According to this, the difference seems to be that Studio Drivers are older and better tested, nothing else.

        https://nvidia.custhelp.com/app/answers/detail/a_id/4931/~/n...

        What am I missing in my understanding of Studio Drivers?

        """ How do Studio Drivers differ from Game Ready Drivers (GRD)?

        In 2014, NVIDIA created the Game Ready Driver program to provide the best day-0 gaming experience. In order to accomplish this, the release cadence for Game Ready Drivers is driven by the release of major new game content giving our driver team as much time as possible to work on a given title. In similar fashion, NVIDIA now offers the Studio Driver program. Designed to provide the ultimate in functionality and stability for creative applications, Studio Drivers provide extensive testing against top creative applications and workflows for the best performance possible, and support any major creative app updates to ensure that you are ready to update any apps on Day 1. ""

      • koheripbal 3 years ago

        Isn't a new Titan RTX 4090 coming out soon?

        • enlyth 3 years ago

          An alleged photo of an engineering sample was spotted in the wild a while ago, but no one knows if it's actually going to end up being a thing you can buy.

    • koheripbal 3 years ago

      We're NOT business users, we just want to run our own LLM at home.

      Given the size of LLMs, this should be possible with just a little bit of extra VRAM.

      • enlyth 3 years ago

        Exactly, we're just below that sweet spot right now.

        For example on 24GB, Llama 30B runs only in 4bit mode and very slowly, but I can imagine a RLHF finetuned 30B or 65B version running in at least 8bit would be actually useful, and you could run it on your own computer easily.

        • bick_nyers 3 years ago

          Do you know where the cutoff is? Does 32GB VRAM give us 30B int8 with/without a RLHF layer? I don't think 5090 is going to go straight to 48GB, I'm thinking either 32 or 40GB (if not 24GB).

        • riku_iki 3 years ago

          > For example on 24GB, Llama 30B runs only in 4bit mode and very slowly

          why do you think adding vram, but not cores will make it run faster?..

          • enlyth 3 years ago

            I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally

            In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB

            • riku_iki 3 years ago

              It's not impossible, there are several projects which load model layer by layer for execution from the disk or ram, but it will be much slower.

      • bick_nyers 3 years ago

        I don't think you understand though, they don't WANT you. They WANT the version of you who makes $150k+ a year and will splurge $5k on a Quadro.

        If they had trouble selling stock we would see this niche market get catered to.

        • koheripbal 3 years ago

          That IS me. $5K is not enough to run an LLM at home (beyond the non-functional reduced quantization smaller models).

          • bick_nyers 3 years ago

            Ahh yes, looks like I was too generous with my numbers, the new Quadro with 48GB VRAM is $7k, so you probably would need $14k and a Threadripper/Xeon/EPYC workstation because you won't have enough PCIE lanes/RAM/Memory Bandwidth otherwise.

            So maybe more accurate is $200k+ a year and $20-30k on a workstation.

            I grew up on $20k a year, the numbers in tech. are baffling!

  • nullc 3 years ago

    Nvidia can't do a large 'consumer' card without cannibalizing their commercial ML business. ATI doesn't have that problem.

    ATI seems to be holding the idiot ball.

    Port stable diffusion and clip to their hardware. Train an upsized version sized for a 48GB card. Release a prosumer 48gb card... get huge uptake from artists and creators using the tech.

andrewstuart 3 years ago

GPUs are going to be weird, underconfigured and overpriced until there is real competition.

Whether or not there is real competition depends entirely on whether Intels Arc line of GPUs stays in the market.

AMD strangely has decided not to compete. Its newest GPU the 7900 XTX is an extremely powerful card, close to the top of the line Nvidia RTX 4090 in raster performance.

If AMD had introduced it with an aggressively low price then then they could have wedged Nvidia, which is determinbed to exploit it's market dominance by squeezing the maximum money out of buyers.

Instead, AMD has decided to simply follow Nvidia in squeezing for maximum prices, with AM prices slightly behind Nvidia.

It's a strange decision from AMD who is well behind in market and apparently seems disinterested in increasing that market share by competing aggressively.

So a third player is needed - Intel - it's alot harder for three companies to sit on outrageously high prices for years rather than compete with each other for market share.

  • dragontamer 3 years ago

    The root cause is that TSMC raised prices in everyone.

    Since Intel GPUs are again TSMC manufactured, you really aren't going to see price improvements unless Intel subsidizes all of this.

  • enlyth 3 years ago

    I suspect that the lack of CUDA is a dealbreaker for too many people when it comes to AMD, with the recent explosion in machine learning.

  • JonChesterfield 3 years ago

    GPUs strike me as absurdly cheap given the performance they can offer. I'd just like them to be easier to program.

    • andrewstuart 3 years ago

      Depends on the GPU of course but at the top end of the market AUD$3000 / USD$1,600 is not cheap and certainly not absurdly cheap.

      Much less powerful GPUs represent better value but the market is ridiculously overpriced at the moment.

brucethemoose2 3 years ago

The really interesting upcoming LLM products are from AMD and Intel... with catches.

- The Intel Falcon Shores XPU is basically a big GPU that can use DDR5 DIMMS directly, hence it can fit absolutely enormous models into a single pool. But it has been delayed to 2025 :/

- AMD have not mentioned anything about the (not delayed) MI300 supporting DIMMs. If it doesn't, its capped to 128GB, and its being marketed as an HPC product like the MI200 anyway (which you basically cannot find on cloud services).

Nvidia also has some DDR5 grace CPUs, but the memory is embedded and I'm not sure how much of a GPU they have. Other startups (Tenstorrent, Cerebras, Graphcore and such) seemed to have underestimated the memory requirements of future models.

  • YetAnotherNick 3 years ago

    > DDR5 DIMMS directly

    That's the problem. Good DDR5 RAM's memory speed is <100GB/s, while nvidia could has up to 2TB/s, and still the bottleneck lies on memory speed for most applications.

    • brucethemoose2 3 years ago

      Not if the bus is wide enough :P. EPYC Genoa is already ~450GB/s, and the M2 max is 400GB/s.

      Anyway, what I was implying is that simply fitting a trillion parameter model into a single pool is probably more efficient than splitting it up over a power hungry interconnect. Bandwidth is much lower, but latency is also slower, you are shuffling much less data around.

  • virtuallynathan 3 years ago

    Grace can be paired with Hopper via a 900GB/s NVLINK bus (500GB/s memory bandwidth), 1TB of LPDDR5 on the CPU and 80-94GB of HBM3 on the GPU.

int_19h 3 years ago

I wonder how soon we'll see something tailored specifically for local applications. Basically just tons of VRAM to be able to load large models, but not bleeding edge perf. And eGPU form factor, ideally.

  • frankchn 3 years ago

    The Apple M-series CPUs with unified RAM is interesting in this regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM for $4300 today, and I expect the M2 Ultra go to 192GB.

  • pixl97 3 years ago

    I'm not a ML scientist my any means, but Perf seems as important as RAM from what I'm reading. Running prompts in internal chain of thought (eating up more TPU time) appears to give much better output.

    • int_19h 3 years ago

      It's not that perf is not important, but not having enough VRAM means you can't load the model of a given size at all.

      I'm not saying they shouldn't bother with RAM at all, mind you. But given some target price, it's a balance thing between compute and RAM, and right now it seems that RAM is the bigger hurdle.

aliljet 3 years ago

I'm super duper curious if there are ways to glob together VRAM between consumer-grade hardware to make this whole market more accessible to the common hacker?

metadat 3 years ago

How is this card (which is really two physical cards occupying 2 PCIe slots) exposed to the OS? Does it show up as a single /dev/gfx0 device, or is the unification a driver trick?

  • rerx 3 years ago

    The two cards show as two distinct GPUs to the host, connected via NVLink. Unification / load balancing happens via software.

    • sva_ 3 years ago

      Kinda depressing if you consider how they removed NVLink in the 4090, stating the following reason:

      > “The reason we took [NVLink] off is that we need I/O for other things, so we’re using that area to cram in as many AI processors as possible,” Jen-Hsun Huang explained of the reason for axing NVLink.[0]

      "NVLink is bad for your games and AI, trust me bro."

      But then this card, actually aimed at ML applications, uses it.

      0. https://www.techgoing.com/nvidia-rtx-4090-no-longer-supports...

      • rerx 3 years ago

        Market segmentation. Back when the Pascal architecture was the latest thing, it didn't make much sense to buy expensive Tesla P100 GPUs for many professional applications when consumer GeForce 1080 Ti cards gave you much more bang for the buck with few drawbacks. From the corporation's perspective it makes so much sense to differentiate the product lines more, now that their customers are deeply entrenched.

sargun 3 years ago

What exactly is an SXM5 socket? It sounds like a PCIe competitor, but proprietary to nvidia. Looking at it, it seems specific to nvidia DGX (mother?)boards. Is this just a "better" alternative to PCIe (with power delivery, and such), or fundamentally a new technology?

  • koheripbal 3 years ago

    Yes to all your questions. It's specifically designed for commercial compute servers. It provides significantly more bandwidth and speed over PCIe.

    It's also enormously more expensive and I'm not sure if you can buy it new without getting the nvidia compute server.

  • 0xbadc0de5 3 years ago

    It's one of those /If you have to ask, you can't afford it/ scenarios.

tromp 3 years ago

The TDP row in the comparison table must be in error. It shows the card with dual GH100 GPUs at 700W and the one with a single GH100 GPU at 700-800W ?!

  • rerx 3 years ago

    That's the SXM version, used for instance in servers like the DGX. It's also faster than the PCIe variation.

0xbadc0de5 3 years ago

So it's essentially two H100's in a trenchcoat? (plus a sprinkling of "latest")

ipsum2 3 years ago

I would sell a kidney for one of these. It's basically impossible to train language models on a consumer 24GB card. The jump up is the A6000 ADA, at 48GB for $8,000. This one will probably be priced somewhere in the $100k+ range.

eliben 3 years ago

NVIDIA is selling shovels in a gold rush. Good for them. Their P/E of 150 is frightening, though.

jiggawatts 3 years ago

I was just saying to a colleague the day before this announcement that the inevitable consequence of the popularity of large language models will be GPUs with more memory.

Previously, GPUs were designed for gamers, and no game really "needs" more than 16 GB of VRAM. I've seen reviews of the A100 and H100 cards saying that the 80GB is ample for even the most demanding usage.

Now? Suddenly GPUs with 1 TB of memory could be immediately used, at scale, by deep-pocket customers happy to throw their entire wallets at NVIDIA.

This new H100 NVL model is a Frankenstein's monster stitched together from whatever they had lying around. It's a desperate move to corner the market early as possible. It's just the beginning, a preview of the times to come.

There will be a new digital moat, a new capitalist's empire, built upon on the scarcity of cards "big enough" to run models that nobody but a handful of megacorps can afford to train.

In fact, it won't be enough to restrict access by making the models expensive to train. The real moat will be models too expensive to run. Users will have to sign up, get API keys, and stand in line.

"Safe use of AI" my ass. Safe profits, more like. Safe monopolies, safe from competition.

g42gregory 3 years ago

I wonder how this compares to AMD Instinct MI300 128GB HBM3 cards?

tpmx 3 years ago

Does AMD have a chance here in the short term (say 24 months)?

  • Symmetry 3 years ago

    AMD seems to be focusing on traditional HPC, they've got a ton of 64 bit flops in their recent commercial model. I expect their server GPUs are mostly for chasing supercomputer contracts, which can be pretty lucrative, while they cede model training to NVidia.

    • shubham-rawat5 3 years ago

      For now nvidia is a very dominant player for sure but in long run do you see it changing, with competition from Amd-xilinx, intel or potential AI hardware startups,why have the startups or other big players failed to make dent in nvidia's dominance ? considering how big this market will be in coming years there should have been significant investment made by other players but they seem to be incompetent in making even a competitive chip and nvidia which is already so ahead is running even more faster expanding its software ecosystem across various industries.

garbagecoder 3 years ago

Sarah Connor is totally coming for NVIDIA.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection