Simplifying GPU Application Development with HMM
developer.nvidia.com> This new ability to directly read or write to the full application memory address space will significantly improve programmer productivity for all programming models built on top of CUDA: CUDA C++, Fortran, standard parallelism in Python, ISO C++, ISO Fortran, OpenACC, OpenMP, and many others.
This is the part of CUDA alternatives always miss when their models only support C and some C++ subset.
Apple Metal does this though.
Only recently. mmap file and directly use in Metal kernels is actually not supported until iOS 16 / macOS 13. Also, there are limited optimization opportunities around that and the recommended way seems still to use the specific Metal APIs to stream load assets from disk.
Although Apple holds a <10% of total market share when it comes to computers, so not sure how helpful it is.
Well Nvidia holds like 90% of the GPU marketshare, so any reply mentioning a competitor would have this property.
You missed the polyglot description regarding which workloads CUDA supports.
For performance, it's always better to explicitly manage GPU memory and host/device copies for performance than to depend upon the unified memory paging mechanism, if it's possible to go the extra effort.
My feeling is that unified memory and on-demand paging introduced with Pascal? was mainly about making it easier to onboard existing applications (e.g., HPC codes etc) to the GPU a bit at a time with less problem. For writing a GPU application from scratch, I don't think it makes much sense (unless the granluarity of the data that you are moving around is really tiny and/or you can't predict what you would need in advance on CPU or GPU).
>"As an aside, new hardware platforms such as NVIDIA Grace Hopper natively support the Unified Memory programming model through hardware-based memory coherence among all CPUs and GPUs. For such systems, HMM is not required, and in fact, HMM is automatically disabled there.
One way to think about this is to observe that HMM is effectively a software-based way of providing the same programming model as an NVIDIA Grace Hopper Superchip."
1) I am curious what the AMD equivalent of nVidia's HMM is, or will be...
2) I am curious if software will be able to be written with HMM (or some higher level abstraction API) such that HMM enabled software will also function on an AMD or other 3rd party GPU...
HMM is a Linux thing, not an nVidia thing. https://www.kernel.org/doc/html/v5.0/vm/hmm.html
AMD has much the same variations as nvidia here, some details at https://github.com/amd/amd-lab-notes/blob/release/mi200-memo.... The single memory systems are called APUs. The internet thinks the MI300 (in El Capitan) is one of those. The games consoles and mobile chips are too.
I'm not sure what the limits are in terms of arbitrary heterogenous execution if you want to push the boundaries, e.g. can you JIT amdgpu code into memory you got from mmap and have one of the GPU execution units branch to it? I don't see why not, but haven't tried it.
In principle I suppose a page should be able to migrate between nvidia and amdgpu hardware on a machine containing GPUS from both vendors, though that isn't likely to be a well tested path.
HMM is, I believe, a Linux feature.
AMD added HMM support in ROCm 5.0 according to this: https://github.com/RadeonOpenCompute/ROCm/blob/develop/CHANG...
Note: that isn't the same thing as what the OP describes, at least according to those release notes, but it does fall under the "HMM" umbrella. You still need to specifically allocate your memory with hipMallocManaged before it can be transparently used between the CPU and GPU. Nvidia calls this "unified memory" (and has had it for 10 years now.)
It's confusing, because there are basically three levels of "Heterogeneous Memory Management" in this regard, in order of increasing features and improved programming model:
1. Nothing. You have to both allocate memory with the right allocator (no malloc, no mmap), and also memcpy to/from the host memory to the device, when you want to use it. You still need to "synchronize" with the compute kernel to ensure it completes, before you can see results from a compute kernel.
2. Unified virtual memory. You have to allocate memory with the right allocator (no malloc, no mmap), but after that, you don't need to copy to/from the device memory via special memcpy routines. Memory pages are migrated to/from as you demand them; you can address more memory than your actual GPU has, hence "virtual". You still need to synchronize with the compute kernel to ensure it completes. You can (in theory) LD_PRELOAD a different malloc(2) routine that uses the proper cudaMalloc call or whatever, making all malloc(2) based memory usable for the accelerator, but it doesn't fix systems/libraries/programs that use custom non-malloc(2) allocators or e.g. mmap
3. True heterogeneous memory management. You can use ANY piece of allocated memory, from any memory allocator, and share it with the accelerator, and do not need to copy to/from the device memory. You can use mmap'd pages, custom memory allocators, arbitrary 3rd party libraries, it doesn't really matter. Hell, you can probably set the PROT_WRITE bit on your own executable .text sections and then have the GPU modify your .text from the accelerator. The GPU and CPU have a unified view without any handholding from userspace. You still need to synchronize with the compute kernel to ensure it completes.
Nvidia implements all the features above, while HIP/AMD only implements the first two. Note that AMD has long been involved in various HMM-adjacent work for many years now (HSAIL, various GCC HSA stuff), so it's not like they're coming out of nowhere here. But as far as actual features and "It works today" goes, they're now behind if you're looking at HIP vs CUDA.
I can see how you got here from the release notes, but the conclusions are a bit off. For hardware and kernels that support the full HMM setup with AMD, you get 3 today as long as XNACK is turned on. Systems like Frontier have been using it for some time now.
Also, 2 can be subdivided into systems that implement it by having two allocations, one host one device, and triggering transfers when the GPU might access memory (2.1) and those that implement demand paging (2.2). The HMM support adds demand paging for type 2.2 as well as type 3 on supported hardware, where without it hip had to use either 2.1 or remote PCIE access to provide “unified memory”. Those were dark days, but for current hardware on appropriate kernels appropriately configured, AMD implements memory just as unified as either NVIDIA’s HMM or ATS implementations.
This is not true.
3. is supported by AMD on new hardware, e.g., Frontier. See https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#...
Amazing, thanks for the correction(s)!
Oh, very nice!
AMDs answer will be “nothing” imho.
They’ve really left this area wide open for over a decade now when it’s been extremely clear this is where the market was going.
Their GPU and GPU compute story is a mess, because rocm has the most confusing compatibility story possible . They’ve been late to compute accelerators as well.
I don’t think there’ll be any abstraction layers either. The community as a whole is more than happy to be single vendor. AMD has shown they can’t build compute stacks, not because of technology reasons but purely long term decisions. The community therefore won’t do it for them.
ROCm already supports HMM.
You're not helping anything by going off on some rant based on an assumption and falsehood - this sort of comment is exactly the sort of thing the phrase "FUD" is used to describe.
You’re right that my rant is incorrect on the premise that they don’t have hmm, but it’s because I missed rocm adding it two years ago. So my bad, and unfortunately I can’t edit my post so I’ll leave the link here with my apologies. https://www.phoronix.com/news/Radeon-ROCm-4.3
The reason I missed it is because rocm dropped support for my cards very unceremoniously. At which point I gave up.
I do think the rest of my point outside of the first sentence is valid though. Rocm isn’t reliable to target. Nowhere near CUDA.
That it’s so dependent on what card you have, what OS/kernel you use and is so aggressive with dropping support for older cards, makes the entire ecosystem a mess. CUDA by comparison is so much more ubiquitous.
That becomes chicken and egg with popular libraries adding rocm support because it then ends up targeting such a sliver (and shifting sliver) at that of the market.
I used to work on Myrinet HPC NICs many years ago, and the ability for a PCI(e) device to access any memory by user virtual address was a desirable feature. I believe that Quadrics did this first using a patched version of DEC OSF/1 (UNIX, Tru64, whatever you want to call it), where they hooked into the kernel pmap (page table) code, and sync'ed the page tables with their NIC. That way the NIC could do the virtual to physical translations, and know if a virtual memory address was backed by a physical page.
What Nvidia is doing here sounds similar. Does linux provide such primitives now?
Its really hard to google for information on older stuff like this. I did find a presentation from 2000 where they talk about "OS Bypass with Virtual Addressing; no page locking or copying; full protection" (https://hsi.web.cern.ch/HNF-Europe/sem3_2001/hnf.pdf)
Does that mean that now anyone with rtx20 series or above can run local ML models as big as their RAM allows? (Or larger if they're happy to wait for swapping to SSD) Or am I misunderstanding the scale of the impact here?
(Not exactly "now", but when the software is recompiled / ported to this)
You could already do that with Unified Memory which has existed for a while and IIRC supported paging and swapping, assuming you `cudaMalloc` and `cudaFree` appropriately for your allocations.
This is not a change to "features" but a change to the programming model. You now never need to ever write cudaMalloc or cudaFree, you can just use any allocator or tool. This means more off the shelf code will just work when used with CUDA. So now your io_uring buffers can be shared with the GPU trivially, for example, or mmap'd pages that a library gave you, or whatever.
The programming model is one of the things Nvidia does significantly better than any competitor. Single source model + HMM is a big step up from something like OpenCL in productivity and correctness.
On Grace Hopper chips, HMM is granular down to the cache line (64 bytes); on x86 systems I believe they said it's (of course) a 4k page granularity.
mmap weights directly from a file seems to be new (I think). Need to check notes to remember whether you can already do that with some cuda* API.
Yeah, I think a good simple litmus test for this is "can I directly call mmap(2) on a file, and then launch a kernel on that mmap'd memory, with no extra steps, and it works as I expect it to". With these newer features in CUDA, the answer to that is "yes you can."
You can already do that with GGUF/GGML models which allow you to split between CPU and GPU. Obviously there is a performance hit when running on your DDR5 and CPU compared to HBM/GDDR and GPU but it’s better than nothing.
I have not been keeping up with developments. Does this mean mortals can run the biggest tier of Llama models (albeit with trash performance) by using system ram? For playing around, I would be willing to let my system chug along just to see what the top tier models can achieve.
Technically yes - if you have lots of ram you can use that and your CPU, as you say, the performance would be pretty poor, though, especially as it’s a toll where you want to tweak your responses quite frequently. I’ve been running and old Nvidia Tesla P100 card. I got cheap on eBay for awhile now it has 16 GB of VRAM but it is pretty old. I’m so interested in this now I’ve gone out and got myself a secondhand RTX 3090 - something I never thought I’d do, but I’d really like to run 30B models in GPU.
Yes. I recently benchmarked the 70B Llama 2 model on a 24 vCPU vSphere host with 64GB RAM (through Ollama) and it was capable of spitting out ~0.15 tokens / second. Useless for any interactive use-case but better than nothing. As a comparison the 7B Llama 2 model was ~1.5 tokens / second on the same hardware while the cheapest M1 MacBook Air can do ~10 tokens / second thanks to GPU acceleration.
Already doable. The gotcha is that it is slow AF. Even if it’s 90%/10% split the subjective experience tanks hard so usually makes sense to pick something that fits into your vram
"What every programmer should know about memory" needs an update.
I don't think so.
The only thing that's been added is bank-groups in DDR4 IMO. But all you need to know is that modern RAM is maybe 16x to 32x way parallel per stick. The interface operates are faster than RAM can respond in time, so an "Optimal" CPU will list off 32x to 64x (32x for the first stick, 32x for the 2nd stick) read/write commands before the first command ever responds.
Understanding that mechanism is what that document is about (how CPUs coalesce memory and parallelizes requests).
----------------
GPUs have one additional coalesce layer given channel vs bank conflicts, and all that noise. But most GPU manuals (be they NVidia or AMD) will cover those details.
they used to not support unified memory in their vGPU drivers. it was a major deal breaker back then.
I think that's still the case; though I assume you're talking about Linux-on-Linux vGPUs, the same is true of e.g. WSL2 where unified memory isn't supported. Sucks, because it's a great feature.
According to the manual, UVM is supposed to be working on vGPUs (at least MIG-backed vCS), I could never get it working though