What happens when you vectorize wide PyTorch expressions?
probablymarcus.com> My so-called CPU “active” time is actually an inferred value; CUDA spins the CPU 100% constantly, even when the CPU is just waiting for the GPU
The CUDA Runtime and Driver APIs allow you to use“blocking synchronization” where the CPU will go to sleep while waiting for synchronization with the device. However, it seems that PyTorch doesn’t expose this functionality in any of its Python APIs:
https://github.com/pytorch/pytorch/issues/28224
What happens when you try using ctypes to call into libcudart.so to set the device flags as described in the above issue? You’ll have to call torch.cuda.init() for it to work, and unfortunately it won’t work if PyTorch is launching kernels from other threads.
Aha, I was hoping to learn about something like this, thanks for sharing. I'll try this some time. PyTorch does use different threads for the forward and backward pass, so as you suggest, setting that flag might only improve the forward pass.
The CUDA Runtime and Driver APIs have per-thread state, so using threads would unfortunately bypass our trick here to set the flag. Assuming you're on Linux, I might suggest creating a shared library to intercept calls to the Driver API, as all Runtime functions are implemented as wrappers around Driver functions. You'd have to intercept all calls to context creation and flag setting:
... and make sure that the three least significant bits of any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`.* `cuCtxCreate` * `cuCtxCreate_v3` * `cuCtxSetFlags` * `cuDevicePrimaryCtxRetain` * `cuDevicePrimaryCtxSetFlags`cuDevicePrimaryCtxSetFlags: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PR...
dlsym(3): https://man.archlinux.org/man/dlsym.3.en
ld.so(8): https://man.archlinux.org/man/ld.so.8.en
I’m somewhat confused as to what is exposed, as the description in the quote sounds like a blocking call, but with a busy wait, which seems like it couldn’t be the only or main thing that PyTorch exposes.
Not just that: you can perfectly happily poll a marker you inserted into the CUDA stream, interspersed with sched_yield() syscalls to let other processes get work done in between you checking if the GPU got to a point where you can retrieve (as/if relevant) results and submit new work. You would have to dial the scheduler time slice to not keep those other processes running long enough after you yielded for your queue of submitted work to run dry before you get to top that queue off. This isn't as critical when you can completely fill the scheduler queue (I remember ~1000 entries, but it's been years and I haven't checked again if I even remembered correctly. Don't rely on this!), as you may want to force sleep there for some millisecond(s) to keep the CPU core sleeping instead of merely allowing other processes to get work done.
That is indeed the only API that it exposes.
I really hope those pow(x, 2) calls are getting turned into x * x, else it's a performance catastrophe / extreme beginner mistake even with vectorisation.
Also, this kind of ultra wide buffering consumes a ton of memory bandwidth for each operation, instead of keeping a small portion in cache/registers. FLOPs are scaling sort of infinitely, whereas memory speed is flat, so this is increasingly a losing game; just because it's faster than glacial Python doesn't mean it's fast compared to a language which actually concerns itself with performance or a more cache aware approach.
For an extreme example of how you can even sometimes beat ultra optimised GPU ML libraries in this way, check out https://github.com/NVlabs/tiny-cuda-nn
I wondered about this same thing. Your logic about cache/registers is certainly true on CPUs, but what about GPUs? Hence this blurb:
> I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.
Here are some examples not included in the blog post:
- Total time spent in aten::cdist kernel
- Total time spent in aten::mul kernel- Baseline: 2.834s (4900 calls) - Vectorized: 2.686s (500 calls)
This nice little win applies to tons of other kernels, almost across the board. As you point out, CPU intuition suggests this should have been slower, so this was an interesting outcome.- Baseline: 5.745s (80700 calls) - Vectorized: 5.555s (8100 calls)On the other hand, some specific increases occur:
- Total time spent in aten::cat kernel
So working in fewer, larger batches doesn't only enable outrunning the GPU. It decreases the total GPU workload... then adds some overhead. But some of this overhead could be removed with custom CUDA kernels, so I think this is an interesting direction even if you solve the CPU problem some other way.- Baseline: 0.680s - Vectorized: 1.849s(The pow(x, 2) is only there in the toy code, not my actual kernel, so I didn't performance-tune it.)
"For example, what if the parallel sums are of different lengths? On GPUs, fast parallel reductions only work when inputs all have the same length. [...] Vexpr’s vectorizer groups the inputs by length and performs a reduced number of operations—one for each unique length."
I'm surprised this is necessary, I thought modern vectorization on both CPU and GPU handled heterogenous vectorization cases like this handily with conditional execution (on SMT GPUs) or mask registers (on SIMD CPUs)
There may be tricks that I don't know about. One quick experimental answer I can give: if I change to looping over the sums and rerun Benchmark 3, my time in the aten::sum CUDA kernel increases from 0.779s (before) to 0.840ms (after). So CUDA doesn't seem to automagically handle this.
I will note that these grouped operations occasionally cause a net loss in performance compared to "naive" looping, since it involves calling PyTorch's "x.view(...)" which is usually ~instant but sometimes adds some extra CUDA operations on the backward pass. It always reduces the time spent in aten::add, but adds these extra ops. A really smart vectorizer would use heuristics to decide how/whether to group operations according to the target hardware; my current vectorizer just does the grouping every time.
Did you leave Numenta? Enjoyed the paper discussions you all posted to YT.
Glad to hear :)
Yes, I'm off doing my own thing now. Deep Learning went so much further than I ever expected, and now I'm drawn to all the things that can be built today. Who knows, maybe I'll swing back into neuroscience in a few years. (Still friends with my old coworkers / bosses.)
What’s the state-of-the-art in terms of compiler optimization here? Seems like auto-vectorization could be a somewhat simple transform, no?
Pretty cool to see people using compile in the wild :)
Yeah, one unspoken theme of this blog post is "look how nice torch.compile" is :)
Fun fact, I had to put in extra work to get torch.compile working with my code, for understandable reasons. My library, Vexpr, literally runs an interpreter inside of Python, reading a big tree-like namedtuple-of-namedtuples "expression" data structure and evaluating it recursively. That data structure was way too fancy for torch.compile's guards, so I actually wrote code [1] that converts a Vexpr expression into a big Python code string and evals it, factoring the interpreter out of the code, then I pass that eval'd string into torch.compile.
One torch.compile capability I would be excited to see is compatibility with torch.vmap. One selling point of Vexpr is that you can use vmap with it, so I was sad when I found I couldn't use vmap and still support torch.compile. This made me convert a bunch of my GP kernels [2] to be batch-aware. (This missing capability is also understandable -- both vmap and compile are new.)
Anyway, I'm a fan of what y'all are doing!
[1] https://github.com/outergroup/vexpr/blob/e732e034768443386f9... [2] https://github.com/outergroup/outer-loop-cookbook/blob/5d94c...
I spend a lot of sweat in the guards - I am very interested in how it failed! Can you say more? Did guard creation fail? or did guard check_fn perf overhead destroy it?
> One torch.compile capability I would be excited to see is compatibility with torch.vmap
We added support for torch.func.vmap, iirc - check out test_higher_order_ops.py, grep for vmap.