[ANN] cuTile.jl: Tile-based GPU programming for CUDA GPUs

I’m happy to announce an initial release of cuTile.jl, a new JuliaGPU package that makes it possible to program (Blackwell) NVIDIA GPUs using a tile-based abstraction by NVIDIA. This simplifies writing kernels, because you don’t have to think about threads or memory hierarchies anymore, everything is global memory accessed by blocks of threads:

using CUDA
import cuTile as ct

# Define kernel
function vadd(a, b, c, tile_size::Int)
    pid = ct.bid(1)
    tile_a = ct.load(a, pid, (tile_size,))
    tile_b = ct.load(b, pid, (tile_size,))
    ct.store(c, pid, tile_a + tile_b)
    return
end

# Launch
vector_size = 2^20
tile_size = 16
a, b = CUDA.rand(Float32, vector_size), CUDA.rand(Float32, vector_size)
c = CUDA.zeros(Float32, vector_size)

ct.launch(vadd, (cld(vector_size, tile_size), 1, 1), a, b, c, ct.Constant(tile_size))

@assert c == a .+ b

Compare this to a CUDA.jl vector addition:

function vadd(a, b, c)
    i = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return
end

@cuda threads=vector_size vadd(d_a, d_b, d_c)

Of course, the real power of tile-based programming becomes obvious with more complicated kernels, e.g., a full-blown matrix multiplication delivering pretty good performance (75% of CUBLAS) is as simple as:

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},
                       tm::Int, tn::Int, tk::Int) where {T}
    M = size(A, 1)
    N = size(B, 2)
    K = ct.num_tiles(A, 2, (tm, tk))

    m, n = ct.bid(1), ct.bid(2)

    # K reduction loop - accumulate partial products
    acc = ct.full((tm, tn), zero(Float32), Float32)
    k = Int32(1)
    while k <= K
        a = ct.load(A, (m, k), (tm, tk); padding_mode=ct.PaddingMode.Zero)
        b = ct.load(B, (k, n), (tk, tn); padding_mode=ct.PaddingMode.Zero)
        if T === Float32
            # make use of tensor cores
            a = convert(ct.Tile{ct.TFloat32}, a)
            b = convert(ct.Tile{ct.TFloat32}, b)
        end
        acc = muladd(a, b, acc)
        k += Int32(1)
    end

    ct.store(C, (m, n), convert(ct.Tile{T}, acc))

    return nothing
end

As should be obvious from the 0.1 version number, cuTile.jl is under heavy development, and many features are still missing. Notably, not all of the Julia language is currently supported, as cuTile.jl brings its own Julia to Tile IR compiler. So please try out the package, file bugs or create PRs!

For more information, check out the NVIDIA developer zone blog post, or check out the repository which contains many more examples.