HipKittens: Fast and Furious AMD Kernels

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora
Links: Arxiv | Code

AI is gated by hardware. We think that opening up AI’s compute landscape is one of the most important problems to be working on right now. Building towards this goal, we present HipKittens: SoTA AMD kernels and a collection of opinionated programming primitives to make AMD kernel dev easier!

Named after AMD's CUDA equivalent, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to get to its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is locked away from AI workflows due to the lack of mature AMD software.

Spec	NVIDIA B200 SXM5	AMD MI355X OAM
BF16 matrix / tensor	2.2 PFLOPs	2.5 PFLOPs
MXFP8 matrix / tensor	4.5 PFLOPs	5.0 PFLOPs
MXFP6 matrix / tensor	4.5 PFLOPs	10.1 PFLOPs
MXFP4 matrix / tensor	9.0 PFLOPs	10.1 PFLOPs
Memory capacity	180 GB	288 GB
Memory bandwidth	8.0 TB/s	8.0 TB/s

Table 1: Hardware overview. Peak memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high performance AI kernel library; PyTorch and a few compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD's C++ based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

The existing software offerings fail to consistently achieve peak performance. CK kernels frequently underperform (see our evaluations below). AITER and PyTorch are volatile; for instance, AITER and PyTorch SDPA Llama GQA backwards kernels achieve just 30% and 24% of SoTA performance respectively on AMD MI355X GPUs. And the compilers currently significantly sacrifice performance and have not yet demonstrated reusable programming primitives for AMD. Further, we find that some critical aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, discussed in our technical deep dive blog.

Details: expand to learn more about our current compiler observations

We expand on a few observations about current compilers:

Mojo's MHA forwards kernel suffers from bank conflicts and offers just 50% of peak performance on the MI355X (e.g., 430 TFLOPs at B=16, H=16, N=2048, D=128).
TileLang is currently limited to CDNA3 and the MHA kernel “is competitive with PyTorch”, which is amongst the slowest baselines (see results below). TileLang lacks many features that we believe are important for AMD kernels--multiple matrix core shapes like 32x32x16, buffer_load_dwordx4, XCD chiplet swizzling--and also has dependencies on Composable Kernel, adding complexity to the library.
On AMD, Triton struggles with register lifetime tracking and lowering memory accesses to the most performant intrinsics. For example, it may fail to reclaim registers or lower vectorized loads. In our evaluations, we find that Triton kernels from AMD’s ROCm/Triton tend to underperform (even on a vanilla BF16 GEMM).

While the Pythonic interface and portability of such compilers is helpful, our goal is to identify the principles that lead to peak performance as well.

As a result, AMD's most performant AI kernels need to be hand-optimized by experts in raw assembly. It is very difficult to scale to the breadth of AI workloads and as a result, most widely used AI workloads are unsupported/under-optimized on AMD (e.g., we see this on some attention problem shapes, non-causal gqa backwards pass, memory bound kernels).

Figure: what is raw assembly? can't understand it? that's the point!

With all of this, it remains up in the air what the best path forwards is for multi-silicon kernel development!

As a result, the AI community says that there’s a CUDA moat in AI software: tweet #1, tweet #2, tweet #3 and many more.

But that being said, developing performant NVIDIA kernels was also painstakingly tedious a few years ago. Using low level CUDA/CUTLASS, it took two years between the H100 GPU’s release and the release of peak performance open-source attention kernels. Compilers and LLMs-for-kernel-development on NVIDIA have so far sacrificed performance for simplicity and struggled to quickly support new hardware features.

Opinionated primitives are simplifying the process of writing performant NVIDIA kernels! Amazingly through the community’s effort on all these DSLs and AI assistants, NVIDIA kernel development is starting to get easier! Last year, we shared our opinionated take on kernel DSLs — keep the familiar PyTorch feel, but make the primitives C++ embedded to get peak performance, simplicity and extensibility to hardware platforms and AI workloads. In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels. We find:

The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable. The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

Climbing out of the CUDA moat: Introducing HipKittens

We first explored ThunderKittens for NVIDIA, then ThunderMittens on Apple Silicon and now we’re excited to share HipKittens (HK) for AMD!

Figure: The Kittens cinematic universe! Towards multi-silicon AI!

HK kernels are performant, while remaining easy to read and modify! We might not need raw assembly for peak performance AMD kernels any more! Life is good!

Figure: HipKittens riding the ~wave~ (not warp).

Let's go through the results:

Our attention forwards kernels are written in ~500 lines of code and outperform all of AMDs baselines on average, including the AITER kernels which are written in hand-optimized assembly! We show different head dimensions ( $D$ ) and sequence lengths $N$ , for both causal and non-causal settings.

Our GEMM kernel features a hot loop that's $<100$ lines of code and achieves peak performance. Again, AITER and HipBLASLT kernels are programmed in ... raw assembly!

We also get speedy attention backwards pass, rotary, and fused dropout-residual-layernorm kernels compared to the strongest available baselines! These results use head dimension $128$ and we vary the sequence length.

Multi-silicon AI is coming!

Realizing AI's full potential requires diverse, open hardware.¹ Today, that means making AMD GPUs truly accessible.

We want more AI in the world. AI has relied on and innovated on a single hardware provider, but we need to be able to use and experiment with all the kinds of compute we can. We need to be able to use the fastest hardware out there. We’re happy to help address these problems with HipKittens!

Checkout part two for a technical deep dive on HK.

Links: Arxiv | Code