GitHub - mag-/gpu_benchmark: Gpu benchmark

Theoretical TFLOPS ≠ Real-world Performance

Testing Theoretical Maximum FLOPS on GPUs

This project aims to measure the theoretical maximum FLOPS (Floating Point Operations Per Second) achievable on various GPU models. Please see the original work by Stas Bekman.

Key Features

Median instead of maximum value, each mutliplication is repeated 100 times and we take a median.
Clearing L2 Cache between iterations we are clearing L2 cache on the GPU see the article by SemiAnalysis
Optimized Search: Unlike the original implementation which uses a brute force approach, this version leverages Optuna for efficient parameter optimization.
Visualization: Optuna provides insightful visualizations of the optimization process:
Data Collection: An optional feature allows submitting results to a remote API for data collection and analysis.

Stats

GPU Model	Best Shape (MxNxK)	TFLOPS
NVIDIA RTX 4000 SFF Ada Generation	2304x5120x1536	59.0
NVIDIA A10G	20480x18112x19712	69.7
NVIDIA GeForce RTX 3090	5248x15040x1024	78.0
NVIDIA RTX 4000 Ada Generation	14464x5312x20480	82.7
NVIDIA GeForce RTX 3090 Ti	10752x15488x10752	86.0
NVIDIA L4	1024x6016x1792	91.4
NVIDIA RTX A5000	17856x17024x3584	93.9
Tesla V100-SXM2-32GB	17216x20480x4096	94.0
Tesla V100-SXM2-16GB	2048x17920x1216	96.1
Radeon RX 7900 XTX	11008x3392x9216	113.3
DCU K100_AI	9344x3968x6592	126.3
NVIDIA RTX A6000	9856x12480x13248	131.2
AMD Instinct MI210	17536x7360x2304	142.8
NVIDIA L40	3712x2624x11136	170.3
NVIDIA GeForce RTX 4090	14336x4096x4096	178.8
NVIDIA L40S	4416x3776x3072	252.0
NVIDIA RTX 6000 Ada Generation	2624x5632x3328	278.5
NVIDIA A100 PCIe	2304x5120x1536	256.4
NVIDIA A100 SXM	6912x16384x2048	267.9
NVIDIA H100 NVL*	2560x2176x8192	488.5
NVIDIA H100 PCIe	6912x16384x2048	499.5
AMD Instinct MI300X	4096x8448x4864	788.2
NVIDIA H100 SXM 96GB	16896x15680x1024	807.1
NVIDIA H100 SXM 80GB	6144x17920x2816	821.2
NVIDIA GH200 96GB	7616x17664x4480	852.5
NVIDIA GH200 144G HBM3e	7616x17664x4480	853.8

*for H100 NVL we are only using a single card as we don't support multi-gpu

Install

# For a faster and smoother installation experience, we recommend using `uv`, an extremely fast Python package installer written in Rust.
# It's a seamless drop-in replacement for pip, so you don't have to worry about compatibility.
# You can easily install it with: 
pip install uv
git clone https://github.com/mag-/gpu_benchmark
cd gpu_benchmark
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
./mamf-finder.py

TODO:

check raw CUDA
check tinygrad

Acknowledgements:

Thanks to Bernhard from GPTshop.ai for giving me access to GH200

Special thanks to Stas Bekman for the original implementation and research.