Theoretical TFLOPS ≠ Real-world Performance
Testing Theoretical Maximum FLOPS on GPUs
This project aims to measure the theoretical maximum FLOPS (Floating Point Operations Per Second) achievable on various GPU models. Please see the original work by Stas Bekman.
Key Features
-
Median instead of maximum value, each mutliplication is repeated 100 times and we take a median.
-
Clearing L2 Cache between iterations we are clearing L2 cache on the GPU see the article by SemiAnalysis
-
Optimized Search: Unlike the original implementation which uses a brute force approach, this version leverages Optuna for efficient parameter optimization.
-
Visualization: Optuna provides insightful visualizations of the optimization process:
-
Data Collection: An optional feature allows submitting results to a remote API for data collection and analysis.
Stats
| GPU Model | Best Shape (MxNxK) | TFLOPS |
|---|---|---|
| NVIDIA RTX 4000 SFF Ada Generation | 2304x5120x1536 | 59.0 |
| NVIDIA A10G | 20480x18112x19712 | 69.7 |
| NVIDIA GeForce RTX 3090 | 5248x15040x1024 | 78.0 |
| NVIDIA RTX 4000 Ada Generation | 14464x5312x20480 | 82.7 |
| NVIDIA GeForce RTX 3090 Ti | 10752x15488x10752 | 86.0 |
| NVIDIA L4 | 1024x6016x1792 | 91.4 |
| NVIDIA RTX A5000 | 17856x17024x3584 | 93.9 |
| Tesla V100-SXM2-32GB | 17216x20480x4096 | 94.0 |
| Tesla V100-SXM2-16GB | 2048x17920x1216 | 96.1 |
| Radeon RX 7900 XTX | 11008x3392x9216 | 113.3 |
| DCU K100_AI | 9344x3968x6592 | 126.3 |
| NVIDIA RTX A6000 | 9856x12480x13248 | 131.2 |
| AMD Instinct MI210 | 17536x7360x2304 | 142.8 |
| NVIDIA L40 | 3712x2624x11136 | 170.3 |
| NVIDIA GeForce RTX 4090 | 14336x4096x4096 | 178.8 |
| NVIDIA L40S | 4416x3776x3072 | 252.0 |
| NVIDIA RTX 6000 Ada Generation | 2624x5632x3328 | 278.5 |
| NVIDIA A100 PCIe | 2304x5120x1536 | 256.4 |
| NVIDIA A100 SXM | 6912x16384x2048 | 267.9 |
| NVIDIA H100 NVL* | 2560x2176x8192 | 488.5 |
| NVIDIA H100 PCIe | 6912x16384x2048 | 499.5 |
| AMD Instinct MI300X | 4096x8448x4864 | 788.2 |
| NVIDIA H100 SXM 96GB | 16896x15680x1024 | 807.1 |
| NVIDIA H100 SXM 80GB | 6144x17920x2816 | 821.2 |
| NVIDIA GH200 96GB | 7616x17664x4480 | 852.5 |
| NVIDIA GH200 144G HBM3e | 7616x17664x4480 | 853.8 |
*for H100 NVL we are only using a single card as we don't support multi-gpu
Install
# For a faster and smoother installation experience, we recommend using `uv`, an extremely fast Python package installer written in Rust.
# It's a seamless drop-in replacement for pip, so you don't have to worry about compatibility.
# You can easily install it with:
pip install uv
git clone https://github.com/mag-/gpu_benchmark
cd gpu_benchmark
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
./mamf-finder.py
TODO:
- check raw CUDA
- check tinygrad
Acknowledgements:
Thanks to Bernhard from GPTshop.ai for giving me access to GH200
Special thanks to Stas Bekman for the original implementation and research.

