Inference Arena runs the same training and inference workload through every supported ML framework on every available platform, then publishes the results side by side. Each tab below covers one model — pick your model, filter to the frameworks you care about, and compare. Lower numbers are better; bold marks the best matching framework on each platform.
HuggingFaceTB/SmolLM2-135M — 134.5M parameter decoder-only language model.
Benchmark config: seq_len=128, float32, input=[0,1,…,127].
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 135.63 | 188 | 18 | 486 | 10.98 |
| ONNX Runtime 1.24.4 (CPU) | 65.50 | 118 | 20 | — | 10.98 | |
| JAX 0.9.2 (CPU) | 6.79 | 194 | 31 | 2107 | 10.98 | |
| Candle (CPU) | 0.31 | 453 | 61 | — | 11.11 | |
| Luminal (CPU) | 3.37 | 17006 | — | 14459 | 10.81 | |
| Burn (wgpu/Lavapipe) | ||||||
| Meganeura (Vulkan/Lavapipe) | 7.29 | 3933 | 852 | 3651 | 10.99 | |
| llama.cpp (CPU) | 0.10 | 221 | 24 | — | 10.98 | |
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 51.07 | 64 | 27 | 119 | 8.35 |
| Burn (wgpu/vulkan) | ||||||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan) | 0.59 | 26 | 9.1 | 92 | 8.64 | |
| ONNX Runtime | ✗ | ✗ | ✗ | ✗ | ||
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 356 | 71 | 699 | 8.35 |
| MLX (MLX) | 0.00 | 97 | — | 253 | 8.64 | |
| Candle (Metal) | ||||||
| Burn (wgpu/metal) | ||||||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Metal) | 1.50 | 201 | 9.1 | 464 | 8.65 | |
| GGML (Metal) | 0.38 | 49 | 11 | — | 8.69 | |
| JAX (METAL) | ||||||
| NVIDIA GeForce RTX 5080 | PyTorch 2.11.0+cu130 (CUDA 13.0) | 16.83 | 4.0 | 2.8 | 6.5 | 8.35 |
| Candle (CUDA) | ||||||
| Burn (wgpu/vulkan) | ||||||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan) | 0.87 | 5.2 | 2.2 | 17 | 8.64 | |
| GGML (CUDA) | 0.25 | 25 | 1.5 | — | 8.69 | |
| ONNX Runtime (CUDAExecutionProvider) | ||||||
| MAX (GPU) | ||||||
| NVIDIA GeForce RTX 3050 (Windows) | PyTorch 2.11.0+cu128 (CUDA 12.8) | 0.00 | 11 | 5.1 | 51 | 8.35 |
| Burn (wgpu/vulkan) | ||||||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan/DX12) | 1.40 | 13 | 3.6 | 58 | 8.63 | |
| GGML (CUDA) | 0.31 | 132 | 5.9 | — | 8.69 | |
| ONNX Runtime (CUDAExecutionProvider) | ||||||
| JAX | ✗ | ✗ | ✗ | ✗ | ||
| Intel(R) Graphics (RPL-U) | PyTorch 2.11.0+xpu (CPU) | 0.00 | 541 | 126 | 1130 | 8.35 |
| Candle (CPU) | ||||||
| Burn (wgpu/vulkan) | ||||||
| Inferi (Vulkan) | ||||||
| Luminal (CPU) | ||||||
| Meganeura (Vulkan) | 1.74 | 172 | 52 | 700 | 8.64 | |
| GGML (CPU) | 0.11 | 433 | 33 | — | 8.69 | |
| ONNX Runtime (CPUExecutionProvider) | ||||||
| MAX | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ||||||
| AMD Radeon RX 7900 XT | PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) | 73.03 | 10 | 6.8 | 23 | 8.35 |
| Burn (wgpu/vulkan) | ||||||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Meganeura (Vulkan) | 0.99 | 7.3 | 2.1 | 21 | 8.64 | |
| GGML (ROCm) | 0.11 | 259 | 3.4 | — | 8.69 | |
| MAX (GPU) |
Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.
Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.
- Candle runs the real LLaMA architecture with the same safetensors
weights, but its
forward()returns last-position logits only (private fields prevent getting all-position logits). Loss is computed on 1 position vs 128 for others — hence DIFFERENT MODEL in correctness check. Timing is valid. Backward not yet wired. - Burn and Luminal use a simplified model (single-head attention, no RoPE/RMSNorm) with random weights.
- Luminal backward is estimated as a second forward pass.
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M
lerobot/smolvla_base — SmolVLA action expert decoder for robotics.
Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 51.63 | 40 | 11 | 116 | 0.00 |
| Meganeura (Vulkan/Lavapipe) | 2.75 | 696 | — | 3850 | 0.01 | |
| ONNX Runtime (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Candle (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 19.72 | 27 | 14 | 49 | 0.00 |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.12 | 15 | 6.7 | 47 | 0.00 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (MIGraphXExecutionProvider) | 19.38 | 10 | — | — | 0.00 | |
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 173 | 9.1 | 117 | 0.00 |
| MLX (MLX) | 0.00 | 13 | — | 24 | 0.00 | |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Metal) | 0.12 | 34 | 6.4 | 170 | 0.00 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CoreMLExecutionProvider) | 7.96 | 86 | — | — | 0.00 | |
| JAX (METAL) | 1.17 | 15 | — | 147 | 0.00 | |
| NVIDIA GeForce RTX 5080 | PyTorch 2.11.0+cu130 (CUDA 13.0) | 8.39 | 2.5 | 1.2 | 3.2 | 0.00 |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.47 | 3.2 | 1.5 | 9.1 | 0.00 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CUDAExecutionProvider) | 2.54 | 1.6 | — | — | 0.00 | |
| MAX (GPU) | 33.04 | 32 | — | — | 0.00 | |
| NVIDIA GeForce RTX 3050 (Windows) | PyTorch 2.11.0+cu128 (CUDA 12.8) | 0.00 | 4.5 | 3.5 | 22 | 0.00 |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan/DX12) | 0.70 | 4.9 | 2.9 | 23 | 0.00 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CUDAExecutionProvider) | 4.14 | 4.9 | — | — | 0.00 | |
| JAX | ✗ | ✗ | ✗ | ✗ | ||
| Intel(R) Graphics (RPL-U) | PyTorch 2.11.0+xpu (CPU) | 0.00 | 183 | 72 | 388 | 0.00 |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | — | — | — | — | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.39 | 73 | 40 | 222 | 0.00 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CPUExecutionProvider) | 10.33 | 86 | — | — | 0.00 | |
| MAX | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | 3.86 | 162 | — | 471 | 0.00 | |
| AMD Radeon RX 7900 XT | PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) | 9.23 | 4.8 | 4.1 | 8.1 | 0.00 |
| Candle | — | — | — | — | ||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.81 | 5.1 | 1.5 | 9.9 | 0.00 | |
| GGML | — | — | — | — | ||
| MAX (GPU) | 2.20 | 15 | — | — | 0.00 |
Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).
Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.
- Burn and Luminal do not implement this architecture yet (reported as ✗).
- Inputs are synthetic: random noisy actions, sinusoidal timestep, random VLM context.
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA
stable-diffusion-v1-5/stable-diffusion-v1-5 — Latent diffusion model for text-to-image generation.
Most frameworks (PyTorch, Meganeura, ONNX Runtime, JAX, MLX) run the simplified U-Net — Conv + GroupNorm + skip connections, no cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels, ~2M params. Shared architecture, but each framework uses its own random-init parameters, so losses don’t match across frameworks and several end up marked DIFFERENT MODEL even on identical structure.
Candle runs the full SD 1.5 U-Net (~860M params, 64×64×4 latent, cross-attention + timestep) — the real thing, marked DIFFERENT MODEL by design.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 53.02 | 14 | 11 | 28 | 0.57 |
| Meganeura (Vulkan/Lavapipe) | 2.75 | 379 | — | 666 | 0.57 | |
| Candle (CPU) | ||||||
| ONNX Runtime (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| Luminal (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 12.68 | 2.6 | 3.0 | 5.4 | 0.50 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.09 | 10 | 11 | 15 | 0.53 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (MIGraphXExecutionProvider) | ||||||
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 504 | 11 | 222 | 0.50 |
| MLX (MLX) | 0.00 | 6.9 | — | 9.3 | 0.51 | |
| Candle (Metal) | ||||||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Metal) | 0.49 | 8.9 | 8.9 | 68 | 0.53 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CoreMLExecutionProvider) | ||||||
| JAX (METAL) | ||||||
| NVIDIA GeForce RTX 5080 | PyTorch 2.11.0+cu130 (CUDA 13.0) | 6.35 | 1.0 | 0.9 | 1.4 | 0.50 |
| Candle (CUDA) | ||||||
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.42 | 1.0 | 0.9 | 5.5 | 0.53 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CUDAExecutionProvider) | ||||||
| MAX | — | — | — | — | ||
| NVIDIA GeForce RTX 3050 (Windows) | PyTorch 2.11.0+cu128 (CUDA 12.8) | 0.00 | 1.4 | 1.0 | 4.7 | 0.50 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan/DX12) | 0.50 | 3.1 | 3.1 | 7.8 | 0.52 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CUDAExecutionProvider) | ||||||
| JAX | ✗ | ✗ | ✗ | ✗ | ||
| Intel(R) Graphics (RPL-U) | PyTorch 2.11.0+xpu (CPU) | 0.00 | 118 | 33 | 153 | 0.50 |
| Candle (CPU) | ||||||
| Burn | — | — | — | — | ||
| Inferi | — | — | — | — | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.12 | 21 | 21 | 88 | 0.53 | |
| GGML | — | — | — | — | ||
| ONNX Runtime (CPUExecutionProvider) | ||||||
| MAX | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ||||||
| AMD Radeon RX 7900 XT | PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) | 11.06 | 1.7 | 1.5 | 3.3 | 0.50 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.77 | 1.6 | 1.5 | 7.3 | 0.51 | |
| GGML | — | — | — | — | ||
| MAX | — | — | — | — |
Run ./run.sh -m StableDiffusion to populate this table.
Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).
- Input is deterministic synthetic data — no actual image generation.
- PyTorch and Meganeura use a simplified architecture for fair comparison.
- Candle runs the full SD 1.5 UNet but on CPU only (DIFFERENT MODEL vs others).
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion
openai/whisper-tiny — Encoder-decoder transformer for speech recognition. ~39M parameters.
Uses a custom tiny configuration (4 encoder + 4 decoder layers) for fast benchmarking.
Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.
| Platform | Framework | Compile (s) | Inference (ms) | Latency (ms) | Training (ms) | Loss |
|---|---|---|---|---|---|---|
| Intel Xeon @ 2.10GHz | PyTorch 2.11.0+cu130 (CPU) | 39.88 | 150 | — | 371 | 11.80 |
| ONNX Runtime 1.24.4 (CPU) | 0.84 | 212 | — | — | 11.80 | |
| Candle (CPU) | ||||||
| Meganeura (Vulkan/Lavapipe) | ||||||
| Burn (wgpu) | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | ✗ | ✗ | ✗ | ✗ | ||
| AMD Radeon 890M Graphics | PyTorch 2.10.0 (ROCm 7.2.53210) | 17.09 | 79 | 63 | 220 | 0.00 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.20 | 34 | 33 | 101 | 0.01 | |
| GGML | ✗ | ✗ | ✗ | ✗ | ||
| ONNX Runtime (MIGraphXExecutionProvider) | 23.58 | 32 | — | — | 0.01 | |
| Apple M3 | PyTorch 2.11.0 (MPS) | 0.00 | 318 | 41 | 127 | 0.00 |
| MLX | — | — | — | — | ||
| Candle (Metal) | 0.01 | 22 | — | — | 0.00 | |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Metal) | 0.15 | 406 | 415 | 1062 | 0.01 | |
| GGML | ✗ | ✗ | ✗ | ✗ | ||
| ONNX Runtime (CoreMLExecutionProvider) | 7.93 | 440 | — | — | 0.01 | |
| JAX (METAL) | 2.17 | 128 | 315 | 445 | 0.01 | |
| NVIDIA GeForce RTX 5080 | PyTorch 2.11.0+cu130 (CUDA 13.0) | 4.04 | 2.3 | 2.1 | 13 | 0.00 |
| Candle (CUDA) | 0.01 | 44 | — | — | 0.00 | |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.48 | 2.9 | 2.8 | 12 | 0.01 | |
| GGML (faster-whisper (CTranslate2, CUDA)) | 7.05 | 19 | 19 | — | 0.00 | |
| ONNX Runtime (CUDAExecutionProvider) | 1.94 | 3.5 | — | — | 0.01 | |
| MAX | — | — | — | — | ||
| NVIDIA GeForce RTX 3050 (Windows) | PyTorch 2.11.0+cu128 (CUDA 12.8) | 0.00 | 13 | 13 | 43 | 0.00 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan/DX12) | 0.66 | 19 | 19 | 43 | 0.01 | |
| GGML (faster-whisper (CTranslate2, CUDA)) | 7.00 | 40 | 45 | — | 0.00 | |
| ONNX Runtime (CUDAExecutionProvider) | 4.82 | 20 | — | — | 0.01 | |
| JAX | ✗ | ✗ | ✗ | ✗ | ||
| Intel(R) Graphics (RPL-U) | PyTorch 2.11.0+xpu (CPU) | 0.00 | 477 | 420 | 899 | 0.00 |
| Candle (CPU) | 0.02 | 795 | — | — | 0.00 | |
| Burn | — | — | — | — | ||
| Inferi | — | — | — | — | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.39 | 467 | 466 | 1594 | 0.01 | |
| GGML (faster-whisper (CTranslate2, CPU)) | 14.76 | 1036 | 1104 | — | 0.00 | |
| ONNX Runtime (CPUExecutionProvider) | 6.18 | 333 | — | — | 0.01 | |
| MAX | ✗ | ✗ | ✗ | ✗ | ||
| JAX (CPU) | 5.59 | 717 | 686 | 2681 | 0.01 | |
| AMD Radeon RX 7900 XT | PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) | 5.38 | 12 | 6.5 | 44 | 0.00 |
| Burn | — | — | — | — | ||
| Inferi | ✗ | ✗ | ✗ | ✗ | ||
| Luminal | — | — | — | — | ||
| Meganeura (Vulkan) | 0.82 | 4.8 | 4.8 | 21 | 0.01 | |
| MAX | — | — | — | — |
Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).
Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI
- Input is synthetic mel spectrogram, not real audio
- Decoder runs with a 4-token input (language/task tokens), not full transcription
Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny
Legend:
Bold = best among matching
frameworks
Struck through = different / simplified
model
✗ = not supported
Framework names link to tested revision