Results

12 min read Original article ↗

Inference Arena runs the same training and inference workload through every supported ML framework on every available platform, then publishes the results side by side. Each tab below covers one model — pick your model, filter to the frameworks you care about, and compare. Lower numbers are better; bold marks the best matching framework on each platform.

HuggingFaceTB/SmolLM2-135M — 134.5M parameter decoder-only language model.

Benchmark config: seq_len=128, float32, input=[0,1,…,127].

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 135.63 188 18 486 10.98
  ONNX Runtime 1.24.4 (CPU) 65.50 118 20 10.98
  JAX 0.9.2 (CPU) 6.79 194 31 2107 10.98
  Candle (CPU) 0.31 453 61 11.11
  Luminal (CPU) 3.37 17006 14459 10.81
  Burn (wgpu/Lavapipe) 0.00 2369 320 5700 11.73
  Meganeura (Vulkan/Lavapipe) 7.29 3933 852 3651 10.99
  llama.cpp (CPU) 0.10 221 24 10.98
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 51.07 64 27 119 8.35
  Burn (wgpu/vulkan) 0.00 182 31 206 11.55
  Inferi  
  Meganeura (Vulkan) 0.59 26 9.1 92 8.64
  ONNX Runtime  
Apple M3 PyTorch 2.11.0 (MPS) 0.00 356 71 699 8.35
  MLX (MLX) 0.00 97 253 8.64
  Candle (Metal) 0.02 22 2.8 10.80
  Burn (wgpu/metal) 0.00 873 39 905 11.59
  Inferi  
  Luminal  
  Meganeura (Metal) 1.50 201 9.1 464 8.65
  GGML (Metal) 0.38 49 11 8.69
  JAX (METAL) 3.13 47 21 253 5.79
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 16.83 4.0 2.8 6.5 8.35
  Candle (CUDA) 0.05 48 2.4 10.80
  Burn (wgpu/vulkan) 0.00 154 26 86 11.69
  Inferi  
  Meganeura (Vulkan) 0.87 5.2 2.2 17 8.64
  GGML (CUDA) 0.25 25 1.5 8.69
  ONNX Runtime (CUDAExecutionProvider) 20.54 5.2 3.1 6.01
  MAX (GPU) 18.79 3.5 0.1 10.80
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 11 5.1 51 8.35
  Burn (wgpu/vulkan) 0.00 125 28 138 11.76
  Inferi  
  Meganeura (Vulkan/DX12) 1.40 13 3.6 58 8.63
  GGML (CUDA) 0.31 132 5.9 8.69
  ONNX Runtime (CUDAExecutionProvider) 39.92 18 14 6.01
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 541 126 1130 8.35
  Candle (CPU) 0.41 524 76 12.16
  Burn (wgpu/vulkan) 0.00 604 83 1437 11.79
  Inferi (Vulkan) 1.06 25769 9.1 15.16
  Luminal (CPU) 3.57 15551 15473 10.81
  Meganeura (Vulkan) 1.74 172 52 700 8.64
  GGML (CPU) 0.11 433 33 8.69
  ONNX Runtime (CPUExecutionProvider) 66.86 381 64 6.01
  MAX  
  JAX (CPU) 9.97 589 176 1412 5.79
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 73.03 10 6.8 23 8.35
  Burn (wgpu/vulkan) 0.00 172 25 181 11.76
  Inferi  
  Meganeura (Vulkan) 0.99 7.3 2.1 21 8.64
  GGML (ROCm) 0.11 259 3.4 8.69
  MAX (GPU) 2.13 3.5 0.1 10.80

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.

Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.

  • Candle runs the real LLaMA architecture with the same safetensors weights, but its forward() returns last-position logits only (private fields prevent getting all-position logits). Loss is computed on 1 position vs 128 for others — hence DIFFERENT MODEL in correctness check. Timing is valid. Backward not yet wired.
  • Burn and Luminal use a simplified model (single-head attention, no RoPE/RMSNorm) with random weights.
  • Luminal backward is estimated as a second forward pass.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M

lerobot/smolvla_base — SmolVLA action expert decoder for robotics.

Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 51.63 40 11 116 0.00
  Meganeura (Vulkan/Lavapipe) 2.75 696 3850 0.01
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 19.72 27 14 49 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.12 15 6.7 47 0.00
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 19.38 10 0.00
Apple M3 PyTorch 2.11.0 (MPS) 0.00 173 9.1 117 0.00
  MLX (MLX) 0.00 13 24 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.12 34 6.4 170 0.00
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 7.96 86 0.00
  JAX (METAL) 1.17 15 147 0.00
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 8.39 2.5 1.2 3.2 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.47 3.2 1.5 9.1 0.00
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 2.54 1.6 0.00
  MAX (GPU) 33.04 32 0.00
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 4.5 3.5 22 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.70 4.9 2.9 23 0.00
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 4.14 4.9 0.00
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 183 72 388 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.39 73 40 222 0.00
  GGML  
  ONNX Runtime (CPUExecutionProvider) 10.33 86 0.00
  MAX  
  JAX (CPU) 3.86 162 471 0.00
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 9.23 4.8 4.1 8.1 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.81 5.1 1.5 9.9 0.00
  GGML  
  MAX (GPU) 2.20 15 0.00

Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).

Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.

  • Burn and Luminal do not implement this architecture yet (reported as ✗).
  • Inputs are synthetic: random noisy actions, sinusoidal timestep, random VLM context.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA

stable-diffusion-v1-5/stable-diffusion-v1-5 — Latent diffusion model for text-to-image generation.

Most frameworks (PyTorch, Meganeura, ONNX Runtime, JAX, MLX) run the simplified U-Net — Conv + GroupNorm + skip connections, no cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels, ~2M params. Shared architecture, but each framework uses its own random-init parameters, so losses don’t match across frameworks and several end up marked DIFFERENT MODEL even on identical structure.

Candle runs the full SD 1.5 U-Net (~860M params, 64×64×4 latent, cross-attention + timestep) — the real thing, marked DIFFERENT MODEL by design.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 53.02 14 11 28 0.57
  Meganeura (Vulkan/Lavapipe) 2.75 379 666 0.57
  Candle (CPU) 0.00 10777 0.00
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 12.68 2.6 3.0 5.4 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.09 10 11 15 0.53
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 29.97 3.7 0.05
Apple M3 PyTorch 2.11.0 (MPS) 0.00 504 11 222 0.50
  MLX (MLX) 0.00 6.9 9.3 0.51
  Candle (Metal) 0.01 233 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.49 8.9 8.9 68 0.53
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 2.38 12 0.05
  JAX (METAL) 0.72 6.0 25 0.05
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 6.35 1.0 0.9 1.4 0.50
  Candle (CUDA) 0.01 108 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.42 1.0 0.9 5.5 0.53
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 0.84 0.8 0.05
  MAX  
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 1.4 1.0 4.7 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.50 3.1 3.1 7.8 0.52
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 1.70 4.0 0.05
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 118 33 153 0.50
  Candle (CPU) 0.00 16529 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.12 21 21 88 0.53
  GGML  
  ONNX Runtime (CPUExecutionProvider) 2.39 31 0.05
  MAX  
  JAX (CPU) 4.85 73 206 0.05
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 11.06 1.7 1.5 3.3 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.77 1.6 1.5 7.3 0.51
  GGML  
  MAX  

Run ./run.sh -m StableDiffusion to populate this table.

Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).

  • Input is deterministic synthetic data — no actual image generation.
  • PyTorch and Meganeura use a simplified architecture for fair comparison.
  • Candle runs the full SD 1.5 UNet but on CPU only (DIFFERENT MODEL vs others).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion

openai/whisper-tiny — Encoder-decoder transformer for speech recognition. ~39M parameters.

Uses a custom tiny configuration (4 encoder + 4 decoder layers) for fast benchmarking.

Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 39.88 150 371 11.80
  ONNX Runtime 1.24.4 (CPU) 0.84 212 11.80
  Candle (CPU) 0.01 616 0.00
  Meganeura (Vulkan/Lavapipe) 7.84 53467 0.01
  Burn (wgpu)  
  JAX (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 17.09 79 63 220 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.20 34 33 101 0.01
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 23.58 32 0.01
Apple M3 PyTorch 2.11.0 (MPS) 0.00 318 41 127 0.00
  MLX  
  Candle (Metal) 0.01 22 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.15 406 415 1062 0.01
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 7.93 440 0.01
  JAX (METAL) 2.17 128 315 445 0.01
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 4.04 2.3 2.1 13 0.00
  Candle (CUDA) 0.01 44 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.48 2.9 2.8 12 0.01
  GGML (faster-whisper (CTranslate2, CUDA)) 7.05 19 19 0.00
  ONNX Runtime (CUDAExecutionProvider) 1.94 3.5 0.01
  MAX  
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 13 13 43 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.66 19 19 43 0.01
  GGML (faster-whisper (CTranslate2, CUDA)) 7.00 40 45 0.00
  ONNX Runtime (CUDAExecutionProvider) 4.82 20 0.01
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 477 420 899 0.00
  Candle (CPU) 0.02 795 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.39 467 466 1594 0.01
  GGML (faster-whisper (CTranslate2, CPU)) 14.76 1036 1104 0.00
  ONNX Runtime (CPUExecutionProvider) 6.18 333 0.01
  MAX  
  JAX (CPU) 5.59 717 686 2681 0.01
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 5.38 12 6.5 44 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.82 4.8 4.8 21 0.01
  MAX  

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).

Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI

  • Input is synthetic mel spectrogram, not real audio
  • Decoder runs with a 4-token input (language/task tokens), not full transcription

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny

Legend: Bold = best among matching frameworks Struck through = different / simplified model = not supported Framework names link to tested revision