Results - NFHN Reader

Inference Arena runs the same training and inference workload through every supported ML framework on every available platform, then publishes the results side by side. Each tab below covers one model — pick your model, filter to the frameworks you care about, and compare. Lower numbers are better; bold marks the best matching framework on each platform.

SmolLM2-135M SmolVLA Stable Diffusion ResNet-50 Whisper-tiny

HuggingFaceTB/SmolLM2-135M — 134.5M parameter decoder-only language model.

Benchmark config: seq_len=128, float32, input=[0,1,…,127].

Platform	Framework	Compile (s)	Inference (ms)	Latency (ms)	Training (ms)	Loss
Intel Xeon @ 2.10GHz	PyTorch 2.11.0+cu130 (CPU)	135.63	188	18	486	10.98
	ONNX Runtime 1.24.4 (CPU)	65.50	118	20	—	10.98
	JAX 0.9.2 (CPU)	6.79	194	31	2107	10.98
	Candle (CPU)	0.31	453	61	—	11.11
	Luminal (CPU)	3.37	17006	—	14459	10.81
	Burn (wgpu/Lavapipe)	~~0.00~~	~~2369~~	~~320~~	~~5700~~	~~11.73~~
	Meganeura (Vulkan/Lavapipe)	7.29	3933	852	3651	10.99
	llama.cpp (CPU)	0.10	221	24	—	10.98
AMD Radeon 890M Graphics	PyTorch 2.10.0 (ROCm 7.2.53210)	51.07	64	27	119	8.35
	Burn (wgpu/vulkan)	~~0.00~~	~~182~~	31	~~206~~	~~11.55~~
	Inferi	✗	✗	✗	✗
	Meganeura (Vulkan)	0.59	26	9.1	92	8.64
	ONNX Runtime	✗	✗	✗	✗
Apple M3	PyTorch 2.11.0 (MPS)	0.00	356	71	699	8.35
	MLX (MLX)	0.00	97	—	253	8.64
	Candle (Metal)	~~0.02~~	22	~~2.8~~	—	~~10.80~~
	Burn (wgpu/metal)	~~0.00~~	~~873~~	39	~~905~~	~~11.59~~
	Inferi	✗	✗	✗	✗
	Luminal	✗	✗	✗	✗
	Meganeura (Metal)	1.50	201	9.1	464	8.65
	GGML (Metal)	0.38	49	11	—	8.69
	JAX (METAL)	~~3.13~~	47	21	~~253~~	~~5.79~~
NVIDIA GeForce RTX 5080	PyTorch 2.11.0+cu130 (CUDA 13.0)	16.83	4.0	2.8	6.5	8.35
	Candle (CUDA)	~~0.05~~	48	~~2.4~~	—	~~10.80~~
	Burn (wgpu/vulkan)	~~0.00~~	~~154~~	26	86	~~11.69~~
	Inferi	✗	✗	✗	✗
	Meganeura (Vulkan)	0.87	5.2	2.2	17	8.64
	GGML (CUDA)	0.25	25	1.5	—	8.69
	ONNX Runtime (CUDAExecutionProvider)	~~20.54~~	~~5.2~~	~~3.1~~	—	~~6.01~~
	MAX (GPU)	~~18.79~~	~~3.5~~	~~0.1~~	—	~~10.80~~
NVIDIA GeForce RTX 3050 (Windows)	PyTorch 2.11.0+cu128 (CUDA 12.8)	0.00	11	5.1	51	8.35
	Burn (wgpu/vulkan)	~~0.00~~	~~125~~	28	~~138~~	~~11.76~~
	Inferi	✗	✗	✗	✗
	Meganeura (Vulkan/DX12)	1.40	13	3.6	58	8.63
	GGML (CUDA)	0.31	132	5.9	—	8.69
	ONNX Runtime (CUDAExecutionProvider)	~~39.92~~	18	14	—	~~6.01~~
	JAX	✗	✗	✗	✗
Intel(R) Graphics (RPL-U)	PyTorch 2.11.0+xpu (CPU)	0.00	541	126	1130	8.35
	Candle (CPU)	~~0.41~~	~~524~~	76	—	~~12.16~~
	Burn (wgpu/vulkan)	~~0.00~~	~~604~~	83	~~1437~~	~~11.79~~
	Inferi (Vulkan)	~~1.06~~	~~25769~~	~~9.1~~	—	~~15.16~~
	Luminal (CPU)	~~3.57~~	~~15551~~	—	~~15473~~	~~10.81~~
	Meganeura (Vulkan)	1.74	172	52	700	8.64
	GGML (CPU)	0.11	433	33	—	8.69
	ONNX Runtime (CPUExecutionProvider)	~~66.86~~	~~381~~	64	—	~~6.01~~
	MAX	✗	✗	✗	✗
	JAX (CPU)	~~9.97~~	~~589~~	~~176~~	~~1412~~	~~5.79~~
AMD Radeon RX 7900 XT	PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424)	73.03	10	6.8	23	8.35
	Burn (wgpu/vulkan)	~~0.00~~	~~172~~	25	~~181~~	~~11.76~~
	Inferi	✗	✗	✗	✗
	Meganeura (Vulkan)	0.99	7.3	2.1	21	8.64
	GGML (ROCm)	0.11	259	3.4	—	8.69
	MAX (GPU)	~~2.13~~	~~3.5~~	~~0.1~~	—	~~10.80~~

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.

Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.

Candle runs the real LLaMA architecture with the same safetensors weights, but its forward() returns last-position logits only (private fields prevent getting all-position logits). Loss is computed on 1 position vs 128 for others — hence DIFFERENT MODEL in correctness check. Timing is valid. Backward not yet wired.
Burn and Luminal use a simplified model (single-head attention, no RoPE/RMSNorm) with random weights.
Luminal backward is estimated as a second forward pass.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M

lerobot/smolvla_base — SmolVLA action expert decoder for robotics.

Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.

Platform	Framework	Compile (s)	Inference (ms)	Latency (ms)	Training (ms)	Loss
Intel Xeon @ 2.10GHz	PyTorch 2.11.0+cu130 (CPU)	51.63	40	11	116	0.00
	Meganeura (Vulkan/Lavapipe)	2.75	696	—	3850	0.01
	ONNX Runtime (CPU)	✗	✗	✗	✗
	JAX (CPU)	✗	✗	✗	✗
	Candle (CPU)	✗	✗	✗	✗
	Burn (wgpu)	✗	✗	✗	✗
	Luminal (CPU)	✗	✗	✗	✗
AMD Radeon 890M Graphics	PyTorch 2.10.0 (ROCm 7.2.53210)	19.72	27	14	49	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.12	15	6.7	47	0.00
	GGML	—	—	—	—
	ONNX Runtime (MIGraphXExecutionProvider)	19.38	10	—	—	0.00
Apple M3	PyTorch 2.11.0 (MPS)	0.00	173	9.1	117	0.00
	MLX (MLX)	0.00	13	—	24	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Metal)	0.12	34	6.4	170	0.00
	GGML	—	—	—	—
	ONNX Runtime (CoreMLExecutionProvider)	7.96	86	—	—	0.00
	JAX (METAL)	1.17	15	—	147	0.00
NVIDIA GeForce RTX 5080	PyTorch 2.11.0+cu130 (CUDA 13.0)	8.39	2.5	1.2	3.2	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.47	3.2	1.5	9.1	0.00
	GGML	—	—	—	—
	ONNX Runtime (CUDAExecutionProvider)	2.54	1.6	—	—	0.00
	MAX (GPU)	33.04	32	—	—	0.00
NVIDIA GeForce RTX 3050 (Windows)	PyTorch 2.11.0+cu128 (CUDA 12.8)	0.00	4.5	3.5	22	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan/DX12)	0.70	4.9	2.9	23	0.00
	GGML	—	—	—	—
	ONNX Runtime (CUDAExecutionProvider)	4.14	4.9	—	—	0.00
	JAX	✗	✗	✗	✗
Intel(R) Graphics (RPL-U)	PyTorch 2.11.0+xpu (CPU)	0.00	183	72	388	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	—	—	—	—
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.39	73	40	222	0.00
	GGML	—	—	—	—
	ONNX Runtime (CPUExecutionProvider)	10.33	86	—	—	0.00
	MAX	✗	✗	✗	✗
	JAX (CPU)	3.86	162	—	471	0.00
AMD Radeon RX 7900 XT	PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424)	9.23	4.8	4.1	8.1	0.00
	Candle	—	—	—	—
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.81	5.1	1.5	9.9	0.00
	GGML	—	—	—	—
	MAX (GPU)	2.20	15	—	—	0.00

Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).

Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.

Burn and Luminal do not implement this architecture yet (reported as ✗).
Inputs are synthetic: random noisy actions, sinusoidal timestep, random VLM context.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA

stable-diffusion-v1-5/stable-diffusion-v1-5 — Latent diffusion model for text-to-image generation.

Most frameworks (PyTorch, Meganeura, ONNX Runtime, JAX, MLX) run the simplified U-Net — Conv + GroupNorm + skip connections, no cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels, ~2M params. Shared architecture, but each framework uses its own random-init parameters, so losses don’t match across frameworks and several end up marked DIFFERENT MODEL even on identical structure.

Candle runs the full SD 1.5 U-Net (~860M params, 64×64×4 latent, cross-attention + timestep) — the real thing, marked DIFFERENT MODEL by design.

Platform	Framework	Compile (s)	Inference (ms)	Latency (ms)	Training (ms)	Loss
Intel Xeon @ 2.10GHz	PyTorch 2.11.0+cu130 (CPU)	53.02	14	11	28	0.57
	Meganeura (Vulkan/Lavapipe)	2.75	379	—	666	0.57
	Candle (CPU)	~~0.00~~	~~10777~~	—	—	~~0.00~~
	ONNX Runtime (CPU)	✗	✗	✗	✗
	JAX (CPU)	✗	✗	✗	✗
	Burn (wgpu)	✗	✗	✗	✗
	Luminal (CPU)	✗	✗	✗	✗
AMD Radeon 890M Graphics	PyTorch 2.10.0 (ROCm 7.2.53210)	12.68	2.6	3.0	5.4	0.50
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.09	10	11	15	0.53
	GGML	—	—	—	—
	ONNX Runtime (MIGraphXExecutionProvider)	~~29.97~~	~~3.7~~	—	—	~~0.05~~
Apple M3	PyTorch 2.11.0 (MPS)	0.00	504	11	222	0.50
	MLX (MLX)	0.00	6.9	—	9.3	0.51
	Candle (Metal)	~~0.01~~	~~233~~	—	—	~~0.00~~
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Metal)	0.49	8.9	8.9	68	0.53
	GGML	—	—	—	—
	ONNX Runtime (CoreMLExecutionProvider)	~~2.38~~	12	—	—	~~0.05~~
	JAX (METAL)	~~0.72~~	~~6.0~~	—	25	~~0.05~~
NVIDIA GeForce RTX 5080	PyTorch 2.11.0+cu130 (CUDA 13.0)	6.35	1.0	0.9	1.4	0.50
	Candle (CUDA)	~~0.01~~	~~108~~	—	—	~~0.00~~
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.42	1.0	0.9	5.5	0.53
	GGML	—	—	—	—
	ONNX Runtime (CUDAExecutionProvider)	~~0.84~~	~~0.8~~	—	—	~~0.05~~
	MAX	—	—	—	—
NVIDIA GeForce RTX 3050 (Windows)	PyTorch 2.11.0+cu128 (CUDA 12.8)	0.00	1.4	1.0	4.7	0.50
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan/DX12)	0.50	3.1	3.1	7.8	0.52
	GGML	—	—	—	—
	ONNX Runtime (CUDAExecutionProvider)	~~1.70~~	~~4.0~~	—	—	~~0.05~~
	JAX	✗	✗	✗	✗
Intel(R) Graphics (RPL-U)	PyTorch 2.11.0+xpu (CPU)	0.00	118	33	153	0.50
	Candle (CPU)	~~0.00~~	~~16529~~	—	—	~~0.00~~
	Burn	—	—	—	—
	Inferi	—	—	—	—
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.12	21	21	88	0.53
	GGML	—	—	—	—
	ONNX Runtime (CPUExecutionProvider)	~~2.39~~	31	—	—	~~0.05~~
	MAX	✗	✗	✗	✗
	JAX (CPU)	~~4.85~~	73	—	~~206~~	~~0.05~~
AMD Radeon RX 7900 XT	PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424)	11.06	1.7	1.5	3.3	0.50
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.77	1.6	1.5	7.3	0.51
	GGML	—	—	—	—
	MAX	—	—	—	—

Run ./run.sh -m StableDiffusion to populate this table.

Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).

Input is deterministic synthetic data — no actual image generation.
PyTorch and Meganeura use a simplified architecture for fair comparison.
Candle runs the full SD 1.5 UNet but on CPU only (DIFFERENT MODEL vs others).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion

openai/whisper-tiny — Encoder-decoder transformer for speech recognition. ~39M parameters.

Uses a custom tiny configuration (4 encoder + 4 decoder layers) for fast benchmarking.

Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.

Platform	Framework	Compile (s)	Inference (ms)	Latency (ms)	Training (ms)	Loss
Intel Xeon @ 2.10GHz	PyTorch 2.11.0+cu130 (CPU)	39.88	150	—	371	11.80
	ONNX Runtime 1.24.4 (CPU)	0.84	212	—	—	11.80
	Candle (CPU)	~~0.01~~	~~616~~	—	—	~~0.00~~
	Meganeura (Vulkan/Lavapipe)	~~7.84~~	~~53467~~	—	—	~~0.01~~
	Burn (wgpu)	✗	✗	✗	✗
	JAX (CPU)	✗	✗	✗	✗
AMD Radeon 890M Graphics	PyTorch 2.10.0 (ROCm 7.2.53210)	17.09	79	63	220	0.00
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.20	34	33	101	0.01
	GGML	✗	✗	✗	✗
	ONNX Runtime (MIGraphXExecutionProvider)	23.58	32	—	—	0.01
Apple M3	PyTorch 2.11.0 (MPS)	0.00	318	41	127	0.00
	MLX	—	—	—	—
	Candle (Metal)	0.01	22	—	—	0.00
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Metal)	0.15	406	415	1062	0.01
	GGML	✗	✗	✗	✗
	ONNX Runtime (CoreMLExecutionProvider)	7.93	440	—	—	0.01
	JAX (METAL)	2.17	128	315	445	0.01
NVIDIA GeForce RTX 5080	PyTorch 2.11.0+cu130 (CUDA 13.0)	4.04	2.3	2.1	13	0.00
	Candle (CUDA)	0.01	44	—	—	0.00
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.48	2.9	2.8	12	0.01
	GGML (faster-whisper (CTranslate2, CUDA))	7.05	19	19	—	0.00
	ONNX Runtime (CUDAExecutionProvider)	1.94	3.5	—	—	0.01
	MAX	—	—	—	—
NVIDIA GeForce RTX 3050 (Windows)	PyTorch 2.11.0+cu128 (CUDA 12.8)	0.00	13	13	43	0.00
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan/DX12)	0.66	19	19	43	0.01
	GGML (faster-whisper (CTranslate2, CUDA))	7.00	40	45	—	0.00
	ONNX Runtime (CUDAExecutionProvider)	4.82	20	—	—	0.01
	JAX	✗	✗	✗	✗
Intel(R) Graphics (RPL-U)	PyTorch 2.11.0+xpu (CPU)	0.00	477	420	899	0.00
	Candle (CPU)	0.02	795	—	—	0.00
	Burn	—	—	—	—
	Inferi	—	—	—	—
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.39	467	466	1594	0.01
	GGML (faster-whisper (CTranslate2, CPU))	14.76	1036	1104	—	0.00
	ONNX Runtime (CPUExecutionProvider)	6.18	333	—	—	0.01
	MAX	✗	✗	✗	✗
	JAX (CPU)	5.59	717	686	2681	0.01
AMD Radeon RX 7900 XT	PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424)	5.38	12	6.5	44	0.00
	Burn	—	—	—	—
	Inferi	✗	✗	✗	✗
	Luminal	—	—	—	—
	Meganeura (Vulkan)	0.82	4.8	4.8	21	0.01
	MAX	—	—	—	—

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).

Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI

Input is synthetic mel spectrogram, not real audio
Decoder runs with a 4-token input (language/task tokens), not full transcription

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny

Legend: Bold = best among matching frameworks ~~Struck through~~ = different / simplified model ✗ = not supported Framework names link to tested revision