Inference Arena

View on GitHub

Inference Arena runs the same training and inference workload through every supported ML framework on every available platform, then publishes the results side by side. Each tab below covers one model — pick your model, filter to the frameworks you care about, and compare. Lower numbers are better; bold marks the best matching framework on each platform.

HuggingFaceTB/SmolLM2-135M — 134.5M parameter decoder-only language model.

Benchmark config: seq_len=128, float32, input=[0,1,…,127].

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 135.63 188 18 486 10.98
  ONNX Runtime 1.24.4 (CPU) 65.50 118 20 10.98
  JAX 0.9.2 (CPU) 6.79 194 31 2107 10.98
  Candle (CPU) 0.31 453 61 11.11
  Luminal (CPU) 3.37 17006 14459 10.81
  Burn (wgpu/Lavapipe) 0.00 2369 320 5700 11.73
  Meganeura (Vulkan/Lavapipe) 7.29 3933 852 3651 10.99
  llama.cpp (CPU) 0.10 221 24 10.98
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 51.07 64 27 119 8.35
  Burn (wgpu/vulkan) 0.00 182 31 206 11.55
  Inferi  
  Meganeura (Vulkan) 0.59 26 9.1 92 8.64
  ONNX Runtime  
Apple M3 PyTorch 2.11.0 (MPS) 0.00 356 71 699 8.35
  MLX (MLX) 0.00 97 253 8.64
  Candle (Metal) 0.02 22 2.8 10.80
  Burn (wgpu/metal) 0.00 873 39 905 11.59
  Inferi  
  Luminal  
  Meganeura (Metal) 1.50 201 9.1 464 8.65
  GGML (Metal) 0.38 49 11 8.69
  JAX (METAL) 3.13 47 21 253 5.79
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 16.83 4.0 2.8 6.5 8.35
  Candle (CUDA) 0.05 48 2.4 10.80
  Burn (wgpu/vulkan) 0.00 154 26 86 11.69
  Inferi  
  Meganeura (Vulkan) 0.87 5.2 2.2 17 8.64
  GGML (CUDA) 0.25 25 1.5 8.69
  ONNX Runtime (CUDAExecutionProvider) 20.54 5.2 3.1 6.01
  MAX (GPU) 18.79 3.5 0.1 10.80
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 11 5.1 51 8.35
  Burn (wgpu/vulkan) 0.00 125 28 138 11.76
  Inferi  
  Meganeura (Vulkan/DX12) 1.40 13 3.6 58 8.63
  GGML (CUDA) 0.31 132 5.9 8.69
  ONNX Runtime (CUDAExecutionProvider) 39.92 18 14 6.01
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 541 126 1130 8.35
  Candle (CPU) 0.41 524 76 12.16
  Burn (wgpu/vulkan) 0.00 604 83 1437 11.79
  Inferi (Vulkan) 1.06 25769 9.1 15.16
  Luminal (CPU) 3.57 15551 15473 10.81
  Meganeura (Vulkan) 1.74 172 52 700 8.64
  GGML (CPU) 0.11 433 33 8.69
  ONNX Runtime (CPUExecutionProvider) 66.86 381 64 6.01
  MAX  
  JAX (CPU) 9.97 589 176 1412 5.79
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 73.03 10 6.8 23 8.35
  Burn (wgpu/vulkan) 0.00 172 25 181 11.76
  Inferi  
  Meganeura (Vulkan) 0.99 7.3 2.1 21 8.64
  GGML (ROCm) 0.11 259 3.4 8.69
  MAX (GPU) 2.13 3.5 0.1 10.80

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 3.2e-3). PyTorch vs JAX: PASS (loss diff 3.2e-3). PyTorch vs Meganeura: PASS (max error 1.7e-6, loss diff 5.3e-3). PyTorch vs llama.cpp: PASS (loss diff 4.5e-3). Candle, Luminal: CLOSE. Struck-through values are from frameworks running a different (simplified) model.

Caveats: - PyTorch and Meganeura load real model weights and run the full architecture — their outputs match.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolLM2-135M

lerobot/smolvla_base — SmolVLA action expert decoder for robotics.

Benchmark config: chunk_size=50, vlm_seq_len=16, float32, random weights, MSE loss.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 51.63 40 11 116 0.00
  Meganeura (Vulkan/Lavapipe) 2.75 696 3850 0.01
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Candle (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 19.72 27 14 49 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.12 15 6.7 47 0.00
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 19.38 10 0.00
Apple M3 PyTorch 2.11.0 (MPS) 0.00 173 9.1 117 0.00
  MLX (MLX) 0.00 13 24 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.12 34 6.4 170 0.00
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 7.96 86 0.00
  JAX (METAL) 1.17 15 147 0.00
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 8.39 2.5 1.2 3.2 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.47 3.2 1.5 9.1 0.00
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 2.54 1.6 0.00
  MAX (GPU) 33.04 32 0.00
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 4.5 3.5 22 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.70 4.9 2.9 23 0.00
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 4.14 4.9 0.00
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 183 72 388 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.39 73 40 222 0.00
  GGML  
  ONNX Runtime (CPUExecutionProvider) 10.33 86 0.00
  MAX  
  JAX (CPU) 3.86 162 471 0.00
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 9.23 4.8 4.1 8.1 0.00
  Candle  
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.81 5.1 1.5 9.9 0.00
  GGML  
  MAX (GPU) 2.20 15 0.00

Correctness: PyTorch vs Meganeura: CLOSE (loss diff 1e-5, max error 4.6e-3).

Caveats: - PyTorch and Meganeura implement the full action expert architecture and should produce matching outputs.

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m SmolVLA

stable-diffusion-v1-5/stable-diffusion-v1-5 — Latent diffusion model for text-to-image generation.

Most frameworks (PyTorch, Meganeura, ONNX Runtime, JAX, MLX) run the simplified U-Net — Conv + GroupNorm + skip connections, no cross-attention or timestep embedding. Batch 2, 32×32×4 latent, base_channels=64, 3 levels, ~2M params. Shared architecture, but each framework uses its own random-init parameters, so losses don’t match across frameworks and several end up marked DIFFERENT MODEL even on identical structure.

Candle runs the full SD 1.5 U-Net (~860M params, 64×64×4 latent, cross-attention + timestep) — the real thing, marked DIFFERENT MODEL by design.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 53.02 14 11 28 0.57
  Meganeura (Vulkan/Lavapipe) 2.75 379 666 0.57
  Candle (CPU) 0.00 10777 0.00
  ONNX Runtime (CPU)  
  JAX (CPU)  
  Burn (wgpu)  
  Luminal (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 12.68 2.6 3.0 5.4 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.09 10 11 15 0.53
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 29.97 3.7 0.05
Apple M3 PyTorch 2.11.0 (MPS) 0.00 504 11 222 0.50
  MLX (MLX) 0.00 6.9 9.3 0.51
  Candle (Metal) 0.01 233 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.49 8.9 8.9 68 0.53
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 2.38 12 0.05
  JAX (METAL) 0.72 6.0 25 0.05
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 6.35 1.0 0.9 1.4 0.50
  Candle (CUDA) 0.01 108 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.42 1.0 0.9 5.5 0.53
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 0.84 0.8 0.05
  MAX  
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 1.4 1.0 4.7 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.50 3.1 3.1 7.8 0.52
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 1.70 4.0 0.05
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 118 33 153 0.50
  Candle (CPU) 0.00 16529 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.12 21 21 88 0.53
  GGML  
  ONNX Runtime (CPUExecutionProvider) 2.39 31 0.05
  MAX  
  JAX (CPU) 4.85 73 206 0.05
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 11.06 1.7 1.5 3.3 0.50
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.77 1.6 1.5 7.3 0.51
  GGML  
  MAX  

Run ./run.sh -m StableDiffusion to populate this table.

Caveats: - Only the UNet is benchmarked (not VAE encode/decode or text encoding).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m StableDiffusion

Classic convolutional neural network for image classification. 25.6M parameters.

Benchmark config: batch=4, 3x224x224, float32, random weights, cross-entropy loss.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 60.61 141 40 284 10.10
  ONNX Runtime 1.24.4 (CPU) 0.28 76 18 10.37
  Candle (CPU) 0.00 782 311 6.91
  Meganeura (Vulkan/Lavapipe) 0.98 3906 1192
  Burn (wgpu)  
  JAX (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 36.46 48 16 97 6.92
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.31 58 19 125 6.92
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 3.17 29 9.3 6.92
Apple M3 PyTorch 2.11.0 (MPS) 0.00 166 21 274 6.92
  MLX  
  Candle (Metal) 0.01 16 5.3 6.91
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.32 63 23 965 6.92
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 4.87 6.5 2.1 6.92
  JAX (METAL) 0.90 139 9.4 409 6.92
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 8.83 2.4 1.5 4.5 6.92
  Candle (CUDA) 0.17 55 2.4 6.92
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.58 3.9 2.5 22 6.92
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 1.52 2.4 1.3 6.92
  MAX  
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 12 4.1 36 6.92
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 1.04 19 6.5 52 6.92
  GGML  
  ONNX Runtime (CUDAExecutionProvider) 3.20 13 4.5 6.92
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 511 127 1048 6.92
  Candle (CPU) 0.36 1187 368 6.92
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.77 225 73 1049 6.92
  GGML  
  ONNX Runtime (CPUExecutionProvider) 5.75 204 53 6.92
  MAX  
  JAX (CPU) 6.99 436 176 2813 6.92
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 17.57 7.1 3.8 17 6.92
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.92 6.5 3.4 26 6.92
  GGML  
  MAX  

Correctness: PyTorch vs ONNX Runtime: CLOSE (loss diff 0.27, rel error 8.8%).

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m ResNet-50

openai/whisper-tiny — Encoder-decoder transformer for speech recognition. ~39M parameters.

Uses a custom tiny configuration (4 encoder + 4 decoder layers) for fast benchmarking.

Benchmark config: 30s mel spectrogram (80x3000), 4-token decoder input, float32, random weights.

Platform Framework Compile (s) Inference (ms) Latency (ms) Training (ms) Loss
Intel Xeon @ 2.10GHz PyTorch 2.11.0+cu130 (CPU) 39.88 150 371 11.80
  ONNX Runtime 1.24.4 (CPU) 0.84 212 11.80
  Candle (CPU) 0.01 616 0.00
  Meganeura (Vulkan/Lavapipe) 7.84 53467 0.01
  Burn (wgpu)  
  JAX (CPU)  
AMD Radeon 890M Graphics PyTorch 2.10.0 (ROCm 7.2.53210) 17.09 79 63 220 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.20 34 33 101 0.01
  GGML  
  ONNX Runtime (MIGraphXExecutionProvider) 23.58 32 0.01
Apple M3 PyTorch 2.11.0 (MPS) 0.00 318 41 127 0.00
  MLX  
  Candle (Metal) 0.01 22 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Metal) 0.15 406 415 1062 0.01
  GGML  
  ONNX Runtime (CoreMLExecutionProvider) 7.93 440 0.01
  JAX (METAL) 2.17 128 315 445 0.01
NVIDIA GeForce RTX 5080 PyTorch 2.11.0+cu130 (CUDA 13.0) 4.04 2.3 2.1 13 0.00
  Candle (CUDA) 0.01 44 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.48 2.9 2.8 12 0.01
  GGML (faster-whisper (CTranslate2, CUDA)) 7.05 19 19 0.00
  ONNX Runtime (CUDAExecutionProvider) 1.94 3.5 0.01
  MAX  
NVIDIA GeForce RTX 3050 (Windows) PyTorch 2.11.0+cu128 (CUDA 12.8) 0.00 13 13 43 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan/DX12) 0.66 19 19 43 0.01
  GGML (faster-whisper (CTranslate2, CUDA)) 7.00 40 45 0.00
  ONNX Runtime (CUDAExecutionProvider) 4.82 20 0.01
  JAX  
Intel(R) Graphics (RPL-U) PyTorch 2.11.0+xpu (CPU) 0.00 477 420 899 0.00
  Candle (CPU) 0.02 795 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.39 467 466 1594 0.01
  GGML (faster-whisper (CTranslate2, CPU)) 14.76 1036 1104 0.00
  ONNX Runtime (CPUExecutionProvider) 6.18 333 0.01
  MAX  
  JAX (CPU) 5.59 717 686 2681 0.01
AMD Radeon RX 7900 XT PyTorch 2.10.0+rocm7.1 (ROCm 7.1.25424) 5.38 12 6.5 44 0.00
  Burn  
  Inferi  
  Luminal  
  Meganeura (Vulkan) 0.82 4.8 4.8 21 0.01
  MAX  

Correctness: PyTorch vs ONNX Runtime: PASS (loss diff 0.0).

Caveats: - Uses a custom tiny config (4+4 layers, d=384), not the full whisper-tiny from OpenAI

Run it yourself: git clone https://github.com/kvark/inferena && cd inferena && ./run.sh -m Whisper-tiny

Legend: Bold = best among matching frameworks Struck through = different / simplified model = not supported Framework names link to tested revision