Why so many runtimes?
Hardware diversity created this — and you have to live with it
When you quantize a model to INT4 or NF4, you've answered "how small?" — but not "where does it run?" That second question is where things fragment. NVIDIA GPUs speak TensorRT. Qualcomm NPUs speak QNN. Apple Silicon speaks Core ML. CPUs speak GGUF and ONNX. Each vendor built its own inference engine, optimized for its own memory hierarchy, instruction set, and tensor core design. There is no universal runtime — and there won't be for a while.
The good news: once you understand what each runtime actually optimizes for, the choice becomes mechanical. This post maps the full landscape and gives you the decision logic at the end.
GGUF & llama.cpp
The file format that made local LLMs real
GGUF (GPT-Generated Unified Format) replaced the older GGML format in August 2023. It's a binary container format designed for self-contained, portable LLM weights — everything the runtime needs (tokenizer, metadata, quantized tensors) lives in a single .gguf file. llama.cpp is the inference engine that reads it.
The File Format Anatomy
K-Quants: What Q4_K_M Actually Means
The naming scheme is Q{bits}_K_{size}. The K stands for "k-means inspired super-block quantization" — a major upgrade over naïve uniform INT4.
// Super-block structure for Q4_K_M // 256 weights → 1 super-block → 8 sub-blocks of 32 struct block_q4_K { // Scales for all 8 sub-blocks (6-bit each, packed) uint8_t scales[12]; // 6 bits × 8 + 4 bits × 8 packed uint8_t d[2]; // FP16 super-block scale uint8_t dmin[2]; // FP16 super-block min uint8_t qs[128]; // 4-bit quants, 2 per byte → 256 weights }; // Size = 12 + 2 + 2 + 128 = 144 bytes for 256 weights // Effective bits/weight = 144×8/256 = 4.5 bpw // vs naive INT4 = exactly 4.0 bpw // The 0.5 extra bits buys dramatically better accuracy // Q4_K_M: "M" = mixed — attention layers get K_M, FF layers get K_S // Q4_K_S: "S" = small — all layers use smaller super-block format // Q5_K_M: same super-block idea with 5-bit quants (5.5 bpw)
llama.cpp: The Engine
llama.cpp does four things most people don't realize: (1) it memory-maps the GGUF file — no copy, just virtual address space. (2) It implements its own GPU offloading via CUDA/Metal — you decide how many layers go to GPU vs RAM. (3) It runs mixed CPU+GPU inference for models that don't fit entirely on-device. (4) The KV-cache can be quantized separately (Q8_0 by default).
# Run Llama-3-8B-Instruct with 30 GPU layers, rest on CPU RAM ./llama-cli \ -m Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \ -ngl 30 \ # GPU layer count -c 8192 \ # context length --temp 0.7 --top-p 0.9 \ -p "Explain transformer attention:" # JNI / Android — the llama.cpp Android library path # Build: cmake -DLLAMA_ANDROID=ON -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake # Architecture: NDK JNI → libllama.so → GPU offload via OpenCL/Vulkan
- Single portable file, self-contained
- Zero-copy mmap — huge models need no RAM copy
- Works on CPU-only machines
- Excellent macOS Metal integration
- Active community, daily updates
- Android via JNI (your KindKeyboard context!)
- Not a production server at scale (use llama-server + load balancer)
- No INT4 tensor core utilization on NVIDIA (dequants to FP16)
- Mixed CPU+GPU path has memory bandwidth overhead
- QNN/NPU path immature vs native QNN SDK
TensorRT-LLM
NVIDIA's production inference engine — the throughput king
TensorRT-LLM is NVIDIA's open-source library that takes a HuggingFace model, compiles it into a highly optimized TensorRT engine, and runs it with batching strategies designed for LLM inference. It's what NVIDIA uses internally and what AWS, Azure, and GCP run for managed inference on NVIDIA hardware.
The Compilation Pipeline
What Makes It Fast: Five Key Optimizations
1. Kernel Fusion. Standard PyTorch runs LayerNorm → Attention → Softmax → Projection as separate CUDA kernels, each requiring a round-trip to HBM. TRT-LLM fuses these into single custom CUDA kernels. The fusion alone gives 20–35% latency reduction on A100.
2. Paged Attention & Continuous Batching. Borrowed from vLLM — KV-cache is stored in fixed-size "pages" rather than contiguous blocks. This eliminates KV-cache fragmentation and allows different sequences at different stages of generation to share a GPU without waiting. Throughput improvement: 2–4× over naïve static batching.
3. FP8 on H100/H200. H100 has native FP8 (E4M3/E5M2) tensor cores. TRT-LLM's FP8 workflow quantizes both weights AND activations to FP8, enabling true INT8-speed matrix multiply with FP16-level accuracy. The result: nearly 2× MFU (Model FLOPs Utilization) vs FP16 on H100.
# Build TRT-LLM engine for Llama-3-8B with INT4 AWQ weights python convert_checkpoint.py \ --model_dir meta-llama/Meta-Llama-3-8B \ --output_dir ./llama3_awq \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_awq trtllm-build \ --checkpoint_dir ./llama3_awq \ --output_dir ./llama3_trt_engine \ --gemm_plugin float16 \ # GEMM in fp16, weights in int4 --max_batch_size 32 \ --max_input_len 4096 \ --max_output_len 2048 \ --use_paged_context_fmha enable # paged KV cache
4. In-flight Batching. New requests join the batch mid-generation — no waiting for all sequences to finish. This is the key to high utilization at serving time. Combined with paged attention, TRT-LLM serves 3–5× more requests/sec than equivalent naive inference.
5. Speculative Decoding (Draft Tokens). A small "draft" model proposes N tokens ahead, and the main model verifies them in parallel in a single forward pass. When the draft is right, you get N tokens at the cost of 1.3 forward passes instead of N. Llama-3-70B + a 1B draft model achieves 2–2.5× decode speedup.
- Highest throughput on NVIDIA hardware
- Native FP8 on H100/H200
- Continuous batching + paged KV built-in
- Triton Inference Server integration
- Speculative decoding support
- NVIDIA-only (no AMD, no edge)
- Long compile times per model+GPU pair
- Static engine — no dynamic shape flex
- Steep learning curve vs vLLM or Ollama
Qualcomm QNN SDK
The NPU path — the only way to unlock Snapdragon's Hexagon DSP
QNN SDK (formerly SNPE — Snapdragon Neural Processing Engine) is Qualcomm's inference framework for deploying neural networks on Snapdragon SoCs. It's the only way to run inference on the Hexagon NPU — the dedicated AI accelerator that provides 30–75 TOPS at sub-1W power budget in Snapdragon 8 Gen 3 / X Elite chips.
Snapdragon AI Architecture: Three Compute Units
The QNN Workflow: From ONNX to Hexagon
# Step 1: Export to ONNX torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17, dynamic_axes={"input": {0: "batch"}}) # Step 2: Convert to QNN graph (on Linux dev machine) qnn-onnx-converter \ --input_network model.onnx \ --output_path model_qnn.cpp \ --input_dim "input" 1,1,768 \ --quantization_overrides quant_config.json \ # optional per-layer precision --act_bw 16 --weights_bw 8 # W8A16 config # Step 3: Compile .so for Snapdragon target qnn-model-lib-generator \ -m model_qnn.cpp -b model_qnn.bin \ -t aarch64-android \ -l libmodel_qnn.so # Step 4: On-device inference (Android NDK / JNI) # Load backend → create context → execute graph → read outputs
W4A16 vs W8A16 on QNN — the Tradeoff That Matters
For LLMs on Snapdragon, the choice between W4A16 and W8A16 deserves careful analysis:
- Only path to Hexagon NPU
- Best TOPS/watt on Snapdragon
- AIMET integration for production tuning
- Samsung Galaxy devices (your deployment target)
- Thermal efficiency vs GPU path
- Qualcomm-only, no cross-platform
- Complex toolchain (converter → lib-gen → device)
- Some LLM ops not NPU-supported → CPU fallback
- Model recompilation per SoC generation
ONNX Runtime
The universal adapter — runs everywhere, optimized for nothing specific
ONNX Runtime is the inference engine for the Open Neural Network Exchange format. It's the broadest coverage runtime — a single model can run on x86 CPU, ARM, NVIDIA GPU (via CUDA EP), AMD GPU (via ROCm EP), Apple ANE (via Core ML EP), Qualcomm (via QNN EP), and Intel (via OpenVINO EP). The tradeoff: "Execution Providers" (EPs) are opt-in and vary in maturity.
Execution Providers Architecture
## ORT dispatches ops to the best available EP ## Unsupported ops fall back to CPU EP automatically session_options = ort.SessionOptions() session_options.graph_optimization_level = ( ort.GraphOptimizationLevel.ORT_ENABLE_ALL ) session = ort.InferenceSession( "model.onnx", providers=[ "QNNExecutionProvider", # Qualcomm NPU — try first "CUDAExecutionProvider", # NVIDIA GPU fallback "CPUExecutionProvider", # Always available ], sess_options=session_options, ) # ORT automatically routes each op to the highest-priority EP # that supports it. If QNN doesn't support op X → CPU handles it.
ORT + Generative AI Extension (ORT-GenAI)
For LLMs specifically, Microsoft ships onnxruntime-genai — a wrapper that adds KV-cache management, greedy/beam/top-p sampling, and tokenizer integration to ORT. It's the recommended path for running Phi-3, Mistral, and Llama variants on ORT.
# onnxruntime-genai — high-level LLM inference on ORT import onnxruntime_genai as og model = og.Model("phi-3-mini-int4-cpu-onnx") tokenizer= og.Tokenizer(model) params = og.GeneratorParams(model) params.set_search_options(max_length=512, temperature=0.7) input_tokens = tokenizer.encode("Explain quantization:") params.input_ids = input_tokens generator = og.Generator(model, params) while not generator.is_done(): generator.compute_logits() generator.generate_next_token() output_tokens = generator.get_sequence(0) print(tokenizer.decode(output_tokens))
- Single model file, runs everywhere
- Native Windows on ARM (Snapdragon X Elite PCs)
- Strong Microsoft backing + Phi-3 optimized
- Automatic CPU fallback for unsupported ops
- No hardware-specific kernel optimizations
- QNN EP less mature than native QNN SDK
- ONNX export has dynamic shape limitations
- LLM throughput below TRT-LLM on NVIDIA
Core ML · ExecuTorch · OpenVINO
Apple, Meta, and Intel's answers to the same problem
Core ML is Apple's inference framework and the only way to access the Apple Neural Engine (ANE) — the dedicated ML accelerator in all Apple Silicon chips (M1–M4, A14–A18). ANE runs INT8/FP16 ops at 11–38 TOPS depending on chip generation. For on-device LLMs on Apple hardware, Core ML + ANE is the correct path — not llama.cpp's Metal path, which uses the Adreno equivalent (GPU, not ANE).
# Convert from PyTorch to Core ML (.mlpackage) import coremltools as ct model = torch.load("llm_backbone.pt") traced = torch.jit.trace(model, example_inputs) mlmodel = ct.convert( traced, inputs=[ct.TensorType(shape=(ct.RangeDim(1,1), ct.RangeDim(1,4096), 4096))], compute_units=ct.ComputeUnit.ALL, # ANE + GPU + CPU minimum_deployment_target=ct.target.iOS17, compute_precision=ct.precision.FLOAT16, ) mlmodel.save("model.mlpackage")
MLX framework (Apple's own) and llm.mlx have become the de-facto standard for running quantized LLMs on Mac. MLX handles Apple Silicon's unified memory (CPU+GPU same DRAM) efficiently and is faster than llama.cpp's Metal path for most models on M-series chips.ExecuTorch is Meta's lightweight runtime for deploying PyTorch models at the edge — targeting iOS, Android, and microcontrollers. It uses torch.export() (the new PyTorch 2.x export path) to generate a portable, ahead-of-time compiled graph. The core runtime is ~50KB — smaller than ONNX Runtime's multi-MB footprint.
# ExecuTorch export pipeline from executorch.exir import to_edge from executorch.backends.xnnpack.partition import XnnpackPartitioner exported = torch.export(model, (example_input,)) edge_prog = to_edge(exported) # Delegate compute-intensive ops to XNNPACK (ARM NEON / AVX2) lowered = edge_prog.to_backend(XnnpackPartitioner()) exe_prog = lowered.to_executorch() with open("model.pte", "wb") as f: f.write(exe_prog.buffer)
ExecuTorch has backends for XNNPACK (optimized CPU kernels for ARM/x86), Vulkan (Android GPU), Core ML (iOS ANE), and QNN (Snapdragon NPU). It's how Meta deploys LLaMA 3.2 1B/3B on-device in Meta AI on WhatsApp/Instagram.
OpenVINO is Intel's inference toolkit optimized for Intel CPUs (especially 4th-gen Xeon with AMX — Advanced Matrix Extensions), Intel GPUs (Arc series), and Intel NPUs (Meteor Lake+). For server-side inference on Intel Xeon, OpenVINO with INT8 quantization achieves 3–5× throughput vs PyTorch FP32 and rivals ONNX Runtime.
# OpenVINO with INT8 quantization via NNCF from openvino.runtime import Core import nncf # Quantize to INT8 with calibration data quantized = nncf.quantize( model, calibration_dataset, preset=nncf.QuantizationPreset.MIXED, # weights INT8, activations INT8 subset_size=128, ) ov_model = ov.convert_model(quantized) ov.save_model(ov_model, "model_int8.xml") # Deploy core = Core() compiled = core.compile_model("model_int8.xml", "CPU")
Relative Performance Landscape
Throughput normalized to baseline — within-hardware comparison
NVIDIA A100 — Throughput (tokens/sec, batch=8)
* FP8 requires H100/H200 — shown here as reference. A100 max is TRT-LLM INT4.
Snapdragon 8 Gen 3 — Latency (ms/token, batch=1)
Values for a 1B–3B parameter model. Larger models require more layers on CPU fallback.
Side-by-Side Comparison
All seven runtimes across the dimensions that matter
| Runtime | Hardware Target | LLM Ready | Best Precision | Setup Effort | Portability | License |
|---|---|---|---|---|---|---|
| GGUF / llama.cpp | CPU, Mac, GPU (CUDA/Metal) | ✓ Native | Q4_K_M / Q5_K_M | Low — single binary | High | MIT |
| TensorRT-LLM | NVIDIA A10 → H200 | ✓ Native | INT4-AWQ / FP8 | High — compile per GPU | None | Apache 2.0 |
| QNN SDK | Snapdragon NPU | Partial (evolving) | W4A16 / W8A16 | High — 3-step pipeline | QC-only | Proprietary |
| ONNX Runtime | CPU, CUDA, QNN EP, CoreML EP | ✓ via GenAI | INT8 / INT4 | Medium | Very High | MIT |
| Core ML / MLX | Apple ANE (M/A series) | ✓ via MLX | INT4 / FP16 | Medium | Apple-only | MIT (MLX) |
| ExecuTorch | Android / iOS / embedded | ✓ Llama 3.2 | INT4 / INT8 | Medium — torch.export | High | BSD |
| OpenVINO | Intel CPU / GPU / NPU | Growing | INT8 via NNCF | Low–Medium | Intel-only | Apache 2.0 |
When to Choose What
The decision logic — hardware first, then use-case
Scenario-Based Guidance
mlx-lm. For iOS app deployment, convert to Core ML .mlpackage. llama.cpp Metal is good for prototyping but MLX is measurably faster on M-series.The TL;DR
| If your target is… | Start with… | When you need more… |
|---|---|---|
| NVIDIA cloud GPU | vLLM (fast to set up) | TensorRT-LLM (max throughput) |
| Snapdragon Android | GGUF + llama.cpp (prototype) | QNN SDK → Hexagon NPU (production) |
| Apple Mac / iPhone | llama.cpp Metal (quick) | MLX / Core ML ANE (production) |
| Intel CPU server | ONNX Runtime CPU EP | OpenVINO INT8 + AMX |
| Windows on ARM PC | ONNX Runtime + GenAI | QNN EP or DirectML EP |
| iOS/Android apps | ExecuTorch + XNNPACK | + Core ML / QNN backend |
| Unknown / Research | GGUF + Ollama | Measure, then pick hardware-specific |