LLM Inference Runtimes — Prateek Singh

00

Why so many runtimes?

Hardware diversity created this — and you have to live with it

When you quantize a model to INT4 or NF4, you've answered "how small?" — but not "where does it run?" That second question is where things fragment. NVIDIA GPUs speak TensorRT. Qualcomm NPUs speak QNN. Apple Silicon speaks Core ML. CPUs speak GGUF and ONNX. Each vendor built its own inference engine, optimized for its own memory hierarchy, instruction set, and tensor core design. There is no universal runtime — and there won't be for a while.

The good news: once you understand what each runtime actually optimizes for, the choice becomes mechanical. This post maps the full landscape and gives you the decision logic at the end.

🖥️

CPU / Edge Inference

GGUF + llama.cpp leads here. ONNX Runtime is the universal fallback. These runtimes prioritize portability and low-memory footprints.

GGUF · ONNX RT

⚡

NVIDIA GPU Inference

TensorRT-LLM for production throughput. vLLM on top for serving. FP8 on H100. Paged attention. Maximum tokens/sec at the highest hardware tier.

TensorRT-LLM · vLLM

📱

Mobile / On-Device

QNN SDK for Snapdragon NPU. Core ML for Apple chips. ExecuTorch for cross-platform mobile. Battery life and thermal budget define the constraints.

QNN · Core ML · ET

🗺️

What this post covers: GGUF/llama.cpp · TensorRT-LLM · QNN SDK · ONNX Runtime · Core ML · ExecuTorch · OpenVINO — their internals, what they're optimizing, when each wins, and a decision tree at the end.

01

GGUF & llama.cpp

The file format that made local LLMs real

🦙

GGUF / llama.cpp

Georgi Gerganov · 2023–present · C/C++ · Apache 2.0

GGUF (GPT-Generated Unified Format) replaced the older GGML format in August 2023. It's a binary container format designed for self-contained, portable LLM weights — everything the runtime needs (tokenizer, metadata, quantized tensors) lives in a single .gguf file. llama.cpp is the inference engine that reads it.

The File Format Anatomy

GGUF Binary Layout

4 bytes

Magic

GGUF signature

4 bytes

Version

Format version (v3)

variable

Metadata KV

model name, arch, context len, tokenizer vocab, special tokens

variable

Tensor Index

name, shape, dtype, byte offset per tensor

majority

Tensor Data

Quantized weights (Q4_K_M, Q5_K_S, Q8_0 …)

align

Padding

Alignment to 32-byte boundary

K-Quants: What Q4_K_M Actually Means

The naming scheme is Q{bits}_K_{size}. The K stands for "k-means inspired super-block quantization" — a major upgrade over naïve uniform INT4.

// Super-block structure for Q4_K_M
// 256 weights → 1 super-block → 8 sub-blocks of 32

struct block_q4_K {
  // Scales for all 8 sub-blocks (6-bit each, packed)
  uint8_t scales[12];   // 6 bits × 8 + 4 bits × 8 packed
  uint8_t d[2];          // FP16 super-block scale
  uint8_t dmin[2];       // FP16 super-block min
  uint8_t qs[128];       // 4-bit quants, 2 per byte → 256 weights
};

// Size = 12 + 2 + 2 + 128 = 144 bytes for 256 weights
// Effective bits/weight = 144×8/256 = 4.5 bpw
// vs naive INT4 = exactly 4.0 bpw
// The 0.5 extra bits buys dramatically better accuracy

// Q4_K_M: "M" = mixed — attention layers get K_M, FF layers get K_S
// Q4_K_S: "S" = small — all layers use smaller super-block format
// Q5_K_M: same super-block idea with 5-bit quants (5.5 bpw)

💡

Q4_K_M is the sweet spot for most use cases: 4.5 bpw, fits a 7B model in ~4 GB, scores within 0.5–1 perplexity points of FP16 on most benchmarks. Q5_K_M trades ~25% more memory for another 0.3–0.5 PPL improvement.

llama.cpp: The Engine

llama.cpp does four things most people don't realize: (1) it memory-maps the GGUF file — no copy, just virtual address space. (2) It implements its own GPU offloading via CUDA/Metal — you decide how many layers go to GPU vs RAM. (3) It runs mixed CPU+GPU inference for models that don't fit entirely on-device. (4) The KV-cache can be quantized separately (Q8_0 by default).

# Run Llama-3-8B-Instruct with 30 GPU layers, rest on CPU RAM
./llama-cli \
  -m Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \
  -ngl 30 \                         # GPU layer count
  -c 8192 \                        # context length
  --temp 0.7 --top-p 0.9 \
  -p "Explain transformer attention:"

# JNI / Android — the llama.cpp Android library path
# Build: cmake -DLLAMA_ANDROID=ON -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake
# Architecture: NDK JNI → libllama.so → GPU offload via OpenCL/Vulkan

Best for

CPU / Apple Silicon / edge Android

Quant support

Q2 – Q8, K-quants, IQ-quants

GPU offload

CUDA, Metal, OpenCL, Vulkan

License

MIT

Strengths

Single portable file, self-contained
Zero-copy mmap — huge models need no RAM copy
Works on CPU-only machines
Excellent macOS Metal integration
Active community, daily updates
Android via JNI (your KindKeyboard context!)

Limitations

Not a production server at scale (use llama-server + load balancer)
No INT4 tensor core utilization on NVIDIA (dequants to FP16)
Mixed CPU+GPU path has memory bandwidth overhead
QNN/NPU path immature vs native QNN SDK

02

TensorRT-LLM

NVIDIA's production inference engine — the throughput king

🟢

TensorRT-LLM

NVIDIA · 2023–present · C++/Python · Apache 2.0

TensorRT-LLM is NVIDIA's open-source library that takes a HuggingFace model, compiles it into a highly optimized TensorRT engine, and runs it with batching strategies designed for LLM inference. It's what NVIDIA uses internally and what AWS, Azure, and GCP run for managed inference on NVIDIA hardware.

The Compilation Pipeline

HuggingFace checkpointfp16 / bf16

→

trtllm-buildLayer fusion, kernel selection

→

TRT Engine.engine file, GPU-specific

→

Runtime ServingPaged KV, continuous batching

What Makes It Fast: Five Key Optimizations

1. Kernel Fusion. Standard PyTorch runs LayerNorm → Attention → Softmax → Projection as separate CUDA kernels, each requiring a round-trip to HBM. TRT-LLM fuses these into single custom CUDA kernels. The fusion alone gives 20–35% latency reduction on A100.

2. Paged Attention & Continuous Batching. Borrowed from vLLM — KV-cache is stored in fixed-size "pages" rather than contiguous blocks. This eliminates KV-cache fragmentation and allows different sequences at different stages of generation to share a GPU without waiting. Throughput improvement: 2–4× over naïve static batching.

3. FP8 on H100/H200. H100 has native FP8 (E4M3/E5M2) tensor cores. TRT-LLM's FP8 workflow quantizes both weights AND activations to FP8, enabling true INT8-speed matrix multiply with FP16-level accuracy. The result: nearly 2× MFU (Model FLOPs Utilization) vs FP16 on H100.

# Build TRT-LLM engine for Llama-3-8B with INT4 AWQ weights
python convert_checkpoint.py \
  --model_dir meta-llama/Meta-Llama-3-8B \
  --output_dir ./llama3_awq \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq

trtllm-build \
  --checkpoint_dir ./llama3_awq \
  --output_dir ./llama3_trt_engine \
  --gemm_plugin float16 \        # GEMM in fp16, weights in int4
  --max_batch_size 32 \
  --max_input_len 4096 \
  --max_output_len 2048 \
  --use_paged_context_fmha enable  # paged KV cache

4. In-flight Batching. New requests join the batch mid-generation — no waiting for all sequences to finish. This is the key to high utilization at serving time. Combined with paged attention, TRT-LLM serves 3–5× more requests/sec than equivalent naive inference.

5. Speculative Decoding (Draft Tokens). A small "draft" model proposes N tokens ahead, and the main model verifies them in parallel in a single forward pass. When the draft is right, you get N tokens at the cost of 1.3 forward passes instead of N. Llama-3-70B + a 1B draft model achieves 2–2.5× decode speedup.

⚠️

The catch: TRT engines are GPU-specific and not portable. An engine compiled for A100 won't run on H100. Recompile per deployment target. Build times for large models can be 30–120 minutes.

Best for

NVIDIA GPU production serving

Quant support

INT8, INT4-AWQ, FP8, GPTQ

Throughput gain

3–5× vs naive PyTorch

Min GPU

A10G / RTX 4090 and above

Strengths

Highest throughput on NVIDIA hardware
Native FP8 on H100/H200
Continuous batching + paged KV built-in
Triton Inference Server integration
Speculative decoding support

Limitations

NVIDIA-only (no AMD, no edge)
Long compile times per model+GPU pair
Static engine — no dynamic shape flex
Steep learning curve vs vLLM or Ollama

03

Qualcomm QNN SDK

The NPU path — the only way to unlock Snapdragon's Hexagon DSP

🔷

Qualcomm Neural Network (QNN) SDK

Qualcomm · 2022–present · C++ API · Proprietary (free to use)

QNN SDK (formerly SNPE — Snapdragon Neural Processing Engine) is Qualcomm's inference framework for deploying neural networks on Snapdragon SoCs. It's the only way to run inference on the Hexagon NPU — the dedicated AI accelerator that provides 30–75 TOPS at sub-1W power budget in Snapdragon 8 Gen 3 / X Elite chips.

Snapdragon AI Architecture: Three Compute Units

🔶

Hexagon NPU

INT8/INT16/FP16 matrix ops. Best TOPS/watt. For quantized model inference. QNN is the only path here.

🟣

Adreno GPU

FP16/FP32 compute via OpenCL or Vulkan. Higher latency than NPU but handles dynamic shapes better.

🔵

Kryo CPU

Fallback for unsupported ops. Slowest but most flexible. GGUF with llama.cpp runs here.

The QNN Workflow: From ONNX to Hexagon

PyTorch Modelfp32 weights

→

Export ONNXtorch.onnx.export

→

qnn-onnx-converter.cpp graph

→

qnn-model-lib-generator.so library

→

On-Device InferencelibQnnHtp.so backend

# Step 1: Export to ONNX
torch.onnx.export(model, dummy_input, "model.onnx",
  opset_version=17, dynamic_axes={"input": {0: "batch"}})

# Step 2: Convert to QNN graph (on Linux dev machine)
qnn-onnx-converter \
  --input_network model.onnx \
  --output_path model_qnn.cpp \
  --input_dim "input" 1,1,768 \
  --quantization_overrides quant_config.json \  # optional per-layer precision
  --act_bw 16 --weights_bw 8          # W8A16 config

# Step 3: Compile .so for Snapdragon target
qnn-model-lib-generator \
  -m model_qnn.cpp -b model_qnn.bin \
  -t aarch64-android \
  -l libmodel_qnn.so

# Step 4: On-device inference (Android NDK / JNI)
# Load backend → create context → execute graph → read outputs

W4A16 vs W8A16 on QNN — the Tradeoff That Matters

For LLMs on Snapdragon, the choice between W4A16 and W8A16 deserves careful analysis:

W8A16 — Safer Default

Weights in INT8, activations in FP16. Hexagon HTP handles this natively. Lower accuracy loss (~0.2% on most tasks). Wider op support. Recommended starting point for medical/clinical models where accuracy is paramount.

Safer · More ops supported

W4A16 — Aggressive Compression

Weights in INT4, activations in FP16. Fits larger models on device. ~1.5% accuracy loss. Not all ops support INT4 — some layers fall back to W8A16. Requires per-layer sensitivity analysis (AIMET).

Smaller · Needs per-layer tuning

🔬

AIMET + QNN: Qualcomm's AI Model Efficiency Toolkit (AIMET) integrates directly with QNN for quantization-aware fine-tuning and per-layer sensitivity analysis. For on-device LLMs (your Samsung work on Snapdragon 8 Gen 3), this is the production-grade path.

Target SoC

Snapdragon 8 Gen 1–3, X Elite

NPU TOPS

26–75 TOPS (Gen 1–3)

Quant support

W4A16, W8A16, INT8, FP16

Power budget

<1W NPU inference

Strengths

Only path to Hexagon NPU
Best TOPS/watt on Snapdragon
AIMET integration for production tuning
Samsung Galaxy devices (your deployment target)
Thermal efficiency vs GPU path

Limitations

Qualcomm-only, no cross-platform
Complex toolchain (converter → lib-gen → device)
Some LLM ops not NPU-supported → CPU fallback
Model recompilation per SoC generation

04

ONNX Runtime

The universal adapter — runs everywhere, optimized for nothing specific

🔗

ONNX Runtime (ORT)

Microsoft · Linux Foundation · 2019–present · MIT License

ONNX Runtime is the inference engine for the Open Neural Network Exchange format. It's the broadest coverage runtime — a single model can run on x86 CPU, ARM, NVIDIA GPU (via CUDA EP), AMD GPU (via ROCm EP), Apple ANE (via Core ML EP), Qualcomm (via QNN EP), and Intel (via OpenVINO EP). The tradeoff: "Execution Providers" (EPs) are opt-in and vary in maturity.

Execution Providers Architecture

## ORT dispatches ops to the best available EP
## Unsupported ops fall back to CPU EP automatically

session_options = ort.SessionOptions()
session_options.graph_optimization_level = (
    ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        "QNNExecutionProvider",  # Qualcomm NPU — try first
        "CUDAExecutionProvider", # NVIDIA GPU fallback
        "CPUExecutionProvider",  # Always available
    ],
    sess_options=session_options,
)
# ORT automatically routes each op to the highest-priority EP
# that supports it. If QNN doesn't support op X → CPU handles it.

ORT + Generative AI Extension (ORT-GenAI)

For LLMs specifically, Microsoft ships onnxruntime-genai — a wrapper that adds KV-cache management, greedy/beam/top-p sampling, and tokenizer integration to ORT. It's the recommended path for running Phi-3, Mistral, and Llama variants on ORT.

# onnxruntime-genai — high-level LLM inference on ORT
import onnxruntime_genai as og

model    = og.Model("phi-3-mini-int4-cpu-onnx")
tokenizer= og.Tokenizer(model)
params   = og.GeneratorParams(model)
params.set_search_options(max_length=512, temperature=0.7)

input_tokens = tokenizer.encode("Explain quantization:")
params.input_ids = input_tokens

generator = og.Generator(model, params)
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()

output_tokens = generator.get_sequence(0)
print(tokenizer.decode(output_tokens))

Best for

Cross-platform, Windows on ARM

EP coverage

CPU, CUDA, QNN, CoreML, OpenVINO

LLM extension

onnxruntime-genai

License

MIT

Strengths

Single model file, runs everywhere
Native Windows on ARM (Snapdragon X Elite PCs)
Strong Microsoft backing + Phi-3 optimized
Automatic CPU fallback for unsupported ops

Limitations

No hardware-specific kernel optimizations
QNN EP less mature than native QNN SDK
ONNX export has dynamic shape limitations
LLM throughput below TRT-LLM on NVIDIA

05

Core ML · ExecuTorch · OpenVINO

Apple, Meta, and Intel's answers to the same problem

🍎

Core ML

Apple · macOS / iOS / iPadOS · Swift & Python API · Free (requires Apple hardware)

Core ML is Apple's inference framework and the only way to access the Apple Neural Engine (ANE) — the dedicated ML accelerator in all Apple Silicon chips (M1–M4, A14–A18). ANE runs INT8/FP16 ops at 11–38 TOPS depending on chip generation. For on-device LLMs on Apple hardware, Core ML + ANE is the correct path — not llama.cpp's Metal path, which uses the Adreno equivalent (GPU, not ANE).

# Convert from PyTorch to Core ML (.mlpackage)
import coremltools as ct

model = torch.load("llm_backbone.pt")
traced = torch.jit.trace(model, example_inputs)

mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(ct.RangeDim(1,1), ct.RangeDim(1,4096), 4096))],
    compute_units=ct.ComputeUnit.ALL,  # ANE + GPU + CPU
    minimum_deployment_target=ct.target.iOS17,
    compute_precision=ct.precision.FLOAT16,
)
mlmodel.save("model.mlpackage")

📌

Apple's LLM kit: The MLX framework (Apple's own) and llm.mlx have become the de-facto standard for running quantized LLMs on Mac. MLX handles Apple Silicon's unified memory (CPU+GPU same DRAM) efficiently and is faster than llama.cpp's Metal path for most models on M-series chips.

🔺

ExecuTorch

Meta · 2023–present · C++ · BSD License

ExecuTorch is Meta's lightweight runtime for deploying PyTorch models at the edge — targeting iOS, Android, and microcontrollers. It uses torch.export() (the new PyTorch 2.x export path) to generate a portable, ahead-of-time compiled graph. The core runtime is ~50KB — smaller than ONNX Runtime's multi-MB footprint.

# ExecuTorch export pipeline
from executorch.exir import to_edge
from executorch.backends.xnnpack.partition import XnnpackPartitioner

exported = torch.export(model, (example_input,))
edge_prog = to_edge(exported)

# Delegate compute-intensive ops to XNNPACK (ARM NEON / AVX2)
lowered = edge_prog.to_backend(XnnpackPartitioner())
exe_prog = lowered.to_executorch()

with open("model.pte", "wb") as f:
    f.write(exe_prog.buffer)

ExecuTorch has backends for XNNPACK (optimized CPU kernels for ARM/x86), Vulkan (Android GPU), Core ML (iOS ANE), and QNN (Snapdragon NPU). It's how Meta deploys LLaMA 3.2 1B/3B on-device in Meta AI on WhatsApp/Instagram.

🔵

OpenVINO

Intel · 2018–present · C++/Python · Apache 2.0

OpenVINO is Intel's inference toolkit optimized for Intel CPUs (especially 4th-gen Xeon with AMX — Advanced Matrix Extensions), Intel GPUs (Arc series), and Intel NPUs (Meteor Lake+). For server-side inference on Intel Xeon, OpenVINO with INT8 quantization achieves 3–5× throughput vs PyTorch FP32 and rivals ONNX Runtime.

# OpenVINO with INT8 quantization via NNCF
from openvino.runtime import Core
import nncf

# Quantize to INT8 with calibration data
quantized = nncf.quantize(
    model, calibration_dataset,
    preset=nncf.QuantizationPreset.MIXED,  # weights INT8, activations INT8
    subset_size=128,
)
ov_model = ov.convert_model(quantized)
ov.save_model(ov_model, "model_int8.xml")

# Deploy
core = Core()
compiled = core.compile_model("model_int8.xml", "CPU")

06

Relative Performance Landscape

Throughput normalized to baseline — within-hardware comparison

⚠️

Caveat: These are representative relative figures for Llama-3-8B inference, not absolute benchmarks. Actual numbers vary by model, batch size, precision, and hardware generation. Use as directional guidance only.

NVIDIA A100 — Throughput (tokens/sec, batch=8)

PyTorch FP16

~420 t/s

vLLM FP16

~780 t/s

TRT-LLM INT8

~1050 t/s

TRT-LLM INT4

~1260 t/s

TRT-LLM FP8*

~1400 t/s

* FP8 requires H100/H200 — shown here as reference. A100 max is TRT-LLM INT4.

Snapdragon 8 Gen 3 — Latency (ms/token, batch=1)

CPU (llama.cpp)

~180 ms

Adreno GPU (ORT)

~60 ms

NPU W8A16 (QNN)

~35 ms

NPU W4A16 (QNN)

~22 ms

Values for a 1B–3B parameter model. Larger models require more layers on CPU fallback.

07

Side-by-Side Comparison

All seven runtimes across the dimensions that matter

Runtime	Hardware Target	LLM Ready	Best Precision	Setup Effort	Portability	License
GGUF / llama.cpp	CPU, Mac, GPU (CUDA/Metal)	✓ Native	Q4_K_M / Q5_K_M	Low — single binary	High	MIT
TensorRT-LLM	NVIDIA A10 → H200	✓ Native	INT4-AWQ / FP8	High — compile per GPU	None	Apache 2.0
QNN SDK	Snapdragon NPU	Partial (evolving)	W4A16 / W8A16	High — 3-step pipeline	QC-only	Proprietary
ONNX Runtime	CPU, CUDA, QNN EP, CoreML EP	✓ via GenAI	INT8 / INT4	Medium	Very High	MIT
Core ML / MLX	Apple ANE (M/A series)	✓ via MLX	INT4 / FP16	Medium	Apple-only	MIT (MLX)
ExecuTorch	Android / iOS / embedded	✓ Llama 3.2	INT4 / INT8	Medium — torch.export	High	BSD
OpenVINO	Intel CPU / GPU / NPU	Growing	INT8 via NNCF	Low–Medium	Intel-only	Apache 2.0

08

When to Choose What

The decision logic — hardware first, then use-case

Runtime Selection — Start Here

🔵 What hardware are you deploying to?

NVIDIA GPU → TensorRT-LLM (or vLLM for faster iteration, TRT for max throughput)

Snapdragon → QNN SDK (NPU) (ONNX Runtime QNN EP if ops unsupported)

Apple Silicon → MLX / Core ML (llama.cpp Metal if quick prototyping)

Intel CPU/GPU → OpenVINO (ONNX Runtime OpenVINO EP as alternative)

Android (multi-SoC) → ExecuTorch (or GGUF + llama.cpp if prototyping on device)

CPU / Unknown → GGUF + llama.cpp (or ONNX Runtime for broader op coverage)

Scenario-Based Guidance

🏭 Cloud API / Production Serving

TensorRT-LLM + Triton. Compile once per GPU SKU. Use paged attention and continuous batching. Deploy behind NVIDIA Triton Inference Server. If you want to skip compilation: vLLM is 80% of TRT throughput with 10% of the setup complexity.

📱 Samsung Galaxy On-Device (your use case)

QNN SDK → Hexagon NPU. Use AIMET for W4A16 quantization + per-layer sensitivity. Target Snapdragon 8 Gen 3. For models where QNN coverage is incomplete, fall back to GGUF on CPU for non-critical paths. Your snore detection / KindKeyboard architecture pattern.

🍎 Mac-side Inference / Research

MLX framework. Unified memory means no CPU↔GPU copy overhead. 4-bit quantization via mlx-lm. For iOS app deployment, convert to Core ML .mlpackage. llama.cpp Metal is good for prototyping but MLX is measurably faster on M-series.

🔬 Research / Rapid Prototyping

GGUF + llama.cpp or Ollama. Single command, no conversion pipeline, runs on any hardware. Use this to validate the model behavior and quantization tradeoffs before committing to a hardware-specific runtime.

🌐 Cross-Platform (Windows/Android/iOS)

ONNX Runtime + GenAI extension. One ONNX model, runtime automatically targets best EP per device. Strong Windows on ARM support (Copilot+ PCs with Snapdragon X Elite). Microsoft ships Phi-3 Mini this way.

🤖 Meta's LLaMA on Mobile

ExecuTorch. torch.export() → XNNPACK (ARM NEON) + optional QNN/Core ML backends. This is how Meta ships Llama 3.2 1B/3B in Meta AI inside WhatsApp, Instagram, and Messenger. Use if your team is already deep in PyTorch 2.x.

🧭

The practical mental model: Think hardware-first, not model-first. The runtime is dictated by your deployment target. Then ask: does the runtime support your model's ops? If not, you need an intermediate format (ONNX) that routes ops to fallback EPs. Only after hardware and op coverage are settled does quantization format (GGUF vs QNN vs TRT engine) become the question.

09

The TL;DR

If your target is…	Start with…	When you need more…
NVIDIA cloud GPU	vLLM (fast to set up)	TensorRT-LLM (max throughput)
Snapdragon Android	GGUF + llama.cpp (prototype)	QNN SDK → Hexagon NPU (production)
Apple Mac / iPhone	llama.cpp Metal (quick)	MLX / Core ML ANE (production)
Intel CPU server	ONNX Runtime CPU EP	OpenVINO INT8 + AMX
Windows on ARM PC	ONNX Runtime + GenAI	QNN EP or DirectML EP
iOS/Android apps	ExecuTorch + XNNPACK	+ Core ML / QNN backend
Unknown / Research	GGUF + Ollama	Measure, then pick hardware-specific

GGUF, TensorRT, QNN& everything in between

Why so many runtimes?

GGUF & llama.cpp

The File Format Anatomy

K-Quants: What Q4_K_M Actually Means

llama.cpp: The Engine

TensorRT-LLM

The Compilation Pipeline

What Makes It Fast: Five Key Optimizations

Qualcomm QNN SDK

Snapdragon AI Architecture: Three Compute Units

The QNN Workflow: From ONNX to Hexagon

W4A16 vs W8A16 on QNN — the Tradeoff That Matters

ONNX Runtime

Execution Providers Architecture

ORT + Generative AI Extension (ORT-GenAI)

Core ML · ExecuTorch · OpenVINO

Relative Performance Landscape

NVIDIA A100 — Throughput (tokens/sec, batch=8)

Snapdragon 8 Gen 3 — Latency (ms/token, batch=1)

Side-by-Side Comparison

When to Choose What

Scenario-Based Guidance

The TL;DR

GGUF, TensorRT, QNN
& everything in between