Deep Dive · Inference Runtimes

GGUF, TensorRT, QNN
& everything in between

You quantized your model. Now where does it run? GGUF, TensorRT-LLM, QNN SDK, ONNX Runtime, Core ML, ExecuTorch, OpenVINO — the landscape is fragmented by design. Here's the full map, with the decision logic to match.

March 2026 · Prateek Singh, PhD
GGUF TensorRT-LLM QNN SDK ONNX Runtime Core ML ExecuTorch OpenVINO
scroll
00

Why so many runtimes?

Hardware diversity created this — and you have to live with it

When you quantize a model to INT4 or NF4, you've answered "how small?" — but not "where does it run?" That second question is where things fragment. NVIDIA GPUs speak TensorRT. Qualcomm NPUs speak QNN. Apple Silicon speaks Core ML. CPUs speak GGUF and ONNX. Each vendor built its own inference engine, optimized for its own memory hierarchy, instruction set, and tensor core design. There is no universal runtime — and there won't be for a while.

The good news: once you understand what each runtime actually optimizes for, the choice becomes mechanical. This post maps the full landscape and gives you the decision logic at the end.

🖥️
CPU / Edge Inference
GGUF + llama.cpp leads here. ONNX Runtime is the universal fallback. These runtimes prioritize portability and low-memory footprints.
GGUF · ONNX RT
NVIDIA GPU Inference
TensorRT-LLM for production throughput. vLLM on top for serving. FP8 on H100. Paged attention. Maximum tokens/sec at the highest hardware tier.
TensorRT-LLM · vLLM
📱
Mobile / On-Device
QNN SDK for Snapdragon NPU. Core ML for Apple chips. ExecuTorch for cross-platform mobile. Battery life and thermal budget define the constraints.
QNN · Core ML · ET
🗺️
What this post covers: GGUF/llama.cpp · TensorRT-LLM · QNN SDK · ONNX Runtime · Core ML · ExecuTorch · OpenVINO — their internals, what they're optimizing, when each wins, and a decision tree at the end.
01

GGUF & llama.cpp

The file format that made local LLMs real

GGUF / llama.cpp
Georgi Gerganov · 2023–present · C/C++ · Apache 2.0

GGUF (GPT-Generated Unified Format) replaced the older GGML format in August 2023. It's a binary container format designed for self-contained, portable LLM weights — everything the runtime needs (tokenizer, metadata, quantized tensors) lives in a single .gguf file. llama.cpp is the inference engine that reads it.

The File Format Anatomy

GGUF Binary Layout
4 bytes
Magic
GGUF signature
4 bytes
Version
Format version (v3)
variable
Metadata KV
model name, arch, context len, tokenizer vocab, special tokens
variable
Tensor Index
name, shape, dtype, byte offset per tensor
majority
Tensor Data
Quantized weights (Q4_K_M, Q5_K_S, Q8_0 …)
align
Padding
Alignment to 32-byte boundary

K-Quants: What Q4_K_M Actually Means

The naming scheme is Q{bits}_K_{size}. The K stands for "k-means inspired super-block quantization" — a major upgrade over naïve uniform INT4.

// Super-block structure for Q4_K_M
// 256 weights → 1 super-block → 8 sub-blocks of 32

struct block_q4_K {
  // Scales for all 8 sub-blocks (6-bit each, packed)
  uint8_t scales[12];   // 6 bits × 8 + 4 bits × 8 packed
  uint8_t d[2];          // FP16 super-block scale
  uint8_t dmin[2];       // FP16 super-block min
  uint8_t qs[128];       // 4-bit quants, 2 per byte → 256 weights
};

// Size = 12 + 2 + 2 + 128 = 144 bytes for 256 weights
// Effective bits/weight = 144×8/256 = 4.5 bpw
// vs naive INT4 = exactly 4.0 bpw
// The 0.5 extra bits buys dramatically better accuracy

// Q4_K_M: "M" = mixed — attention layers get K_M, FF layers get K_S
// Q4_K_S: "S" = small — all layers use smaller super-block format
// Q5_K_M: same super-block idea with 5-bit quants (5.5 bpw)
💡
Q4_K_M is the sweet spot for most use cases: 4.5 bpw, fits a 7B model in ~4 GB, scores within 0.5–1 perplexity points of FP16 on most benchmarks. Q5_K_M trades ~25% more memory for another 0.3–0.5 PPL improvement.

llama.cpp: The Engine

llama.cpp does four things most people don't realize: (1) it memory-maps the GGUF file — no copy, just virtual address space. (2) It implements its own GPU offloading via CUDA/Metal — you decide how many layers go to GPU vs RAM. (3) It runs mixed CPU+GPU inference for models that don't fit entirely on-device. (4) The KV-cache can be quantized separately (Q8_0 by default).

# Run Llama-3-8B-Instruct with 30 GPU layers, rest on CPU RAM
./llama-cli \
  -m Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \
  -ngl 30 \                         # GPU layer count
  -c 8192 \                        # context length
  --temp 0.7 --top-p 0.9 \
  -p "Explain transformer attention:"

# JNI / Android — the llama.cpp Android library path
# Build: cmake -DLLAMA_ANDROID=ON -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake
# Architecture: NDK JNI → libllama.so → GPU offload via OpenCL/Vulkan
Best for
CPU / Apple Silicon / edge Android
Quant support
Q2 – Q8, K-quants, IQ-quants
GPU offload
CUDA, Metal, OpenCL, Vulkan
License
MIT
Strengths
  • Single portable file, self-contained
  • Zero-copy mmap — huge models need no RAM copy
  • Works on CPU-only machines
  • Excellent macOS Metal integration
  • Active community, daily updates
  • Android via JNI (your KindKeyboard context!)
Limitations
  • Not a production server at scale (use llama-server + load balancer)
  • No INT4 tensor core utilization on NVIDIA (dequants to FP16)
  • Mixed CPU+GPU path has memory bandwidth overhead
  • QNN/NPU path immature vs native QNN SDK
02

TensorRT-LLM

NVIDIA's production inference engine — the throughput king

TensorRT-LLM
NVIDIA · 2023–present · C++/Python · Apache 2.0

TensorRT-LLM is NVIDIA's open-source library that takes a HuggingFace model, compiles it into a highly optimized TensorRT engine, and runs it with batching strategies designed for LLM inference. It's what NVIDIA uses internally and what AWS, Azure, and GCP run for managed inference on NVIDIA hardware.

The Compilation Pipeline

HuggingFace checkpointfp16 / bf16
trtllm-buildLayer fusion, kernel selection
TRT Engine.engine file, GPU-specific
Runtime ServingPaged KV, continuous batching

What Makes It Fast: Five Key Optimizations

1. Kernel Fusion. Standard PyTorch runs LayerNorm → Attention → Softmax → Projection as separate CUDA kernels, each requiring a round-trip to HBM. TRT-LLM fuses these into single custom CUDA kernels. The fusion alone gives 20–35% latency reduction on A100.

2. Paged Attention & Continuous Batching. Borrowed from vLLM — KV-cache is stored in fixed-size "pages" rather than contiguous blocks. This eliminates KV-cache fragmentation and allows different sequences at different stages of generation to share a GPU without waiting. Throughput improvement: 2–4× over naïve static batching.

3. FP8 on H100/H200. H100 has native FP8 (E4M3/E5M2) tensor cores. TRT-LLM's FP8 workflow quantizes both weights AND activations to FP8, enabling true INT8-speed matrix multiply with FP16-level accuracy. The result: nearly 2× MFU (Model FLOPs Utilization) vs FP16 on H100.

# Build TRT-LLM engine for Llama-3-8B with INT4 AWQ weights
python convert_checkpoint.py \
  --model_dir meta-llama/Meta-Llama-3-8B \
  --output_dir ./llama3_awq \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq

trtllm-build \
  --checkpoint_dir ./llama3_awq \
  --output_dir ./llama3_trt_engine \
  --gemm_plugin float16 \        # GEMM in fp16, weights in int4
  --max_batch_size 32 \
  --max_input_len 4096 \
  --max_output_len 2048 \
  --use_paged_context_fmha enable  # paged KV cache

4. In-flight Batching. New requests join the batch mid-generation — no waiting for all sequences to finish. This is the key to high utilization at serving time. Combined with paged attention, TRT-LLM serves 3–5× more requests/sec than equivalent naive inference.

5. Speculative Decoding (Draft Tokens). A small "draft" model proposes N tokens ahead, and the main model verifies them in parallel in a single forward pass. When the draft is right, you get N tokens at the cost of 1.3 forward passes instead of N. Llama-3-70B + a 1B draft model achieves 2–2.5× decode speedup.

⚠️
The catch: TRT engines are GPU-specific and not portable. An engine compiled for A100 won't run on H100. Recompile per deployment target. Build times for large models can be 30–120 minutes.
Best for
NVIDIA GPU production serving
Quant support
INT8, INT4-AWQ, FP8, GPTQ
Throughput gain
3–5× vs naive PyTorch
Min GPU
A10G / RTX 4090 and above
Strengths
  • Highest throughput on NVIDIA hardware
  • Native FP8 on H100/H200
  • Continuous batching + paged KV built-in
  • Triton Inference Server integration
  • Speculative decoding support
Limitations
  • NVIDIA-only (no AMD, no edge)
  • Long compile times per model+GPU pair
  • Static engine — no dynamic shape flex
  • Steep learning curve vs vLLM or Ollama
03

Qualcomm QNN SDK

The NPU path — the only way to unlock Snapdragon's Hexagon DSP

Qualcomm Neural Network (QNN) SDK
Qualcomm · 2022–present · C++ API · Proprietary (free to use)

QNN SDK (formerly SNPE — Snapdragon Neural Processing Engine) is Qualcomm's inference framework for deploying neural networks on Snapdragon SoCs. It's the only way to run inference on the Hexagon NPU — the dedicated AI accelerator that provides 30–75 TOPS at sub-1W power budget in Snapdragon 8 Gen 3 / X Elite chips.

Snapdragon AI Architecture: Three Compute Units

🔶
Hexagon NPU
INT8/INT16/FP16 matrix ops. Best TOPS/watt. For quantized model inference. QNN is the only path here.
🟣
Adreno GPU
FP16/FP32 compute via OpenCL or Vulkan. Higher latency than NPU but handles dynamic shapes better.
🔵
Kryo CPU
Fallback for unsupported ops. Slowest but most flexible. GGUF with llama.cpp runs here.

The QNN Workflow: From ONNX to Hexagon

PyTorch Modelfp32 weights
Export ONNXtorch.onnx.export
qnn-onnx-converter.cpp graph
qnn-model-lib-generator.so library
On-Device InferencelibQnnHtp.so backend
# Step 1: Export to ONNX
torch.onnx.export(model, dummy_input, "model.onnx",
  opset_version=17, dynamic_axes={"input": {0: "batch"}})

# Step 2: Convert to QNN graph (on Linux dev machine)
qnn-onnx-converter \
  --input_network model.onnx \
  --output_path model_qnn.cpp \
  --input_dim "input" 1,1,768 \
  --quantization_overrides quant_config.json \  # optional per-layer precision
  --act_bw 16 --weights_bw 8          # W8A16 config

# Step 3: Compile .so for Snapdragon target
qnn-model-lib-generator \
  -m model_qnn.cpp -b model_qnn.bin \
  -t aarch64-android \
  -l libmodel_qnn.so

# Step 4: On-device inference (Android NDK / JNI)
# Load backend → create context → execute graph → read outputs

W4A16 vs W8A16 on QNN — the Tradeoff That Matters

For LLMs on Snapdragon, the choice between W4A16 and W8A16 deserves careful analysis:

W8A16 — Safer Default
Weights in INT8, activations in FP16. Hexagon HTP handles this natively. Lower accuracy loss (~0.2% on most tasks). Wider op support. Recommended starting point for medical/clinical models where accuracy is paramount.
Safer · More ops supported
W4A16 — Aggressive Compression
Weights in INT4, activations in FP16. Fits larger models on device. ~1.5% accuracy loss. Not all ops support INT4 — some layers fall back to W8A16. Requires per-layer sensitivity analysis (AIMET).
Smaller · Needs per-layer tuning
🔬
AIMET + QNN: Qualcomm's AI Model Efficiency Toolkit (AIMET) integrates directly with QNN for quantization-aware fine-tuning and per-layer sensitivity analysis. For on-device LLMs (your Samsung work on Snapdragon 8 Gen 3), this is the production-grade path.
Target SoC
Snapdragon 8 Gen 1–3, X Elite
NPU TOPS
26–75 TOPS (Gen 1–3)
Quant support
W4A16, W8A16, INT8, FP16
Power budget
<1W NPU inference
Strengths
  • Only path to Hexagon NPU
  • Best TOPS/watt on Snapdragon
  • AIMET integration for production tuning
  • Samsung Galaxy devices (your deployment target)
  • Thermal efficiency vs GPU path
Limitations
  • Qualcomm-only, no cross-platform
  • Complex toolchain (converter → lib-gen → device)
  • Some LLM ops not NPU-supported → CPU fallback
  • Model recompilation per SoC generation
04

ONNX Runtime

The universal adapter — runs everywhere, optimized for nothing specific

ONNX Runtime (ORT)
Microsoft · Linux Foundation · 2019–present · MIT License

ONNX Runtime is the inference engine for the Open Neural Network Exchange format. It's the broadest coverage runtime — a single model can run on x86 CPU, ARM, NVIDIA GPU (via CUDA EP), AMD GPU (via ROCm EP), Apple ANE (via Core ML EP), Qualcomm (via QNN EP), and Intel (via OpenVINO EP). The tradeoff: "Execution Providers" (EPs) are opt-in and vary in maturity.

Execution Providers Architecture

## ORT dispatches ops to the best available EP
## Unsupported ops fall back to CPU EP automatically

session_options = ort.SessionOptions()
session_options.graph_optimization_level = (
    ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        "QNNExecutionProvider",  # Qualcomm NPU — try first
        "CUDAExecutionProvider", # NVIDIA GPU fallback
        "CPUExecutionProvider",  # Always available
    ],
    sess_options=session_options,
)
# ORT automatically routes each op to the highest-priority EP
# that supports it. If QNN doesn't support op X → CPU handles it.

ORT + Generative AI Extension (ORT-GenAI)

For LLMs specifically, Microsoft ships onnxruntime-genai — a wrapper that adds KV-cache management, greedy/beam/top-p sampling, and tokenizer integration to ORT. It's the recommended path for running Phi-3, Mistral, and Llama variants on ORT.

# onnxruntime-genai — high-level LLM inference on ORT
import onnxruntime_genai as og

model    = og.Model("phi-3-mini-int4-cpu-onnx")
tokenizer= og.Tokenizer(model)
params   = og.GeneratorParams(model)
params.set_search_options(max_length=512, temperature=0.7)

input_tokens = tokenizer.encode("Explain quantization:")
params.input_ids = input_tokens

generator = og.Generator(model, params)
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()

output_tokens = generator.get_sequence(0)
print(tokenizer.decode(output_tokens))
Best for
Cross-platform, Windows on ARM
EP coverage
CPU, CUDA, QNN, CoreML, OpenVINO
LLM extension
onnxruntime-genai
License
MIT
Strengths
  • Single model file, runs everywhere
  • Native Windows on ARM (Snapdragon X Elite PCs)
  • Strong Microsoft backing + Phi-3 optimized
  • Automatic CPU fallback for unsupported ops
Limitations
  • No hardware-specific kernel optimizations
  • QNN EP less mature than native QNN SDK
  • ONNX export has dynamic shape limitations
  • LLM throughput below TRT-LLM on NVIDIA
05

Core ML · ExecuTorch · OpenVINO

Apple, Meta, and Intel's answers to the same problem

Core ML
Apple · macOS / iOS / iPadOS · Swift & Python API · Free (requires Apple hardware)

Core ML is Apple's inference framework and the only way to access the Apple Neural Engine (ANE) — the dedicated ML accelerator in all Apple Silicon chips (M1–M4, A14–A18). ANE runs INT8/FP16 ops at 11–38 TOPS depending on chip generation. For on-device LLMs on Apple hardware, Core ML + ANE is the correct path — not llama.cpp's Metal path, which uses the Adreno equivalent (GPU, not ANE).

# Convert from PyTorch to Core ML (.mlpackage)
import coremltools as ct

model = torch.load("llm_backbone.pt")
traced = torch.jit.trace(model, example_inputs)

mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(ct.RangeDim(1,1), ct.RangeDim(1,4096), 4096))],
    compute_units=ct.ComputeUnit.ALL,  # ANE + GPU + CPU
    minimum_deployment_target=ct.target.iOS17,
    compute_precision=ct.precision.FLOAT16,
)
mlmodel.save("model.mlpackage")
📌
Apple's LLM kit: The MLX framework (Apple's own) and llm.mlx have become the de-facto standard for running quantized LLMs on Mac. MLX handles Apple Silicon's unified memory (CPU+GPU same DRAM) efficiently and is faster than llama.cpp's Metal path for most models on M-series chips.
ExecuTorch
Meta · 2023–present · C++ · BSD License

ExecuTorch is Meta's lightweight runtime for deploying PyTorch models at the edge — targeting iOS, Android, and microcontrollers. It uses torch.export() (the new PyTorch 2.x export path) to generate a portable, ahead-of-time compiled graph. The core runtime is ~50KB — smaller than ONNX Runtime's multi-MB footprint.

# ExecuTorch export pipeline
from executorch.exir import to_edge
from executorch.backends.xnnpack.partition import XnnpackPartitioner

exported = torch.export(model, (example_input,))
edge_prog = to_edge(exported)

# Delegate compute-intensive ops to XNNPACK (ARM NEON / AVX2)
lowered = edge_prog.to_backend(XnnpackPartitioner())
exe_prog = lowered.to_executorch()

with open("model.pte", "wb") as f:
    f.write(exe_prog.buffer)

ExecuTorch has backends for XNNPACK (optimized CPU kernels for ARM/x86), Vulkan (Android GPU), Core ML (iOS ANE), and QNN (Snapdragon NPU). It's how Meta deploys LLaMA 3.2 1B/3B on-device in Meta AI on WhatsApp/Instagram.

OpenVINO
Intel · 2018–present · C++/Python · Apache 2.0

OpenVINO is Intel's inference toolkit optimized for Intel CPUs (especially 4th-gen Xeon with AMX — Advanced Matrix Extensions), Intel GPUs (Arc series), and Intel NPUs (Meteor Lake+). For server-side inference on Intel Xeon, OpenVINO with INT8 quantization achieves 3–5× throughput vs PyTorch FP32 and rivals ONNX Runtime.

# OpenVINO with INT8 quantization via NNCF
from openvino.runtime import Core
import nncf

# Quantize to INT8 with calibration data
quantized = nncf.quantize(
    model, calibration_dataset,
    preset=nncf.QuantizationPreset.MIXED,  # weights INT8, activations INT8
    subset_size=128,
)
ov_model = ov.convert_model(quantized)
ov.save_model(ov_model, "model_int8.xml")

# Deploy
core = Core()
compiled = core.compile_model("model_int8.xml", "CPU")
06

Relative Performance Landscape

Throughput normalized to baseline — within-hardware comparison

⚠️
Caveat: These are representative relative figures for Llama-3-8B inference, not absolute benchmarks. Actual numbers vary by model, batch size, precision, and hardware generation. Use as directional guidance only.

NVIDIA A100 — Throughput (tokens/sec, batch=8)

PyTorch FP16
~420 t/s
vLLM FP16
~780 t/s
TRT-LLM INT8
~1050 t/s
TRT-LLM INT4
~1260 t/s
TRT-LLM FP8*
~1400 t/s

* FP8 requires H100/H200 — shown here as reference. A100 max is TRT-LLM INT4.

Snapdragon 8 Gen 3 — Latency (ms/token, batch=1)

CPU (llama.cpp)
~180 ms
Adreno GPU (ORT)
~60 ms
NPU W8A16 (QNN)
~35 ms
NPU W4A16 (QNN)
~22 ms

Values for a 1B–3B parameter model. Larger models require more layers on CPU fallback.

07

Side-by-Side Comparison

All seven runtimes across the dimensions that matter

Runtime Hardware Target LLM Ready Best Precision Setup Effort Portability License
GGUF / llama.cpp CPU, Mac, GPU (CUDA/Metal) Native Q4_K_M / Q5_K_M Low — single binary High MIT
TensorRT-LLM NVIDIA A10 → H200 Native INT4-AWQ / FP8 High — compile per GPU None Apache 2.0
QNN SDK Snapdragon NPU Partial (evolving) W4A16 / W8A16 High — 3-step pipeline QC-only Proprietary
ONNX Runtime CPU, CUDA, QNN EP, CoreML EP via GenAI INT8 / INT4 Medium Very High MIT
Core ML / MLX Apple ANE (M/A series) via MLX INT4 / FP16 Medium Apple-only MIT (MLX)
ExecuTorch Android / iOS / embedded Llama 3.2 INT4 / INT8 Medium — torch.export High BSD
OpenVINO Intel CPU / GPU / NPU Growing INT8 via NNCF Low–Medium Intel-only Apache 2.0
08

When to Choose What

The decision logic — hardware first, then use-case

Runtime Selection — Start Here
🔵 What hardware are you deploying to?
NVIDIA GPU → TensorRT-LLM (or vLLM for faster iteration, TRT for max throughput)
Snapdragon → QNN SDK (NPU) (ONNX Runtime QNN EP if ops unsupported)
Apple Silicon → MLX / Core ML (llama.cpp Metal if quick prototyping)
Intel CPU/GPU → OpenVINO (ONNX Runtime OpenVINO EP as alternative)
Android (multi-SoC) → ExecuTorch (or GGUF + llama.cpp if prototyping on device)
CPU / Unknown → GGUF + llama.cpp (or ONNX Runtime for broader op coverage)

Scenario-Based Guidance

🏭 Cloud API / Production Serving
TensorRT-LLM + Triton. Compile once per GPU SKU. Use paged attention and continuous batching. Deploy behind NVIDIA Triton Inference Server. If you want to skip compilation: vLLM is 80% of TRT throughput with 10% of the setup complexity.
📱 Samsung Galaxy On-Device (your use case)
QNN SDK → Hexagon NPU. Use AIMET for W4A16 quantization + per-layer sensitivity. Target Snapdragon 8 Gen 3. For models where QNN coverage is incomplete, fall back to GGUF on CPU for non-critical paths. Your snore detection / KindKeyboard architecture pattern.
🍎 Mac-side Inference / Research
MLX framework. Unified memory means no CPU↔GPU copy overhead. 4-bit quantization via mlx-lm. For iOS app deployment, convert to Core ML .mlpackage. llama.cpp Metal is good for prototyping but MLX is measurably faster on M-series.
🔬 Research / Rapid Prototyping
GGUF + llama.cpp or Ollama. Single command, no conversion pipeline, runs on any hardware. Use this to validate the model behavior and quantization tradeoffs before committing to a hardware-specific runtime.
🌐 Cross-Platform (Windows/Android/iOS)
ONNX Runtime + GenAI extension. One ONNX model, runtime automatically targets best EP per device. Strong Windows on ARM support (Copilot+ PCs with Snapdragon X Elite). Microsoft ships Phi-3 Mini this way.
🤖 Meta's LLaMA on Mobile
ExecuTorch. torch.export() → XNNPACK (ARM NEON) + optional QNN/Core ML backends. This is how Meta ships Llama 3.2 1B/3B in Meta AI inside WhatsApp, Instagram, and Messenger. Use if your team is already deep in PyTorch 2.x.
🧭
The practical mental model: Think hardware-first, not model-first. The runtime is dictated by your deployment target. Then ask: does the runtime support your model's ops? If not, you need an intermediate format (ONNX) that routes ops to fallback EPs. Only after hardware and op coverage are settled does quantization format (GGUF vs QNN vs TRT engine) become the question.
09

The TL;DR

If your target is…Start with…When you need more…
NVIDIA cloud GPUvLLM (fast to set up)TensorRT-LLM (max throughput)
Snapdragon AndroidGGUF + llama.cpp (prototype)QNN SDK → Hexagon NPU (production)
Apple Mac / iPhonellama.cpp Metal (quick)MLX / Core ML ANE (production)
Intel CPU serverONNX Runtime CPU EPOpenVINO INT8 + AMX
Windows on ARM PCONNX Runtime + GenAIQNN EP or DirectML EP
iOS/Android appsExecuTorch + XNNPACK+ Core ML / QNN backend
Unknown / ResearchGGUF + OllamaMeasure, then pick hardware-specific