Skip to content

Single-GPU LLM serving — tuning for extraction throughput

This guide consolidates what actually moves throughput when serving a single mid-size GPU (e.g. an L4-class card, or our GB10 host) for the fact-extraction workload (KG-2.64) and other JSON-structured generation. The findings are distilled from an empirical benchmark sweep on an L4 serving a 35B-A3B MoE model; they generalize to any bandwidth-bound single-stream decode.

Why this matters here. Fact extraction is a long structured-JSON generation on one contended GPU slot (KG-2.65 schedules access to it). The knobs below are the difference between ~56 and ~76 tok/s with no quality loss — a 34% wall-clock win on every extraction job.

The one-line summary

Low-bit k-quant + speculative decoding are synergistic, not independent. On a bandwidth-bound single stream, the win comes from doing fewer, cheaper memory passes per accepted token — quantization shrinks the pass, speculation amortizes it. Neither alone reaches the ceiling.

What works (apply these)

Lever Setting Effect Why
Quantization Q3_K_XL (~3.5 bpw) over Q4_K_XL +34% decode tok/s, zero quality loss Decode is memory-bandwidth bound; smaller weights = faster passes.
Speculative decoding MTP / draft-mtp, n-max 3, p-min 0.1 +39% at Q3 (only +13% at Q4) Drafts cheap tokens, verifies in one pass; the win scales with how cheap the pass is (so it compounds with low-bit).
Flash attention on, uniform KV precision baseline-critical Mixing KV precisions disables flash-attn → −57%. Keep KV uniform.
Parallelism --parallel 1 best single-job throughput BS=1 MoE decode already saturates bandwidth; extra streams split it (−10%).
KV-cache reuse reuse across rounds on the same doc prefix faster rounds 2+ The prompt+document prefix is identical across our multi-round recall (KG-2.64); cache it.

What does NOT help (don't bother)

  • KV-cache quantization — KV is tiny next to weights; no measurable effect, and mixed precision kills flash-attn.
  • Smaller context window — same reason; the weights, not the KV, fill VRAM.
  • More draft tokens (n-max 4/5) — acceptance falls, verify cost rises (−2% / −7%).
  • Aggressive draft gating (p-min 0.5) — drops baseline facts for a −2% net.

The quality floor (do not cross)

  • ~3.5 bpw (k-quant) is the floor. Below it, completeness and groundedness collapse: IQ3_XXS (3.1 bpw) was faster but lost ~32% of extracted facts; Q2_K_XL fabricated (groundedness 0.67 → 0.26). i-quants fail this workload.
  • Validate any new quant against extraction coverage (facts per doc) and groundedness (evidence_span actually substring-matches), not just tok/s.

Applying it to our stack

  • We serve via vLLM (vllm.arpa), not llama.cpp — the knowledge transfers, the flags differ. The equivalents: pick a ~Q4/AWQ-or-better quant that holds groundedness, enable speculative decoding where the engine supports it, keep one high-throughput stream per extraction job, and let KG-2.65 serialize slot access rather than running concurrent decode streams.
  • The fact extractor already sets the matching sampling profile (temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5, enable_thinking=False) and a strict JSON-schema response format — see knowledge_graph/extraction/fact_extractor.py.
  • The single-GPU slot is a scheduled resource: submit extraction as a job (graph_ingest action=extract_submit) so preemption + backfill keep the GPU busy without oversubscription.