Single-GPU LLM serving — tuning for extraction throughput¶
This guide consolidates what actually moves throughput when serving a single mid-size GPU (e.g. an L4-class card, or our GB10 host) for the fact-extraction workload (KG-2.64) and other JSON-structured generation. The findings are distilled from an empirical benchmark sweep on an L4 serving a 35B-A3B MoE model; they generalize to any bandwidth-bound single-stream decode.
Why this matters here. Fact extraction is a long structured-JSON generation on one contended GPU slot (KG-2.65 schedules access to it). The knobs below are the difference between ~56 and ~76 tok/s with no quality loss — a 34% wall-clock win on every extraction job.
The one-line summary¶
Low-bit k-quant + speculative decoding are synergistic, not independent. On a bandwidth-bound single stream, the win comes from doing fewer, cheaper memory passes per accepted token — quantization shrinks the pass, speculation amortizes it. Neither alone reaches the ceiling.
What works (apply these)¶
| Lever | Setting | Effect | Why |
|---|---|---|---|
| Quantization | Q3_K_XL (~3.5 bpw) over Q4_K_XL | +34% decode tok/s, zero quality loss | Decode is memory-bandwidth bound; smaller weights = faster passes. |
| Speculative decoding | MTP / draft-mtp, n-max 3, p-min 0.1 |
+39% at Q3 (only +13% at Q4) | Drafts cheap tokens, verifies in one pass; the win scales with how cheap the pass is (so it compounds with low-bit). |
| Flash attention | on, uniform KV precision | baseline-critical | Mixing KV precisions disables flash-attn → −57%. Keep KV uniform. |
| Parallelism | --parallel 1 |
best single-job throughput | BS=1 MoE decode already saturates bandwidth; extra streams split it (−10%). |
| KV-cache reuse | reuse across rounds on the same doc prefix | faster rounds 2+ | The prompt+document prefix is identical across our multi-round recall (KG-2.64); cache it. |
What does NOT help (don't bother)¶
- KV-cache quantization — KV is tiny next to weights; no measurable effect, and mixed precision kills flash-attn.
- Smaller context window — same reason; the weights, not the KV, fill VRAM.
- More draft tokens (
n-max 4/5) — acceptance falls, verify cost rises (−2% / −7%). - Aggressive draft gating (
p-min 0.5) — drops baseline facts for a −2% net.
The quality floor (do not cross)¶
- ~3.5 bpw (k-quant) is the floor. Below it, completeness and groundedness collapse: IQ3_XXS (3.1 bpw) was faster but lost ~32% of extracted facts; Q2_K_XL fabricated (groundedness 0.67 → 0.26). i-quants fail this workload.
- Validate any new quant against extraction coverage (facts per doc) and groundedness (evidence_span actually substring-matches), not just tok/s.
Applying it to our stack¶
- We serve via vLLM (
vllm.arpa), not llama.cpp — the knowledge transfers, the flags differ. The equivalents: pick a ~Q4/AWQ-or-better quant that holds groundedness, enable speculative decoding where the engine supports it, keep one high-throughput stream per extraction job, and let KG-2.65 serialize slot access rather than running concurrent decode streams. - The fact extractor already sets the matching sampling profile (temperature 0.7,
top_p 0.8, top_k 20, presence_penalty 1.5,
enable_thinking=False) and a strict JSON-schema response format — seeknowledge_graph/extraction/fact_extractor.py. - The single-GPU slot is a scheduled resource: submit extraction as a job
(
graph_ingest action=extract_submit) so preemption + backfill keep the GPU busy without oversubscription.