KV-Cache Paging: How We Fit 3x More Concurrent Requests Per H100
PagedAttention is not just an optimization — it is a memory allocator for transformers. Here is what we learned shipping it to production.
GPU memory is the constraint that quietly dictates your inference economics. A single H100 holds 80GB of HBM. A 70B model in FP16 already consumes 140GB — so we shard. But the model weights are only half the story. The KV-cache, which scales with concurrent requests and context length, eats whatever memory is left.
PagedAttention treats the KV-cache the way an operating system treats virtual memory: fixed-size pages, a page table per sequence, and a free list. Sequences allocate pages as they grow and release them on completion. Fragmentation drops to near-zero.
In practice this means we can serve 3.2x more concurrent requests per GPU at the same context length, or alternatively, 4x longer contexts at the same concurrency. The throughput curve is roughly linear in available pages.
Production deployment
We deployed this gradually behind an internal feature flag, mirroring 1% of traffic for 72 hours before promoting. The instrumentation surface is shipped as part of our open-source edge-trace crate.
// Pseudocode — the actual wiring lives in the repo
const router = createRouter({
classify: classifier.predict,
speculate: speculator.draft,
verify: verifier.confirm,
windowSize: 8,
});
export default router.handle;What we got wrong
Our first iteration over-trusted the speculator on long sequences. The fix was a sliding acceptance threshold that decays with prefix length — obvious in hindsight, not obvious during the on-call that surfaced it.
The bottleneck is rarely where you think it is. Measure first; optimize the thing that actually moves the bill.