Back to index
[AI / LLM ROUTING]12 min read

Speculative Decoding at the Gateway: Saving 41% on Inference Spend

We rewired our model router to draft tokens with a 1.3B speculator before the 70B verifier confirms. The economics changed overnight — and so did our P99 latency.

EV
Eda Voss
Staff Engineer · InceptsLab

Modern inference stacks treat the model as the bottleneck. That assumption is increasingly wrong. At our edge, the *router* — not the model — is where margin is won or lost.

Speculative decoding flips the standard generation loop. Rather than sampling tokens one-at-a-time from the verifier, we draft a window of N candidate tokens from a much smaller speculator model (in our case, a tuned 1.3B param Mistral variant) and submit them to the 70B verifier in parallel.

The verifier accepts the longest valid prefix and rejects the rest. When the speculator is well-aligned with the verifier — which is the entire game — we collapse 8-12 forward passes into a single batched verification. Throughput climbs. Cost falls. Tail latency improves because we are no longer bound by per-token serialization.

The hard part is alignment. A naively chosen speculator agrees on maybe 35% of tokens. After distillation against verifier logits on a 4M-token in-domain corpus, we are seeing 71-78% acceptance for code workloads and 62% for prose.

The router itself sits in front. It dispatches to one of four pipelines based on prompt class (code, structured, conversational, multilingual) and adjusts the speculator-verifier pair per route. This routing decision is sub-millisecond — a single fastText classifier we keep hot in shared memory.

Net result over a 30-day rolling window: 41% reduction in GPU-hours billed, P50 latency down from 740ms to 410ms, P99 from 3.1s to 1.4s. No measurable quality regression on our internal eval suite.

Production deployment

We deployed this gradually behind an internal feature flag, mirroring 1% of traffic for 72 hours before promoting. The instrumentation surface is shipped as part of our open-source edge-trace crate.

// Pseudocode — the actual wiring lives in the repo
const router = createRouter({
  classify: classifier.predict,
  speculate: speculator.draft,
  verify: verifier.confirm,
  windowSize: 8,
});

export default router.handle;

What we got wrong

Our first iteration over-trusted the speculator on long sequences. The fix was a sliding acceptance threshold that decays with prefix length — obvious in hindsight, not obvious during the on-call that surfaced it.

The bottleneck is rarely where you think it is. Measure first; optimize the thing that actually moves the bill.