Speculative Decoding at the Gateway: Saving 41% on Inference Spend
We rewired our model router to draft tokens with a 1.3B speculator before the 70B verifier confirms. The economics changed overnight — and so did our P99 latency.
Every long-form publication from the InceptsLab collective. Sorted by recency, indexed for depth.
We rewired our model router to draft tokens with a 1.3B speculator before the 70B verifier confirms. The economics changed overnight — and so did our P99 latency.
A 600-line Rust crate, compiled to a 47KB WebAssembly module, now handles authentication for every request across 280 edge locations. Here is the architecture.
PagedAttention is not just an optimization — it is a memory allocator for transformers. Here is what we learned shipping it to production.
We built a TypeScript-flavored DSL for prompts that statically verifies retrieved-context shape against expected JSON output. Bugs that used to ship to prod now fail CI.
RSC payloads are not HTML and they are not JSON. They are something stranger — and your CDN almost certainly handles them wrong by default.
OpenTelemetry was not designed for million-event-per-second token streams. We extended it with a sampling sidecar that preserves causal chains.