Speculative Decoding at the Gateway: Saving 41% on Inference Spend
We rewired our model router to draft tokens with a 1.3B speculator before the 70B verifier confirms. The economics changed overnight — and so did our P99 latency.
Long-form research, applied AI tooling, and open-source middleware from a collective of staff engineers shipping at the frontier. Published when there is something worth publishing.
We rewired our model router to draft tokens with a 1.3B speculator before the 70B verifier confirms. The economics changed overnight — and so did our P99 latency.
A 600-line Rust crate, compiled to a 47KB WebAssembly module, now handles authentication for every request across 280 edge locations. Here is the architecture.
PagedAttention is not just an optimization — it is a memory allocator for transformers. Here is what we learned shipping it to production.
We built a TypeScript-flavored DSL for prompts that statically verifies retrieved-context shape against expected JSON output. Bugs that used to ship to prod now fail CI.
RSC payloads are not HTML and they are not JSON. They are something stranger — and your CDN almost certainly handles them wrong by default.
OpenTelemetry was not designed for million-event-per-second token streams. We extended it with a sampling sidecar that preserves causal chains.