One Span Per Token: Tracing LLM Pipelines Without Drowning in Data
OpenTelemetry was not designed for million-event-per-second token streams. We extended it with a sampling sidecar that preserves causal chains.
Tracing an LLM pipeline at token granularity gives you perfect causal visibility — and an unsustainable storage bill. A 1M-DAU app generating 200 tokens per response produces ~2.4 billion spans per day.
Our sidecar samples on causal chains rather than individual spans. If anything in a request chain is anomalous (latency outlier, error, retry), we keep the entire chain. Otherwise we keep 0.1%. Storage drops 99.4%; debug coverage stays at 100% for incidents.
Production deployment
We deployed this gradually behind an internal feature flag, mirroring 1% of traffic for 72 hours before promoting. The instrumentation surface is shipped as part of our open-source edge-trace crate.
// Pseudocode — the actual wiring lives in the repo
const router = createRouter({
classify: classifier.predict,
speculate: speculator.draft,
verify: verifier.confirm,
windowSize: 8,
});
export default router.handle;What we got wrong
Our first iteration over-trusted the speculator on long sequences. The fix was a sliding acceptance threshold that decays with prefix length — obvious in hindsight, not obvious during the on-call that surfaced it.
The bottleneck is rarely where you think it is. Measure first; optimize the thing that actually moves the bill.