Back to index
[OBSERVABILITY / OTEL]10 min read

One Span Per Token: Tracing LLM Pipelines Without Drowning in Data

OpenTelemetry was not designed for million-event-per-second token streams. We extended it with a sampling sidecar that preserves causal chains.

YT
Yuki Tanaka
Staff Engineer · InceptsLab

Tracing an LLM pipeline at token granularity gives you perfect causal visibility — and an unsustainable storage bill. A 1M-DAU app generating 200 tokens per response produces ~2.4 billion spans per day.

Our sidecar samples on causal chains rather than individual spans. If anything in a request chain is anomalous (latency outlier, error, retry), we keep the entire chain. Otherwise we keep 0.1%. Storage drops 99.4%; debug coverage stays at 100% for incidents.

Production deployment

We deployed this gradually behind an internal feature flag, mirroring 1% of traffic for 72 hours before promoting. The instrumentation surface is shipped as part of our open-source edge-trace crate.

// Pseudocode — the actual wiring lives in the repo
const router = createRouter({
  classify: classifier.predict,
  speculate: speculator.draft,
  verify: verifier.confirm,
  windowSize: 8,
});

export default router.handle;

What we got wrong

Our first iteration over-trusted the speculator on long sequences. The fix was a sliding acceptance threshold that decays with prefix length — obvious in hindsight, not obvious during the on-call that surfaced it.

The bottleneck is rarely where you think it is. Measure first; optimize the thing that actually moves the bill.