All glossary terms
Verify

Distributed tracing

Distributed tracing records the path of a single request as it traverses multiple services, producing a tree-like view of every span — a unit of work in a single service — with timings, parent-child relationships, and metadata. A trace lets engineers see exactly where time was spent in a slow request and where errors propagated from in a failed one.

The concept comes from Google's Dapper (2010 paper). Modern implementations — Jaeger, Tempo, Honeycomb, Datadog APM, the OpenTelemetry SDK — instrument services to emit spans, propagate trace context (typically a W3C traceparent header) across service boundaries, and ship spans to a backend that reassembles them. Tracing pays off most in distributed systems where the failure or latency mode isn't visible in any single service's logs. The cost is real (instrumentation overhead, storage for traces, sampling decisions); high-volume services typically sample (head-based or tail-based) rather than tracing every request. Pairs with structured logs (linked via trace ID) and metrics (derived from spans).

Related terms