Spans and Traces
Multiple spans come together to form a trace, which represents the complete journey of a request through a distributed system. Spans within a trace are typically organized chronologically, with each span's timestamps indicating its position in the request's timeline. Parent-child relationships between spans create a hierarchical structure, showing how operations are nested or dependent on each other. Span links provide a way to associate related spans across different traces, useful for scenarios like batch processing or asynchronous operations where traditional parent-child relationships don't fully capture the relationship between operations.
Origins in the Dapper Paper
Google's Dapper paper, published in 2010, introduced the concept of spans as we know them today. In Dapper, a span represents the basic unit of work and contains annotations and key-value pairs that describe the work being performed. The paper defined spans as having a unique identifier, parent span identifier, and timing information, establishing the foundation for modern distributed tracing systems.
Evolution of Distributed Tracing Systems
The distributed tracing landscape has evolved through several key systems and standards:
- Zipkin: Created by Twitter in 2012, inspired by Google's Dapper paper. It introduced the first widely-adopted open-source implementation of distributed tracing.
- Jaeger Tracing: Developed by Uber in 2016, offering a more modern architecture and better scalability than Zipkin.
- OpenTracing: Emerged in 2016 as a vendor-neutral API standard, allowing developers to instrument applications without vendor lock-in.
- OpenTelemetry: Created in 2019 by merging OpenTracing and OpenCensus, becoming the de facto standard for instrumentation and telemetry data collection.
OpenTelemetry has effectively become the successor to both OpenTracing and OpenCensus, providing a unified approach to observability that includes not just tracing, but also metrics and logs. While Zipkin and Jaeger continue to be popular trace visualization and storage backends, they now commonly integrate with OpenTelemetry for data collection.
Implementation of Spans in OpenTracing and OpenTelemetry
OpenTracing standardized the span concept across different tracing systems. OpenTelemetry, which merged OpenTracing and OpenCensus, further refined the span specification. In both systems, spans maintain these core characteristics:
- Operation name that describes the work being done
- Start and end timestamps
- Span context (trace ID, span ID, and parent span ID)
- Attributes (key-value pairs)
- Events (timestamped logs)
- Links to related spans
Semantic Conventions for Span Attributes
OpenTelemetry defines semantic conventions for span attributes to ensure consistency across different services and applications. Here are common examples:
- HTTP Requests:
- http.method: "GET", "POST", "PUT"
- http.response.status_code: 200, 404, 500
- http.url: "https://api.example.com/users"
- Database Operations:
- db.system: "mysql", "postgresql", "mongodb"
- db.statement: "SELECT * FROM users"
- db.operation: "query", "insert", "update"
- RPC Calls:
- rpc.system: "grpc", "jsonrpc"
- rpc.service: "PaymentService"
- rpc.method: "ProcessPayment"
- General Service Information:
- service.name: "payment-processor"
- service.version: "1.0.0"
- service.instance.id: "instance-abc123"
Commonly used Span Kinds
Spans are categorized into different kinds based on their role in the system. The following ones are frequently used:
- INTERNAL: Default span type representing internal operations within a service
- SERVER: Represents the handling of an incoming request on the server side
- CLIENT: Represents outgoing requests from a service to an external system
- PRODUCER: Indicates the sending of a message to a message broker or queue
- CONSUMER: Represents the processing of a message received from a message broker or queue
Context Propagation and Correlation
Context propagation is essential for maintaining trace context across service boundaries in distributed systems. Here's how it works:
- Correlation Headers: These are special HTTP headers that pass trace context between services. The most common ones are:
- traceparent: Contains trace ID, span ID, and trace flags
- tracestate: Allows vendors to add custom trace information
- W3C Trace Context: This is the standard specification for propagating context across service boundaries, ensuring interoperability between different tracing systems.
- Baggage API: This is a mechanism for carrying arbitrary key-value pairs alongside the trace context. Unlike trace context, baggage is application-specific data that can include:
- User IDs
- Tenant information
- Custom correlation identifiers
Code Examples
Here's how to create and add attributes to spans in Go using OpenTelemetry:
spans.go01234567891011121314151617181920212223import ("go.opentelemetry.io/otel""go.opentelemetry.io/otel/attribute""context")func performOperation(ctx context.Context) {tracer := otel.Tracer("service-name")ctx, span := tracer.Start(ctx, "operation-name")defer span.End()// Add attributes to the spanspan.SetAttributes(attribute.String("customer.id", "123"),attribute.Int64("items.count", 5),attribute.Float64("order.total", 99.99))// Add eventsspan.AddEvent("processing.started")// ... perform work ...span.AddEvent("processing.completed")}
And here's the equivalent example in Java:
spans.java01234567891011121314151617181920212223242526import io.opentelemetry.api.OpenTelemetry;import io.opentelemetry.api.trace.Span;import io.opentelemetry.api.trace.Tracer;import io.opentelemetry.api.common.Attributes;public class TracingExample {private final Tracer tracer;public void performOperation() {Span span = tracer.spanBuilder("operation-name").startSpan();try (var scope = span.makeCurrent()) {// Add attributesspan.setAttribute("customer.id", "123");span.setAttribute("items.count", 5);span.setAttribute("order.total", 99.99);// Add eventsspan.addEvent("processing.started");// ... perform work ...span.addEvent("processing.completed");} finally {span.end();}}}
Best Practices for Span Usage
When working with spans, consider these important practices:
- Keep span names concise but descriptive
- Add relevant attributes that aid in troubleshooting
- Maintain proper parent-child relationships between spans
- End spans as soon as the operation completes
- Use span events to mark significant points in the operation's lifecycle
Understanding spans is crucial for implementing effective distributed tracing. Whether using open-source frameworks or commercial solutions, the fundamental concept of spans provides the foundation for tracking and understanding the behavior of distributed systems.