Distributed Tracing: Trace Context and OpenTelemetry
Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.
Distributed Tracing: Trace Context, OpenTelemetry, and Correlation
When a request touches ten services before returning an error, traditional logging tells you each piece in isolation. Distributed tracing connects the pieces, showing you the complete journey of a request through your system.
This guide covers the fundamentals of tracing: trace context propagation, OpenTelemetry instrumentation, and practical correlation patterns.
If you are building microservices, you need distributed tracing.
Introduction
Logs tell you what happened in a single service. Metrics tell you aggregate patterns. Neither shows causality across service boundaries.
Consider a request that fails after touching five services. With only logs, you search each service’s logs for the trace ID, then manually piece together the sequence. With tracing, you open a single view showing the entire timeline: service A started the request, called B, which called C, D, and E in sequence, and E returned an error that propagated back up.
This makes debugging actually tractable instead of a scavenger hunt.
Core Concepts
Traces and Spans
A trace represents an entire request journey. It contains one or more spans, where each span represents a single operation within that trace.
sequenceDiagram
participant C as Client
participant A as API Gateway
participant O as Order Service
participant P as Payment Service
participant N as Notification Service
C->>A: GET /orders/123
A->>O: GetOrder(123)
O->>P: ProcessPayment(order)
P-->>O: Payment confirmed
O->>N: SendConfirmation(order)
N-->>O: Notification sent
O-->>A: Order details
A-->>C: Response
Each span captures:
- Operation name
- Start and end time
- Parent span ID (linking)
- Attributes (key-value metadata)
- Events (timestamped points within the span)
Trace Context
Trace context propagates across service boundaries through HTTP headers. When service A calls service B, it passes trace context in headers. Service B creates a child span using that context, ensuring the spans stay connected in a single trace.
The W3C Trace Context specification standardizes these headers:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
The traceparent header contains:
- Version (00)
- Trace ID (32 hex characters)
- Parent ID (16 hex characters)
- Flags (01 = sampled)
OpenTelemetry Architecture
OpenTelemetry (OTel) is the open standard for observability. It gives you APIs, SDKs, and instrumentation for collecting traces, metrics, and logs.
graph TB
subgraph "Application Code"
A[Your Service]
B[OTel SDK]
C[Language-specific auto-instrumentation]
end
subgraph "Exporters"
D[OTLP Exporter]
E[Jaeger Exporter]
F[Zipkin Exporter]
end
subgraph "Collecting Infrastructure"
G[OTel Collector]
H[Jaeger]
I[Zipkin]
end
A --> B
B --> C
B --> D
D --> G
G --> H
G --> I
OTel collector
The OTel collector receives, processes, and exports telemetry data. Think of it as middleware between your application and your observability backend.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_percentage: 90
exporters:
otlp:
endpoint: jaeger-collector:4317
tls:
insecure: false
cert_file: /certs/cert.pem
key_file: /certs/key.pem
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Manual Instrumentation
Auto-instrumentation covers many frameworks automatically, but you need manual instrumentation for business-specific operations and custom spans.
Starting Traces
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service", "1.0.0");
async function createOrder(orderData: OrderData): Promise<Order> {
const span = tracer.startSpan("OrderService.createOrder", {
attributes: {
"order.customer_id": orderData.customerId,
"order.item_count": orderData.items.length,
"order.total": orderData.total,
},
});
try {
const order = await db.orders.create(orderData);
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end();
}
}
Creating Child Spans
Wrap nested operations as child spans:
async function processPayment(
orderId: string,
payment: Payment,
): Promise<PaymentResult> {
const parentSpan = trace.getActiveSpan();
const paymentSpan = tracer.startSpan("PaymentService.process", {
parent: parentSpan,
attributes: {
"payment.method": payment.method,
"payment.amount": payment.amount,
"payment.currency": payment.currency,
},
});
try {
// Verify card
await verifyCard(payment.card, paymentSpan);
// Charge
const result = await chargeCard(payment, paymentSpan);
paymentSpan.setAttribute("payment.transaction_id", result.transactionId);
paymentSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
paymentSpan.recordException(error as Error);
paymentSpan.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
paymentSpan.end();
}
}
async function verifyCard(card: Card, parentSpan: Span): Promise<void> {
const span = tracer.startSpan("PaymentService.verifyCard", {
parent: parentSpan,
attributes: {
"card.type": card.type,
"card.last_four": card.lastFour,
},
});
// Verification logic
await api.verifyCard(card);
span.end();
}
Context Propagation
Proper context propagation connects spans across service boundaries. Without it, spans become orphaned and useless for debugging.
HTTP Propagation
Middleware that extracts incoming trace context and propagates it to downstream calls:
import { trace, context, propagation } from "@opentelemetry/api";
function httpMiddleware(req, res, next) {
// Extract context from incoming headers
const extractedContext = propagation.extract(context.active(), req.headers);
// Run the rest of the request handler within that context
context.with(extractedContext, () => {
// All spans created here are linked to the incoming trace
next();
});
}
// When making outgoing requests, inject context into headers
async function callDownstreamService(url: string, data: any): Promise<any> {
const headers = {};
propagation.inject(context.active(), headers);
return fetch(url, {
method: "POST",
headers: {
...headers,
"Content-Type": "application/json",
},
body: JSON.stringify(data),
});
}
Messaging Propagation
Propagate context through message queues so spans stay connected even with asynchronous processing:
import { trace, context, propagation } from "@opentelemetry/api";
// Producer: inject context into message
async function sendOrderCreatedEvent(order: Order): Promise<void> {
const headers: Record<string, string> = {};
propagation.inject(context.active(), headers);
await kafka.send({
topic: "order.created",
messages: [
{
key: order.id,
value: JSON.stringify(order),
headers: headers,
},
],
});
}
// Consumer: extract context from message and create linked span
async function handleOrderCreated(message: KafkaMessage): Promise<void> {
const extractedContext = propagation.extract(
context.active(),
message.headers,
);
await context.with(extractedContext, async () => {
const span = tracer.startSpan("OrderConsumer.handleOrderCreated");
try {
const order = JSON.parse(message.value.toString());
await processOrder(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
});
}
Adding Business Context
Rich span attributes turn traces from timing diagrams into debugging tools.
Semantic Attributes
Use standard attribute names for common data:
// HTTP attributes
span.setAttribute("http.method", "POST");
span.setAttribute("http.url", "https://api.example.com/orders");
span.setAttribute("http.status_code", 201);
span.setAttribute("http.response_content_length", 1024);
// Database attributes
span.setAttribute("db.system", "postgresql");
span.setAttribute("db.name", "orders_db");
span.setAttribute("db.statement", "SELECT * FROM orders WHERE id = $1");
span.setAttribute("db.operation", "SELECT");
// Messaging attributes
span.setAttribute("messaging.system", "kafka");
span.setAttribute("messaging.destination", "order.created");
span.setAttribute("messaging.operation", "publish");
Custom Business Attributes
Add domain-specific context:
span.setAttribute("order.id", order.id);
span.setAttribute("order.status", order.status);
span.setAttribute("order.customer_tier", customer.tier);
span.setAttribute("order.is_first_purchase", customer.orderCount === 0);
These attributes let you filter traces by business properties: find all traces for premium customers, or analyze timing for first-time purchasers.
Correlation with Logs and Metrics
Traces work best when linked to your logs and metrics.
Trace-Log Correlation
Include trace ID in logs:
import { trace, span } from "@opentelemetry/api";
function logInfo(message: string, data?: Record<string, unknown>): void {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
const logEntry = {
timestamp: new Date().toISOString(),
level: "INFO",
message,
traceId,
...data,
};
console.log(JSON.stringify(logEntry));
}
// Now every log entry includes the trace ID
logInfo("Order created successfully", { orderId: "ord_123" });
// {"timestamp":"2026-03-22T14:30:00Z","level":"INFO","message":"Order created successfully","traceId":"abc123...","orderId":"ord_123"}
Trace-Metric Correlation
Link metrics to traces through span events:
const meter = metrics.getMeter("payment-service");
const paymentDuration = meter.createHistogram("payment.duration", {
unit: "ms",
description: "Payment processing duration",
});
async function processPayment(payment: Payment): Promise<void> {
const span = tracer.startSpan("PaymentService.process");
const startTime = Date.now();
try {
await doPayment(payment);
paymentDuration.record(Date.now() - startTime, {
"payment.method": payment.method,
"payment.status": "success",
});
} catch (error) {
paymentDuration.record(Date.now() - startTime, {
"payment.method": payment.method,
"payment.status": "failure",
});
throw error;
} finally {
span.end();
}
}
Sampling Strategies
At high traffic, you cannot capture every trace. Sampling reduces volume while preserving useful data.
Common Sampling Strategies
Head-based sampling decides at trace start whether to capture:
// Always sample 1% of traces, plus all errors
const sampler = new TraceIdRatioBasedSampler({
ratio: 0.01,
rules: [
{ matcher: (span) => span.status.code === "ERROR", sampler: "always" },
],
});
const tracer = trace
.getTracerProvider()
.addSpanProcessor(new SimpleSpanProcessor(exporter), sampler);
Tail-based sampling captures all spans temporarily, then decides what to keep:
# OTel Collector tail-based sampling
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 10 }
- name: keep-all
type: always_sample
This captures slow traces, errors, and a percentage of everything else.
Cloud-Native Tracing Solutions
AWS X-Ray
AWS X-Ray integrates with services like API Gateway, Lambda, ECS, and EKS:
import { AWSXRay } from "aws-xray-sdk";
// Automatic tracing for AWS SDK calls
AWSXRay.captureAWSv3Client(s3Client);
AWSXRay.captureHTTPClient(httpAgent);
// For Lambda, use the wrapper
export const handler = AWSXRay.captureAsyncHandler(async (event, context) => {
// Your handler code
return await processOrder(event);
});
X-Ray uses a daemon that buffers traces and sends them to the AWS backend. In ECS, run the X-Ray daemon as a sidecar container.
Google Cloud Trace
GCP Cloud Trace integrates with Cloud Run, GKE, and Compute Engine:
import { TraceAgent } from "@google-cloud/trace-agent";
// Initialize before other imports
TraceAgent.start({
projectId: process.env.GCP_PROJECT_ID,
keyFilename: "/path/to/service-account.json",
logLevel: 1,
});
// OpenTelemetry SDK with GCP exporter
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
const traceExporter = new OTLPTraceExporter({
url: "collector.googleapis.com:443",
headers: {
"x-goog-api-key": process.env.GCP_API_KEY,
},
});
Azure Application Insights
Azure uses the OpenTelemetry SDK with its own exporter:
import { ApplicationInsights } from "@microsoft/applicationinsight-web";
// Auto-instrument HTTP and AJAX calls
const appInsights = new ApplicationInsights({
config: {
instrumentationKey: process.env.AZURE_INSTRUMENTATION_KEY,
enableCors stripping: true,
autoTrackPageVisit: true,
},
});
appInsights.loadAppInsights();
appInsights.trackTrace({
message: "Distributed tracing initialized",
severityLevel: 1,
});
Multi-Cloud Trace Correlation
When running across cloud providers, maintain trace context using W3C headers. The traceparent header works across all providers:
function forwardToExternalService(
url: string,
headers: Headers,
): Promise<Response> {
// Always inject W3C trace context
const traceparent = headers.get("traceparent") || generateTraceparent();
return fetch(url, {
headers: {
...Object.fromEntries(headers),
traceparent: traceparent,
tracestate: `cloud=gcp,region=${process.env.REGION}`,
},
});
}
Visualization with Jaeger
Jaeger is a popular distributed tracing backend. It stores traces and provides a UI for exploration.
Key Jaeger Views
Search: Find traces by service, operation, time range, or tags.
Trace Detail: The flame graph showing all spans in a trace with timing information.
Span Detail: The attributes, events, and logs attached to a specific span.
Analyzing Trace Flame Graphs
A flame graph shows the parent-child span relationships:
order-service.createOrder (2.3s)
├── auth-service.validateToken (50ms)
├── inventory-service.checkStock (150ms)
│ └── external-partner.getAvailability (120ms)
├── payment-service.process (1.8s)
│ ├── fraud-check.analyze (400ms)
│ │ └── external-api.call (380ms)
│ └── payment-gateway.charge (1.2s)
│ └── external-bank.authorize (1.1s)
├── notification-service.send (100ms)
└── db.orders.insert (30ms)
Long spans are easy to spot. Here, payment-service.process dominates. Drilling in, payment-gateway.charge is the bottleneck. Further still, external-bank.authorize is where time is spent.
Service Mesh Tracing (Istio and Linkerd)
Service meshes add automatic tracing to all service-to-service communication without requiring code changes.
Istio Integration
Istio’s Envoy sidecar proxy automatically instruments all HTTP, gRPC, and TCP traffic:
# istio tracing config
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: tracing-config
spec:
meshConfig:
enableTracing: true
defaultProviders:
tracing:
- opentelemetry
extensionProviders:
- name: otel
opentelemetry:
service: otel-collector.observability
port: 4317
Envoy extracts trace context from traceparent headers and creates spans for every request. Your application code only needs to propagate context for async operations.
Linkerd Integration
Linkerd uses service profiles to enable tracing on specific routes:
# service-profile.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: order-service.default.svc.cluster.local
spec:
routes:
- condition:
method: GET
path: /api/orders/{id}
timeout: 5s
retryBudget:
retryRatio: 0.2
minRetriesPerSecond: 10
maxRetries: 100
Trade-offs: Mesh vs SDK Tracing
| Aspect | Service Mesh Tracing | SDK Manual Tracing |
|---|---|---|
| Setup effort | Minimal (config only) | Code changes required |
| Network span coverage | Automatic for all traffic | Only where you add it |
| Business context | Limited to mesh metadata | Full custom attributes |
| Performance impact | Sidecar overhead ~1-2ms | Minimal with sampling |
| Portability | Tied to mesh implementation | Portable across platforms |
Common Patterns
Database Query Tracing
// Monkey-patch your database client for auto-tracing
import { dbplugin } from "@opentelemetry/instrumentation-pg";
new dbplugin.DatabaseDetector({
enhancedDatabaseReporting: true,
addSqlCommenterCommentToQueries: true,
});
HTTP Client Tracing
import { fetchInstrumentation } from "@opentelemetry/instrumentation-fetch";
new fetchInstrumentation({
propagateCorrelationHeader: true,
timingOrigin: (origin) => origin !== window.location.origin,
});
gRPC Tracing
import { grpcInstrumentation } from "@opentelemetry/instrumentation-grpc";
new grpcInstrumentation({
yaml: true,
});
When to Use Distributed Tracing
Use distributed tracing when:
- Understanding request flow through complex architectures
- Finding which service causes cascading failures
- Root cause analysis when errors propagate across boundaries
- Performance optimization by identifying bottlenecks
- Validating service dependencies and communication patterns
When Not to Use Distributed Tracing:
- Single monolithic applications (local debugging suffices)
- Low-traffic services where logs provide sufficient context
- When you only need aggregate metrics (use Prometheus)
- Very high-throughput paths where tracing overhead matters (use sampling)
- Systems without clear request boundaries (batch jobs)
Trade-off Analysis
| Aspect | Distributed Tracing | Traditional Logging | Metrics Only |
|---|---|---|---|
| Debugging Speed | Minutes (full context) | Hours (manual correlation) | N/A |
| Storage Cost | High (span data) | Medium (log volume) | Low |
| Overhead | ~1-5% latency | Minimal | Minimal |
| Root Cause | Full causal chain visible | Requires ID correlation | Aggregates only |
| Cardinality | High (many traces) | Medium | Low |
| Error Context | Full request path | Per-service only | None |
SLI/SLO/Error Budget Templates for Tracing
Distributed tracing does not typically have traditional SLIs/SLOs since it is qualitative debugging tooling rather than quantitative reliability measurement. However, you can define SLOs around tracing coverage and health.
Trace Health SLI Template
# tracing-sli-config.yaml
service: tracing-observability
environment: production
slis:
- name: trace_ingestion_success_rate
description: "Percentage of started traces successfully exported"
query: |
sum(rate(otel_exporter_sent_spans_total[5m]))
/
sum(rate(otel_span_started_total[5m]))
- name: trace_context_propagation_success
description: "Percentage of requests with valid propagated trace context"
query: |
sum(rate(otel_trace_context_propagated_total{status="success"}[5m]))
/
sum(rate(otel_trace_context_propagated_total[5m]))
- name: tail_sampling_efficiency
description: "Percentage of traces retained by tail sampling"
query: |
sum(rate(otel_tail_sampling_traces_sampled_total[5m]))
/
sum(rate(otel_tail_sampling_traces_evaluated_total[5m]))
- name: span_error_rate
description: "Percentage of spans with error status"
query: |
sum(rate(otel_span_status_code_total{code="ERROR"}[5m]))
/
sum(rate(otel_span_started_total[5m]))
Trace SLO Template
# tracing-slo-config.yaml
objectives:
- display_name: "Trace Ingestion Availability"
sli: trace_ingestion_success_rate
target: 99.5
window: 30d
description: "99.5% of started traces should be exported"
- display_name: "Context Propagation Success"
sli: trace_context_propagation_success
target: 99.9
window: 30d
description: "99.9% of requests should have valid trace context"
- display_name: "Tail Sampling Coverage"
sli: tail_sampling_efficiency
target: 95.0
window: 30d
description: "95% of sampled traces should match sampling policies"
Error Budget Calculator for Tracing
def calculate_tracing_budgets():
"""
Calculate error budgets for tracing SLOs (30-day window).
"""
window_minutes = 30 * 24 * 60
slos = {
"99.5% (Trace Ingestion)": window_minutes * 0.005,
"99.9% (Context Propagation)": window_minutes * 0.001,
"95.0% (Tail Sampling)": window_minutes * 0.050,
}
for slo, budget in slos.items():
print(f"{slo}: {budget:.1f} minutes allowed degradation")
print(f" = {budget / 60:.2f} hours")
print(f" = {budget / 60 / 24:.2f} days")
calculate_tracing_budgets()
Multi-Window Burn-Rate Alerting for Tracing
While tracing does not have traditional error budgets, you can apply burn-rate concepts to detect tracing infrastructure issues.
Trace Coverage Burn-Rate Alert (1h Window)
# Tracing burn-rate alerts
groups:
- name: tracing-burn-rate
rules:
# Fast burn: Trace ingestion dropping significantly
- alert: TracingCoverageFastBurn
expr: |
(
sum(rate(otel_exporter_sent_spans_total[1h]))
/
sum(rate(otel_span_started_total[1h]))
)
< 0.95
for: 5m
labels:
severity: critical
category: tracing
window: 1h
annotations:
summary: "Trace coverage dropping fast (1h window)"
description: "Trace ingestion success rate is {{ $value | humanizePercentage }}. Investigate OTel collector health or exporter issues."
Trace Context Propagation Burn-Rate Alert (6h Window)
# Medium burn: Context propagation issues
- alert: TracingContextPropagationBurn
expr: |
(
sum(rate(otel_trace_context_propagated_total{status="failure"}[6h]))
/
sum(rate(otel_trace_context_propagated_total[6h]))
)
> 0.01
for: 15m
labels:
severity: warning
category: tracing
window: 6h
annotations:
summary: "Trace context propagation failures (6h window)"
description: "Context propagation failure rate is {{ $value | humanizePercentage }}. Check service mesh or HTTP middleware configuration."
Multi-Window Trace Health Alert Set
# Complete trace health burn-rate alert
- alert: TracingHealthBurnAllWindows
expr: |
(
sum(rate(otel_exporter_sent_spans_total[1h]))
/
sum(rate(otel_span_started_total[1h]))
)
< 0.95
or
(
sum(rate(otel_trace_context_propagated_total{status="failure"}[1h]))
/
sum(rate(otel_trace_context_propagated_total[1h]))
)
> 0.05
for: 5m
labels:
severity: critical
category: tracing
annotations:
summary: "Tracing health degraded across multiple indicators"
description: |
One or more tracing health metrics are burning fast.
Trace coverage: {{ printf "%.2f" (index $values "0" | value)) }}
Propagation failures: {{ printf "%.2f" (index $values "1" | value)) }}
Distributed tracing visibility is compromised.
Observability Hooks for Distributed Tracing
This section defines what to log, measure, trace, and alert for tracing systems themselves.
Log (What to Emit)
| Event | Fields | Level |
|---|---|---|
| Collector started | version, endpoint, exporters | INFO |
| Exporter failure | exporter_type, error, retry_count | WARN |
| Sampling decision | sampling_policy, trace_id, decision | DEBUG |
| Context propagation failure | service, direction, error | WARN |
| Span queue full | service, queue_size, drop_count | ERROR |
| Batch export success | exporter, spans_count, bytes | DEBUG |
Measure (Metrics to Collect)
| Metric | Type | Description |
|---|---|---|
otel_span_started_total | Counter | Total spans started |
otel_span_ended_total | Counter | Total spans ended |
otel_exporter_sent_spans_total | Counter | Spans successfully exported |
otel_exporter_failed_spans_total | Counter | Spans that failed to export |
otel_trace_context_propagated_total | Counter | Context propagation attempts |
otel_tail_sampling_traces_evaluated_total | Counter | Traces evaluated by tail sampler |
otel_tail_sampling_traces_sampled_total | Counter | Traces retained by tail sampler |
otel_span_queue_depth | Gauge | Pending spans in export queue |
otel_collector_receive_latency_seconds | Histogram | Time to receive spans |
otel_exporter_send_latency_seconds | Histogram | Time to send to backend |
Trace (Correlation Points)
| Operation | Trace Attribute | Purpose |
|---|---|---|
| Span started | tracing.otel.version | Track OTel SDK version |
| Sampling decision | tracing.sampling.decision | Monitor sampling efficiency |
| Export batch | tracing.export.batch_size | Track export efficiency |
| Context inject/extraction | tracing.context.direction | Monitor propagation health |
Alert (When to Page)
| Alert | Condition | Severity | Purpose |
|---|---|---|---|
| Trace Silence | No spans exported for 5 minutes | P1 Critical | Tracing pipeline down |
| Export Failure Rate | Export failures > 5% for 5 min | P1 Critical | Data loss imminent |
| Context Propagation Failure | Propagation failures > 1% | P2 High | Incomplete traces |
| Span Queue Critical | Queue > 90% capacity | P2 High | Risk of drops |
| Tail Sampling Bypass | Sampled < expected with high errors | P3 Medium | Sampling misconfigured |
| Collector Latency | Receive latency > 1s p95 | P3 Medium | Performance issue |
Tracing Observability Hook Template
# tracing-observability-hooks.yaml
groups:
- name: tracing-observability-hooks
rules:
# Alert on trace silence
- alert: TracingPipelineSilence
expr: sum(rate(otel_exporter_sent_spans_total[5m])) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No spans being exported (Alert on Silence)"
description: "OTel collectors are not exporting any spans. Tracing visibility is completely lost."
# Alert on high export failure rate
- alert: TracingExportFailuresHigh
expr: |
sum(rate(otel_exporter_failed_spans_total[5m]))
/
sum(rate(otel_span_started_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Trace export failure rate above 5%"
description: "{{ $value | humanizePercentage }} of spans are failing to export. Data loss is occurring."
# Alert on context propagation failures
- alert: TracingContextPropagationFailure
expr: |
sum(rate(otel_trace_context_propagated_total{status="failure"}[5m]))
/
sum(rate(otel_trace_context_propagated_total[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Trace context propagation failure rate above 1%"
description: "Context propagation is failing. Traces will be broken across service boundaries."
# Alert on span queue capacity
- alert: TracingSpanQueueCritical
expr: otel_span_queue_depth / otel_span_queue_limit > 0.9
for: 5m
labels:
severity: high
annotations:
summary: "Span export queue above 90% capacity"
description: "Span queue is filling up. Risk of memory exhaustion and trace drops."
# Alert on collector latency
- alert: TracingCollectorLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(otel_collector_receive_latency_seconds_bucket[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "OTel collector receive latency above 1 second"
description: "P95 collector latency is {{ $value }}s. Traces may be delayed in reaching the backend."
# SLO burn-rate for trace coverage
- alert: TracingCoverageBurnRateFast
expr: |
(
sum(rate(otel_exporter_sent_spans_total[1h]))
/
sum(rate(otel_span_started_total[1h]))
)
< 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "Trace coverage burning at unsustainable rate"
description: "Trace coverage is at {{ $value | humanizePercentage }}. Investigate exporter health immediately."
Trace Storage and Retention Considerations
Choosing a trace storage backend and defining retention policies are critical decisions for production tracing systems. The storage layer affects query performance, operational costs, and your ability to debug issues after the fact.
Storage Operations
Storage Backend Options
Self-hosted options give you control over data and infrastructure:
| Backend | Best For | Limitations |
|---|---|---|
| Jaeger with Elasticsearch | Flexible querying, multi-tenant | Operational complexity |
| Jaeger with Cassandra | High write throughput | Limited query capabilities |
| Jaeger with badger | Small-scale, simplicity | Not distributed |
| Zipkin with Elasticsearch | Basic needs | Fewer features than Jaeger |
Managed cloud options reduce operational overhead:
| Service | Advantages | Trade-offs |
|---|---|---|
| AWS X-Ray | Deep AWS integration | Vendor lock-in |
| GCP Cloud Trace | Auto-scaling, strong perf | GCP dependency |
| Azure Application Insights | Full APM features | Azure dependency |
| Honeycomb | Sophisticated queries | Cost at high volume |
| Datadog | Comprehensive platform | Expensive at scale |
Retention Planning
Trace data follows a lifecycle pattern:
# Tiered retention example
retention_tiers:
hot_storage:
duration: 7 days
sampling: 100% for errors, 10% for normal
compression: none
storage: fast SSD
warm_storage:
duration: 30 days
sampling: 100% errors, 1% normal
compression: lz4
storage: standard block storage
cold_storage:
duration: 1 year
sampling: errors only
compression: zstd
storage: object storage (S3, GCS)
Partitioning Strategies
High-volume trace stores require careful partitioning:
# Elasticsearch index per time window
indices:
pattern: "traces-{service}-{yyyy.MM.dd}"
rollovers:
- max_age: 7d
max_docs: 50 million
aliases:
write: "traces-write"
read: "traces-read"
Query Performance at Scale
As trace volume grows, query performance degrades without proper optimization:
// Optimize trace queries with date filtering
async function queryTraces(service: string, startTime: Date, endTime: Date) {
// Always filter by time range first - reduces scan scope
const query = {
index: `traces-${service}-*`,
body: {
query: {
bool: {
must: [
{ range: { timestamp: { gte: startTime, lte: endTime } } },
{ term: { "service.name": service } },
],
},
},
sort: [{ timestamp: "desc" }],
size: 100, // Limit results
},
};
return elasticsearch.search(query);
}
Reliability
Data Lifecycle Management
Automate data lifecycle to prevent unbounded growth:
# OTel Collector with lifecycle management
exporters:
otlp/jaeger:
endpoint: jaeger:4317
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 5m
processors:
# Tag spans with expiration metadata
resource:
attributes:
- action: upsert
key: data_category
value: tracing
# Batch and compress before export
batch:
timeout: 10s
send_batch_size: 8192
Backup and Recovery Considerations
Trace data recovery is often overlooked:
- Regular backups: Schedule Elasticsearch snapshots or managed service backups
- Point-in-time recovery: Test restoration procedures periodically
- Cross-region replication: Replicate critical trace data to secondary region
- RTO/RPO planning: Define acceptable downtime and data loss windows for tracing infrastructure
Cost Optimization Patterns
Trace storage costs scale with volume. Optimize with these approaches:
# Cost optimization configuration
processors:
# Prune low-value attributes before storage
transform:
trace_state: "(trace_state):lens(include: [service.name, operation.name, error])"
# Aggregate redundant data
groupbyattrs:
keys: ["service.name", "operation.name", "http.status_code"]
mode: sum
# Compress spans with limited attributes
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
Multi-Tenant Trace Storage
When serving multiple customers from shared infrastructure:
# Multi-tenant storage isolation
tenants:
- name: customer-a
index_prefix: "traces-a"
retention_days: 30
quota:
storage_gb: 100
queries_per_minute: 60
- name: customer-b
index_prefix: "traces-b"
retention_days: 90
quota:
storage_gb: 500
queries_per_minute: 120
Implement tenant isolation at the query layer to prevent cross-tenant data leakage.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Trace context not propagated | Incomplete traces; cannot follow requests | Implement propagation in all HTTP clients, message queues, databases |
| Sampling rate too low | Missing rare but important traces | Use tail-based sampling for errors and slow traces |
| Trace storage backend overwhelmed | Traces dropped; gaps in visibility | Implement adaptive sampling; scale storage; use compression |
| Missing span attributes | Cannot filter or group traces meaningfully | Define semantic conventions; require business context in spans |
| Clock skew between services | Invalid timing data; impossible to correlate | Use NTP synchronization; log trace generation time |
| OTel collector bottleneck | Spans queue up; memory pressure; drops | Scale collectors horizontally; add batching; monitor queue depth |
Common Pitfalls / Anti-Patterns
1. Creating Spans for Everything
Every span has overhead. Do not create spans for every loop iteration or minor function call:
// Bad: Spans for everything
async function processItems(items: Item[]) {
const span = tracer.startSpan("processItems");
for (const item of items) {
const itemSpan = tracer.startSpan("processItem"); // Too granular
await processItem(item);
itemSpan.end();
}
span.end();
}
// Good: Batch operations as single span
async function processItems(items: Item[]) {
const span = tracer.startSpan("processItems");
const results = await Promise.all(items.map((item) => processItem(item)));
span.setAttribute("items.count", items.length);
span.end();
return results;
}
2. Forgetting to End Spans
Unfinished spans remain open and appear as ongoing operations:
// Bad: Span not ended on error path
async function riskyOperation() {
const span = tracer.startSpan('risky');
if (condition) {
throw new Error('condition failed');
}
span.end(); // May never execute
}
// Good: Use try/finally
async function riskyOperation() {
const span = tracer.startSpan('risky');
try {
// Work
span.setStatus({ code: SpanStatusCode.OK });
} catch (e) {
span.recordException(e);
span.setStatus({ code: SpanStatusCode.ERROR });
throw;
} finally {
span.end();
}
}
3. Not Propagating Context Across Async Boundaries
Async operations lose trace context without explicit propagation:
// Bad: Context lost
async function outer() {
const span = tracer.startSpan("outer");
await inner(); // Span context not passed
span.end();
}
async function inner() {
const span = tracer.startSpan("inner"); // Orphan span
span.end();
}
// Good: Pass context
async function outer() {
const span = tracer.startSpan("outer");
await inner(span); // Parent context passed
span.end();
}
async function inner(parentSpan: Span) {
const span = tracer.startSpan("inner", { parent: parentSpan });
span.end();
}
4. Storing Too Much Data in Span Attributes
Span attributes are not a data store. Keep them small and queryable:
// Bad: Large data in attributes
span.setAttribute("response_body", JSON.stringify(largeObject));
// Good: Reference data by ID
span.setAttribute("order_id", order.id);
span.setAttribute("items_count", order.items.length);
5. Ignoring Sampling in High-Volume Services
Unsampled tracing at high volume creates massive overhead:
# OTel Collector tail sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 1 }
Observability Checklist
Tracing Coverage
- HTTP request/response spans for all API endpoints
- Database query spans with statement and duration
- External API call spans with URL and status
- Message queue publish/consume spans
- Background job spans with job ID and outcome
- Custom business operation spans with relevant context
Span Attributes
- Service name and version
- Operation name
- Trace ID and span ID
- Start time and duration
- HTTP: method, URL, status code
- DB: system, statement, rows affected
- Business: entity IDs, customer tier, transaction amount
Correlation
- Trace ID included in all log entries
- Trace ID included in metric labels (where appropriate)
- Log entries linkable from span events
- Metrics aggregatable by trace-derived dimensions
Sampling Configuration
- Head-based sampling for consistent baseline (1-10%)
- Tail-based sampling for errors (100% of errors)
- Tail-based sampling for slow traces (>threshold)
- Always sample for tagged critical requests
Security Checklist
- Trace data does not include passwords, tokens, or secrets in span attributes
- PII not stored in span attributes or events
- Trace data encrypted in transit (TLS)
- Trace data access logged and audited
- Sampling does not inadvertently exclude security-relevant traces
- Trace storage has appropriate retention policies
- Internal service names not exposed in trace exports to third parties
- Trace context headers sanitized before external calls
Interview Questions
Expected answer points:
- A trace represents the complete end-to-end journey of a single request through all services
- A span is a single unit of work within that trace, representing one operation or service call
- Spans are organized hierarchically with parent-child relationships forming the trace tree
- Each span captures timing, attributes, events, and status about that specific operation
Expected answer points:
- Trace context propagates via HTTP headers, most importantly the `traceparent` header
- The `traceparent` header contains: version (2 chars), trace ID (32 hex chars), parent ID (16 hex chars), and flags
- When service A calls service B, it injects the trace context into outgoing request headers
- Service B extracts the context and creates a child span, linking to the parent
- The `tracestate` header allows for vendor-specific propagation data
Expected answer points:
- Application Code / SDK: Language-specific instrumentation libraries that create spans
- Auto-instrumentation: Framework-specific agents that instrument common operations automatically
- Collector: Middleware that receives, processes, and exports telemetry data
- Exporters: Connectors that send data to backends like Jaeger, Zipkin, or cloud providers
- The OTel SDK is vendor-neutral, allowing you to switch backends without code changes
Expected answer points:
- Head-based sampling decides at trace start whether to capture, using probabilistic or rule-based selection
- Tail-based sampling captures all spans temporarily, then decides what to keep after the trace completes
- Head-based sampling is simpler and has lower memory overhead since you discard early
- Tail-based sampling enables intelligent decisions like "keep all errors" or "keep slow traces" after seeing the full picture
- Production systems often use both: head-based for consistent baseline sampling, tail-based for targeted capture of important traces
Expected answer points:
- Producer injects trace context into message headers before sending
- Context is serialized into headers like `traceparent` using W3C format
- Consumer extracts context from message headers and creates a linked span
- Use `context.with(extractedContext, () => { ... })` to run handlers within the correct context
- This ensures traces span across async boundaries, showing the full request flow even through queues
Expected answer points:
- Semantic conventions are standardized attribute names for common operations (HTTP, DB, messaging)
- Examples: `http.method`, `http.status_code`, `db.system`, `db.statement`
- They enable interoperability between instrumentation from different libraries
- Backend systems can interpret attributes consistently regardless of instrumentation source
- They make traces queryable across your entire system using consistent filter names
Expected answer points:
- Use W3C `traceparent` header to propagate context to external services
- If the external service supports W3C tracing, spans will be linked automatically
- For services that don't propagate headers, create a span representing the external call with relevant attributes
- Include the downstream service URL, response status, and duration as span attributes
- Add custom attributes for business context even when you cannot instrument the remote service
Expected answer points:
- RED metrics are derived from trace data aggregated across similar spans
- Rate: Request count per second, derived by counting spans per operation over time
- Errors: Error rate calculated from spans with error status codes
- Duration: Latency percentiles (p50, p95, p99) calculated from span durations
- Traces provide the granular data; metrics are the rollup of that data for alerting
- Use traces for debugging specific issues, use RED metrics for alerting and dashboards
Expected answer points:
- Never include passwords, tokens, or secrets in span attributes or events
- Sanitize PII from span attributes before export
- Encrypt trace data in transit using TLS
- Implement access controls and audit logging for trace data access
- Configure sampling to avoid capturing sensitive high-traffic endpoints excessively
- Scrub or exclude headers like Authorization before creating spans
Expected answer points:
- Check if trace context is being extracted at service entry points (HTTP middleware)
- Verify that context is being injected into outgoing requests
- Look for async boundaries where context might be lost (missing context.with)
- Check if message queue producers are injecting headers and consumers are extracting them
- Verify all HTTP clients and message frameworks are instrumented
- Check collector logs for context propagation failures
- Ensure sampling decisions are consistent across the trace propagation path
Expected answer points:
- Jaeger (Cassandra, Elasticsearch, badger) - good for self-hosted with flexible querying
- Zipkin (Cassandra, Elasticsearch, MySQL) - simpler alternative with basic search
- AWS X-Ray (managed) - tight integration with AWS services but vendor lock-in
- GCP Cloud Trace (managed) - seamless integration with Google Cloud, scales automatically
- Azure Application Insights (managed) - comprehensive APM with built-in analytics
- Choice depends on: existing cloud provider, query flexibility needs, operational overhead, cost
- For multi-cloud: prefer vendor-neutral backends like Jaeger or self-hosted OTel-compatible storage
Expected answer points:
- Retention depends on use case: debugging (hours to days), compliance (months to years), analytics (aggregated indefinitely)
- Hot storage (fast query): typically 7-30 days for recent traces
- Cold storage (archive): months to years for historical analysis
- Consider sampling older data - keep 100% for recent, sample for historical
- Cost implications: trace data is voluminous; compression and tiered storage help
- Compliance requirements may mandate minimum retention periods
- Balance between investigative value and storage costs
Expected answer points:
- Centralized (Jaeger, Zipkin): simpler operations, single query endpoint, potential network latency for upload
- Edge storage (X-Ray daemon buffers): resilience to network partitions, reduced upload bandwidth, more complex retrieval
- Hybrid approach: buffer at edge, batch upload to central, local fallback during outages
- Consider data locality requirements - some regulations mandate data stays in certain regions
- Edge buffering prevents data loss during collector downtime but requires disk management
- Centralized storage simplifies debugging across services but creates dependency on network
Expected answer points:
- OTel Collector has built-in sender functionality with retry mechanisms
- When backend is down, spans queue in memory - risk of memory exhaustion under sustained load
- Configure `memory_limiter` processor to drop spans when memory pressure exceeds threshold
- Use persistent queue (disk-backed) for better resilience during backend outages
- Exponential backoff with jitter prevents thundering herd when backend recovers
- Dead letter queue / retry_stale configuration handles spans that cannot be exported
- Monitor queue depth metrics to anticipate potential data loss
Expected answer points:
- Adaptive sampling: lower overall rate, 100% for errors and slow traces
- Attribute pruning: remove low-value attributes before storage
- Span deduplication: compress similar spans in batch operations
- Data tiering: hot storage for recent data, archive/aggregate older data
- Compression: use columnar formats (Parquet) that compress well
- Trace summarization: keep full traces for errors, aggregated metrics for success paths
- TTL enforcement: automatically expire old data based on retention policy
Expected answer points:
- Tenant isolation via separate indices/tables per customer (Jaeger with Elasticsearch)
- Tag-based filtering: all spans tagged with tenant ID, query layer filters
- Separate collectors or collector groups per tenant for strict data isolation
- Consider data residency requirements - tenants may need data in specific regions
- Resource quota enforcement to prevent one tenant from monopolizing storage
- Access control: ensure tenants can only query their own trace data
- Cost attribution: track storage and query costs per tenant for billing
Expected answer points:
- Trace collection adds latency: OTel SDK overhead ~1-5ms per span creation
- Batching exporters reduce network overhead by amortizing connection costs
- Async export prevents blocking the main request path
- SimpleSpanProcessor vs BatchSpanProcessor: batch is more efficient at scale
- Collector pipeline: use processors to aggregate and reduce data before export
- Network: consider gRPC vs HTTP exporters; gRPC has lower overhead for high volume
- Profile in staging to understand actual overhead before production deployment
Expected answer points:
- Regional collectors ingest locally, then forward to central aggregation
- Use load balancing across collectors for horizontal scalability
- Implement trace context propagation across regional boundaries
- Consider data residency - some regions may require local storage before aggregation
- Global sampling: each region samples independently, increasing total capture rate
- Global view requires stitching traces from multiple regions - use consistent trace ID generation
- Network design: dedicated links for trace traffic prevent interference with application traffic
Expected answer points:
- Spans started vs ended (detector for leaks)
- Export success/failure rate per backend
- Queue depth and memory usage for exporters
- Collector receive latency (p50, p95, p99)
- Dropped spans count and reason (sampling, queue full, export failure)
- Context propagation success/failure rate
- Backend query latency for trace retrieval
- Set SLOs/SLIs on these metrics and alert on violations
Expected answer points:
- Saga orchestrator creates parent span; each saga step is a child span
- Compensation operations (rollbacks) should be spans linked to the original transaction
- Event-driven: inject context into message headers, extract in consumers
- Choreography-based sagas: use correlation ID linking all related spans
- Long-running sagas require sustained context propagation across hours or days
- Consider span linking vs parent-based models for saga step relationships
- Trace visualization helps identify bottleneck steps in saga execution
Further Reading
- OpenTelemetry JavaScript SDK Documentation - Official OTel JS guides and API reference
- W3C Trace Context Specification - The standard that defines trace header format and propagation rules
- Jaeger Documentation - Backend for storing and visualizing distributed traces
- Service Mesh Observability with Linkerd - How Linkerd handles tracing without code changes
- AWS X-Ray Developer Guide - Cloud-native tracing on AWS
- Google Cloud Trace Documentation - GCP tracing solution
- OpenTelemetry Collector Configuration - Building flexible collector pipelines
Conclusion
Key Takeaways:
- Distributed tracing shows complete request journeys across services
- OpenTelemetry provides vendor-neutral instrumentation
- Always propagate trace context through HTTP headers, queues, and databases
- Use semantic conventions for span attributes
- Implement tail-based sampling to capture errors and slow traces
- Correlate traces with logs and metrics for complete observability
Copy/Paste Checklist:
// Trace context propagation (HTTP)
import { propagation, context } from '@opentelemetry/api';
function injectTraceContext(headers: Record<string, string>) {
propagation.inject(context.active(), headers);
}
function extractTraceContext(headers: Record<string, string>) {
return propagation.extract(context.active(), headers);
}
// Manual span with error handling
const span = tracer.startSpan('operation');
try {
span.setAttribute('entity.id', entityId);
await doWork();
span.setStatus({ code: SpanStatusCode.OK });
} catch (e) {
span.recordException(e as Error);
span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
throw;
} finally {
span.end();
}
// Tail-based sampling config
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
Distributed tracing turns opaque microservices into transparent systems where request flows are visible and debugging is systematic. Start by instrumenting your HTTP and database layers with auto-instrumentation, then add custom spans for business operations.
OpenTelemetry provides vendor-neutral instrumentation, so you can switch backends without re-instrumenting. For implementation details, see our Jaeger guide on trace visualization. For metrics correlation, the Prometheus & Grafana guide covers building complete observability pipelines.
Category
Related Posts
Jaeger: Distributed Tracing for Microservices
Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.
Distributed Operating Systems
Explore distributed file systems, RPC mechanisms, cluster scheduling, and the fundamental concepts behind modern distributed operating systems.
Performance Profiling
Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.