Distributed Tracing: Trace Context and OpenTelemetry

Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.

published: March 22, 2026 reading time: 36 min read author: GeekWorkBench

Distributed Tracing: Trace Context, OpenTelemetry, and Correlation

When a request touches ten services before returning an error, traditional logging tells you each piece in isolation. Distributed tracing connects the pieces, showing you the complete journey of a request through your system.

This guide covers the fundamentals of tracing: trace context propagation, OpenTelemetry instrumentation, and practical correlation patterns.

If you are building microservices, you need distributed tracing.

Introduction

Logs tell you what happened in a single service. Metrics tell you aggregate patterns. Neither shows causality across service boundaries.

Consider a request that fails after touching five services. With only logs, you search each service’s logs for the trace ID, then manually piece together the sequence. With tracing, you open a single view showing the entire timeline: service A started the request, called B, which called C, D, and E in sequence, and E returned an error that propagated back up.

This makes debugging actually tractable instead of a scavenger hunt.

Core Concepts

Traces and Spans

A trace represents an entire request journey. It contains one or more spans, where each span represents a single operation within that trace.

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant O as Order Service
    participant P as Payment Service
    participant N as Notification Service

    C->>A: GET /orders/123
    A->>O: GetOrder(123)
    O->>P: ProcessPayment(order)
    P-->>O: Payment confirmed
    O->>N: SendConfirmation(order)
    N-->>O: Notification sent
    O-->>A: Order details
    A-->>C: Response

Each span captures:

Operation name
Start and end time
Parent span ID (linking)
Attributes (key-value metadata)
Events (timestamped points within the span)

Trace Context

Trace context propagates across service boundaries through HTTP headers. When service A calls service B, it passes trace context in headers. Service B creates a child span using that context, ensuring the spans stay connected in a single trace.

The W3C Trace Context specification standardizes these headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

The traceparent header contains:

Version (00)
Trace ID (32 hex characters)
Parent ID (16 hex characters)
Flags (01 = sampled)

OpenTelemetry Architecture

OpenTelemetry (OTel) is the open standard for observability. It gives you APIs, SDKs, and instrumentation for collecting traces, metrics, and logs.

graph TB
    subgraph "Application Code"
        A[Your Service]
        B[OTel SDK]
        C[Language-specific auto-instrumentation]
    end

    subgraph "Exporters"
        D[OTLP Exporter]
        E[Jaeger Exporter]
        F[Zipkin Exporter]
    end

    subgraph "Collecting Infrastructure"
        G[OTel Collector]
        H[Jaeger]
        I[Zipkin]
    end

    A --> B
    B --> C
    B --> D
    D --> G
    G --> H
    G --> I

OTel collector

The OTel collector receives, processes, and exports telemetry data. Think of it as middleware between your application and your observability backend.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_percentage: 90

exporters:
  otlp:
    endpoint: jaeger-collector:4317
    tls:
      insecure: false
      cert_file: /certs/cert.pem
      key_file: /certs/key.pem

  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Manual Instrumentation

Auto-instrumentation covers many frameworks automatically, but you need manual instrumentation for business-specific operations and custom spans.

Starting Traces

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service", "1.0.0");

async function createOrder(orderData: OrderData): Promise<Order> {
  const span = tracer.startSpan("OrderService.createOrder", {
    attributes: {
      "order.customer_id": orderData.customerId,
      "order.item_count": orderData.items.length,
      "order.total": orderData.total,
    },
  });

  try {
    const order = await db.orders.create(orderData);
    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Creating Child Spans

Wrap nested operations as child spans:

async function processPayment(
  orderId: string,
  payment: Payment,
): Promise<PaymentResult> {
  const parentSpan = trace.getActiveSpan();

  const paymentSpan = tracer.startSpan("PaymentService.process", {
    parent: parentSpan,
    attributes: {
      "payment.method": payment.method,
      "payment.amount": payment.amount,
      "payment.currency": payment.currency,
    },
  });

  try {
    // Verify card
    await verifyCard(payment.card, paymentSpan);

    // Charge
    const result = await chargeCard(payment, paymentSpan);

    paymentSpan.setAttribute("payment.transaction_id", result.transactionId);
    paymentSpan.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    paymentSpan.recordException(error as Error);
    paymentSpan.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    paymentSpan.end();
  }
}

async function verifyCard(card: Card, parentSpan: Span): Promise<void> {
  const span = tracer.startSpan("PaymentService.verifyCard", {
    parent: parentSpan,
    attributes: {
      "card.type": card.type,
      "card.last_four": card.lastFour,
    },
  });

  // Verification logic
  await api.verifyCard(card);

  span.end();
}

Context Propagation

Proper context propagation connects spans across service boundaries. Without it, spans become orphaned and useless for debugging.

HTTP Propagation

Middleware that extracts incoming trace context and propagates it to downstream calls:

import { trace, context, propagation } from "@opentelemetry/api";

function httpMiddleware(req, res, next) {
  // Extract context from incoming headers
  const extractedContext = propagation.extract(context.active(), req.headers);

  // Run the rest of the request handler within that context
  context.with(extractedContext, () => {
    // All spans created here are linked to the incoming trace
    next();
  });
}

// When making outgoing requests, inject context into headers
async function callDownstreamService(url: string, data: any): Promise<any> {
  const headers = {};
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: "POST",
    headers: {
      ...headers,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(data),
  });
}

Messaging Propagation

Propagate context through message queues so spans stay connected even with asynchronous processing:

import { trace, context, propagation } from "@opentelemetry/api";

// Producer: inject context into message
async function sendOrderCreatedEvent(order: Order): Promise<void> {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);

  await kafka.send({
    topic: "order.created",
    messages: [
      {
        key: order.id,
        value: JSON.stringify(order),
        headers: headers,
      },
    ],
  });
}

// Consumer: extract context from message and create linked span
async function handleOrderCreated(message: KafkaMessage): Promise<void> {
  const extractedContext = propagation.extract(
    context.active(),
    message.headers,
  );

  await context.with(extractedContext, async () => {
    const span = tracer.startSpan("OrderConsumer.handleOrderCreated");
    try {
      const order = JSON.parse(message.value.toString());
      await processOrder(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

Adding Business Context

Rich span attributes turn traces from timing diagrams into debugging tools.

Semantic Attributes

Use standard attribute names for common data:

// HTTP attributes
span.setAttribute("http.method", "POST");
span.setAttribute("http.url", "https://api.example.com/orders");
span.setAttribute("http.status_code", 201);
span.setAttribute("http.response_content_length", 1024);

// Database attributes
span.setAttribute("db.system", "postgresql");
span.setAttribute("db.name", "orders_db");
span.setAttribute("db.statement", "SELECT * FROM orders WHERE id = $1");
span.setAttribute("db.operation", "SELECT");

// Messaging attributes
span.setAttribute("messaging.system", "kafka");
span.setAttribute("messaging.destination", "order.created");
span.setAttribute("messaging.operation", "publish");

Custom Business Attributes

Add domain-specific context:

span.setAttribute("order.id", order.id);
span.setAttribute("order.status", order.status);
span.setAttribute("order.customer_tier", customer.tier);
span.setAttribute("order.is_first_purchase", customer.orderCount === 0);

These attributes let you filter traces by business properties: find all traces for premium customers, or analyze timing for first-time purchasers.

Correlation with Logs and Metrics

Traces work best when linked to your logs and metrics.

Trace-Log Correlation

Include trace ID in logs:

import { trace, span } from "@opentelemetry/api";

function logInfo(message: string, data?: Record<string, unknown>): void {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;

  const logEntry = {
    timestamp: new Date().toISOString(),
    level: "INFO",
    message,
    traceId,
    ...data,
  };

  console.log(JSON.stringify(logEntry));
}

// Now every log entry includes the trace ID
logInfo("Order created successfully", { orderId: "ord_123" });
// {"timestamp":"2026-03-22T14:30:00Z","level":"INFO","message":"Order created successfully","traceId":"abc123...","orderId":"ord_123"}

Trace-Metric Correlation

Link metrics to traces through span events:

const meter = metrics.getMeter("payment-service");

const paymentDuration = meter.createHistogram("payment.duration", {
  unit: "ms",
  description: "Payment processing duration",
});

async function processPayment(payment: Payment): Promise<void> {
  const span = tracer.startSpan("PaymentService.process");

  const startTime = Date.now();
  try {
    await doPayment(payment);
    paymentDuration.record(Date.now() - startTime, {
      "payment.method": payment.method,
      "payment.status": "success",
    });
  } catch (error) {
    paymentDuration.record(Date.now() - startTime, {
      "payment.method": payment.method,
      "payment.status": "failure",
    });
    throw error;
  } finally {
    span.end();
  }
}

Sampling Strategies

At high traffic, you cannot capture every trace. Sampling reduces volume while preserving useful data.

Common Sampling Strategies

Head-based sampling decides at trace start whether to capture:

// Always sample 1% of traces, plus all errors
const sampler = new TraceIdRatioBasedSampler({
  ratio: 0.01,
  rules: [
    { matcher: (span) => span.status.code === "ERROR", sampler: "always" },
  ],
});

const tracer = trace
  .getTracerProvider()
  .addSpanProcessor(new SimpleSpanProcessor(exporter), sampler);

Tail-based sampling captures all spans temporarily, then decides what to keep:

# OTel Collector tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
      - name: keep-all
        type: always_sample

This captures slow traces, errors, and a percentage of everything else.

Cloud-Native Tracing Solutions

AWS X-Ray

AWS X-Ray integrates with services like API Gateway, Lambda, ECS, and EKS:

import { AWSXRay } from "aws-xray-sdk";

// Automatic tracing for AWS SDK calls
AWSXRay.captureAWSv3Client(s3Client);
AWSXRay.captureHTTPClient(httpAgent);

// For Lambda, use the wrapper
export const handler = AWSXRay.captureAsyncHandler(async (event, context) => {
  // Your handler code
  return await processOrder(event);
});

X-Ray uses a daemon that buffers traces and sends them to the AWS backend. In ECS, run the X-Ray daemon as a sidecar container.

Google Cloud Trace

GCP Cloud Trace integrates with Cloud Run, GKE, and Compute Engine:

import { TraceAgent } from "@google-cloud/trace-agent";

// Initialize before other imports
TraceAgent.start({
  projectId: process.env.GCP_PROJECT_ID,
  keyFilename: "/path/to/service-account.json",
  logLevel: 1,
});

// OpenTelemetry SDK with GCP exporter
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const traceExporter = new OTLPTraceExporter({
  url: "collector.googleapis.com:443",
  headers: {
    "x-goog-api-key": process.env.GCP_API_KEY,
  },
});

Azure Application Insights

Azure uses the OpenTelemetry SDK with its own exporter:

import { ApplicationInsights } from "@microsoft/applicationinsight-web";

// Auto-instrument HTTP and AJAX calls
const appInsights = new ApplicationInsights({
  config: {
    instrumentationKey: process.env.AZURE_INSTRUMENTATION_KEY,
    enableCors stripping: true,
    autoTrackPageVisit: true,
  },
});

appInsights.loadAppInsights();
appInsights.trackTrace({
  message: "Distributed tracing initialized",
  severityLevel: 1,
});

Multi-Cloud Trace Correlation

When running across cloud providers, maintain trace context using W3C headers. The traceparent header works across all providers:

function forwardToExternalService(
  url: string,
  headers: Headers,
): Promise<Response> {
  // Always inject W3C trace context
  const traceparent = headers.get("traceparent") || generateTraceparent();

  return fetch(url, {
    headers: {
      ...Object.fromEntries(headers),
      traceparent: traceparent,
      tracestate: `cloud=gcp,region=${process.env.REGION}`,
    },
  });
}

Visualization with Jaeger

Jaeger is a popular distributed tracing backend. It stores traces and provides a UI for exploration.

Key Jaeger Views

Search: Find traces by service, operation, time range, or tags.

Trace Detail: The flame graph showing all spans in a trace with timing information.

Span Detail: The attributes, events, and logs attached to a specific span.

Analyzing Trace Flame Graphs

A flame graph shows the parent-child span relationships:

order-service.createOrder (2.3s)
├── auth-service.validateToken (50ms)
├── inventory-service.checkStock (150ms)
│   └── external-partner.getAvailability (120ms)
├── payment-service.process (1.8s)
│   ├── fraud-check.analyze (400ms)
│   │   └── external-api.call (380ms)
│   └── payment-gateway.charge (1.2s)
│       └── external-bank.authorize (1.1s)
├── notification-service.send (100ms)
└── db.orders.insert (30ms)

Long spans are easy to spot. Here, payment-service.process dominates. Drilling in, payment-gateway.charge is the bottleneck. Further still, external-bank.authorize is where time is spent.

Service Mesh Tracing (Istio and Linkerd)

Service meshes add automatic tracing to all service-to-service communication without requiring code changes.

Istio Integration

Istio’s Envoy sidecar proxy automatically instruments all HTTP, gRPC, and TCP traffic:

# istio tracing config
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: tracing-config
spec:
  meshConfig:
    enableTracing: true
    defaultProviders:
      tracing:
        - opentelemetry
    extensionProviders:
      - name: otel
        opentelemetry:
          service: otel-collector.observability
          port: 4317

Envoy extracts trace context from traceparent headers and creates spans for every request. Your application code only needs to propagate context for async operations.

Linkerd Integration

Linkerd uses service profiles to enable tracing on specific routes:

# service-profile.yaml
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: order-service.default.svc.cluster.local
spec:
  routes:
    - condition:
        method: GET
        path: /api/orders/{id}
      timeout: 5s
      retryBudget:
        retryRatio: 0.2
        minRetriesPerSecond: 10
        maxRetries: 100

Trade-offs: Mesh vs SDK Tracing

Aspect	Service Mesh Tracing	SDK Manual Tracing
Setup effort	Minimal (config only)	Code changes required
Network span coverage	Automatic for all traffic	Only where you add it
Business context	Limited to mesh metadata	Full custom attributes
Performance impact	Sidecar overhead ~1-2ms	Minimal with sampling
Portability	Tied to mesh implementation	Portable across platforms

Common Patterns

Database Query Tracing

// Monkey-patch your database client for auto-tracing
import { dbplugin } from "@opentelemetry/instrumentation-pg";

new dbplugin.DatabaseDetector({
  enhancedDatabaseReporting: true,
  addSqlCommenterCommentToQueries: true,
});

HTTP Client Tracing

import { fetchInstrumentation } from "@opentelemetry/instrumentation-fetch";

new fetchInstrumentation({
  propagateCorrelationHeader: true,
  timingOrigin: (origin) => origin !== window.location.origin,
});

gRPC Tracing

import { grpcInstrumentation } from "@opentelemetry/instrumentation-grpc";

new grpcInstrumentation({
  yaml: true,
});

When to Use Distributed Tracing

Use distributed tracing when:

Understanding request flow through complex architectures
Finding which service causes cascading failures
Root cause analysis when errors propagate across boundaries
Performance optimization by identifying bottlenecks
Validating service dependencies and communication patterns

When Not to Use Distributed Tracing:

Single monolithic applications (local debugging suffices)
Low-traffic services where logs provide sufficient context
When you only need aggregate metrics (use Prometheus)
Very high-throughput paths where tracing overhead matters (use sampling)
Systems without clear request boundaries (batch jobs)

Trade-off Analysis

Aspect	Distributed Tracing	Traditional Logging	Metrics Only
Debugging Speed	Minutes (full context)	Hours (manual correlation)	N/A
Storage Cost	High (span data)	Medium (log volume)	Low
Overhead	~1-5% latency	Minimal	Minimal
Root Cause	Full causal chain visible	Requires ID correlation	Aggregates only
Cardinality	High (many traces)	Medium	Low
Error Context	Full request path	Per-service only	None

SLI/SLO/Error Budget Templates for Tracing

Distributed tracing does not typically have traditional SLIs/SLOs since it is qualitative debugging tooling rather than quantitative reliability measurement. However, you can define SLOs around tracing coverage and health.

Trace Health SLI Template

# tracing-sli-config.yaml
service: tracing-observability
environment: production

slis:
  - name: trace_ingestion_success_rate
    description: "Percentage of started traces successfully exported"
    query: |
      sum(rate(otel_exporter_sent_spans_total[5m]))
      /
      sum(rate(otel_span_started_total[5m]))

  - name: trace_context_propagation_success
    description: "Percentage of requests with valid propagated trace context"
    query: |
      sum(rate(otel_trace_context_propagated_total{status="success"}[5m]))
      /
      sum(rate(otel_trace_context_propagated_total[5m]))

  - name: tail_sampling_efficiency
    description: "Percentage of traces retained by tail sampling"
    query: |
      sum(rate(otel_tail_sampling_traces_sampled_total[5m]))
      /
      sum(rate(otel_tail_sampling_traces_evaluated_total[5m]))

  - name: span_error_rate
    description: "Percentage of spans with error status"
    query: |
      sum(rate(otel_span_status_code_total{code="ERROR"}[5m]))
      /
      sum(rate(otel_span_started_total[5m]))

Trace SLO Template

# tracing-slo-config.yaml
objectives:
  - display_name: "Trace Ingestion Availability"
    sli: trace_ingestion_success_rate
    target: 99.5
    window: 30d
    description: "99.5% of started traces should be exported"

  - display_name: "Context Propagation Success"
    sli: trace_context_propagation_success
    target: 99.9
    window: 30d
    description: "99.9% of requests should have valid trace context"

  - display_name: "Tail Sampling Coverage"
    sli: tail_sampling_efficiency
    target: 95.0
    window: 30d
    description: "95% of sampled traces should match sampling policies"

Error Budget Calculator for Tracing

def calculate_tracing_budgets():
    """
    Calculate error budgets for tracing SLOs (30-day window).
    """
    window_minutes = 30 * 24 * 60

    slos = {
        "99.5% (Trace Ingestion)": window_minutes * 0.005,
        "99.9% (Context Propagation)": window_minutes * 0.001,
        "95.0% (Tail Sampling)": window_minutes * 0.050,
    }

    for slo, budget in slos.items():
        print(f"{slo}: {budget:.1f} minutes allowed degradation")
        print(f"  = {budget / 60:.2f} hours")
        print(f"  = {budget / 60 / 24:.2f} days")

calculate_tracing_budgets()

Multi-Window Burn-Rate Alerting for Tracing

While tracing does not have traditional error budgets, you can apply burn-rate concepts to detect tracing infrastructure issues.

Trace Coverage Burn-Rate Alert (1h Window)

# Tracing burn-rate alerts
groups:
  - name: tracing-burn-rate
    rules:
      # Fast burn: Trace ingestion dropping significantly
      - alert: TracingCoverageFastBurn
        expr: |
          (
            sum(rate(otel_exporter_sent_spans_total[1h]))
            /
            sum(rate(otel_span_started_total[1h]))
          )
          < 0.95
        for: 5m
        labels:
          severity: critical
          category: tracing
          window: 1h
        annotations:
          summary: "Trace coverage dropping fast (1h window)"
          description: "Trace ingestion success rate is {{ $value | humanizePercentage }}. Investigate OTel collector health or exporter issues."

Trace Context Propagation Burn-Rate Alert (6h Window)

# Medium burn: Context propagation issues
- alert: TracingContextPropagationBurn
  expr: |
    (
      sum(rate(otel_trace_context_propagated_total{status="failure"}[6h]))
      /
      sum(rate(otel_trace_context_propagated_total[6h]))
    )
    > 0.01
  for: 15m
  labels:
    severity: warning
    category: tracing
    window: 6h
  annotations:
    summary: "Trace context propagation failures (6h window)"
    description: "Context propagation failure rate is {{ $value | humanizePercentage }}. Check service mesh or HTTP middleware configuration."

Multi-Window Trace Health Alert Set

# Complete trace health burn-rate alert
- alert: TracingHealthBurnAllWindows
  expr: |
    (
      sum(rate(otel_exporter_sent_spans_total[1h]))
      /
      sum(rate(otel_span_started_total[1h]))
    )
    < 0.95
    or
    (
      sum(rate(otel_trace_context_propagated_total{status="failure"}[1h]))
      /
      sum(rate(otel_trace_context_propagated_total[1h]))
    )
    > 0.05
  for: 5m
  labels:
    severity: critical
    category: tracing
  annotations:
    summary: "Tracing health degraded across multiple indicators"
    description: |
      One or more tracing health metrics are burning fast.
      Trace coverage: {{ printf "%.2f" (index $values "0" | value)) }}
      Propagation failures: {{ printf "%.2f" (index $values "1" | value)) }}
      Distributed tracing visibility is compromised.

Observability Hooks for Distributed Tracing

This section defines what to log, measure, trace, and alert for tracing systems themselves.

Log (What to Emit)

Event	Fields	Level
Collector started	version, endpoint, exporters	INFO
Exporter failure	exporter_type, error, retry_count	WARN
Sampling decision	sampling_policy, trace_id, decision	DEBUG
Context propagation failure	service, direction, error	WARN
Span queue full	service, queue_size, drop_count	ERROR
Batch export success	exporter, spans_count, bytes	DEBUG

Measure (Metrics to Collect)

Metric	Type	Description
`otel_span_started_total`	Counter	Total spans started
`otel_span_ended_total`	Counter	Total spans ended
`otel_exporter_sent_spans_total`	Counter	Spans successfully exported
`otel_exporter_failed_spans_total`	Counter	Spans that failed to export
`otel_trace_context_propagated_total`	Counter	Context propagation attempts
`otel_tail_sampling_traces_evaluated_total`	Counter	Traces evaluated by tail sampler
`otel_tail_sampling_traces_sampled_total`	Counter	Traces retained by tail sampler
`otel_span_queue_depth`	Gauge	Pending spans in export queue
`otel_collector_receive_latency_seconds`	Histogram	Time to receive spans
`otel_exporter_send_latency_seconds`	Histogram	Time to send to backend

Trace (Correlation Points)

Operation	Trace Attribute	Purpose
Span started	`tracing.otel.version`	Track OTel SDK version
Sampling decision	`tracing.sampling.decision`	Monitor sampling efficiency
Export batch	`tracing.export.batch_size`	Track export efficiency
Context inject/extraction	`tracing.context.direction`	Monitor propagation health

Alert (When to Page)

Alert	Condition	Severity	Purpose
Trace Silence	No spans exported for 5 minutes	P1 Critical	Tracing pipeline down
Export Failure Rate	Export failures > 5% for 5 min	P1 Critical	Data loss imminent
Context Propagation Failure	Propagation failures > 1%	P2 High	Incomplete traces
Span Queue Critical	Queue > 90% capacity	P2 High	Risk of drops
Tail Sampling Bypass	Sampled < expected with high errors	P3 Medium	Sampling misconfigured
Collector Latency	Receive latency > 1s p95	P3 Medium	Performance issue

Tracing Observability Hook Template

# tracing-observability-hooks.yaml
groups:
  - name: tracing-observability-hooks
    rules:
      # Alert on trace silence
      - alert: TracingPipelineSilence
        expr: sum(rate(otel_exporter_sent_spans_total[5m])) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No spans being exported (Alert on Silence)"
          description: "OTel collectors are not exporting any spans. Tracing visibility is completely lost."

      # Alert on high export failure rate
      - alert: TracingExportFailuresHigh
        expr: |
          sum(rate(otel_exporter_failed_spans_total[5m]))
          /
          sum(rate(otel_span_started_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Trace export failure rate above 5%"
          description: "{{ $value | humanizePercentage }} of spans are failing to export. Data loss is occurring."

      # Alert on context propagation failures
      - alert: TracingContextPropagationFailure
        expr: |
          sum(rate(otel_trace_context_propagated_total{status="failure"}[5m]))
          /
          sum(rate(otel_trace_context_propagated_total[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Trace context propagation failure rate above 1%"
          description: "Context propagation is failing. Traces will be broken across service boundaries."

      # Alert on span queue capacity
      - alert: TracingSpanQueueCritical
        expr: otel_span_queue_depth / otel_span_queue_limit > 0.9
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Span export queue above 90% capacity"
          description: "Span queue is filling up. Risk of memory exhaustion and trace drops."

      # Alert on collector latency
      - alert: TracingCollectorLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(otel_collector_receive_latency_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OTel collector receive latency above 1 second"
          description: "P95 collector latency is {{ $value }}s. Traces may be delayed in reaching the backend."

      # SLO burn-rate for trace coverage
      - alert: TracingCoverageBurnRateFast
        expr: |
          (
            sum(rate(otel_exporter_sent_spans_total[1h]))
            /
            sum(rate(otel_span_started_total[1h]))
          )
          < 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Trace coverage burning at unsustainable rate"
          description: "Trace coverage is at {{ $value | humanizePercentage }}. Investigate exporter health immediately."

Trace Storage and Retention Considerations

Choosing a trace storage backend and defining retention policies are critical decisions for production tracing systems. The storage layer affects query performance, operational costs, and your ability to debug issues after the fact.

Storage Operations

Storage Backend Options

Self-hosted options give you control over data and infrastructure:

Backend	Best For	Limitations
Jaeger with Elasticsearch	Flexible querying, multi-tenant	Operational complexity
Jaeger with Cassandra	High write throughput	Limited query capabilities
Jaeger with badger	Small-scale, simplicity	Not distributed
Zipkin with Elasticsearch	Basic needs	Fewer features than Jaeger

Managed cloud options reduce operational overhead:

Service	Advantages	Trade-offs
AWS X-Ray	Deep AWS integration	Vendor lock-in
GCP Cloud Trace	Auto-scaling, strong perf	GCP dependency
Azure Application Insights	Full APM features	Azure dependency
Honeycomb	Sophisticated queries	Cost at high volume
Datadog	Comprehensive platform	Expensive at scale

Retention Planning

Trace data follows a lifecycle pattern:

# Tiered retention example
retention_tiers:
  hot_storage:
    duration: 7 days
    sampling: 100% for errors, 10% for normal
    compression: none
    storage: fast SSD

  warm_storage:
    duration: 30 days
    sampling: 100% errors, 1% normal
    compression: lz4
    storage: standard block storage

  cold_storage:
    duration: 1 year
    sampling: errors only
    compression: zstd
    storage: object storage (S3, GCS)

Partitioning Strategies

High-volume trace stores require careful partitioning:

# Elasticsearch index per time window
indices:
  pattern: "traces-{service}-{yyyy.MM.dd}"
  rollovers:
    - max_age: 7d
      max_docs: 50 million
  aliases:
    write: "traces-write"
    read: "traces-read"

Query Performance at Scale

As trace volume grows, query performance degrades without proper optimization:

// Optimize trace queries with date filtering
async function queryTraces(service: string, startTime: Date, endTime: Date) {
  // Always filter by time range first - reduces scan scope
  const query = {
    index: `traces-${service}-*`,
    body: {
      query: {
        bool: {
          must: [
            { range: { timestamp: { gte: startTime, lte: endTime } } },
            { term: { "service.name": service } },
          ],
        },
      },
      sort: [{ timestamp: "desc" }],
      size: 100, // Limit results
    },
  };

  return elasticsearch.search(query);
}

Reliability

Data Lifecycle Management

Automate data lifecycle to prevent unbounded growth:

# OTel Collector with lifecycle management
exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 5m

processors:
  # Tag spans with expiration metadata
  resource:
    attributes:
      - action: upsert
        key: data_category
        value: tracing

  # Batch and compress before export
  batch:
    timeout: 10s
    send_batch_size: 8192

Backup and Recovery Considerations

Trace data recovery is often overlooked:

Regular backups: Schedule Elasticsearch snapshots or managed service backups
Point-in-time recovery: Test restoration procedures periodically
Cross-region replication: Replicate critical trace data to secondary region
RTO/RPO planning: Define acceptable downtime and data loss windows for tracing infrastructure

Cost Optimization Patterns

Trace storage costs scale with volume. Optimize with these approaches:

# Cost optimization configuration
processors:
  # Prune low-value attributes before storage
  transform:
    trace_state: "(trace_state):lens(include: [service.name, operation.name, error])"

  # Aggregate redundant data
  groupbyattrs:
    keys: ["service.name", "operation.name", "http.status_code"]
    mode: sum

  # Compress spans with limited attributes
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

Multi-Tenant Trace Storage

When serving multiple customers from shared infrastructure:

# Multi-tenant storage isolation
tenants:
  - name: customer-a
    index_prefix: "traces-a"
    retention_days: 30
    quota:
      storage_gb: 100
      queries_per_minute: 60

  - name: customer-b
    index_prefix: "traces-b"
    retention_days: 90
    quota:
      storage_gb: 500
      queries_per_minute: 120

Implement tenant isolation at the query layer to prevent cross-tenant data leakage.

Production Failure Scenarios

Failure	Impact	Mitigation
Trace context not propagated	Incomplete traces; cannot follow requests	Implement propagation in all HTTP clients, message queues, databases
Sampling rate too low	Missing rare but important traces	Use tail-based sampling for errors and slow traces
Trace storage backend overwhelmed	Traces dropped; gaps in visibility	Implement adaptive sampling; scale storage; use compression
Missing span attributes	Cannot filter or group traces meaningfully	Define semantic conventions; require business context in spans
Clock skew between services	Invalid timing data; impossible to correlate	Use NTP synchronization; log trace generation time
OTel collector bottleneck	Spans queue up; memory pressure; drops	Scale collectors horizontally; add batching; monitor queue depth

Common Pitfalls / Anti-Patterns

1. Creating Spans for Everything

Every span has overhead. Do not create spans for every loop iteration or minor function call:

// Bad: Spans for everything
async function processItems(items: Item[]) {
  const span = tracer.startSpan("processItems");
  for (const item of items) {
    const itemSpan = tracer.startSpan("processItem"); // Too granular
    await processItem(item);
    itemSpan.end();
  }
  span.end();
}

// Good: Batch operations as single span
async function processItems(items: Item[]) {
  const span = tracer.startSpan("processItems");
  const results = await Promise.all(items.map((item) => processItem(item)));
  span.setAttribute("items.count", items.length);
  span.end();
  return results;
}

2. Forgetting to End Spans

Unfinished spans remain open and appear as ongoing operations:

// Bad: Span not ended on error path
async function riskyOperation() {
  const span = tracer.startSpan('risky');
  if (condition) {
    throw new Error('condition failed');
  }
  span.end(); // May never execute
}

// Good: Use try/finally
async function riskyOperation() {
  const span = tracer.startSpan('risky');
  try {
    // Work
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (e) {
    span.recordException(e);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw;
  } finally {
    span.end();
  }
}

3. Not Propagating Context Across Async Boundaries

Async operations lose trace context without explicit propagation:

// Bad: Context lost
async function outer() {
  const span = tracer.startSpan("outer");
  await inner(); // Span context not passed
  span.end();
}

async function inner() {
  const span = tracer.startSpan("inner"); // Orphan span
  span.end();
}

// Good: Pass context
async function outer() {
  const span = tracer.startSpan("outer");
  await inner(span); // Parent context passed
  span.end();
}

async function inner(parentSpan: Span) {
  const span = tracer.startSpan("inner", { parent: parentSpan });
  span.end();
}

4. Storing Too Much Data in Span Attributes

Span attributes are not a data store. Keep them small and queryable:

// Bad: Large data in attributes
span.setAttribute("response_body", JSON.stringify(largeObject));

// Good: Reference data by ID
span.setAttribute("order_id", order.id);
span.setAttribute("items_count", order.items.length);

5. Ignoring Sampling in High-Volume Services

Unsampled tracing at high volume creates massive overhead:

# OTel Collector tail sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Observability Checklist

Tracing Coverage

HTTP request/response spans for all API endpoints
Database query spans with statement and duration
External API call spans with URL and status
Message queue publish/consume spans
Background job spans with job ID and outcome
Custom business operation spans with relevant context

Span Attributes

Service name and version
Operation name
Trace ID and span ID
Start time and duration
HTTP: method, URL, status code
DB: system, statement, rows affected
Business: entity IDs, customer tier, transaction amount

Correlation

Trace ID included in all log entries
Trace ID included in metric labels (where appropriate)
Log entries linkable from span events
Metrics aggregatable by trace-derived dimensions

Sampling Configuration

Head-based sampling for consistent baseline (1-10%)
Tail-based sampling for errors (100% of errors)
Tail-based sampling for slow traces (>threshold)
Always sample for tagged critical requests

Security Checklist

Trace data does not include passwords, tokens, or secrets in span attributes
PII not stored in span attributes or events
Trace data encrypted in transit (TLS)
Trace data access logged and audited
Sampling does not inadvertently exclude security-relevant traces
Trace storage has appropriate retention policies
Internal service names not exposed in trace exports to third parties
Trace context headers sanitized before external calls

Interview Questions

1. What is the difference between a trace and a span in distributed tracing?

Expected answer points:

A trace represents the complete end-to-end journey of a single request through all services
A span is a single unit of work within that trace, representing one operation or service call
Spans are organized hierarchically with parent-child relationships forming the trace tree
Each span captures timing, attributes, events, and status about that specific operation

2. How does W3C Trace Context propagation work across service boundaries?

Expected answer points:

Trace context propagates via HTTP headers, most importantly the `traceparent` header
The `traceparent` header contains: version (2 chars), trace ID (32 hex chars), parent ID (16 hex chars), and flags
When service A calls service B, it injects the trace context into outgoing request headers
Service B extracts the context and creates a child span, linking to the parent
The `tracestate` header allows for vendor-specific propagation data

3. What are the main components of the OpenTelemetry architecture?

Expected answer points:

Application Code / SDK: Language-specific instrumentation libraries that create spans
Auto-instrumentation: Framework-specific agents that instrument common operations automatically
Collector: Middleware that receives, processes, and exports telemetry data
Exporters: Connectors that send data to backends like Jaeger, Zipkin, or cloud providers
The OTel SDK is vendor-neutral, allowing you to switch backends without code changes

4. Explain head-based sampling vs tail-based sampling. When would you use each?

Expected answer points:

Head-based sampling decides at trace start whether to capture, using probabilistic or rule-based selection
Tail-based sampling captures all spans temporarily, then decides what to keep after the trace completes
Head-based sampling is simpler and has lower memory overhead since you discard early
Tail-based sampling enables intelligent decisions like "keep all errors" or "keep slow traces" after seeing the full picture
Production systems often use both: head-based for consistent baseline sampling, tail-based for targeted capture of important traces

5. How do you propagate trace context through asynchronous message queues like Kafka?

Expected answer points:

Producer injects trace context into message headers before sending
Context is serialized into headers like `traceparent` using W3C format
Consumer extracts context from message headers and creates a linked span
Use `context.with(extractedContext, () => { ... })` to run handlers within the correct context
This ensures traces span across async boundaries, showing the full request flow even through queues

6. What are semantic conventions for span attributes and why are they important?

Expected answer points:

Semantic conventions are standardized attribute names for common operations (HTTP, DB, messaging)
Examples: `http.method`, `http.status_code`, `db.system`, `db.statement`
They enable interoperability between instrumentation from different libraries
Backend systems can interpret attributes consistently regardless of instrumentation source
They make traces queryable across your entire system using consistent filter names

7. How would you handle trace context propagation for external API calls that you cannot modify?

Expected answer points:

Use W3C `traceparent` header to propagate context to external services
If the external service supports W3C tracing, spans will be linked automatically
For services that don't propagate headers, create a span representing the external call with relevant attributes
Include the downstream service URL, response status, and duration as span attributes
Add custom attributes for business context even when you cannot instrument the remote service

8. What is the relationship between distributed tracing and the RED method (Rate, Errors, Duration)?

Expected answer points:

RED metrics are derived from trace data aggregated across similar spans
Rate: Request count per second, derived by counting spans per operation over time
Errors: Error rate calculated from spans with error status codes
Duration: Latency percentiles (p50, p95, p99) calculated from span durations
Traces provide the granular data; metrics are the rollup of that data for alerting
Use traces for debugging specific issues, use RED metrics for alerting and dashboards

9. What are the security considerations when implementing distributed tracing?

Expected answer points:

Never include passwords, tokens, or secrets in span attributes or events
Sanitize PII from span attributes before export
Encrypt trace data in transit using TLS
Implement access controls and audit logging for trace data access
Configure sampling to avoid capturing sensitive high-traffic endpoints excessively
Scrub or exclude headers like Authorization before creating spans

10. How would you debug a scenario where traces are being created but not linked across services?

Expected answer points:

Check if trace context is being extracted at service entry points (HTTP middleware)
Verify that context is being injected into outgoing requests
Look for async boundaries where context might be lost (missing context.with)
Check if message queue producers are injecting headers and consumers are extracting them
Verify all HTTP clients and message frameworks are instrumented
Check collector logs for context propagation failures
Ensure sampling decisions are consistent across the trace propagation path

11. What storage backend options exist for distributed traces, and how do you choose between them?

Expected answer points:

Jaeger (Cassandra, Elasticsearch, badger) - good for self-hosted with flexible querying
Zipkin (Cassandra, Elasticsearch, MySQL) - simpler alternative with basic search
AWS X-Ray (managed) - tight integration with AWS services but vendor lock-in
GCP Cloud Trace (managed) - seamless integration with Google Cloud, scales automatically
Azure Application Insights (managed) - comprehensive APM with built-in analytics
Choice depends on: existing cloud provider, query flexibility needs, operational overhead, cost
For multi-cloud: prefer vendor-neutral backends like Jaeger or self-hosted OTel-compatible storage

12. How do you determine appropriate retention periods for trace data?

Expected answer points:

Retention depends on use case: debugging (hours to days), compliance (months to years), analytics (aggregated indefinitely)
Hot storage (fast query): typically 7-30 days for recent traces
Cold storage (archive): months to years for historical analysis
Consider sampling older data - keep 100% for recent, sample for historical
Cost implications: trace data is voluminous; compression and tiered storage help
Compliance requirements may mandate minimum retention periods
Balance between investigative value and storage costs

13. What are the trade-offs between centralized trace storage and distributed edge storage?

Expected answer points:

Centralized (Jaeger, Zipkin): simpler operations, single query endpoint, potential network latency for upload
Edge storage (X-Ray daemon buffers): resilience to network partitions, reduced upload bandwidth, more complex retrieval
Hybrid approach: buffer at edge, batch upload to central, local fallback during outages
Consider data locality requirements - some regulations mandate data stays in certain regions
Edge buffering prevents data loss during collector downtime but requires disk management
Centralized storage simplifies debugging across services but creates dependency on network

14. How does the OTel Collector handle backpressure when the trace backend is unavailable?

Expected answer points:

OTel Collector has built-in sender functionality with retry mechanisms
When backend is down, spans queue in memory - risk of memory exhaustion under sustained load
Configure `memory_limiter` processor to drop spans when memory pressure exceeds threshold
Use persistent queue (disk-backed) for better resilience during backend outages
Exponential backoff with jitter prevents thundering herd when backend recovers
Dead letter queue / retry_stale configuration handles spans that cannot be exported
Monitor queue depth metrics to anticipate potential data loss

15. What strategies exist for reducing trace storage costs at scale?

Expected answer points:

Adaptive sampling: lower overall rate, 100% for errors and slow traces
Attribute pruning: remove low-value attributes before storage
Span deduplication: compress similar spans in batch operations
Data tiering: hot storage for recent data, archive/aggregate older data
Compression: use columnar formats (Parquet) that compress well
Trace summarization: keep full traces for errors, aggregated metrics for success paths
TTL enforcement: automatically expire old data based on retention policy

16. How do you implement multi-tenancy in a trace storage system?

Expected answer points:

Tenant isolation via separate indices/tables per customer (Jaeger with Elasticsearch)
Tag-based filtering: all spans tagged with tenant ID, query layer filters
Separate collectors or collector groups per tenant for strict data isolation
Consider data residency requirements - tenants may need data in specific regions
Resource quota enforcement to prevent one tenant from monopolizing storage
Access control: ensure tenants can only query their own trace data
Cost attribution: track storage and query costs per tenant for billing

17. What are the performance implications of trace collection and how do you optimize it?

Expected answer points:

Trace collection adds latency: OTel SDK overhead ~1-5ms per span creation
Batching exporters reduce network overhead by amortizing connection costs
Async export prevents blocking the main request path
SimpleSpanProcessor vs BatchSpanProcessor: batch is more efficient at scale
Collector pipeline: use processors to aggregate and reduce data before export
Network: consider gRPC vs HTTP exporters; gRPC has lower overhead for high volume
Profile in staging to understand actual overhead before production deployment

18. How would you design a trace data pipeline for a globally distributed system?

Expected answer points:

Regional collectors ingest locally, then forward to central aggregation
Use load balancing across collectors for horizontal scalability
Implement trace context propagation across regional boundaries
Consider data residency - some regions may require local storage before aggregation
Global sampling: each region samples independently, increasing total capture rate
Global view requires stitching traces from multiple regions - use consistent trace ID generation
Network design: dedicated links for trace traffic prevent interference with application traffic

19. What monitoring metrics should you track for your trace collection infrastructure?

Expected answer points:

Spans started vs ended (detector for leaks)
Export success/failure rate per backend
Queue depth and memory usage for exporters
Collector receive latency (p50, p95, p99)
Dropped spans count and reason (sampling, queue full, export failure)
Context propagation success/failure rate
Backend query latency for trace retrieval
Set SLOs/SLIs on these metrics and alert on violations

20. How does distributed tracing interact with event-driven architectures and saga patterns?

Expected answer points:

Saga orchestrator creates parent span; each saga step is a child span
Compensation operations (rollbacks) should be spans linked to the original transaction
Event-driven: inject context into message headers, extract in consumers
Choreography-based sagas: use correlation ID linking all related spans
Long-running sagas require sustained context propagation across hours or days
Consider span linking vs parent-based models for saga step relationships
Trace visualization helps identify bottleneck steps in saga execution

Conclusion

Key Takeaways:

Distributed tracing shows complete request journeys across services
OpenTelemetry provides vendor-neutral instrumentation
Always propagate trace context through HTTP headers, queues, and databases
Use semantic conventions for span attributes
Implement tail-based sampling to capture errors and slow traces
Correlate traces with logs and metrics for complete observability

Copy/Paste Checklist:

// Trace context propagation (HTTP)
import { propagation, context } from '@opentelemetry/api';

function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
}

function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

// Manual span with error handling
const span = tracer.startSpan('operation');
try {
  span.setAttribute('entity.id', entityId);
  await doWork();
  span.setStatus({ code: SpanStatusCode.OK });
} catch (e) {
  span.recordException(e as Error);
  span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
  throw;
} finally {
  span.end();
}

// Tail-based sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }

Distributed tracing turns opaque microservices into transparent systems where request flows are visible and debugging is systematic. Start by instrumenting your HTTP and database layers with auto-instrumentation, then add custom spans for business operations.

OpenTelemetry provides vendor-neutral instrumentation, so you can switch backends without re-instrumenting. For implementation details, see our Jaeger guide on trace visualization. For metrics correlation, the Prometheus & Grafana guide covers building complete observability pipelines.