Distributed Tracing: Trace Context and OpenTelemetry

Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.

published: reading time: 20 min read

Distributed Tracing: Trace Context, OpenTelemetry, and Correlation

When a request touches ten services before returning an error, traditional logging tells you each piece in isolation. Distributed tracing connects the pieces, showing you the complete journey of a request through your system.

This guide covers the fundamentals of tracing: trace context propagation, OpenTelemetry instrumentation, and practical correlation patterns.

If you are building microservices, you need distributed tracing.

Why Logs and Metrics Are Not Enough

Logs tell you what happened in a single service. Metrics tell you aggregate patterns. Neither shows causality across service boundaries.

Consider a request that fails after touching five services. With only logs, you search each service’s logs for the trace ID, then manually piece together the sequence. With tracing, you open a single view showing the entire timeline: service A started the request, called B, which called C, D, and E in sequence, and E returned an error that propagated back up.

This turns hours of debugging into minutes.

Core Concepts

Traces and Spans

A trace represents an entire request journey. It contains one or more spans, where each span represents a single operation within that trace.

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant O as Order Service
    participant P as Payment Service
    participant N as Notification Service

    C->>A: GET /orders/123
    A->>O: GetOrder(123)
    O->>P: ProcessPayment(order)
    P-->>O: Payment confirmed
    O->>N: SendConfirmation(order)
    N-->>O: Notification sent
    O-->>A: Order details
    A-->>C: Response

Each span captures:

  • Operation name
  • Start and end time
  • Parent span ID (linking)
  • Attributes (key-value metadata)
  • Events (timestamped points within the span)

Trace Context

Trace context propagates across service boundaries through HTTP headers. When service A calls service B, it passes trace context in headers. Service B creates a child span using that context, ensuring the spans stay connected in a single trace.

The W3C Trace Context specification standardizes these headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

The traceparent header contains:

  • Version (00)
  • Trace ID (32 hex characters)
  • Parent ID (16 hex characters)
  • Flags (01 = sampled)

OpenTelemetry Architecture

OpenTelemetry (OTel) is the open standard for observability. It gives you APIs, SDKs, and instrumentation for collecting traces, metrics, and logs.

graph TB
    subgraph "Application Code"
        A[Your Service]
        B[OTel SDK]
        C[Language-specific auto-instrumentation]
    end

    subgraph "Exporters"
        D[OTLP Exporter]
        E[Jaeger Exporter]
        F[Zipkin Exporter]
    end

    subgraph "Collecting Infrastructure"
        G[OTel Collector]
        H[Jaeger]
        I[Zipkin]
    end

    A --> B
    B --> C
    B --> D
    D --> G
    G --> H
    G --> I

OTel Collector

The OTel Collector receives, processes, and exports telemetry data. Think of it as middleware between your application and your observability backend.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_percentage: 90

exporters:
  otlp:
    endpoint: jaeger-collector:4317
    tls:
      insecure: false
      cert_file: /certs/cert.pem
      key_file: /certs/key.pem

  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Manual Instrumentation

Auto-instrumentation covers many frameworks automatically, but you need manual instrumentation for business-specific operations and custom spans.

Starting Traces

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service", "1.0.0");

async function createOrder(orderData: OrderData): Promise<Order> {
  const span = tracer.startSpan("OrderService.createOrder", {
    attributes: {
      "order.customer_id": orderData.customerId,
      "order.item_count": orderData.items.length,
      "order.total": orderData.total,
    },
  });

  try {
    const order = await db.orders.create(orderData);
    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Creating Child Spans

Wrap nested operations as child spans:

async function processPayment(
  orderId: string,
  payment: Payment,
): Promise<PaymentResult> {
  const parentSpan = trace.getActiveSpan();

  const paymentSpan = tracer.startSpan("PaymentService.process", {
    parent: parentSpan,
    attributes: {
      "payment.method": payment.method,
      "payment.amount": payment.amount,
      "payment.currency": payment.currency,
    },
  });

  try {
    // Verify card
    await verifyCard(payment.card, paymentSpan);

    // Charge
    const result = await chargeCard(payment, paymentSpan);

    paymentSpan.setAttribute("payment.transaction_id", result.transactionId);
    paymentSpan.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    paymentSpan.recordException(error as Error);
    paymentSpan.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    paymentSpan.end();
  }
}

async function verifyCard(card: Card, parentSpan: Span): Promise<void> {
  const span = tracer.startSpan("PaymentService.verifyCard", {
    parent: parentSpan,
    attributes: {
      "card.type": card.type,
      "card.last_four": card.lastFour,
    },
  });

  // Verification logic
  await api.verifyCard(card);

  span.end();
}

Context Propagation

Proper context propagation connects spans across service boundaries. Without it, spans become orphaned and useless for debugging.

HTTP Propagation

Middleware that extracts incoming trace context and propagates it to downstream calls:

import { trace, context, propagation } from "@opentelemetry/api";

function httpMiddleware(req, res, next) {
  // Extract context from incoming headers
  const extractedContext = propagation.extract(context.active(), req.headers);

  // Run the rest of the request handler within that context
  context.with(extractedContext, () => {
    // All spans created here are linked to the incoming trace
    next();
  });
}

// When making outgoing requests, inject context into headers
async function callDownstreamService(url: string, data: any): Promise<any> {
  const headers = {};
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: "POST",
    headers: {
      ...headers,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(data),
  });
}

Messaging Propagation

Propagate context through message queues so spans stay connected even with asynchronous processing:

import { trace, context, propagation } from "@opentelemetry/api";

// Producer: inject context into message
async function sendOrderCreatedEvent(order: Order): Promise<void> {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);

  await kafka.send({
    topic: "order.created",
    messages: [
      {
        key: order.id,
        value: JSON.stringify(order),
        headers: headers,
      },
    ],
  });
}

// Consumer: extract context from message and create linked span
async function handleOrderCreated(message: KafkaMessage): Promise<void> {
  const extractedContext = propagation.extract(
    context.active(),
    message.headers,
  );

  await context.with(extractedContext, async () => {
    const span = tracer.startSpan("OrderConsumer.handleOrderCreated");
    try {
      const order = JSON.parse(message.value.toString());
      await processOrder(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

Adding Business Context

Rich span attributes turn traces from timing diagrams into debugging tools.

Semantic Attributes

Use standard attribute names for common data:

// HTTP attributes
span.setAttribute("http.method", "POST");
span.setAttribute("http.url", "https://api.example.com/orders");
span.setAttribute("http.status_code", 201);
span.setAttribute("http.response_content_length", 1024);

// Database attributes
span.setAttribute("db.system", "postgresql");
span.setAttribute("db.name", "orders_db");
span.setAttribute("db.statement", "SELECT * FROM orders WHERE id = $1");
span.setAttribute("db.operation", "SELECT");

// Messaging attributes
span.setAttribute("messaging.system", "kafka");
span.setAttribute("messaging.destination", "order.created");
span.setAttribute("messaging.operation", "publish");

Custom Business Attributes

Add domain-specific context:

span.setAttribute("order.id", order.id);
span.setAttribute("order.status", order.status);
span.setAttribute("order.customer_tier", customer.tier);
span.setAttribute("order.is_first_purchase", customer.orderCount === 0);

These attributes let you filter traces by business properties: find all traces for premium customers, or analyze timing for first-time purchasers.

Correlation with Logs and Metrics

Traces work best when linked to your logs and metrics.

Trace-Log Correlation

Include trace ID in logs:

import { trace, span } from "@opentelemetry/api";

function logInfo(message: string, data?: Record<string, unknown>): void {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;

  const logEntry = {
    timestamp: new Date().toISOString(),
    level: "INFO",
    message,
    traceId,
    ...data,
  };

  console.log(JSON.stringify(logEntry));
}

// Now every log entry includes the trace ID
logInfo("Order created successfully", { orderId: "ord_123" });
// {"timestamp":"2026-03-22T14:30:00Z","level":"INFO","message":"Order created successfully","traceId":"abc123...","orderId":"ord_123"}

Trace-Metric Correlation

Link metrics to traces through span events:

const meter = metrics.getMeter("payment-service");

const paymentDuration = meter.createHistogram("payment.duration", {
  unit: "ms",
  description: "Payment processing duration",
});

async function processPayment(payment: Payment): Promise<void> {
  const span = tracer.startSpan("PaymentService.process");

  const startTime = Date.now();
  try {
    await doPayment(payment);
    paymentDuration.record(Date.now() - startTime, {
      "payment.method": payment.method,
      "payment.status": "success",
    });
  } catch (error) {
    paymentDuration.record(Date.now() - startTime, {
      "payment.method": payment.method,
      "payment.status": "failure",
    });
    throw error;
  } finally {
    span.end();
  }
}

Sampling Strategies

At high traffic, you cannot capture every trace. Sampling reduces volume while preserving useful data.

Common Sampling Strategies

Head-based sampling decides at trace start whether to capture:

// Always sample 1% of traces, plus all errors
const sampler = new TraceIdRatioBasedSampler({
  ratio: 0.01,
  rules: [
    { matcher: (span) => span.status.code === "ERROR", sampler: "always" },
  ],
});

const tracer = trace
  .getTracerProvider()
  .addSpanProcessor(new SimpleSpanProcessor(exporter), sampler);

Tail-based sampling captures all spans temporarily, then decides what to keep:

# OTel Collector tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
      - name: keep-all
        type: always_sample

This captures slow traces, errors, and a percentage of everything else.

Visualization with Jaeger

Jaeger is a popular distributed tracing backend. It stores traces and provides a UI for exploration.

Key Jaeger Views

Search: Find traces by service, operation, time range, or tags.

Trace Detail: The flame graph showing all spans in a trace with timing information.

Span Detail: The attributes, events, and logs attached to a specific span.

Analyzing Trace Flame Graphs

A flame graph shows the parent-child span relationships:

order-service.createOrder (2.3s)
├── auth-service.validateToken (50ms)
├── inventory-service.checkStock (150ms)
│   └── external-partner.getAvailability (120ms)
├── payment-service.process (1.8s)
│   ├── fraud-check.analyze (400ms)
│   │   └── external-api.call (380ms)
│   └── payment-gateway.charge (1.2s)
│       └── external-bank.authorize (1.1s)
├── notification-service.send (100ms)
└── db.orders.insert (30ms)

Long spans are easy to spot. Here, payment-service.process dominates. Drilling in, payment-gateway.charge is the bottleneck. Further still, external-bank.authorize is where time is spent.

Common Patterns

Database Query Tracing

// Monkey-patch your database client for auto-tracing
import { dbplugin } from "@opentelemetry/instrumentation-pg";

new dbplugin.DatabaseDetector({
  enhancedDatabaseReporting: true,
  addSqlCommenterCommentToQueries: true,
});

HTTP Client Tracing

import { fetchInstrumentation } from "@opentelemetry/instrumentation-fetch";

new fetchInstrumentation({
  propagateCorrelationHeader: true,
  timingOrigin: (origin) => origin !== window.location.origin,
});

gRPC Tracing

import { grpcInstrumentation } from "@opentelemetry/instrumentation-grpc";

new grpcInstrumentation({
  白玉gRPC: true,
});

When to Use Distributed Tracing

When to Use Distributed Tracing:

  • Debugging latency issues across multiple microservices
  • Understanding request flow through complex architectures
  • Finding which service causes cascading failures
  • Root cause analysis when errors propagate across boundaries
  • Performance optimization by identifying bottlenecks
  • Validating service dependencies and communication patterns

When Not to Use Distributed Tracing:

  • Single monolithic applications (local debugging suffices)
  • Low-traffic services where logs provide sufficient context
  • When you only need aggregate metrics (use Prometheus)
  • Very high-throughput paths where tracing overhead matters (use sampling)
  • Systems without clear request boundaries (batch jobs)

Trade-off Analysis

AspectDistributed TracingTraditional LoggingMetrics Only
Debugging SpeedMinutes (full context)Hours (manual correlation)N/A
Storage CostHigh (span data)Medium (log volume)Low
Overhead~1-5% latencyMinimalMinimal
Root Cause因果链路清晰需要关联IDAggregates only
CardinalityHigh (many traces)MediumLow
Error ContextFull request pathPer-service onlyNone

SLI/SLO/Error Budget Templates for Tracing

Distributed tracing does not typically have traditional SLIs/SLOs since it is qualitative debugging tooling rather than quantitative reliability measurement. However, you can define SLOs around tracing coverage and health.

Trace Health SLI Template

# tracing-sli-config.yaml
service: tracing-observability
environment: production

slis:
  - name: trace_ingestion_success_rate
    description: "Percentage of started traces successfully exported"
    query: |
      sum(rate(otel_exporter_sent_spans_total[5m]))
      /
      sum(rate(otel_span_started_total[5m]))

  - name: trace_context_propagation_success
    description: "Percentage of requests with valid propagated trace context"
    query: |
      sum(rate(otel_trace_context_propagated_total{status="success"}[5m]))
      /
      sum(rate(otel_trace_context_propagated_total[5m]))

  - name: tail_sampling_efficiency
    description: "Percentage of traces retained by tail sampling"
    query: |
      sum(rate(otel_tail_sampling_traces_sampled_total[5m]))
      /
      sum(rate(otel_tail_sampling_traces_evaluated_total[5m]))

  - name: span_error_rate
    description: "Percentage of spans with error status"
    query: |
      sum(rate(otel_span_status_code_total{code="ERROR"}[5m]))
      /
      sum(rate(otel_span_started_total[5m]))

Trace SLO Template

# tracing-slo-config.yaml
objectives:
  - display_name: "Trace Ingestion Availability"
    sli: trace_ingestion_success_rate
    target: 99.5
    window: 30d
    description: "99.5% of started traces should be exported"

  - display_name: "Context Propagation Success"
    sli: trace_context_propagation_success
    target: 99.9
    window: 30d
    description: "99.9% of requests should have valid trace context"

  - display_name: "Tail Sampling Coverage"
    sli: tail_sampling_efficiency
    target: 95.0
    window: 30d
    description: "95% of sampled traces should match sampling policies"

Error Budget Calculator for Tracing

def calculate_tracing_budgets():
    """
    Calculate error budgets for tracing SLOs (30-day window).
    """
    window_minutes = 30 * 24 * 60

    slos = {
        "99.5% (Trace Ingestion)": window_minutes * 0.005,
        "99.9% (Context Propagation)": window_minutes * 0.001,
        "95.0% (Tail Sampling)": window_minutes * 0.050,
    }

    for slo, budget in slos.items():
        print(f"{slo}: {budget:.1f} minutes allowed degradation")
        print(f"  = {budget / 60:.2f} hours")
        print(f"  = {budget / 60 / 24:.2f} days")

calculate_tracing_budgets()

Multi-Window Burn-Rate Alerting for Tracing

While tracing does not have traditional error budgets, you can apply burn-rate concepts to detect tracing infrastructure issues.

Trace Coverage Burn-Rate Alert (1h Window)

# Tracing burn-rate alerts
groups:
  - name: tracing-burn-rate
    rules:
      # Fast burn: Trace ingestion dropping significantly
      - alert: TracingCoverageFastBurn
        expr: |
          (
            sum(rate(otel_exporter_sent_spans_total[1h]))
            /
            sum(rate(otel_span_started_total[1h]))
          )
          < 0.95
        for: 5m
        labels:
          severity: critical
          category: tracing
          window: 1h
        annotations:
          summary: "Trace coverage dropping fast (1h window)"
          description: "Trace ingestion success rate is {{ $value | humanizePercentage }}. Investigate OTel collector health or exporter issues."

Trace Context Propagation Burn-Rate Alert (6h Window)

# Medium burn: Context propagation issues
- alert: TracingContextPropagationBurn
  expr: |
    (
      sum(rate(otel_trace_context_propagated_total{status="failure"}[6h]))
      /
      sum(rate(otel_trace_context_propagated_total[6h]))
    )
    > 0.01
  for: 15m
  labels:
    severity: warning
    category: tracing
    window: 6h
  annotations:
    summary: "Trace context propagation failures (6h window)"
    description: "Context propagation failure rate is {{ $value | humanizePercentage }}. Check service mesh or HTTP middleware configuration."

Multi-Window Trace Health Alert Set

# Complete trace health burn-rate alert
- alert: TracingHealthBurnAllWindows
  expr: |
    (
      sum(rate(otel_exporter_sent_spans_total[1h]))
      /
      sum(rate(otel_span_started_total[1h]))
    )
    < 0.95
    or
    (
      sum(rate(otel_trace_context_propagated_total{status="failure"}[1h]))
      /
      sum(rate(otel_trace_context_propagated_total[1h]))
    )
    > 0.05
  for: 5m
  labels:
    severity: critical
    category: tracing
  annotations:
    summary: "Tracing health degraded across multiple indicators"
    description: |
      One or more tracing health metrics are burning fast.
      Trace coverage: {{ printf "%.2f" (index $values "0" | value)) }}
      Propagation failures: {{ printf "%.2f" (index $values "1" | value)) }}
      Distributed tracing visibility is compromised.

Observability Hooks for Distributed Tracing

This section defines what to log, measure, trace, and alert for tracing systems themselves.

Log (What to Emit)

EventFieldsLevel
Collector startedversion, endpoint, exportersINFO
Exporter failureexporter_type, error, retry_countWARN
Sampling decisionsampling_policy, trace_id, decisionDEBUG
Context propagation failureservice, direction, errorWARN
Span queue fullservice, queue_size, drop_countERROR
Batch export successexporter, spans_count, bytesDEBUG

Measure (Metrics to Collect)

MetricTypeDescription
otel_span_started_totalCounterTotal spans started
otel_span_ended_totalCounterTotal spans ended
otel_exporter_sent_spans_totalCounterSpans successfully exported
otel_exporter_failed_spans_totalCounterSpans that failed to export
otel_trace_context_propagated_totalCounterContext propagation attempts
otel_tail_sampling_traces_evaluated_totalCounterTraces evaluated by tail sampler
otel_tail_sampling_traces_sampled_totalCounterTraces retained by tail sampler
otel_span_queue_depthGaugePending spans in export queue
otel_collector_receive_latency_secondsHistogramTime to receive spans
otel_exporter_send_latency_secondsHistogramTime to send to backend

Trace (Correlation Points)

OperationTrace AttributePurpose
Span startedtracing.otel.versionTrack OTel SDK version
Sampling decisiontracing.sampling.decisionMonitor sampling efficiency
Export batchtracing.export.batch_sizeTrack export efficiency
Context inject/extractiontracing.context.directionMonitor propagation health

Alert (When to Page)

AlertConditionSeverityPurpose
Trace SilenceNo spans exported for 5 minutesP1 CriticalTracing pipeline down
Export Failure RateExport failures > 5% for 5 minP1 CriticalData loss imminent
Context Propagation FailurePropagation failures > 1%P2 HighIncomplete traces
Span Queue CriticalQueue > 90% capacityP2 HighRisk of drops
Tail Sampling BypassSampled < expected with high errorsP3 MediumSampling misconfigured
Collector LatencyReceive latency > 1s p95P3 MediumPerformance issue

Tracing Observability Hook Template

# tracing-observability-hooks.yaml
groups:
  - name: tracing-observability-hooks
    rules:
      # Alert on trace silence
      - alert: TracingPipelineSilence
        expr: sum(rate(otel_exporter_sent_spans_total[5m])) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No spans being exported (Alert on Silence)"
          description: "OTel collectors are not exporting any spans. Tracing visibility is completely lost."

      # Alert on high export failure rate
      - alert: TracingExportFailuresHigh
        expr: |
          sum(rate(otel_exporter_failed_spans_total[5m]))
          /
          sum(rate(otel_span_started_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Trace export failure rate above 5%"
          description: "{{ $value | humanizePercentage }} of spans are failing to export. Data loss is occurring."

      # Alert on context propagation failures
      - alert: TracingContextPropagationFailure
        expr: |
          sum(rate(otel_trace_context_propagated_total{status="failure"}[5m]))
          /
          sum(rate(otel_trace_context_propagated_total[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Trace context propagation failure rate above 1%"
          description: "Context propagation is failing. Traces will be broken across service boundaries."

      # Alert on span queue capacity
      - alert: TracingSpanQueueCritical
        expr: otel_span_queue_depth / otel_span_queue_limit > 0.9
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Span export queue above 90% capacity"
          description: "Span queue is filling up. Risk of memory exhaustion and trace drops."

      # Alert on collector latency
      - alert: TracingCollectorLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(otel_collector_receive_latency_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OTel collector receive latency above 1 second"
          description: "P95 collector latency is {{ $value }}s. Traces may be delayed in reaching the backend."

      # SLO burn-rate for trace coverage
      - alert: TracingCoverageBurnRateFast
        expr: |
          (
            sum(rate(otel_exporter_sent_spans_total[1h]))
            /
            sum(rate(otel_span_started_total[1h]))
          )
          < 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Trace coverage burning at unsustainable rate"
          description: "Trace coverage is at {{ $value | humanizePercentage }}. Investigate exporter health immediately."

Production Failure Scenarios

FailureImpactMitigation
Trace context not propagatedIncomplete traces; cannot follow requestsImplement propagation in all HTTP clients, message queues, databases
Sampling rate too lowMissing rare but important tracesUse tail-based sampling for errors and slow traces
Trace storage backend overwhelmedTraces dropped; gaps in visibilityImplement adaptive sampling; scale storage; use compression
Missing span attributesCannot filter or group traces meaningfullyDefine semantic conventions; require business context in spans
Clock skew between servicesInvalid timing data; impossible to correlateUse NTP synchronization; log trace generation time
OTel collector bottleneckSpans queue up; memory pressure; dropsScale collectors horizontally; add batching; monitor queue depth

Observability Checklist

Tracing Coverage

  • HTTP request/response spans for all API endpoints
  • Database query spans with statement and duration
  • External API call spans with URL and status
  • Message queue publish/consume spans
  • Background job spans with job ID and outcome
  • Custom business operation spans with relevant context

Span Attributes

  • Service name and version
  • Operation name
  • Trace ID and span ID
  • Start time and duration
  • HTTP: method, URL, status code
  • DB: system, statement, rows affected
  • Business: entity IDs, customer tier, transaction amount

Correlation

  • Trace ID included in all log entries
  • Trace ID included in metric labels (where appropriate)
  • Log entries linkable from span events
  • Metrics aggregatable by trace-derived dimensions

Sampling Configuration

  • Head-based sampling for consistent baseline (1-10%)
  • Tail-based sampling for errors (100% of errors)
  • Tail-based sampling for slow traces (>threshold)
  • Always sample for tagged critical requests

Security Checklist

  • Trace data does not include passwords, tokens, or secrets in span attributes
  • PII not stored in span attributes or events
  • Trace data encrypted in transit (TLS)
  • Trace data access logged and audited
  • Sampling does not inadvertently exclude security-relevant traces
  • Trace storage has appropriate retention policies
  • Internal service names not exposed in trace exports to third parties
  • Trace context headers sanitized before external calls

Common Pitfalls / Anti-Patterns

1. Creating Spans for Everything

Every span has overhead. Do not create spans for every loop iteration or minor function call:

// Bad: Spans for everything
async function processItems(items: Item[]) {
  const span = tracer.startSpan("processItems");
  for (const item of items) {
    const itemSpan = tracer.startSpan("processItem"); // Too granular
    await processItem(item);
    itemSpan.end();
  }
  span.end();
}

// Good: Batch operations as single span
async function processItems(items: Item[]) {
  const span = tracer.startSpan("processItems");
  const results = await Promise.all(items.map((item) => processItem(item)));
  span.setAttribute("items.count", items.length);
  span.end();
  return results;
}

2. Forgetting to End Spans

Unfinished spans remain open and appear as ongoing operations:

// Bad: Span not ended on error path
async function riskyOperation() {
  const span = tracer.startSpan('risky');
  if (condition) {
    throw new Error('condition failed');
  }
  span.end(); // May never execute
}

// Good: Use try/finally
async function riskyOperation() {
  const span = tracer.startSpan('risky');
  try {
    // Work
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (e) {
    span.recordException(e);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw;
  } finally {
    span.end();
  }
}

3. Not Propagating Context Across Async Boundaries

Async operations lose trace context without explicit propagation:

// Bad: Context lost
async function outer() {
  const span = tracer.startSpan("outer");
  await inner(); // Span context not passed
  span.end();
}

async function inner() {
  const span = tracer.startSpan("inner"); // Orphan span
  span.end();
}

// Good: Pass context
async function outer() {
  const span = tracer.startSpan("outer");
  await inner(span); // Parent context passed
  span.end();
}

async function inner(parentSpan: Span) {
  const span = tracer.startSpan("inner", { parent: parentSpan });
  span.end();
}

4. Storing Too Much Data in Span Attributes

Span attributes are not a data store. Keep them small and queryable:

// Bad: Large data in attributes
span.setAttribute("response_body", JSON.stringify(largeObject));

// Good: Reference data by ID
span.setAttribute("order_id", order.id);
span.setAttribute("items_count", order.items.length);

5. Ignoring Sampling in High-Volume Services

Unsampled tracing at high volume creates massive overhead:

# OTel Collector tail sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Quick Recap

Key Takeaways:

  • Distributed tracing shows complete request journeys across services
  • OpenTelemetry provides vendor-neutral instrumentation
  • Always propagate trace context through HTTP headers, queues, and databases
  • Use semantic conventions for span attributes
  • Implement tail-based sampling to capture errors and slow traces
  • Correlate traces with logs and metrics for complete observability

Copy/Paste Checklist:

// Trace context propagation (HTTP)
import { propagation, context } from '@opentelemetry/api';

function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
}

function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

// Manual span with error handling
const span = tracer.startSpan('operation');
try {
  span.setAttribute('entity.id', entityId);
  await doWork();
  span.setStatus({ code: SpanStatusCode.OK });
} catch (e) {
  span.recordException(e as Error);
  span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
  throw;
} finally {
  span.end();
}

// Tail-based sampling config
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }

Conclusion

Distributed tracing turns opaque microservices into transparent systems where request flows are visible and debugging is systematic. Start by instrumenting your HTTP and database layers with auto-instrumentation, then add custom spans for business operations.

OpenTelemetry provides vendor-neutral instrumentation, so you can switch backends without re-instrumenting. For implementation details, see our Jaeger guide on trace visualization. For metrics correlation, the Prometheus & Grafana guide covers building complete observability pipelines.

Category

Related Posts

Jaeger: Distributed Tracing for Microservices

Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.

#jaeger #tracing #observability

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines