Jaeger: Distributed Tracing for Microservices

Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.

published: reading time: 11 min read

Jaeger: Distributed Tracing for Microservices

Jaeger is an open-source distributed tracing system used to monitor and troubleshoot microservices. It gives you visibility into request flows, latency analysis, and service dependencies.

This guide covers Jaeger deployment, trace analysis, and practical debugging workflows. For tracing fundamentals, see our Distributed Tracing guide first.

Jaeger Architecture

graph TB
    A[Services] -->|OTLP| B[Jaeger Collector]
    B --> C[Jaeger Backend]
    C --> D[Elasticsearch]
    C --> E[Cassandra]
    C --> F[Kafka]
    G[Jaeger Query] --> C
    H[Jaeger UI] --> G

Jaeger follows the OpenTelemetry collector pattern:

  • Agent: Sidecar or daemonset that receives spans via UDP
  • Collector: Receives spans, processes them, and stores them
  • Query: Backend service for trace retrieval
  • UI: Web interface for trace exploration

Deployment Options

All-in-One Quick Start

For local development:

docker run -d \
  --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
  -p 6831:6831/UDP \
  -p 6832:6832/UDP \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Access the UI at http://localhost:16686.

Production Deployment

For Kubernetes:

# jaeger-operator.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      name: elasticsearch
      doNotProvision: true
      secretName: jaeger-elasticsearch
  query:
    replicas: 2
    options:
      query:
        base-path: /jaeger

External Elasticsearch Backend

apiVersion: v1
kind: Secret
metadata:
  name: jaeger-elasticsearch
  namespace: observability
type: Opaque
stringData:
  ELASTICSEARCH_SERVER: "https://elasticsearch:9200"
  ELASTICSEARCH_USERNAME: "jaeger"
  ELASTICSEARCH_PASSWORD: "${ES_PASSWORD}"
---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    autoscale: true
    maxReplicas: 5
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      storage:
        size: 200Gi
        storageClassName: fast-storage
      indexCleaner:
        enabled: true
        numberOfDays: 14
        schedule: "55 5 * * *"
  query:
    replicas: 2

Jaeger UI

The Jaeger UI provides several views for trace analysis.

Search View

The search view lets you find traces by:

  • Service name
  • Operation name
  • Trace ID
  • Time range
  • Tag filters
  • Duration range

Trace Detail View

graph TD
    A[API Gateway<br/>Total: 1.2s] --> B[Auth Service<br/>200ms]
    A --> C[Order Service<br/>800ms]
    C --> D[Payment Service<br/>500ms]
    C --> E[Inventory Service<br/>150ms]
    C --> F[Notification<br/>50ms]
    D --> G[External Bank<br/>400ms]

The trace view shows:

  • Parent-child span relationships
  • Timing for each span
  • Span tags and logs
  • Total trace duration

Span Detail Panel

Clicking a span reveals:

  • Operation name and service
  • Start time and duration
  • Tags (key-value attributes)
  • Logs (timestamped events)
  • References (parent-child links)

Trace Analysis Workflows

Debugging a Slow Request

Find the bottleneck in a slow trace:

  1. Search for traces with high duration
  2. Identify which service has the longest spans
  3. Check span tags for business context
  4. Review span logs for errors or unusual events
# Search for slow traces via API
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h&maxDuration=5s" | \
  jq '.data[] | {traceID: .traceID, duration: .duration, services: [.spans[].process.serviceName] | unique}'

Finding Error Sources

Identify which service is causing errors:

  1. Filter traces by error status
  2. Examine the error span and its logs
  3. Follow the trace backward to find root cause
# Find traces with errors
curl -s "http://jaeger-query:16686/api/traces?service=checkout-service&lookback=1h" | \
  jq '.data[] | select(.spans[].tags // [] | any(.key == "error" and .vBool == true))'

Analyzing Service Dependencies

Use the dependency view to understand service relationships:

  1. Navigate to the Dependency graph view
  2. Click on services to see call patterns
  3. Identify services with high fan-out
  4. Spot potential bottlenecks

Advanced Features

Adaptive Sampling

Reduce storage by sampling intelligently:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: adaptive-sampling
spec:
  strategy: adaptive
  sampling:
    type: adaptive
    adaptive:
      sampling_server_url: jaeger-agent:5778
      max_traces_per_second: 100
      initial_sampling_rate: 10
      adaptive:
        enabled: true

Barrelfish View

Barrelfish visualizes trace data by service showing:

  • Average latency per operation
  • Request throughput
  • Error rates
  • Dependency links

Trace Quality Scoring

Jaeger can score traces based on quality:

# Trace quality indicators
- Missing span tags
- Incomplete trace depth
- High error rate
- Excessive span count

Integrating with OpenTelemetry

OpenTelemetry is the modern standard for tracing:

Automatic Instrumentation

# Python auto-instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(
    trace.TracerProvider(
        resource=Resource.create({"service.name": "my-service"})
    )
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger-collector:4317"))
)

FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

Manual Span Creation

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(order: Order) {
  const span = tracer.startSpan("OrderService.process");

  try {
    await validateOrder(order, span);
    await chargePayment(order, span);
    await fulfillOrder(order, span);

    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

async function validateOrder(order: Order, parentSpan: Span) {
  const span = tracer.startSpan("OrderService.validate", {
    parent: parentSpan,
  });

  span.setAttribute("validation.type", "business");

  // Validation logic

  span.end();
}

Performance Analysis

Latency Percentiles

Analyze latency distribution:

# Get latency percentiles via Jaeger API
curl -s "http://jaeger-query:16686/api/services" | jq '.[].name'

# Get trace stats
curl -s "http://jaeger-query:16686/api/traces/stats?service=api-gateway" | jq

Throughput Analysis

Understand request volume patterns:

# Get traces count over time
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&start=$(date -d '1 hour ago' +%s000000)&end=$(date +%s000000)&limit=1000" | \
  jq '.data | length'

Error Rate Correlation

Correlate errors across services:

# Find traces with errors and their services
curl -s "http://jaeger-query:16686/api/traces?service=*&lookback=1h" | \
  jq '.data[] | {
      traceID: .traceID,
      errors: [.spans[] | select(.tags // [] | any(.key == "error")) | .process.serviceName]
    } | select(.errors | length > 0)'

Alerting on Trace Data

Prometheus Metrics from Jaeger

# jaeger-metrics-exporter.yaml
apiVersion: v1
kind: Service
metadata:
  name: jaeger-metrics
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8888"
spec:
  ports:
    - port: 8888
      targetPort: 8888
---
# Prometheus scrape config
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Key Metrics to Monitor

MetricDescription
jaeger_collector_traces_receivedIncoming traces count
jaeger_collector_spans_receivedTotal spans received
jaeger_collector_queue_lengthPending spans in queue
jaeger_query_latencyQuery response time

Storage Backends

Jaeger supports multiple storage backends.

Elasticsearch

Best for large-scale production deployments:

# Elasticsearch with ILM
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      indexParameters:
        numberOfShards: 5
        numberOfReplicas: 1

Cassandra

Traditional choice for high volume:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: cassandra
    cassandra:
      servers: cassandra:9042
      keyspace: jaeger_v1
      replication_factor: 2

Kafka

For buffering and replay:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: streaming
  collector:
    maxReplicas: 10
  storage:
    type: kafka
    kafka:
      brokers:
        - kafka:9092
      topic: jaeger-spans
      partitions: 10
  ingester:
    replicas: 2

When to Use Jaeger

Use Jaeger when:

  • Debugging latency issues across microservice boundaries
  • Understanding service dependencies and call patterns
  • Root cause analysis for cascading failures
  • Optimizing performance by identifying bottlenecks
  • Validating trace context propagation
  • Monitoring distributed transactions
  • Detecting anomalies in request flows

Don’t use Jaeger when:

  • You have single monolithic applications without service boundaries
  • You need purely metric-based monitoring
  • Low-latency tracing overhead is unacceptable
  • You need long-term log storage
  • You only need aggregate analytics (use dashboards)

Production Failure Scenarios

FailureImpactMitigation
Jaeger storage backend degradedTraces dropped; incomplete debugging dataConfigure sampling; scale storage; implement buffer queue
Collector queue overflowSpans dropped; monitoring gapsMonitor queue depth; scale collectors; implement backpressure
Query service performanceSlow trace search; UI timeoutsOptimize queries; add caching; scale query replicas
Adaptive sampling too aggressiveMissing important tracesReview sampling rates; ensure error traces always sampled
Trace context propagation brokenIncomplete traces; orphaned spansImplement proper propagation in all services; test regularly
ES backend slowDelayed trace availabilityMonitor ES cluster; optimize indices; add warm storage tier

Observability Checklist

Jaeger Infrastructure Metrics

  • Traces received per second (ingestion rate)
  • Spans received and processed
  • Collector queue depth
  • Backend storage write latency
  • Query service latency (p50, p95, p99)
  • Active connections to storage

Trace Coverage Metrics

  • Services with instrumentation
  • Trace completeness (spans per trace average)
  • Error trace percentage
  • Slow trace percentage (>threshold)
  • Span attribute coverage

Sampling Configuration

  • Head-based sampling rate
  • Adaptive sampling thresholds
  • Tail sampling policies (errors, slow traces)
  • Always-sampled tag configuration

Alerting Rules

  • Collector queue depth > threshold
  • Storage write latency degraded
  • Query service down
  • Ingestion rate drop (potential issue)

Security Checklist

  • Jaeger UI access authenticated
  • No sensitive data in span tags or logs (passwords, tokens, PII)
  • TLS configured for all endpoints
  • Trace data access logged and audited
  • Sampling does not inadvertently exclude security-relevant traces
  • Trace context headers sanitized before external calls
  • Storage backend access restricted
  • No internal service names exposed to external trace exports

Common Pitfalls / Anti-Patterns

1. Not Sampling Tail for Errors

Head-based sampling misses most error traces at low sampling rates:

# Bad: Only probabilistic sampling
strategy: probabilistic
sampling:
  type: probabilistic
  probabilistic:
    sampling_rate: 0.01  # 1% - misses most errors

# Good: Adaptive sampling with error traces always sampled
strategy: adaptive
sampling:
  type: adaptive
  adaptive:
    max_traces_per_second: 100
    sampling_server_url: jaeger-agent:5778

2. Missing Semantic Attributes

Traces without standard attributes are hard to query:

// Bad: Missing semantic attributes
const span = tracer.startSpan("OrderService.process");
span.setAttribute("orderId", order.id); // Non-standard name

// Good: Use semantic conventions
const span = tracer.startSpan("OrderService.process");
span.setAttribute("order.id", order.id);
span.setAttribute("order.total", order.total);
span.setAttribute("customer.tier", customer.tier);

3. Creating Child Spans Without Parent Context

Orphaned spans break trace continuity:

// Bad: No parent context
async function processOrder(order) {
  const span = tracer.startSpan("OrderService.process");
  await validateOrder(order); // Creates orphaned span
  span.end();
}

// Good: Propagate parent context
async function processOrder(order, parentSpan) {
  const span = tracer.startSpan("OrderService.process", {
    parent: parentSpan,
  });
  await validateOrder(order, span); // Child span linked
  span.end();
}

4. Storing Large Payloads in Span Events

Span events are not a data store:

// Bad: Large payload in span
span.addEvent("response", { body: JSON.stringify(largeResponse) });

// Good: Reference by ID or summary
span.addEvent("response", {
  "response.size_bytes": largeResponse.length,
  "response.status": "success",
});

5. Not Monitoring Jaeger Itself

Jaeger monitoring blind spots:

# Prometheus metrics from Jaeger
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Quick Recap

Key Takeaways:

  • Jaeger provides distributed tracing visualization for microservices
  • Deploy collectors, query, and UI separately in production
  • Use adaptive or tail-based sampling to capture errors
  • Always propagate trace context through HTTP headers and queues
  • Monitor Jaeger itself: ingestion rate, queue depth, storage latency
  • Combine with metrics (Prometheus) and logs (ELK) for complete observability

Copy/Paste Checklist:

# Production Jaeger CRD
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
  query:
    replicas: 2

# Adaptive sampling config
spec:
  sampling:
    type: adaptive
    adaptive:
      max_traces_per_second: 100
      initial_sampling_rate: 10

# Prometheus metrics scrape
scrape_configs:
  - job_name: 'jaeger-collector'
    static_configs:
      - targets: ['jaeger-collector:8888']
// Trace context propagation
import { propagation, context } from "@opentelemetry/api";

function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
}

function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

// Error span recording
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });

Conclusion

Jaeger provides powerful distributed tracing capabilities for microservices debugging and analysis. Trace visualization, dependency mapping, and performance analysis turn complex request flows into understandable data.

Start with automatic instrumentation for your services, deploy the all-in-one version for development, and scale to a production deployment with Elasticsearch storage when you are ready.

For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.

Category

Related Posts

Distributed Tracing: Trace Context and OpenTelemetry

Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.

#observability #tracing #distributed-systems

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines