Jaeger: Distributed Tracing for Microservices
Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.
Jaeger: Distributed Tracing for Microservices
Jaeger is an open-source distributed tracing system for monitoring and troubleshooting microservices. It shows you how requests flow through your services, where latency lives, and how your services depend on each other.
This guide covers Jaeger deployment, trace analysis, and practical debugging workflows. For tracing fundamentals, see our Distributed Tracing guide first.
Introduction
graph TB
A[Services] -->|OTLP| B[Jaeger Collector]
B --> C[Jaeger Backend]
C --> D[Elasticsearch]
C --> E[Cassandra]
C --> F[Kafka]
G[Jaeger Query] --> C
H[Jaeger UI] --> G
Jaeger uses the OpenTelemetry collector pattern:
- Agent: Sidecar or daemonset that receives spans via UDP
- Collector: Receives spans, processes them, and stores them
- Query: Backend service for trace retrieval
- UI: Web interface for trace exploration
Deployment Options
All-in-One Quick Start
For local development:
docker run -d \
--name jaeger \
-e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
-p 6831:6831/UDP \
-p 6832:6832/UDP \
-p 16686:16686 \
-p 14268:14268 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Access the UI at http://localhost:16686.
Production Deployment
For Kubernetes:
# jaeger-operator.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
namespace: observability
spec:
strategy: production
collector:
maxReplicas: 5
resources:
limits:
cpu: 500m
memory: 512Mi
storage:
type: elasticsearch
elasticsearch:
name: elasticsearch
doNotProvision: true
secretName: jaeger-elasticsearch
query:
replicas: 2
options:
query:
base-path: /jaeger
External Elasticsearch Backend
apiVersion: v1
kind: Secret
metadata:
name: jaeger-elasticsearch
namespace: observability
type: Opaque
stringData:
ELASTICSEARCH_SERVER: "https://elasticsearch:9200"
ELASTICSEARCH_USERNAME: "jaeger"
ELASTICSEARCH_PASSWORD: "${ES_PASSWORD}"
---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
namespace: observability
spec:
strategy: production
collector:
autoscale: true
maxReplicas: 5
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
storage:
size: 200Gi
storageClassName: fast-storage
indexCleaner:
enabled: true
numberOfDays: 14
schedule: "55 5 * * *"
query:
replicas: 2
Jaeger UI
The Jaeger UI has several views for trace analysis.
Search View
The search view lets you find traces by:
- Service name
- Operation name
- Trace ID
- Time range
- Tag filters
- Duration range
Trace Detail View
graph TD
A[API Gateway<br/>Total: 1.2s] --> B[Auth Service<br/>200ms]
A --> C[Order Service<br/>800ms]
C --> D[Payment Service<br/>500ms]
C --> E[Inventory Service<br/>150ms]
C --> F[Notification<br/>50ms]
D --> G[External Bank<br/>400ms]
The trace view shows:
- Parent-child span relationships
- Timing for each span
- Span tags and logs
- Total trace duration
Span Detail Panel
Clicking a span reveals:
- Operation name and service
- Start time and duration
- Tags (key-value attributes)
- Logs (timestamped events)
- References (parent-child links)
Trace Analysis Workflows
Debugging a Slow Request
Find the bottleneck in a slow trace:
- Search for traces with high duration
- Identify which service has the longest spans
- Check span tags for business context
- Review span logs for errors or unusual events
# Search for slow traces via API
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h&maxDuration=5s" | \
jq '.data[] | {traceID: .traceID, duration: .duration, services: [.spans[].process.serviceName] | unique}'
Finding Error Sources
Identify which service is causing errors:
- Filter traces by error status
- Examine the error span and its logs
- Follow the trace backward to find root cause
# Find traces with errors
curl -s "http://jaeger-query:16686/api/traces?service=checkout-service&lookback=1h" | \
jq '.data[] | select(.spans[].tags // [] | any(.key == "error" and .vBool == true))'
Analyzing Service Dependencies
Use the dependency view to understand service relationships:
- Navigate to the Dependency graph view
- Click on services to see call patterns
- Identify services with high fan-out
- Spot potential bottlenecks
Advanced Features
Adaptive Sampling
Reduce storage by sampling intelligently:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: adaptive-sampling
spec:
strategy: adaptive
sampling:
type: adaptive
adaptive:
sampling_server_url: jaeger-agent:5778
max_traces_per_second: 100
initial_sampling_rate: 10
adaptive:
enabled: true
Barrelfish View
Barrelfish visualizes trace data by service showing:
- Average latency per operation
- Request throughput
- Error rates
- Dependency links
Trace Quality Scoring
Jaeger can score traces based on quality:
# Trace quality indicators
- Missing span tags
- Incomplete trace depth
- High error rate
- Excessive span count
Jaeger vs Zipkin: Comparison and Migration
Understanding how Jaeger relates to Zipkin helps when evaluating or migrating distributed tracing solutions.
Shared Foundations
Both Jaeger and Zipkin implement the OpenTracing standard with compatible data models:
- Span: Represents a unit of work with name, start time, duration, and attributes
- Trace: A collection of spans forming a complete request path
- Context propagation: Both support W3C Trace Context for cross-service correlation
This compatibility means you can instrument services using OpenTelemetry and route traces to either system.
Key Differences
| Aspect | Jaeger | Zipkin |
|---|---|---|
| Architecture | Collector, Query, UI, Agent as separate components | Simple architecture with Collector and Query only |
| Storage backends | Elasticsearch, Cassandra, Kafka | In-memory, Cassandra, MySQL, PostgreSQL |
| Sampling strategies | Probabilistic, Adaptive, Tail-based | Probabilistic, Rate-limiting |
| OTLP support | Native OTLP receiver | Requires Zipkin collector adapter |
| Service mesh integration | Native Istio and Linkerd support | Limited built-in support |
| Operations complexity | Higher due to more components | Lower, simpler to operate |
Using the Zipkin Receiver
Jaeger includes a Zipkin-compatible receiver for migrations:
# Configure Jaeger collector to accept Zipkin spans
docker run -d \
--name jaeger \
-e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
-p 6831:6831/UDP \
-p 9411:9411 \
jaegertracing/all-in-one:latest
This lets existing Zipkin-instrumented services send traces to Jaeger without re-instrumentation.
Migration Path
When migrating from Zipkin to Jaeger:
- Phase 1: Deploy Jaeger alongside Zipkin, configure services to send traces to both
- Phase 2: Validate Jaeger data completeness and sampling behavior
- Phase 3: Switch primary monitoring to Jaeger, keep Zipkin as fallback
- Phase 4: Decommission Zipkin once confidence is established
# Dual-export configuration for gradual migration
# Instrument services to send to both endpoints during transition
instrumentation:
tracing:
exporters:
- jaeger_exporter:
endpoint: http://jaeger-collector:14250
- zipkin_exporter:
endpoint: http://zipkin:9411/api/v2/spans
When to Choose Jaeger Over Zipkin
Choose Jaeger when:
- You need advanced sampling strategies like tail-based sampling
- Your environment uses Kubernetes with service mesh (Istio/Linkerd)
- You require long-term trace storage with Elasticsearch or Cassandra
- You want native OpenTelemetry protocol support without adapters
- Your team needs more sophisticated trace visualization and analysis
Choose Zipkin when:
- You have existing Zipkin instrumentation and limited migration budget
- Your deployment scale is modest and simple architecture is preferred
- You need quick setup with minimal operational overhead
- Your organization already has expertise with Zipkin tooling
Integrating with OpenTelemetry
OpenTelemetry is the modern standard for tracing:
Automatic Instrumentation
# Python auto-instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
trace.set_tracer_provider(
trace.TracerProvider(
resource=Resource.create({"service.name": "my-service"})
)
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger-collector:4317"))
)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
Manual Span Creation
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(order: Order) {
const span = tracer.startSpan("OrderService.process");
try {
await validateOrder(order, span);
await chargePayment(order, span);
await fulfillOrder(order, span);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end();
}
}
async function validateOrder(order: Order, parentSpan: Span) {
const span = tracer.startSpan("OrderService.validate", {
parent: parentSpan,
});
span.setAttribute("validation.type", "business");
// Validation logic
span.end();
}
Performance Analysis
Latency Percentiles
Analyze latency distribution:
# Get latency percentiles via Jaeger API
curl -s "http://jaeger-query:16686/api/services" | jq '.[].name'
# Get trace stats
curl -s "http://jaeger-query:16686/api/traces/stats?service=api-gateway" | jq
Throughput Analysis
Understand request volume patterns:
# Get traces count over time
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&start=$(date -d '1 hour ago' +%s000000)&end=$(date +%s000000)&limit=1000" | \
jq '.data | length'
Error Rate Correlation
Correlate errors across services:
# Find traces with errors and their services
curl -s "http://jaeger-query:16686/api/traces?service=*&lookback=1h" | \
jq '.data[] | {
traceID: .traceID,
errors: [.spans[] | select(.tags // [] | any(.key == "error")) | .process.serviceName]
} | select(.errors | length > 0)'
Alerting on Trace Data
Prometheus Metrics from Jaeger
# jaeger-metrics-exporter.yaml
apiVersion: v1
kind: Service
metadata:
name: jaeger-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"
spec:
ports:
- port: 8888
targetPort: 8888
---
# Prometheus scrape config
scrape_configs:
- job_name: "jaeger"
static_configs:
- targets: ["jaeger-collector:8888"]
Key Metrics to Monitor
| Metric | Description |
|---|---|
jaeger_collector_traces_received | Incoming traces count |
jaeger_collector_spans_received | Total spans received |
jaeger_collector_queue_length | Pending spans in queue |
jaeger_query_latency | Query response time |
Storage Backends
Jaeger supports multiple storage backends.
Elasticsearch
Best for large-scale production deployments:
# Elasticsearch with ILM
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
indexParameters:
numberOfShards: 5
numberOfReplicas: 1
Cassandra
Traditional choice for high volume:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
storage:
type: cassandra
cassandra:
servers: cassandra:9042
keyspace: jaeger_v1
replication_factor: 2
Kafka
For buffering and replay:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: streaming
collector:
maxReplicas: 10
storage:
type: kafka
kafka:
brokers:
- kafka:9092
topic: jaeger-spans
partitions: 10
ingester:
replicas: 2
SLO Integration with Tracing
Correlate trace data with Service Level Objectives for better reliability.
Defining SLOs from Traces
# Define SLO thresholds based on trace latency
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: slo-tracking
spec:
strategy: production
sampling:
type: adaptive
adaptive:
# Ensure traces for SLO boundary requests are always captured
sampling_server_url: jaeger-agent:5778
# Prioritize capturing slow traces that may breach SLOs
max_traces_per_second: 100
Trace-Based SLO Dashboard
Key trace metrics to visualize:
| Metric | SLO Target | Query Method |
|---|---|---|
| p99 Latency | < 500ms | Trace duration percentiles |
| Error Rate | < 0.1% | Error spans / total spans |
| Availability | > 99.9% | Successful traces / total traces |
| Trace Completeness | > 95% | Complete traces / total traces |
Latency Budgets with Traces
Allocate latency budget across services:
# Get latency breakdown by service for an SLO window
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h" | \
jq '[.data[].spans[] | {
service: .process.serviceName,
operation: .operationName,
duration_ms: (.duration / 1000),
errors: (.tags[] | select(.key == "error") | .key) | length
}] | group_by(.service) | map({
service: .[0].service,
avg_duration_ms: (map(.duration_ms) | add / length),
max_duration_ms: (map(.duration_ms) | max),
error_count: (map(.errors) | add)
})'
Tail-Based Sampling for SLO Traces
Ensure error and slow traces are always captured:
# Tail-based sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: slo-sampling
spec:
strategy: production
collector:
sampling:
type: tail-based
tail-based:
sampling:
- type: probabilistic
probabilistic: 0.1
- type: latencyn
latency:
lower: 100ms
upper: 500ms
probabilistic: 0.5
- type: always
category:
- error
probabilistic: 1.0
Multi-Region Deployment Considerations
Deploy Jaeger across regions for global microservices visibility.
Architecture Patterns
graph LR
A[US-East Services] --> B[US-East Jaeger Collector]
C[EU-West Services] --> D[EU-West Jaeger Collector]
E[AP-South Services] --> F[AP-South Jaeger Collector]
B --> G[Central Storage]
D --> G
F --> G
H[Global Query] --> G
Cross-Region Trace Context
Propagate trace context across regions without losing context:
# Cross-region trace context propagation
from opentelemetry import propagate
from opentelemetry.trace import set_span_in_context
def forward_to_region(headers: dict, region: str):
# Inject current trace context into headers for cross-region call
propagate.inject(headers)
# Add region-specific tags
current_span = trace.get_current_span()
current_span.set_attribute("destination.region", region)
response = requests.post(
f"https://{region}-api.example.com/process",
headers=headers
)
return response
Regional Sampling Strategies
# Regional sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: multi-region-jaeger
namespace: observability
spec:
strategy: adaptive
sampling:
type: adaptive
adaptive:
# Higher sampling rate in production regions
sampling_server_url: jaeger-agent:5778
max_traces_per_second: 200
initial_sampling_rate: 20
# Always sample cross-region calls
policies:
- name: cross-region
type: tag
tag:
key: cross_region
value: "true"
probabilistic: 1.0
Storage Considerations for Multi-Region
| Approach | Pros | Cons |
|---|---|---|
| Centralized (single ES) | Simple, consistent | Latency for remote collectors |
| Distributed (per-region ES) | Low latency | Complex cross-region queries |
| Hybrid (hot-warm ES) | Balance of both | Operational complexity |
Cross-Region Trace Correlation
# Correlate traces across regions
curl -s "http://jaeger-query.global:16686/api/traces?service=*&lookback=1h" | \
jq '.data[] | select(.spans[] | .tags[] | .key == "region" and .vStr == "us-east") | {
traceID: .traceID,
regions: [.spans[].tags[] | select(.key == "region") | .vStr] | unique,
duration_ms: (.duration / 1000)
} | select(.regions | length > 1)'
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Jaeger storage backend degraded | Traces dropped; incomplete debugging data | Configure sampling; scale storage; implement buffer queue |
| Collector queue overflow | Spans dropped; monitoring gaps | Monitor queue depth; scale collectors; implement backpressure |
| Query service performance | Slow trace search; UI timeouts | Optimize queries; add caching; scale query replicas |
| Adaptive sampling too aggressive | Missing important traces | Review sampling rates; ensure error traces always sampled |
| Trace context propagation broken | Incomplete traces; orphaned spans | Implement proper propagation in all services; test regularly |
| ES backend slow | Delayed trace availability | Monitor ES cluster; optimize indices; add warm storage tier |
Common Pitfalls / Anti-Patterns
1. Not Sampling Tail for Errors
Head-based sampling misses most error traces at low sampling rates:
# Bad: Only probabilistic sampling
strategy: probabilistic
sampling:
type: probabilistic
probabilistic:
sampling_rate: 0.01 # 1% - misses most errors
# Good: Adaptive sampling with error traces always sampled
strategy: adaptive
sampling:
type: adaptive
adaptive:
max_traces_per_second: 100
sampling_server_url: jaeger-agent:5778
2. Missing Semantic Attributes
Traces without standard attributes are hard to query:
// Bad: Missing semantic attributes
const span = tracer.startSpan("OrderService.process");
span.setAttribute("orderId", order.id); // Non-standard name
// Good: Use semantic conventions
const span = tracer.startSpan("OrderService.process");
span.setAttribute("order.id", order.id);
span.setAttribute("order.total", order.total);
span.setAttribute("customer.tier", customer.tier);
3. Creating Child Spans Without Parent Context
Orphaned spans break trace continuity:
// Bad: No parent context
async function processOrder(order) {
const span = tracer.startSpan("OrderService.process");
await validateOrder(order); // Creates orphaned span
span.end();
}
// Good: Propagate parent context
async function processOrder(order, parentSpan) {
const span = tracer.startSpan("OrderService.process", {
parent: parentSpan,
});
await validateOrder(order, span); // Child span linked
span.end();
}
4. Storing Large Payloads in Span Events
Span events are not a data store:
// Bad: Large payload in span
span.addEvent("response", { body: JSON.stringify(largeResponse) });
// Good: Reference by ID or summary
span.addEvent("response", {
"response.size_bytes": largeResponse.length,
"response.status": "success",
});
5. Not Monitoring Jaeger Itself
Jaeger monitoring blind spots:
# Prometheus metrics from Jaeger
scrape_configs:
- job_name: "jaeger"
static_configs:
- targets: ["jaeger-collector:8888"]
Real-world Failure Scenarios
Scenario 1: Jaeger Collector OOM During Traffic Spike
What happened: A product launch caused a 20x spike in trace volume. The Jaeger collector’s in-memory queue filled up and the process was killed by the OOM killer.
Root cause: The collector was configured with a fixed in-memory queue size but no dead-letter queue or sampling strategy to handle sudden volume increases.
Impact: Approximately 15 minutes of traces were lost during the product launch window. Engineers could not correlate the elevated error rate with specific services.
Lesson learned: Configure adaptive sampling or tail-based sampling to automatically reduce trace volume during traffic spikes. Set up dead-letter queues for failed trace ingestion. Monitor collector queue depth and set OOM alerts.
Scenario 2: Cassandra Storage Saturation
What happened: Over several months, the Cassandra storage cluster used by Jaeger reached its capacity limit. Trace data began being dropped silently as inserts failed.
Root cause: No capacity planning was done for trace retention. The default 30-day retention was never adjusted as the system scaled.
Impact: Historical traces older than 2 weeks became unavailable, making retrospective analysis of a security incident impossible.
Lesson learned: Plan storage capacity based on expected trace volume and retention requirements. Monitor storage node disk usage and set alerts. Consider down-sampling older traces to reduce storage costs.
Trade-off Analysis
When designing a Jaeger deployment, several key trade-offs require careful consideration.
Storage Backend Selection
| Criteria | Elasticsearch | Cassandra | Kafka |
|---|---|---|---|
| Scalability | Horizontal sharding, ILM support | Tunable consistency, wide rows | Partition-based parallelism |
| Query Performance | Excellent aggregations, full-text search | Fast reads for trace ID lookups | Requires separate consumer |
| Operational Complexity | High (cluster management, ILM) | Medium (SSTables, compaction) | Medium (brokers, replication) |
| Cost at Scale | Higher (memory-heavy) | Lower (disk-efficient) | Variable (depends on retention) |
| Best For | Large teams, advanced analytics | High write throughput | Event-driven replay scenarios |
Sampling Strategy Trade-offs
| Strategy | Storage Savings | Debug Coverage | Latency Overhead | Complexity |
|---|---|---|---|---|
| Probabilistic | High | Low (misses rare events) | Minimal | Simple |
| Adaptive | Medium | High (prioritizes errors) | Low | Medium |
| Tail-based | Variable | Highest (captures all important) | Higher | Complex |
| Rate-limiting | Predictable | Medium | Minimal | Simple |
Deployment Architecture Comparison
| Aspect | All-in-One | Production (Separated) | Multi-Region |
|---|---|---|---|
| Resource Usage | Minimal | High | Very High |
| Scalability | None | Horizontal collectors/query | Regional collectors |
| Operational Overhead | Minimal | Medium | High |
| Fault Isolation | Poor (single point) | Good (component isolation) | Excellent (regional failure domains) |
| Latency | Low (local only) | Medium (network to collectors) | Higher (cross-region) |
| Setup Time | Minutes | Hours | Days |
| Use Case | Development, CI/CD | Production (single region) | Global enterprises |
Agent Deployment Models
| Model | Pros | Cons | Best Environment |
|---|---|---|---|
| Sidecar (per pod) | Simple injection, local UDP | Resource overhead per pod | Kubernetes with sidecar injection |
| Daemonset (node-level) | Shared resource pool, lower overhead | Requires host networking | Dense node deployments |
| Agentless (direct to collector) | No agent maintenance | Higher latency (network hop), firewall rules | Secure environments, small scale |
Instrumentation Approach Trade-offs
| Approach | Effort | Granularity | Maintenance | Best Stage |
|---|---|---|---|---|
| Auto-instrumentation | Low | Medium | Low | Initial adoption |
| Manual instrumentation | High | Fine-grained | Higher | Production hardening |
| Hybrid (auto + manual) | Medium | Fine-grained | Medium | Mature observability |
When to Use Jaeger
Use Jaeger when:
- Debugging latency issues across microservice boundaries
- Understanding service dependencies and call patterns
- Root cause analysis for cascading failures
- Optimizing performance by identifying bottlenecks
- Validating trace context propagation
- Monitoring distributed transactions
- Detecting anomalies in request flows
Don’t use Jaeger when:
- You have single monolithic applications without service boundaries
- You need purely metric-based monitoring
- Low-latency tracing overhead is unacceptable
- You need long-term log storage
- You only need aggregate analytics (use dashboards)
Interview Questions
Expected answer points:
- Distributed tracing tracks requests across service boundaries in microservices architectures
- Jaeger implements tracing via the OpenTelemetry collector pattern with agents, collectors, query, and UI components
- Spans represent individual operations, and traces are composed of connected spans forming a request path
- Jaeger stores traces in backends like Elasticsearch, Cassandra, or Kafka for querying and visualization
Expected answer points:
- Jaeger Agent: Sidecar or daemonset that receives spans via UDP and forwards to collectors
- Jaeger Collector: Receives spans, processes them, and stores in the configured backend
- Jaeger Query: Backend service that retrieves traces from storage for display
- Jaeger UI: Web interface for searching and visualizing traces
- Supported storage backends: Elasticsearch, Cassandra, Kafka
Expected answer points:
- Probabilistic: Fixed percentage of traces captured, simple but may miss important events
- Adaptive: Dynamic sampling based on traffic, prioritizes rare events and errors
- Tail-based: Collects full trace after seeing the end, ideal for capturing all slow or error traces
- Use adaptive for production with high traffic, tail-based for debugging specific issues
Expected answer points:
- Search for traces with high duration using the Jaeger UI or API
- Identify which service has the longest spans in the trace waterfall
- Check span tags for business context (order ID, user ID, etc.)
- Review span logs for errors, database queries, or external calls
- Follow the trace backward to find where latency was introduced
Expected answer points:
- Trace context propagation passes trace and span IDs across service boundaries via HTTP headers
- Without proper propagation, spans become orphaned and traces appear incomplete
- OpenTelemetry uses W3C Trace Context headers (traceparent, tracestate)
- Context must be injected before outgoing requests and extracted on incoming requests
Expected answer points:
- Use OpenTelemetry SDK with Jaeger exporter or OTLP exporter pointing to Jaeger collector
- Auto-instrumentation available for Python, Java, Node.js, Go, .NET and other languages
- Manual instrumentation using OpenTelemetry API for custom business logic
- Configure resource attributes like service.name for proper identification
Expected answer points:
- Elasticsearch: Best for large-scale production, excellent query performance, ILM support
- Cassandra: Traditional choice, good for very high write throughput, tunable consistency
- Kafka: Enables buffering and replay, useful for event-driven architectures
- Consider ingestion rate, query patterns, operational complexity, and existing infrastructure
Expected answer points:
- Head-based sampling missing error traces at low sampling rates
- Missing semantic attributes making traces hard to query
- Orphaned spans from not propagating parent context
- Storing large payloads in span events causing performance issues
- Not monitoring Jaeger itself leading to observability gaps
Expected answer points:
- Define SLO thresholds based on trace latency percentiles and error rates
- Use tail-based sampling to ensure error and slow traces are always captured
- Export Jaeger metrics to Prometheus for alerting on collector queue depth, ingestion rate
- Create dashboards correlating trace data with business-level SLOs
Expected answer points:
- Authenticate Jaeger UI access to prevent unauthorized trace data exposure
- Avoid sensitive data (passwords, tokens, PII) in span tags and logs
- Configure TLS for all Jaeger endpoints including gRPC and HTTP
- Restrict storage backend access and audit trace data exports
- Sanitize trace context headers before external calls to prevent injection
Expected answer points:
- Jaeger and Zipkin share the same trace data model (spans and traces) making interoperability possible
- Jaeger supports native OTLP ingestion while Zipkin requires additional collectors for OTLP
- Jaeger provides more advanced sampling strategies including tail-based sampling
- Zipkin has a simpler deployment model suited for smaller deployments
- Migration involves re-instrumenting services with Jaeger clients or using the Zipkin receiver in Jaeger collector
- Jaeger stores data in Elasticsearch, Cassandra, or Kafka while Zipkin typically uses in-memory or Cassandra
Expected answer points:
- Istio automatically instruments traffic with Jaeger via the OpenTelemetry collector addon
- Configure Istio to export traces to Jaeger using the mesh config and tracing options
- Linkerd uses its own distributed tracing capability that can export to Jaeger
- Service mesh sidecars handle trace context propagation automatically across proxy boundaries
- Jaeger helps identify service mesh performance issues and proxy overhead
- High cardinality of mesh-generated traces requires careful sampling strategy configuration
Expected answer points:
- Head-based sampling decides at the start of a trace whether to sample, using probabilistic or rate-limiting strategies
- Tail-based sampling collects all traces but makes the sampling decision at the end based on policy rules
- Head-based sampling is simpler and requires less resources but may miss important rare events
- Tail-based sampling ensures error traces and slow traces are always captured for debugging
- Jaeger supports adaptive sampling which combines both approaches dynamically
Expected answer points:
- Flame graphs visualize trace duration as stacked bars showing time spent in each span
- Wide blocks indicate where most time is spent, identifying latency bottlenecks
- Deep stacks show trace depth and parent-child relationships across services
- Jaeger's trace detail view provides a waterfall representation that functions like a flame graph
- Color coding helps distinguish services, errors, and external calls
- Compare flame graphs across time periods to detect performance regressions
Expected answer points:
- Scale Jaeger collectors horizontally to handle increased ingestion load
- Use Kafka as a buffer between collectors and storage to handle spikes
- Configure adaptive sampling to reduce storage requirements without losing critical traces
- Elasticsearch index lifecycle management helps control storage costs
- Query service can be scaled horizontally with read replicas
- Monitor queue depth and span latency to detect scaling needs proactively
Expected answer points:
- Inject trace context into message headers or properties before publishing
- Extract trace context on the consumer side and create child spans
- OpenTelemetry provides automatic propagation for Kafka, RabbitMQ, and other messaging systems
- Ensure message processing spans have proper parent context for complete trace continuity
- Batch message processing requires careful span management to avoid orphaned spans
Expected answer points:
- Use tag allowlisting to restrict which attributes are stored with spans
- Implement sampling strategies that capture complete traces for errors and slow requests
- Aggregate high-cardinality attributes like user IDs or request IDs into bucketed values
- Configure storage backends with appropriate index mappings to handle cardinality
- Use trace quality scoring to identify and drop low-value traces
Expected answer points:
- Jaeger uses relative time measurements within a single trace rather than absolute wall-clock times
- Spans record duration and relative start times calculated from the trace start
- Clock skew between services is handled by respecting parent span start times as baseline
- Jaeger UI displays spans using relative offsets from trace start, not absolute timestamps
- For cross-region traces, use trace duration for comparison rather than timestamps
Expected answer points:
- Use the Jaeger Operator for declarative Kubernetes deployments and upgrades
- Deploy the agent as a daemonset for sidecar-less architecture with lower overhead
- Configure resource limits based on expected trace volume and processing requirements
- Use separate namespaces for observability components to enable proper RBAC
- Implement pod disruption policies for collector and query deployments
- Monitor Jaeger itself with Prometheus metrics exported on port 8888
Expected answer points:
- Use consistent service names and labels across Jaeger, Prometheus, and ELK
- Export Jaeger metrics to Prometheus for correlation between trace latency and system metrics
- Include trace ID in log lines to enable cross-platform correlation
- Use the trace ID from Jaeger spans to search corresponding logs in Elasticsearch
- Build Grafana dashboards that combine Jaeger trace data with Prometheus metrics
- Implement trace span IDs in error messages to link directly to the relevant trace in Jaeger UI
Further Reading
- Distributed Tracing - OpenTelemetry integration and tracing fundamentals
- Prometheus & Grafana - Metrics collection and visualization
- ELK Stack - Log aggregation and analysis For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.
Conclusion
Jaeger gives you distributed tracing for microservices debugging and analysis. Trace visualization, dependency mapping, and performance analysis make sense of complex request flows.
Start with automatic instrumentation for your services, deploy the all-in-one version for development, and scale to a production deployment with Elasticsearch storage when you are ready.
For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.
Quick Recap
Key Takeaways:
- Jaeger provides distributed tracing visualization for microservices
- Deploy collectors, query, and UI separately in production
- Use adaptive or tail-based sampling to capture errors
- Always propagate trace context through HTTP headers and queues
- Monitor Jaeger itself: ingestion rate, queue depth, storage latency
- Combine with metrics (Prometheus) and logs (ELK) for complete observability
Copy/Paste Checklist:
# Production Jaeger CRD
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
collector:
maxReplicas: 5
resources:
limits:
cpu: 500m
memory: 512Mi
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
query:
replicas: 2
# Adaptive sampling config
spec:
sampling:
type: adaptive
adaptive:
max_traces_per_second: 100
initial_sampling_rate: 10
# Prometheus metrics scrape
scrape_configs:
- job_name: 'jaeger-collector'
static_configs:
- targets: ['jaeger-collector:8888']
// Trace context propagation
import { propagation, context } from "@opentelemetry/api";
function injectTraceContext(headers: Record<string, string>) {
propagation.inject(context.active(), headers);
}
function extractTraceContext(headers: Record<string, string>) {
return propagation.extract(context.active(), headers);
}
// Error span recording
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
Reference Checklists
Observability Checklist
- Traces received per second (ingestion rate)
- Spans received and processed
- Collector queue depth
- Backend storage write latency
- Query service latency (p50, p95, p99)
- Active connections to storage
Jaeger Infrastructure Metrics
- Traces received per second (ingestion rate)
- Spans received and processed
- Collector queue depth
- Backend storage write latency
- Query service latency (p50, p95, p99)
- Active connections to storage
Trace Coverage Metrics
- Services with instrumentation
- Trace completeness (spans per trace average)
- Error trace percentage
- Slow trace percentage (>threshold)
- Span attribute coverage
Sampling Configuration
- Head-based sampling rate
- Adaptive sampling thresholds
- Tail sampling policies (errors, slow traces)
- Always-sampled tag configuration
Alerting Rules
- Collector queue depth > threshold
- Storage write latency degraded
- Query service down
- Ingestion rate drop (potential issue)
Security Checklist
- Jaeger UI access authenticated
- No sensitive data in span tags or logs (passwords, tokens, PII)
- TLS configured for all endpoints
- Trace data access logged and audited
- Sampling does not inadvertently exclude security-relevant traces
- Trace context headers sanitized before external calls
- Storage backend access restricted
- No internal service names exposed to external trace exports
Category
Related Posts
Distributed Tracing: Trace Context and OpenTelemetry
Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.
Distributed Operating Systems
Explore distributed file systems, RPC mechanisms, cluster scheduling, and the fundamental concepts behind modern distributed operating systems.
Performance Profiling
Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.