Jaeger: Distributed Tracing for Microservices
Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.
Jaeger: Distributed Tracing for Microservices
Jaeger is an open-source distributed tracing system used to monitor and troubleshoot microservices. It gives you visibility into request flows, latency analysis, and service dependencies.
This guide covers Jaeger deployment, trace analysis, and practical debugging workflows. For tracing fundamentals, see our Distributed Tracing guide first.
Jaeger Architecture
graph TB
A[Services] -->|OTLP| B[Jaeger Collector]
B --> C[Jaeger Backend]
C --> D[Elasticsearch]
C --> E[Cassandra]
C --> F[Kafka]
G[Jaeger Query] --> C
H[Jaeger UI] --> G
Jaeger follows the OpenTelemetry collector pattern:
- Agent: Sidecar or daemonset that receives spans via UDP
- Collector: Receives spans, processes them, and stores them
- Query: Backend service for trace retrieval
- UI: Web interface for trace exploration
Deployment Options
All-in-One Quick Start
For local development:
docker run -d \
--name jaeger \
-e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
-p 6831:6831/UDP \
-p 6832:6832/UDP \
-p 16686:16686 \
-p 14268:14268 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Access the UI at http://localhost:16686.
Production Deployment
For Kubernetes:
# jaeger-operator.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
namespace: observability
spec:
strategy: production
collector:
maxReplicas: 5
resources:
limits:
cpu: 500m
memory: 512Mi
storage:
type: elasticsearch
elasticsearch:
name: elasticsearch
doNotProvision: true
secretName: jaeger-elasticsearch
query:
replicas: 2
options:
query:
base-path: /jaeger
External Elasticsearch Backend
apiVersion: v1
kind: Secret
metadata:
name: jaeger-elasticsearch
namespace: observability
type: Opaque
stringData:
ELASTICSEARCH_SERVER: "https://elasticsearch:9200"
ELASTICSEARCH_USERNAME: "jaeger"
ELASTICSEARCH_PASSWORD: "${ES_PASSWORD}"
---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
namespace: observability
spec:
strategy: production
collector:
autoscale: true
maxReplicas: 5
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
storage:
size: 200Gi
storageClassName: fast-storage
indexCleaner:
enabled: true
numberOfDays: 14
schedule: "55 5 * * *"
query:
replicas: 2
Jaeger UI
The Jaeger UI provides several views for trace analysis.
Search View
The search view lets you find traces by:
- Service name
- Operation name
- Trace ID
- Time range
- Tag filters
- Duration range
Trace Detail View
graph TD
A[API Gateway<br/>Total: 1.2s] --> B[Auth Service<br/>200ms]
A --> C[Order Service<br/>800ms]
C --> D[Payment Service<br/>500ms]
C --> E[Inventory Service<br/>150ms]
C --> F[Notification<br/>50ms]
D --> G[External Bank<br/>400ms]
The trace view shows:
- Parent-child span relationships
- Timing for each span
- Span tags and logs
- Total trace duration
Span Detail Panel
Clicking a span reveals:
- Operation name and service
- Start time and duration
- Tags (key-value attributes)
- Logs (timestamped events)
- References (parent-child links)
Trace Analysis Workflows
Debugging a Slow Request
Find the bottleneck in a slow trace:
- Search for traces with high duration
- Identify which service has the longest spans
- Check span tags for business context
- Review span logs for errors or unusual events
# Search for slow traces via API
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h&maxDuration=5s" | \
jq '.data[] | {traceID: .traceID, duration: .duration, services: [.spans[].process.serviceName] | unique}'
Finding Error Sources
Identify which service is causing errors:
- Filter traces by error status
- Examine the error span and its logs
- Follow the trace backward to find root cause
# Find traces with errors
curl -s "http://jaeger-query:16686/api/traces?service=checkout-service&lookback=1h" | \
jq '.data[] | select(.spans[].tags // [] | any(.key == "error" and .vBool == true))'
Analyzing Service Dependencies
Use the dependency view to understand service relationships:
- Navigate to the Dependency graph view
- Click on services to see call patterns
- Identify services with high fan-out
- Spot potential bottlenecks
Advanced Features
Adaptive Sampling
Reduce storage by sampling intelligently:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: adaptive-sampling
spec:
strategy: adaptive
sampling:
type: adaptive
adaptive:
sampling_server_url: jaeger-agent:5778
max_traces_per_second: 100
initial_sampling_rate: 10
adaptive:
enabled: true
Barrelfish View
Barrelfish visualizes trace data by service showing:
- Average latency per operation
- Request throughput
- Error rates
- Dependency links
Trace Quality Scoring
Jaeger can score traces based on quality:
# Trace quality indicators
- Missing span tags
- Incomplete trace depth
- High error rate
- Excessive span count
Integrating with OpenTelemetry
OpenTelemetry is the modern standard for tracing:
Automatic Instrumentation
# Python auto-instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
trace.set_tracer_provider(
trace.TracerProvider(
resource=Resource.create({"service.name": "my-service"})
)
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger-collector:4317"))
)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
Manual Span Creation
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(order: Order) {
const span = tracer.startSpan("OrderService.process");
try {
await validateOrder(order, span);
await chargePayment(order, span);
await fulfillOrder(order, span);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error as Error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
throw error;
} finally {
span.end();
}
}
async function validateOrder(order: Order, parentSpan: Span) {
const span = tracer.startSpan("OrderService.validate", {
parent: parentSpan,
});
span.setAttribute("validation.type", "business");
// Validation logic
span.end();
}
Performance Analysis
Latency Percentiles
Analyze latency distribution:
# Get latency percentiles via Jaeger API
curl -s "http://jaeger-query:16686/api/services" | jq '.[].name'
# Get trace stats
curl -s "http://jaeger-query:16686/api/traces/stats?service=api-gateway" | jq
Throughput Analysis
Understand request volume patterns:
# Get traces count over time
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&start=$(date -d '1 hour ago' +%s000000)&end=$(date +%s000000)&limit=1000" | \
jq '.data | length'
Error Rate Correlation
Correlate errors across services:
# Find traces with errors and their services
curl -s "http://jaeger-query:16686/api/traces?service=*&lookback=1h" | \
jq '.data[] | {
traceID: .traceID,
errors: [.spans[] | select(.tags // [] | any(.key == "error")) | .process.serviceName]
} | select(.errors | length > 0)'
Alerting on Trace Data
Prometheus Metrics from Jaeger
# jaeger-metrics-exporter.yaml
apiVersion: v1
kind: Service
metadata:
name: jaeger-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"
spec:
ports:
- port: 8888
targetPort: 8888
---
# Prometheus scrape config
scrape_configs:
- job_name: "jaeger"
static_configs:
- targets: ["jaeger-collector:8888"]
Key Metrics to Monitor
| Metric | Description |
|---|---|
jaeger_collector_traces_received | Incoming traces count |
jaeger_collector_spans_received | Total spans received |
jaeger_collector_queue_length | Pending spans in queue |
jaeger_query_latency | Query response time |
Storage Backends
Jaeger supports multiple storage backends.
Elasticsearch
Best for large-scale production deployments:
# Elasticsearch with ILM
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy
indexParameters:
numberOfShards: 5
numberOfReplicas: 1
Cassandra
Traditional choice for high volume:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
storage:
type: cassandra
cassandra:
servers: cassandra:9042
keyspace: jaeger_v1
replication_factor: 2
Kafka
For buffering and replay:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: streaming
collector:
maxReplicas: 10
storage:
type: kafka
kafka:
brokers:
- kafka:9092
topic: jaeger-spans
partitions: 10
ingester:
replicas: 2
When to Use Jaeger
Use Jaeger when:
- Debugging latency issues across microservice boundaries
- Understanding service dependencies and call patterns
- Root cause analysis for cascading failures
- Optimizing performance by identifying bottlenecks
- Validating trace context propagation
- Monitoring distributed transactions
- Detecting anomalies in request flows
Don’t use Jaeger when:
- You have single monolithic applications without service boundaries
- You need purely metric-based monitoring
- Low-latency tracing overhead is unacceptable
- You need long-term log storage
- You only need aggregate analytics (use dashboards)
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Jaeger storage backend degraded | Traces dropped; incomplete debugging data | Configure sampling; scale storage; implement buffer queue |
| Collector queue overflow | Spans dropped; monitoring gaps | Monitor queue depth; scale collectors; implement backpressure |
| Query service performance | Slow trace search; UI timeouts | Optimize queries; add caching; scale query replicas |
| Adaptive sampling too aggressive | Missing important traces | Review sampling rates; ensure error traces always sampled |
| Trace context propagation broken | Incomplete traces; orphaned spans | Implement proper propagation in all services; test regularly |
| ES backend slow | Delayed trace availability | Monitor ES cluster; optimize indices; add warm storage tier |
Observability Checklist
Jaeger Infrastructure Metrics
- Traces received per second (ingestion rate)
- Spans received and processed
- Collector queue depth
- Backend storage write latency
- Query service latency (p50, p95, p99)
- Active connections to storage
Trace Coverage Metrics
- Services with instrumentation
- Trace completeness (spans per trace average)
- Error trace percentage
- Slow trace percentage (>threshold)
- Span attribute coverage
Sampling Configuration
- Head-based sampling rate
- Adaptive sampling thresholds
- Tail sampling policies (errors, slow traces)
- Always-sampled tag configuration
Alerting Rules
- Collector queue depth > threshold
- Storage write latency degraded
- Query service down
- Ingestion rate drop (potential issue)
Security Checklist
- Jaeger UI access authenticated
- No sensitive data in span tags or logs (passwords, tokens, PII)
- TLS configured for all endpoints
- Trace data access logged and audited
- Sampling does not inadvertently exclude security-relevant traces
- Trace context headers sanitized before external calls
- Storage backend access restricted
- No internal service names exposed to external trace exports
Common Pitfalls / Anti-Patterns
1. Not Sampling Tail for Errors
Head-based sampling misses most error traces at low sampling rates:
# Bad: Only probabilistic sampling
strategy: probabilistic
sampling:
type: probabilistic
probabilistic:
sampling_rate: 0.01 # 1% - misses most errors
# Good: Adaptive sampling with error traces always sampled
strategy: adaptive
sampling:
type: adaptive
adaptive:
max_traces_per_second: 100
sampling_server_url: jaeger-agent:5778
2. Missing Semantic Attributes
Traces without standard attributes are hard to query:
// Bad: Missing semantic attributes
const span = tracer.startSpan("OrderService.process");
span.setAttribute("orderId", order.id); // Non-standard name
// Good: Use semantic conventions
const span = tracer.startSpan("OrderService.process");
span.setAttribute("order.id", order.id);
span.setAttribute("order.total", order.total);
span.setAttribute("customer.tier", customer.tier);
3. Creating Child Spans Without Parent Context
Orphaned spans break trace continuity:
// Bad: No parent context
async function processOrder(order) {
const span = tracer.startSpan("OrderService.process");
await validateOrder(order); // Creates orphaned span
span.end();
}
// Good: Propagate parent context
async function processOrder(order, parentSpan) {
const span = tracer.startSpan("OrderService.process", {
parent: parentSpan,
});
await validateOrder(order, span); // Child span linked
span.end();
}
4. Storing Large Payloads in Span Events
Span events are not a data store:
// Bad: Large payload in span
span.addEvent("response", { body: JSON.stringify(largeResponse) });
// Good: Reference by ID or summary
span.addEvent("response", {
"response.size_bytes": largeResponse.length,
"response.status": "success",
});
5. Not Monitoring Jaeger Itself
Jaeger monitoring blind spots:
# Prometheus metrics from Jaeger
scrape_configs:
- job_name: "jaeger"
static_configs:
- targets: ["jaeger-collector:8888"]
Quick Recap
Key Takeaways:
- Jaeger provides distributed tracing visualization for microservices
- Deploy collectors, query, and UI separately in production
- Use adaptive or tail-based sampling to capture errors
- Always propagate trace context through HTTP headers and queues
- Monitor Jaeger itself: ingestion rate, queue depth, storage latency
- Combine with metrics (Prometheus) and logs (ELK) for complete observability
Copy/Paste Checklist:
# Production Jaeger CRD
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-jaeger
spec:
strategy: production
collector:
maxReplicas: 5
resources:
limits:
cpu: 500m
memory: 512Mi
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
query:
replicas: 2
# Adaptive sampling config
spec:
sampling:
type: adaptive
adaptive:
max_traces_per_second: 100
initial_sampling_rate: 10
# Prometheus metrics scrape
scrape_configs:
- job_name: 'jaeger-collector'
static_configs:
- targets: ['jaeger-collector:8888']
// Trace context propagation
import { propagation, context } from "@opentelemetry/api";
function injectTraceContext(headers: Record<string, string>) {
propagation.inject(context.active(), headers);
}
function extractTraceContext(headers: Record<string, string>) {
return propagation.extract(context.active(), headers);
}
// Error span recording
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
Conclusion
Jaeger provides powerful distributed tracing capabilities for microservices debugging and analysis. Trace visualization, dependency mapping, and performance analysis turn complex request flows into understandable data.
Start with automatic instrumentation for your services, deploy the all-in-one version for development, and scale to a production deployment with Elasticsearch storage when you are ready.
For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.
Category
Related Posts
Distributed Tracing: Trace Context and OpenTelemetry
Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
Backpressure Handling: Protecting Pipelines from Overload
Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.