The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.
Monitoring asks: is the system up? Observability asks: why is the system behaving this way? The difference matters when something unexpected happens.
Traditional monitoring is threshold-based: alert when CPU > 80%, alert when error rate > 1%. This works for known failure modes. It fails for unknown ones. When your system fails in a way you did not anticipate, threshold alerts do not help. You need to ask questions of your system in real time.
When to Use
Head-Based vs. Tail-Based Sampling
Use head-based sampling when you need simplicity or when your trace volume is manageable. Head-based sampling decides at trace start whether to sample — every service sees the same sampling decision. It is easy to implement and predictable.
Use tail-based sampling when you have high trace volume and need to ensure you capture all errors and slow requests without storing every trace. Tail-based sampling requires a collector that buffers traces, so it adds operational complexity. The tradeoff is worth it for high-traffic services where errors are rare but important.
For most applications: head-based sampling at 10% combined with 100% error capture is sufficient.
OpenTelemetry vs. Vendor-Specific Agents
Use OpenTelemetry when you want portability across observability backends, need to instrument once and send to multiple destinations, or are building a multi-cloud service. OpenTelemetry is vendor-neutral and becoming the industry standard.
Use vendor-specific agents when you need deep integrations with a specific vendor’s platform, want the fastest time-to-value, or have a simple single-vendor stack. Vendor agents often have deeper auto-instrumentation for specific frameworks.
The practical choice: instrument with OpenTelemetry, use the OpenTelemetry Collector with vendor exporters to route data wherever you want. This gives you portability without sacrificing vendor features.
SLOs: When the Investment Is Worth It
Define SLOs when you have customer-facing services where availability and latency directly affect the business. SLOs give your observability data meaning — without them, you have metrics with no context.
Do not define SLOs for every internal service. Internal services that nobody directly calls (a library, an intermediate processor) do not need SLOs. Focus on services that customers depend on directly.
Start with 2-3 SLOs per service, not a dashboard full of them. Fewer SLOs that people actually track beat many SLOs that nobody looks at.
Three Pillars in Practice
flowchart LR
A[Request] --> B[Trace ID generated]
B --> C[Metric: latency, errors]
B --> D[Log: event details]
B --> E[Span: operation timing]
C --> F[Dashboard<br/>RED / USE]
D --> G[Log Aggregator<br/>ELK / Loki]
E --> H[Trace View<br/>Jaeger / Tempo]
F --> I{SLO<br/>Error Budget?}
G --> I
H --> I
I -->|Budget burning| J[Slow Down<br/>Focus on Reliability]
I -->|Budget healthy| K[Ship Features<br/>Accept Risk]
The power comes from correlating these signals. When an alert fires (metrics), you look at the error rate over time. You find a specific time window. You query logs for errors in that window. You find a correlation ID. You look at the trace for that ID and see exactly where the request failed.
What Observability Really Means
Observability comes from control theory. A system is observable if you can determine its internal state from its external outputs. Applied to software: if your system is observable, you can debug any behavior by asking questions of the data it produces.
The three signals, logs, metrics, and traces, are the external outputs. But the key is not just collecting them. The key is being able to correlate them and ask arbitrary questions.
Logs without context (which request caused this error?) are hard to use. Metrics without traces (which users are affected?) are hard to act on. Traces without logs (what happened inside this service?) are incomplete.
The Three Pillars (Logs, Metrics, Traces)
The three pillars are not separate disciplines. They are three views of the same system.
Metrics are numerical measurements aggregated over time. They are cheap to store, fast to query, and good for dashboards. They answer “is something wrong?”
The Metrics & Monitoring post covers metrics in depth, but the key points: use RED (Rate, Errors, Duration) for services, USE (Utilization, Saturation, Errors) for resources.
Logs are discrete events. They provide detail that metrics cannot. Structured logging (JSON) makes logs queryable.
The Logging Best Practices post covers structured logging, correlation IDs, and log levels.
Traces follow a request across service boundaries. A trace is a collection of spans, each representing a single operation. Distributed tracing shows you the full path of a request.
The Distributed Tracing post covers trace propagation, sampling strategies, and visualization.
The power comes from correlating these signals. When an alert fires (metrics), you look at the error rate over time. You find a specific time window. You query logs for errors in that window. You find a correlation ID. You look at the trace for that ID and see exactly where the request failed.
Observability-Driven Development
Observability should not be added after the fact. Design your services to be observable from the start.
The key is instrumentation at the code level. Use a library that handles trace propagation automatically. Emit structured logs with context. Expose metrics endpoints.
# Example: Python service with OpenTelemetry instrumentation
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# This single line instruments Flask, capturing HTTP spans automatically
FlaskInstrumentor().instrument()
# Add custom span for business logic
tracer = trace.get_tracer(__name__)
@app.route('/checkout')
def checkout():
with tracer.start_as_current_span("checkout_process") as span:
span.set_attribute("user.id", get_current_user_id())
span.set_attribute("cart.value", cart.total())
# Your code, traced automatically
result = process_checkout(cart)
span.set_attribute("checkout.success", result.success)
return result
Design your code so that every operation creates a span. When something breaks, you can see exactly where.
SLOs and Error Budget Burn-Down
SLOs (Service Level Objectives) give your observability work meaning. Without SLOs, you have metrics with no context.
An SLO is a commitment to a level of service. “99.9% availability” is an SLO. The error budget is what you have left: 0.1% of requests can fail per month before you are out of compliance.
Track error budget burn-down as a metric. When the budget is healthy, you can take risks: deploy new features, make architecture changes. When the budget is low, slow down and focus on reliability.
# Prometheus alerting rule for error budget
groups:
- name: slo-alerts
rules:
- alert: ErrorBudgetBurnRateHigh
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="checkout"}[1h])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error budget burning fast"
description: "Error budget burn rate > 5% over 1 hour"
The Prometheus & Grafana post covers how to set up SLO dashboards.
Choosing the Right Level of Fidelity
Not every request needs full tracing. Sampling is how you balance cost and coverage.
Head-based sampling decides at the start of a trace whether to sample it. Easy to implement, but you might miss the interesting 1% of requests that are slow or errored.
Tail-based sampling decides at the end of the trace, after you know whether it is interesting. More complex, but you can guarantee you capture all errors and slow requests while sampling most healthy traffic.
# OpenTelemetry tail-based sampling config
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces-policy
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic-policy
type: probabilistic
probabilistic: { sampling_percentage: 10 }
For most services: capture 100% of errors, sample 10% of everything else. This gives you enough data to debug issues without overwhelming your storage.
Building an Observability Culture
Tools do not make systems observable. People do. The most sophisticated tracing infrastructure is worthless if engineers do not use it during incidents.
Make observability a first-class concern during development:
- PRs should include observability considerations (what will you measure? how will you debug this?)
- On-call rotation should include time reviewing dashboards and learning the system
- Post-mortems should include “what observability would have helped?”
This is the hardest part. Engineers who are used to console.log debugging need to learn structured logging, trace analysis, and metric interpretation. Budget time for this learning.
The ELK Stack guide covers log aggregation at scale, which is foundational for observability.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Sampling missing critical error trace | Error is in 1% of requests and not captured in sampled traces, root cause unknown | Capture 100% of error traces, sample only healthy requests, tail-based sampling ensures interesting traces are kept |
| Tail-based sampling causing memory pressure | OpenTelemetry collector buffers traces in memory, OOM kills under load | Set conservative buffer limits, use probabilistic sampling before the tail sampler, scale collectors horizontally |
| SLO definition mismatch causing wrong alert thresholds | Alert fires on metric that does not match actual user experience, team ignores it | Define SLOs based on user-perceptible behavior, not internal metrics, validate SLOs against real user complaints |
| OpenTelemetry collector outage causing data loss | Traces stop flowing to backend, no observability during incident | Deploy collectors in HA mode, use local queueing with retry, do not put all telemetry through single collector |
| High-cardinality labels causing metrics storage explosion | PromQL queries slow to a crawl, metrics cost spikes unexpectedly | Set conservative cardinality limits per label, drop high-cardinality labels (user IDs, request IDs) from metrics |
Observability Trade-offs
| Scenario | Head-Based Sampling | Tail-Based Sampling |
|---|---|---|
| Implementation complexity | Low | High |
| Memory overhead | Low | High (buffering) |
| Captures interesting 1% | Only if lucky | Yes (by policy) |
| Operational cost | Low | Medium |
| Best for | Low-traffic services | High-traffic production services |
| Scenario | OpenTelemetry SDK | Vendor Agent |
|---|---|---|
| Portability | High | Low |
| Auto-instrumentation depth | Good | Vendor-specific deep integration |
| Operational overhead | Medium (collector) | Low |
| Cost | Free | Vendor pricing |
Common Anti-Patterns
Alert fatigue from too many alerts. If your on-call team ignores 80% of alerts, the remaining 20% get missed. Fewer, higher-quality alerts beat many low-quality ones. Review alert effectiveness monthly.
Metrics without context. CPU at 90% is not actionable. CPU at 90% on the checkout service while error rate is climbing is actionable. Always pair resource metrics with service-level metrics.
Traces without a sampling strategy. Capturing 100% of traces in production at scale is prohibitively expensive. Without a strategy, you either undersample and miss important data, or oversample and overwhelm your storage.
SLOs without error budget tracking. Declaring an SLO is meaningless if nobody tracks whether you are meeting it. Build error budget burn-down into your dashboards and treat budget consumption as a reliability signal.
Collecting data without asking questions. Many teams have ELK stacks full of logs that nobody reads. Observability is active: you ask questions of your system. If you are not using the data to make decisions, stop collecting it.
Quick Recap
Key Takeaways
- Observability means you can debug any behavior from external outputs, not just known failure modes
- The three pillars (logs, metrics, traces) are three views of the same system — correlate them
- Instrument with OpenTelemetry for portability; use collectors to route to your preferred backends
- SLOs give observability data meaning: without them, you have metrics with no context
- Error budget burn-down tells you when to focus on reliability vs. shipping features
- Tail-based sampling ensures you capture errors without storing every trace
- Observability is a cultural practice, not a tool purchase
Observability Checklist
# 1. Instrument services with OpenTelemetry
# FlaskInstrumentor().instrument()
# Add custom spans for business logic
# 2. Expose /metrics endpoint for Prometheus scraping
# prometheus_client.start_http_server(port=8000)
# 3. Emit structured JSON logs with trace correlation IDs
# logger.info("checkout_processed", extra={"trace_id": trace_id})
# 4. Define SLOs for customer-facing services
# Error rate < 0.1% per month
# p99 latency < 500ms per month
# 5. Set up error budget burn-down dashboard in Grafana
# Track: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# 6. Configure sampling: 100% errors + 10% probabilistic
# tail_sampling: errors-policy + slow-traces-policy + probabilistic
# 7. Deploy OpenTelemetry Collector in HA mode
# otelcol --config=collector-config.yaml
Conclusion
Observability is not a feature. It is a property of a system that you build intentionally. Start with the three pillars: logs, metrics, traces. Instrument your code. Define SLOs. Correlate signals. And build a culture where engineers use observability tools as part of their daily work, not just during incidents.
The question is not whether your system is monitored. The question is whether your system is observable.
Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
Logging Best Practices: Structured Logs, Levels, Aggregation
Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.