Logging Best Practices: Structured Logs, Levels, Aggregation
Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.
Logging Best Practices: Structured Logs, Levels, and Aggregation
Logs are your primary tool for understanding what happens in production when users report issues. If you have ever spent hours searching through plain text log files trying to find a single error, you know how painful unstructured logging can be.
This guide covers structured formats, log levels, correlation IDs for tracing requests, and aggregation strategies that scale.
The Problem with Plain Text Logging
Most applications start with simple string logging:
logger.info("User " + userId + " logged in from " + ipAddress);
This approach has serious problems. Searching for all logs from a specific user requires parsing the string. Machine parsing is brittle and breaks when the format changes. Adding context requires string concatenation throughout your codebase.
Structured logging solves these problems by emitting machine-readable log entries that are also human-readable.
Structured Logging
Structured logs use a defined format, typically JSON, where each field has a specific meaning.
JSON Log Format
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"message": "User login successful",
"service": "auth-service",
"version": "2.1.0",
"trace_id": "abc123def456",
"user_id": "usr_789",
"ip_address": "192.168.1.42",
"duration_ms": 45,
"environment": "production"
}
This format lets you search by any field, aggregate metrics across dimensions, and correlate logs with traces and metrics.
Implementing Structured Logging
Most languages have structured logging libraries:
# Python with structlog
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
]
)
log = structlog.get_logger()
log.info("user_login",
user_id="usr_789",
ip_address="192.168.1.42",
duration_ms=45
)
// TypeScript with pino
import pino from "pino";
const log = pino({
level: "info",
base: {
service: "auth-service",
version: process.env.APP_VERSION,
},
});
log.info(
{
userId: "usr_789",
ipAddress: "192.168.1.42",
durationMs: 45,
},
"User login successful",
);
// Go with zerolog
import (
"github.com/rs/zerolog"
"github.com/rs/zerolog/log"
)
zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
log.Info().
Str("user_id", "usr_789").
Str("ip_address", "192.168.1.42").
Int("duration_ms", 45).
Msg("User login successful")
Log Levels
Log levels help filter noise. Not every log entry needs to be visible during normal operations.
Standard Log Levels
| Level | Purpose | When to Use |
|---|---|---|
| DEBUG | Detailed diagnostic information | During development and troubleshooting |
| INFO | Confirmation that things work as expected | Significant business events |
| WARN | Unexpected but handled situations | Recoverable errors, degraded states |
| ERROR | Errors that need attention | Failures that affect requests |
| FATAL | System cannot continue | Critical failures requiring immediate action |
Choosing the Right Level
This feels intuitive but gets harder at scale. A few guidelines:
- INFO for business events like orders placed, users registered. You want these for analytics and auditing.
- WARN for situations that require attention but the system continues: retries succeeded, cache misses, degraded mode.
- ERROR for failures that affect the current request: database timeout, external API failure, validation error.
- DEBUG for information that helps during development but would overwhelm production: loop iterations, intermediate values.
Don’t log DEBUG in production unless you can enable it selectively for specific requests. A debug log in a hot path can generate gigabytes per hour.
Correlation IDs
When a request flows through multiple services, correlation IDs let you follow it across all logs.
Propagating Trace Context
// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
const traceId = req.headers["x-trace-id"] || generateUUID();
req.correlationId = traceId;
res.setHeader("x-trace-id", traceId);
// Add to logger context
log = log.with({ traceId });
next();
}
Propagate the correlation ID to all downstream calls:
// Outgoing HTTP request
fetch("https://api.example.com/users", {
headers: {
"X-Trace-ID": req.correlationId,
},
});
// Database queries
db.query("SELECT * FROM users WHERE id = $1", [userId], {
traceId: req.correlationId,
});
// Message queue messages
queue.send({
payload: orderData,
headers: {
"X-Trace-ID": req.correlationId,
},
});
Using Correlation IDs for Search
With structured logs and correlation IDs, debugging a user issue looks like this:
# Find all logs for a specific request
grep '"trace_id":"abc123def456"' /var/log/app.log
# Or in your log aggregator
query: trace_id = "abc123def456"
Search for the trace ID and you get the incoming request, database queries, cache hits, outgoing API calls, and the error that occurred.
What to Include in Logs
Context matters. The more relevant context you include, the easier debugging becomes.
Essential Fields
Every log entry needs at minimum:
- timestamp: ISO 8601 format in UTC
- level: Log severity level
- service: Which service generated this log
- message: Human-readable description
Request Context
For web services, include:
- request_id or trace_id
- user_id (if authenticated)
- HTTP method, path, status code
- Client IP address
- User agent
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"service": "api-gateway",
"message": "Request completed",
"request_id": "req_abc123",
"method": "GET",
"path": "/api/users/usr_789",
"status": 200,
"duration_ms": 120,
"ip": "192.168.1.42",
"user_agent": "Mozilla/5.0..."
}
Business Events
For significant business events:
- Event type (login, purchase, registration)
- Entity IDs involved
- Outcome (success, failure)
- Duration if applicable
- Any relevant metadata
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"service": "checkout-service",
"message": "Order placed",
"event": "order_placed",
"order_id": "ord_xyz789",
"customer_id": "cust_123",
"total_amount": 99.99,
"currency": "USD",
"item_count": 3
}
What NOT to Log
Logging sensitive data creates security and compliance problems.
Never Log These
Never log:
- Passwords or password hashes
- Credit card numbers or CVV codes
- Social security numbers or national IDs
- API keys or secrets
- Full authorization tokens (log the type and last 4 chars only)
Redact Sensitive Data
function redactSensitiveFields(
obj: Record<string, unknown>,
): Record<string, unknown> {
const sensitiveFields = ["password", "token", "secret", "creditCard", "ssn"];
const redacted = { ...obj };
for (const key of Object.keys(redacted)) {
if (sensitiveFields.some((f) => key.toLowerCase().includes(f))) {
redacted[key] = "[REDACTED]";
} else if (typeof redacted[key] === "object") {
redacted[key] = redactSensitiveFields(
redacted[key] as Record<string, unknown>,
);
}
}
return redacted;
}
log.info(
"User authenticated",
redactSensitiveFields({ userId: "usr_123", password: "secret123" }),
);
// Logs: { userId: 'usr_123', password: '[REDACTED]' }
Log Aggregation Architecture
At scale, logs need to be collected, aggregated, and stored efficiently.
Common Architecture
graph LR
A[Application] -->|stdout/JSON| B[Container Runtime]
B --> C[Log Agent]
C --> D[Log Aggregator]
D --> E[Storage]
D --> F[Search Interface]
G[Analytics/BI] --> E
Container Logging
In containerized environments, applications write to stdout and stderr. The container runtime handles collection:
# Write logs to stdout, not files
# Bad: RUN echo "$(date) Log entry" >> /var/log/app.log
# Good: console.log(JSON.stringify({ timestamp, message }))
For applications that must write to files, use a sidecar log agent or mount a shared log directory:
# Pod with log volume
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: app
image: myapp:latest
volumeMounts:
- name: logs
mountPath: /var/log/myapp
- name: log-agent
image: log-agent:latest
volumeMounts:
- name: logs
mountPath: /var/log/myapp
- name: agent-config
mountPath: /etc/log-agent
volumes:
- name: logs
emptyDir: {}
- name: agent-config
configMap:
name: log-agent-config
Shipping Logs to Aggregators
Fluentd/Fluent Bit Configuration
# fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag container.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match container.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
[OUTPUT]
Name es
Match container.*
Host elasticsearch.logging.svc
Port 9200
Logstash_Format On
Logstash_Prefix kubernetes
Retry_Limit False
Vector Configuration
Vector is a newer alternative with better performance and lower resource usage:
# vector.toml
[sources.docker]
type = "docker_logs"
[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '.message = parse_json!(.message)'
[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch.logging.svc:9200"
index = "kubernetes-%Y.%m.%d"
Log Storage and Retention
Storage costs grow with log volume. Design retention policies carefully.
Retention Tiers
| Tier | Duration | Use Case |
|---|---|---|
| Hot | 0-7 days | Real-time troubleshooting |
| Warm | 7-30 days | Investigating recent issues |
| Cold | 30-90 days | Compliance, audit |
| Archive | 1+ years | Legal requirements |
Elasticsearch Index Lifecycle Management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "7d",
"max_size": "50gb"
},
"set_priority": 100
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": 50
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {},
"set_priority": 0
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
Performance Considerations
Logging can become a bottleneck if you don’t design it carefully.
Asynchronous Logging
Write logs asynchronously so they don’t block your application:
import logging
import queue
from threading import Thread
class AsyncLogHandler(logging.Handler):
def __init__(self, batch_size=100, flush_interval=1.0):
super().__init__()
self.queue = queue.Queue(maxsize=10000)
self.batch_size = batch_size
self.flush_interval = flush_interval
self.worker = Thread(target=self._process_logs, daemon=True)
self.worker.start()
def emit(self, record):
try:
self.queue.put_nowait(self.format(record))
except queue.Full:
pass # Drop log if queue is full
def _process_logs(self):
batch = []
while True:
try:
item = self.queue.get(timeout=self.flush_interval)
batch.append(item)
while len(batch) < self.batch_size:
item = self.queue.get_nowait()
batch.append(item)
except queue.Empty:
pass
if batch:
self._send_batch(batch)
batch = []
def _send_batch(self, batch):
# Send to log aggregator
pass
Sampling High-Volume Logs
For debug-level logs in high-traffic paths, sample to reduce volume:
const sampler = new RateSampler({ rate: 0.1 }); // 10% sample rate
log.debug(
{
message: "Processing item",
itemId: item.id,
sampled: sampler.sample(),
},
"Item processing details",
);
// Only actually logs ~10% of the time
Monitoring Log Health
Logs themselves need monitoring. If logging stops, you lose visibility into your systems.
Metrics to Track
- Log ingestion rate (logs/second)
- Log volume by service and level
- Error rate in logs
- Log processing latency
- Log agent errors and restarts
Alert on Silence
# Prometheus alert for missing logs
- alert: LogIngestionSilence
expr: |
rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No logs being ingested from Fluentd"
description: "Fluentd has not sent logs to Elasticsearch in 5 minutes"
When to Use Structured Logging
Use structured logging when:
- Debugging requires cross-referencing multiple log entries
- Requests span multiple services
- You need selective debugging in high-volume APIs
- Audit trails are required for compliance
- You need to correlate logs with traces or metrics
Don’t use structured logging when:
- Simple scripts or one-off utilities where stdout debugging suffices
- Very low-traffic applications where unstructured grep suffices
- Legacy systems where migration cost outweighs benefits
- Development environments where DEBUG-level verbosity is acceptable
Trade-off Analysis
| Aspect | Structured Logging | Plain Text Logging |
|---|---|---|
| Searchability | Field-level queries via log aggregators | grep/string matching only |
| Storage Cost | Higher (JSON overhead per line) | Lower (minimal formatting) |
| Parse Complexity | Zero (machine-readable by default) | Brittle (format changes break parsers) |
| Human Readability | Moderate (requires jq or aggregator UI) | High (direct reading in terminal) |
| Tooling Required | Log aggregator (ELK, Loki, Splunk) | None or basic text tools |
| Correlation | Automatic via shared fields | Manual trace ID injection |
| Performance Impact | Slight overhead for JSON serialization | Minimal |
SLI/SLO/Error Budget Templates for Logging
Log-Based SLI Template
# logging-sli-config.yaml
service: logging-observability
environment: production
slis:
- name: log_ingestion_success_rate
description: "Percentage of emitted logs successfully ingested"
query: |
sum(rate(fluentd_output_status_num_logs_total{status="output"}[5m]))
/
sum(rate(fluentd_input_status_records_total[5m]))
- name: log_processing_latency_p95
description: "Time from log emit to searchable in aggregator"
query: |
histogram_quantile(0.95,
sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
)
- name: log_error_rate
description: "ERROR level log rate as percentage of total"
query: |
sum(rate(log_entries_total{level="error"}[5m]))
/
sum(rate(log_entries_total[5m])) * 100
Log SLO Template
# logging-slo-config.yaml
objectives:
- display_name: "Log Ingestion Availability"
sli: log_ingestion_success_rate
target: 99.5
window: 30d
description: "99.5% of emitted logs should be ingested"
- display_name: "Log Processing Latency"
sli: log_processing_latency_p95
target: 99.0
threshold_ms: 30000
window: 30d
description: "95% of logs should be searchable within 30 seconds"
- display_name: "Log Error Rate"
sli: log_error_rate
target: 99.9
threshold_percent: 1.0
window: 30d
description: "Error rate should stay below 1%"
Error Budget Calculator
# error-budget-calculator.py
def calculate_error_budget(slo_target, window_days=30):
"""
Calculate error budget in minutes for a given SLO target.
Example: 99.5% SLO over 30 days = 21.6 minutes of allowed errors
"""
window_seconds = window_days * 24 * 60 * 60
allowed_errors = window_seconds * (1 - slo_target)
return allowed_errors / 60 # Convert to minutes
# Standard SLO error budgets (30-day window)
slo_budgets = {
"99.0%": calculate_error_budget(0.990), # 432 minutes = 7.2 hours
"99.5%": calculate_error_budget(0.995), # 216 minutes = 3.6 hours
"99.9%": calculate_error_budget(0.999), # 43.2 minutes
"99.95%": calculate_error_budget(0.9995), # 21.6 minutes
"99.99%": calculate_error_budget(0.9999), # 4.32 minutes
}
for slo, budget in slo_budgets.items():
print(f"SLO {slo}: {budget:.2f} minutes error budget")
Multi-Window Burn-Rate Alerting for Log Quality
Burn-rate alerts detect when error budgets are being consumed faster than expected. This approach catches both sudden spikes and slow leaks.
1-Hour Window Burn-Rate Alert (Fast Burn)
# Burn-rate alerts for logging
groups:
- name: logging-burn-rate
rules:
# Fast burn: 1-hour window, 14.4x burn rate (burns 1% budget in 1 hour)
- alert: LogErrorBudgetFastBurn
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
)
> (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
category: logging
window: 1h
annotations:
summary: "Log error budget burning fast (1h window)"
description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Budget may be depleted in ~7 hours."
6-Hour Window Burn-Rate Alert (Medium Burn)
# Medium burn: 6-hour window, 6x burn rate (burns 10% budget in 6 hours)
- alert: LogErrorBudgetMediumBurn
expr: |
(
sum(rate(log_entries_total{level="error"}[6h]))
/
sum(rate(log_entries_total[6h]))
)
> (1 - 0.999) * 6
for: 30m
labels:
severity: warning
category: logging
window: 6h
annotations:
summary: "Log error budget burning (6h window)"
description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Check for sustained error patterns."
Multi-Window Burn-Rate Alert Set
# Complete burn-rate alert set (multi-window)
- alert: LogErrorBudgetBurnAllWindows
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
)
> (1 - 0.999) * 14.4
or
(
sum(rate(log_entries_total{level="error"}[6h]))
/
sum(rate(log_entries_total[6h]))
)
> (1 - 0.999) * 6
for: 5m
labels:
severity: critical
category: logging
annotations:
summary: "Log error budget burning across multiple time windows"
description: |
Multi-window burn-rate alert triggered.
1h burn rate: {{ printf "%.2f" (neilyz (index $alerts "0" | value)) }}x
6h burn rate: {{ printf "%.2f" (neilyz (index $alerts "1" | value)) }}x
Review error patterns and allocate incident resources.
SLO Error Budget Dashboard Panels
{
"dashboard": {
"title": "Logging SLO Error Budget",
"panels": [
{
"title": "Error Budget Remaining (30d)",
"type": "gauge",
"targets": [
{
"expr": "(1 - (sum(rate(log_entries_total{level=\"error\"}[30d])) / sum(rate(log_entries_total[30d])))) * 100",
"legendFormat": "Budget Used %"
}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 50, "color": "yellow" },
{ "value": 90, "color": "green" }
]
}
}
}
},
{
"title": "Burn Rate (1h)",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999)",
"legendFormat": "Burn Rate"
}
]
},
{
"title": "Projected Budget Exhaustion",
"type": "stat",
"targets": [
{
"expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999) * 24",
"legendFormat": "Hours until budget exhausted"
}
]
}
]
}
}
Observability Hooks for Logging
This section defines what to log, measure, trace, and alert for logging systems themselves.
Log (What to Emit)
| Event | Fields | Level |
|---|---|---|
| Log ingestion started | service, host, agent_version | INFO |
| Log ingestion stopped | service, host, reason | WARN |
| Buffer approaching full | host, buffer_used_percent, buffer_limit | WARN |
| Malformed log detected | host, parse_error_type, sample | WARN |
| Retry attempt | host, destination, attempt, max_attempts | DEBUG |
| Batch sent successfully | host, destination, batch_size, bytes_sent | DEBUG |
| Authentication failure | host, client_ip, reason | WARN |
Measure (Metrics to Collect)
| Metric | Type | Description |
|---|---|---|
log_emitted_total | Counter | Total logs emitted by service |
log_ingested_total | Counter | Total logs ingested to aggregator |
log_dropped_total | Counter | Logs dropped due to errors/full buffers |
log_processing_latency_seconds | Histogram | Time from emit to searchable |
log_buffer_utilization_percent | Gauge | Buffer fill percentage |
log_parsing_errors_total | Counter | Malformed log entries |
log_bytes_sent_total | Counter | Bytes sent to aggregators |
log_aggregator_queue_depth | Gauge | Pending logs in aggregator queue |
Trace (Correlation Points)
| Operation | Trace Attribute | Purpose |
|---|---|---|
| Log emit | log.aggregate | Track logs from emit through aggregation |
| Log parsing | log.parse.status | Monitor parsing health |
| Log shipping | log.ship.destination | Track delivery to aggregators |
| Batch processing | log.batch.size | Monitor batch efficiency |
Alert (When to Page)
| Alert | Condition | Severity | Purpose |
|---|---|---|---|
| Log Silence | No logs received for 5 minutes | P1 Critical | Log pipeline failure |
| High Drop Rate | Drop rate > 1% for 5 minutes | P2 High | Pipeline health |
| Buffer Critical | Buffer > 90% full | P2 High | Prevent data loss |
| Parse Error Spike | Parse errors > 100/min | P3 Medium | Data quality |
| Latency High | Processing latency > 30s p95 | P3 Medium | Performance degradation |
Alerting Hook Template
# logging-observability-hooks.yaml
groups:
- name: logging-observability-hooks
rules:
# Alert on silence - no logs coming in
- alert: LoggingPipelineSilence
expr: rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No logs being ingested (Alert on Silence)"
description: "Fluentd/Bit has not sent logs to Elasticsearch in 5 minutes. Either the log pipeline is down or all services have stopped logging."
# Alert on high drop rate
- alert: LoggingDropRateHigh
expr: |
sum(rate(fluentd_output_status_num_errors_total[5m]))
/
sum(rate(fluentd_input_status_records_total[5m])) > 0.01
for: 5m
labels:
severity: high
annotations:
summary: "Log drop rate above 1%"
description: "{{ $value | humanizePercentage }} of logs are being dropped. Check Fluentd/Bit error logs."
# Alert on buffer approaching full
- alert: LoggingBufferCritical
expr: fluentd_buffer_queue_length / fluentd_buffer_limit > 0.9
for: 5m
labels:
severity: high
annotations:
summary: "Log buffer above 90% capacity"
description: "Fluentd/Bit buffer is filling up. Risk of log loss if not addressed."
# Alert on high parsing errors
- alert: LoggingParseErrorSpike
expr: rate(log_parsing_errors_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High log parsing error rate"
description: "More than 100 parsing errors per minute. Review log format consistency."
# Alert on processing latency
- alert: LoggingProcessingLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "Log processing latency above 30 seconds"
description: "P95 log processing latency is {{ $value }}s. Logs may not be searchable in real-time."
# SLO error budget burn rate
- alert: LoggingErrorBudgetBurningFast
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Log error budget burning at unsustainable rate"
description: "Error budget is being consumed 14.4x faster than sustainable. Immediate investigation required."
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Log aggregation pipeline downtime | No new logs searchable; teams blind to issues | Buffer logs locally; implement retry with backoff; alert on pipeline health |
| Elasticsearch cluster saturation | Log ingestion backs up; logs dropped | Monitor ES cluster health; implement backpressure; use ILM to manage indices |
| Corrupted log data | Searches return incomplete results; debugging misses context | Validate JSON structure at ingestion; use dead-letter queues for malformed logs |
| Sensitive data logged | Security/compliance breach; potential data exposure | Implement redaction middleware; scan logs before storage; educate developers |
| Excessive log volume | Storage costs spike; performance degradation | Implement sampling for DEBUG logs; enforce log level policies; archive aggressively |
| Missing correlation IDs | Cannot trace requests across services | Auto-inject correlation IDs in middleware; reject requests without trace context in high-security paths |
Observability Checklist
Key Log Metrics
- Log ingestion rate (logs/second) by service and level
- Log volume by service, level, and environment
- Error rate in logs (ERROR level count over time)
- Log processing latency (time from log emit to searchable)
- Log agent errors and restarts
- Storage utilization per index
Logs You Should Have
- Request logs with trace_id, user_id, method, path, status, duration_ms
- Authentication events (login attempts, failures, token refreshes)
- Business events (orders, payments, registrations) with entity IDs
- Database query logs for slow queries (>100ms threshold)
- External API call logs with request/response timing
- Background job start/complete/fail logs with job IDs
- Health check and readiness probe logs
- Configuration change logs (who changed what when)
Alerts You Need
- No logs received from a service for >5 minutes (Alert on Silence pattern)
- Error rate spike above baseline (unexpected errors)
- Log volume anomaly (sudden drop or spike)
- Log processing latency >30 seconds
- Elasticsearch cluster health degraded (yellow/red)
- Log agent restart detected
Security Checklist
- No passwords, API keys, or secrets in log output
- Credit card numbers, CVV, SSN never logged
- Authorization tokens logged as type + last 4 chars only (e.g., “Bearer ***abc123”)
- PII fields identified and redacted in redaction middleware
- Log access requires authentication and is audited
- Log aggregation pipeline uses TLS in transit
- Elasticsearch access restricted to authorized personnel
- Log retention complies with data retention policies
- Sensitive data cannot be searched in Kibana/ES by unauthorized users
Common Pitfalls / Anti-Patterns
1. Logging Everything at DEBUG in Production
DEBUG-level logging in high-throughput services generates gigabytes per hour. Use sampling for debug scenarios, or enable DEBUG selectively via feature flags for specific request IDs.
2. Plain Text Logging with String Concatenation
// Bad: Cannot search, parse, or aggregate
logger.info("User " + userId + " purchased " + item);
// Good: Structured, searchable, aggregatable
logger.info("User purchased item", { userId, itemId, itemName, price });
3. Missing Trace Context Propagation
Logs without correlation IDs are useless for tracing requests across services. Always propagate trace_id through HTTP headers, database connections, and message queues.
4. Logging Sensitive Data
Never log passwords, full tokens, credit card numbers, or PII. Implement redaction at the logger level, not the application level, to catch mistakes.
5. Synchronous Logging to Network Storage
Writing logs synchronously to a remote log server adds latency to every operation. Use async logging with local buffering and background shipping.
6. No Log Retention Policy
Without retention policies, storage costs grow unbounded. Define hot/warm/cold/archive tiers and automate data lifecycle management.
7. Logs as the Only Observability Signal
Relying solely on logs for debugging is insufficient at scale. Combine logs with metrics and traces for complete observability.
Quick Recap
Key Takeaways:
- Structured JSON logs enable efficient searching and aggregation
- Correlation IDs connect logs across service boundaries
- Log levels filter noise: DEBUG for development, ERROR/WARN/INFO for production
- Never log sensitive data; always implement redaction
- Monitor your monitors: log aggregation needs its own observability
- Retention policies prevent unbounded storage growth
Copy/Paste Checklist:
# Verify structured logging format
grep -c '"timestamp".*"level".*"message".*"service"' /var/log/app.json
# Find logs for specific trace
grep '"trace_id":"abc123"' /var/log/app.json
# Count errors by service
jq 'select(.level == "ERROR") | .service' /var/log/app.json | sort | uniq -c
# Alert on log silence (Prometheus)
- alert: LogIngestionSilence
expr: rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
# Redaction function (TypeScript)
const sensitiveFields = ['password', 'token', 'secret', 'creditCard', 'ssn'];
function redact(obj) {
return Object.fromEntries(
Object.entries(obj).map(([k, v]) =>
sensitiveFields.some(f => k.toLowerCase().includes(f)) ? [k, '[REDACTED]'] : [k, v]
)
);
}
Interview Questions
Q: A user reports a bug but provides no details beyond “the checkout failed.” How do you find the relevant logs?
A: Ask for the approximate time, user ID, or order ID. With a timestamp window, query logs for that timeframe filtering on the service handling checkout. With a user ID, search for all log entries tagged with that user ID. With a correlation ID from their session, search for that ID across all services to reconstruct the full request path. In ELK/Kibana: timeframe AND service.name: checkout-service AND "checkout failed". If no direct match, search for errors in the checkout service within the time window, then trace back via correlation IDs to find the root cause service.
Q: Your Elasticsearch cluster is running out of disk space. How do you reduce storage without losing searchable data?
A: Immediate mitigation: force a flush to free up translog space, delete old indices beyond your retention policy, and consider a readonly index for historical data. For ongoing cost reduction: adjust index settings to reduce replica count in hot-warm architectures, use ILM policies to move older indices to cheaper storage tiers (frozen or cold), and reduce shard count — too many small shards wastes overhead. Audit field mappings to check if you can reduce the number of indexed fields with doc_values: false on fields that are only used for filtering. Finally, enforce log volume budgets per service to prevent any single service from overwhelming the cluster.
Q: What is the relationship between correlation IDs, trace IDs, and span IDs in distributed tracing?
A: A trace ID is a unique identifier for an entire request transaction across all services — it stitches together every span. A span ID represents a single unit of work within that trace (one service call, one database query). Correlation IDs are typically an application-level business identifier (order ID, user session ID) that helps you filter logs across services without relying on trace IDs. In practice: the trace ID propagates via HTTP headers (x-b3-traceId in Zipkin, traceparent in W3C trace context) through every service call. Each service creates a span with the incoming trace ID and its own span ID, creating a parent-child tree of operations.
Q: You find that DEBUG logs are missing during production incidents. What logging level should you set in production and why? A: Production should run at INFO or WARN in most services — DEBUG is too noisy for production traffic volumes and can itself cause performance problems (disk I/O, log storage costs). However, when an incident is active, dynamically raising a specific service to DEBUG via a config change allows targeted debugging without impacting all services. The pattern: have a mechanism to change log level at runtime (via a config map reload, a logging API endpoint, or a Kubernetes annotation). Keep DEBUG in staging and development environments where you need it for development iteration.
Q: How do you handle sensitive data like PII appearing in logs?
A: The best approach is to never log PII in the first place — sanitize before logging by configuring your logger to mask fields like email addresses, credit cards, and phone numbers using a redaction library. In your logging framework ( Pino for Node, zap for Go, python-logstash), add a field filter that replaces sensitive patterns with [REDACTED]. Alternatively, use a log processor (Fluentd filter, Logstash mutate) to strip or hash sensitive fields before forwarding. Also configure your SIEM and log storage to mark PII fields as sensitive so analysts are warned. Audit your logs regularly with a tool like DataSommer to catch accidental PII leakage.
Conclusion
Good logging practices pay off when you need them most: debugging production issues at 2am. Structured logs with correlation IDs let you trace requests across service boundaries. Appropriate log levels keep noise manageable. Retention policies balance cost with compliance requirements.
Start with JSON structured logging in your applications. Add correlation ID propagation early. Build log aggregation before you need it, not during an incident.
For deeper observability, combine logging with the Metrics, Monitoring & Alerting and Distributed Tracing practices covered in our other guides. These three pillars work together: logs show you what happened, metrics show you patterns, and traces show you why it happened.
Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.