Logging Best Practices: Structured Logs, Levels, Aggregation
Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.
Logging Best Practices: Structured Logs, Levels, and Aggregation
Logs are your primary tool for understanding what happens in production when users report issues. If you have ever spent hours searching through plain text log files trying to find a single error, you know how painful unstructured logging can be.
This guide covers structured formats, log levels, correlation IDs for tracing requests, and aggregation strategies that scale.
Core Concepts
Structured logs use a defined format, typically JSON, where each field has a specific meaning.
JSON Log Format
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"message": "User login successful",
"service": "auth-service",
"version": "2.1.0",
"trace_id": "abc123def456",
"user_id": "usr_789",
"ip_address": "192.168.1.42",
"duration_ms": 45,
"environment": "production"
}
This format lets you search by any field, aggregate metrics across dimensions, and correlate logs with traces and metrics.
Implementing Structured Logging
Most languages have structured logging libraries:
# Python with structlog
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
]
)
log = structlog.get_logger()
log.info("user_login",
user_id="usr_789",
ip_address="192.168.1.42",
duration_ms=45
)
// TypeScript with pino
import pino from "pino";
const log = pino({
level: "info",
base: {
service: "auth-service",
version: process.env.APP_VERSION,
},
});
log.info(
{
userId: "usr_789",
ipAddress: "192.168.1.42",
durationMs: 45,
},
"User login successful",
);
// Go with zerolog
import (
"github.com/rs/zerolog"
"github.com/rs/zerolog/log"
)
zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
log.Info().
Str("user_id", "usr_789").
Str("ip_address", "192.168.1.42").
Int("duration_ms", 45).
Msg("User login successful")
Log Levels
Log levels help filter noise. Not every log entry needs to be visible during normal operations.
Standard Log Levels
| Level | Purpose | When to Use |
|---|---|---|
| DEBUG | Detailed diagnostic information | During development and troubleshooting |
| INFO | Confirmation that things work as expected | Significant business events |
| WARN | Unexpected but handled situations | Recoverable errors, degraded states |
| ERROR | Errors that need attention | Failures that affect requests |
| FATAL | System cannot continue | Critical failures requiring immediate action |
Choosing the Right Level
This feels intuitive but gets harder at scale. A few guidelines:
- INFO for business events like orders placed, users registered. You want these for analytics and auditing.
- WARN for situations that require attention but the system continues: retries succeeded, cache misses, degraded mode.
- ERROR for failures that affect the current request: database timeout, external API failure, validation error.
- DEBUG for information that helps during development but would overwhelm production: loop iterations, intermediate values.
Don’t log DEBUG in production unless you can enable it selectively for specific requests. A debug log in a hot path can generate gigabytes per hour.
Correlation IDs
When a request flows through multiple services, correlation IDs let you follow it across all logs.
Propagating Trace Context
// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
const traceId = req.headers["x-trace-id"] || generateUUID();
req.correlationId = traceId;
res.setHeader("x-trace-id", traceId);
// Add to logger context
log = log.with({ traceId });
next();
}
Propagate the correlation ID to all downstream calls:
// Outgoing HTTP request
fetch("https://api.example.com/users", {
headers: {
"X-Trace-ID": req.correlationId,
},
});
// Database queries
db.query("SELECT * FROM users WHERE id = $1", [userId], {
traceId: req.correlationId,
});
// Message queue messages
queue.send({
payload: orderData,
headers: {
"X-Trace-ID": req.correlationId,
},
});
Using Correlation IDs for Search
With structured logs and correlation IDs, debugging a user issue looks like this:
# Find all logs for a specific request
grep '"trace_id":"abc123def456"' /var/log/app.log
# Or in your log aggregator
query: trace_id = "abc123def456"
Search for the trace ID and you get the incoming request, database queries, cache hits, outgoing API calls, and the error that occurred.
What to Include in Logs
Context matters. The more relevant context you include, the easier debugging becomes.
Essential Fields
Every log entry needs at minimum:
- timestamp: ISO 8601 format in UTC
- level: Log severity level
- service: Which service generated this log
- message: Human-readable description
Request Context
For web services, include:
- request_id or trace_id
- user_id (if authenticated)
- HTTP method, path, status code
- Client IP address
- User agent
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"service": "api-gateway",
"message": "Request completed",
"request_id": "req_abc123",
"method": "GET",
"path": "/api/users/usr_789",
"status": 200,
"duration_ms": 120,
"ip": "192.168.1.42",
"user_agent": "Mozilla/5.0..."
}
Business Events
For significant business events:
- Event type (login, purchase, registration)
- Entity IDs involved
- Outcome (success, failure)
- Duration if applicable
- Any relevant metadata
{
"timestamp": "2026-03-22T14:32:01.456Z",
"level": "INFO",
"service": "checkout-service",
"message": "Order placed",
"event": "order_placed",
"order_id": "ord_xyz789",
"customer_id": "cust_123",
"total_amount": 99.99,
"currency": "USD",
"item_count": 3
}
What NOT to Log
Logging sensitive data creates security and compliance problems. Never log:
- Passwords or password hashes
- Credit card numbers or CVV codes
- Social security numbers or national IDs
- API keys or secrets
- Full authorization tokens (log the type and last 4 chars only)
Redact Sensitive Data
function redactSensitiveFields(
obj: Record<string, unknown>,
): Record<string, unknown> {
const sensitiveFields = ["password", "token", "secret", "creditCard", "ssn"];
const redacted = { ...obj };
for (const key of Object.keys(redacted)) {
if (sensitiveFields.some((f) => key.toLowerCase().includes(f))) {
redacted[key] = "[REDACTED]";
} else if (typeof redacted[key] === "object") {
redacted[key] = redactSensitiveFields(
redacted[key] as Record<string, unknown>,
);
}
}
return redacted;
}
log.info(
"User authenticated",
redactSensitiveFields({ userId: "usr_123", password: "secret123" }),
);
// Logs: { userId: 'usr_123', password: '[REDACTED]' }
Log Aggregation Architecture
At scale, logs need to be collected, aggregated, and stored efficiently.
Common Architecture
graph LR
A[Application] -->|stdout/JSON| B[Container Runtime]
B --> C[Log Agent]
C --> D[Log Aggregator]
D --> E[Storage]
D --> F[Search Interface]
G[Analytics/BI] --> E
Container Logging
In containerized environments, applications write to stdout and stderr. The container runtime handles collection:
# Write logs to stdout, not files
# Bad: RUN echo "$(date) Log entry" >> /var/log/app.log
# Good: console.log(JSON.stringify({ timestamp, message }))
For applications that must write to files, use a sidecar log agent or mount a shared log directory:
# Pod with log volume
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: app
image: myapp:latest
volumeMounts:
- name: logs
mountPath: /var/log/myapp
- name: log-agent
image: log-agent:latest
volumeMounts:
- name: logs
mountPath: /var/log/myapp
- name: agent-config
mountPath: /etc/log-agent
volumes:
- name: logs
emptyDir: {}
- name: agent-config
configMap:
name: log-agent-config
Shipping Logs to Aggregators
Fluentd/Fluent Bit Configuration
# fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag container.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match container.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
[OUTPUT]
Name es
Match container.*
Host elasticsearch.logging.svc
Port 9200
Logstash_Format On
Logstash_Prefix kubernetes
Retry_Limit False
Vector Configuration
Vector is a newer alternative with better performance and lower resource usage:
# vector.toml
[sources.docker]
type = "docker_logs"
[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '.message = parse_json!(.message)'
[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch.logging.svc:9200"
index = "kubernetes-%Y.%m.%d"
Log Storage and Retention
Storage costs grow with log volume. Design retention policies carefully.
Retention Tiers
| Tier | Duration | Use Case |
|---|---|---|
| Hot | 0-7 days | Real-time troubleshooting |
| Warm | 7-30 days | Investigating recent issues |
| Cold | 30-90 days | Compliance, audit |
| Archive | 1+ years | Legal requirements |
Elasticsearch Index Lifecycle Management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "7d",
"max_size": "50gb"
},
"set_priority": 100
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": 50
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {},
"set_priority": 0
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
Performance Considerations
Logging can become a bottleneck if you don’t design it carefully.
Asynchronous Logging
Write logs asynchronously so they don’t block your application:
import logging
import queue
from threading import Thread
class AsyncLogHandler(logging.Handler):
def __init__(self, batch_size=100, flush_interval=1.0):
super().__init__()
self.queue = queue.Queue(maxsize=10000)
self.batch_size = batch_size
self.flush_interval = flush_interval
self.worker = Thread(target=self._process_logs, daemon=True)
self.worker.start()
def emit(self, record):
try:
self.queue.put_nowait(self.format(record))
except queue.Full:
pass # Drop log if queue is full
def _process_logs(self):
batch = []
while True:
try:
item = self.queue.get(timeout=self.flush_interval)
batch.append(item)
while len(batch) < self.batch_size:
item = self.queue.get_nowait()
batch.append(item)
except queue.Empty:
pass
if batch:
self._send_batch(batch)
batch = []
def _send_batch(self, batch):
# Send to log aggregator
pass
Sampling High-Volume Logs
For debug-level logs in high-traffic paths, sample to reduce volume:
const sampler = new RateSampler({ rate: 0.1 }); // 10% sample rate
log.debug(
{
message: "Processing item",
itemId: item.id,
sampled: sampler.sample(),
},
"Item processing details",
);
// Only actually logs ~10% of the time
Monitoring Log Health
Logs themselves need monitoring. If logging stops, you lose visibility into your systems.
Metrics to Track
- Log ingestion rate (logs/second)
- Log volume by service and level
- Error rate in logs
- Log processing latency
- Log agent errors and restarts
Alert on Silence
# Prometheus alert for missing logs
- alert: LogIngestionSilence
expr: |
rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No logs being ingested from Fluentd"
description: "Fluentd has not sent logs to Elasticsearch in 5 minutes"
When to Use Structured Logging
Use structured logging when:
- Debugging requires cross-referencing multiple log entries
- Requests span multiple services
- You need selective debugging in high-volume APIs
- Audit trails are required for compliance
- You need to correlate logs with traces or metrics
Don’t use structured logging when:
- Simple scripts or one-off utilities where stdout debugging suffices
- Very low-traffic applications where unstructured grep suffices
- Legacy systems where migration cost outweighs benefits
- Development environments where DEBUG-level verbosity is acceptable
Trade-off Analysis
| Aspect | Structured Logging | Plain Text Logging |
|---|---|---|
| Searchability | Field-level queries via log aggregators | grep/string matching only |
| Storage Cost | Higher (JSON overhead per line) | Lower (minimal formatting) |
| Parse Complexity | Zero (machine-readable by default) | Brittle (format changes break parsers) |
| Human Readability | Moderate (requires jq or aggregator UI) | High (direct reading in terminal) |
| Tooling Required | Log aggregator (ELK, Loki, Splunk) | None or basic text tools |
| Correlation | Automatic via shared fields | Manual trace ID injection |
| Performance Impact | Slight overhead for JSON serialization | Minimal |
SLI/SLO/Error Budget Templates for Logging
Log-Based SLI Template
# logging-sli-config.yaml
service: logging-observability
environment: production
slis:
- name: log_ingestion_success_rate
description: "Percentage of emitted logs successfully ingested"
query: |
sum(rate(fluentd_output_status_num_logs_total{status="output"}[5m]))
/
sum(rate(fluentd_input_status_records_total[5m]))
- name: log_processing_latency_p95
description: "Time from log emit to searchable in aggregator"
query: |
histogram_quantile(0.95,
sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
)
- name: log_error_rate
description: "ERROR level log rate as percentage of total"
query: |
sum(rate(log_entries_total{level="error"}[5m]))
/
sum(rate(log_entries_total[5m])) * 100
Log SLO Template
# logging-slo-config.yaml
objectives:
- display_name: "Log Ingestion Availability"
sli: log_ingestion_success_rate
target: 99.5
window: 30d
description: "99.5% of emitted logs should be ingested"
- display_name: "Log Processing Latency"
sli: log_processing_latency_p95
target: 99.0
threshold_ms: 30000
window: 30d
description: "95% of logs should be searchable within 30 seconds"
- display_name: "Log Error Rate"
sli: log_error_rate
target: 99.9
threshold_percent: 1.0
window: 30d
description: "Error rate should stay below 1%"
Error Budget Calculator
# error-budget-calculator.py
def calculate_error_budget(slo_target, window_days=30):
"""
Calculate error budget in minutes for a given SLO target.
Example: 99.5% SLO over 30 days = 21.6 minutes of allowed errors
"""
window_seconds = window_days * 24 * 60 * 60
allowed_errors = window_seconds * (1 - slo_target)
return allowed_errors / 60 # Convert to minutes
# Standard SLO error budgets (30-day window)
slo_budgets = {
"99.0%": calculate_error_budget(0.990), # 432 minutes = 7.2 hours
"99.5%": calculate_error_budget(0.995), # 216 minutes = 3.6 hours
"99.9%": calculate_error_budget(0.999), # 43.2 minutes
"99.95%": calculate_error_budget(0.9995), # 21.6 minutes
"99.99%": calculate_error_budget(0.9999), # 4.32 minutes
}
for slo, budget in slo_budgets.items():
print(f"SLO {slo}: {budget:.2f} minutes error budget")
Multi-Window Burn-Rate Alerting for Log Quality
Burn-rate alerts detect when error budgets are being consumed faster than expected. This approach catches both sudden spikes and slow leaks.
1-Hour Window Burn-Rate Alert (Fast Burn)
# Burn-rate alerts for logging
groups:
- name: logging-burn-rate
rules:
# Fast burn: 1-hour window, 14.4x burn rate (burns 1% budget in 1 hour)
- alert: LogErrorBudgetFastBurn
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
)
> (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
category: logging
window: 1h
annotations:
summary: "Log error budget burning fast (1h window)"
description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Budget may be depleted in ~7 hours."
6-Hour Window Burn-Rate Alert (Medium Burn)
# Medium burn: 6-hour window, 6x burn rate (burns 10% budget in 6 hours)
- alert: LogErrorBudgetMediumBurn
expr: |
(
sum(rate(log_entries_total{level="error"}[6h]))
/
sum(rate(log_entries_total[6h]))
)
> (1 - 0.999) * 6
for: 30m
labels:
severity: warning
category: logging
window: 6h
annotations:
summary: "Log error budget burning (6h window)"
description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Check for sustained error patterns."
Multi-Window Burn-Rate Alert Set
# Complete burn-rate alert set (multi-window)
- alert: LogErrorBudgetBurnAllWindows
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
)
> (1 - 0.999) * 14.4
or
(
sum(rate(log_entries_total{level="error"}[6h]))
/
sum(rate(log_entries_total[6h]))
)
> (1 - 0.999) * 6
for: 5m
labels:
severity: critical
category: logging
annotations:
summary: "Log error budget burning across multiple time windows"
description: |
Multi-window burn-rate alert triggered.
1h burn rate: {{ printf "%.2f" (neilyz (index $alerts "0" | value)) }}x
6h burn rate: {{ printf "%.2f" (neilyz (index $alerts "1" | value)) }}x
Review error patterns and allocate incident resources.
SLO Error Budget Dashboard Panels
{
"dashboard": {
"title": "Logging SLO Error Budget",
"panels": [
{
"title": "Error Budget Remaining (30d)",
"type": "gauge",
"targets": [
{
"expr": "(1 - (sum(rate(log_entries_total{level=\"error\"}[30d])) / sum(rate(log_entries_total[30d])))) * 100",
"legendFormat": "Budget Used %"
}
],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 50, "color": "yellow" },
{ "value": 90, "color": "green" }
]
}
}
}
},
{
"title": "Burn Rate (1h)",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999)",
"legendFormat": "Burn Rate"
}
]
},
{
"title": "Projected Budget Exhaustion",
"type": "stat",
"targets": [
{
"expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999) * 24",
"legendFormat": "Hours until budget exhausted"
}
]
}
]
}
}
OpenTelemetry Integration for Logging
OpenTelemetry hooks into your application at the SDK level and sends logs, traces, and metrics through the same pipeline. No per-vendor instrumentation, no rewriting your log format when you switch backends.
Auto-Instrumentation Setup
// OpenTelemetry collector configuration
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPLogExporter } from "@opentelemetry/exporter-logs-otlp-http";
import { BatchLogRecordProcessor } from "@opentelemetry/sdk-logs";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: "http://otel-collector:4318/v1/traces",
}),
logRecordProcessor: new BatchLogRecordProcessor(
new OTLPLogExporter({
url: "http://otel-collector:4318/v1/logs",
}),
),
});
sdk.start();
Correlating Logs with Traces and Metrics
// Inject trace context into log records
import { trace, context } from "@opentelemetry/api";
function emitLog(level: string, message: string, attributes = {}) {
const span = trace.getSpan(context.active());
const record = {
timestamp: new Date().toISOString(),
level,
message,
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
...attributes,
};
logger.emit(record);
}
Benefits of OpenTelemetry for Logging
| Benefit | Description |
|---|---|
| Vendor neutrality | Switch backends without re-instrumenting |
| Unified data model | Logs, traces, and metrics share the same correlation IDs |
| Automatic context prop | Trace context automatically injected into logs |
| Sampling coordination | Sample logs and traces together for consistent debugging |
Observability Hooks for Logging
This section defines what to log, measure, trace, and alert for logging systems themselves.
Log (What to Emit)
| Event | Fields | Level |
|---|---|---|
| Log ingestion started | service, host, agent_version | INFO |
| Log ingestion stopped | service, host, reason | WARN |
| Buffer approaching full | host, buffer_used_percent, buffer_limit | WARN |
| Malformed log detected | host, parse_error_type, sample | WARN |
| Retry attempt | host, destination, attempt, max_attempts | DEBUG |
| Batch sent successfully | host, destination, batch_size, bytes_sent | DEBUG |
| Authentication failure | host, client_ip, reason | WARN |
Measure (Metrics to Collect)
| Metric | Type | Description |
|---|---|---|
log_emitted_total | Counter | Total logs emitted by service |
log_ingested_total | Counter | Total logs ingested to aggregator |
log_dropped_total | Counter | Logs dropped due to errors/full buffers |
log_processing_latency_seconds | Histogram | Time from emit to searchable |
log_buffer_utilization_percent | Gauge | Buffer fill percentage |
log_parsing_errors_total | Counter | Malformed log entries |
log_bytes_sent_total | Counter | Bytes sent to aggregators |
log_aggregator_queue_depth | Gauge | Pending logs in aggregator queue |
Trace (Correlation Points)
| Operation | Trace Attribute | Purpose |
|---|---|---|
| Log emit | log.aggregate | Track logs from emit through aggregation |
| Log parsing | log.parse.status | Monitor parsing health |
| Log shipping | log.ship.destination | Track delivery to aggregators |
| Batch processing | log.batch.size | Monitor batch efficiency |
Alert (When to Page)
| Alert | Condition | Severity | Purpose |
|---|---|---|---|
| Log Silence | No logs received for 5 minutes | P1 Critical | Log pipeline failure |
| High Drop Rate | Drop rate > 1% for 5 minutes | P2 High | Pipeline health |
| Buffer Critical | Buffer > 90% full | P2 High | Prevent data loss |
| Parse Error Spike | Parse errors > 100/min | P3 Medium | Data quality |
| Latency High | Processing latency > 30s p95 | P3 Medium | Performance degradation |
Alerting Hook Template
# logging-observability-hooks.yaml
groups:
- name: logging-observability-hooks
rules:
# Alert on silence - no logs coming in
- alert: LoggingPipelineSilence
expr: rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No logs being ingested (Alert on Silence)"
description: "Fluentd/Bit has not sent logs to Elasticsearch in 5 minutes. Either the log pipeline is down or all services have stopped logging."
# Alert on high drop rate
- alert: LoggingDropRateHigh
expr: |
sum(rate(fluentd_output_status_num_errors_total[5m]))
/
sum(rate(fluentd_input_status_records_total[5m])) > 0.01
for: 5m
labels:
severity: high
annotations:
summary: "Log drop rate above 1%"
description: "{{ $value | humanizePercentage }} of logs are being dropped. Check Fluentd/Bit error logs."
# Alert on buffer approaching full
- alert: LoggingBufferCritical
expr: fluentd_buffer_queue_length / fluentd_buffer_limit > 0.9
for: 5m
labels:
severity: high
annotations:
summary: "Log buffer above 90% capacity"
description: "Fluentd/Bit buffer is filling up. Risk of log loss if not addressed."
# Alert on high parsing errors
- alert: LoggingParseErrorSpike
expr: rate(log_parsing_errors_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High log parsing error rate"
description: "More than 100 parsing errors per minute. Review log format consistency."
# Alert on processing latency
- alert: LoggingProcessingLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "Log processing latency above 30 seconds"
description: "P95 log processing latency is {{ $value }}s. Logs may not be searchable in real-time."
# SLO error budget burn rate
- alert: LoggingErrorBudgetBurningFast
expr: |
(
sum(rate(log_entries_total{level="error"}[1h]))
/
sum(rate(log_entries_total[1h]))
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Log error budget burning at unsustainable rate"
description: "Error budget is being consumed 14.4x faster than sustainable. Immediate investigation required."
Cost Optimization for Logging Pipeline
Logging costs creep up fast when you are not paying attention. Here is how to keep them under control.
Log Volume Budgeting
# Kubernetes resource quota for logging
apiVersion: v1
kind: ResourceQuota
metadata:
name: logging-budget
namespace: production
spec:
hard:
# Limit log storage per namespace
requests.storage: 100Gi
# Limit Fluentd memory
requests.memory: 2Gi
limits.memory: 4Gi
Cost Optimization Strategies
| Strategy | Impact | Implementation |
|---|---|---|
| Reduce DEBUG in production | 60-80% volume reduction | Runtime level control, feature flags |
| Index only essential fields | 40-60% storage reduction | Field mapping optimization in ES |
| Aggressive ILM policies | 50-70% cost reduction | Move old data to cold/archive tiers |
| Sampling high-volume paths | 90% volume reduction | Deterministic sampling for non-critical |
| Compress before shipping | 30-50% bandwidth savings | gzip compression in log agents |
Architecture for Cost-Effective Logging
graph TB
A[Application] --> B[Fluent Bit Agent]
B --> C{Local Buffer}
C -->|Normal hours| D[Hot Storage - 7 days]
C -->|Off-peak batch| E[Warm Storage - 30 days]
D --> F[Cold Storage - 90 days]
E --> F
F --> G[Archive - 1+ year]
G --> H[Glacier/Blob Storage]
egress costs in multi-region set-ups
# Fluentd filter to drop low-priority logs before shipping
<filter container.**>
@type grep
<exclude>
key level
pattern /DEBUG|TRACE/
</exclude>
</filter>
# Alternative: drop based on sampling
<filter container.**>
@type sampler
@label @sampled
sample_rate 0.1 # Keep only 10% of debug logs
random_seed 12345
</filter>
Multi-Region Logging Strategies
Global systems need log collection that respects regional boundaries and does not add latency to user-facing paths.
Regional Log Aggregation
# Regional Fluentd aggregator config
[sinks]
[sinks.s3_regional]
type = "s3"
bucket = "logs-us-east-1"
region = "us-east-1"
[sinks.s3_eu_central]
type = "s3"
bucket = "logs-eu-central-1"
region = "eu-central-1"
# Route logs to regional storage based on source
[transforms.route_by_region]
type = "route"
inputs = ["parse_json"]
route = '''
match /(?i)(eu|europe)/ => "s3_eu_central"
match * => "s3_regional"
'''
Cross-Region Log Correlation
// Fan-out query across regions
async function searchLogsAcrossRegions(
query,
regions = ["us-east-1", "eu-central-1"],
) {
const results = await Promise.all(
regions.map((region) =>
elasticsearch.search({
index: `logs-${region}-*`,
body: {
query: {
bool: {
must: [query],
filter: [{ term: { region } }],
},
},
},
}),
),
);
// Merge and deduplicate by trace_id
return results
.flatMap((r) => r.hits.hits)
.reduce((acc, hit) => {
acc[hit._source.trace_id] = hit._source;
return acc;
}, {});
}
Compliance Considerations
| Requirement | Implementation |
|---|---|
| GDPR (EU data) | Regional aggregation, no cross-border log transfer |
| Data residency | Separate indices per region, regional access controls |
| Audit trails | Immutable WORM storage in each region |
| Incident response | Replicate critical error logs to a central alerting index |
Cross-Region Replication Configuration
{
"index": {
"number_of_shards": 3,
"number_of_replicas": 1,
"allocation": {
"include": {
"region": "us-east-1"
}
}
},
"cluster.routing": {
"allocation.awareness.attributes": "region"
}
}
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Log aggregation pipeline downtime | No new logs searchable; teams blind to issues | Buffer logs locally; implement retry with backoff; alert on pipeline health |
| Elasticsearch cluster saturation | Log ingestion backs up; logs dropped | Monitor ES cluster health; implement backpressure; use ILM to manage indices |
| Corrupted log data | Searches return incomplete results; debugging misses context | Validate JSON structure at ingestion; use dead-letter queues for malformed logs |
| Sensitive data logged | Security/compliance breach; potential data exposure | Implement redaction middleware; scan logs before storage; educate developers |
| Excessive log volume | Storage costs spike; performance degradation | Implement sampling for DEBUG logs; enforce log level policies; archive aggressively |
| Missing correlation IDs | Cannot trace requests across services | Auto-inject correlation IDs in middleware; reject requests without trace context in high-security paths |
Real-world Failure Scenarios
Scenario 1: Log Data Loss During Incident
What happened: During a production incident, engineers discovered that logs from the primary application server were not being shipped to the central log aggregator. The log forwarder had crashed silently 2 hours prior.
Root cause: No health checks were configured for the log forwarder daemon. The process had exited but the orchestration system did not restart it because it was running as a sidecar rather than a managed service.
Impact: Engineers spent 45 minutes manually accessing individual server logs to piece together the sequence of events, delaying incident resolution.
Lesson learned: Monitor log forwarder processes and shipper queues. Implement heartbeat logging so missing heartbeats trigger an alert. Ship logs to multiple destinations for critical services.
Scenario 2: Structured Logging Breaking Search Dashboards
What happened: After migrating from plain-text to structured JSON logging, the Kibana dashboard used by the operations team stopped displaying log events. The team was flying blind for 3 hours until the issue was diagnosed.
Root cause: The Kibana index pattern was configured to look for a message field as the primary text field. Structured logs used field names like msg and event_text, so no events matched the default search.
Impact: All monitoring dashboards showed empty results. A customer-impacting database slowdown went undetected for longer than necessary.
Lesson learned: Validate dashboard queries against a test environment before migrating logging formats. Ensure field naming conventions match across the log pipeline and dashboards. Maintain backwards compatibility during format transitions.
Common Pitfalls / Anti-Patterns
Anti-Patterns to Avoid
1. Logging Everything at DEBUG in Production
DEBUG-level logging in high-throughput services generates gigabytes per hour. Use sampling for debug scenarios, or enable DEBUG selectively via feature flags for specific request IDs.
2. Plain Text Logging with String Concatenation
// Bad: Cannot search, parse, or aggregate
logger.info("User " + userId + " purchased " + item);
// Good: Structured, searchable, aggregatable
logger.info("User purchased item", { userId, itemId, itemName, price });
3. Missing Trace Context Propagation
Logs without correlation IDs are useless for tracing requests across services. Always propagate trace_id through HTTP headers, database connections, and message queues.
4. Logging Sensitive Data
Never log passwords, full tokens, credit card numbers, or PII. Implement redaction at the logger level, not the application level, to catch mistakes.
5. Synchronous Logging to Network Storage
Writing logs synchronously to a remote log server adds latency to every operation. Use async logging with local buffering and background shipping.
6. No Log Retention Policy
Without retention policies, storage costs grow unbounded. Define hot/warm/cold/archive tiers and automate data lifecycle management.
7. Logs as the Only Observability Signal
Relying solely on logs for debugging is insufficient at scale. Combine logs with metrics and traces for complete observability.
Observability Checklist
Key Log Metrics
- Log ingestion rate (logs/second) by service and level
- Log volume by service, level, and environment
- Error rate in logs (ERROR level count over time)
- Log processing latency (time from log emit to searchable)
- Log agent errors and restarts
- Storage utilization per index
Logs You Should Have
- Request logs with trace_id, user_id, method, path, status, duration_ms
- Authentication events (login attempts, failures, token refreshes)
- Business events (orders, payments, registrations) with entity IDs
- Database query logs for slow queries (>100ms threshold)
- External API call logs with request/response timing
- Background job start/complete/fail logs with job IDs
- Health check and readiness probe logs
- Configuration change logs (who changed what when)
Alerts You Need
- No logs received from a service for >5 minutes (Alert on Silence pattern)
- Error rate spike above baseline (unexpected errors)
- Log volume anomaly (sudden drop or spike)
- Log processing latency >30 seconds
- Elasticsearch cluster health degraded (yellow/red)
- Log agent restart detected
Security Checklist
- No passwords, API keys, or secrets in log output
- Credit card numbers, CVV, SSN never logged
- Authorization tokens logged as type + last 4 chars only (e.g., “Bearer ***abc123”)
- PII fields identified and redacted in redaction middleware
- Log access requires authentication and is audited
- Log aggregation pipeline uses TLS in transit
- Elasticsearch access restricted to authorized personnel
- Log retention complies with data retention policies
- Sensitive data cannot be searched in Kibana/ES by unauthorized users
Interview Questions
Ask for the approximate time, user ID, or order ID. With a timestamp window, query logs for that timeframe filtering on the service handling checkout. With a user ID, search for all log entries tagged with that user ID. With a correlation ID from their session, search for that ID across all services to reconstruct the full request path. In ELK/Kibana: timeframe AND service.name: checkout-service AND "checkout failed". If no direct match, search for errors in the checkout service within the time window, then trace back via correlation IDs to find the root cause service.
Immediate mitigation: force a flush to free up translog space, delete old indices beyond your retention policy, and consider a readonly index for historical data. For ongoing cost reduction: adjust index settings to reduce replica count in hot-warm architectures, use ILM policies to move older indices to cheaper storage tiers (frozen or cold), and reduce shard count — too many small shards wastes overhead. Audit field mappings to check if you can reduce the number of indexed fields with doc_values: false on fields that are only used for filtering. Finally, enforce log volume budgets per service to prevent any single service from overwhelming the cluster.
A trace ID is a unique identifier for an entire request transaction across all services — it stitches together every span. A span ID represents a single unit of work within that trace (one service call, one database query). Correlation IDs are typically an application-level business identifier (order ID, user session ID) that helps you filter logs across services without relying on trace IDs. In practice: the trace ID propagates via HTTP headers (x-b3-traceId in Zipkin, traceparent in W3C trace context) through every service call. Each service creates a span with the incoming trace ID and its own span ID, creating a parent-child tree of operations.
Production should run at INFO or WARN in most services — DEBUG is too noisy for production traffic volumes and can itself cause performance problems (disk I/O, log storage costs). However, when an incident is active, dynamically raising a specific service to DEBUG via a config change allows targeted debugging without impacting all services. The pattern: have a mechanism to change log level at runtime (via a config map reload, a logging API endpoint, or a Kubernetes annotation). Keep DEBUG in staging and development environments where you need it for development iteration.
The best approach is to never log PII in the first place — sanitize before logging by configuring your logger to mask fields like email addresses, credit cards, and phone numbers using a redaction library. In your logging framework (Pino for Node, zap for Go, python-logstash), add a field filter that replaces sensitive patterns with [REDACTED]. Alternatively, use a log processor (Fluentd filter, Logstash mutate) to strip or hash sensitive fields before forwarding. Also configure your SIEM and log storage to mark PII fields as sensitive so analysts are warned. Audit your logs regularly with a tool like DataSommer to catch accidental PII leakage.
Implement local buffering on your log agents so logs are not lost if the aggregation pipeline fails. Fluentd/Fluent Bit agents should have a configurable buffer section (file or memory) with sufficient capacity to handle your expected outage window. Configure the agent to retry with exponential backoff when the destination is unavailable. For critical services, consider a dual-write strategy: ship logs to both your primary aggregator and a fallback like a local file or S3 bucket. Set up alerting on the log pipeline itself so you know immediately when ingestion stops. During the outage, query those fallback destinations to maintain visibility.
Adaptive sampling lets you keep debug logs when you need them most without overwhelming your logging infrastructure. Implement head-based sampling at the log agent level: configure a base sample rate (e.g., 1%) for DEBUG logs but increase it dynamically when errors spike or for specific users/transactions. A simple implementation uses a deterministic hash of the trace_id to ensure you always see all logs for the same trace. Alternatively, tail-based sampling collects full logs for requests with errors and samples everything else — this guarantees complete debugging context for failed requests while reducing volume for successful ones. Combine with feature flags so on-call engineers can override sampling rates in real-time via config maps or API calls.
Writing to stdout (container logging) is the recommended approach in Kubernetes/containerized environments because the container runtime handles rotation, storage management, and shipping — your application stays focused on business logic. It also means logs survive container restarts and can be collected by any log agent. The downside is added latency from the agent polling stdout and the need for an agent sidecar or DaemonSet. Writing directly to an aggregator (Elasticsearch, Loki) eliminates the agent but couples your application to the aggregation infrastructure — if the aggregator is slow or unavailable, your application suffers. Direct writes work well for serverless functions where you cannot run log agents, but for traditional services, stdout with an agent like Fluent Bit or Vector gives you the best separation of concerns and flexibility to swap aggregation backends.
A memory leak typically manifests as gradual increases in memory usage without corresponding garbage collection events. Configure your application to emit JVM metrics via Micrometer or Dropwizard Metrics, then create alerts based on patterns in those metric logs. Key signals: memory usage increasing across consecutive GC cycles without full recovery, heap utilization trending upward over hours, or the number of allocated objects growing while the GC pause time increases. In your log aggregator, query for metric.name: jvm_memory_used AND metric.area: heap AND increase(metric.value[1h]) > threshold. Set a PagerDuty alert when heap usage exceeds 80% sustained for 15 minutes or when the GC reclaim rate falls below the allocation rate. Correlate with your application logs to identify which code paths are allocating the most objects.
OpenTelemetry provides automatic instrumentation that injects trace context (trace_id, span_id, flags) into log records at the SDK level, eliminating manual correlation ID propagation. Set up the OpenTelemetry SDK in each microservice with an OTLP exporter pointing to your collector. The collector fans out to both your trace backend (Jaeger, Zipkin) and log backend (Elasticsearch, Loki). When you query a trace ID, you get the full span timeline — clicking any span shows you all log records emitted during that span. For manual correlation when needed, use the OTel API: extract the active span context, inject it into your structured log record alongside your business fields. This gives you the best of both worlds — automatic context injection plus explicit business correlation in your log messages.
JSON structured logs win when you need machine parsing, field-level queries, and integration with modern log aggregators like Elasticsearch or Loki. They work well when your logging infrastructure can handle the overhead and you need to search across dimensions (service, user_id, region) without regex matching. Logfmt is a better fit when you want human readability in terminal output while still supporting structured parsing — it's more compact than JSON and reads almost like natural language when you cat a log file. Consider JSON when you have high logging volume, need aggregations, and will mostly query via Kibana/Grafana. Consider logfmt when developers read logs directly during debugging and your aggregation pipeline can handle field extraction.
For a startup with limited resources, managed services beat self-hosted every time. Elasticsearch is powerful but requires significant operational overhead — you need someone who knows cluster sizing, shard allocation, and ILM policies. Loki is cheaper and pairs naturally with Grafana for metrics plus logs. Cloud-specific options like AWS CloudWatch Logs or GCP Cloud Logging reduce operational burden further if you're already in that ecosystem. My recommendation: start with whatever your metrics platform uses (Grafana+Loki for open source, CloudWatch if AWS-native) and migrate only when you hit scaling limits. The worst choice is running Elasticsearch on VMs without dedicated infrastructure expertise.
I worked on a service where developers, trying to be thorough, logged every single function entry and exit with full parameter values. In production, this generated 50GB of logs per day for a single service — not because of traffic, but because debug logging was left enabled. When a real incident happened, the log volume was so high that querying for relevant entries timed out the aggregator. The actual error was buried under megabytes of noise. The fix was simple: disable verbose debug logging in production by default, and enable it selectively via feature flag for specific request IDs when needed. The lesson: more logs is not always better. Volume management and signal-to-noise ratio matter more than comprehensiveness.
Metrics from instrumentation libraries (Prometheus counters, gauges, histograms) are purpose-built for time-series analysis — they're lightweight, aggregatable, and designed for alerting. Log-based metrics extract measurements from log entries after the fact (counting ERROR lines per minute, for example). The key difference is overhead and precision: library metrics are emitted once per event with negligible CPU cost, while parsing logs to extract metrics adds processing latency and resource usage. Use library metrics for things you always need to measure (request rates, latencies, error counts). Use log-based metrics for derived measurements that would be expensive to instrument directly, like counting specific business events that already appear in logs. In practice, you want both — library metrics for SLIs and alerting, log queries for investigative drilling down when you have a specific hypothesis to test.
A robust redaction function needs to handle nested structures recursively and preserve JSON validity after redaction. The key pattern: traverse the object tree, check each key against sensitive field patterns (password, token, ssn, credit_card, etc.), and replace matching values with a redaction marker. For nested objects, recurse. For arrays, iterate and recurse on each element. The critical edge case is when redaction turns an object into a flattened structure — you want to preserve the structure but mask values. Implementation: use a whitelist approach where you explicitly define which fields to keep readable, and redact everything else by default. This is safer than a blacklist which will inevitably miss a field. Test with: nested objects 5 levels deep, arrays of objects, mixed-type arrays, and null values to ensure your redaction preserves the JSON contract downstream.
First, measure current spend by breaking down log volume by service, environment, and log level. In most systems, 60-80% of log volume is DEBUG-level in production — the first win is disabling DEBUG. Second, audit indexed fields in Elasticsearch — each indexed field adds storage overhead, and many teams index fields they never query. Third, check retention policies: if you have 90-day retention but compliance only requires 30 days, you can cut storage by half by tightening the warm tier. Fourth, look at aggregation overhead: are you shipping logs from regions where they never get queried? Finally, consider the agent side — Fluentd/Fluent Bit agents running on every node consume memory and CPU that scales with log volume. A well-tuned logging infrastructure should consume 2-5% of total cloud budget, not 15-20%.
Introduce structured logging incrementally, never in a big bang. Start with your most critical service — add a structured logger alongside the existing text logger during a migration sprint, run both in parallel for a week to validate the new format, then deprecate the text logger. For each service: wrap the existing logger in an abstraction that can output both formats simultaneously during the transition. This way, dashboards and alerts built on text parsing continue working while new debugging workflows use structured queries. The biggest mistake is trying to migrate everything at once — you will miss edge cases and your team will resist the change. Migrate service by service, and maintain a log format registry so the team always knows which format each service uses. After 3-4 months of incremental migration, you can deprecate text logging entirely.
Head-based sampling makes the sampling decision at the point of log emission — before you know whether the request will succeed or fail. You sample X% of all debug logs uniformly. This is simple and predictable but risks losing debug context for the exact requests you need most: the failing ones. Tail-based sampling collects logs for all requests but only persists complete logs for requests with errors, while sampling or dropping logs for successful requests. Tail-based sampling gives you 100% of debugging information for failures and dramatically reduced volume for successes — but requires more infrastructure (the sampler must sit close to the aggregation layer and hold partial logs in memory). Use head-based sampling when volume is the primary concern and failures are evenly distributed. Use tail-based sampling when error investigation is the primary use case and you cannot afford to lose debug context for failed requests.
First, audit the scope: run a log scanning tool like Log-inspector or a custom regex scanner across your log storage to identify which services, log levels, and field patterns contain PII. This tells you whether the leak is in the logger layer (redaction not applied), the transport layer (logs written to files before redaction), or the aggregator layer (structured fields parsed incorrectly). Second, trace the specific leak: if email addresses are appearing, check if email is being logged explicitly in business logic rather than just as a user_id reference. Third, fix at the logger level — the redaction library should be the outermost wrapper around any log call, not an afterthought in business logic. Finally, add automated PII scanning in your CI pipeline: fail builds if known PII patterns appear in test log output. This prevents regression. Consider tools like DataSommer or grep patterns for credit card formats, SSN patterns, and emailRegex.
Memory leaks under sustained load often show up first in logs before metrics trigger alerts. Configure your service to log periodic memory snapshots: every N minutes or every M requests, emit a log entry with JVM heap used/committed, thread count, and number of loaded classes. Over time, you can query these snapshots and use the log aggregator's built-in statistical functions to detect upward trends. For example, in Elasticsearch, use a date histogram aggregation on the memory snapshot timestamp, with an avg sub-aggregation on heap_used. If the average heap usage shows a statistically significant upward trend over 24-48 hours without corresponding GC events clearing it, you likely have a leak. Correlate the leak timeline with your application logs to identify which code paths were active — often the leak correlates with specific features or traffic patterns. This approach works without external profiling tools and gives you enough signal to prioritize a fix before an OOM kill.
Further Reading
- Elasticsearch Guide: Logstash - Comprehensive resource for log processing pipelines
- Fluentd Documentation - Official Fluentd and Fluent Bit docs for log aggregation
- OpenTelemetry Logging SDK - Vendor-neutral logging specification and SDK documentation
- Loki: Prometheus-inspired Logging - Cost-effective log aggregation designed to work with Grafana
- Python structlog Documentation - Structured logging for Python applications
- Pino JavaScript Logger - Fast, production-ready logger for Node.js with JSON output
- Distributed Tracing with OpenTelemetry - Tutorial correlating logs, traces, and metrics
- Log Observability and SLOs - Architecture for log-based SLOs and error budget monitoring
- JSON Log Format Specification - STIX 2.1 JSON logging schema for security events
Conclusion
Key Takeaways:
- Structured JSON logs enable efficient searching and aggregation
- Correlation IDs connect logs across service boundaries
- Log levels filter noise: DEBUG for development, ERROR/WARN/INFO for production
- Never log sensitive data; always implement redaction
- Monitor your monitors: log aggregation needs its own observability
- Retention policies prevent unbounded storage growth
Copy/Paste Checklist:
# Verify structured logging format
grep -c '"timestamp".*"level".*"message".*"service"' /var/log/app.json
# Find logs for specific trace
grep '"trace_id":"abc123"' /var/log/app.json
# Count errors by service
jq 'select(.level == "ERROR") | .service' /var/log/app.json | sort | uniq -c
# Alert on log silence (Prometheus)
- alert: LogIngestionSilence
expr: rate(fluentd_input_status_records_total[5m]) == 0
for: 5m
labels:
severity: critical
# Redaction function (TypeScript)
const sensitiveFields = ['password', 'token', 'secret', 'creditCard', 'ssn'];
function redact(obj) {
return Object.fromEntries(
Object.entries(obj).map(([k, v]) =>
sensitiveFields.some(f => k.toLowerCase().includes(f)) ? [k, '[REDACTED]'] : [k, v]
)
);
}
Good logging practices pay off when you need them most: debugging production issues at 2am. Structured logs with correlation IDs let you trace requests across service boundaries. Appropriate log levels keep noise manageable. Retention policies balance cost with compliance requirements.
Start with JSON structured logging in your applications. Add correlation ID propagation early. Build log aggregation before you need it, not during an incident.
For deeper observability, combine logging with the Metrics, Monitoring & Alerting and Distributed Tracing practices covered in our other guides. These three pillars work together: logs show you what happened, metrics show you patterns, and traces show you why it happened.
Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.