Logging Best Practices: Structured Logs, Levels, Aggregation

Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.

published: March 22, 2026 reading time: 41 min read author: GeekWorkBench

Logging Best Practices: Structured Logs, Levels, and Aggregation

Logs are your primary tool for understanding what happens in production when users report issues. If you have ever spent hours searching through plain text log files trying to find a single error, you know how painful unstructured logging can be.

This guide covers structured formats, log levels, correlation IDs for tracing requests, and aggregation strategies that scale.

Core Concepts

Structured logs use a defined format, typically JSON, where each field has a specific meaning.

JSON Log Format

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "message": "User login successful",
  "service": "auth-service",
  "version": "2.1.0",
  "trace_id": "abc123def456",
  "user_id": "usr_789",
  "ip_address": "192.168.1.42",
  "duration_ms": 45,
  "environment": "production"
}

This format lets you search by any field, aggregate metrics across dimensions, and correlate logs with traces and metrics.

Implementing Structured Logging

Most languages have structured logging libraries:

# Python with structlog
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

log = structlog.get_logger()

log.info("user_login",
    user_id="usr_789",
    ip_address="192.168.1.42",
    duration_ms=45
)

// TypeScript with pino
import pino from "pino";

const log = pino({
  level: "info",
  base: {
    service: "auth-service",
    version: process.env.APP_VERSION,
  },
});

log.info(
  {
    userId: "usr_789",
    ipAddress: "192.168.1.42",
    durationMs: 45,
  },
  "User login successful",
);

// Go with zerolog
import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

zerolog.TimeFieldFormat = zerolog.TimeFormatUnix

log.Info().
    Str("user_id", "usr_789").
    Str("ip_address", "192.168.1.42").
    Int("duration_ms", 45).
    Msg("User login successful")

Log Levels

Log levels help filter noise. Not every log entry needs to be visible during normal operations.

Standard Log Levels

Level	Purpose	When to Use
DEBUG	Detailed diagnostic information	During development and troubleshooting
INFO	Confirmation that things work as expected	Significant business events
WARN	Unexpected but handled situations	Recoverable errors, degraded states
ERROR	Errors that need attention	Failures that affect requests
FATAL	System cannot continue	Critical failures requiring immediate action

Choosing the Right Level

This feels intuitive but gets harder at scale. A few guidelines:

INFO for business events like orders placed, users registered. You want these for analytics and auditing.
WARN for situations that require attention but the system continues: retries succeeded, cache misses, degraded mode.
ERROR for failures that affect the current request: database timeout, external API failure, validation error.
DEBUG for information that helps during development but would overwhelm production: loop iterations, intermediate values.

Don’t log DEBUG in production unless you can enable it selectively for specific requests. A debug log in a hot path can generate gigabytes per hour.

Correlation IDs

When a request flows through multiple services, correlation IDs let you follow it across all logs.

Propagating Trace Context

// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
  const traceId = req.headers["x-trace-id"] || generateUUID();
  req.correlationId = traceId;
  res.setHeader("x-trace-id", traceId);

  // Add to logger context
  log = log.with({ traceId });

  next();
}

Propagate the correlation ID to all downstream calls:

// Outgoing HTTP request
fetch("https://api.example.com/users", {
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

// Database queries
db.query("SELECT * FROM users WHERE id = $1", [userId], {
  traceId: req.correlationId,
});

// Message queue messages
queue.send({
  payload: orderData,
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

Using Correlation IDs for Search

With structured logs and correlation IDs, debugging a user issue looks like this:

# Find all logs for a specific request
grep '"trace_id":"abc123def456"' /var/log/app.log

# Or in your log aggregator
query: trace_id = "abc123def456"

Search for the trace ID and you get the incoming request, database queries, cache hits, outgoing API calls, and the error that occurred.

What to Include in Logs

Context matters. The more relevant context you include, the easier debugging becomes.

Essential Fields

Every log entry needs at minimum:

timestamp: ISO 8601 format in UTC
level: Log severity level
service: Which service generated this log
message: Human-readable description

Request Context

For web services, include:

request_id or trace_id
user_id (if authenticated)
HTTP method, path, status code
Client IP address
User agent

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "api-gateway",
  "message": "Request completed",
  "request_id": "req_abc123",
  "method": "GET",
  "path": "/api/users/usr_789",
  "status": 200,
  "duration_ms": 120,
  "ip": "192.168.1.42",
  "user_agent": "Mozilla/5.0..."
}

Business Events

For significant business events:

Event type (login, purchase, registration)
Entity IDs involved
Outcome (success, failure)
Duration if applicable
Any relevant metadata

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "checkout-service",
  "message": "Order placed",
  "event": "order_placed",
  "order_id": "ord_xyz789",
  "customer_id": "cust_123",
  "total_amount": 99.99,
  "currency": "USD",
  "item_count": 3
}

What NOT to Log

Logging sensitive data creates security and compliance problems. Never log:

Passwords or password hashes
Credit card numbers or CVV codes
Social security numbers or national IDs
API keys or secrets
Full authorization tokens (log the type and last 4 chars only)

Redact Sensitive Data

function redactSensitiveFields(
  obj: Record<string, unknown>,
): Record<string, unknown> {
  const sensitiveFields = ["password", "token", "secret", "creditCard", "ssn"];
  const redacted = { ...obj };

  for (const key of Object.keys(redacted)) {
    if (sensitiveFields.some((f) => key.toLowerCase().includes(f))) {
      redacted[key] = "[REDACTED]";
    } else if (typeof redacted[key] === "object") {
      redacted[key] = redactSensitiveFields(
        redacted[key] as Record<string, unknown>,
      );
    }
  }

  return redacted;
}

log.info(
  "User authenticated",
  redactSensitiveFields({ userId: "usr_123", password: "secret123" }),
);
// Logs: { userId: 'usr_123', password: '[REDACTED]' }

Log Aggregation Architecture

At scale, logs need to be collected, aggregated, and stored efficiently.

Common Architecture

graph LR
    A[Application] -->|stdout/JSON| B[Container Runtime]
    B --> C[Log Agent]
    C --> D[Log Aggregator]
    D --> E[Storage]
    D --> F[Search Interface]
    G[Analytics/BI] --> E

Container Logging

In containerized environments, applications write to stdout and stderr. The container runtime handles collection:

# Write logs to stdout, not files
# Bad: RUN echo "$(date) Log entry" >> /var/log/app.log
# Good: console.log(JSON.stringify({ timestamp, message }))

For applications that must write to files, use a sidecar log agent or mount a shared log directory:

# Pod with log volume
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: app
      image: myapp:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
    - name: log-agent
      image: log-agent:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
        - name: agent-config
          mountPath: /etc/log-agent
  volumes:
    - name: logs
      emptyDir: {}
    - name: agent-config
      configMap:
        name: log-agent-config

Shipping Logs to Aggregators

Fluentd/Fluent Bit Configuration

# fluent-bit.conf
[SERVICE]
    Flush         5
    Daemon        Off
    Log_Level     info
    Parsers_File  parsers.conf

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               container.*
    Refresh_Interval  5

[FILTER]
    Name                kubernetes
    Match               container.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token

[OUTPUT]
    Name        es
    Match       container.*
    Host        elasticsearch.logging.svc
    Port        9200
    Logstash_Format    On
    Logstash_Prefix    kubernetes
    Retry_Limit        False

Vector Configuration

Vector is a newer alternative with better performance and lower resource usage:

# vector.toml
[sources.docker]
type = "docker_logs"

[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '.message = parse_json!(.message)'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch.logging.svc:9200"
index = "kubernetes-%Y.%m.%d"

Log Storage and Retention

Storage costs grow with log volume. Design retention policies carefully.

Retention Tiers

Tier	Duration	Use Case
Hot	0-7 days	Real-time troubleshooting
Warm	7-30 days	Investigating recent issues
Cold	30-90 days	Compliance, audit
Archive	1+ years	Legal requirements

Elasticsearch Index Lifecycle Management

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_size": "50gb"
          },
          "set_priority": 100
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": 50
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "set_priority": 0
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Performance Considerations

Logging can become a bottleneck if you don’t design it carefully.

Asynchronous Logging

Write logs asynchronously so they don’t block your application:

import logging
import queue
from threading import Thread

class AsyncLogHandler(logging.Handler):
    def __init__(self, batch_size=100, flush_interval=1.0):
        super().__init__()
        self.queue = queue.Queue(maxsize=10000)
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.worker = Thread(target=self._process_logs, daemon=True)
        self.worker.start()

    def emit(self, record):
        try:
            self.queue.put_nowait(self.format(record))
        except queue.Full:
            pass  # Drop log if queue is full

    def _process_logs(self):
        batch = []
        while True:
            try:
                item = self.queue.get(timeout=self.flush_interval)
                batch.append(item)
                while len(batch) < self.batch_size:
                    item = self.queue.get_nowait()
                    batch.append(item)
            except queue.Empty:
                pass

            if batch:
                self._send_batch(batch)
                batch = []

    def _send_batch(self, batch):
        # Send to log aggregator
        pass

Sampling High-Volume Logs

For debug-level logs in high-traffic paths, sample to reduce volume:

const sampler = new RateSampler({ rate: 0.1 }); // 10% sample rate

log.debug(
  {
    message: "Processing item",
    itemId: item.id,
    sampled: sampler.sample(),
  },
  "Item processing details",
);
// Only actually logs ~10% of the time

Monitoring Log Health

Logs themselves need monitoring. If logging stops, you lose visibility into your systems.

Metrics to Track

Log ingestion rate (logs/second)
Log volume by service and level
Error rate in logs
Log processing latency
Log agent errors and restarts

Alert on Silence

# Prometheus alert for missing logs
- alert: LogIngestionSilence
  expr: |
    rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No logs being ingested from Fluentd"
    description: "Fluentd has not sent logs to Elasticsearch in 5 minutes"

When to Use Structured Logging

Use structured logging when:

Debugging requires cross-referencing multiple log entries
Requests span multiple services
You need selective debugging in high-volume APIs
Audit trails are required for compliance
You need to correlate logs with traces or metrics

Don’t use structured logging when:

Simple scripts or one-off utilities where stdout debugging suffices
Very low-traffic applications where unstructured grep suffices
Legacy systems where migration cost outweighs benefits
Development environments where DEBUG-level verbosity is acceptable

Trade-off Analysis

Aspect	Structured Logging	Plain Text Logging
Searchability	Field-level queries via log aggregators	grep/string matching only
Storage Cost	Higher (JSON overhead per line)	Lower (minimal formatting)
Parse Complexity	Zero (machine-readable by default)	Brittle (format changes break parsers)
Human Readability	Moderate (requires jq or aggregator UI)	High (direct reading in terminal)
Tooling Required	Log aggregator (ELK, Loki, Splunk)	None or basic text tools
Correlation	Automatic via shared fields	Manual trace ID injection
Performance Impact	Slight overhead for JSON serialization	Minimal

SLI/SLO/Error Budget Templates for Logging

Log-Based SLI Template

# logging-sli-config.yaml
service: logging-observability
environment: production

slis:
  - name: log_ingestion_success_rate
    description: "Percentage of emitted logs successfully ingested"
    query: |
      sum(rate(fluentd_output_status_num_logs_total{status="output"}[5m]))
      /
      sum(rate(fluentd_input_status_records_total[5m]))

  - name: log_processing_latency_p95
    description: "Time from log emit to searchable in aggregator"
    query: |
      histogram_quantile(0.95,
        sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
      )

  - name: log_error_rate
    description: "ERROR level log rate as percentage of total"
    query: |
      sum(rate(log_entries_total{level="error"}[5m]))
      /
      sum(rate(log_entries_total[5m])) * 100

Log SLO Template

# logging-slo-config.yaml
objectives:
  - display_name: "Log Ingestion Availability"
    sli: log_ingestion_success_rate
    target: 99.5
    window: 30d
    description: "99.5% of emitted logs should be ingested"

  - display_name: "Log Processing Latency"
    sli: log_processing_latency_p95
    target: 99.0
    threshold_ms: 30000
    window: 30d
    description: "95% of logs should be searchable within 30 seconds"

  - display_name: "Log Error Rate"
    sli: log_error_rate
    target: 99.9
    threshold_percent: 1.0
    window: 30d
    description: "Error rate should stay below 1%"

Error Budget Calculator

# error-budget-calculator.py
def calculate_error_budget(slo_target, window_days=30):
    """
    Calculate error budget in minutes for a given SLO target.
    Example: 99.5% SLO over 30 days = 21.6 minutes of allowed errors
    """
    window_seconds = window_days * 24 * 60 * 60
    allowed_errors = window_seconds * (1 - slo_target)
    return allowed_errors / 60  # Convert to minutes

# Standard SLO error budgets (30-day window)
slo_budgets = {
    "99.0%": calculate_error_budget(0.990),  # 432 minutes = 7.2 hours
    "99.5%": calculate_error_budget(0.995),  # 216 minutes = 3.6 hours
    "99.9%": calculate_error_budget(0.999),  # 43.2 minutes
    "99.95%": calculate_error_budget(0.9995), # 21.6 minutes
    "99.99%": calculate_error_budget(0.9999), # 4.32 minutes
}

for slo, budget in slo_budgets.items():
    print(f"SLO {slo}: {budget:.2f} minutes error budget")

Multi-Window Burn-Rate Alerting for Log Quality

Burn-rate alerts detect when error budgets are being consumed faster than expected. This approach catches both sudden spikes and slow leaks.

1-Hour Window Burn-Rate Alert (Fast Burn)

# Burn-rate alerts for logging
groups:
  - name: logging-burn-rate
    rules:
      # Fast burn: 1-hour window, 14.4x burn rate (burns 1% budget in 1 hour)
      - alert: LogErrorBudgetFastBurn
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          category: logging
          window: 1h
        annotations:
          summary: "Log error budget burning fast (1h window)"
          description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Budget may be depleted in ~7 hours."

6-Hour Window Burn-Rate Alert (Medium Burn)

# Medium burn: 6-hour window, 6x burn rate (burns 10% budget in 6 hours)
- alert: LogErrorBudgetMediumBurn
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: warning
    category: logging
    window: 6h
  annotations:
    summary: "Log error budget burning (6h window)"
    description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Check for sustained error patterns."

Multi-Window Burn-Rate Alert Set

# Complete burn-rate alert set (multi-window)
- alert: LogErrorBudgetBurnAllWindows
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[1h]))
      /
      sum(rate(log_entries_total[1h]))
    )
    > (1 - 0.999) * 14.4
    or
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 5m
  labels:
    severity: critical
    category: logging
  annotations:
    summary: "Log error budget burning across multiple time windows"
    description: |
      Multi-window burn-rate alert triggered.
      1h burn rate: {{ printf "%.2f" (neilyz (index $alerts "0" | value)) }}x
      6h burn rate: {{ printf "%.2f" (neilyz (index $alerts "1" | value)) }}x
      Review error patterns and allocate incident resources.

SLO Error Budget Dashboard Panels

{
  "dashboard": {
    "title": "Logging SLO Error Budget",
    "panels": [
      {
        "title": "Error Budget Remaining (30d)",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - (sum(rate(log_entries_total{level=\"error\"}[30d])) / sum(rate(log_entries_total[30d])))) * 100",
            "legendFormat": "Budget Used %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 50, "color": "yellow" },
                { "value": 90, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate (1h)",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999)",
            "legendFormat": "Burn Rate"
          }
        ]
      },
      {
        "title": "Projected Budget Exhaustion",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999) * 24",
            "legendFormat": "Hours until budget exhausted"
          }
        ]
      }
    ]
  }
}

OpenTelemetry Integration for Logging

OpenTelemetry hooks into your application at the SDK level and sends logs, traces, and metrics through the same pipeline. No per-vendor instrumentation, no rewriting your log format when you switch backends.

Auto-Instrumentation Setup

// OpenTelemetry collector configuration
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPLogExporter } from "@opentelemetry/exporter-logs-otlp-http";
import { BatchLogRecordProcessor } from "@opentelemetry/sdk-logs";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4318/v1/traces",
  }),
  logRecordProcessor: new BatchLogRecordProcessor(
    new OTLPLogExporter({
      url: "http://otel-collector:4318/v1/logs",
    }),
  ),
});

sdk.start();

Correlating Logs with Traces and Metrics

// Inject trace context into log records
import { trace, context } from "@opentelemetry/api";

function emitLog(level: string, message: string, attributes = {}) {
  const span = trace.getSpan(context.active());
  const record = {
    timestamp: new Date().toISOString(),
    level,
    message,
    trace_id: span?.spanContext().traceId,
    span_id: span?.spanContext().spanId,
    ...attributes,
  };

  logger.emit(record);
}

Benefits of OpenTelemetry for Logging

Benefit	Description
Vendor neutrality	Switch backends without re-instrumenting
Unified data model	Logs, traces, and metrics share the same correlation IDs
Automatic context prop	Trace context automatically injected into logs
Sampling coordination	Sample logs and traces together for consistent debugging

Observability Hooks for Logging

This section defines what to log, measure, trace, and alert for logging systems themselves.

Log (What to Emit)

Event	Fields	Level
Log ingestion started	service, host, agent_version	INFO
Log ingestion stopped	service, host, reason	WARN
Buffer approaching full	host, buffer_used_percent, buffer_limit	WARN
Malformed log detected	host, parse_error_type, sample	WARN
Retry attempt	host, destination, attempt, max_attempts	DEBUG
Batch sent successfully	host, destination, batch_size, bytes_sent	DEBUG
Authentication failure	host, client_ip, reason	WARN

Measure (Metrics to Collect)

Metric	Type	Description
`log_emitted_total`	Counter	Total logs emitted by service
`log_ingested_total`	Counter	Total logs ingested to aggregator
`log_dropped_total`	Counter	Logs dropped due to errors/full buffers
`log_processing_latency_seconds`	Histogram	Time from emit to searchable
`log_buffer_utilization_percent`	Gauge	Buffer fill percentage
`log_parsing_errors_total`	Counter	Malformed log entries
`log_bytes_sent_total`	Counter	Bytes sent to aggregators
`log_aggregator_queue_depth`	Gauge	Pending logs in aggregator queue

Trace (Correlation Points)

Operation	Trace Attribute	Purpose
Log emit	`log.aggregate`	Track logs from emit through aggregation
Log parsing	`log.parse.status`	Monitor parsing health
Log shipping	`log.ship.destination`	Track delivery to aggregators
Batch processing	`log.batch.size`	Monitor batch efficiency

Alert (When to Page)

Alert	Condition	Severity	Purpose
Log Silence	No logs received for 5 minutes	P1 Critical	Log pipeline failure
High Drop Rate	Drop rate > 1% for 5 minutes	P2 High	Pipeline health
Buffer Critical	Buffer > 90% full	P2 High	Prevent data loss
Parse Error Spike	Parse errors > 100/min	P3 Medium	Data quality
Latency High	Processing latency > 30s p95	P3 Medium	Performance degradation

Alerting Hook Template

# logging-observability-hooks.yaml
groups:
  - name: logging-observability-hooks
    rules:
      # Alert on silence - no logs coming in
      - alert: LoggingPipelineSilence
        expr: rate(fluentd_input_status_records_total[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No logs being ingested (Alert on Silence)"
          description: "Fluentd/Bit has not sent logs to Elasticsearch in 5 minutes. Either the log pipeline is down or all services have stopped logging."

      # Alert on high drop rate
      - alert: LoggingDropRateHigh
        expr: |
          sum(rate(fluentd_output_status_num_errors_total[5m]))
          /
          sum(rate(fluentd_input_status_records_total[5m])) > 0.01
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log drop rate above 1%"
          description: "{{ $value | humanizePercentage }} of logs are being dropped. Check Fluentd/Bit error logs."

      # Alert on buffer approaching full
      - alert: LoggingBufferCritical
        expr: fluentd_buffer_queue_length / fluentd_buffer_limit > 0.9
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log buffer above 90% capacity"
          description: "Fluentd/Bit buffer is filling up. Risk of log loss if not addressed."

      # Alert on high parsing errors
      - alert: LoggingParseErrorSpike
        expr: rate(log_parsing_errors_total[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High log parsing error rate"
          description: "More than 100 parsing errors per minute. Review log format consistency."

      # Alert on processing latency
      - alert: LoggingProcessingLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Log processing latency above 30 seconds"
          description: "P95 log processing latency is {{ $value }}s. Logs may not be searchable in real-time."

      # SLO error budget burn rate
      - alert: LoggingErrorBudgetBurningFast
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          ) > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Log error budget burning at unsustainable rate"
          description: "Error budget is being consumed 14.4x faster than sustainable. Immediate investigation required."

Cost Optimization for Logging Pipeline

Logging costs creep up fast when you are not paying attention. Here is how to keep them under control.

Log Volume Budgeting

# Kubernetes resource quota for logging
apiVersion: v1
kind: ResourceQuota
metadata:
  name: logging-budget
  namespace: production
spec:
  hard:
    # Limit log storage per namespace
    requests.storage: 100Gi
    # Limit Fluentd memory
    requests.memory: 2Gi
    limits.memory: 4Gi

Cost Optimization Strategies

Strategy	Impact	Implementation
Reduce DEBUG in production	60-80% volume reduction	Runtime level control, feature flags
Index only essential fields	40-60% storage reduction	Field mapping optimization in ES
Aggressive ILM policies	50-70% cost reduction	Move old data to cold/archive tiers
Sampling high-volume paths	90% volume reduction	Deterministic sampling for non-critical
Compress before shipping	30-50% bandwidth savings	gzip compression in log agents

Architecture for Cost-Effective Logging

graph TB
    A[Application] --> B[Fluent Bit Agent]
    B --> C{Local Buffer}
    C -->|Normal hours| D[Hot Storage - 7 days]
    C -->|Off-peak batch| E[Warm Storage - 30 days]
    D --> F[Cold Storage - 90 days]
    E --> F
    F --> G[Archive - 1+ year]
    G --> H[Glacier/Blob Storage]

egress costs in multi-region set-ups

# Fluentd filter to drop low-priority logs before shipping
<filter container.**>
  @type grep
  <exclude>
    key level
    pattern /DEBUG|TRACE/
  </exclude>
</filter>

# Alternative: drop based on sampling
<filter container.**>
  @type sampler
  @label @sampled
  sample_rate 0.1  # Keep only 10% of debug logs
  random_seed 12345
</filter>

Multi-Region Logging Strategies

Global systems need log collection that respects regional boundaries and does not add latency to user-facing paths.

Regional Log Aggregation

# Regional Fluentd aggregator config
[sinks]
[sinks.s3_regional]
type = "s3"
bucket = "logs-us-east-1"
region = "us-east-1"

[sinks.s3_eu_central]
type = "s3"
bucket = "logs-eu-central-1"
region = "eu-central-1"

# Route logs to regional storage based on source
[transforms.route_by_region]
type = "route"
inputs = ["parse_json"]
route = '''
  match /(?i)(eu|europe)/ => "s3_eu_central"
  match * => "s3_regional"
'''

Cross-Region Log Correlation

// Fan-out query across regions
async function searchLogsAcrossRegions(
  query,
  regions = ["us-east-1", "eu-central-1"],
) {
  const results = await Promise.all(
    regions.map((region) =>
      elasticsearch.search({
        index: `logs-${region}-*`,
        body: {
          query: {
            bool: {
              must: [query],
              filter: [{ term: { region } }],
            },
          },
        },
      }),
    ),
  );

  // Merge and deduplicate by trace_id
  return results
    .flatMap((r) => r.hits.hits)
    .reduce((acc, hit) => {
      acc[hit._source.trace_id] = hit._source;
      return acc;
    }, {});
}

Compliance Considerations

Requirement	Implementation
GDPR (EU data)	Regional aggregation, no cross-border log transfer
Data residency	Separate indices per region, regional access controls
Audit trails	Immutable WORM storage in each region
Incident response	Replicate critical error logs to a central alerting index

Cross-Region Replication Configuration

{
  "index": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "allocation": {
      "include": {
        "region": "us-east-1"
      }
    }
  },
  "cluster.routing": {
    "allocation.awareness.attributes": "region"
  }
}

Production Failure Scenarios

Failure	Impact	Mitigation
Log aggregation pipeline downtime	No new logs searchable; teams blind to issues	Buffer logs locally; implement retry with backoff; alert on pipeline health
Elasticsearch cluster saturation	Log ingestion backs up; logs dropped	Monitor ES cluster health; implement backpressure; use ILM to manage indices
Corrupted log data	Searches return incomplete results; debugging misses context	Validate JSON structure at ingestion; use dead-letter queues for malformed logs
Sensitive data logged	Security/compliance breach; potential data exposure	Implement redaction middleware; scan logs before storage; educate developers
Excessive log volume	Storage costs spike; performance degradation	Implement sampling for DEBUG logs; enforce log level policies; archive aggressively
Missing correlation IDs	Cannot trace requests across services	Auto-inject correlation IDs in middleware; reject requests without trace context in high-security paths

Real-world Failure Scenarios

Scenario 1: Log Data Loss During Incident

What happened: During a production incident, engineers discovered that logs from the primary application server were not being shipped to the central log aggregator. The log forwarder had crashed silently 2 hours prior.

Root cause: No health checks were configured for the log forwarder daemon. The process had exited but the orchestration system did not restart it because it was running as a sidecar rather than a managed service.

Impact: Engineers spent 45 minutes manually accessing individual server logs to piece together the sequence of events, delaying incident resolution.

Lesson learned: Monitor log forwarder processes and shipper queues. Implement heartbeat logging so missing heartbeats trigger an alert. Ship logs to multiple destinations for critical services.

Scenario 2: Structured Logging Breaking Search Dashboards

What happened: After migrating from plain-text to structured JSON logging, the Kibana dashboard used by the operations team stopped displaying log events. The team was flying blind for 3 hours until the issue was diagnosed.

Root cause: The Kibana index pattern was configured to look for a message field as the primary text field. Structured logs used field names like msg and event_text, so no events matched the default search.

Impact: All monitoring dashboards showed empty results. A customer-impacting database slowdown went undetected for longer than necessary.

Lesson learned: Validate dashboard queries against a test environment before migrating logging formats. Ensure field naming conventions match across the log pipeline and dashboards. Maintain backwards compatibility during format transitions.

Common Pitfalls / Anti-Patterns

Anti-Patterns to Avoid

1. Logging Everything at DEBUG in Production

DEBUG-level logging in high-throughput services generates gigabytes per hour. Use sampling for debug scenarios, or enable DEBUG selectively via feature flags for specific request IDs.

2. Plain Text Logging with String Concatenation

// Bad: Cannot search, parse, or aggregate
logger.info("User " + userId + " purchased " + item);

// Good: Structured, searchable, aggregatable
logger.info("User purchased item", { userId, itemId, itemName, price });

3. Missing Trace Context Propagation

Logs without correlation IDs are useless for tracing requests across services. Always propagate trace_id through HTTP headers, database connections, and message queues.

4. Logging Sensitive Data

Never log passwords, full tokens, credit card numbers, or PII. Implement redaction at the logger level, not the application level, to catch mistakes.

5. Synchronous Logging to Network Storage

Writing logs synchronously to a remote log server adds latency to every operation. Use async logging with local buffering and background shipping.

6. No Log Retention Policy

Without retention policies, storage costs grow unbounded. Define hot/warm/cold/archive tiers and automate data lifecycle management.

7. Logs as the Only Observability Signal

Relying solely on logs for debugging is insufficient at scale. Combine logs with metrics and traces for complete observability.

Observability Checklist

Key Log Metrics

Log ingestion rate (logs/second) by service and level
Log volume by service, level, and environment
Error rate in logs (ERROR level count over time)
Log processing latency (time from log emit to searchable)
Log agent errors and restarts
Storage utilization per index

Logs You Should Have

Request logs with trace_id, user_id, method, path, status, duration_ms
Authentication events (login attempts, failures, token refreshes)
Business events (orders, payments, registrations) with entity IDs
Database query logs for slow queries (>100ms threshold)
External API call logs with request/response timing
Background job start/complete/fail logs with job IDs
Health check and readiness probe logs
Configuration change logs (who changed what when)

Alerts You Need

No logs received from a service for >5 minutes (Alert on Silence pattern)
Error rate spike above baseline (unexpected errors)
Log volume anomaly (sudden drop or spike)
Log processing latency >30 seconds
Elasticsearch cluster health degraded (yellow/red)
Log agent restart detected

Security Checklist

No passwords, API keys, or secrets in log output
Credit card numbers, CVV, SSN never logged
Authorization tokens logged as type + last 4 chars only (e.g., “Bearer ***abc123”)
PII fields identified and redacted in redaction middleware
Log access requires authentication and is audited
Log aggregation pipeline uses TLS in transit
Elasticsearch access restricted to authorized personnel
Log retention complies with data retention policies
Sensitive data cannot be searched in Kibana/ES by unauthorized users

Interview Questions

1. A user reports a bug but provides no details beyond "the checkout failed." How do you find the relevant logs?

Ask for the approximate time, user ID, or order ID. With a timestamp window, query logs for that timeframe filtering on the service handling checkout. With a user ID, search for all log entries tagged with that user ID. With a correlation ID from their session, search for that ID across all services to reconstruct the full request path. In ELK/Kibana: timeframe AND service.name: checkout-service AND "checkout failed". If no direct match, search for errors in the checkout service within the time window, then trace back via correlation IDs to find the root cause service.

2. Your Elasticsearch cluster is running out of disk space. How do you reduce storage without losing searchable data?

Immediate mitigation: force a flush to free up translog space, delete old indices beyond your retention policy, and consider a readonly index for historical data. For ongoing cost reduction: adjust index settings to reduce replica count in hot-warm architectures, use ILM policies to move older indices to cheaper storage tiers (frozen or cold), and reduce shard count — too many small shards wastes overhead. Audit field mappings to check if you can reduce the number of indexed fields with doc_values: false on fields that are only used for filtering. Finally, enforce log volume budgets per service to prevent any single service from overwhelming the cluster.

3. What is the relationship between correlation IDs, trace IDs, and span IDs in distributed tracing?

A trace ID is a unique identifier for an entire request transaction across all services — it stitches together every span. A span ID represents a single unit of work within that trace (one service call, one database query). Correlation IDs are typically an application-level business identifier (order ID, user session ID) that helps you filter logs across services without relying on trace IDs. In practice: the trace ID propagates via HTTP headers (x-b3-traceId in Zipkin, traceparent in W3C trace context) through every service call. Each service creates a span with the incoming trace ID and its own span ID, creating a parent-child tree of operations.

4. You find that DEBUG logs are missing during production incidents. What logging level should you set in production and why?

Production should run at INFO or WARN in most services — DEBUG is too noisy for production traffic volumes and can itself cause performance problems (disk I/O, log storage costs). However, when an incident is active, dynamically raising a specific service to DEBUG via a config change allows targeted debugging without impacting all services. The pattern: have a mechanism to change log level at runtime (via a config map reload, a logging API endpoint, or a Kubernetes annotation). Keep DEBUG in staging and development environments where you need it for development iteration.

5. How do you handle sensitive data like PII appearing in logs?

The best approach is to never log PII in the first place — sanitize before logging by configuring your logger to mask fields like email addresses, credit cards, and phone numbers using a redaction library. In your logging framework (Pino for Node, zap for Go, python-logstash), add a field filter that replaces sensitive patterns with [REDACTED]. Alternatively, use a log processor (Fluentd filter, Logstash mutate) to strip or hash sensitive fields before forwarding. Also configure your SIEM and log storage to mark PII fields as sensitive so analysts are warned. Audit your logs regularly with a tool like DataSommer to catch accidental PII leakage.

6. Your log aggregation pipeline goes down during a major incident. How do you prevent log loss and maintain visibility?

Implement local buffering on your log agents so logs are not lost if the aggregation pipeline fails. Fluentd/Fluent Bit agents should have a configurable buffer section (file or memory) with sufficient capacity to handle your expected outage window. Configure the agent to retry with exponential backoff when the destination is unavailable. For critical services, consider a dual-write strategy: ship logs to both your primary aggregator and a fallback like a local file or S3 bucket. Set up alerting on the log pipeline itself so you know immediately when ingestion stops. During the outage, query those fallback destinations to maintain visibility.

7. Describe how you would implement adaptive sampling for debug logs in a high-throughput API service.

Adaptive sampling lets you keep debug logs when you need them most without overwhelming your logging infrastructure. Implement head-based sampling at the log agent level: configure a base sample rate (e.g., 1%) for DEBUG logs but increase it dynamically when errors spike or for specific users/transactions. A simple implementation uses a deterministic hash of the trace_id to ensure you always see all logs for the same trace. Alternatively, tail-based sampling collects full logs for requests with errors and samples everything else — this guarantees complete debugging context for failed requests while reducing volume for successful ones. Combine with feature flags so on-call engineers can override sampling rates in real-time via config maps or API calls.

8. What are the trade-offs between writing logs to stdout versus writing directly to a log aggregation service?

Writing to stdout (container logging) is the recommended approach in Kubernetes/containerized environments because the container runtime handles rotation, storage management, and shipping — your application stays focused on business logic. It also means logs survive container restarts and can be collected by any log agent. The downside is added latency from the agent polling stdout and the need for an agent sidecar or DaemonSet. Writing directly to an aggregator (Elasticsearch, Loki) eliminates the agent but couples your application to the aggregation infrastructure — if the aggregator is slow or unavailable, your application suffers. Direct writes work well for serverless functions where you cannot run log agents, but for traditional services, stdout with an agent like Fluent Bit or Vector gives you the best separation of concerns and flexibility to swap aggregation backends.

9. How would you design a log-based alert to detect a memory leak in a Java application before it causes an outage?

A memory leak typically manifests as gradual increases in memory usage without corresponding garbage collection events. Configure your application to emit JVM metrics via Micrometer or Dropwizard Metrics, then create alerts based on patterns in those metric logs. Key signals: memory usage increasing across consecutive GC cycles without full recovery, heap utilization trending upward over hours, or the number of allocated objects growing while the GC pause time increases. In your log aggregator, query for metric.name: jvm_memory_used AND metric.area: heap AND increase(metric.value[1h]) > threshold. Set a PagerDuty alert when heap usage exceeds 80% sustained for 15 minutes or when the GC reclaim rate falls below the allocation rate. Correlate with your application logs to identify which code paths are allocating the most objects.

10. Explain how you would use OpenTelemetry to correlate logs from a microservices application with distributed traces.

OpenTelemetry provides automatic instrumentation that injects trace context (trace_id, span_id, flags) into log records at the SDK level, eliminating manual correlation ID propagation. Set up the OpenTelemetry SDK in each microservice with an OTLP exporter pointing to your collector. The collector fans out to both your trace backend (Jaeger, Zipkin) and log backend (Elasticsearch, Loki). When you query a trace ID, you get the full span timeline — clicking any span shows you all log records emitted during that span. For manual correlation when needed, use the OTel API: extract the active span context, inject it into your structured log record alongside your business fields. This gives you the best of both worlds — automatic context injection plus explicit business correlation in your log messages.

11. Your team is debating whether to use JSON structured logs or plain text with a standard format like logfmt. What factors influence the right choice?

JSON structured logs win when you need machine parsing, field-level queries, and integration with modern log aggregators like Elasticsearch or Loki. They work well when your logging infrastructure can handle the overhead and you need to search across dimensions (service, user_id, region) without regex matching. Logfmt is a better fit when you want human readability in terminal output while still supporting structured parsing — it's more compact than JSON and reads almost like natural language when you cat a log file. Consider JSON when you have high logging volume, need aggregations, and will mostly query via Kibana/Grafana. Consider logfmt when developers read logs directly during debugging and your aggregation pipeline can handle field extraction.

12. How do you decide which log aggregation platform to use for a growing startup with limited DevOps resources?

For a startup with limited resources, managed services beat self-hosted every time. Elasticsearch is powerful but requires significant operational overhead — you need someone who knows cluster sizing, shard allocation, and ILM policies. Loki is cheaper and pairs naturally with Grafana for metrics plus logs. Cloud-specific options like AWS CloudWatch Logs or GCP Cloud Logging reduce operational burden further if you're already in that ecosystem. My recommendation: start with whatever your metrics platform uses (Grafana+Loki for open source, CloudWatch if AWS-native) and migrate only when you hit scaling limits. The worst choice is running Elasticsearch on VMs without dedicated infrastructure expertise.

13. Describe a scenario where logging too much information actually made production debugging harder.

I worked on a service where developers, trying to be thorough, logged every single function entry and exit with full parameter values. In production, this generated 50GB of logs per day for a single service — not because of traffic, but because debug logging was left enabled. When a real incident happened, the log volume was so high that querying for relevant entries timed out the aggregator. The actual error was buried under megabytes of noise. The fix was simple: disable verbose debug logging in production by default, and enable it selectively via feature flag for specific request IDs when needed. The lesson: more logs is not always better. Volume management and signal-to-noise ratio matter more than comprehensiveness.

14. What is the difference between log-based metrics and metrics derived from instrumentation libraries like Micrometer or Prometheus client?

Metrics from instrumentation libraries (Prometheus counters, gauges, histograms) are purpose-built for time-series analysis — they're lightweight, aggregatable, and designed for alerting. Log-based metrics extract measurements from log entries after the fact (counting ERROR lines per minute, for example). The key difference is overhead and precision: library metrics are emitted once per event with negligible CPU cost, while parsing logs to extract metrics adds processing latency and resource usage. Use library metrics for things you always need to measure (request rates, latencies, error counts). Use log-based metrics for derived measurements that would be expensive to instrument directly, like counting specific business events that already appear in logs. In practice, you want both — library metrics for SLIs and alerting, log queries for investigative drilling down when you have a specific hypothesis to test.

15. How would you implement a log redaction library that handles nested objects and arrays without accidentally breaking JSON structure?

A robust redaction function needs to handle nested structures recursively and preserve JSON validity after redaction. The key pattern: traverse the object tree, check each key against sensitive field patterns (password, token, ssn, credit_card, etc.), and replace matching values with a redaction marker. For nested objects, recurse. For arrays, iterate and recurse on each element. The critical edge case is when redaction turns an object into a flattened structure — you want to preserve the structure but mask values. Implementation: use a whitelist approach where you explicitly define which fields to keep readable, and redact everything else by default. This is safer than a blacklist which will inevitably miss a field. Test with: nested objects 5 levels deep, arrays of objects, mixed-type arrays, and null values to ensure your redaction preserves the JSON contract downstream.

16. Your manager asks you to reduce cloud costs. Logging infrastructure is consuming significant budget. What is your audit process?

First, measure current spend by breaking down log volume by service, environment, and log level. In most systems, 60-80% of log volume is DEBUG-level in production — the first win is disabling DEBUG. Second, audit indexed fields in Elasticsearch — each indexed field adds storage overhead, and many teams index fields they never query. Third, check retention policies: if you have 90-day retention but compliance only requires 30 days, you can cut storage by half by tightening the warm tier. Fourth, look at aggregation overhead: are you shipping logs from regions where they never get queried? Finally, consider the agent side — Fluentd/Fluent Bit agents running on every node consume memory and CPU that scales with log volume. A well-tuned logging infrastructure should consume 2-5% of total cloud budget, not 15-20%.

17. When you inherit a legacy system with zero structured logging, how do you introduce structured logging without breaking existing monitoring?

Introduce structured logging incrementally, never in a big bang. Start with your most critical service — add a structured logger alongside the existing text logger during a migration sprint, run both in parallel for a week to validate the new format, then deprecate the text logger. For each service: wrap the existing logger in an abstraction that can output both formats simultaneously during the transition. This way, dashboards and alerts built on text parsing continue working while new debugging workflows use structured queries. The biggest mistake is trying to migrate everything at once — you will miss edge cases and your team will resist the change. Migrate service by service, and maintain a log format registry so the team always knows which format each service uses. After 3-4 months of incremental migration, you can deprecate text logging entirely.

18. Explain how head-based sampling differs from tail-based sampling for debug logs. When would you use each approach?

Head-based sampling makes the sampling decision at the point of log emission — before you know whether the request will succeed or fail. You sample X% of all debug logs uniformly. This is simple and predictable but risks losing debug context for the exact requests you need most: the failing ones. Tail-based sampling collects logs for all requests but only persists complete logs for requests with errors, while sampling or dropping logs for successful requests. Tail-based sampling gives you 100% of debugging information for failures and dramatically reduced volume for successes — but requires more infrastructure (the sampler must sit close to the aggregation layer and hold partial logs in memory). Use head-based sampling when volume is the primary concern and failures are evenly distributed. Use tail-based sampling when error investigation is the primary use case and you cannot afford to lose debug context for failed requests.

19. Your security team flags that PII is appearing in application logs despite your redaction library. How do you investigate the scope and prevent recurrence?

First, audit the scope: run a log scanning tool like Log-inspector or a custom regex scanner across your log storage to identify which services, log levels, and field patterns contain PII. This tells you whether the leak is in the logger layer (redaction not applied), the transport layer (logs written to files before redaction), or the aggregator layer (structured fields parsed incorrectly). Second, trace the specific leak: if email addresses are appearing, check if email is being logged explicitly in business logic rather than just as a user_id reference. Third, fix at the logger level — the redaction library should be the outermost wrapper around any log call, not an afterthought in business logic. Finally, add automated PII scanning in your CI pipeline: fail builds if known PII patterns appear in test log output. This prevents regression. Consider tools like DataSommer or grep patterns for credit card formats, SSN patterns, and emailRegex.

20. Describe how you would use logs to detect a slow memory leak in a production service that only manifests under sustained load.

Memory leaks under sustained load often show up first in logs before metrics trigger alerts. Configure your service to log periodic memory snapshots: every N minutes or every M requests, emit a log entry with JVM heap used/committed, thread count, and number of loaded classes. Over time, you can query these snapshots and use the log aggregator's built-in statistical functions to detect upward trends. For example, in Elasticsearch, use a date histogram aggregation on the memory snapshot timestamp, with an avg sub-aggregation on heap_used. If the average heap usage shows a statistically significant upward trend over 24-48 hours without corresponding GC events clearing it, you likely have a leak. Correlate the leak timeline with your application logs to identify which code paths were active — often the leak correlates with specific features or traffic patterns. This approach works without external profiling tools and gives you enough signal to prioritize a fix before an OOM kill.

Conclusion

Key Takeaways:

Structured JSON logs enable efficient searching and aggregation
Correlation IDs connect logs across service boundaries
Log levels filter noise: DEBUG for development, ERROR/WARN/INFO for production
Never log sensitive data; always implement redaction
Monitor your monitors: log aggregation needs its own observability
Retention policies prevent unbounded storage growth

Copy/Paste Checklist:

# Verify structured logging format
grep -c '"timestamp".*"level".*"message".*"service"' /var/log/app.json

# Find logs for specific trace
grep '"trace_id":"abc123"' /var/log/app.json

# Count errors by service
jq 'select(.level == "ERROR") | .service' /var/log/app.json | sort | uniq -c

# Alert on log silence (Prometheus)
- alert: LogIngestionSilence
  expr: rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical

# Redaction function (TypeScript)
const sensitiveFields = ['password', 'token', 'secret', 'creditCard', 'ssn'];
function redact(obj) {
  return Object.fromEntries(
    Object.entries(obj).map(([k, v]) =>
      sensitiveFields.some(f => k.toLowerCase().includes(f)) ? [k, '[REDACTED]'] : [k, v]
    )
  );
}

Good logging practices pay off when you need them most: debugging production issues at 2am. Structured logs with correlation IDs let you trace requests across service boundaries. Appropriate log levels keep noise manageable. Retention policies balance cost with compliance requirements.

Start with JSON structured logging in your applications. Add correlation ID propagation early. Build log aggregation before you need it, not during an incident.

For deeper observability, combine logging with the Metrics, Monitoring & Alerting and Distributed Tracing practices covered in our other guides. These three pillars work together: logs show you what happened, metrics show you patterns, and traces show you why it happened.