Logging Best Practices: Structured Logs, Levels, Aggregation

Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.

published: March 22, 2026 reading time: 24 min read

Logging Best Practices: Structured Logs, Levels, and Aggregation

Logs are your primary tool for understanding what happens in production when users report issues. If you have ever spent hours searching through plain text log files trying to find a single error, you know how painful unstructured logging can be.

This guide covers structured formats, log levels, correlation IDs for tracing requests, and aggregation strategies that scale.

The Problem with Plain Text Logging

Most applications start with simple string logging:

logger.info("User " + userId + " logged in from " + ipAddress);

This approach has serious problems. Searching for all logs from a specific user requires parsing the string. Machine parsing is brittle and breaks when the format changes. Adding context requires string concatenation throughout your codebase.

Structured logging solves these problems by emitting machine-readable log entries that are also human-readable.

Structured Logging

Structured logs use a defined format, typically JSON, where each field has a specific meaning.

JSON Log Format

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "message": "User login successful",
  "service": "auth-service",
  "version": "2.1.0",
  "trace_id": "abc123def456",
  "user_id": "usr_789",
  "ip_address": "192.168.1.42",
  "duration_ms": 45,
  "environment": "production"
}

This format lets you search by any field, aggregate metrics across dimensions, and correlate logs with traces and metrics.

Implementing Structured Logging

Most languages have structured logging libraries:

# Python with structlog
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

log = structlog.get_logger()

log.info("user_login",
    user_id="usr_789",
    ip_address="192.168.1.42",
    duration_ms=45
)

// TypeScript with pino
import pino from "pino";

const log = pino({
  level: "info",
  base: {
    service: "auth-service",
    version: process.env.APP_VERSION,
  },
});

log.info(
  {
    userId: "usr_789",
    ipAddress: "192.168.1.42",
    durationMs: 45,
  },
  "User login successful",
);

// Go with zerolog
import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

zerolog.TimeFieldFormat = zerolog.TimeFormatUnix

log.Info().
    Str("user_id", "usr_789").
    Str("ip_address", "192.168.1.42").
    Int("duration_ms", 45).
    Msg("User login successful")

Log Levels

Log levels help filter noise. Not every log entry needs to be visible during normal operations.

Standard Log Levels

Level	Purpose	When to Use
DEBUG	Detailed diagnostic information	During development and troubleshooting
INFO	Confirmation that things work as expected	Significant business events
WARN	Unexpected but handled situations	Recoverable errors, degraded states
ERROR	Errors that need attention	Failures that affect requests
FATAL	System cannot continue	Critical failures requiring immediate action

Choosing the Right Level

This feels intuitive but gets harder at scale. A few guidelines:

INFO for business events like orders placed, users registered. You want these for analytics and auditing.
WARN for situations that require attention but the system continues: retries succeeded, cache misses, degraded mode.
ERROR for failures that affect the current request: database timeout, external API failure, validation error.
DEBUG for information that helps during development but would overwhelm production: loop iterations, intermediate values.

Don’t log DEBUG in production unless you can enable it selectively for specific requests. A debug log in a hot path can generate gigabytes per hour.

Correlation IDs

When a request flows through multiple services, correlation IDs let you follow it across all logs.

Propagating Trace Context

// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
  const traceId = req.headers["x-trace-id"] || generateUUID();
  req.correlationId = traceId;
  res.setHeader("x-trace-id", traceId);

  // Add to logger context
  log = log.with({ traceId });

  next();
}

Propagate the correlation ID to all downstream calls:

// Outgoing HTTP request
fetch("https://api.example.com/users", {
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

// Database queries
db.query("SELECT * FROM users WHERE id = $1", [userId], {
  traceId: req.correlationId,
});

// Message queue messages
queue.send({
  payload: orderData,
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

Using Correlation IDs for Search

With structured logs and correlation IDs, debugging a user issue looks like this:

# Find all logs for a specific request
grep '"trace_id":"abc123def456"' /var/log/app.log

# Or in your log aggregator
query: trace_id = "abc123def456"

Search for the trace ID and you get the incoming request, database queries, cache hits, outgoing API calls, and the error that occurred.

What to Include in Logs

Context matters. The more relevant context you include, the easier debugging becomes.

Essential Fields

Every log entry needs at minimum:

timestamp: ISO 8601 format in UTC
level: Log severity level
service: Which service generated this log
message: Human-readable description

Request Context

For web services, include:

request_id or trace_id
user_id (if authenticated)
HTTP method, path, status code
Client IP address
User agent

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "api-gateway",
  "message": "Request completed",
  "request_id": "req_abc123",
  "method": "GET",
  "path": "/api/users/usr_789",
  "status": 200,
  "duration_ms": 120,
  "ip": "192.168.1.42",
  "user_agent": "Mozilla/5.0..."
}

Business Events

For significant business events:

Event type (login, purchase, registration)
Entity IDs involved
Outcome (success, failure)
Duration if applicable
Any relevant metadata

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "checkout-service",
  "message": "Order placed",
  "event": "order_placed",
  "order_id": "ord_xyz789",
  "customer_id": "cust_123",
  "total_amount": 99.99,
  "currency": "USD",
  "item_count": 3
}

What NOT to Log

Logging sensitive data creates security and compliance problems.

Never Log These

Never log:

Passwords or password hashes
Credit card numbers or CVV codes
Social security numbers or national IDs
API keys or secrets
Full authorization tokens (log the type and last 4 chars only)

Redact Sensitive Data

function redactSensitiveFields(
  obj: Record<string, unknown>,
): Record<string, unknown> {
  const sensitiveFields = ["password", "token", "secret", "creditCard", "ssn"];
  const redacted = { ...obj };

  for (const key of Object.keys(redacted)) {
    if (sensitiveFields.some((f) => key.toLowerCase().includes(f))) {
      redacted[key] = "[REDACTED]";
    } else if (typeof redacted[key] === "object") {
      redacted[key] = redactSensitiveFields(
        redacted[key] as Record<string, unknown>,
      );
    }
  }

  return redacted;
}

log.info(
  "User authenticated",
  redactSensitiveFields({ userId: "usr_123", password: "secret123" }),
);
// Logs: { userId: 'usr_123', password: '[REDACTED]' }

Log Aggregation Architecture

At scale, logs need to be collected, aggregated, and stored efficiently.

Common Architecture

graph LR
    A[Application] -->|stdout/JSON| B[Container Runtime]
    B --> C[Log Agent]
    C --> D[Log Aggregator]
    D --> E[Storage]
    D --> F[Search Interface]
    G[Analytics/BI] --> E

Container Logging

In containerized environments, applications write to stdout and stderr. The container runtime handles collection:

# Write logs to stdout, not files
# Bad: RUN echo "$(date) Log entry" >> /var/log/app.log
# Good: console.log(JSON.stringify({ timestamp, message }))

For applications that must write to files, use a sidecar log agent or mount a shared log directory:

# Pod with log volume
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: app
      image: myapp:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
    - name: log-agent
      image: log-agent:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
        - name: agent-config
          mountPath: /etc/log-agent
  volumes:
    - name: logs
      emptyDir: {}
    - name: agent-config
      configMap:
        name: log-agent-config

Shipping Logs to Aggregators

Fluentd/Fluent Bit Configuration

# fluent-bit.conf
[SERVICE]
    Flush         5
    Daemon        Off
    Log_Level     info
    Parsers_File  parsers.conf

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               container.*
    Refresh_Interval  5

[FILTER]
    Name                kubernetes
    Match               container.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token

[OUTPUT]
    Name        es
    Match       container.*
    Host        elasticsearch.logging.svc
    Port        9200
    Logstash_Format    On
    Logstash_Prefix    kubernetes
    Retry_Limit        False

Vector Configuration

Vector is a newer alternative with better performance and lower resource usage:

# vector.toml
[sources.docker]
type = "docker_logs"

[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '.message = parse_json!(.message)'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch.logging.svc:9200"
index = "kubernetes-%Y.%m.%d"

Log Storage and Retention

Storage costs grow with log volume. Design retention policies carefully.

Retention Tiers

Tier	Duration	Use Case
Hot	0-7 days	Real-time troubleshooting
Warm	7-30 days	Investigating recent issues
Cold	30-90 days	Compliance, audit
Archive	1+ years	Legal requirements

Elasticsearch Index Lifecycle Management

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_size": "50gb"
          },
          "set_priority": 100
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": 50
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "set_priority": 0
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Performance Considerations

Logging can become a bottleneck if you don’t design it carefully.

Asynchronous Logging

Write logs asynchronously so they don’t block your application:

import logging
import queue
from threading import Thread

class AsyncLogHandler(logging.Handler):
    def __init__(self, batch_size=100, flush_interval=1.0):
        super().__init__()
        self.queue = queue.Queue(maxsize=10000)
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.worker = Thread(target=self._process_logs, daemon=True)
        self.worker.start()

    def emit(self, record):
        try:
            self.queue.put_nowait(self.format(record))
        except queue.Full:
            pass  # Drop log if queue is full

    def _process_logs(self):
        batch = []
        while True:
            try:
                item = self.queue.get(timeout=self.flush_interval)
                batch.append(item)
                while len(batch) < self.batch_size:
                    item = self.queue.get_nowait()
                    batch.append(item)
            except queue.Empty:
                pass

            if batch:
                self._send_batch(batch)
                batch = []

    def _send_batch(self, batch):
        # Send to log aggregator
        pass

Sampling High-Volume Logs

For debug-level logs in high-traffic paths, sample to reduce volume:

const sampler = new RateSampler({ rate: 0.1 }); // 10% sample rate

log.debug(
  {
    message: "Processing item",
    itemId: item.id,
    sampled: sampler.sample(),
  },
  "Item processing details",
);
// Only actually logs ~10% of the time

Monitoring Log Health

Logs themselves need monitoring. If logging stops, you lose visibility into your systems.

Metrics to Track

Log ingestion rate (logs/second)
Log volume by service and level
Error rate in logs
Log processing latency
Log agent errors and restarts

Alert on Silence

# Prometheus alert for missing logs
- alert: LogIngestionSilence
  expr: |
    rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No logs being ingested from Fluentd"
    description: "Fluentd has not sent logs to Elasticsearch in 5 minutes"

When to Use Structured Logging

Use structured logging when:

Debugging requires cross-referencing multiple log entries
Requests span multiple services
You need selective debugging in high-volume APIs
Audit trails are required for compliance
You need to correlate logs with traces or metrics

Don’t use structured logging when:

Simple scripts or one-off utilities where stdout debugging suffices
Very low-traffic applications where unstructured grep suffices
Legacy systems where migration cost outweighs benefits
Development environments where DEBUG-level verbosity is acceptable

Trade-off Analysis

Aspect	Structured Logging	Plain Text Logging
Searchability	Field-level queries via log aggregators	grep/string matching only
Storage Cost	Higher (JSON overhead per line)	Lower (minimal formatting)
Parse Complexity	Zero (machine-readable by default)	Brittle (format changes break parsers)
Human Readability	Moderate (requires jq or aggregator UI)	High (direct reading in terminal)
Tooling Required	Log aggregator (ELK, Loki, Splunk)	None or basic text tools
Correlation	Automatic via shared fields	Manual trace ID injection
Performance Impact	Slight overhead for JSON serialization	Minimal

SLI/SLO/Error Budget Templates for Logging

Log-Based SLI Template

# logging-sli-config.yaml
service: logging-observability
environment: production

slis:
  - name: log_ingestion_success_rate
    description: "Percentage of emitted logs successfully ingested"
    query: |
      sum(rate(fluentd_output_status_num_logs_total{status="output"}[5m]))
      /
      sum(rate(fluentd_input_status_records_total[5m]))

  - name: log_processing_latency_p95
    description: "Time from log emit to searchable in aggregator"
    query: |
      histogram_quantile(0.95,
        sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
      )

  - name: log_error_rate
    description: "ERROR level log rate as percentage of total"
    query: |
      sum(rate(log_entries_total{level="error"}[5m]))
      /
      sum(rate(log_entries_total[5m])) * 100

Log SLO Template

# logging-slo-config.yaml
objectives:
  - display_name: "Log Ingestion Availability"
    sli: log_ingestion_success_rate
    target: 99.5
    window: 30d
    description: "99.5% of emitted logs should be ingested"

  - display_name: "Log Processing Latency"
    sli: log_processing_latency_p95
    target: 99.0
    threshold_ms: 30000
    window: 30d
    description: "95% of logs should be searchable within 30 seconds"

  - display_name: "Log Error Rate"
    sli: log_error_rate
    target: 99.9
    threshold_percent: 1.0
    window: 30d
    description: "Error rate should stay below 1%"

Error Budget Calculator

# error-budget-calculator.py
def calculate_error_budget(slo_target, window_days=30):
    """
    Calculate error budget in minutes for a given SLO target.
    Example: 99.5% SLO over 30 days = 21.6 minutes of allowed errors
    """
    window_seconds = window_days * 24 * 60 * 60
    allowed_errors = window_seconds * (1 - slo_target)
    return allowed_errors / 60  # Convert to minutes

# Standard SLO error budgets (30-day window)
slo_budgets = {
    "99.0%": calculate_error_budget(0.990),  # 432 minutes = 7.2 hours
    "99.5%": calculate_error_budget(0.995),  # 216 minutes = 3.6 hours
    "99.9%": calculate_error_budget(0.999),  # 43.2 minutes
    "99.95%": calculate_error_budget(0.9995), # 21.6 minutes
    "99.99%": calculate_error_budget(0.9999), # 4.32 minutes
}

for slo, budget in slo_budgets.items():
    print(f"SLO {slo}: {budget:.2f} minutes error budget")

Multi-Window Burn-Rate Alerting for Log Quality

Burn-rate alerts detect when error budgets are being consumed faster than expected. This approach catches both sudden spikes and slow leaks.

1-Hour Window Burn-Rate Alert (Fast Burn)

# Burn-rate alerts for logging
groups:
  - name: logging-burn-rate
    rules:
      # Fast burn: 1-hour window, 14.4x burn rate (burns 1% budget in 1 hour)
      - alert: LogErrorBudgetFastBurn
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          category: logging
          window: 1h
        annotations:
          summary: "Log error budget burning fast (1h window)"
          description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Budget may be depleted in ~7 hours."

6-Hour Window Burn-Rate Alert (Medium Burn)

# Medium burn: 6-hour window, 6x burn rate (burns 10% budget in 6 hours)
- alert: LogErrorBudgetMediumBurn
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: warning
    category: logging
    window: 6h
  annotations:
    summary: "Log error budget burning (6h window)"
    description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Check for sustained error patterns."

Multi-Window Burn-Rate Alert Set

# Complete burn-rate alert set (multi-window)
- alert: LogErrorBudgetBurnAllWindows
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[1h]))
      /
      sum(rate(log_entries_total[1h]))
    )
    > (1 - 0.999) * 14.4
    or
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 5m
  labels:
    severity: critical
    category: logging
  annotations:
    summary: "Log error budget burning across multiple time windows"
    description: |
      Multi-window burn-rate alert triggered.
      1h burn rate: {{ printf "%.2f" (neilyz (index $alerts "0" | value)) }}x
      6h burn rate: {{ printf "%.2f" (neilyz (index $alerts "1" | value)) }}x
      Review error patterns and allocate incident resources.

SLO Error Budget Dashboard Panels

{
  "dashboard": {
    "title": "Logging SLO Error Budget",
    "panels": [
      {
        "title": "Error Budget Remaining (30d)",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - (sum(rate(log_entries_total{level=\"error\"}[30d])) / sum(rate(log_entries_total[30d])))) * 100",
            "legendFormat": "Budget Used %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 50, "color": "yellow" },
                { "value": 90, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate (1h)",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999)",
            "legendFormat": "Burn Rate"
          }
        ]
      },
      {
        "title": "Projected Budget Exhaustion",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999) * 24",
            "legendFormat": "Hours until budget exhausted"
          }
        ]
      }
    ]
  }
}

Observability Hooks for Logging

This section defines what to log, measure, trace, and alert for logging systems themselves.

Log (What to Emit)

Event	Fields	Level
Log ingestion started	service, host, agent_version	INFO
Log ingestion stopped	service, host, reason	WARN
Buffer approaching full	host, buffer_used_percent, buffer_limit	WARN
Malformed log detected	host, parse_error_type, sample	WARN
Retry attempt	host, destination, attempt, max_attempts	DEBUG
Batch sent successfully	host, destination, batch_size, bytes_sent	DEBUG
Authentication failure	host, client_ip, reason	WARN

Measure (Metrics to Collect)

Metric	Type	Description
`log_emitted_total`	Counter	Total logs emitted by service
`log_ingested_total`	Counter	Total logs ingested to aggregator
`log_dropped_total`	Counter	Logs dropped due to errors/full buffers
`log_processing_latency_seconds`	Histogram	Time from emit to searchable
`log_buffer_utilization_percent`	Gauge	Buffer fill percentage
`log_parsing_errors_total`	Counter	Malformed log entries
`log_bytes_sent_total`	Counter	Bytes sent to aggregators
`log_aggregator_queue_depth`	Gauge	Pending logs in aggregator queue

Trace (Correlation Points)

Operation	Trace Attribute	Purpose
Log emit	`log.aggregate`	Track logs from emit through aggregation
Log parsing	`log.parse.status`	Monitor parsing health
Log shipping	`log.ship.destination`	Track delivery to aggregators
Batch processing	`log.batch.size`	Monitor batch efficiency

Alert (When to Page)

Alert	Condition	Severity	Purpose
Log Silence	No logs received for 5 minutes	P1 Critical	Log pipeline failure
High Drop Rate	Drop rate > 1% for 5 minutes	P2 High	Pipeline health
Buffer Critical	Buffer > 90% full	P2 High	Prevent data loss
Parse Error Spike	Parse errors > 100/min	P3 Medium	Data quality
Latency High	Processing latency > 30s p95	P3 Medium	Performance degradation

Alerting Hook Template

# logging-observability-hooks.yaml
groups:
  - name: logging-observability-hooks
    rules:
      # Alert on silence - no logs coming in
      - alert: LoggingPipelineSilence
        expr: rate(fluentd_input_status_records_total[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No logs being ingested (Alert on Silence)"
          description: "Fluentd/Bit has not sent logs to Elasticsearch in 5 minutes. Either the log pipeline is down or all services have stopped logging."

      # Alert on high drop rate
      - alert: LoggingDropRateHigh
        expr: |
          sum(rate(fluentd_output_status_num_errors_total[5m]))
          /
          sum(rate(fluentd_input_status_records_total[5m])) > 0.01
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log drop rate above 1%"
          description: "{{ $value | humanizePercentage }} of logs are being dropped. Check Fluentd/Bit error logs."

      # Alert on buffer approaching full
      - alert: LoggingBufferCritical
        expr: fluentd_buffer_queue_length / fluentd_buffer_limit > 0.9
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log buffer above 90% capacity"
          description: "Fluentd/Bit buffer is filling up. Risk of log loss if not addressed."

      # Alert on high parsing errors
      - alert: LoggingParseErrorSpike
        expr: rate(log_parsing_errors_total[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High log parsing error rate"
          description: "More than 100 parsing errors per minute. Review log format consistency."

      # Alert on processing latency
      - alert: LoggingProcessingLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Log processing latency above 30 seconds"
          description: "P95 log processing latency is {{ $value }}s. Logs may not be searchable in real-time."

      # SLO error budget burn rate
      - alert: LoggingErrorBudgetBurningFast
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          ) > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Log error budget burning at unsustainable rate"
          description: "Error budget is being consumed 14.4x faster than sustainable. Immediate investigation required."

Production Failure Scenarios

Failure	Impact	Mitigation
Log aggregation pipeline downtime	No new logs searchable; teams blind to issues	Buffer logs locally; implement retry with backoff; alert on pipeline health
Elasticsearch cluster saturation	Log ingestion backs up; logs dropped	Monitor ES cluster health; implement backpressure; use ILM to manage indices
Corrupted log data	Searches return incomplete results; debugging misses context	Validate JSON structure at ingestion; use dead-letter queues for malformed logs
Sensitive data logged	Security/compliance breach; potential data exposure	Implement redaction middleware; scan logs before storage; educate developers
Excessive log volume	Storage costs spike; performance degradation	Implement sampling for DEBUG logs; enforce log level policies; archive aggressively
Missing correlation IDs	Cannot trace requests across services	Auto-inject correlation IDs in middleware; reject requests without trace context in high-security paths

Observability Checklist

Key Log Metrics

Log ingestion rate (logs/second) by service and level
Log volume by service, level, and environment
Error rate in logs (ERROR level count over time)
Log processing latency (time from log emit to searchable)
Log agent errors and restarts
Storage utilization per index

Logs You Should Have

Request logs with trace_id, user_id, method, path, status, duration_ms
Authentication events (login attempts, failures, token refreshes)
Business events (orders, payments, registrations) with entity IDs
Database query logs for slow queries (>100ms threshold)
External API call logs with request/response timing
Background job start/complete/fail logs with job IDs
Health check and readiness probe logs
Configuration change logs (who changed what when)

Alerts You Need

No logs received from a service for >5 minutes (Alert on Silence pattern)
Error rate spike above baseline (unexpected errors)
Log volume anomaly (sudden drop or spike)
Log processing latency >30 seconds
Elasticsearch cluster health degraded (yellow/red)
Log agent restart detected

Security Checklist

No passwords, API keys, or secrets in log output
Credit card numbers, CVV, SSN never logged
Authorization tokens logged as type + last 4 chars only (e.g., “Bearer ***abc123”)
PII fields identified and redacted in redaction middleware
Log access requires authentication and is audited
Log aggregation pipeline uses TLS in transit
Elasticsearch access restricted to authorized personnel
Log retention complies with data retention policies
Sensitive data cannot be searched in Kibana/ES by unauthorized users

Common Pitfalls / Anti-Patterns

1. Logging Everything at DEBUG in Production

DEBUG-level logging in high-throughput services generates gigabytes per hour. Use sampling for debug scenarios, or enable DEBUG selectively via feature flags for specific request IDs.

2. Plain Text Logging with String Concatenation

// Bad: Cannot search, parse, or aggregate
logger.info("User " + userId + " purchased " + item);

// Good: Structured, searchable, aggregatable
logger.info("User purchased item", { userId, itemId, itemName, price });

3. Missing Trace Context Propagation

Logs without correlation IDs are useless for tracing requests across services. Always propagate trace_id through HTTP headers, database connections, and message queues.

4. Logging Sensitive Data

Never log passwords, full tokens, credit card numbers, or PII. Implement redaction at the logger level, not the application level, to catch mistakes.

5. Synchronous Logging to Network Storage

Writing logs synchronously to a remote log server adds latency to every operation. Use async logging with local buffering and background shipping.

6. No Log Retention Policy

Without retention policies, storage costs grow unbounded. Define hot/warm/cold/archive tiers and automate data lifecycle management.

7. Logs as the Only Observability Signal

Relying solely on logs for debugging is insufficient at scale. Combine logs with metrics and traces for complete observability.

Quick Recap

Key Takeaways:

Structured JSON logs enable efficient searching and aggregation
Correlation IDs connect logs across service boundaries
Log levels filter noise: DEBUG for development, ERROR/WARN/INFO for production
Never log sensitive data; always implement redaction
Monitor your monitors: log aggregation needs its own observability
Retention policies prevent unbounded storage growth

Copy/Paste Checklist:

# Verify structured logging format
grep -c '"timestamp".*"level".*"message".*"service"' /var/log/app.json

# Find logs for specific trace
grep '"trace_id":"abc123"' /var/log/app.json

# Count errors by service
jq 'select(.level == "ERROR") | .service' /var/log/app.json | sort | uniq -c

# Alert on log silence (Prometheus)
- alert: LogIngestionSilence
  expr: rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical

# Redaction function (TypeScript)
const sensitiveFields = ['password', 'token', 'secret', 'creditCard', 'ssn'];
function redact(obj) {
  return Object.fromEntries(
    Object.entries(obj).map(([k, v]) =>
      sensitiveFields.some(f => k.toLowerCase().includes(f)) ? [k, '[REDACTED]'] : [k, v]
    )
  );
}

Interview Questions

Q: A user reports a bug but provides no details beyond “the checkout failed.” How do you find the relevant logs? A: Ask for the approximate time, user ID, or order ID. With a timestamp window, query logs for that timeframe filtering on the service handling checkout. With a user ID, search for all log entries tagged with that user ID. With a correlation ID from their session, search for that ID across all services to reconstruct the full request path. In ELK/Kibana: timeframe AND service.name: checkout-service AND "checkout failed". If no direct match, search for errors in the checkout service within the time window, then trace back via correlation IDs to find the root cause service.

Q: Your Elasticsearch cluster is running out of disk space. How do you reduce storage without losing searchable data? A: Immediate mitigation: force a flush to free up translog space, delete old indices beyond your retention policy, and consider a readonly index for historical data. For ongoing cost reduction: adjust index settings to reduce replica count in hot-warm architectures, use ILM policies to move older indices to cheaper storage tiers (frozen or cold), and reduce shard count — too many small shards wastes overhead. Audit field mappings to check if you can reduce the number of indexed fields with doc_values: false on fields that are only used for filtering. Finally, enforce log volume budgets per service to prevent any single service from overwhelming the cluster.

Q: What is the relationship between correlation IDs, trace IDs, and span IDs in distributed tracing? A: A trace ID is a unique identifier for an entire request transaction across all services — it stitches together every span. A span ID represents a single unit of work within that trace (one service call, one database query). Correlation IDs are typically an application-level business identifier (order ID, user session ID) that helps you filter logs across services without relying on trace IDs. In practice: the trace ID propagates via HTTP headers (x-b3-traceId in Zipkin, traceparent in W3C trace context) through every service call. Each service creates a span with the incoming trace ID and its own span ID, creating a parent-child tree of operations.

Q: You find that DEBUG logs are missing during production incidents. What logging level should you set in production and why? A: Production should run at INFO or WARN in most services — DEBUG is too noisy for production traffic volumes and can itself cause performance problems (disk I/O, log storage costs). However, when an incident is active, dynamically raising a specific service to DEBUG via a config change allows targeted debugging without impacting all services. The pattern: have a mechanism to change log level at runtime (via a config map reload, a logging API endpoint, or a Kubernetes annotation). Keep DEBUG in staging and development environments where you need it for development iteration.

Q: How do you handle sensitive data like PII appearing in logs? A: The best approach is to never log PII in the first place — sanitize before logging by configuring your logger to mask fields like email addresses, credit cards, and phone numbers using a redaction library. In your logging framework ( Pino for Node, zap for Go, python-logstash), add a field filter that replaces sensitive patterns with [REDACTED]. Alternatively, use a log processor (Fluentd filter, Logstash mutate) to strip or hash sensitive fields before forwarding. Also configure your SIEM and log storage to mark PII fields as sensitive so analysts are warned. Audit your logs regularly with a tool like DataSommer to catch accidental PII leakage.

Conclusion

Good logging practices pay off when you need them most: debugging production issues at 2am. Structured logs with correlation IDs let you trace requests across service boundaries. Appropriate log levels keep noise manageable. Retention policies balance cost with compliance requirements.

Start with JSON structured logging in your applications. Add correlation ID propagation early. Build log aggregation before you need it, not during an incident.

For deeper observability, combine logging with the Metrics, Monitoring & Alerting and Distributed Tracing practices covered in our other guides. These three pillars work together: logs show you what happened, metrics show you patterns, and traces show you why it happened.