Logging Best Practices: Structured Logs, Levels, Aggregation

Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.

published: reading time: 24 min read

Logging Best Practices: Structured Logs, Levels, and Aggregation

Logs are your primary tool for understanding what happens in production when users report issues. If you have ever spent hours searching through plain text log files trying to find a single error, you know how painful unstructured logging can be.

This guide covers structured formats, log levels, correlation IDs for tracing requests, and aggregation strategies that scale.

The Problem with Plain Text Logging

Most applications start with simple string logging:

logger.info("User " + userId + " logged in from " + ipAddress);

This approach has serious problems. Searching for all logs from a specific user requires parsing the string. Machine parsing is brittle and breaks when the format changes. Adding context requires string concatenation throughout your codebase.

Structured logging solves these problems by emitting machine-readable log entries that are also human-readable.

Structured Logging

Structured logs use a defined format, typically JSON, where each field has a specific meaning.

JSON Log Format

{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "message": "User login successful",
  "service": "auth-service",
  "version": "2.1.0",
  "trace_id": "abc123def456",
  "user_id": "usr_789",
  "ip_address": "192.168.1.42",
  "duration_ms": 45,
  "environment": "production"
}

This format lets you search by any field, aggregate metrics across dimensions, and correlate logs with traces and metrics.

Implementing Structured Logging

Most languages have structured logging libraries:

# Python with structlog
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

log = structlog.get_logger()

log.info("user_login",
    user_id="usr_789",
    ip_address="192.168.1.42",
    duration_ms=45
)
// TypeScript with pino
import pino from "pino";

const log = pino({
  level: "info",
  base: {
    service: "auth-service",
    version: process.env.APP_VERSION,
  },
});

log.info(
  {
    userId: "usr_789",
    ipAddress: "192.168.1.42",
    durationMs: 45,
  },
  "User login successful",
);
// Go with zerolog
import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

zerolog.TimeFieldFormat = zerolog.TimeFormatUnix

log.Info().
    Str("user_id", "usr_789").
    Str("ip_address", "192.168.1.42").
    Int("duration_ms", 45).
    Msg("User login successful")

Log Levels

Log levels help filter noise. Not every log entry needs to be visible during normal operations.

Standard Log Levels

LevelPurposeWhen to Use
DEBUGDetailed diagnostic informationDuring development and troubleshooting
INFOConfirmation that things work as expectedSignificant business events
WARNUnexpected but handled situationsRecoverable errors, degraded states
ERRORErrors that need attentionFailures that affect requests
FATALSystem cannot continueCritical failures requiring immediate action

Choosing the Right Level

This feels intuitive but gets harder at scale. A few guidelines:

  • INFO for business events like orders placed, users registered. You want these for analytics and auditing.
  • WARN for situations that require attention but the system continues: retries succeeded, cache misses, degraded mode.
  • ERROR for failures that affect the current request: database timeout, external API failure, validation error.
  • DEBUG for information that helps during development but would overwhelm production: loop iterations, intermediate values.

Don’t log DEBUG in production unless you can enable it selectively for specific requests. A debug log in a hot path can generate gigabytes per hour.

Correlation IDs

When a request flows through multiple services, correlation IDs let you follow it across all logs.

Propagating Trace Context

// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
  const traceId = req.headers["x-trace-id"] || generateUUID();
  req.correlationId = traceId;
  res.setHeader("x-trace-id", traceId);

  // Add to logger context
  log = log.with({ traceId });

  next();
}

Propagate the correlation ID to all downstream calls:

// Outgoing HTTP request
fetch("https://api.example.com/users", {
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

// Database queries
db.query("SELECT * FROM users WHERE id = $1", [userId], {
  traceId: req.correlationId,
});

// Message queue messages
queue.send({
  payload: orderData,
  headers: {
    "X-Trace-ID": req.correlationId,
  },
});

With structured logs and correlation IDs, debugging a user issue looks like this:

# Find all logs for a specific request
grep '"trace_id":"abc123def456"' /var/log/app.log

# Or in your log aggregator
query: trace_id = "abc123def456"

Search for the trace ID and you get the incoming request, database queries, cache hits, outgoing API calls, and the error that occurred.

What to Include in Logs

Context matters. The more relevant context you include, the easier debugging becomes.

Essential Fields

Every log entry needs at minimum:

  • timestamp: ISO 8601 format in UTC
  • level: Log severity level
  • service: Which service generated this log
  • message: Human-readable description

Request Context

For web services, include:

  • request_id or trace_id
  • user_id (if authenticated)
  • HTTP method, path, status code
  • Client IP address
  • User agent
{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "api-gateway",
  "message": "Request completed",
  "request_id": "req_abc123",
  "method": "GET",
  "path": "/api/users/usr_789",
  "status": 200,
  "duration_ms": 120,
  "ip": "192.168.1.42",
  "user_agent": "Mozilla/5.0..."
}

Business Events

For significant business events:

  • Event type (login, purchase, registration)
  • Entity IDs involved
  • Outcome (success, failure)
  • Duration if applicable
  • Any relevant metadata
{
  "timestamp": "2026-03-22T14:32:01.456Z",
  "level": "INFO",
  "service": "checkout-service",
  "message": "Order placed",
  "event": "order_placed",
  "order_id": "ord_xyz789",
  "customer_id": "cust_123",
  "total_amount": 99.99,
  "currency": "USD",
  "item_count": 3
}

What NOT to Log

Logging sensitive data creates security and compliance problems.

Never Log These

Never log:

  • Passwords or password hashes
  • Credit card numbers or CVV codes
  • Social security numbers or national IDs
  • API keys or secrets
  • Full authorization tokens (log the type and last 4 chars only)

Redact Sensitive Data

function redactSensitiveFields(
  obj: Record<string, unknown>,
): Record<string, unknown> {
  const sensitiveFields = ["password", "token", "secret", "creditCard", "ssn"];
  const redacted = { ...obj };

  for (const key of Object.keys(redacted)) {
    if (sensitiveFields.some((f) => key.toLowerCase().includes(f))) {
      redacted[key] = "[REDACTED]";
    } else if (typeof redacted[key] === "object") {
      redacted[key] = redactSensitiveFields(
        redacted[key] as Record<string, unknown>,
      );
    }
  }

  return redacted;
}

log.info(
  "User authenticated",
  redactSensitiveFields({ userId: "usr_123", password: "secret123" }),
);
// Logs: { userId: 'usr_123', password: '[REDACTED]' }

Log Aggregation Architecture

At scale, logs need to be collected, aggregated, and stored efficiently.

Common Architecture

graph LR
    A[Application] -->|stdout/JSON| B[Container Runtime]
    B --> C[Log Agent]
    C --> D[Log Aggregator]
    D --> E[Storage]
    D --> F[Search Interface]
    G[Analytics/BI] --> E

Container Logging

In containerized environments, applications write to stdout and stderr. The container runtime handles collection:

# Write logs to stdout, not files
# Bad: RUN echo "$(date) Log entry" >> /var/log/app.log
# Good: console.log(JSON.stringify({ timestamp, message }))

For applications that must write to files, use a sidecar log agent or mount a shared log directory:

# Pod with log volume
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
    - name: app
      image: myapp:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
    - name: log-agent
      image: log-agent:latest
      volumeMounts:
        - name: logs
          mountPath: /var/log/myapp
        - name: agent-config
          mountPath: /etc/log-agent
  volumes:
    - name: logs
      emptyDir: {}
    - name: agent-config
      configMap:
        name: log-agent-config

Shipping Logs to Aggregators

Fluentd/Fluent Bit Configuration

# fluent-bit.conf
[SERVICE]
    Flush         5
    Daemon        Off
    Log_Level     info
    Parsers_File  parsers.conf

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               container.*
    Refresh_Interval  5

[FILTER]
    Name                kubernetes
    Match               container.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token

[OUTPUT]
    Name        es
    Match       container.*
    Host        elasticsearch.logging.svc
    Port        9200
    Logstash_Format    On
    Logstash_Prefix    kubernetes
    Retry_Limit        False

Vector Configuration

Vector is a newer alternative with better performance and lower resource usage:

# vector.toml
[sources.docker]
type = "docker_logs"

[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '.message = parse_json!(.message)'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoint = "http://elasticsearch.logging.svc:9200"
index = "kubernetes-%Y.%m.%d"

Log Storage and Retention

Storage costs grow with log volume. Design retention policies carefully.

Retention Tiers

TierDurationUse Case
Hot0-7 daysReal-time troubleshooting
Warm7-30 daysInvestigating recent issues
Cold30-90 daysCompliance, audit
Archive1+ yearsLegal requirements

Elasticsearch Index Lifecycle Management

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_size": "50gb"
          },
          "set_priority": 100
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": 50
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "set_priority": 0
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Performance Considerations

Logging can become a bottleneck if you don’t design it carefully.

Asynchronous Logging

Write logs asynchronously so they don’t block your application:

import logging
import queue
from threading import Thread

class AsyncLogHandler(logging.Handler):
    def __init__(self, batch_size=100, flush_interval=1.0):
        super().__init__()
        self.queue = queue.Queue(maxsize=10000)
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.worker = Thread(target=self._process_logs, daemon=True)
        self.worker.start()

    def emit(self, record):
        try:
            self.queue.put_nowait(self.format(record))
        except queue.Full:
            pass  # Drop log if queue is full

    def _process_logs(self):
        batch = []
        while True:
            try:
                item = self.queue.get(timeout=self.flush_interval)
                batch.append(item)
                while len(batch) < self.batch_size:
                    item = self.queue.get_nowait()
                    batch.append(item)
            except queue.Empty:
                pass

            if batch:
                self._send_batch(batch)
                batch = []

    def _send_batch(self, batch):
        # Send to log aggregator
        pass

Sampling High-Volume Logs

For debug-level logs in high-traffic paths, sample to reduce volume:

const sampler = new RateSampler({ rate: 0.1 }); // 10% sample rate

log.debug(
  {
    message: "Processing item",
    itemId: item.id,
    sampled: sampler.sample(),
  },
  "Item processing details",
);
// Only actually logs ~10% of the time

Monitoring Log Health

Logs themselves need monitoring. If logging stops, you lose visibility into your systems.

Metrics to Track

  • Log ingestion rate (logs/second)
  • Log volume by service and level
  • Error rate in logs
  • Log processing latency
  • Log agent errors and restarts

Alert on Silence

# Prometheus alert for missing logs
- alert: LogIngestionSilence
  expr: |
    rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No logs being ingested from Fluentd"
    description: "Fluentd has not sent logs to Elasticsearch in 5 minutes"

When to Use Structured Logging

Use structured logging when:

  • Debugging requires cross-referencing multiple log entries
  • Requests span multiple services
  • You need selective debugging in high-volume APIs
  • Audit trails are required for compliance
  • You need to correlate logs with traces or metrics

Don’t use structured logging when:

  • Simple scripts or one-off utilities where stdout debugging suffices
  • Very low-traffic applications where unstructured grep suffices
  • Legacy systems where migration cost outweighs benefits
  • Development environments where DEBUG-level verbosity is acceptable

Trade-off Analysis

AspectStructured LoggingPlain Text Logging
SearchabilityField-level queries via log aggregatorsgrep/string matching only
Storage CostHigher (JSON overhead per line)Lower (minimal formatting)
Parse ComplexityZero (machine-readable by default)Brittle (format changes break parsers)
Human ReadabilityModerate (requires jq or aggregator UI)High (direct reading in terminal)
Tooling RequiredLog aggregator (ELK, Loki, Splunk)None or basic text tools
CorrelationAutomatic via shared fieldsManual trace ID injection
Performance ImpactSlight overhead for JSON serializationMinimal

SLI/SLO/Error Budget Templates for Logging

Log-Based SLI Template

# logging-sli-config.yaml
service: logging-observability
environment: production

slis:
  - name: log_ingestion_success_rate
    description: "Percentage of emitted logs successfully ingested"
    query: |
      sum(rate(fluentd_output_status_num_logs_total{status="output"}[5m]))
      /
      sum(rate(fluentd_input_status_records_total[5m]))

  - name: log_processing_latency_p95
    description: "Time from log emit to searchable in aggregator"
    query: |
      histogram_quantile(0.95,
        sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
      )

  - name: log_error_rate
    description: "ERROR level log rate as percentage of total"
    query: |
      sum(rate(log_entries_total{level="error"}[5m]))
      /
      sum(rate(log_entries_total[5m])) * 100

Log SLO Template

# logging-slo-config.yaml
objectives:
  - display_name: "Log Ingestion Availability"
    sli: log_ingestion_success_rate
    target: 99.5
    window: 30d
    description: "99.5% of emitted logs should be ingested"

  - display_name: "Log Processing Latency"
    sli: log_processing_latency_p95
    target: 99.0
    threshold_ms: 30000
    window: 30d
    description: "95% of logs should be searchable within 30 seconds"

  - display_name: "Log Error Rate"
    sli: log_error_rate
    target: 99.9
    threshold_percent: 1.0
    window: 30d
    description: "Error rate should stay below 1%"

Error Budget Calculator

# error-budget-calculator.py
def calculate_error_budget(slo_target, window_days=30):
    """
    Calculate error budget in minutes for a given SLO target.
    Example: 99.5% SLO over 30 days = 21.6 minutes of allowed errors
    """
    window_seconds = window_days * 24 * 60 * 60
    allowed_errors = window_seconds * (1 - slo_target)
    return allowed_errors / 60  # Convert to minutes

# Standard SLO error budgets (30-day window)
slo_budgets = {
    "99.0%": calculate_error_budget(0.990),  # 432 minutes = 7.2 hours
    "99.5%": calculate_error_budget(0.995),  # 216 minutes = 3.6 hours
    "99.9%": calculate_error_budget(0.999),  # 43.2 minutes
    "99.95%": calculate_error_budget(0.9995), # 21.6 minutes
    "99.99%": calculate_error_budget(0.9999), # 4.32 minutes
}

for slo, budget in slo_budgets.items():
    print(f"SLO {slo}: {budget:.2f} minutes error budget")

Multi-Window Burn-Rate Alerting for Log Quality

Burn-rate alerts detect when error budgets are being consumed faster than expected. This approach catches both sudden spikes and slow leaks.

1-Hour Window Burn-Rate Alert (Fast Burn)

# Burn-rate alerts for logging
groups:
  - name: logging-burn-rate
    rules:
      # Fast burn: 1-hour window, 14.4x burn rate (burns 1% budget in 1 hour)
      - alert: LogErrorBudgetFastBurn
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          )
          > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          category: logging
          window: 1h
        annotations:
          summary: "Log error budget burning fast (1h window)"
          description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Budget may be depleted in ~7 hours."

6-Hour Window Burn-Rate Alert (Medium Burn)

# Medium burn: 6-hour window, 6x burn rate (burns 10% budget in 6 hours)
- alert: LogErrorBudgetMediumBurn
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: warning
    category: logging
    window: 6h
  annotations:
    summary: "Log error budget burning (6h window)"
    description: "Error rate is burning budget {{ $value | humanize }}x faster than sustainable. Check for sustained error patterns."

Multi-Window Burn-Rate Alert Set

# Complete burn-rate alert set (multi-window)
- alert: LogErrorBudgetBurnAllWindows
  expr: |
    (
      sum(rate(log_entries_total{level="error"}[1h]))
      /
      sum(rate(log_entries_total[1h]))
    )
    > (1 - 0.999) * 14.4
    or
    (
      sum(rate(log_entries_total{level="error"}[6h]))
      /
      sum(rate(log_entries_total[6h]))
    )
    > (1 - 0.999) * 6
  for: 5m
  labels:
    severity: critical
    category: logging
  annotations:
    summary: "Log error budget burning across multiple time windows"
    description: |
      Multi-window burn-rate alert triggered.
      1h burn rate: {{ printf "%.2f" (neilyz (index $alerts "0" | value)) }}x
      6h burn rate: {{ printf "%.2f" (neilyz (index $alerts "1" | value)) }}x
      Review error patterns and allocate incident resources.

SLO Error Budget Dashboard Panels

{
  "dashboard": {
    "title": "Logging SLO Error Budget",
    "panels": [
      {
        "title": "Error Budget Remaining (30d)",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - (sum(rate(log_entries_total{level=\"error\"}[30d])) / sum(rate(log_entries_total[30d])))) * 100",
            "legendFormat": "Budget Used %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "value": 0, "color": "red" },
                { "value": 50, "color": "yellow" },
                { "value": 90, "color": "green" }
              ]
            }
          }
        }
      },
      {
        "title": "Burn Rate (1h)",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999)",
            "legendFormat": "Burn Rate"
          }
        ]
      },
      {
        "title": "Projected Budget Exhaustion",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(log_entries_total{level=\"error\"}[1h])) / sum(rate(log_entries_total[1h]))) / (1 - 0.999) * 24",
            "legendFormat": "Hours until budget exhausted"
          }
        ]
      }
    ]
  }
}

Observability Hooks for Logging

This section defines what to log, measure, trace, and alert for logging systems themselves.

Log (What to Emit)

EventFieldsLevel
Log ingestion startedservice, host, agent_versionINFO
Log ingestion stoppedservice, host, reasonWARN
Buffer approaching fullhost, buffer_used_percent, buffer_limitWARN
Malformed log detectedhost, parse_error_type, sampleWARN
Retry attempthost, destination, attempt, max_attemptsDEBUG
Batch sent successfullyhost, destination, batch_size, bytes_sentDEBUG
Authentication failurehost, client_ip, reasonWARN

Measure (Metrics to Collect)

MetricTypeDescription
log_emitted_totalCounterTotal logs emitted by service
log_ingested_totalCounterTotal logs ingested to aggregator
log_dropped_totalCounterLogs dropped due to errors/full buffers
log_processing_latency_secondsHistogramTime from emit to searchable
log_buffer_utilization_percentGaugeBuffer fill percentage
log_parsing_errors_totalCounterMalformed log entries
log_bytes_sent_totalCounterBytes sent to aggregators
log_aggregator_queue_depthGaugePending logs in aggregator queue

Trace (Correlation Points)

OperationTrace AttributePurpose
Log emitlog.aggregateTrack logs from emit through aggregation
Log parsinglog.parse.statusMonitor parsing health
Log shippinglog.ship.destinationTrack delivery to aggregators
Batch processinglog.batch.sizeMonitor batch efficiency

Alert (When to Page)

AlertConditionSeverityPurpose
Log SilenceNo logs received for 5 minutesP1 CriticalLog pipeline failure
High Drop RateDrop rate > 1% for 5 minutesP2 HighPipeline health
Buffer CriticalBuffer > 90% fullP2 HighPrevent data loss
Parse Error SpikeParse errors > 100/minP3 MediumData quality
Latency HighProcessing latency > 30s p95P3 MediumPerformance degradation

Alerting Hook Template

# logging-observability-hooks.yaml
groups:
  - name: logging-observability-hooks
    rules:
      # Alert on silence - no logs coming in
      - alert: LoggingPipelineSilence
        expr: rate(fluentd_input_status_records_total[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No logs being ingested (Alert on Silence)"
          description: "Fluentd/Bit has not sent logs to Elasticsearch in 5 minutes. Either the log pipeline is down or all services have stopped logging."

      # Alert on high drop rate
      - alert: LoggingDropRateHigh
        expr: |
          sum(rate(fluentd_output_status_num_errors_total[5m]))
          /
          sum(rate(fluentd_input_status_records_total[5m])) > 0.01
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log drop rate above 1%"
          description: "{{ $value | humanizePercentage }} of logs are being dropped. Check Fluentd/Bit error logs."

      # Alert on buffer approaching full
      - alert: LoggingBufferCritical
        expr: fluentd_buffer_queue_length / fluentd_buffer_limit > 0.9
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Log buffer above 90% capacity"
          description: "Fluentd/Bit buffer is filling up. Risk of log loss if not addressed."

      # Alert on high parsing errors
      - alert: LoggingParseErrorSpike
        expr: rate(log_parsing_errors_total[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High log parsing error rate"
          description: "More than 100 parsing errors per minute. Review log format consistency."

      # Alert on processing latency
      - alert: LoggingProcessingLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum(rate(fluentd_output_status_flush_interval_bucket[5m])) by (le)
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Log processing latency above 30 seconds"
          description: "P95 log processing latency is {{ $value }}s. Logs may not be searchable in real-time."

      # SLO error budget burn rate
      - alert: LoggingErrorBudgetBurningFast
        expr: |
          (
            sum(rate(log_entries_total{level="error"}[1h]))
            /
            sum(rate(log_entries_total[1h]))
          ) > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Log error budget burning at unsustainable rate"
          description: "Error budget is being consumed 14.4x faster than sustainable. Immediate investigation required."

Production Failure Scenarios

FailureImpactMitigation
Log aggregation pipeline downtimeNo new logs searchable; teams blind to issuesBuffer logs locally; implement retry with backoff; alert on pipeline health
Elasticsearch cluster saturationLog ingestion backs up; logs droppedMonitor ES cluster health; implement backpressure; use ILM to manage indices
Corrupted log dataSearches return incomplete results; debugging misses contextValidate JSON structure at ingestion; use dead-letter queues for malformed logs
Sensitive data loggedSecurity/compliance breach; potential data exposureImplement redaction middleware; scan logs before storage; educate developers
Excessive log volumeStorage costs spike; performance degradationImplement sampling for DEBUG logs; enforce log level policies; archive aggressively
Missing correlation IDsCannot trace requests across servicesAuto-inject correlation IDs in middleware; reject requests without trace context in high-security paths

Observability Checklist

Key Log Metrics

  • Log ingestion rate (logs/second) by service and level
  • Log volume by service, level, and environment
  • Error rate in logs (ERROR level count over time)
  • Log processing latency (time from log emit to searchable)
  • Log agent errors and restarts
  • Storage utilization per index

Logs You Should Have

  • Request logs with trace_id, user_id, method, path, status, duration_ms
  • Authentication events (login attempts, failures, token refreshes)
  • Business events (orders, payments, registrations) with entity IDs
  • Database query logs for slow queries (>100ms threshold)
  • External API call logs with request/response timing
  • Background job start/complete/fail logs with job IDs
  • Health check and readiness probe logs
  • Configuration change logs (who changed what when)

Alerts You Need

  • No logs received from a service for >5 minutes (Alert on Silence pattern)
  • Error rate spike above baseline (unexpected errors)
  • Log volume anomaly (sudden drop or spike)
  • Log processing latency >30 seconds
  • Elasticsearch cluster health degraded (yellow/red)
  • Log agent restart detected

Security Checklist

  • No passwords, API keys, or secrets in log output
  • Credit card numbers, CVV, SSN never logged
  • Authorization tokens logged as type + last 4 chars only (e.g., “Bearer ***abc123”)
  • PII fields identified and redacted in redaction middleware
  • Log access requires authentication and is audited
  • Log aggregation pipeline uses TLS in transit
  • Elasticsearch access restricted to authorized personnel
  • Log retention complies with data retention policies
  • Sensitive data cannot be searched in Kibana/ES by unauthorized users

Common Pitfalls / Anti-Patterns

1. Logging Everything at DEBUG in Production

DEBUG-level logging in high-throughput services generates gigabytes per hour. Use sampling for debug scenarios, or enable DEBUG selectively via feature flags for specific request IDs.

2. Plain Text Logging with String Concatenation

// Bad: Cannot search, parse, or aggregate
logger.info("User " + userId + " purchased " + item);

// Good: Structured, searchable, aggregatable
logger.info("User purchased item", { userId, itemId, itemName, price });

3. Missing Trace Context Propagation

Logs without correlation IDs are useless for tracing requests across services. Always propagate trace_id through HTTP headers, database connections, and message queues.

4. Logging Sensitive Data

Never log passwords, full tokens, credit card numbers, or PII. Implement redaction at the logger level, not the application level, to catch mistakes.

5. Synchronous Logging to Network Storage

Writing logs synchronously to a remote log server adds latency to every operation. Use async logging with local buffering and background shipping.

6. No Log Retention Policy

Without retention policies, storage costs grow unbounded. Define hot/warm/cold/archive tiers and automate data lifecycle management.

7. Logs as the Only Observability Signal

Relying solely on logs for debugging is insufficient at scale. Combine logs with metrics and traces for complete observability.

Quick Recap

Key Takeaways:

  • Structured JSON logs enable efficient searching and aggregation
  • Correlation IDs connect logs across service boundaries
  • Log levels filter noise: DEBUG for development, ERROR/WARN/INFO for production
  • Never log sensitive data; always implement redaction
  • Monitor your monitors: log aggregation needs its own observability
  • Retention policies prevent unbounded storage growth

Copy/Paste Checklist:

# Verify structured logging format
grep -c '"timestamp".*"level".*"message".*"service"' /var/log/app.json

# Find logs for specific trace
grep '"trace_id":"abc123"' /var/log/app.json

# Count errors by service
jq 'select(.level == "ERROR") | .service' /var/log/app.json | sort | uniq -c

# Alert on log silence (Prometheus)
- alert: LogIngestionSilence
  expr: rate(fluentd_input_status_records_total[5m]) == 0
  for: 5m
  labels:
    severity: critical

# Redaction function (TypeScript)
const sensitiveFields = ['password', 'token', 'secret', 'creditCard', 'ssn'];
function redact(obj) {
  return Object.fromEntries(
    Object.entries(obj).map(([k, v]) =>
      sensitiveFields.some(f => k.toLowerCase().includes(f)) ? [k, '[REDACTED]'] : [k, v]
    )
  );
}

Interview Questions

Q: A user reports a bug but provides no details beyond “the checkout failed.” How do you find the relevant logs? A: Ask for the approximate time, user ID, or order ID. With a timestamp window, query logs for that timeframe filtering on the service handling checkout. With a user ID, search for all log entries tagged with that user ID. With a correlation ID from their session, search for that ID across all services to reconstruct the full request path. In ELK/Kibana: timeframe AND service.name: checkout-service AND "checkout failed". If no direct match, search for errors in the checkout service within the time window, then trace back via correlation IDs to find the root cause service.

Q: Your Elasticsearch cluster is running out of disk space. How do you reduce storage without losing searchable data? A: Immediate mitigation: force a flush to free up translog space, delete old indices beyond your retention policy, and consider a readonly index for historical data. For ongoing cost reduction: adjust index settings to reduce replica count in hot-warm architectures, use ILM policies to move older indices to cheaper storage tiers (frozen or cold), and reduce shard count — too many small shards wastes overhead. Audit field mappings to check if you can reduce the number of indexed fields with doc_values: false on fields that are only used for filtering. Finally, enforce log volume budgets per service to prevent any single service from overwhelming the cluster.

Q: What is the relationship between correlation IDs, trace IDs, and span IDs in distributed tracing? A: A trace ID is a unique identifier for an entire request transaction across all services — it stitches together every span. A span ID represents a single unit of work within that trace (one service call, one database query). Correlation IDs are typically an application-level business identifier (order ID, user session ID) that helps you filter logs across services without relying on trace IDs. In practice: the trace ID propagates via HTTP headers (x-b3-traceId in Zipkin, traceparent in W3C trace context) through every service call. Each service creates a span with the incoming trace ID and its own span ID, creating a parent-child tree of operations.

Q: You find that DEBUG logs are missing during production incidents. What logging level should you set in production and why? A: Production should run at INFO or WARN in most services — DEBUG is too noisy for production traffic volumes and can itself cause performance problems (disk I/O, log storage costs). However, when an incident is active, dynamically raising a specific service to DEBUG via a config change allows targeted debugging without impacting all services. The pattern: have a mechanism to change log level at runtime (via a config map reload, a logging API endpoint, or a Kubernetes annotation). Keep DEBUG in staging and development environments where you need it for development iteration.

Q: How do you handle sensitive data like PII appearing in logs? A: The best approach is to never log PII in the first place — sanitize before logging by configuring your logger to mask fields like email addresses, credit cards, and phone numbers using a redaction library. In your logging framework ( Pino for Node, zap for Go, python-logstash), add a field filter that replaces sensitive patterns with [REDACTED]. Alternatively, use a log processor (Fluentd filter, Logstash mutate) to strip or hash sensitive fields before forwarding. Also configure your SIEM and log storage to mark PII fields as sensitive so analysts are warned. Audit your logs regularly with a tool like DataSommer to catch accidental PII leakage.

Conclusion

Good logging practices pay off when you need them most: debugging production issues at 2am. Structured logs with correlation IDs let you trace requests across service boundaries. Appropriate log levels keep noise manageable. Retention policies balance cost with compliance requirements.

Start with JSON structured logging in your applications. Add correlation ID propagation early. Build log aggregation before you need it, not during an incident.

For deeper observability, combine logging with the Metrics, Monitoring & Alerting and Distributed Tracing practices covered in our other guides. These three pillars work together: logs show you what happened, metrics show you patterns, and traces show you why it happened.

Category

Related Posts

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics