ELK Stack: Elasticsearch, Logstash, Kibana, and Beats

Complete guide to the ELK Stack for log aggregation and analysis. Learn Elasticsearch indexing, Logstash pipelines, Kibana visualizations, and Beats shippers.

published: reading time: 29 min read author: GeekWorkBench

ELK Stack Deep Dive: Elasticsearch, Logstash, Kibana, and Beats

The ELK Stack is a popular open-source solution for centralized logging. It lets you collect logs from multiple sources, transform them into structured format, store them efficiently, and query them interactively.

This guide covers each component in depth. If you are new to logging concepts, start with our Logging Best Practices guide first.

Introduction

graph LR
    A[Log Sources] -->|Shippers| B[Beats]
    B --> C[Logstash]
    C --> D[Elasticsearch]
    D --> E[Kibana]
    A -->|Direct| D

The ELK Stack has four main components:

  • Beats: Lightweight shippers that collect data from various sources
  • Logstash: Transforms and enriches data during transit
  • Elasticsearch: Stores and indexes data for fast search
  • Kibana: Visualizes and explores data

Core Concepts

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores documents in JSON format and provides powerful query capabilities.

Elasticsearch Key Concepts

ConceptDescription
IndexCollection of documents, similar to a database
DocumentA single JSON record, similar to a row
ShardA partition of an index for horizontal scaling
ReplicaA copy of a shard for high availability

Index Lifecycle Management

Define policies to manage index data from creation to deletion:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": 100
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": 50
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "set_priority": 0
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Mapping and Index Templates

Index templates define mappings and settings for new indices:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "level": {
          "type": "keyword"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "service": {
          "type": "keyword"
        },
        "trace_id": {
          "type": "keyword"
        },
        "user_id": {
          "type": "keyword"
        },
        "duration_ms": {
          "type": "long"
        },
        "host": {
          "properties": {
            "name": { "type": "keyword" },
            "ip": { "type": "ip" }
          }
        }
      }
    }
  }
}

Querying Elasticsearch

GET logs-2026.03.22/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "service": "api-gateway" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ],
      "filter": [
        { "term": { "level": "ERROR" } }
      ]
    }
  },
  "sort": [
    { "@timestamp": "desc" }
  ],
  "aggs": {
    "error_by_service": {
      "terms": { "field": "service" },
      "aggs": {
        "error_rate": {
          "avg": { "field": "error_count" }
        }
      }
    }
  }
}

Logstash

Logstash processes and transforms data before it reaches Elasticsearch. It handles complex parsing, enrichment, and filtering.

Logstash Pipeline

graph TB
    A[Input] --> B[Filter]
    B --> C[Output]

A pipeline has three sections: input, filter, and output.

Input Plugins

# Receive logs from Beats
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/ssl/certs/logstash.crt"
    ssl_key => "/etc/ssl/private/logstash.key"
  }

  # Alternative: direct HTTP
  http {
    port => 8080
    content_type => "application/json"
  }
}

Filter Plugins

Filters transform and enrich data:

filter {
  # Parse JSON logs
  json {
    source => "message"
    target => "parsed"
  }

  # Parse timestamp
  date {
    match => ["parsed.timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # Extract fields from message
  grok {
    match => {
      "parsed.message" => "%{DATA:level}\s*%{DATA:logger}\s*%{GREEDYDATA:log_message}"
    }
    overwrite => ["message"]
  }

  # Add computed fields
  mutate {
    add_field => {
      "environment" => "%{[parsed][env]}"
      "[@metadata][index_prefix]" => "logs-%{[parsed][service]}"
    }
  }

  # Enrich with GeoIP
  geoip {
    source => "[parsed][client_ip]"
    target => "[parsed][geoip]"
    database => "/etc/logstash/GeoLite2-City.mmdb"
  }

  # Parse query string
  kv {
    source => "[parsed][request_params]"
    field_split => "&"
    prefix => "param_"
  }
}

Output Plugins

output {
  elasticsearch {
    hosts => ["https://elasticsearch:9200"]
    manage_template => false
    index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
    ssl => true
    cacert => "/etc/ssl/certs/ca.crt"
    user => "${ELASTICSEARCH_USER}"
    password => "${ELASTICSEARCH_PASSWORD}"
  }

  # Also send to stdout for debugging
  stdout {
    codec => rubydebug
  }
}

Complete Pipeline Example

input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][log_type] == "application" {
    json {
      source => "message"
      target => "parsed"
    }

    date {
      match => ["parsed.timestamp", "ISO8601"]
      target => "@timestamp"
    }

    if [parsed][level] {
      mutate {
        add_field => { "level" => "%{parsed[level]}" }
      }
    }

    if [parsed][exception] {
      mutate {
        add_tag => ["error"]
      }
    }
  }

  if [fields][log_type] == "access" {
    grok {
      match => {
        "message" => '%{IPORHOST:client_ip} %{DATA:ident} %{DATA:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status:int} %{NUMBER:bytes:int} "%{DATA:referrer}" "%{DATA:user_agent}"'
      }
    }

    date {
      match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
      target => "@timestamp"
    }

    geoip {
      source => "client_ip"
      target => "geoip"
    }

    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
}

output {
  if "error" in [tags] {
    elasticsearch {
      hosts => ["https://elasticsearch:9200"]
      index => "logs-error-%{+YYYY.MM.dd}"
    }
  } else {
    elasticsearch {
      hosts => ["https://elasticsearch:9200"]
      index => "logs-%{[fields][log_type]}-%{+YYYY.MM.dd}"
    }
  }
}

Beats

Beats are lightweight data shippers that send data from servers to Logstash or Elasticsearch.

Filebeat

Filebeat tails log files and ships them:

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/containers/*.log
    json:
      keys_under_root: true
      add_error_key: true
      message_key: log
    fields:
      log_type: container
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

  - type: log
    enabled: true
    paths:
      - /var/log/nginx/*.log
    fields:
      log_type: nginx
    processors:
      - add_locale: ~

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/filebeat/ca.crt"]
  ssl.certificate: "/etc/filebeat/filebeat.crt"
  ssl.key: "/etc/filebeat/filebeat.key"

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Metricbeat

Metricbeat collects system and service metrics:

metricbeat.modules:
  - module: system
    metricsets:
      - cpu
      - memory
      - network
      - process
      - diskio
    period: 10s
    processes: [".*"]

  - module: docker
    metricsets:
      - container
      - cpu
      - diskio
      - healthcheck
      - info
      - memory
      - network
    hosts: ["unix:///var/run/docker.sock"]
    period: 10s

  - module: nginx
    metricsets:
      - stubstatus
    hosts: ["http://nginx:8080/nginx_status"]
    period: 10s

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/metricbeat/ca.crt"]

Heartbeat

Heartbeat monitors service availability with synthetic checks:

heartbeat.monitors:
  - type: http
    name: api-health-check
    schedule: "@every 30s"
    urls:
      - https://api.example.com/health
    check.response:
      status: 200
    fields:
      service: api-gateway

  - type: tcp
    name: redis-connectivity
    schedule: "@every 60s"
    hosts: ["redis:6379"]
    timeout: 5s

  - type: icmp
    name: host-ping
    schedule: "@every 5m"
    hosts: ["elasticsearch"]

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]

Kibana

Kibana provides the visualization and exploration interface for your Elasticsearch data.

Index Pattern Setup

Before exploring data, create an index pattern in Kibana:

  1. Navigate to Management > Stack Management > Index Patterns
  2. Click “Create index pattern”
  3. Enter logs-* as the pattern
  4. Select @timestamp as the time field

Building Visualizations

Error Rate Over Time

{
  "title": "Error Rate",
  "type": "line",
  "params": {
    "type": "line",
    "grid": { "categoryLines": false },
    "categoryAxes": [
      {
        "id": "CategoryAxis-1",
        "type": "category",
        "position": "bottom"
      }
    ],
    "valueAxes": [
      {
        "id": "ValueAxis-1",
        "name": "LeftAxis-1",
        "type": "value",
        "position": "left",
        "scale": {
          "type": "linear",
          "mode": "normal"
        }
      }
    ]
  },
  "aggs": [
    {
      "id": "1",
      "type": "avg",
      "schema": "metric",
      "params": {
        "field": "error_rate"
      }
    },
    {
      "id": "2",
      "type": "date_histogram",
      "schema": "segment",
      "params": {
        "field": "@timestamp",
        "interval": "auto"
      }
    }
  ]
}

Service Error Distribution

{
  "title": "Errors by Service",
  "type": "pie",
  "aggs": [
    {
      "id": "1",
      "type": "count",
      "schema": "metric"
    },
    {
      "id": "2",
      "type": "terms",
      "schema": "segment",
      "params": {
        "field": "service.keyword",
        "size": 10
      }
    }
  ]
}

Kibana Discover

Discover provides ad-hoc search and exploration:

// Sample Discover query
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "fields": ["@timestamp", "level", "message", "service", "trace_id"],
  "filter": [
    {
      "meta": {
        "index": "logs-*",
        "negate": false,
        "params": {},
        "type": "phrase"
      },
      "query": {
        "match_phrase": {
          "service": "api-gateway"
        }
      }
    }
  ]
}

Kibana Dashboard Example

A complete dashboard might include:

  • Time series of log volume by level
  • Pie chart of error distribution by service
  • Table of recent errors with context
  • Heat map of errors over time by host
  • Metric visualization of error rate and latency percentiles

Deployment Considerations

Hardware Requirements

ComponentCPURAMDisk
Elasticsearch (per node)4+ cores8GB+SSD, 500GB+
Logstash2+ cores4GB+Minimal
Kibana2 cores2GB+Minimal
Beats1 core512MB+Minimal

Elasticsearch is I/O intensive. Use SSDs and ensure adequate disk throughput.

Security

# Enable security in elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

# API key authentication
xpack.security.api.key.enabled: true

# Role-based access control
xpack.security.authorization:
  roles_path: /etc/elasticsearch/roles.yml

Scaling

Scale Elasticsearch horizontally by adding nodes. The cluster automatically rebalances shards.

# Minimum master nodes for cluster stability
discovery.zen.minimum_master_nodes: 2  # for 3-node cluster

# Adjust shard allocation
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}

When to Use the ELK Stack

Use the ELK Stack when:

  • You need centralized logging from multiple services and environments
  • You need full-text search across log entries and application data
  • You need log analysis and pattern detection with Kibana
  • You need security analytics and threat detection
  • You need compliance audit logging and archival
  • You need infrastructure log aggregation (syslog, nginx, apache)

Don’t use the ELK Stack when:

  • You have simple applications with minimal logging needs
  • You only need metrics and dashboards (use Prometheus + Grafana instead)
  • You have high-volume streaming use cases (Kafka is better suited)
  • You need real-time alerting on log data (use dedicated alerting tools)
  • You need large-scale time-series metrics (Elasticsearch is not optimized for pure metrics)

ELK Stack vs Alternatives

AspectELK StackLokiSplunk
CostOpen source (self-hosted)Open source (self-hosted)Commercial (expensive)
Storage efficiencyMedium (indexed)High (log-structured)Medium
Query languageKQL (Kibana)LogQL (Prometheus-style)SPL
ScalabilityExcellent (horizontal)ExcellentExcellent
Ease of setupModerateEasyEasy
Full-text searchExcellentLimitedExcellent
Metrics integrationVia MetricbeatNative PrometheusNative
Best forComplex log analysis, security analyticsHigh-volume Kubernetes logsEnterprise compliance, security

Capacity Planning

Choosing the right hardware for Elasticsearch prevents performance issues down the line. These are rough guidelines for typical workloads.

Elasticsearch Node Sizing

TierRAMCPUDisk (SSD)Use Case
Hot64GB+8+ cores1TB+Active indexing, recent data
Warm32GB4+ cores2TB+Read-only, older indices
Cold16GB2 cores4TB+Archival, rare queries

The heap size should be at most 50% of available RAM. Keep heap under 32GB if possible to benefit from compressed object pointers. Set -Xms and -Xmx to the same value to avoid heap resizing during runtime.

# Check JVM settings
GET _nodes/jvm?filter_path=nodes.*.jvm.memory

Estimating Storage Requirements

Calculate expected index size using this formula:

index_size = source_log_volume × compression_ratio × replica_factor

For JSON logs with ILM enabled, expect 3-5x compression from raw log size. Without ILM, indices can grow 10-20x beyond raw log volume due to normalization and extra fields.

Log Volume Estimation

# Estimate daily log volume per service
# Assume: 1000 requests/min × 10KB avg log size × 60 min × 24h = ~14GB/day

# With 3 replicas and 30% overhead:
# 14GB × 4 (3 replicas + overhead) × 30 days = ~1.7TB/month per service

Scaling Triggers

Watch these metrics to decide when to add nodes:

  • Cluster health: Yellow or red status means you need capacity
  • Indexing latency: P95 above 500ms indicates saturation
  • Search latency: P95 above 1s for interactive queries
  • Disk usage: Nodes approaching 80% capacity
  • JVM heap pressure: Old generation spending more than 30% time in GC

Observability Stack Integration

ELK works best as part of a broader observability setup. Here is how it fits with other tools.

Prometheus + Grafana Integration

Beats can export metrics to Prometheus:

# metricbeat.yml - Enable prometheus output
metricbeat.modules:
  - module: elasticsearch
    metricsets:
      - node
      - node_stats
    period: 10s
    hosts: ["https://elasticsearch:9200"]

output.prometheus:
  enabled: true
  host: "0.0.0.0"
  port: 9424

Then scrape those metrics in Prometheus and build Grafana dashboards for cluster health, indexing throughput, and search latency.

Distributed Tracing with Jaeger

Add trace context to your logs so you can jump from a Kibana error directly into Jaeger:

# Logstash filter to extract Jaeger trace context
filter {
  if [message] =~ /trace_id:/ {
    grok {
      match => { "message" => "trace_id:%{DATA:trace_id}\s+span_id:%{DATA:span_id}" }
    }
    mutate {
      add_field => {
        "trace_url" => "https://jaeger.example.com/trace/%{trace_id}"
      }
    }
  }
}

This lets you correlate log entries with the full request trace in Jaeger.

Trade-off Analysis

FactorELK Stack (Self-Hosted)Loki (Grafana Cloud)Splunk EnterpriseCloudWatch Logs
Deployment ModelFully self-managedSaaS / managedSelf-managedFully managed SaaS
Storage CostInfrastructure + opsPay-per-GB storedLicense + infraPay-per-GB ingested
Query PerformanceExcellent on large dataGoodExcellentModerate (throttling)
Operational BurdenHigh (cluster ops)LowHigh (infra + license)Very low
ScalabilityManual shard managementAuto-scalingManualAuto-scaling
Learning CurveSteep (ES DSL)Low (LogQL)Steep (SPL)Low (CloudWatch)
EcosystemBeats, Fluentd, LogstashPromtail, GrafanaHeavy AgentsAWS-native

Production Failure Scenarios

FailureImpactMitigation
Elasticsearch cluster red/yellowLogs not indexing; search degradedMonitor cluster health; provision more shards; adjust replica settings
Logstash pipeline errorsLogs stuck in queue; processing backlogMonitor pipeline errors; implement dead-letter queues; alert on queue depth
Hot tier disk saturationNew indices cannot be created; ingestion failsMonitor disk usage; implement ILM rollover; add nodes
Kibana performance degradationSlow searches; dashboards timeoutOptimize queries; use filter context; limit time ranges
Beats shipper failureLogs not forwarded; blind spots in coverageMonitor Beats health; implement local buffering; alert on forward failures
Index template mismatchFields not indexed correctly; search failuresVersion index templates; validate mappings; test before deployment

Common Pitfalls / Anti-Patterns

1. Too Many Indices with Few Documents

Each index has overhead. Too many small indices overwhelms the cluster:

// Bad: Index per day per service creates thousands of indices
PUT logs-service-a-2026.03.22
PUT logs-service-b-2026.03.22
// ... thousands more

// Good: Use rollover with larger time intervals
PUT logs-service-a
{
  "aliases": {
    "logs-service-a": { "is_write_index": true }
  }
}

2. Dynamic Field Mapping Without Controls

Dynamic mapping can create unexpected field types and blow up cardinality:

// Bad: Unrestricted dynamic mapping
{
  "mappings": {
    "dynamic": "true" // Creates any field
  }
}

// Good: Strict dynamic mapping or disabled
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" }
    }
  }
}

3. Not Using Filter Context for Simple Queries

Filter context is faster because it does not score:

// Bad: Query context for term filter
{
  "query": {
    "match": { "level": "ERROR" } // Scores, slower
  }
}

// Good: Filter context for exact match
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "ERROR" } } // No scoring, faster
      ]
    }
  }
}

4. Ignoring Index Lifecycle Management

Without ILM, indices grow unbounded and performance degrades:

// Good: ILM with hot/warm/cold/delete
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": { "rollover": { "max_age": "7d" } }
      },
      "warm": { "min_age": "7d", "actions": { "shrink": 1, "forcemerge": 1 } },
      "cold": { "min_age": "30d", "actions": { "freeze": {} } },
      "delete": { "min_age": "365d", "actions": { "delete": {} } }
    }
  }
}

5. Loading Too Much Data into Memory

Kibana visualizations on large time ranges cause OOM:

// Bad: Visualize 90 days of minute-level data
{
  "query": { "range": { "@timestamp": { "gte": "now-90d" } } }
}

// Good: Use date histogram with appropriate interval
{
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h" // Or auto with proper configuration
      }
    }
  }
}

Real-world Failure Scenarios

Scenario 1: Elasticsearch Cluster Red/Yellow

What happened: After an unexpected traffic spike, the Elasticsearch cluster’s hot tier disk became saturated. New indices could not be created, causing log ingestion to fail across multiple services simultaneously.

Root cause: Index Lifecycle Management (ILM) was not configured, and disk alerts were set too high to catch gradual saturation before it became critical.

Impact: Approximately 4 hours of log data were lost. Engineers lost visibility into production systems during a post-incident analysis window.

Lesson learned: Configure ILM policies before going to production. Set disk usage alerts at 70% as a warning threshold and 80% as a critical threshold.

Scenario 2: Logstash Pipeline Backlog

What happened: A misconfigured Logstash filter caused an infinite loop in a conditional expression, causing the pipeline to process zero events while the input queue grew to millions of pending messages.

Root cause: A Grok filter pattern with an overly broad regex caused Logstash to CPU-thrash while failing to match any events.

Impact: Logs were delayed by 12 hours before the queue depth alert fired. Correlating logs with real-time events during the incident window was impossible.

Lesson learned: Implement dead-letter queues for failed events. Monitor pipeline-to-queue depth. Test filter patterns on sample data before deploying to production.

Scenario 3: Kibana Dashboard Timeout

What happened: A dashboard aggregating logs across 30-day windows with complex Vega visualizations began timing out. Users reported that the Kibana UI became unresponsive for all users on the cluster.

Root cause: The cluster’s coordination node was memory-constrained. Large aggregations caused heap pressure and triggered long GC pauses, affecting all coordination operations.

Impact: All Kibana users lost access to dashboards for approximately 45 minutes during peak business hours.

Lesson learned: Set query timeout limits in Kibana. Limit the default time range for users. Implement query caching and use rollup indices for historical data.

Interview Questions

1. How does Elasticsearch handle horizontal scaling when you add new nodes to a cluster?

What to cover:

  • Elasticsearch distributes shards across all data nodes automatically
  • New nodes trigger rebalancing; shards spread by disk usage
  • Master node updates routing tables without manual intervention
  • You can increase replica count after adding nodes for better redundancy
2. When should you use filter context versus query context in Elasticsearch?

What to cover:

  • Filter context: exact matching without relevance scoring, cached automatically (good for term, range, exists queries)
  • Query context: full-text search with scoring (match, query_string queries)
  • Use filter for status codes, user IDs, anything you want to match exactly
  • Use query when you need results ranked by relevance, like searching log messages
  • bool must = query context, bool filter = filter context in compound queries
3. Walk through the ILM phases and what actually happens in each one.

What to cover:

  • Hot: new indices accept writes; rollover triggers based on age or size; set_priority keeps these nodes prioritized
  • Warm: index becomes read-only; shrink reduces primary shards, forcemerge combines segments, priority drops
  • Cold: index is frozen and not actively queried; priority goes to zero
  • Delete: index disappears after the retention period; useful for compliance and clearing old data
  • Phases run sequentially once the min_age threshold passes
4. Elasticsearch cluster is stuck in yellow or red. How do you diagnose it?

What to cover:

  • Start with GET _cluster/health?pretty to see the overall status
  • Run GET _cat/shards?h=index,shard,prirep,state to find unassigned shards
  • Yellow = replica shards unassigned; red = primary shards missing
  • Common culprits: disk watermark breaches, heap pressure, network partition, node crashes
  • Solutions vary: add nodes, raise watermark thresholds, drop replica count, increase heap, restart frozen nodes
5. Describe the three Logstash pipeline stages and what each does with a real example.

What to cover:

  • Input: receives data from external sources (Beats, HTTP, files); example: beats plugin listening on port 5044
  • Filter: parses and enriches raw data; example: grok patterns extracting HTTP status code from an access log
  • Output: sends processed data to a destination; example: elasticsearch plugin writing to a daily index
  • Data flows sequentially: input feeds filter, filter feeds output
  • You can run multiple pipelines in parallel for different log types
6. What is the difference between index templates and ILM policies in Elasticsearch?

What to cover:

  • Index templates define mappings and settings for new indices matching a pattern automatically
  • They enforce consistency: field types, shard counts, replica counts, even ILM policy assignment
  • ILM policies govern what happens to indices over time (rollover, shrink, freeze, delete)
  • Templates handle structure at creation; ILM handles data management afterward
  • You can combine them: a template assigns an ILM policy so new indices automatically follow it
7. What is the practical difference between Beats and Logstash, and when would you pick one?

What to cover:

  • Beats are lightweight agents you install on edge machines; Logstash is heavier and runs on dedicated servers
  • Use Filebeat to tail log files, Metricbeat to collect system metrics, Heartbeat for uptime checks
  • Use Logstash when you need grok parsing, multi-step enrichment, conditional routing, or GeoIP lookups
  • In practice: Beats collect and ship, Logstash transforms and routes, Elasticsearch stores
8. What security features does the ELK stack offer and how do you turn them on?

What to cover:

  • XPack Security adds encryption, authentication, and role-based access control
  • Enable in elasticsearch.yml: xpack.security.enabled: true plus TLS for transport and HTTP
  • API key authentication via xpack.security.api.key.enabled: true for scripted access
  • RBAC with built-in roles like kibana_admin or custom roles defined in roles.yml
  • Audit logging with xpack.security.audit.enabled: true tracks who changed what
  • Kibana spaces let you isolate dev, staging, and prod environments visually
9. What is the difference between rollover and shrinking in ILM?

What to cover:

  • Rollover creates a fresh index when the current one hits max_age or max_primary_shard_size
  • The write alias switches to the new index; the old one stops receiving writes but stays searchable
  • Shrinking takes an existing index and rewrites it with fewer primary shards (say, from 5 down to 1)
  • Shrink copies data into a new index then deletes the original; rollover just switches aliases
  • Rollover handles time-based streams; shrinking optimizes read-heavy historical indices for storage
10. How would you architect an ELK stack for 500GB+ of logs per day?

What to cover:

  • Use hot-warm-cold architecture: SSDs for active indexing, larger spinning disks for warm, cheap storage for cold
  • Run multiple Logstash nodes behind a load balancer; each pipeline handles around 50GB/day
  • Enable local buffering in Beats so temporary Logstash outages do not cause data loss
  • Configure ILM: 7d hot, 30d warm, 90d cold, 365d delete to manage storage growth
  • Set index templates with proper shard sizing (target 50GB per shard) and compression enabled
  • Consider Kafka or Redis as a buffer between Beats and Logstash to absorb traffic spikes
  • Watch queue depth and dead letter queues to catch backlogs before they become outages
11. How do Elasticsearch indexing strategies affect search performance and storage costs?

What to cover:

  • Shard count directly impacts search parallelism: more shards means more parallel searches but also more overhead
  • Target 20-50GB per shard for optimal balance; shards that are too small cause overhead, too large cause slow recovery
  • Use rollover APIs to create new indices based on size or age rather than time-based rolling
  • For time-based logs, daily indices work well at moderate volume; high volume may need hourly
  • Index templates enforce consistent mappings and settings across all indices matching a pattern
  • Consider disabling norms for keyword fields you never search with relevance scoring
12. What are the trade-offs between Logstash and Beats for shipping data to Elasticsearch?

What to cover:

  • Beats are lightweight agents with minimal memory footprint (512MB baseline); Logstash requires 4GB+ servers
  • Beats do simple shipping and some preprocessing; Logstash does complex transformations, enrichment, and conditional routing
  • Use Beats for: file tailing, metric collection, heartbeat monitoring, simple field additions
  • Use Logstash for: grok parsing, multi-step enrichment chains, GeoIP lookups, business-logic-based routing
  • In practice, many architectures use both: Beats handle edge collection, Logstash handles transformation
  • Filebeat can do light parsing (JSON, nginx, apache logs) without Logstash if you do not need complex grok
13. How does hot-warm-cold architecture reduce Elasticsearch costs at scale?

What to cover:

  • Hot nodes use SSDs for fast I/O and handle all writes and recent queries
  • Warm nodes use larger spinning disks for read-only historical data that is queried occasionally
  • Cold nodes use cheap bulk storage for archival data that is rarely accessed but must be retained
  • ILM automates movement between tiers: hot (7d) → warm (30d) → cold (90d) → delete (365d)
  • Frozen indices use memory-mapped files and only load data when queried, dramatically reducing RAM needs
  • For 500GB/day ingestion, hot-warm-cold can cut storage costs by 60-70% compared to all-SSD
14. How do you detect and resolve Elasticsearch cluster rebalancing issues?

What to cover:

  • Rebalancing happens when nodes join or leave; it distributes shards to achieve even disk usage
  • Monitor with GET _cat/shards?h=index,shard,prirep,state,store and look for RELOCATING shards
  • High rebalance rates can saturate network and disk I/O, degrading search and indexing performance
  • Throttle rebalancing with cluster.routing.allocation.cluster_concurrent_rebalance: 2 (default is higher)
  • Watermark settings control when nodes stop accepting new shards (low 85%, high 90%, flood 95%)
  • If a node is slow, Elasticsearch may think it failed and start relocating shards unnecessarily
15. What are the key Kibana visualization best practices for operational dashboards?

What to cover:

  • Use filter context in all dashboard queries to avoid unnecessary relevance scoring
  • Limit time ranges by default; users can expand but loading 90 days of minute data crashes browsers
  • Use date histogram aggregations with appropriate intervals: 1h for 30d views, 1m for under 24h
  • For high-cardinality fields like user_id, use terms aggregation with size limit to avoid memory issues
  • Pin frequently used filters at the dashboard level so every visualization respects them
  • Break complex dashboards into multiple saved searches rather than one monolithic view
16. How does Elasticsearch handle shard allocation and what causes unassigned shards?

What to cover:

  • Shard allocation is the process of assigning shards to nodes based on resource usage and allocation policies
  • Primary shards are allocated at index creation; replicas are allocated dynamically
  • Unassigned shards appear when: disk watermark breached, node left cluster, replica count increased, newly created index
  • Yellow status means replicas are unassigned (data safe but fault tolerance compromised)
  • Red status means primaries are missing (data loss risk); check logs for the specific allocation reason
  • Use GET _cluster/allocation/explain to get detailed reason for a specific unassigned shard
17. What monitoring metrics should you watch for Logstash pipeline health?

What to cover:

  • Pipeline throughput (events/sec) to catch degradation before backlog accumulates
  • Queue depth: persistent queue bytes and unacknowledged events indicate backpressure
  • Dead letter queue size and age; growing DLQ means data is being lost silently
  • Filter execution time: slow filters (GeoIP, DNS lookup) can become bottlenecks
  • Input metrics: bytes received by Logstash, connection errors from Beats
  • JVM heap pressure: Logstash runs Java; heap > 80% causes GC pauses and slow processing
18. How do you secure Beats-to-Logstash and Logstash-to-Elasticsearch communication?

What to cover:

  • Enable TLS on Beats output to Logstash with self-signed certificates or CA-issued certs
  • Configure Logstash to verify client certificates for mutual TLS from Beats
  • Store credentials in environment variables or keystore, never in configuration files
  • Use API key authentication for Beats-to-Elasticsearch direct shipping (no Logstash)
  • Rotate SSL certificates regularly; expired certs on Beats cause silent shipping failures
  • Logstash-to-Elasticsearch should use TLS with certificate verification and credential secrets
19. What are the most common causes of Kibana slow query performance and how do you fix them?

What to cover:

  • Querying too wide a time range: use narrow defaults and let users expand; always use date histogram with auto or fixed intervals
  • High-cardinality aggregations: terms aggregation on user_id or trace_id with size: 10000 blows up memory
  • Missing filter context: queries with match instead of term go through scoring when they do not need to
  • Visualizations on scripted fields or runtime fields that execute per document
  • Large result sets returned to browser: use query size: 0 with aggregations instead of returning hits
  • Cross-cluster searches add network latency; consider local index patterns instead
20. How would you design an ELK stack migration from an existing Splunk deployment?

What to cover:

  • Map Splunk indexes to Elasticsearch indices; plan index naming and template structure
  • Beats can replace Heavy Forwarders; Filebeat handles most log types natively
  • SPL queries convert to KQL or Elasticsearch query DSL; some queries need redesign
  • Field names differ: Splunk "sourcetype" maps to a field in Elasticsearch, not a native concept
  • License cost comparison: Splunk is commercial and expensive; ELK is open source but has operational costs
  • Parallel run: ship same logs to both systems during transition period to validate data integrity
  • Start with non-critical logs, migrate incrementally, validate search results match before decommissioning Splunk

Further Reading

Quick Recap

Key Takeaways:

  • Beats collect, Logstash transforms, Elasticsearch stores, Kibana visualizes
  • Index lifecycle management prevents unbounded growth
  • Use filter context for exact matches; query context only when scoring needed
  • Monitor cluster health and pipeline metrics proactively
  • Implement security early: authentication, TLS, RBAC
  • Design index templates carefully to control field mapping

Copy/Paste Checklist:

# Check cluster health
GET _cluster/health?pretty

# Monitor index size and document count
GET _cat/indices?v&s=store.size:desc

# Check Logstash pipeline status
GET _nodes/stats/ingest?filter_path=nodes.*.ingest

# ILM policy check
GET _ilm/policy/logs-policy?pretty

# Dead letter queue inspection
GET _all/_doc/_search?q=tags:_dead_letter_queue

# Index template validation
GET _index_template/logs-template?pretty

# Secure your cluster (Elasticsearch)
PUT _security/user/kibana_admin
{
  "password": "${KIBA...ORD}",
  "roles": ["kibana_admin"]
}

Observability Checklist

Infrastructure Monitoring

  • Elasticsearch cluster health (green/yellow/red)
  • Primary shard and replica distribution
  • Index count and size per index
  • Node resource utilization (CPU, heap, disk)
  • Search and indexing latency percentiles
  • JVM heap usage and GC frequency
  • Segment count and merge queue depth

Log Pipeline Monitoring

  • Beats shipper metrics (bytes sent, errors, lag)
  • Logstash pipeline throughput and latency
  • Logstash queue depth and worker utilization
  • Dead-letter queue size and age
  • Log parsing error rate

Kibana Monitoring

  • Search response time (p95, p99)
  • Dashboard load time
  • Visualization render time
  • Active users and session count

Data Management

  • Index count within expected bounds
  • Document count growth rate
  • Disk usage trend and forecasting
  • ILM policy execution success/failure
  • Archive tier accessibility

Security Checklist

  • Elasticsearch security enabled (XPack Security)
  • User authentication configured (LDAP, SAML, or built-in)
  • Role-based access control for indices and spaces
  • TLS encryption for all network traffic
  • API keys rotated regularly
  • Kibana spaces isolation (dev/staging/prod separation)
  • Audit logging enabled for security events
  • No sensitive data in index names or field names
  • Snapshot repositories secured and access logged
  • Cross-cluster search secured if used

Conclusion

The ELK Stack provides a powerful platform for centralized logging and analysis. Beats collect data efficiently, Logstash transforms it into structured format, Elasticsearch stores and indexes it, and Kibana makes it explorable.

Start with Filebeat shipping container logs to Elasticsearch, and build from there. Add Logstash for complex parsing, Kibana for visualizations, and ILM policies for efficient data retention.

For monitoring beyond logs, see our Prometheus & Grafana guide for metrics visualization. For distributed tracing, see the Jaeger and Distributed Tracing guides for correlating logs with request traces.

Category

Related Posts

Logging Best Practices: Structured Logs, Levels, Aggregation

Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.

#observability #logging #monitoring

Performance Profiling

Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.

#operating-systems #performance-profiling #linux

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring