ELK Stack: Elasticsearch, Logstash, Kibana, and Beats

Complete guide to the ELK Stack for log aggregation and analysis. Learn Elasticsearch indexing, Logstash pipelines, Kibana visualizations, and Beats shippers.

published: March 22, 2026 reading time: 29 min read author: GeekWorkBench

ELK Stack Deep Dive: Elasticsearch, Logstash, Kibana, and Beats

The ELK Stack is a popular open-source solution for centralized logging. It lets you collect logs from multiple sources, transform them into structured format, store them efficiently, and query them interactively.

This guide covers each component in depth. If you are new to logging concepts, start with our Logging Best Practices guide first.

Introduction

graph LR
    A[Log Sources] -->|Shippers| B[Beats]
    B --> C[Logstash]
    C --> D[Elasticsearch]
    D --> E[Kibana]
    A -->|Direct| D

The ELK Stack has four main components:

Beats: Lightweight shippers that collect data from various sources
Logstash: Transforms and enriches data during transit
Elasticsearch: Stores and indexes data for fast search
Kibana: Visualizes and explores data

Core Concepts

Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores documents in JSON format and provides powerful query capabilities.

Elasticsearch Key Concepts

Concept	Description
Index	Collection of documents, similar to a database
Document	A single JSON record, similar to a row
Shard	A partition of an index for horizontal scaling
Replica	A copy of a shard for high availability

Index Lifecycle Management

Define policies to manage index data from creation to deletion:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": 100
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": 50
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "set_priority": 0
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Mapping and Index Templates

Index templates define mappings and settings for new indices:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "level": {
          "type": "keyword"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "service": {
          "type": "keyword"
        },
        "trace_id": {
          "type": "keyword"
        },
        "user_id": {
          "type": "keyword"
        },
        "duration_ms": {
          "type": "long"
        },
        "host": {
          "properties": {
            "name": { "type": "keyword" },
            "ip": { "type": "ip" }
          }
        }
      }
    }
  }
}

Querying Elasticsearch

GET logs-2026.03.22/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "service": "api-gateway" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ],
      "filter": [
        { "term": { "level": "ERROR" } }
      ]
    }
  },
  "sort": [
    { "@timestamp": "desc" }
  ],
  "aggs": {
    "error_by_service": {
      "terms": { "field": "service" },
      "aggs": {
        "error_rate": {
          "avg": { "field": "error_count" }
        }
      }
    }
  }
}

Logstash

Logstash processes and transforms data before it reaches Elasticsearch. It handles complex parsing, enrichment, and filtering.

Logstash Pipeline

graph TB
    A[Input] --> B[Filter]
    B --> C[Output]

A pipeline has three sections: input, filter, and output.

Input Plugins

# Receive logs from Beats
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/ssl/certs/logstash.crt"
    ssl_key => "/etc/ssl/private/logstash.key"
  }

  # Alternative: direct HTTP
  http {
    port => 8080
    content_type => "application/json"
  }
}

Filter Plugins

Filters transform and enrich data:

filter {
  # Parse JSON logs
  json {
    source => "message"
    target => "parsed"
  }

  # Parse timestamp
  date {
    match => ["parsed.timestamp", "ISO8601"]
    target => "@timestamp"
  }

  # Extract fields from message
  grok {
    match => {
      "parsed.message" => "%{DATA:level}\s*%{DATA:logger}\s*%{GREEDYDATA:log_message}"
    }
    overwrite => ["message"]
  }

  # Add computed fields
  mutate {
    add_field => {
      "environment" => "%{[parsed][env]}"
      "[@metadata][index_prefix]" => "logs-%{[parsed][service]}"
    }
  }

  # Enrich with GeoIP
  geoip {
    source => "[parsed][client_ip]"
    target => "[parsed][geoip]"
    database => "/etc/logstash/GeoLite2-City.mmdb"
  }

  # Parse query string
  kv {
    source => "[parsed][request_params]"
    field_split => "&"
    prefix => "param_"
  }
}

Output Plugins

output {
  elasticsearch {
    hosts => ["https://elasticsearch:9200"]
    manage_template => false
    index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
    ssl => true
    cacert => "/etc/ssl/certs/ca.crt"
    user => "${ELASTICSEARCH_USER}"
    password => "${ELASTICSEARCH_PASSWORD}"
  }

  # Also send to stdout for debugging
  stdout {
    codec => rubydebug
  }
}

Complete Pipeline Example

input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][log_type] == "application" {
    json {
      source => "message"
      target => "parsed"
    }

    date {
      match => ["parsed.timestamp", "ISO8601"]
      target => "@timestamp"
    }

    if [parsed][level] {
      mutate {
        add_field => { "level" => "%{parsed[level]}" }
      }
    }

    if [parsed][exception] {
      mutate {
        add_tag => ["error"]
      }
    }
  }

  if [fields][log_type] == "access" {
    grok {
      match => {
        "message" => '%{IPORHOST:client_ip} %{DATA:ident} %{DATA:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status:int} %{NUMBER:bytes:int} "%{DATA:referrer}" "%{DATA:user_agent}"'
      }
    }

    date {
      match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
      target => "@timestamp"
    }

    geoip {
      source => "client_ip"
      target => "geoip"
    }

    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
}

output {
  if "error" in [tags] {
    elasticsearch {
      hosts => ["https://elasticsearch:9200"]
      index => "logs-error-%{+YYYY.MM.dd}"
    }
  } else {
    elasticsearch {
      hosts => ["https://elasticsearch:9200"]
      index => "logs-%{[fields][log_type]}-%{+YYYY.MM.dd}"
    }
  }
}

Beats

Beats are lightweight data shippers that send data from servers to Logstash or Elasticsearch.

Filebeat

Filebeat tails log files and ships them:

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/containers/*.log
    json:
      keys_under_root: true
      add_error_key: true
      message_key: log
    fields:
      log_type: container
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

  - type: log
    enabled: true
    paths:
      - /var/log/nginx/*.log
    fields:
      log_type: nginx
    processors:
      - add_locale: ~

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/filebeat/ca.crt"]
  ssl.certificate: "/etc/filebeat/filebeat.crt"
  ssl.key: "/etc/filebeat/filebeat.key"

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Metricbeat

Metricbeat collects system and service metrics:

metricbeat.modules:
  - module: system
    metricsets:
      - cpu
      - memory
      - network
      - process
      - diskio
    period: 10s
    processes: [".*"]

  - module: docker
    metricsets:
      - container
      - cpu
      - diskio
      - healthcheck
      - info
      - memory
      - network
    hosts: ["unix:///var/run/docker.sock"]
    period: 10s

  - module: nginx
    metricsets:
      - stubstatus
    hosts: ["http://nginx:8080/nginx_status"]
    period: 10s

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/metricbeat/ca.crt"]

Heartbeat

Heartbeat monitors service availability with synthetic checks:

heartbeat.monitors:
  - type: http
    name: api-health-check
    schedule: "@every 30s"
    urls:
      - https://api.example.com/health
    check.response:
      status: 200
    fields:
      service: api-gateway

  - type: tcp
    name: redis-connectivity
    schedule: "@every 60s"
    hosts: ["redis:6379"]
    timeout: 5s

  - type: icmp
    name: host-ping
    schedule: "@every 5m"
    hosts: ["elasticsearch"]

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]

Kibana

Kibana provides the visualization and exploration interface for your Elasticsearch data.

Index Pattern Setup

Before exploring data, create an index pattern in Kibana:

Navigate to Management > Stack Management > Index Patterns
Click “Create index pattern”
Enter logs-* as the pattern
Select @timestamp as the time field

Building Visualizations

Error Rate Over Time

{
  "title": "Error Rate",
  "type": "line",
  "params": {
    "type": "line",
    "grid": { "categoryLines": false },
    "categoryAxes": [
      {
        "id": "CategoryAxis-1",
        "type": "category",
        "position": "bottom"
      }
    ],
    "valueAxes": [
      {
        "id": "ValueAxis-1",
        "name": "LeftAxis-1",
        "type": "value",
        "position": "left",
        "scale": {
          "type": "linear",
          "mode": "normal"
        }
      }
    ]
  },
  "aggs": [
    {
      "id": "1",
      "type": "avg",
      "schema": "metric",
      "params": {
        "field": "error_rate"
      }
    },
    {
      "id": "2",
      "type": "date_histogram",
      "schema": "segment",
      "params": {
        "field": "@timestamp",
        "interval": "auto"
      }
    }
  ]
}

Service Error Distribution

{
  "title": "Errors by Service",
  "type": "pie",
  "aggs": [
    {
      "id": "1",
      "type": "count",
      "schema": "metric"
    },
    {
      "id": "2",
      "type": "terms",
      "schema": "segment",
      "params": {
        "field": "service.keyword",
        "size": 10
      }
    }
  ]
}

Kibana Discover

Discover provides ad-hoc search and exploration:

// Sample Discover query
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "fields": ["@timestamp", "level", "message", "service", "trace_id"],
  "filter": [
    {
      "meta": {
        "index": "logs-*",
        "negate": false,
        "params": {},
        "type": "phrase"
      },
      "query": {
        "match_phrase": {
          "service": "api-gateway"
        }
      }
    }
  ]
}

Kibana Dashboard Example

A complete dashboard might include:

Time series of log volume by level
Pie chart of error distribution by service
Table of recent errors with context
Heat map of errors over time by host
Metric visualization of error rate and latency percentiles

Deployment Considerations

Hardware Requirements

Component	CPU	RAM	Disk
Elasticsearch (per node)	4+ cores	8GB+	SSD, 500GB+
Logstash	2+ cores	4GB+	Minimal
Kibana	2 cores	2GB+	Minimal
Beats	1 core	512MB+	Minimal

Elasticsearch is I/O intensive. Use SSDs and ensure adequate disk throughput.

Security

# Enable security in elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

# API key authentication
xpack.security.api.key.enabled: true

# Role-based access control
xpack.security.authorization:
  roles_path: /etc/elasticsearch/roles.yml

Scaling

Scale Elasticsearch horizontally by adding nodes. The cluster automatically rebalances shards.

# Minimum master nodes for cluster stability
discovery.zen.minimum_master_nodes: 2  # for 3-node cluster

# Adjust shard allocation
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}

When to Use the ELK Stack

Use the ELK Stack when:

You need centralized logging from multiple services and environments
You need full-text search across log entries and application data
You need log analysis and pattern detection with Kibana
You need security analytics and threat detection
You need compliance audit logging and archival
You need infrastructure log aggregation (syslog, nginx, apache)

Don’t use the ELK Stack when:

You have simple applications with minimal logging needs
You only need metrics and dashboards (use Prometheus + Grafana instead)
You have high-volume streaming use cases (Kafka is better suited)
You need real-time alerting on log data (use dedicated alerting tools)
You need large-scale time-series metrics (Elasticsearch is not optimized for pure metrics)

ELK Stack vs Alternatives

Aspect	ELK Stack	Loki	Splunk
Cost	Open source (self-hosted)	Open source (self-hosted)	Commercial (expensive)
Storage efficiency	Medium (indexed)	High (log-structured)	Medium
Query language	KQL (Kibana)	LogQL (Prometheus-style)	SPL
Scalability	Excellent (horizontal)	Excellent	Excellent
Ease of setup	Moderate	Easy	Easy
Full-text search	Excellent	Limited	Excellent
Metrics integration	Via Metricbeat	Native Prometheus	Native
Best for	Complex log analysis, security analytics	High-volume Kubernetes logs	Enterprise compliance, security

Capacity Planning

Choosing the right hardware for Elasticsearch prevents performance issues down the line. These are rough guidelines for typical workloads.

Elasticsearch Node Sizing

Tier	RAM	CPU	Disk (SSD)	Use Case
Hot	64GB+	8+ cores	1TB+	Active indexing, recent data
Warm	32GB	4+ cores	2TB+	Read-only, older indices
Cold	16GB	2 cores	4TB+	Archival, rare queries

The heap size should be at most 50% of available RAM. Keep heap under 32GB if possible to benefit from compressed object pointers. Set -Xms and -Xmx to the same value to avoid heap resizing during runtime.

# Check JVM settings
GET _nodes/jvm?filter_path=nodes.*.jvm.memory

Estimating Storage Requirements

Calculate expected index size using this formula:

index_size = source_log_volume × compression_ratio × replica_factor

For JSON logs with ILM enabled, expect 3-5x compression from raw log size. Without ILM, indices can grow 10-20x beyond raw log volume due to normalization and extra fields.

Log Volume Estimation

# Estimate daily log volume per service
# Assume: 1000 requests/min × 10KB avg log size × 60 min × 24h = ~14GB/day

# With 3 replicas and 30% overhead:
# 14GB × 4 (3 replicas + overhead) × 30 days = ~1.7TB/month per service

Scaling Triggers

Watch these metrics to decide when to add nodes:

Cluster health: Yellow or red status means you need capacity
Indexing latency: P95 above 500ms indicates saturation
Search latency: P95 above 1s for interactive queries
Disk usage: Nodes approaching 80% capacity
JVM heap pressure: Old generation spending more than 30% time in GC

Observability Stack Integration

ELK works best as part of a broader observability setup. Here is how it fits with other tools.

Prometheus + Grafana Integration

Beats can export metrics to Prometheus:

# metricbeat.yml - Enable prometheus output
metricbeat.modules:
  - module: elasticsearch
    metricsets:
      - node
      - node_stats
    period: 10s
    hosts: ["https://elasticsearch:9200"]

output.prometheus:
  enabled: true
  host: "0.0.0.0"
  port: 9424

Then scrape those metrics in Prometheus and build Grafana dashboards for cluster health, indexing throughput, and search latency.

Distributed Tracing with Jaeger

Add trace context to your logs so you can jump from a Kibana error directly into Jaeger:

# Logstash filter to extract Jaeger trace context
filter {
  if [message] =~ /trace_id:/ {
    grok {
      match => { "message" => "trace_id:%{DATA:trace_id}\s+span_id:%{DATA:span_id}" }
    }
    mutate {
      add_field => {
        "trace_url" => "https://jaeger.example.com/trace/%{trace_id}"
      }
    }
  }
}

This lets you correlate log entries with the full request trace in Jaeger.

Trade-off Analysis

Factor	ELK Stack (Self-Hosted)	Loki (Grafana Cloud)	Splunk Enterprise	CloudWatch Logs
Deployment Model	Fully self-managed	SaaS / managed	Self-managed	Fully managed SaaS
Storage Cost	Infrastructure + ops	Pay-per-GB stored	License + infra	Pay-per-GB ingested
Query Performance	Excellent on large data	Good	Excellent	Moderate (throttling)
Operational Burden	High (cluster ops)	Low	High (infra + license)	Very low
Scalability	Manual shard management	Auto-scaling	Manual	Auto-scaling
Learning Curve	Steep (ES DSL)	Low (LogQL)	Steep (SPL)	Low (CloudWatch)
Ecosystem	Beats, Fluentd, Logstash	Promtail, Grafana	Heavy Agents	AWS-native

Production Failure Scenarios

Failure	Impact	Mitigation
Elasticsearch cluster red/yellow	Logs not indexing; search degraded	Monitor cluster health; provision more shards; adjust replica settings
Logstash pipeline errors	Logs stuck in queue; processing backlog	Monitor pipeline errors; implement dead-letter queues; alert on queue depth
Hot tier disk saturation	New indices cannot be created; ingestion fails	Monitor disk usage; implement ILM rollover; add nodes
Kibana performance degradation	Slow searches; dashboards timeout	Optimize queries; use filter context; limit time ranges
Beats shipper failure	Logs not forwarded; blind spots in coverage	Monitor Beats health; implement local buffering; alert on forward failures
Index template mismatch	Fields not indexed correctly; search failures	Version index templates; validate mappings; test before deployment

Common Pitfalls / Anti-Patterns

1. Too Many Indices with Few Documents

Each index has overhead. Too many small indices overwhelms the cluster:

// Bad: Index per day per service creates thousands of indices
PUT logs-service-a-2026.03.22
PUT logs-service-b-2026.03.22
// ... thousands more

// Good: Use rollover with larger time intervals
PUT logs-service-a
{
  "aliases": {
    "logs-service-a": { "is_write_index": true }
  }
}

2. Dynamic Field Mapping Without Controls

Dynamic mapping can create unexpected field types and blow up cardinality:

// Bad: Unrestricted dynamic mapping
{
  "mappings": {
    "dynamic": "true" // Creates any field
  }
}

// Good: Strict dynamic mapping or disabled
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" }
    }
  }
}

3. Not Using Filter Context for Simple Queries

Filter context is faster because it does not score:

// Bad: Query context for term filter
{
  "query": {
    "match": { "level": "ERROR" } // Scores, slower
  }
}

// Good: Filter context for exact match
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "ERROR" } } // No scoring, faster
      ]
    }
  }
}

4. Ignoring Index Lifecycle Management

Without ILM, indices grow unbounded and performance degrades:

// Good: ILM with hot/warm/cold/delete
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": { "rollover": { "max_age": "7d" } }
      },
      "warm": { "min_age": "7d", "actions": { "shrink": 1, "forcemerge": 1 } },
      "cold": { "min_age": "30d", "actions": { "freeze": {} } },
      "delete": { "min_age": "365d", "actions": { "delete": {} } }
    }
  }
}

5. Loading Too Much Data into Memory

Kibana visualizations on large time ranges cause OOM:

// Bad: Visualize 90 days of minute-level data
{
  "query": { "range": { "@timestamp": { "gte": "now-90d" } } }
}

// Good: Use date histogram with appropriate interval
{
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1h" // Or auto with proper configuration
      }
    }
  }
}

Real-world Failure Scenarios

Scenario 1: Elasticsearch Cluster Red/Yellow

What happened: After an unexpected traffic spike, the Elasticsearch cluster’s hot tier disk became saturated. New indices could not be created, causing log ingestion to fail across multiple services simultaneously.

Root cause: Index Lifecycle Management (ILM) was not configured, and disk alerts were set too high to catch gradual saturation before it became critical.

Impact: Approximately 4 hours of log data were lost. Engineers lost visibility into production systems during a post-incident analysis window.

Lesson learned: Configure ILM policies before going to production. Set disk usage alerts at 70% as a warning threshold and 80% as a critical threshold.

Scenario 2: Logstash Pipeline Backlog

What happened: A misconfigured Logstash filter caused an infinite loop in a conditional expression, causing the pipeline to process zero events while the input queue grew to millions of pending messages.

Root cause: A Grok filter pattern with an overly broad regex caused Logstash to CPU-thrash while failing to match any events.

Impact: Logs were delayed by 12 hours before the queue depth alert fired. Correlating logs with real-time events during the incident window was impossible.

Lesson learned: Implement dead-letter queues for failed events. Monitor pipeline-to-queue depth. Test filter patterns on sample data before deploying to production.

Scenario 3: Kibana Dashboard Timeout

What happened: A dashboard aggregating logs across 30-day windows with complex Vega visualizations began timing out. Users reported that the Kibana UI became unresponsive for all users on the cluster.

Root cause: The cluster’s coordination node was memory-constrained. Large aggregations caused heap pressure and triggered long GC pauses, affecting all coordination operations.

Impact: All Kibana users lost access to dashboards for approximately 45 minutes during peak business hours.

Lesson learned: Set query timeout limits in Kibana. Limit the default time range for users. Implement query caching and use rollup indices for historical data.

Interview Questions

1. How does Elasticsearch handle horizontal scaling when you add new nodes to a cluster?

What to cover:

Elasticsearch distributes shards across all data nodes automatically
New nodes trigger rebalancing; shards spread by disk usage
Master node updates routing tables without manual intervention
You can increase replica count after adding nodes for better redundancy

2. When should you use filter context versus query context in Elasticsearch?

What to cover:

Filter context: exact matching without relevance scoring, cached automatically (good for term, range, exists queries)
Query context: full-text search with scoring (match, query_string queries)
Use filter for status codes, user IDs, anything you want to match exactly
Use query when you need results ranked by relevance, like searching log messages
bool must = query context, bool filter = filter context in compound queries

3. Walk through the ILM phases and what actually happens in each one.

What to cover:

Hot: new indices accept writes; rollover triggers based on age or size; set_priority keeps these nodes prioritized
Warm: index becomes read-only; shrink reduces primary shards, forcemerge combines segments, priority drops
Cold: index is frozen and not actively queried; priority goes to zero
Delete: index disappears after the retention period; useful for compliance and clearing old data
Phases run sequentially once the min_age threshold passes

4. Elasticsearch cluster is stuck in yellow or red. How do you diagnose it?

What to cover:

Start with GET _cluster/health?pretty to see the overall status
Run GET _cat/shards?h=index,shard,prirep,state to find unassigned shards
Yellow = replica shards unassigned; red = primary shards missing
Common culprits: disk watermark breaches, heap pressure, network partition, node crashes
Solutions vary: add nodes, raise watermark thresholds, drop replica count, increase heap, restart frozen nodes

5. Describe the three Logstash pipeline stages and what each does with a real example.

What to cover:

Input: receives data from external sources (Beats, HTTP, files); example: beats plugin listening on port 5044
Filter: parses and enriches raw data; example: grok patterns extracting HTTP status code from an access log
Output: sends processed data to a destination; example: elasticsearch plugin writing to a daily index
Data flows sequentially: input feeds filter, filter feeds output
You can run multiple pipelines in parallel for different log types

6. What is the difference between index templates and ILM policies in Elasticsearch?

What to cover:

Index templates define mappings and settings for new indices matching a pattern automatically
They enforce consistency: field types, shard counts, replica counts, even ILM policy assignment
ILM policies govern what happens to indices over time (rollover, shrink, freeze, delete)
Templates handle structure at creation; ILM handles data management afterward
You can combine them: a template assigns an ILM policy so new indices automatically follow it

7. What is the practical difference between Beats and Logstash, and when would you pick one?

What to cover:

Beats are lightweight agents you install on edge machines; Logstash is heavier and runs on dedicated servers
Use Filebeat to tail log files, Metricbeat to collect system metrics, Heartbeat for uptime checks
Use Logstash when you need grok parsing, multi-step enrichment, conditional routing, or GeoIP lookups
In practice: Beats collect and ship, Logstash transforms and routes, Elasticsearch stores

8. What security features does the ELK stack offer and how do you turn them on?

What to cover:

XPack Security adds encryption, authentication, and role-based access control
Enable in elasticsearch.yml: xpack.security.enabled: true plus TLS for transport and HTTP
API key authentication via xpack.security.api.key.enabled: true for scripted access
RBAC with built-in roles like kibana_admin or custom roles defined in roles.yml
Audit logging with xpack.security.audit.enabled: true tracks who changed what
Kibana spaces let you isolate dev, staging, and prod environments visually

9. What is the difference between rollover and shrinking in ILM?

What to cover:

Rollover creates a fresh index when the current one hits max_age or max_primary_shard_size
The write alias switches to the new index; the old one stops receiving writes but stays searchable
Shrinking takes an existing index and rewrites it with fewer primary shards (say, from 5 down to 1)
Shrink copies data into a new index then deletes the original; rollover just switches aliases
Rollover handles time-based streams; shrinking optimizes read-heavy historical indices for storage

10. How would you architect an ELK stack for 500GB+ of logs per day?

What to cover:

Use hot-warm-cold architecture: SSDs for active indexing, larger spinning disks for warm, cheap storage for cold
Run multiple Logstash nodes behind a load balancer; each pipeline handles around 50GB/day
Enable local buffering in Beats so temporary Logstash outages do not cause data loss
Configure ILM: 7d hot, 30d warm, 90d cold, 365d delete to manage storage growth
Set index templates with proper shard sizing (target 50GB per shard) and compression enabled
Consider Kafka or Redis as a buffer between Beats and Logstash to absorb traffic spikes
Watch queue depth and dead letter queues to catch backlogs before they become outages

11. How do Elasticsearch indexing strategies affect search performance and storage costs?

What to cover:

Shard count directly impacts search parallelism: more shards means more parallel searches but also more overhead
Target 20-50GB per shard for optimal balance; shards that are too small cause overhead, too large cause slow recovery
Use rollover APIs to create new indices based on size or age rather than time-based rolling
For time-based logs, daily indices work well at moderate volume; high volume may need hourly
Index templates enforce consistent mappings and settings across all indices matching a pattern
Consider disabling norms for keyword fields you never search with relevance scoring

12. What are the trade-offs between Logstash and Beats for shipping data to Elasticsearch?

What to cover:

Beats are lightweight agents with minimal memory footprint (512MB baseline); Logstash requires 4GB+ servers
Beats do simple shipping and some preprocessing; Logstash does complex transformations, enrichment, and conditional routing
Use Beats for: file tailing, metric collection, heartbeat monitoring, simple field additions
Use Logstash for: grok parsing, multi-step enrichment chains, GeoIP lookups, business-logic-based routing
In practice, many architectures use both: Beats handle edge collection, Logstash handles transformation
Filebeat can do light parsing (JSON, nginx, apache logs) without Logstash if you do not need complex grok

13. How does hot-warm-cold architecture reduce Elasticsearch costs at scale?

What to cover:

Hot nodes use SSDs for fast I/O and handle all writes and recent queries
Warm nodes use larger spinning disks for read-only historical data that is queried occasionally
Cold nodes use cheap bulk storage for archival data that is rarely accessed but must be retained
ILM automates movement between tiers: hot (7d) → warm (30d) → cold (90d) → delete (365d)
Frozen indices use memory-mapped files and only load data when queried, dramatically reducing RAM needs
For 500GB/day ingestion, hot-warm-cold can cut storage costs by 60-70% compared to all-SSD

14. How do you detect and resolve Elasticsearch cluster rebalancing issues?

What to cover:

Rebalancing happens when nodes join or leave; it distributes shards to achieve even disk usage
Monitor with GET _cat/shards?h=index,shard,prirep,state,store and look for RELOCATING shards
High rebalance rates can saturate network and disk I/O, degrading search and indexing performance
Throttle rebalancing with cluster.routing.allocation.cluster_concurrent_rebalance: 2 (default is higher)
Watermark settings control when nodes stop accepting new shards (low 85%, high 90%, flood 95%)
If a node is slow, Elasticsearch may think it failed and start relocating shards unnecessarily

15. What are the key Kibana visualization best practices for operational dashboards?

What to cover:

Use filter context in all dashboard queries to avoid unnecessary relevance scoring
Limit time ranges by default; users can expand but loading 90 days of minute data crashes browsers
Use date histogram aggregations with appropriate intervals: 1h for 30d views, 1m for under 24h
For high-cardinality fields like user_id, use terms aggregation with size limit to avoid memory issues
Pin frequently used filters at the dashboard level so every visualization respects them
Break complex dashboards into multiple saved searches rather than one monolithic view

16. How does Elasticsearch handle shard allocation and what causes unassigned shards?

What to cover:

Shard allocation is the process of assigning shards to nodes based on resource usage and allocation policies
Primary shards are allocated at index creation; replicas are allocated dynamically
Unassigned shards appear when: disk watermark breached, node left cluster, replica count increased, newly created index
Yellow status means replicas are unassigned (data safe but fault tolerance compromised)
Red status means primaries are missing (data loss risk); check logs for the specific allocation reason
Use GET _cluster/allocation/explain to get detailed reason for a specific unassigned shard

17. What monitoring metrics should you watch for Logstash pipeline health?

What to cover:

Pipeline throughput (events/sec) to catch degradation before backlog accumulates
Queue depth: persistent queue bytes and unacknowledged events indicate backpressure
Dead letter queue size and age; growing DLQ means data is being lost silently
Filter execution time: slow filters (GeoIP, DNS lookup) can become bottlenecks
Input metrics: bytes received by Logstash, connection errors from Beats
JVM heap pressure: Logstash runs Java; heap > 80% causes GC pauses and slow processing

18. How do you secure Beats-to-Logstash and Logstash-to-Elasticsearch communication?

What to cover:

Enable TLS on Beats output to Logstash with self-signed certificates or CA-issued certs
Configure Logstash to verify client certificates for mutual TLS from Beats
Store credentials in environment variables or keystore, never in configuration files
Use API key authentication for Beats-to-Elasticsearch direct shipping (no Logstash)
Rotate SSL certificates regularly; expired certs on Beats cause silent shipping failures
Logstash-to-Elasticsearch should use TLS with certificate verification and credential secrets

19. What are the most common causes of Kibana slow query performance and how do you fix them?

What to cover:

Querying too wide a time range: use narrow defaults and let users expand; always use date histogram with auto or fixed intervals
High-cardinality aggregations: terms aggregation on user_id or trace_id with size: 10000 blows up memory
Missing filter context: queries with match instead of term go through scoring when they do not need to
Visualizations on scripted fields or runtime fields that execute per document
Large result sets returned to browser: use query size: 0 with aggregations instead of returning hits
Cross-cluster searches add network latency; consider local index patterns instead

20. How would you design an ELK stack migration from an existing Splunk deployment?

What to cover:

Map Splunk indexes to Elasticsearch indices; plan index naming and template structure
Beats can replace Heavy Forwarders; Filebeat handles most log types natively
SPL queries convert to KQL or Elasticsearch query DSL; some queries need redesign
Field names differ: Splunk "sourcetype" maps to a field in Elasticsearch, not a native concept
License cost comparison: Splunk is commercial and expensive; ELK is open source but has operational costs
Parallel run: ship same logs to both systems during transition period to validate data integrity
Start with non-critical logs, migrate incrementally, validate search results match before decommissioning Splunk

Conclusion

The ELK Stack provides a powerful platform for centralized logging and analysis. Beats collect data efficiently, Logstash transforms it into structured format, Elasticsearch stores and indexes it, and Kibana makes it explorable.

Start with Filebeat shipping container logs to Elasticsearch, and build from there. Add Logstash for complex parsing, Kibana for visualizations, and ILM policies for efficient data retention.

For monitoring beyond logs, see our Prometheus & Grafana guide for metrics visualization. For distributed tracing, see the Jaeger and Distributed Tracing guides for correlating logs with request traces.

ELK Stack Deep Dive: Elasticsearch, Logstash, Kibana, and Beats

Introduction

Core Concepts

Elasticsearch Key Concepts

Index Lifecycle Management

Mapping and Index Templates

Querying Elasticsearch

Logstash

Logstash Pipeline

Input Plugins

Filter Plugins

Output Plugins

Complete Pipeline Example

Beats

Filebeat

Metricbeat

Heartbeat

Kibana

Index Pattern Setup

Building Visualizations

Error Rate Over Time

Service Error Distribution

Kibana Discover

Kibana Dashboard Example

Deployment Considerations

Hardware Requirements

Security

Scaling

When to Use the ELK Stack

ELK Stack vs Alternatives

Capacity Planning

Elasticsearch Node Sizing

Estimating Storage Requirements

Log Volume Estimation

Scaling Triggers

Observability Stack Integration

Prometheus + Grafana Integration

Distributed Tracing with Jaeger

Trade-off Analysis

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

1. Too Many Indices with Few Documents

2. Dynamic Field Mapping Without Controls

3. Not Using Filter Context for Simple Queries

4. Ignoring Index Lifecycle Management

5. Loading Too Much Data into Memory

Real-world Failure Scenarios

Scenario 1: Elasticsearch Cluster Red/Yellow

Scenario 2: Logstash Pipeline Backlog

Scenario 3: Kibana Dashboard Timeout

Interview Questions

Further Reading

Quick Recap

Observability Checklist

Infrastructure Monitoring

Log Pipeline Monitoring

Kibana Monitoring

Data Management

Security Checklist

Conclusion

Category

Tags

Related Posts

Logging Best Practices: Structured Logs, Levels, Aggregation

Performance Profiling

Alerting in Production: Building Alerts That Matter