ELK Stack: Elasticsearch, Logstash, Kibana, and Beats
Complete guide to the ELK Stack for log aggregation and analysis. Learn Elasticsearch indexing, Logstash pipelines, Kibana visualizations, and Beats shippers.
ELK Stack Deep Dive: Elasticsearch, Logstash, Kibana, and Beats
The ELK Stack is a popular open-source solution for centralized logging. It lets you collect logs from multiple sources, transform them into structured format, store them efficiently, and query them interactively.
This guide covers each component in depth. If you are new to logging concepts, start with our Logging Best Practices guide first.
Introduction
graph LR
A[Log Sources] -->|Shippers| B[Beats]
B --> C[Logstash]
C --> D[Elasticsearch]
D --> E[Kibana]
A -->|Direct| D
The ELK Stack has four main components:
- Beats: Lightweight shippers that collect data from various sources
- Logstash: Transforms and enriches data during transit
- Elasticsearch: Stores and indexes data for fast search
- Kibana: Visualizes and explores data
Core Concepts
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It stores documents in JSON format and provides powerful query capabilities.
Elasticsearch Key Concepts
| Concept | Description |
|---|---|
| Index | Collection of documents, similar to a database |
| Document | A single JSON record, similar to a row |
| Shard | A partition of an index for horizontal scaling |
| Replica | A copy of a shard for high availability |
Index Lifecycle Management
Define policies to manage index data from creation to deletion:
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "7d",
"max_primary_shard_size": "50gb"
},
"set_priority": 100
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": 50
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {},
"set_priority": 0
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
Mapping and Index Templates
Index templates define mappings and settings for new indices:
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"level": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"service": {
"type": "keyword"
},
"trace_id": {
"type": "keyword"
},
"user_id": {
"type": "keyword"
},
"duration_ms": {
"type": "long"
},
"host": {
"properties": {
"name": { "type": "keyword" },
"ip": { "type": "ip" }
}
}
}
}
}
}
Querying Elasticsearch
GET logs-2026.03.22/_search
{
"query": {
"bool": {
"must": [
{ "match": { "service": "api-gateway" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
],
"filter": [
{ "term": { "level": "ERROR" } }
]
}
},
"sort": [
{ "@timestamp": "desc" }
],
"aggs": {
"error_by_service": {
"terms": { "field": "service" },
"aggs": {
"error_rate": {
"avg": { "field": "error_count" }
}
}
}
}
}
Logstash
Logstash processes and transforms data before it reaches Elasticsearch. It handles complex parsing, enrichment, and filtering.
Logstash Pipeline
graph TB
A[Input] --> B[Filter]
B --> C[Output]
A pipeline has three sections: input, filter, and output.
Input Plugins
# Receive logs from Beats
input {
beats {
port => 5044
ssl => true
ssl_certificate => "/etc/ssl/certs/logstash.crt"
ssl_key => "/etc/ssl/private/logstash.key"
}
# Alternative: direct HTTP
http {
port => 8080
content_type => "application/json"
}
}
Filter Plugins
Filters transform and enrich data:
filter {
# Parse JSON logs
json {
source => "message"
target => "parsed"
}
# Parse timestamp
date {
match => ["parsed.timestamp", "ISO8601"]
target => "@timestamp"
}
# Extract fields from message
grok {
match => {
"parsed.message" => "%{DATA:level}\s*%{DATA:logger}\s*%{GREEDYDATA:log_message}"
}
overwrite => ["message"]
}
# Add computed fields
mutate {
add_field => {
"environment" => "%{[parsed][env]}"
"[@metadata][index_prefix]" => "logs-%{[parsed][service]}"
}
}
# Enrich with GeoIP
geoip {
source => "[parsed][client_ip]"
target => "[parsed][geoip]"
database => "/etc/logstash/GeoLite2-City.mmdb"
}
# Parse query string
kv {
source => "[parsed][request_params]"
field_split => "&"
prefix => "param_"
}
}
Output Plugins
output {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
manage_template => false
index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
ssl => true
cacert => "/etc/ssl/certs/ca.crt"
user => "${ELASTICSEARCH_USER}"
password => "${ELASTICSEARCH_PASSWORD}"
}
# Also send to stdout for debugging
stdout {
codec => rubydebug
}
}
Complete Pipeline Example
input {
beats {
port => 5044
}
}
filter {
if [fields][log_type] == "application" {
json {
source => "message"
target => "parsed"
}
date {
match => ["parsed.timestamp", "ISO8601"]
target => "@timestamp"
}
if [parsed][level] {
mutate {
add_field => { "level" => "%{parsed[level]}" }
}
}
if [parsed][exception] {
mutate {
add_tag => ["error"]
}
}
}
if [fields][log_type] == "access" {
grok {
match => {
"message" => '%{IPORHOST:client_ip} %{DATA:ident} %{DATA:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:status:int} %{NUMBER:bytes:int} "%{DATA:referrer}" "%{DATA:user_agent}"'
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
geoip {
source => "client_ip"
target => "geoip"
}
useragent {
source => "user_agent"
target => "ua"
}
}
}
output {
if "error" in [tags] {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
index => "logs-error-%{+YYYY.MM.dd}"
}
} else {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
index => "logs-%{[fields][log_type]}-%{+YYYY.MM.dd}"
}
}
}
Beats
Beats are lightweight data shippers that send data from servers to Logstash or Elasticsearch.
Filebeat
Filebeat tails log files and ships them:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/containers/*.log
json:
keys_under_root: true
add_error_key: true
message_key: log
fields:
log_type: container
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
- type: log
enabled: true
paths:
- /var/log/nginx/*.log
fields:
log_type: nginx
processors:
- add_locale: ~
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_cloud_metadata: ~
- add_docker_metadata: ~
output.logstash:
hosts: ["logstash:5044"]
ssl.enabled: true
ssl.certificate_authorities: ["/etc/filebeat/ca.crt"]
ssl.certificate: "/etc/filebeat/filebeat.crt"
ssl.key: "/etc/filebeat/filebeat.key"
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
Metricbeat
Metricbeat collects system and service metrics:
metricbeat.modules:
- module: system
metricsets:
- cpu
- memory
- network
- process
- diskio
period: 10s
processes: [".*"]
- module: docker
metricsets:
- container
- cpu
- diskio
- healthcheck
- info
- memory
- network
hosts: ["unix:///var/run/docker.sock"]
period: 10s
- module: nginx
metricsets:
- stubstatus
hosts: ["http://nginx:8080/nginx_status"]
period: 10s
output.elasticsearch:
hosts: ["https://elasticsearch:9200"]
ssl.enabled: true
ssl.certificate_authorities: ["/etc/metricbeat/ca.crt"]
Heartbeat
Heartbeat monitors service availability with synthetic checks:
heartbeat.monitors:
- type: http
name: api-health-check
schedule: "@every 30s"
urls:
- https://api.example.com/health
check.response:
status: 200
fields:
service: api-gateway
- type: tcp
name: redis-connectivity
schedule: "@every 60s"
hosts: ["redis:6379"]
timeout: 5s
- type: icmp
name: host-ping
schedule: "@every 5m"
hosts: ["elasticsearch"]
output.elasticsearch:
hosts: ["https://elasticsearch:9200"]
Kibana
Kibana provides the visualization and exploration interface for your Elasticsearch data.
Index Pattern Setup
Before exploring data, create an index pattern in Kibana:
- Navigate to Management > Stack Management > Index Patterns
- Click “Create index pattern”
- Enter
logs-*as the pattern - Select
@timestampas the time field
Building Visualizations
Error Rate Over Time
{
"title": "Error Rate",
"type": "line",
"params": {
"type": "line",
"grid": { "categoryLines": false },
"categoryAxes": [
{
"id": "CategoryAxis-1",
"type": "category",
"position": "bottom"
}
],
"valueAxes": [
{
"id": "ValueAxis-1",
"name": "LeftAxis-1",
"type": "value",
"position": "left",
"scale": {
"type": "linear",
"mode": "normal"
}
}
]
},
"aggs": [
{
"id": "1",
"type": "avg",
"schema": "metric",
"params": {
"field": "error_rate"
}
},
{
"id": "2",
"type": "date_histogram",
"schema": "segment",
"params": {
"field": "@timestamp",
"interval": "auto"
}
}
]
}
Service Error Distribution
{
"title": "Errors by Service",
"type": "pie",
"aggs": [
{
"id": "1",
"type": "count",
"schema": "metric"
},
{
"id": "2",
"type": "terms",
"schema": "segment",
"params": {
"field": "service.keyword",
"size": 10
}
}
]
}
Kibana Discover
Discover provides ad-hoc search and exploration:
// Sample Discover query
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-24h" } } }
]
}
},
"sort": [{ "@timestamp": "desc" }],
"fields": ["@timestamp", "level", "message", "service", "trace_id"],
"filter": [
{
"meta": {
"index": "logs-*",
"negate": false,
"params": {},
"type": "phrase"
},
"query": {
"match_phrase": {
"service": "api-gateway"
}
}
}
]
}
Kibana Dashboard Example
A complete dashboard might include:
- Time series of log volume by level
- Pie chart of error distribution by service
- Table of recent errors with context
- Heat map of errors over time by host
- Metric visualization of error rate and latency percentiles
Deployment Considerations
Hardware Requirements
| Component | CPU | RAM | Disk |
|---|---|---|---|
| Elasticsearch (per node) | 4+ cores | 8GB+ | SSD, 500GB+ |
| Logstash | 2+ cores | 4GB+ | Minimal |
| Kibana | 2 cores | 2GB+ | Minimal |
| Beats | 1 core | 512MB+ | Minimal |
Elasticsearch is I/O intensive. Use SSDs and ensure adequate disk throughput.
Security
# Enable security in elasticsearch.yml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true
# API key authentication
xpack.security.api.key.enabled: true
# Role-based access control
xpack.security.authorization:
roles_path: /etc/elasticsearch/roles.yml
Scaling
Scale Elasticsearch horizontally by adding nodes. The cluster automatically rebalances shards.
# Minimum master nodes for cluster stability
discovery.zen.minimum_master_nodes: 2 # for 3-node cluster
# Adjust shard allocation
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all",
"cluster.routing.allocation.cluster_concurrent_rebalance": 2
}
}
When to Use the ELK Stack
Use the ELK Stack when:
- You need centralized logging from multiple services and environments
- You need full-text search across log entries and application data
- You need log analysis and pattern detection with Kibana
- You need security analytics and threat detection
- You need compliance audit logging and archival
- You need infrastructure log aggregation (syslog, nginx, apache)
Don’t use the ELK Stack when:
- You have simple applications with minimal logging needs
- You only need metrics and dashboards (use Prometheus + Grafana instead)
- You have high-volume streaming use cases (Kafka is better suited)
- You need real-time alerting on log data (use dedicated alerting tools)
- You need large-scale time-series metrics (Elasticsearch is not optimized for pure metrics)
ELK Stack vs Alternatives
| Aspect | ELK Stack | Loki | Splunk |
|---|---|---|---|
| Cost | Open source (self-hosted) | Open source (self-hosted) | Commercial (expensive) |
| Storage efficiency | Medium (indexed) | High (log-structured) | Medium |
| Query language | KQL (Kibana) | LogQL (Prometheus-style) | SPL |
| Scalability | Excellent (horizontal) | Excellent | Excellent |
| Ease of setup | Moderate | Easy | Easy |
| Full-text search | Excellent | Limited | Excellent |
| Metrics integration | Via Metricbeat | Native Prometheus | Native |
| Best for | Complex log analysis, security analytics | High-volume Kubernetes logs | Enterprise compliance, security |
Capacity Planning
Choosing the right hardware for Elasticsearch prevents performance issues down the line. These are rough guidelines for typical workloads.
Elasticsearch Node Sizing
| Tier | RAM | CPU | Disk (SSD) | Use Case |
|---|---|---|---|---|
| Hot | 64GB+ | 8+ cores | 1TB+ | Active indexing, recent data |
| Warm | 32GB | 4+ cores | 2TB+ | Read-only, older indices |
| Cold | 16GB | 2 cores | 4TB+ | Archival, rare queries |
The heap size should be at most 50% of available RAM. Keep heap under 32GB if possible to benefit from compressed object pointers. Set -Xms and -Xmx to the same value to avoid heap resizing during runtime.
# Check JVM settings
GET _nodes/jvm?filter_path=nodes.*.jvm.memory
Estimating Storage Requirements
Calculate expected index size using this formula:
index_size = source_log_volume × compression_ratio × replica_factor
For JSON logs with ILM enabled, expect 3-5x compression from raw log size. Without ILM, indices can grow 10-20x beyond raw log volume due to normalization and extra fields.
Log Volume Estimation
# Estimate daily log volume per service
# Assume: 1000 requests/min × 10KB avg log size × 60 min × 24h = ~14GB/day
# With 3 replicas and 30% overhead:
# 14GB × 4 (3 replicas + overhead) × 30 days = ~1.7TB/month per service
Scaling Triggers
Watch these metrics to decide when to add nodes:
- Cluster health: Yellow or red status means you need capacity
- Indexing latency: P95 above 500ms indicates saturation
- Search latency: P95 above 1s for interactive queries
- Disk usage: Nodes approaching 80% capacity
- JVM heap pressure: Old generation spending more than 30% time in GC
Observability Stack Integration
ELK works best as part of a broader observability setup. Here is how it fits with other tools.
Prometheus + Grafana Integration
Beats can export metrics to Prometheus:
# metricbeat.yml - Enable prometheus output
metricbeat.modules:
- module: elasticsearch
metricsets:
- node
- node_stats
period: 10s
hosts: ["https://elasticsearch:9200"]
output.prometheus:
enabled: true
host: "0.0.0.0"
port: 9424
Then scrape those metrics in Prometheus and build Grafana dashboards for cluster health, indexing throughput, and search latency.
Distributed Tracing with Jaeger
Add trace context to your logs so you can jump from a Kibana error directly into Jaeger:
# Logstash filter to extract Jaeger trace context
filter {
if [message] =~ /trace_id:/ {
grok {
match => { "message" => "trace_id:%{DATA:trace_id}\s+span_id:%{DATA:span_id}" }
}
mutate {
add_field => {
"trace_url" => "https://jaeger.example.com/trace/%{trace_id}"
}
}
}
}
This lets you correlate log entries with the full request trace in Jaeger.
Trade-off Analysis
| Factor | ELK Stack (Self-Hosted) | Loki (Grafana Cloud) | Splunk Enterprise | CloudWatch Logs |
|---|---|---|---|---|
| Deployment Model | Fully self-managed | SaaS / managed | Self-managed | Fully managed SaaS |
| Storage Cost | Infrastructure + ops | Pay-per-GB stored | License + infra | Pay-per-GB ingested |
| Query Performance | Excellent on large data | Good | Excellent | Moderate (throttling) |
| Operational Burden | High (cluster ops) | Low | High (infra + license) | Very low |
| Scalability | Manual shard management | Auto-scaling | Manual | Auto-scaling |
| Learning Curve | Steep (ES DSL) | Low (LogQL) | Steep (SPL) | Low (CloudWatch) |
| Ecosystem | Beats, Fluentd, Logstash | Promtail, Grafana | Heavy Agents | AWS-native |
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Elasticsearch cluster red/yellow | Logs not indexing; search degraded | Monitor cluster health; provision more shards; adjust replica settings |
| Logstash pipeline errors | Logs stuck in queue; processing backlog | Monitor pipeline errors; implement dead-letter queues; alert on queue depth |
| Hot tier disk saturation | New indices cannot be created; ingestion fails | Monitor disk usage; implement ILM rollover; add nodes |
| Kibana performance degradation | Slow searches; dashboards timeout | Optimize queries; use filter context; limit time ranges |
| Beats shipper failure | Logs not forwarded; blind spots in coverage | Monitor Beats health; implement local buffering; alert on forward failures |
| Index template mismatch | Fields not indexed correctly; search failures | Version index templates; validate mappings; test before deployment |
Common Pitfalls / Anti-Patterns
1. Too Many Indices with Few Documents
Each index has overhead. Too many small indices overwhelms the cluster:
// Bad: Index per day per service creates thousands of indices
PUT logs-service-a-2026.03.22
PUT logs-service-b-2026.03.22
// ... thousands more
// Good: Use rollover with larger time intervals
PUT logs-service-a
{
"aliases": {
"logs-service-a": { "is_write_index": true }
}
}
2. Dynamic Field Mapping Without Controls
Dynamic mapping can create unexpected field types and blow up cardinality:
// Bad: Unrestricted dynamic mapping
{
"mappings": {
"dynamic": "true" // Creates any field
}
}
// Good: Strict dynamic mapping or disabled
{
"mappings": {
"dynamic": "strict",
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" }
}
}
}
3. Not Using Filter Context for Simple Queries
Filter context is faster because it does not score:
// Bad: Query context for term filter
{
"query": {
"match": { "level": "ERROR" } // Scores, slower
}
}
// Good: Filter context for exact match
{
"query": {
"bool": {
"filter": [
{ "term": { "level": "ERROR" } } // No scoring, faster
]
}
}
}
4. Ignoring Index Lifecycle Management
Without ILM, indices grow unbounded and performance degrades:
// Good: ILM with hot/warm/cold/delete
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": { "rollover": { "max_age": "7d" } }
},
"warm": { "min_age": "7d", "actions": { "shrink": 1, "forcemerge": 1 } },
"cold": { "min_age": "30d", "actions": { "freeze": {} } },
"delete": { "min_age": "365d", "actions": { "delete": {} } }
}
}
}
5. Loading Too Much Data into Memory
Kibana visualizations on large time ranges cause OOM:
// Bad: Visualize 90 days of minute-level data
{
"query": { "range": { "@timestamp": { "gte": "now-90d" } } }
}
// Good: Use date histogram with appropriate interval
{
"aggs": {
"over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1h" // Or auto with proper configuration
}
}
}
}
Real-world Failure Scenarios
Scenario 1: Elasticsearch Cluster Red/Yellow
What happened: After an unexpected traffic spike, the Elasticsearch cluster’s hot tier disk became saturated. New indices could not be created, causing log ingestion to fail across multiple services simultaneously.
Root cause: Index Lifecycle Management (ILM) was not configured, and disk alerts were set too high to catch gradual saturation before it became critical.
Impact: Approximately 4 hours of log data were lost. Engineers lost visibility into production systems during a post-incident analysis window.
Lesson learned: Configure ILM policies before going to production. Set disk usage alerts at 70% as a warning threshold and 80% as a critical threshold.
Scenario 2: Logstash Pipeline Backlog
What happened: A misconfigured Logstash filter caused an infinite loop in a conditional expression, causing the pipeline to process zero events while the input queue grew to millions of pending messages.
Root cause: A Grok filter pattern with an overly broad regex caused Logstash to CPU-thrash while failing to match any events.
Impact: Logs were delayed by 12 hours before the queue depth alert fired. Correlating logs with real-time events during the incident window was impossible.
Lesson learned: Implement dead-letter queues for failed events. Monitor pipeline-to-queue depth. Test filter patterns on sample data before deploying to production.
Scenario 3: Kibana Dashboard Timeout
What happened: A dashboard aggregating logs across 30-day windows with complex Vega visualizations began timing out. Users reported that the Kibana UI became unresponsive for all users on the cluster.
Root cause: The cluster’s coordination node was memory-constrained. Large aggregations caused heap pressure and triggered long GC pauses, affecting all coordination operations.
Impact: All Kibana users lost access to dashboards for approximately 45 minutes during peak business hours.
Lesson learned: Set query timeout limits in Kibana. Limit the default time range for users. Implement query caching and use rollup indices for historical data.
Interview Questions
What to cover:
- Elasticsearch distributes shards across all data nodes automatically
- New nodes trigger rebalancing; shards spread by disk usage
- Master node updates routing tables without manual intervention
- You can increase replica count after adding nodes for better redundancy
What to cover:
- Filter context: exact matching without relevance scoring, cached automatically (good for term, range, exists queries)
- Query context: full-text search with scoring (match, query_string queries)
- Use filter for status codes, user IDs, anything you want to match exactly
- Use query when you need results ranked by relevance, like searching log messages
- bool must = query context, bool filter = filter context in compound queries
What to cover:
- Hot: new indices accept writes; rollover triggers based on age or size; set_priority keeps these nodes prioritized
- Warm: index becomes read-only; shrink reduces primary shards, forcemerge combines segments, priority drops
- Cold: index is frozen and not actively queried; priority goes to zero
- Delete: index disappears after the retention period; useful for compliance and clearing old data
- Phases run sequentially once the min_age threshold passes
What to cover:
- Start with GET _cluster/health?pretty to see the overall status
- Run GET _cat/shards?h=index,shard,prirep,state to find unassigned shards
- Yellow = replica shards unassigned; red = primary shards missing
- Common culprits: disk watermark breaches, heap pressure, network partition, node crashes
- Solutions vary: add nodes, raise watermark thresholds, drop replica count, increase heap, restart frozen nodes
What to cover:
- Input: receives data from external sources (Beats, HTTP, files); example: beats plugin listening on port 5044
- Filter: parses and enriches raw data; example: grok patterns extracting HTTP status code from an access log
- Output: sends processed data to a destination; example: elasticsearch plugin writing to a daily index
- Data flows sequentially: input feeds filter, filter feeds output
- You can run multiple pipelines in parallel for different log types
What to cover:
- Index templates define mappings and settings for new indices matching a pattern automatically
- They enforce consistency: field types, shard counts, replica counts, even ILM policy assignment
- ILM policies govern what happens to indices over time (rollover, shrink, freeze, delete)
- Templates handle structure at creation; ILM handles data management afterward
- You can combine them: a template assigns an ILM policy so new indices automatically follow it
What to cover:
- Beats are lightweight agents you install on edge machines; Logstash is heavier and runs on dedicated servers
- Use Filebeat to tail log files, Metricbeat to collect system metrics, Heartbeat for uptime checks
- Use Logstash when you need grok parsing, multi-step enrichment, conditional routing, or GeoIP lookups
- In practice: Beats collect and ship, Logstash transforms and routes, Elasticsearch stores
What to cover:
- XPack Security adds encryption, authentication, and role-based access control
- Enable in elasticsearch.yml: xpack.security.enabled: true plus TLS for transport and HTTP
- API key authentication via xpack.security.api.key.enabled: true for scripted access
- RBAC with built-in roles like kibana_admin or custom roles defined in roles.yml
- Audit logging with xpack.security.audit.enabled: true tracks who changed what
- Kibana spaces let you isolate dev, staging, and prod environments visually
What to cover:
- Rollover creates a fresh index when the current one hits max_age or max_primary_shard_size
- The write alias switches to the new index; the old one stops receiving writes but stays searchable
- Shrinking takes an existing index and rewrites it with fewer primary shards (say, from 5 down to 1)
- Shrink copies data into a new index then deletes the original; rollover just switches aliases
- Rollover handles time-based streams; shrinking optimizes read-heavy historical indices for storage
What to cover:
- Use hot-warm-cold architecture: SSDs for active indexing, larger spinning disks for warm, cheap storage for cold
- Run multiple Logstash nodes behind a load balancer; each pipeline handles around 50GB/day
- Enable local buffering in Beats so temporary Logstash outages do not cause data loss
- Configure ILM: 7d hot, 30d warm, 90d cold, 365d delete to manage storage growth
- Set index templates with proper shard sizing (target 50GB per shard) and compression enabled
- Consider Kafka or Redis as a buffer between Beats and Logstash to absorb traffic spikes
- Watch queue depth and dead letter queues to catch backlogs before they become outages
What to cover:
- Shard count directly impacts search parallelism: more shards means more parallel searches but also more overhead
- Target 20-50GB per shard for optimal balance; shards that are too small cause overhead, too large cause slow recovery
- Use rollover APIs to create new indices based on size or age rather than time-based rolling
- For time-based logs, daily indices work well at moderate volume; high volume may need hourly
- Index templates enforce consistent mappings and settings across all indices matching a pattern
- Consider disabling norms for keyword fields you never search with relevance scoring
What to cover:
- Beats are lightweight agents with minimal memory footprint (512MB baseline); Logstash requires 4GB+ servers
- Beats do simple shipping and some preprocessing; Logstash does complex transformations, enrichment, and conditional routing
- Use Beats for: file tailing, metric collection, heartbeat monitoring, simple field additions
- Use Logstash for: grok parsing, multi-step enrichment chains, GeoIP lookups, business-logic-based routing
- In practice, many architectures use both: Beats handle edge collection, Logstash handles transformation
- Filebeat can do light parsing (JSON, nginx, apache logs) without Logstash if you do not need complex grok
What to cover:
- Hot nodes use SSDs for fast I/O and handle all writes and recent queries
- Warm nodes use larger spinning disks for read-only historical data that is queried occasionally
- Cold nodes use cheap bulk storage for archival data that is rarely accessed but must be retained
- ILM automates movement between tiers: hot (7d) → warm (30d) → cold (90d) → delete (365d)
- Frozen indices use memory-mapped files and only load data when queried, dramatically reducing RAM needs
- For 500GB/day ingestion, hot-warm-cold can cut storage costs by 60-70% compared to all-SSD
What to cover:
- Rebalancing happens when nodes join or leave; it distributes shards to achieve even disk usage
- Monitor with GET _cat/shards?h=index,shard,prirep,state,store and look for RELOCATING shards
- High rebalance rates can saturate network and disk I/O, degrading search and indexing performance
- Throttle rebalancing with cluster.routing.allocation.cluster_concurrent_rebalance: 2 (default is higher)
- Watermark settings control when nodes stop accepting new shards (low 85%, high 90%, flood 95%)
- If a node is slow, Elasticsearch may think it failed and start relocating shards unnecessarily
What to cover:
- Use filter context in all dashboard queries to avoid unnecessary relevance scoring
- Limit time ranges by default; users can expand but loading 90 days of minute data crashes browsers
- Use date histogram aggregations with appropriate intervals: 1h for 30d views, 1m for under 24h
- For high-cardinality fields like user_id, use terms aggregation with size limit to avoid memory issues
- Pin frequently used filters at the dashboard level so every visualization respects them
- Break complex dashboards into multiple saved searches rather than one monolithic view
What to cover:
- Shard allocation is the process of assigning shards to nodes based on resource usage and allocation policies
- Primary shards are allocated at index creation; replicas are allocated dynamically
- Unassigned shards appear when: disk watermark breached, node left cluster, replica count increased, newly created index
- Yellow status means replicas are unassigned (data safe but fault tolerance compromised)
- Red status means primaries are missing (data loss risk); check logs for the specific allocation reason
- Use GET _cluster/allocation/explain to get detailed reason for a specific unassigned shard
What to cover:
- Pipeline throughput (events/sec) to catch degradation before backlog accumulates
- Queue depth: persistent queue bytes and unacknowledged events indicate backpressure
- Dead letter queue size and age; growing DLQ means data is being lost silently
- Filter execution time: slow filters (GeoIP, DNS lookup) can become bottlenecks
- Input metrics: bytes received by Logstash, connection errors from Beats
- JVM heap pressure: Logstash runs Java; heap > 80% causes GC pauses and slow processing
What to cover:
- Enable TLS on Beats output to Logstash with self-signed certificates or CA-issued certs
- Configure Logstash to verify client certificates for mutual TLS from Beats
- Store credentials in environment variables or keystore, never in configuration files
- Use API key authentication for Beats-to-Elasticsearch direct shipping (no Logstash)
- Rotate SSL certificates regularly; expired certs on Beats cause silent shipping failures
- Logstash-to-Elasticsearch should use TLS with certificate verification and credential secrets
What to cover:
- Querying too wide a time range: use narrow defaults and let users expand; always use date histogram with auto or fixed intervals
- High-cardinality aggregations: terms aggregation on user_id or trace_id with size: 10000 blows up memory
- Missing filter context: queries with match instead of term go through scoring when they do not need to
- Visualizations on scripted fields or runtime fields that execute per document
- Large result sets returned to browser: use query size: 0 with aggregations instead of returning hits
- Cross-cluster searches add network latency; consider local index patterns instead
What to cover:
- Map Splunk indexes to Elasticsearch indices; plan index naming and template structure
- Beats can replace Heavy Forwarders; Filebeat handles most log types natively
- SPL queries convert to KQL or Elasticsearch query DSL; some queries need redesign
- Field names differ: Splunk "sourcetype" maps to a field in Elasticsearch, not a native concept
- License cost comparison: Splunk is commercial and expensive; ELK is open source but has operational costs
- Parallel run: ship same logs to both systems during transition period to validate data integrity
- Start with non-critical logs, migrate incrementally, validate search results match before decommissioning Splunk
Further Reading
- Metrics, Monitoring & Alerting - Complete observability stack integration
- Distributed Tracing - End-to-end request tracing with correlation IDs
- OpenTelemetry Official Docs - Vendor-neutral instrumentation standard
- Elasticsearch ILM Documentation - Index lifecycle management
- Fluent Bit Documentation - High-performance log agent
- SLO Embedded on-call Handbook - SLO-based incident management
Quick Recap
Key Takeaways:
- Beats collect, Logstash transforms, Elasticsearch stores, Kibana visualizes
- Index lifecycle management prevents unbounded growth
- Use filter context for exact matches; query context only when scoring needed
- Monitor cluster health and pipeline metrics proactively
- Implement security early: authentication, TLS, RBAC
- Design index templates carefully to control field mapping
Copy/Paste Checklist:
# Check cluster health
GET _cluster/health?pretty
# Monitor index size and document count
GET _cat/indices?v&s=store.size:desc
# Check Logstash pipeline status
GET _nodes/stats/ingest?filter_path=nodes.*.ingest
# ILM policy check
GET _ilm/policy/logs-policy?pretty
# Dead letter queue inspection
GET _all/_doc/_search?q=tags:_dead_letter_queue
# Index template validation
GET _index_template/logs-template?pretty
# Secure your cluster (Elasticsearch)
PUT _security/user/kibana_admin
{
"password": "${KIBA...ORD}",
"roles": ["kibana_admin"]
}
Observability Checklist
Infrastructure Monitoring
- Elasticsearch cluster health (green/yellow/red)
- Primary shard and replica distribution
- Index count and size per index
- Node resource utilization (CPU, heap, disk)
- Search and indexing latency percentiles
- JVM heap usage and GC frequency
- Segment count and merge queue depth
Log Pipeline Monitoring
- Beats shipper metrics (bytes sent, errors, lag)
- Logstash pipeline throughput and latency
- Logstash queue depth and worker utilization
- Dead-letter queue size and age
- Log parsing error rate
Kibana Monitoring
- Search response time (p95, p99)
- Dashboard load time
- Visualization render time
- Active users and session count
Data Management
- Index count within expected bounds
- Document count growth rate
- Disk usage trend and forecasting
- ILM policy execution success/failure
- Archive tier accessibility
Security Checklist
- Elasticsearch security enabled (XPack Security)
- User authentication configured (LDAP, SAML, or built-in)
- Role-based access control for indices and spaces
- TLS encryption for all network traffic
- API keys rotated regularly
- Kibana spaces isolation (dev/staging/prod separation)
- Audit logging enabled for security events
- No sensitive data in index names or field names
- Snapshot repositories secured and access logged
- Cross-cluster search secured if used
Conclusion
The ELK Stack provides a powerful platform for centralized logging and analysis. Beats collect data efficiently, Logstash transforms it into structured format, Elasticsearch stores and indexes it, and Kibana makes it explorable.
Start with Filebeat shipping container logs to Elasticsearch, and build from there. Add Logstash for complex parsing, Kibana for visualizations, and ILM policies for efficient data retention.
For monitoring beyond logs, see our Prometheus & Grafana guide for metrics visualization. For distributed tracing, see the Jaeger and Distributed Tracing guides for correlating logs with request traces.
Category
Related Posts
Logging Best Practices: Structured Logs, Levels, Aggregation
Master production logging with structured formats, proper log levels, correlation IDs, and scalable log aggregation. Includes patterns for containerized applications.
Performance Profiling
Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.