Search Scaling: Sharding, Routing, and Horizontal Growth
Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.
Search Scaling: Sharding, Routing, and Horizontal Growth
Search queries are CPU-intensive, latency-sensitive, and often return large result sets. A single node can only handle so much before query times degrade. At that point, you need to distribute the load across multiple nodes while maintaining fast queries and consistent results.
This post covers sharding strategies, query routing, replication for read scalability, and cluster management.
Core Concepts
Search queries are CPU-bound for relevance scoring and aggregation. A query that scores a million documents uses significantly more CPU than a simple key-value lookup. Memory matters too, but for inverted indexes, the bottleneck is often CPU during the scoring phase.
Search is also partition-sensitive. A query for “recent blog posts” might hit different shards depending on how data is distributed. If your sharding strategy does not match your query patterns, you end up hitting all shards for every query, which defeats the purpose of distribution.
Sharding Strategies
Sharding splits your index into smaller pieces distributed across nodes. Choosing the right sharding strategy is the most important decision in search scaling.
Hash-Based Sharding
The default approach in Elasticsearch: hash the document ID and modulo by shard count.
shard = hash(document_id) % num_primary_shards;
This distributes documents evenly but has a downside: queries without a filter on document_id must query all shards. You cannot target a specific shard.
// Every query hits all shards - no optimization possible
GET /my-index/_search
{
"query": { "match": { "content": "search term" } }
}
Range-Based Sharding
Shard by a field value, like date or category:
{
"settings": {
"index": {
"sort.field": "publish_date",
"sort.order": "desc"
}
}
}
Range sharding lets you target subsets of data. A query for recent posts might only hit the relevant time-based shards. The tradeoff is hot spots: if most queries hit recent data, those shards become overloaded.
graph LR
subgraph TimeBasedShards
Shard1[2024-Q1]
Shard2[2024-Q2]
Shard3[2024-Q3]
Shard4[2024-Q4]
end
Query1["query: recent posts"] --> Shard4
Query2["query: old posts"] --> Shard1
Custom Routing
Elasticsearch allows custom routing to target specific shards:
PUT /my-index/_doc/1?routing=category:tutorials
{
"title": "Getting Started",
"category": "tutorials"
}
Queries can then target specific routing values:
GET /my-index/_search?routing=category:tutorials
{
"query": { "match": { "title": "search" } }
}
This reduces the number of shards queried from N to 1, cutting latency significantly for targeted queries.
Query Routing
Once you have multiple shards, query routing determines which shards get queried and how results are merged.
Coordination Node
When a client sends a search request to Elasticsearch, any node can receive it. That node becomes the coordination node for that request. It broadcasts the query to all relevant shards, collects the results, and merges them.
graph TD
Client --> Coord[Coordinator Node]
Coord -->|broadcast| Shard1[Shard 1]
Coord -->|broadcast| Shard2[Shard 2]
Coord -->|broadcast| Shard3[Shard 3]
Shard1 -->|top 10| Coord
Shard2 -->|top 10| Coord
Shard3 -->|top 10| Coord
Coord -->|merged top 10| Client
The coordinator does not do the full search on each shard. Shards return their top N results, and the coordinator merges and re-ranks. This is efficient as long as N is small relative to the total matches.
Adaptive Replica Selection
By default, Elasticsearch routes queries to the nearest replica with available capacity. In a multi-datacenter setup, this means queries go to the replica in the same datacenter as the client, reducing network latency.
You can control replica selection:
GET /my-index/_search?preference=_only_local
The _only_local preference ensures the query hits the local replica only. This is useful when you know the data is local and want to avoid cross-node traffic.
Replication for Read Scalability
Replication adds copies of your data to handle read traffic. Each replica is a full copy of the primary shard, capable of serving searches.
Read-Through Scaling
Adding replicas linearly scales read throughput. With 3 primary shards and 2 replicas, you have 9 total shard copies. Under heavy read load, the coordinator distributes requests across all 9 shards.
PUT /my-index/_settings
{
"number_of_replicas": 2
}
The tradeoff is write amplification: every write must be indexed on the primary and all replicas. More replicas mean slower writes.
Consistency Considerations
With replication, queries may return stale data if reads hit a replica that has not yet caught up with the primary. Elasticsearch offers consistency controls:
GET /my-index/_search?consistency=quorum
The quorum setting ensures at least half the replicas have acknowledged the write before the read, reducing the chance of reading stale data.
Index Lifecycle Management
As indices grow and age, their access patterns change. A blog post from three years ago gets almost no traffic compared to posts from last week. ILM automates the transition of indices through lifecycle stages, optimizing cost and performance.
Hot-Warm-Cold Architecture
The classic tiered architecture separates nodes by role:
graph TD
subgraph HotTier[Hot Nodes - SSD]
HN1[Node 1]
HN2[Node 2]
HN3[Node 3]
end
subgraph WarmTier[Warm Nodes - High Capacity]
WN1[Node 4]
WN2[Node 5]
end
subgraph ColdTier[Cold Nodes - Archive Storage]
CN1[Node 6]
CN2[Node 7]
end
Ingest[Ingest Pipeline] --> HN1
HN1 -->|recent indices| WN1
WN1 -->|90 days+| CN1
Hot nodes are where all the writes happen and where most reads land. Fast NVMe SSDs and plenty of RAM for the file system cache.
Warm nodes sit in the middle — indices that have cooled off but still get queried occasionally. SATA SSDs do the job.
Cold nodes are for archived data that almost nobody touches. Spinning disks or object store, whatever’s cheapest. Query latency is higher but that’s fine for old logs.
ILM Policy Definition
PUT _ilm/policy/search-indices-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"set_priority": {
"priority": 50
},
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"allocate": {
"require": {
"data": "warm"
}
}
}
},
"cold": {
"min_age": "90d",
"actions": {
"set_priority": {
"priority": 0
},
"allocate": {
"require": {
"data": "cold"
}
}
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
The rollover action creates a new index when the current one hits 50GB or is older than 7 days. This creates time-based indices like search-indices-000001, search-indices-000002, etc.
ILM Takeaways
- Build hot-warm-cold in early — retrofitting under traffic is painful
- Set
rolloveron hot indices so new indices spin up automatically when limits hit - Run
forcemergein the warm phase to cut segment count and speed up reads - Check that phases are actually triggering — monitor index age distribution across tiers
- Frozen indices in the cold tier can drop memory usage by ~90% on archived data
Cross-Cluster Search
For organizations with multiple Elasticsearch clusters — whether for multi-tenancy, geographic distribution, or isolation between teams — cross-cluster search enables federated queries across cluster boundaries.
Use Cases
- Multi-datacenter replication: Query data in both primary and DR clusters simultaneously
- Team isolation: Each team manages their own cluster while a central cluster federates queries
- Data sovereignty: Sensitive data stays in a specific region while being queryable globally
Configuration
On the coordinating cluster that federates queries:
PUT _cluster/settings
{
"persistent": {
"cluster.remote": {
"cluster_one": {
"seeds": ["192.168.1.1:9300", "192.168.1.2:9300"]
},
"cluster_two": {
"seeds": ["192.168.2.1:9300", "192.168.2.2:9300"]
}
}
}
}
Query across clusters using the fully qualified index name:
GET cluster_one:my-index,cluster_two:my-index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "search term" } },
{ "term": { "cluster": "cluster_one" } }
]
}
}
}
Cross-Cluster Replication vs Cross-Cluster Search
| Feature | Cross-Cluster Search | Cross-Cluster Replication (CCR) |
|---|---|---|
| Data movement | Query-only, no replication | Real-time index replication |
| Latency | Higher (network round-trip) | Lower (local reads) |
| Consistency | Eventually consistent | Consistently follower reads |
| Failover | Manual query redirection | Automatic failover |
| Use case | Ad-hoc federation, read scaling | Active-passive DR |
Cross-Cluster Takeaways
- Cross-cluster search adds network hops — budget 10-30ms of extra latency
- Use CCR if you need automatic failover for disaster recovery
- Watch seed node health — failed seeds mean failed queries
- Encrypt and authenticate cross-cluster traffic; don’t let it traverse networks untrusted
Search Capacity Planning
Before scaling, you need to know how much capacity you have and how much you need. Capacity planning prevents both over-engineering (wasted cost) and under-engineering (performance degradation).
Estimating Storage Requirements
Storage per shard depends on several factors:
{
"index": {
"number_of_shards": 5,
"number_of_replicas": 2,
"refresh_interval": "1s"
},
"analysis": {
"analyzer": "standard"
}
}
Rough estimation formula:
raw_data_size * (1 + overhead) * (1 + replica_factor) / num_primary_shards
Where:
- overhead = ~1.1-1.3x (segment merge, field data, deleted docs)
- replica_factor = 1 + number_of_replicas
For 100GB of raw JSON data with 2 replicas and 1.2x overhead:
100GB * 1.2 * 3 / 5 = 72GB per primary shard
Estimating Query Throughput
A single shard can handle roughly 5,000-15,000 queries per minute depending on query complexity and hardware. For P99 latency under 100ms, target the lower end.
Needed QPS * 60 / (shards * replicas * 5000 QPM_shard) = num_nodes
Node Sizing Guidelines
| Node Role | CPU Cores | RAM (GB) | Storage | Typical Use |
|---|---|---|---|---|
| Hot | 16-32 | 64-128 | 1-2TB NVMe | Primary + recent indices |
| Warm | 8-16 | 32-64 | 2-4TB SATA SSD | Older indices |
| Cold | 4-8 | 16-32 | 4-8TB Spinning | Frozen/archive indices |
| Master | 4-8 | 8-16 | 64GB SSD | Cluster coordination only |
| Coordinating | 8-16 | 16-32 | Local SSD | Query routing only |
Capacity Planning Checklist
- Baseline current QPS, P50/P95/P99 latency, and throughput per shard
- Pull storage growth rate (GB/day) from historical data
- Project out 6-12 months, multiply by 1.5-3x for growth
- Size for N+2 redundancy on critical clusters
- Run stress tests with simulated peak load before going live
- Leave 30% headroom when adding nodes — rebalancing chews through resources
Index vs Shard Strategy
A common question: should I add more shards to an existing index or create new indices? The answer depends on your access patterns and data lifecycle.
When to Add Shards to an Existing Index
- Document volume is increasing: More shards distribute the write load
- Query latency is increasing: More shards allow more parallel processing
- Single index is approaching size limits: Elasticsearch recommends 50GB per shard as a guideline
POST my-index/_shrink
{
"settings": {
"index.number_of_shards": 3,
"index.number_of_replicas": 1
}
}
When to Create New Indices
- Time-based data: Create daily, weekly, or monthly indices (e.g.,
logs-2026-04-01) - Rolling retention: Easy to delete old indices without reindexing
- Different mappings: Separate indices for different document types
PUT logs-2026-04-17
{
"aliases": {
"logs-current": {}
}
}
Index-Per-Day vs Index-Per-Month Trade-offs
| Strategy | Shard Count | Index Management | Deletion | Best For |
|---|---|---|---|---|
| Index/day | High (365/year) | Complex rollover | Granular | High-volume logs |
| Index/week | Medium (52/year) | Moderate | Less granular | Moderate volume |
| Index/month | Low (12/year) | Simple | Coarse | Search over archives |
Index vs Shard Takeaways
- Index-per-day is the standard for time-series — makes rollover, deletion, and tiering straightforward
- Don’t let shards blow past 50GB — Elasticsearch itself recommends staying under that
- For non-time-series data, index aliases let you change routing without touching application code
- An index template keeps all new indices consistent on settings and mappings
Vector Search Scaling
With the rise of AI-powered search (semantic search, retrieval-augmented generation), many search systems now handle vector embeddings alongside text. Scaling vector search has different characteristics than text search.
ANN Index Structures
Approximate Nearest Neighbor (ANN) indexes like HNSW (Hierarchical Navigable Small World) trade recall for speed. HNSW builds a multi-layer graph where searching starts from the top layer and narrows down.
graph TD
subgraph Layer2[Layer 2 - Sparse]
L2_1((A)) --> L2_2((B))
L2_2 --> L2_3((C))
L2_3 --> L2_4((D))
end
subgraph Layer1[Layer 1 - Medium]
L1_1((E)) --> L1_2((F))
L1_3((G)) --> L1_4((H))
end
subgraph Layer0[Layer 0 - Dense]
L0_1((I)) --> L0_2((J))
L0_3((K)) --> L0_4((L))
end
L2_1 -->|down| L1_1
L1_2 -->|down| L0_2
L0_2 -->|down| L0_3
HNSW parameters to tune:
m: number of connections per node (higher = better recall, more memory)ef_construction: depth of search during indexing (higher = better recall, slower indexing)
Vector Sharding Challenges
Unlike text search where hash-based sharding works, vector search faces the “curse of dimensionality” when sharding. A query must hit all shards to find true nearest neighbors (unless using approximate routing).
Scaling Strategies
- Coordinating node pre-filters: Route queries to shards that likely contain relevant vectors
- Semantic routing: Use a lightweight classifier to route queries to topic-specific indices
- Ensemble retrieval: Query all shards with shorter ef_construction and merge, vs query fewer shards with higher ef_construction
Vector Search Takeaways
- HNSW is hungry for memory: plan on ~1GB per million vectors at 768 dimensions
- Quantization (int8 or product quantization) cuts memory use by 4-8x with acceptable recall loss
- For hybrid search (text + vector), run
knnalongsidematchin aboolfilter - Pre-filtering kills vector search performance — design your index structure around the filter, not the other way around
Trade-off Analysis
Search scaling involves fundamental trade-offs between competing concerns. Understanding these helps you make informed decisions for your specific workload.
Core Trade-off Dimensions
Consistency vs Performance
Distributed search systems must balance consistency against query performance. Strong consistency (ensuring all replicas reflect the latest writes) requires coordination overhead that increases latency. Most search systems opt for eventual consistency, where queries may briefly return stale results in exchange for lower latency. Use consistency controls like quorum writes only when your application genuinely requires them.
Write Throughput vs Data Safety
More replicas improve read scalability but amplify write overhead. Each write must be indexed on the primary and all replicas, multiplying I/O and CPU consumption. For write-heavy workloads, minimize replica count. For read-heavy workloads, add replicas strategically — do not over-provision.
Query Latency vs Recall
Approximate Nearest Neighbor (ANN) algorithms like HNSW trade recall accuracy for query speed. Higher ef_construction and m parameters improve recall at the cost of memory usage and indexing time. For latency-sensitive production systems, benchmark acceptable recall thresholds before settling on parameters.
Shard Count vs Overhead
More shards enable greater parallelism but each shard incurs overhead: heap usage for the shard object, segment metadata, and coordination overhead. Elasticsearch recommends keeping shard size under 50GB and total shard count per node under 1,000. Over-sharding (too many tiny shards) is a common mistake that degrades cluster stability.
Operational Complexity vs Capability
Features like cross-cluster search, custom routing, and hot-warm-cold architectures add capabilities but also operational complexity. Cross-cluster search introduces network latency. Custom routing requires application-level awareness of shard keys. Hot-warm-cold requires careful capacity planning across tiers. Evaluate whether the added complexity delivers genuine value for your use case.
Operations and Reliability
Node Failure Handling
When a node fails, Elasticsearch automatically promotes replicas to primaries. This is automatic but can cause a brief period of reduced capacity while promotion completes.
For critical workloads, keep at least 2 replicas of every shard. Single replicas create single points of failure.
Rebalancing
When you add a new node, Elasticsearch rebalances shards across the cluster automatically. This involves moving shard copies from overloaded nodes to underloaded ones.
Rebalancing can impact query performance. For sensitive workloads, use shadow replicas or temporarily disable rebalancing during maintenance windows.
Frozen Indices
Elasticsearch 7.x introduced frozen indices. These are indices explicitly marked as frozen, consuming less memory because their shard data is not cached. Queries against frozen indices are slower but functional.
POST /my-index/_freeze
POST /my-index/_unfreeze
If you have time-based data that is rarely queried (like logs from two years ago), freezing can save significant memory.
When to Use / When Not to Use
When to Use Search Scaling Techniques
- Data volume exceeds single-node capacity (typically > 500GB per node for search workloads)
- Query latency degrades under load despite indexing optimization
- Read throughput requirements exceed single-node capabilities
- High availability is mandatory — zero tolerance for single points of failure
- Multi-datacenter deployments requiring geographic distribution of queries
When Not to Use Search Scaling Techniques
- Small to medium datasets (< 100GB) where a single well-tuned node suffices
- Low query concurrency — a single node handling < 100 QPS rarely needs sharding
- Simple key-value lookups that do not benefit from distributed search
- Write-heavy workloads with simple queries — replication does not scale writes
- Cost-constrained projects where operational complexity of multi-node clusters outweighs benefits
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Shard hotspot due to skewed routing | Some nodes overloaded while others idle | Use routing keys that distribute evenly; monitor shard sizes |
| Coordinating node bottleneck | All search requests fail if coordination node crashes | Use dedicated coordinating nodes; do not route client traffic to data nodes |
| Network partition causing split-brain | Conflicting primaries accept divergent writes | Use quorum-based writes (consistency=quorum), monitor cluster state |
| Replica lag during heavy indexing | Stale results returned from lagging replicas | Set read_slience timeout, throttle indexing rate during peak read hours |
| Shard relocation timeout | Rebalancing stalls mid-transfer, data temporarily unavailable | Increase recovery_time_window; avoid rebalancing during high load |
| Memory pressure from aggregations | Facet/aggregation queries cause OOM | Set max_buckets limits; pre-compute aggregations with background jobs |
Common Pitfalls / Anti-Patterns
Anti-Patterns
Premature Sharding
Sharding adds complexity without benefits for small datasets. A single shard performs scatter-gather across segments within that shard anyway. If your total index size is under 50GB and latency is acceptable, do not shard.
Rule: Shard when a single node cannot handle the workload or data volume.
Ignoring Rebalancing Impact
Adding a new node triggers automatic rebalancing. During rebalance, the cluster moves shard copies, consuming network bandwidth and CPU. This degrades query performance for running requests.
Fix: Disable automatic rebalancing during maintenance windows:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.rebalance.enable": "none"
}
}
Not Using Circuit Breakers
Complex queries with deep aggregations or large page sizes can cause OOM on coordinating nodes. Without circuit breakers, one bad query can crash an entire cluster.
Fix: Configure and monitor circuit breakers:
{
"indices.breaker.total.limit": "70%",
"indices.breaker.request.limit": "40%",
"indices.breaker.fielddata.limit": "60%"
}
Targeting Wrong Shards with Routing
Custom routing narrows the shards queried, but if the routing key is unevenly distributed, it creates hotspots. A common mistake is routing by user ID when a small subset of users generates most traffic.
Fix: Choose routing keys with high cardinality and even distribution. Monitor shard sizes per routing value.
Over-Allocating Replicas
Replicas multiply storage and indexing overhead. Six replicas of three shards means six copies of every document. For write-heavy workloads, this slows indexing dramatically.
Fix: Start with one replica. Add more only when read scalability demands it.
Quick Reference
Key Bullets
- Hash-based sharding queries all shards by default
- Range-based or custom routing targets subsets when access patterns allow
- Replicas scale reads but amplify writes and storage
- The coordination node broadcasts to all shards and merges results
- Adaptive replica selection routes queries to the least-loaded replica
- Frozen indices trade query speed for memory savings on rarely-accessed data
- Rebalancing is automatic but can degrade performance during heavy load
Copy/Paste Checklist
# Check shard distribution across nodes
GET /_cat/shards?v&h=index,shard,prirep,state,node
# Force rebalancing after adding a node
POST /_cluster/reroute?retry_failed=true
# Disable rebalancing during maintenance
PUT /_cluster/settings
{
"transient": {
"cluster.routing.rebalance.enable": "none"
}
}
# Update replica count for a specific index
PUT /my-index/_settings
{
"number_of_replicas": 2
}
# Freeze a rarely-used index
POST /my-index/_freeze
# Set custom routing on a document
PUT /my-index/_doc/1?routing=user_123
{
"title": "My Post",
"user_id": "user_123"
}
# Query with custom routing
GET /my-index/_search?routing=user_123
{
"query": { "match": { "title": "search term" } }
}
# Monitor pending tasks
GET /_cluster/pending_tasks
# Check circuit breaker status
GET /_nodes/stats/breaker
Observability Checklist
Metrics to Monitor
{
"cluster_level": {
"cluster_health": "green/yellow/red",
"unassigned_shards": "should be 0 in healthy cluster",
"number_of_pending_tasks": "< 50 typically",
"shard_balance_variance": "< 20% across nodes"
},
"per_node": {
"heap_used_percent": "< 85% sustained",
"cpu_iowait_percent": "< 20% sustained",
"search_latency_p99_per_node": "< 500ms",
"indexing_latency_p99": "< 1s per batch"
},
"per_shard": {
"query_count_per_shard": "high variance indicates hotspot",
"doc_count_per_shard": "rebalance if one shard > 2x average",
"segment_count": "< 50; force merge if higher"
}
}
Key Logs to Capture
graph LR
subgraph LogSources
ES[Elasticsearch Logs] --> Cluster[Cluster Events]
ES --> Shard[Shard Allocation]
ES --> Search[Slow Search]
ES --> Index[Indexing Operations]
end
Cluster --> |node join/leave| Monitor[Monitoring]
Shard --> |allocation fail| Monitor
Search --> |query > threshold| Monitor
Index --> |bulk reject| Monitor
- Cluster events: node additions/removals, master elections, cluster state transitions
- Shard allocation logs: failed allocations, delayed decisions, recovery progress
- Slow search logs: queries exceeding
search.slowlog.threshold - Indexing logs: bulk request rejections, mapping conflicts, refresh intervals
# Enable cluster-level allocation logging
PUT /_cluster/settings
{
"transient": {
"logger.index_allocation": "DEBUG",
"logger.index_recovery": "INFO"
}
}
Alerts to Configure
| Alert | Condition | Severity |
|---|---|---|
| Cluster health yellow | status == yellow for > 5 min | Warning |
| Cluster health red | status == red | Critical |
| Shard imbalance | shard_count_max - shard_count_min > 5 | Warning |
| Coordinating node queue | search_queue > 100 | Warning |
| Recovery stalled | recovering_shards > 0 for > 10 min | Warning |
| Disk watermark low | disk.low watermark exceeded | Critical |
Security Checklist
- Network-level isolation — place search cluster on internal network; never expose directly to internet
- Node authentication — use TLS certificates for inter-node communication
- Role-based access — define read-only, read-write, and admin roles; apply per index
- Index-level privileges — restrict write access to pipelines and ingestion services only
- Audit logging — enable security audit logs for all write operations and access attempts
- Field-level security — use document-level security if certain fields must be hidden from some users
- Query restrictions — disable expensive aggregations for public-facing endpoints (
max_buckets) - API key rotation — rotate service account keys quarterly; use secret management system
- Input validation — sanitize query strings to prevent injection via special characters
# Example: Create a read-only role for production traffic
POST /_security/role/production_reader
{
"indices": [
{
"names": ["production-*"],
"privileges": ["read", "view_index_metadata"]
}
],
"field_security": {
"grant": ["title", "content", "publish_date", "author"]
}
}
Interview Questions
Hash-based sharding computes a hash of the document ID and modulo by the number of primary shards to determine which shard owns the document. The main limitation is that queries without a filter on document_id must query all shards (scatter-gather), because the routing cannot be determined without the hash. This means even targeted queries like a term filter on a specific field still hit every shard.
The coordination node receives search requests from clients and acts as a load balancer. It broadcasts the query to all relevant shards (or all shards if routing can't narrow it down), collects the top N results from each shard, then merges and re-ranks these results to return the final response. It does not perform the full search itself—shards return only their top N matches to reduce data transfer.
Hot nodes handle all writes and recent read traffic, using the fastest storage (NVMe SSD) and most RAM. Warm nodes hold indices that are accessed less frequently, typically after 7-30 days, using cheaper storage (SATA SSD). Cold nodes store rarely-accessed data, often frozen, using the cheapest storage (spinning disk or object store). Use hot nodes for active indexing and recent data, warm for historical but queryable data, and cold for archival data that rarely gets queried.
Custom routing allows you to specify a routing key at index time (e.g., `routing=category:tutorials`) so that queries can target specific shards without querying all shards. Use it when you have high-cardinality access patterns where a subset of documents is queried much more frequently (e.g., queries by tenant_id or category). The tradeoff is potential hotspotting if the routing key is not evenly distributed.
ILM automates the transition of indices through lifecycle stages (hot → warm → cold → delete) based on age or size thresholds. This optimizes costs by moving older, less-frequently accessed data to cheaper storage tiers and eventually deleting it, rather than keeping all data on expensive hot-node storage. It also automates maintenance actions like rollover, shrink, and forcemerge to keep indices performing well.
Replication linearly scales read throughput because queries can be distributed across all replica copies. However, every write must be indexed on the primary shard and all replicas, causing write amplification. More replicas mean slower indexing throughput and higher storage costs. The optimal replica count depends on your read-to-write ratio: read-heavy workloads benefit from more replicas, while write-heavy workloads should minimize replicas.
Circuit breakers prevent runaway queries from consuming too much memory and causing OOM errors. Elasticsearch has several circuit breakers (parent, request, fielddata, inbox breaker) each with configurable limits. When a query would exceed a circuit breaker's limit, it fails fast with a Phase 1 execution exception rather than consuming memory until the node crashes. They are critical for cluster stability, especially with complex aggregations or large page sizes.
Cross-cluster search (CCS) federates queries across multiple clusters in real-time without data movement, useful for multi-tenancy or ad-hoc multi-datacenter queries. Cross-cluster replication (CCR) provides active-passive DR by continuously replicating indices from a primary to follower cluster with automatic failover. Use CCS when you need unified read access across isolated clusters. Use CCR when you need true disaster recovery with minimal RPO.
In high-dimensional vector spaces, the distance between any two points becomes similar as dimensions increase, making it hard to distinguish true nearest neighbors. For sharding, this means a query must theoretically hit all shards to guarantee finding true nearest neighbors (unlike text search where routing can narrow the scope). Practical solutions include semantic routing, pre-filtering to candidate subsets, or ensemble retrieval with approximate results.
Capacity planning involves: (1) estimating storage: raw data × overhead (1.2-1.3x) × replica factor, keeping shards under 50GB; (2) estimating throughput: single shard handles 5,000-15,000 QPM, scale nodes linearly; (3) node sizing: hot nodes need 16-32 cores, 64-128GB RAM, NVMe storage; (4) accounting for N+2 redundancy and 30% extra capacity for rebalancing. Always stress test with simulated peak load before production deployment.
Replica lag occurs when replicas fall behind the primary shard during indexing, often due to heavy write loads overwhelming replica indexing threads or network bottlenecks. When a query hits a lagging replica, it returns stale or missing documents. This is called eventual consistency—inconsistency window depends on how far behind replicas are. Mitigation strategies include: reducing indexing bulk size, increasing replica threads, using async replication for non-critical data, or routing reads to primaries during consistency-sensitive operations.
`wait_for_active_shards` (quorum) ensures writes are acknowledged by a minimum number of shard copies before returning success, providing durability guarantees. `read_before_write` (or `refresh=true`) forces a refresh before the read so that recently indexed documents are visible, at the cost of higher latency. The former protects against reading unacknowledged data; the latter ensures read-your-writes consistency by guaranteeing the indexing operation is searchable before the read proceeds.
Elasticsearch rebalances shards automatically when nodes are added or removed, using allocation filters and awareness attributes to optimize physical placement. The process moves shard copies from overloaded to underloaded nodes, throttled by `indices.recovery.max_bytes_per_sec` and `cluster.routing.allocation.cluster_concurrent_rebalance`. During rebalance, network bandwidth and disk I/O are consumed for data transfer, and primary promotion causes brief unavailability. For sensitive workloads, disable automatic rebalancing during maintenance windows using `cluster.routing.rebalance.enable: none`.
Circuit breakers prevent runaway queries from causing OOM by failing fast when memory estimates exceed thresholds. Elasticsearch has: parent circuit breaker (total across all breakers), request circuit breaker (per-query memory for aggregations/sorts), fielddata circuit breaker (field data loading), and inbox breaker (coordinator queue). Recommended configuration: parent at 70%, request at 40%, fielddata at 60%. Monitor `/_nodes/stats/breaker` and adjust based on workload—too tight causes false positives, too loose risks actual OOM crashes.
Hotspots occur when shard distribution doesn't match query traffic patterns—common with time-based data where recent shards get hammered. Solutions include: (1) split your index to add more shards on the hot tier, distributing load; (2) adjust `index.routing.allocation` to move hot shards to dedicated hot nodes with more CPU; (3) use custom routing with a high-cardinality key that distributes traffic evenly; (4) scale horizontally by adding hot nodes which Elasticsearch will rebalance across; (5) implement read-throttling or query queueing at the application layer during extreme spikes.
Adding replicas scales reads linearly (more copies to distribute queries across) but amplifies writes and storage. Adding primary shards increases parallelism for both reads and writes but requires reindexing to change shard count and increases coordination overhead. General guidance: if read latency is the bottleneck, add replicas first (simpler, no reindexing). If indexing throughput is the bottleneck, add primary shards (more parallelism for indexing). If data volume exceeds single-shard capacity limits, you need more primary shards anyway. Over-sharding causes metadata overhead; under-sharding limits parallelism.
Rolling restarts require careful sequencing: (1) Disable shard allocation: `PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` to prevent rebalancing when nodes go down. (2) Stop indexing and pause ILM if needed to avoid rollover during restart. (3) Gracefully stop one node (SIGTERM), wait for its shards to be allocated elsewhere. (4) Perform your upgrade or configuration change. (5) Restart the node and verify it joins the cluster. (6) Re-enable allocation: `{"transient": {"cluster.routing.allocation.enable": "none"}}` to `{"transient": {"cluster.routing.allocation.enable": "all"}}`. (7) Wait for cluster health to return to green/yellow before proceeding to the next node.
`query_then_fetch` (default) distributes the query to relevant shards, each returns its top N results, and the coordinator merges and re-ranks. `dfs_query_then_fetch` first executes a distributed frequency search across all shards to compute global document frequencies for scoring, then executes the actual search with corrected scores. This produces more accurate relevance ranking when documents are unevenly distributed across shards, at the cost of an extra network round-trip. Use `dfs` when you have few shards with heavily skewed document distribution and relevance scoring accuracy matters more than latency.
Field data cache loads all field values into heap memory to support sorting and aggregations on field types like keyword, numeric, and geo. Without bounds, a single large aggregation (like a terms bucket on high-cardinality fields) can load millions of values and exhaust heap, causing OOM. Mitigation: use `doc_values` instead of `field_data=true` for most use cases (doc_values are on-disk, not heap). Set `indices.breaker.fielddata.limit` to cap field data memory. Consider `eager_global_ordinals` for frequently-aggregated fields to pre-load during refresh rather than on first query. Monitor `fielddata` size in `/_nodes/stats/indices/fielddata`.
Multi-tenant isolation strategies: (1) Index-per-tenant—each tenant gets their own index, simplest isolation, easiest to manage per-tenant ILM and scaling, but thousands of indices can stress cluster state. (2) Document-level field with filtered aliases—add `tenant_id` field, create filtered aliases per tenant, query through alias. Lower overhead per tenant but requires careful filter bounds to prevent cross-tenant leaks. (3) Searchable snapshots with cold storage for rarely-accessed tenants. Choose based on tenant count and isolation requirements. For strict compliance isolation, index-per-tenant with dedicated clusters for high-security tenants. Always use role-based access and document-level security for shared indices.
Further Reading
- Elasticsearch: The Definitive Guide — Official comprehensive guide from Elastic
- Scaling Elasticsearch — Elastic’s own scaling best practices
- Index Lifecycle Management — Official ILM documentation
- Search Latency Optimization — Troubleshooting slow queries
- Cross-Cluster Search — Federated search setup
- HNSW Algorithm Explained — Original HNSW paper for vector search internals
- Designing Data-Intensive Applications — Chapter on search and batch processing (Martin Kleppmann)
Conclusion
Three things matter most for search scaling: sharding, replication, and routing. Hash-based sharding is simple but queries all shards. Range-based or custom routing targets subsets when your access patterns allow. Replication scales reads at the cost of writes. The coordination node handles the scatter-gather that makes distributed search work.
The right configuration depends on your query patterns. If most queries target recent data, time-based sharding with more replicas on recent shards makes sense. If queries span your entire dataset, hash-based sharding with many replicas handles the load.
Measure your query latency distribution before optimizing. It is easy to over-engineer sharding for workloads that do not need it.
Category
Related Posts
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Read/Write Splitting
Master-slave replication for read scaling. Routing strategies, consistency lag considerations, and when this pattern helps vs hurts your architecture.
Database Scaling: Vertical, Horizontal, and Read Replicas
Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.