Search Scaling: Sharding, Routing, and Horizontal Growth
Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.
Search Scaling: Sharding, Routing, and Horizontal Growth
Scaling a search system is different from scaling a regular database. Search queries are CPU-intensive, latency-sensitive, and often return large result sets. A single node can only handle so much before query times degrade. At that point, you need to distribute the load across multiple nodes while maintaining fast queries and consistent results.
This post covers the key techniques for scaling search: sharding strategies, query routing, replication for read scalability, and cluster management.
Understanding Search Workload Characteristics
Before diving into scaling techniques, it helps to understand what makes search workloads different.
Search queries are CPU-bound for relevance scoring and aggregation. A query that scores a million documents uses significantly more CPU than a simple key-value lookup. Memory matters too, but for inverted indexes, the bottleneck is often CPU during the scoring phase.
Search is also partition-sensitive. A query for “recent blog posts” might hit different shards depending on how data is distributed. If your sharding strategy does not match your query patterns, you end up hitting all shards for every query, defeating the purpose of distribution.
Sharding Strategies
Sharding splits your index into smaller pieces that can be distributed across nodes. Choosing the right sharding strategy is the most important decision in search scaling.
Hash-Based Sharding
The default approach in Elasticsearch: hash the document ID and modulo by shard count.
shard = hash(document_id) % num_primary_shards;
This distributes documents evenly but has a downside: queries without a filter on document_id must query all shards. You cannot target a specific shard.
// Every query hits all shards - no optimization possible
GET /my-index/_search
{
"query": { "match": { "content": "search term" } }
}
Range-Based Sharding
Shard by a field value, like date or category:
{
"settings": {
"index": {
"sort.field": "publish_date",
"sort.order": "desc"
}
}
}
Range sharding lets you target subsets of data. A query for “last week’s posts” might only hit the relevant time-based shards. The tradeoff is hot spots: if most queries hit recent data, those shards become overloaded.
graph LR
subgraph TimeBasedShards
Shard1[2024-Q1]
Shard2[2024-Q2]
Shard3[2024-Q3]
Shard4[2024-Q4]
end
Query1["query: recent posts"] --> Shard4
Query2["query: old posts"] --> Shard1
Custom Routing
Elasticsearch allows custom routing to target specific shards:
PUT /my-index/_doc/1?routing=category:tutorials
{
"title": "Getting Started",
"category": "tutorials"
}
Queries can then target specific routing values:
GET /my-index/_search?routing=category:tutorials
{
"query": { "match": { "title": "search" } }
}
This reduces the number of shards queried from N to 1, cutting latency significantly for targeted queries.
Query Routing
Once you have multiple shards, query routing determines which shards get queried and how results are merged.
Coordination Node
When a client sends a search request to Elasticsearch, any node can receive it. That node becomes the coordination node for that request. It broadcasts the query to all relevant shards, collects the results, and merges them.
graph TD
Client --> Coord[Coordinator Node]
Coord -->|broadcast| Shard1[Shard 1]
Coord -->|broadcast| Shard2[Shard 2]
Coord -->|broadcast| Shard3[Shard 3]
Shard1 -->|top 10| Coord
Shard2 -->|top 10| Coord
Shard3 -->|top 10| Coord
Coord -->|merged top 10| Client
The coordinator does not do the full search on each shard. Shards return their top N results, and the coordinator merges and re-ranks. This is efficient as long as N is small relative to the total matches.
Adaptive Replica Selection
By default, Elasticsearch routes queries to the nearest replica with available capacity. In a multi-datacenter setup, this means queries go to the replica in the same datacenter as the client, reducing network latency.
You can control replica selection:
GET /my-index/_search?preference=_only_local
The _only_local preference ensures the query hits the local replica only. This is useful when you know the data is local and want to avoid cross-node traffic.
Replication for Read Scalability
Replication adds copies of your data to handle read traffic. Each replica is a full copy of the primary shard, capable of serving searches.
Read-Through Scaling
Adding replicas linearly scales read throughput. With 3 primary shards and 2 replicas, you have 9 total shard copies. Under heavy read load, the coordinator distributes requests across all 9 shards.
PUT /my-index/_settings
{
"number_of_replicas": 2
}
The tradeoff is write amplification: every write must be indexed on the primary and all replicas. More replicas mean slower writes.
Consistency Considerations
With replication, queries may return stale data if reads hit a replica that has not yet caught up with the primary. Elasticsearch offers consistency controls:
GET /my-index/_search?consistency=quorum
The quorum setting ensures at least half the replicas have acknowledged the write before the read, reducing the chance of reading stale data.
Cluster Management
Managing a multi-node search cluster requires planning for failure, rebalancing, and capacity.
Node Failure Handling
When a node fails, Elasticsearch automatically promotes replicas to primaries. This is automatic but can cause a brief period of reduced capacity while promotion completes.
For critical workloads, keep at least 2 replicas of every shard. Single replicas create single points of failure.
Rebalancing
When you add a new node, Elasticsearch rebalances shards across the cluster automatically. This involves moving shard copies from overloaded nodes to underloaded ones.
Rebalancing can impact query performance. For sensitive workloads, use shadow replicas or temporarily disable rebalancing during maintenance windows.
Frozen Indices
Elasticsearch 7.x introduced frozen indices. These are indices explicitly marked as frozen, consuming less memory because their shard data is not cached. Queries against frozen indices are slower but functional.
POST /my-index/_freeze
POST /my-index/_unfreeze
If you have time-based data that is rarely queried (like logs from two years ago), freezing can save significant memory.
When to Use / When Not to Use
When to Use Search Scaling Techniques
- Data volume exceeds single-node capacity (typically > 500GB per node for search workloads)
- Query latency degrades under load despite indexing optimization
- Read throughput requirements exceed single-node capabilities
- High availability is mandatory — zero tolerance for single points of failure
- Multi-datacenter deployments requiring geographic distribution of queries
When Not to Use Search Scaling Techniques
- Small to medium datasets (< 100GB) where a single well-tuned node suffices
- Low query concurrency — a single node handling < 100 QPS rarely needs sharding
- Simple key-value lookups that do not benefit from distributed search
- Write-heavy workloads with simple queries — replication does not scale writes
- Cost-constrained projects where operational complexity of multi-node clusters outweighs benefits
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Shard hotspot due to skewed routing | Some nodes overloaded while others idle | Use routing keys that distribute evenly; monitor shard sizes |
| Coordinating node bottleneck | All search requests fail if coordination node crashes | Use dedicated coordinating nodes; do not route client traffic to data nodes |
| Network partition causing split-brain | Conflicting primaries accept divergent writes | Use quorum-based writes (consistency=quorum), monitor cluster state |
| Replica lag during heavy indexing | Stale results returned from lagging replicas | Set read_slience timeout, throttle indexing rate during peak read hours |
| Shard relocation timeout | Rebalancing stalls mid-transfer, data temporarily unavailable | Increase recovery_time_window; avoid rebalancing during high load |
| Memory pressure from aggregations | Facet/aggregation queries cause OOM | Set max_buckets limits; pre-compute aggregations with background jobs |
Observability Checklist
Metrics to Monitor
{
"cluster_level": {
"cluster_health": "green/yellow/red",
"unassigned_shards": "should be 0 in healthy cluster",
"number_of_pending_tasks": "< 50 typically",
"shard_balance_variance": "< 20% across nodes"
},
"per_node": {
"heap_used_percent": "< 85% sustained",
"cpu_iowait_percent": "< 20% sustained",
"search_latency_p99_per_node": "< 500ms",
"indexing_latency_p99": "< 1s per batch"
},
"per_shard": {
"query_count_per_shard": "high variance indicates hotspot",
"doc_count_per_shard": "rebalance if one shard > 2x average",
"segment_count": "< 50; force merge if higher"
}
}
Key Logs to Capture
graph LR
subgraph LogSources
ES[Elasticsearch Logs] --> Cluster[Cluster Events]
ES --> Shard[Shard Allocation]
ES --> Search[Slow Search]
ES --> Index[Indexing Operations]
end
Cluster --> |node join/leave| Monitor[Monitoring]
Shard --> |allocation fail| Monitor
Search --> |query > threshold| Monitor
Index --> |bulk reject| Monitor
- Cluster events: node additions/removals, master elections, cluster state transitions
- Shard allocation logs: failed allocations, delayed decisions, recovery progress
- Slow search logs: queries exceeding
search.slowlog.threshold - Indexing logs: bulk request rejections, mapping conflicts, refresh intervals
# Enable cluster-level allocation logging
PUT /_cluster/settings
{
"transient": {
"logger.index_allocation": "DEBUG",
"logger.index_recovery": "INFO"
}
}
Alerts to Configure
| Alert | Condition | Severity |
|---|---|---|
| Cluster health yellow | status == yellow for > 5 min | Warning |
| Cluster health red | status == red | Critical |
| Shard imbalance | shard_count_max - shard_count_min > 5 | Warning |
| Coordinating node queue | search_queue > 100 | Warning |
| Recovery stalled | recovering_shards > 0 for > 10 min | Warning |
| Disk watermark low | disk.low watermark exceeded | Critical |
Security Checklist
- Network-level isolation — place search cluster on internal network; never expose directly to internet
- Node authentication — use TLS certificates for inter-node communication
- Role-based access — define read-only, read-write, and admin roles; apply per index
- Index-level privileges — restrict write access to pipelines and ingestion services only
- Audit logging — enable security audit logs for all write operations and access attempts
- Field-level security — use document-level security if certain fields must be hidden from some users
- Query restrictions — disable expensive aggregations for public-facing endpoints (
max_buckets) - API key rotation — rotate service account keys quarterly; use secret management system
- Input validation — sanitize query strings to prevent injection via special characters
# Example: Create a read-only role for production traffic
POST /_security/role/production_reader
{
"indices": [
{
"names": ["production-*"],
"privileges": ["read", "view_index_metadata"]
}
],
"field_security": {
"grant": ["title", "content", "publish_date", "author"]
}
}
Common Pitfalls / Anti-Patterns
Premature Sharding
Sharding adds complexity without benefits for small datasets. A single shard performs scatter-gather across segments within that shard anyway. If your total index size is under 50GB and latency is acceptable, do not shard.
Rule: Shard when a single node cannot handle the workload or data volume.
Ignoring Rebalancing Impact
Adding a new node triggers automatic rebalancing. During rebalance, the cluster moves shard copies, consuming network bandwidth and CPU. This degrades query performance for running requests.
Fix: Disable automatic rebalancing during maintenance windows:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.rebalance.enable": "none"
}
}
Not Using Circuit Breakers
Complex queries with deep aggregations or large page sizes can cause OOM on coordinating nodes. Without circuit breakers, one bad query can crash an entire cluster.
Fix: Configure and monitor circuit breakers:
{
"indices.breaker.total.limit": "70%",
"indices.breaker.request.limit": "40%",
"indices.breaker.fielddata.limit": "60%"
}
Targeting Wrong Shards with Routing
Custom routing narrows the shards queried, but if the routing key is unevenly distributed, it creates hotspots. A common mistake is routing by user ID when a small subset of users generates most traffic.
Fix: Choose routing keys with high cardinality and even distribution. Monitor shard sizes per routing value.
Over-Allocating Replicas
Replicas multiply storage and indexing overhead. Six replicas of three shards means six copies of every document. For write-heavy workloads, this slows indexing dramatically.
Fix: Start with one replica. Add more only when read scalability demands it.
Quick Recap
Key Bullets
- Hash-based sharding is simple but queries all shards for every request
- Range-based or custom routing targets subsets when access patterns allow
- Replicas scale reads but amplify writes and storage
- The coordination node broadcasts to all shards and merges results
- Adaptive replica selection routes queries to the least-loaded replica
- Frozen indices trade query speed for memory savings on rarely-accessed data
- Rebalancing is automatic but can degrade performance during heavy load
Copy/Paste Checklist
# Check shard distribution across nodes
GET /_cat/shards?v&h=index,shard,prirep,state,node
# Force rebalancing after adding a node
POST /_cluster/reroute?retry_failed=true
# Disable rebalancing during maintenance
PUT /_cluster/settings
{
"transient": {
"cluster.routing.rebalance.enable": "none"
}
}
# Update replica count for a specific index
PUT /my-index/_settings
{
"number_of_replicas": 2
}
# Freeze a rarely-used index
POST /my-index/_freeze
# Set custom routing on a document
PUT /my-index/_doc/1?routing=user_123
{
"title": "My Post",
"user_id": "user_123"
}
# Query with custom routing
GET /my-index/_search?routing=user_123
{
"query": { "match": { "title": "search term" } }
}
# Monitor pending tasks
GET /_cluster/pending_tasks
# Check circuit breaker status
GET /_nodes/stats/breaker
Connecting to Related Concepts
Scaling search systems often involves integrating with other infrastructure components. For event-driven indexing pipelines that feed search engines, see event-driven architecture. For caching strategies that complement search replication, see distributed caching.
Search also frequently sits alongside messaging systems. If you are processing streams of documents to index, Apache Kafka or RabbitMQ handle that pattern well.
Conclusion
Three things matter most for search scaling: sharding, replication, and routing. Hash-based sharding is simple but queries all shards. Range-based or custom routing targets subsets when your access patterns allow. Replication scales reads at the cost of writes. The coordination node handles the scatter-gather that makes distributed search work.
The right configuration depends on your query patterns. If most queries target recent data, time-based sharding with more replicas on recent shards makes sense. If queries span your entire dataset, hash-based sharding with many replicas handles the load.
Measure your query latency distribution before optimizing. It is easy to over-engineer sharding for workloads that do not need it.
For more on search infrastructure, see our deep dives on Elasticsearch and Apache Solr.
Category
Related Posts
Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Read/Write Splitting
Master-slave replication for read scaling. Routing strategies, consistency lag considerations, and when this pattern helps vs hurts your architecture.
Database Scaling: Vertical, Horizontal, and Read Replicas
Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.