Search Scaling: Sharding, Routing, and Horizontal Growth

Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.

published: reading time: 12 min read

Search Scaling: Sharding, Routing, and Horizontal Growth

Scaling a search system is different from scaling a regular database. Search queries are CPU-intensive, latency-sensitive, and often return large result sets. A single node can only handle so much before query times degrade. At that point, you need to distribute the load across multiple nodes while maintaining fast queries and consistent results.

This post covers the key techniques for scaling search: sharding strategies, query routing, replication for read scalability, and cluster management.

Understanding Search Workload Characteristics

Before diving into scaling techniques, it helps to understand what makes search workloads different.

Search queries are CPU-bound for relevance scoring and aggregation. A query that scores a million documents uses significantly more CPU than a simple key-value lookup. Memory matters too, but for inverted indexes, the bottleneck is often CPU during the scoring phase.

Search is also partition-sensitive. A query for “recent blog posts” might hit different shards depending on how data is distributed. If your sharding strategy does not match your query patterns, you end up hitting all shards for every query, defeating the purpose of distribution.

Sharding Strategies

Sharding splits your index into smaller pieces that can be distributed across nodes. Choosing the right sharding strategy is the most important decision in search scaling.

Hash-Based Sharding

The default approach in Elasticsearch: hash the document ID and modulo by shard count.

shard = hash(document_id) % num_primary_shards;

This distributes documents evenly but has a downside: queries without a filter on document_id must query all shards. You cannot target a specific shard.

// Every query hits all shards - no optimization possible
GET /my-index/_search
{
  "query": { "match": { "content": "search term" } }
}

Range-Based Sharding

Shard by a field value, like date or category:

{
  "settings": {
    "index": {
      "sort.field": "publish_date",
      "sort.order": "desc"
    }
  }
}

Range sharding lets you target subsets of data. A query for “last week’s posts” might only hit the relevant time-based shards. The tradeoff is hot spots: if most queries hit recent data, those shards become overloaded.

graph LR
    subgraph TimeBasedShards
        Shard1[2024-Q1]
        Shard2[2024-Q2]
        Shard3[2024-Q3]
        Shard4[2024-Q4]
    end

    Query1["query: recent posts"] --> Shard4
    Query2["query: old posts"] --> Shard1

Custom Routing

Elasticsearch allows custom routing to target specific shards:

PUT /my-index/_doc/1?routing=category:tutorials
{
  "title": "Getting Started",
  "category": "tutorials"
}

Queries can then target specific routing values:

GET /my-index/_search?routing=category:tutorials
{
  "query": { "match": { "title": "search" } }
}

This reduces the number of shards queried from N to 1, cutting latency significantly for targeted queries.

Query Routing

Once you have multiple shards, query routing determines which shards get queried and how results are merged.

Coordination Node

When a client sends a search request to Elasticsearch, any node can receive it. That node becomes the coordination node for that request. It broadcasts the query to all relevant shards, collects the results, and merges them.

graph TD
    Client --> Coord[Coordinator Node]
    Coord -->|broadcast| Shard1[Shard 1]
    Coord -->|broadcast| Shard2[Shard 2]
    Coord -->|broadcast| Shard3[Shard 3]
    Shard1 -->|top 10| Coord
    Shard2 -->|top 10| Coord
    Shard3 -->|top 10| Coord
    Coord -->|merged top 10| Client

The coordinator does not do the full search on each shard. Shards return their top N results, and the coordinator merges and re-ranks. This is efficient as long as N is small relative to the total matches.

Adaptive Replica Selection

By default, Elasticsearch routes queries to the nearest replica with available capacity. In a multi-datacenter setup, this means queries go to the replica in the same datacenter as the client, reducing network latency.

You can control replica selection:

GET /my-index/_search?preference=_only_local

The _only_local preference ensures the query hits the local replica only. This is useful when you know the data is local and want to avoid cross-node traffic.

Replication for Read Scalability

Replication adds copies of your data to handle read traffic. Each replica is a full copy of the primary shard, capable of serving searches.

Read-Through Scaling

Adding replicas linearly scales read throughput. With 3 primary shards and 2 replicas, you have 9 total shard copies. Under heavy read load, the coordinator distributes requests across all 9 shards.

PUT /my-index/_settings
{
  "number_of_replicas": 2
}

The tradeoff is write amplification: every write must be indexed on the primary and all replicas. More replicas mean slower writes.

Consistency Considerations

With replication, queries may return stale data if reads hit a replica that has not yet caught up with the primary. Elasticsearch offers consistency controls:

GET /my-index/_search?consistency=quorum

The quorum setting ensures at least half the replicas have acknowledged the write before the read, reducing the chance of reading stale data.

Cluster Management

Managing a multi-node search cluster requires planning for failure, rebalancing, and capacity.

Node Failure Handling

When a node fails, Elasticsearch automatically promotes replicas to primaries. This is automatic but can cause a brief period of reduced capacity while promotion completes.

For critical workloads, keep at least 2 replicas of every shard. Single replicas create single points of failure.

Rebalancing

When you add a new node, Elasticsearch rebalances shards across the cluster automatically. This involves moving shard copies from overloaded nodes to underloaded ones.

Rebalancing can impact query performance. For sensitive workloads, use shadow replicas or temporarily disable rebalancing during maintenance windows.

Frozen Indices

Elasticsearch 7.x introduced frozen indices. These are indices explicitly marked as frozen, consuming less memory because their shard data is not cached. Queries against frozen indices are slower but functional.

POST /my-index/_freeze
POST /my-index/_unfreeze

If you have time-based data that is rarely queried (like logs from two years ago), freezing can save significant memory.

When to Use / When Not to Use

When to Use Search Scaling Techniques

  • Data volume exceeds single-node capacity (typically > 500GB per node for search workloads)
  • Query latency degrades under load despite indexing optimization
  • Read throughput requirements exceed single-node capabilities
  • High availability is mandatory — zero tolerance for single points of failure
  • Multi-datacenter deployments requiring geographic distribution of queries

When Not to Use Search Scaling Techniques

  • Small to medium datasets (< 100GB) where a single well-tuned node suffices
  • Low query concurrency — a single node handling < 100 QPS rarely needs sharding
  • Simple key-value lookups that do not benefit from distributed search
  • Write-heavy workloads with simple queries — replication does not scale writes
  • Cost-constrained projects where operational complexity of multi-node clusters outweighs benefits

Production Failure Scenarios

FailureImpactMitigation
Shard hotspot due to skewed routingSome nodes overloaded while others idleUse routing keys that distribute evenly; monitor shard sizes
Coordinating node bottleneckAll search requests fail if coordination node crashesUse dedicated coordinating nodes; do not route client traffic to data nodes
Network partition causing split-brainConflicting primaries accept divergent writesUse quorum-based writes (consistency=quorum), monitor cluster state
Replica lag during heavy indexingStale results returned from lagging replicasSet read_slience timeout, throttle indexing rate during peak read hours
Shard relocation timeoutRebalancing stalls mid-transfer, data temporarily unavailableIncrease recovery_time_window; avoid rebalancing during high load
Memory pressure from aggregationsFacet/aggregation queries cause OOMSet max_buckets limits; pre-compute aggregations with background jobs

Observability Checklist

Metrics to Monitor

{
  "cluster_level": {
    "cluster_health": "green/yellow/red",
    "unassigned_shards": "should be 0 in healthy cluster",
    "number_of_pending_tasks": "< 50 typically",
    "shard_balance_variance": "< 20% across nodes"
  },
  "per_node": {
    "heap_used_percent": "< 85% sustained",
    "cpu_iowait_percent": "< 20% sustained",
    "search_latency_p99_per_node": "< 500ms",
    "indexing_latency_p99": "< 1s per batch"
  },
  "per_shard": {
    "query_count_per_shard": "high variance indicates hotspot",
    "doc_count_per_shard": "rebalance if one shard > 2x average",
    "segment_count": "< 50; force merge if higher"
  }
}

Key Logs to Capture

graph LR
    subgraph LogSources
        ES[Elasticsearch Logs] --> Cluster[Cluster Events]
        ES --> Shard[Shard Allocation]
        ES --> Search[Slow Search]
        ES --> Index[Indexing Operations]
    end

    Cluster --> |node join/leave| Monitor[Monitoring]
    Shard --> |allocation fail| Monitor
    Search --> |query > threshold| Monitor
    Index --> |bulk reject| Monitor
  • Cluster events: node additions/removals, master elections, cluster state transitions
  • Shard allocation logs: failed allocations, delayed decisions, recovery progress
  • Slow search logs: queries exceeding search.slowlog.threshold
  • Indexing logs: bulk request rejections, mapping conflicts, refresh intervals
# Enable cluster-level allocation logging
PUT /_cluster/settings
{
  "transient": {
    "logger.index_allocation": "DEBUG",
    "logger.index_recovery": "INFO"
  }
}

Alerts to Configure

AlertConditionSeverity
Cluster health yellowstatus == yellow for > 5 minWarning
Cluster health redstatus == redCritical
Shard imbalanceshard_count_max - shard_count_min > 5Warning
Coordinating node queuesearch_queue > 100Warning
Recovery stalledrecovering_shards > 0 for > 10 minWarning
Disk watermark lowdisk.low watermark exceededCritical

Security Checklist

  • Network-level isolation — place search cluster on internal network; never expose directly to internet
  • Node authentication — use TLS certificates for inter-node communication
  • Role-based access — define read-only, read-write, and admin roles; apply per index
  • Index-level privileges — restrict write access to pipelines and ingestion services only
  • Audit logging — enable security audit logs for all write operations and access attempts
  • Field-level security — use document-level security if certain fields must be hidden from some users
  • Query restrictions — disable expensive aggregations for public-facing endpoints (max_buckets)
  • API key rotation — rotate service account keys quarterly; use secret management system
  • Input validation — sanitize query strings to prevent injection via special characters
# Example: Create a read-only role for production traffic
POST /_security/role/production_reader
{
  "indices": [
    {
      "names": ["production-*"],
      "privileges": ["read", "view_index_metadata"]
    }
  ],
  "field_security": {
    "grant": ["title", "content", "publish_date", "author"]
  }
}

Common Pitfalls / Anti-Patterns

Premature Sharding

Sharding adds complexity without benefits for small datasets. A single shard performs scatter-gather across segments within that shard anyway. If your total index size is under 50GB and latency is acceptable, do not shard.

Rule: Shard when a single node cannot handle the workload or data volume.

Ignoring Rebalancing Impact

Adding a new node triggers automatic rebalancing. During rebalance, the cluster moves shard copies, consuming network bandwidth and CPU. This degrades query performance for running requests.

Fix: Disable automatic rebalancing during maintenance windows:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "none"
  }
}

Not Using Circuit Breakers

Complex queries with deep aggregations or large page sizes can cause OOM on coordinating nodes. Without circuit breakers, one bad query can crash an entire cluster.

Fix: Configure and monitor circuit breakers:

{
  "indices.breaker.total.limit": "70%",
  "indices.breaker.request.limit": "40%",
  "indices.breaker.fielddata.limit": "60%"
}

Targeting Wrong Shards with Routing

Custom routing narrows the shards queried, but if the routing key is unevenly distributed, it creates hotspots. A common mistake is routing by user ID when a small subset of users generates most traffic.

Fix: Choose routing keys with high cardinality and even distribution. Monitor shard sizes per routing value.

Over-Allocating Replicas

Replicas multiply storage and indexing overhead. Six replicas of three shards means six copies of every document. For write-heavy workloads, this slows indexing dramatically.

Fix: Start with one replica. Add more only when read scalability demands it.

Quick Recap

Key Bullets

  • Hash-based sharding is simple but queries all shards for every request
  • Range-based or custom routing targets subsets when access patterns allow
  • Replicas scale reads but amplify writes and storage
  • The coordination node broadcasts to all shards and merges results
  • Adaptive replica selection routes queries to the least-loaded replica
  • Frozen indices trade query speed for memory savings on rarely-accessed data
  • Rebalancing is automatic but can degrade performance during heavy load

Copy/Paste Checklist

# Check shard distribution across nodes
GET /_cat/shards?v&h=index,shard,prirep,state,node

# Force rebalancing after adding a node
POST /_cluster/reroute?retry_failed=true

# Disable rebalancing during maintenance
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "none"
  }
}

# Update replica count for a specific index
PUT /my-index/_settings
{
  "number_of_replicas": 2
}

# Freeze a rarely-used index
POST /my-index/_freeze

# Set custom routing on a document
PUT /my-index/_doc/1?routing=user_123
{
  "title": "My Post",
  "user_id": "user_123"
}

# Query with custom routing
GET /my-index/_search?routing=user_123
{
  "query": { "match": { "title": "search term" } }
}

# Monitor pending tasks
GET /_cluster/pending_tasks

# Check circuit breaker status
GET /_nodes/stats/breaker

Scaling search systems often involves integrating with other infrastructure components. For event-driven indexing pipelines that feed search engines, see event-driven architecture. For caching strategies that complement search replication, see distributed caching.

Search also frequently sits alongside messaging systems. If you are processing streams of documents to index, Apache Kafka or RabbitMQ handle that pattern well.

Conclusion

Three things matter most for search scaling: sharding, replication, and routing. Hash-based sharding is simple but queries all shards. Range-based or custom routing targets subsets when your access patterns allow. Replication scales reads at the cost of writes. The coordination node handles the scatter-gather that makes distributed search work.

The right configuration depends on your query patterns. If most queries target recent data, time-based sharding with more replicas on recent shards makes sense. If queries span your entire dataset, hash-based sharding with many replicas handles the load.

Measure your query latency distribution before optimizing. It is easy to over-engineer sharding for workloads that do not need it.

For more on search infrastructure, see our deep dives on Elasticsearch and Apache Solr.

Category

Related Posts

Database Capacity Planning: A Practical Guide

Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.

#database #capacity-planning #infrastructure

Read/Write Splitting

Master-slave replication for read scaling. Routing strategies, consistency lag considerations, and when this pattern helps vs hurts your architecture.

#database #scaling #replication

Database Scaling: Vertical, Horizontal, and Read Replicas

Learn strategies for scaling databases beyond a single instance: vertical scaling, read replicas, write scaling, and when to choose each approach.

#databases #scaling #performance