Load Balancing Algorithms: Round Robin, Least Connections, and Beyond
Explore load balancing algorithms used in microservices including round robin, least connections, weighted, IP hash, and adaptive algorithms.
Load Balancing Algorithms: Round Robin, Least Connections, and Beyond
Every request hitting your microservices deployment faces the same fundamental question: which backend service instance handles it? Someone has to make that call, and that someone is the load balancing algorithm. Get it right and your system hums along even under heavy traffic. Get it wrong and you will be debugging why one server is on fire while others sit idle.
Microservices complicate this decision. You might have dozens of instances spread across availability zones, each with different capacities, varying response times, and potentially different operational states. The algorithm has to navigate all of that while keeping response times low and routing around failures automatically.
This article walks through load balancing approaches from basic round robin to adaptive algorithms that watch real-time server health and adjust accordingly.
Introduction
In a monolith, scaling means copying the same application. Load balancing is simple: distribute requests across identical instances.
Microservices shift the picture. Each service runs multiple instances with different capacities. A payment service making synchronous database calls behaves completely differently from a caching service returning data from memory. A recommendation service might take 500ms while an inventory check finishes in 20ms. The load balancer has to account for all of this variation.
Beyond simple distribution, load balancers in microservices handle service discovery, health checking, circuit breaker integration, metrics collection, and SSL termination. The algorithm you choose affects all of these. Route based on real-time load and your circuit breakers stay quiet. Route poorly and circuit breakers work overtime protecting overloaded servers.
Static Load Balancing Algorithms
Static Algorithms
Static algorithms make routing decisions without considering current system state. They follow predetermined rules configured beforehand. The advantages are real: no state tracking overhead, predictable behavior, and straightforward debugging.
Round Robin
Round robin cycles through servers in order: Server 1, Server 2, Server 3, then back to Server 1. Each request goes to the next server in sequence.
No state to maintain. Each decision is independent. This makes it extremely fast and memory-efficient. No tracking connection counts, no calculating server load.
Round robin works when all servers have identical capacity and similar request processing times. Perfect homogeneity rarely exists though. If Server 1 has twice the memory of Server 2, round robin still sends equal traffic to both. Server 1 sits underutilized while Server 2 struggles.
DNS-based load balancing often uses round robin. Each DNS response rotates through available server IPs. Simple, but lacks awareness of server health or current conditions. Fine for some scenarios, but production microservice deployments usually need more sophistication.
graph LR
A[Request 1] --> B[Server 1]
C[Request 2] --> D[Server 2]
E[Request 3] --> F[Server 3]
G[Request 4] --> B
H[Request 5] --> D
Weighted Round Robin
Weighted round robin assigns a weight to each server based on capacity. Servers with higher weights get more traffic proportionally. If Server 1 has weight 3 and Server 2 has weight 1, Server 1 gets three requests for every one that goes to Server 2.
Weights typically reflect server specs: CPU cores, memory size, expected performance. A newer server with more resources handles heavier loads. An older server running background workloads gets lighter traffic.
The catch is keeping weights accurate. A server that suddenly gets busy still receives its configured share of new requests. Weights reflect theoretical capacity, not current load. Regular recalibration becomes necessary as workloads change.
This approach suits heterogeneous server pools with relatively stable load patterns. When capacities shift frequently, static weights become maintenance burdens.
Random
Random routing distributes requests using a random number generator. Counterintuitively, random selection distributes load quite evenly under moderate to high traffic volumes.
With enough traffic, random selection approximates equal distribution naturally. The law of large numbers ensures convergence over time. For very high traffic systems where state management becomes expensive, random offers a simple alternative with no coordination overhead.
Variance is higher under low traffic. One server might get lucky while another receives fewer requests. Over time this evens out, but bursty traffic causes temporary imbalance.
Random works as a baseline algorithm. Some sophisticated approaches use random selection as a fallback or combine it with other methods.
IP Hash
IP hash routes requests based on a hash of the client IP address. The same client IP always routes to the same backend server. This provides session affinity without cookies or tracking mechanisms.
The hash function maps client IPs to servers. A simple modulo of the IP address integer value by server count works, but causes massive redistribution when servers are added or removed. Consistent hashing reduces this reshuffling, keeping most clients with the same server even when the pool changes.
IP hash breaks down when many clients share the same source IP. Users behind corporate proxies or NAT gateways all appear as one IP, routing to the same server and potentially creating hotspots. It also has no awareness of server load, so a busy server still receives its hash-allocated share.
For simple session affinity where clients need to return to the same server, IP hash works. For more control, cookie-based sticky sessions or application-level routing work better.
Dynamic Load Balancing Algorithms
Least Connections
Least connections routes new requests to the server with the fewest active connections. A server processing ten long-running requests might get a new request before a server that just started on an identical request. The algorithm adapts to current load rather than distributing evenly based on configuration.
This works well for workloads with variable request durations. A request holding a database connection for ten seconds should count differently than one returning cached data in milliseconds. Least connections captures these differences through active connection counts.
The algorithm requires tracking active connections for each backend in load balancer memory. This state updates with each request and response. Under very high traffic, the overhead of tracking and comparing connection counts adds up.
Least connections can cause thrashing under certain patterns. If many requests complete simultaneously, multiple new requests all see the same low count and flood the same server before it updates. Using a smoothed average rather than raw counts mitigates this.
Smoothing: Why Raw Counts Mislead
Raw active connection counts treat every connection as equal. In practice, requests vary wildly in how long they hold connections and how much work they do. A server with 5 long-running database queries consuming 5 seconds each differs vastly from a server with 5 fast cache lookups finishing in 5 milliseconds.
Smoothing addresses this by using exponentially weighted moving averages (EWMAs) of connection counts or latency. Rather than counting connections at a single moment, the algorithm tracks the trend. A server whose connection count is increasing rapidly gets lower weight than one whose count is stable or declining.
HAProxy implements this via queue and rate metrics. NGINX Plus tracks latency-weighted connection rates. AWS ALB’s least outstanding requests implicitly smooths by counting both queued requests and active connections together.
The practical impact: under bursty traffic, smoothed least connections prevents the stampede effect where all new requests rush to a server that just finished a batch of long-running requests simultaneously.
Least Response Time
Least response time routes to the server with the lowest combined metric of active connections and average response time. It combines load awareness with performance awareness in a single metric.
The calculation typically weights active connection count against recent response times. A server with fewer connections but much slower responses might not win. A moderately loaded server with fast responses wins.
AWS ALB uses least outstanding requests, focusing on how many requests are waiting versus actively processed. Google Cloud Load Balancing uses a similar model focused on minimizing latency.
This algorithm works well when response times vary significantly between requests and servers. A mix of fast cached responses and slow database queries benefits from response time awareness.
Resource-Based Routing
Resource-based routing makes decisions based on actual server resource utilization. The load balancer queries each server for current CPU, memory, or application-specific metrics before routing.
This requires agents on each server reporting metrics to the load balancer. The overhead of collecting and communicating metrics limits update frequency. The benefit is routing decisions that truly reflect server capacity rather than indirect signals like connection counts.
Some implementations use active reporting where servers push metrics. Others use passive monitoring where the load balancer tracks response times as a proxy for load. Active reporting is more accurate but adds complexity and network overhead.
Resource-based routing suits environments where server capacity varies significantly or where you want fine-grained control based on actual resource consumption.
Adaptive Algorithms
Adaptive algorithms go beyond simple metrics to make predictive routing decisions. They might watch trends in response time changes, error rates, or capacity utilization and route traffic before problems occur.
These algorithms often use machine learning to identify patterns. A server showing increasing response times might have traffic shifted away before it becomes critical. Error rate spikes trigger preemptive routing away from failing instances.
The complexity of adaptive algorithms makes them harder to debug and predict. The benefit is handling edge cases that rule-based algorithms miss. Production deployments often layer adaptive algorithms on top of simpler fallbacks.
Session Persistence & Consistent Hashing
Session Persistence and Sticky Sessions
Session persistence routes a particular user’s requests to the same backend server. Without it, a user who logs in on Server 1 might get routed to Server 2 on their next request, which has no memory of their session.
Sticky sessions create problems though. They complicate maintenance windows since taking down a server disconnects active users. They make horizontal scaling harder because load cannot be freely redistributed. A server getting stuck with long-running sessions might accumulate disproportionate load.
Sticky sessions matter most for applications that store session state locally rather than in distributed caches. Shopping carts, multi-step form wizard state, in-memory computation results might rely on server affinity. Most modern applications store session state externally in Redis or similar, reducing the need for sticky sessions.
When you do need sticky sessions, cookie-based affinity works better than IP hash. Cookies give more control and work correctly even when clients switch networks or share IPs. The load balancer reads a cookie to determine the target server.
Cookie-based sticky sessions insert a tracking cookie set by the load balancer. The first request gets routed normally, and the load balancer sets a cookie identifying the assigned server. Subsequent requests include the cookie, and the load balancer reads it to maintain affinity.
Consistent Hashing
The algorithms covered so far have a problem when servers join or leave the pool. Round robin, least connections, and even IP hash all redistribute traffic across every server in the pool. Adding one new server means every existing server potentially loses traffic. Removing one server means some traffic has nowhere to go.
Consistent hashing solves this. Instead of mapping clients to servers directly, consistent hashing maps both onto a hash ring. Each server gets a position on the ring based on a hash of its identifier. Each client gets a position based on a hash of its identifier. A client routes to the nearest server clockwise on the ring.
When a server joins, it claims a position on the ring and only takes over traffic from the few clients whose positions fall between it and its predecessor. When a server leaves, its traffic redistributes only to its successor. Most clients keep their same server assignment.
graph TD
subgraph "Hash Ring"
direction TB
C1[Client A<br/>hash=50] --> S2["Server 2<br/>hash=180"]
C2[Client B<br/>hash=90] --> S2
C3[Client C<br/>hash=270] --> S1["Server 1<br/>hash=30"]
C4[Client D<br/>hash=350] --> S3["Server 3<br/>hash=330"]
end
S1 -->|Next CW| S2
S2 -->|Next CW| S3
S3 -->|Next CW| S1
Virtual Nodes
Basic consistent hashing still causes uneven distribution. A server with fewer hash positions gets fewer clients. Virtual nodes fix this by giving each physical server multiple positions on the ring.
Instead of hashing the server ID once, you hash it with a suffix: server1:1, server1:2, server1:3. Each virtual node gets its own spot on the ring. A client looking up its position finds the nearest virtual node clockwise and routes to the physical server that owns it.
With enough virtual nodes, the distribution becomes statistically even. Most implementations use 150-200 virtual nodes per physical server, which provides good balance without excessive memory usage.
The tradeoff is lookup complexity. Finding the nearest virtual node on a ring requires sorted arrays with binary search or specialized data structures like a skip list. Memcached, DynamoDB, and Cassandra all use consistent hashing with virtual nodes for data distribution.
Consistent Hashing Trade-offs
| Factor | Standard CH | Virtual Nodes |
|---|---|---|
| Distribution uniformity | Uneven for small clusters | Statistically even |
| Memory overhead | Low | Higher (per-node entries) |
| Rehashing on add/remove | Minimal | Minimal |
| Implementation complexity | Medium | Higher |
For load balancers handling hundreds or thousands of backend servers, virtual nodes matter less because the law of large numbers already provides even distribution. For databases and distributed caches with fewer nodes, virtual nodes prevent hot spots.
Advanced Load Balancing
Power of Two Choices
Random and round robin both make decisions without considering current load. Under high traffic, statistical averaging keeps distribution reasonably even. But under moderate load, one unlucky server can accumulate more requests than its fair share — creating hot spots while others sit idle.
The power of two choices flips this. Instead of picking one server at random, pick two and choose the better one. When you compare two random servers and send traffic to the one with fewer active connections, the worst-case load drops dramatically, staying close to the average rather than spiking. This sounds almost too simple to work, but the math holds up.
The technique comes from a 1993 paper by Azar, Broder, Karlin, and Upfal. Google’s Maglev load balancer uses it. Envoy’s weighted load balancing builds on the same principle. Facebook’s infrastructure handles it at scale in their internal networking stack.
The algorithm works like this: when a request arrives, the load balancer randomly selects two backend servers from the healthy pool. It compares their current load and routes to the one with less. If both have equal load, pick either.
import random
def pick_two_choices(backends):
"""Pick two random backends, return the one with fewer connections."""
candidates = random.sample(backends, 2)
# Each backend tracks its own active connection count
if candidates[0].active_connections < candidates[1].active_connections:
return candidates[0]
return candidates[1]
The power of two choices needs minimal state — just active connection counts per backend. No historical data, no smoothing, and it handles heterogeneous servers reasonably well when combined with weighted variants.
For systems where temporary imbalance causes real problems, two choices beats random. For high-volume homogeneous clusters where statistical averaging kicks in quickly, simple random or round robin work fine with less implementation overhead.
Circuit Breaker Integration
Load balancing algorithms and circuit breakers work together. The load balancer distributes traffic, but when a service starts failing, the circuit breaker stops traffic to failing instances.
Poor load balancing forces circuit breakers to work harder. A server running hot with CPU maxed out receives requests that timeout, triggering circuit breaker opens for that instance. Better load balancing would have spread work more evenly, keeping the server from becoming overloaded in the first place.
Some load balancers integrate circuit breaking directly. When error rates exceed thresholds for a particular backend, the load balancer stops routing traffic. This happens without needing separate circuit breaker libraries in your application code.
The interaction between load balancing and circuit breaking matters most during recovery. When a circuit breaker closes and traffic resumes, the load balancer should ease traffic back gradually rather than flooding the recovering service.
Client-Side vs Server-Side Load Balancing
Traditional load balancing happens server-side: a dedicated load balancer sits between clients and servers, making routing decisions for all incoming traffic.
Client-side load balancing puts the routing logic in the client. The client maintains a list of available servers and picks which one to call. Netflix’s Ribbon library is an example of client-side load balancing for JVM applications.
Client-side balancing removes the load balancer as a single point of failure. The client directly picks a server, reducing network hops. The tradeoff is that server list management becomes the client’s responsibility. When servers scale up or down, clients need to know.
Service discovery integrates with both approaches. Server-side load balancers often query service registries directly. Client-side load balancers typically receive server lists from service discovery and cache them locally.
graph TD
subgraph "Server-Side Load Balancing"
Client1[Client] --> LB[Load Balancer]
LB --> S1[Server 1]
LB --> S2[Server 2]
LB --> S3[Server 3]
end
subgraph "Client-Side Load Balancing"
Client2[Client] --> CL[Client Library]
CL --> S4[Server 1]
CL --> S5[Server 2]
CL --> S6[Server 3]
end
Server-side load balancing works well when you want centralized control, easier configuration updates, and built-in infrastructure like health checking and circuit breaking. Client-side load balancing suits environments where you want to eliminate the load balancer hop and reduce infrastructure dependencies.
Real-World Implementation Examples
Examples from Real Systems
Tool-specific implementations demonstrate how each platform approaches load balancing.
NGINX
NGINX supports multiple load balancing algorithms in its upstream configuration:
upstream backend {
least_conn; # Least connections algorithm
server 192.168.1.10:8080 weight=3;
server 192.168.1.11:8080 weight=1;
server 192.168.1.12:8080 down; # Marked as down
}
NGINX Plus adds least time and session persistence features. The free version provides round robin, least connections, and IP hash.
HAProxy
HAProxy offers sophisticated load balancing with clear configuration syntax:
backend servers
balance roundrobin
balance leastconn
balance source
server s1 192.168.1.10:8080 check inter 2000 fall 3
server s2 192.168.1.11:8080 check inter 2000 fall 3
server s3 192.168.1.12:8080 check inter 2000 fall 3
HAProxy’s source balance algorithm implements IP hash-like functionality. The check keyword enables health monitoring with configurable intervals and failure thresholds.
AWS ALB
AWS Application Load Balancer provides three routing algorithms:
- Round Robin - Default, cycles through targets in the target group
- Least Outstanding Requests - Routes to the target with the fewest pending requests
- Flow Hash - Routes based on the tuple of protocol, source IP, destination IP, source port, destination port, and TCP sequence number
ALB integrates with Auto Scaling Groups, automatically distributing traffic across healthy instances as they scale.
Algorithm Comparison
| Algorithm | State Required | Adapts to Load | Session Affinity | Complexity | Best For |
|---|---|---|---|---|---|
| Round Robin | None | No | No | Low | Homogeneous servers, simple deployments |
| Weighted Round Robin | Server weights | No | No | Low | Heterogeneous servers with stable load |
| Random | None | No | No | Low | High traffic where simplicity matters |
| IP Hash | None | No | Yes | Low | Session affinity without cookies |
| Least Connections | Active connections | Yes | No | Medium | Variable request durations |
| Least Response Time | Connections + latency | Yes | No | Medium | Latency-sensitive applications |
| Resource-Based | Resource metrics | Yes | No | High | Fine-grained capacity routing |
| Adaptive | Multiple metrics | Yes | No | High | Complex deployments with trends |
Choosing the Right Algorithm
Algorithm selection depends on your workload characteristics and infrastructure. Here is what to think about:
Server homogeneity: If all servers have identical capacity and similar performance, round robin works fine. If servers vary significantly, use weighted variants.
Request characteristics: Do requests take roughly the same time, or do they vary widely? Long-running requests benefit from least connections. Fast, consistent requests work fine with round robin.
Session requirements: Do users need to return to the same server? Cookie-based sticky sessions or IP hash handle this. External session storage eliminates the need entirely.
Latency sensitivity: Are response times critical? Least response time or latency-based routing helps. Background tasks work fine with simple round robin.
Operational complexity: Sophisticated algorithms require more monitoring and tuning. Start simple and add complexity only when measurements show it is needed.
For most web applications, least connections or weighted round robin hits a good balance — these handle heterogeneous servers reasonably well and adapt to varying load without excessive complexity
Advanced Algorithm Implementations
Weighted Least Connections Implementation
Beyond simple least connections, production systems often implement weighted variants:
class WeightedLeastConnections:
def __init__(self, instances):
self.instances = instances
def effective_load(self, instance):
"""Calculate effective load accounting for weights and current connections."""
weight = instance.get('weight', 1)
connections = instance.get('active_connections', 0)
return connections / weight
def select(self):
return min(self.instances, key=lambda i: self.effective_load(i))
Adaptive Load Balancing with ML
Modern load balancers like Envoy support adaptive routing based on real-time metrics:
# Envoy least_request with adaptive routing
clusters:
- name: my_service
type: EDS
lb_policy: LEAST_REQUEST
least_request_lb_config:
choice_count: 2 # Power of two choices
outlier_detection:
consecutive_5xx: 5
interval: 10s
Consistent Hashing with Bounded Load
Some implementations add bounds to prevent any server from being overloaded during rehashing:
import hashlib
class BoundedConsistentHash:
def __init__(self, instances, virtual_nodes=150):
self.ring = {}
self.sorted_keys = []
self.virtual_nodes = virtual_nodes
for instance in instances:
self._add_instance(instance)
def _add_instance(self, instance):
for i in range(self.virtual_nodes):
key = hashlib.md5(f"{instance['id']}:{i}".encode()).digest()
self.ring[key] = instance
self.sorted_keys.append(key)
self.sorted_keys.sort()
def get(self, client_id):
"""Get server for client, bounded to prevent overload during changes."""
key = hashlib.md5(client_id.encode()).digest()
# Find first server clockwise on ring
for k in self.sorted_keys:
if key <= k:
return self.ring[k]
return self.ring[self.sorted_keys[0]]
Failure Scenarios
The Thundering Herd Problem
When a popular service restarts, all clients retry simultaneously. A cache miss triggers multiple backend requests. An hour-long batch job releases workers simultaneously. This “thundering herd” overwhelms servers even though the total request volume hasn’t increased.
Load balancing helps but doesn’t solve thundering herd on its own. Client-side retries with jitter prevent synchronized retries. Server-side request coalescing (deduplicating concurrent requests for the same resource) reduces duplicate work. Token bucket rate limiting prevents request storms from reaching backends.
Zone Failures and Geographic Distribution
When an entire availability zone fails, naive load balancers continue routing traffic to dead instances until health checks detect the failure. With 100 servers across 3 zones and zone C goes dark, you lose 33% of capacity instantly. If health check intervals are 10 seconds, you’re sending 33% of traffic to nowhere for up to 10 seconds.
Multi-zone-aware load balancing tracks zone membership and immediately removes all instances in a failing zone from the healthy pool. Some implementations also consider geographic latency — routing to the closest healthy zone even under normal operation.
The Sticky Session Overload Cascade
Session affinity helps individual users but hurts overall distribution. When a server handling many sticky sessions gets overloaded, those users can’t redistribute. The overloaded server fails faster, triggering circuit breakers that isolate it. Users reconnect and get re-assigned to other servers, potentially overloading those in turn.
This cascade happens when sticky sessions combine with uneven load. Mitigations: external session storage (Redis) so servers can be replaced without user impact; gradual traffic shifting during maintenance; monitoring session distribution per server.
The Cold Start Stampede
Adding new servers to a cluster creates a cold start problem. New instances have empty caches, no JIT compilation, and cold database connections. Clients route to new instances expecting warm behavior and timeout, retry, or queue up.
Newly added servers receive requests via consistent hashing or weighted round robin but can’t handle the load they receive. Overwhelmed, they fail health checks and get removed. The cycle repeats. Mitigation: warm-up period where new servers receive minimal traffic, gradually increasing as they warm up.
Cascade & Rebalancing Failures
The Reweighting Rebalance Event
Manually changing server weights (for maintenance, upgrades, or capacity rebalancing) causes immediate redistribution. A weight change from 5 to 1 for an old server sends its share to others. If those others were already near capacity, they overload.
Automated weight adjustments that account for current load help. Adding capacity before removing weight prevents sudden shifts. Monitoring backend load during reweighting events catches problems early.
The Circuit Breaker Side Effect
Circuit breakers open when error rates exceed thresholds. When a circuit breaker opens on Server A, all traffic shifts to Servers B, C, D. If the shift overwhelms those servers, more circuits open. A cascading failure propagates across the cluster.
Circuit breaker tuning matters: slow half-open recovery (letting a few requests through to test recovery) prevents rapid cycling. Global circuit breaker coordination prevents all instances opening simultaneously. Load balancer integration can shed traffic at the load balancer level before circuit breakers trigger.
The DNS Cache Confusion
DNS-based load balancing creates cache coherency issues. Updating DNS records to remove a failing server takes time to propagate. Clients with cached DNS keep sending traffic to the removed server until their TTL expires.
Short TTLs help but increase DNS query load; Anycast addressing (multiple locations with same IP) helps but complicates health checking — some deployments use HTTP redirects instead of DNS rewriting for faster failover
Common Pitfalls / Anti-Patterns
Treating All Connections as Equal
Least connections counts TCP connections, not actual work. A server holding 5 long-running database queries counts the same as one handling 5 fast cache lookups. Use latency-weighted variants or smoothed averages when request durations vary significantly.
Ignoring Server Capacity Differences
Round robin and random distribute requests evenly, not work. If Server A has 32 cores and Server B has 8, equal traffic distribution leaves Server B overloaded. Use weighted algorithms when capacities differ substantially.
Over-engineering Early
Adaptive algorithms and resource-based routing add complexity that hurts debugging. Start with simple round robin or least connections, instrument thoroughly, and upgrade only when measurements show a specific problem simple algorithms cannot solve.
Skipping Health Check Configuration
The best algorithm fails if traffic routes to dead servers. Configure health checks before deploying to production. Set thresholds that avoid flapping: three failures to remove, two successes to restore works as a baseline.
Adding Servers Without Reassessing Weights
Weighted round robin weights reflect server capacity at configuration time — a newly added powerful server with default weight 1 gets starved while older servers with weight 5 absorb the load; recalibrate weights whenever server pools change substantially
Best Practices Summary
Start simple: Round robin or random handles most web application loads. Add complexity only when measurements show a specific problem.
Prefer least connections for variable workloads: When request durations differ significantly, connection count beats round robin for even distribution.
Use consistent hashing at scale: When you frequently add or remove servers, consistent hashing prevents the reshuffling that breaks other algorithms.
Prefer power of two choices for large pools: At sufficient scale, picking two random backends and choosing the less loaded one dramatically reduces worst-case load.
Account for heterogeneous capacities: Weighted variants exist because identical servers rarely exist in production. Match weights to actual server capacity.
Instrument before optimizing: Without measuring actual distribution and latency, you cannot know whether your algorithm needs improvement.
Combine algorithms with health checks: No routing algorithm helps if traffic goes to dead servers. Health checks are not optional.
Test failure modes
Kill a backend server and verify your load balancer handles it gracefully — do this in staging before it happens in production
Trade-off Analysis
Load balancing algorithms involve fundamental trade-offs between different dimensions. Understanding these helps in algorithm selection and tuning.
Fundamental Trade-offs
State vs Performance: Dynamic algorithms that track connection counts, latency, and resource utilization provide better routing decisions but require memory for state and CPU for calculations. Static algorithms like round robin make O(1) decisions with zero state but can’t adapt to real conditions.
Complexity vs Control: Sophisticated algorithms like adaptive or ML-based routing offer fine-grained control but introduce complexity that makes debugging harder. Simple algorithms like round robin are predictable and debuggable but lack flexibility.
Consistency vs Availability: Consistent hashing provides minimal reshuffling during cluster changes but requires coordination to maintain ring state. Less consistent approaches like random achieve better load distribution in exchange for more redistribution during changes.
Latency vs Accuracy: Real-time metric collection provides accurate routing decisions but adds latency to the routing process. Sampled or aggregated metrics reduce overhead at the cost of slower adaptation.
| Scenario | Recommended Algorithm | Rationale |
|---|---|---|
| Homogeneous servers, simple deployment | Round Robin | Zero state, O(1) decision, adequate distribution |
| Heterogeneous server capacities | Weighted Round Robin | Handles capacity differences without dynamic overhead |
| Variable request durations | Least Connections | Adapts to current load rather than configuration |
| Frequent server additions/removals | Consistent Hashing | Minimizes reshuffling during cluster changes |
| Very large server pools | Power of Two Choices | Reduces worst-case load without O(n) scanning |
| Latency-sensitive applications | Least Response Time | Considers both load and response time |
| Session affinity required | Cookie-based Sticky Sessions | Survives IP changes, no NAT hotspots |
Trade-off Comparison: Load Balancing Algorithms
| Algorithm | Computational Complexity | Memory State | Adaptation Speed | Session Affinity | Best For |
|---|---|---|---|---|---|
| Round Robin | O(1) | None | None | No | Homogeneous servers |
| Weighted Round Robin | O(1) | Server weights | None | No | Known capacity differences |
| Random | O(1) | None | None | No | High traffic, simple deployments |
| IP Hash | O(1) | None | None | Yes | Session affinity needs |
| Least Connections | O(n) | Per-server counts | Medium | No | Variable request durations |
| Least Response Time | O(n) | Counts + latency | Fast | No | Latency-sensitive apps |
| Consistent Hashing | O(log n) | Ring + nodes | Slow | Limited | Frequent pool changes |
| Power of Two Choices | O(1) | Per-server counts | Medium | No | Large heterogeneous pools |
| Adaptive/ML-based | Variable | Multiple metrics | Fast | No | Complex multi-factor routing |
Interview Questions
Standard hashing maps a client directly to a server via modulo: server = hash(client) % N. When N changes, almost every client maps to a different server — total redistribution.
Consistent hashing maps both clients and servers onto a hash ring. A client routes to the nearest server clockwise on the ring. When a server is added, it carves out a position between two existing servers and only takes over clients between it and its predecessor. When removed, only its successor picks up its traffic. In practice, only K/N clients are affected where K is the number of servers — roughly 1/Nth of the total, regardless of how many servers change.
Random and round robin can create temporary load imbalance under moderate traffic. One unlucky server accumulates more requests than others, causing elevated latency or failures.
The power of two choices works because of load balancing mathematics. When you pick one server at random, the maximum load across servers can be much higher than average. When you pick two at random and choose the less loaded one, the maximum load drops dramatically — provably to within a constant factor of the optimal. You get most of the benefit of knowing all server loads with only two samples. This principle underlies Google's load balancer design and Envoy's load balancing.
Least connections routes to the server with fewest active connections. When a burst of long-running requests completes simultaneously, all those servers drop to zero active connections at the same instant. Incoming requests see the same low count and all rush to the same servers before they accumulate new load.
Smoothed averages fix this by tracking not just the raw count but the trend. Exponentially weighted moving averages (EWMAs) weight recent samples less heavily, so a server that was heavily loaded a moment ago retains some of that weight even after its connections complete. The algorithm responds to trends, not just snapshots, preventing the stampede effect.
Server-side load balancing uses a dedicated component (hardware appliance or software like HAProxy/Nginx) between clients and servers. The load balancer handles routing, health checks, and failure handling. Clients just talk to the load balancer's IP.
Client-side load balancing embeds routing logic in the client library. The client maintains a list of healthy servers from service discovery, tracks failures locally, and picks which server to call directly. Netflix Ribbon is a classic example.
Choose server-side when you want centralized control, simpler clients, and built-in infrastructure. Choose client-side when you want to eliminate the load balancer hop, reduce infrastructure dependencies, and are willing to manage server list distribution and client-side failure handling.
Basic consistent hashing maps each physical server to one position on the hash ring. This creates uneven distribution because server IDs hash to random positions. A server whose hash lands near a cluster of client hashes gets disproportionate traffic while an isolated hash position gets few clients.
Virtual nodes give each physical server multiple positions on the ring by hashing `server_id:1`, `server_id:2`, etc. With 150-200 virtual nodes per physical server, the law of large numbers produces statistically even distribution. Cassandra uses 256 virtual nodes by default, DynamoDB uses a similar approach.
Databases need this more than load balancers because they typically have fewer nodes. A load balancer with 50 backend servers gets adequate distribution from raw consistent hashing. A database with 6 nodes does not.
IP hash maps a client by hashing their source IP address. All users behind the same corporate proxy, university network, or mobile carrier NAT gateway appear as a single source IP. The load balancer hashes that IP to the same backend, routing all NAT'd users to the same server. This creates severe hot spots.
Alternatives: cookie-based sticky sessions let the load balancer assign a client to a server via a cookie set on first contact. The cookie survives IP changes from mobile users switching networks. Header-based affinity uses a custom header set by an upstream proxy. Application-level routing can also maintain affinity without load balancer involvement.
Weighted round robin assigns each server a weight proportional to its capacity relative to other servers. A server with weight 3 receives 3 requests for every 1 request sent to a server with weight 1. NGINX, HAProxy, and cloud load balancers all support this.
Weights become stale when servers are upgraded, when workloads change, or when new servers join with default weights. A newly added powerful server with default weight 1 gets starved while old servers with weight 5 absorb disproportionate traffic. Weights require periodic recalibration, which is why many teams prefer dynamic algorithms like least connections that adapt automatically.
Load balancers distribute traffic and handle complete server failures via health checks. Circuit breakers monitor error rates and latency to detect degraded backends that are still responding but poorly. Poor load balancing forces circuit breakers to work harder — a server overloaded by uneven distribution accumulates errors, triggering circuit breaker opens.
During recovery, the interaction matters most. When a circuit breaker closes and traffic resumes, naive load balancers flood the recovering service with requests immediately. Better implementations ease traffic back gradually, letting the recovering service warm up before full traffic resumes. Some load balancers integrate circuit breaking directly; others rely on application-level circuit breaker libraries like Resilience4j or Hystrix.
Traditional least connections routes to the server with the fewest active TCP connections. Each connection counts equally regardless of how long it has been open or how much work it represents.
Least outstanding requests (used by AWS ALB) counts both actively processing requests and queued/pending requests. A server that has dispatched many requests but received none back (they are all outstanding in the network or waiting on downstram services) shows a higher count than a server that has processed many fast responses. This better approximates actual server work in flight rather than just connection count.
Choose static algorithms (round robin, random, weighted variants) when: servers have similar capacities, request processing times are consistent, server pools change infrequently, and you want minimal operational complexity. These are predictable, debuggable, and impose no measurement overhead.
Choose dynamic algorithms (least connections, least response time, resource-based, adaptive) when: servers have heterogeneous capacities or are running different workloads, request durations vary significantly, you need real-time adaptation to load spikes or degraded servers, and you have monitoring infrastructure to observe algorithm behavior. Dynamic algorithms add complexity — state tracking, metric collection, smoothing — so only add them when measurements justify the cost.
Standard consistent hashing can cause temporary overloads during server additions or removals. A newly added server might receive requests beyond its capacity while redistributing from an overloaded departing server.
Bounded load consistent hashing adds a maximum load threshold per server. When selecting a server clockwise on the ring, if the chosen server exceeds its load bound, the algorithm searches clockwise for the next server under threshold. This caps maximum load during reshuffling at the cost of slightly more complex lookup and potential initial scan for heavily loaded rings.
Google's Maglev load balancer uses bounded loads to ensure no server receives more than a configured fraction of average load during rebalancing events.
Request coalescing (also called request deduplication) tracks in-flight requests for the same resource. When a cache miss triggers backend requests for identical data, coalescing ensures only one backend request is made while others wait for the result.
Without coalescing, 1000 concurrent requests for expired cache data result in 1000 simultaneous backend calls. With coalescing, the first request goes to the backend, the remaining 999 wait, and all receive the same response. This reduces backend load dramatically during thundering herd events.
Implementation approaches: shared in-memory tracking with mutex/lock per key; partitioned counters avoiding centralized coordination; probabilistic early response for duplicate requests. The tradeoff is added latency for requests that wait, versus massive backend savings during spikes.
Envoy's least_request load balancing policy implements power of two choices. For each request, it randomly selects N servers from the healthy pool (Envoy defaults to 2) and picks the one with the lowest average latency or lowest active request count.
The implementation requires each server to track: active request count, EWMA of latency, and expose these via membership protocol. The load balancer maintains no global state — it samples from what servers report. This makes the algorithm highly scalable since decision complexity doesn't increase with cluster size.
The choice_count parameter (N) allows tuning. Higher N improves distribution accuracy but increases sampling overhead. At small N (2-3), the benefit curve is steep. At larger N, marginal returns diminish. Envoy allows configuring choice_count via least_request_lb_config.
Health checks verify whether a server responds at all — TCP connection succeeds, HTTP returns non-5xx. But a server can be responding with 200 OK while handling requests slowly due to CPU contention, memory pressure, or network degradation.
Outlier detection extends health checking by analyzing request outcomes. Envoy's outlier detection monitors: consecutive 5xx errors, success rate over time window, latency percentiles, error code frequency. When a server exhibits anomalous patterns, the load balancer marks it unhealthy and ejects it from the pool even if it still responds to health checks.
Ejection is temporary — servers return to the healthy pool after a configured interval. This allows transient issues (GC pause, temporary network hiccup) to resolve without manual intervention while still protecting the cluster from consistently degraded instances.
Unicast DNS: Each DNS response contains a single IP address. Clients connect to that specific endpoint. Easy to implement, predictable routing, but depends on client-side retry logic when that endpoint fails.
Anycast DNS: Multiple endpoints share the same IP address. DNS responses return the same IP regardless of which region answers. Network routing automatically directs traffic to the closest endpoint. Higher availability (loss of one endpoint doesn't break routing) but less control over which endpoint handles specific requests.
Anycast works well for stateless services where request distribution doesn't matter. Unicast provides better control for services requiring session affinity, geographic routing, or capacity planning. Cloud providers often combine both: anycast for global reach, then unicast within regions for precise control.
Session affinity creates asymmetric load distribution — some servers accumulate more sessions than others. During scaling events (adding or removing servers), existing sessions must either break (users reconnect to different servers) or the added server must accept no new sessions until it accumulates its share.
Strategies to minimize tradeoff: external session storage (Redis) decouples session state from server affinity — users can reconnect to any server that accesses the shared session store. Cookie-based affinity allows reassignment without breaking existing sessions, as long as the cookie persists. Gradual traffic shifting during maintenance lets sessions drain naturally before server removal.
Modern architectures prefer external session storage and avoid sticky sessions entirely. For legacy applications requiring affinity, cookie-based approaches with short cookie TTLs allow natural rebalancing over time.
Per-backend metrics: request rate, error rate, latency distribution (p50/p95/p99), active connection count for connection-oriented algorithms. If one backend has significantly higher latency or error rate than others, the algorithm may be routing disproportionate load there.
Distribution metrics: variance in request count across backends. For round robin, expect near-equal distribution. For dynamic algorithms, expect distribution proportional to capacity or load. High variance indicates algorithm malfunction or configuration issues.
Health check metrics: failure rates, time to detection (how long between backend failure and traffic cessation), flapping frequency. Slow detection means traffic continues to failed backends. Frequent flapping indicates threshold misconfiguration.
Algorithm-specific: for least connections, monitor connection count trends rather than absolute values — a server rapidly accumulating connections indicates overload before it becomes critical.
Virtual nodes distribute hash positions more evenly across the ring. With only one position per server, statistical variation causes uneven distribution — some servers accumulate far more client positions than others, especially with fewer servers.
More virtual nodes improve distribution uniformity but increase memory usage and lookup complexity. Each virtual node entry requires storage (ring position, owner pointer). Binary search on sorted keys requires O(log V) where V = virtual nodes per server × servers.
Practical counts: Memcached uses 150 virtual nodes by default, providing good balance for clusters down to tens of nodes. Cassandra uses 256. Load balancers with hundreds of backend servers often use fewer virtual nodes because the law of large numbers already provides adequate distribution with fewer positions.
The right count depends on cluster size and memory constraints. For databases and caches with fewer nodes (3-20), use 100-200. For load balancers with many nodes (50+), 20-50 may suffice.
Latency-aware algorithms (like least response time) combine connection count with recent response times. A common formula: score = active_connections * average_response_time. Lower score wins. This prefers servers with both low connection counts AND fast response times.
Challenges arise when response times vary significantly. A server processing short cached responses (1ms) gets penalized compared to one processing long database queries (500ms) even if the short server has more connections. Mixed workload environments (fast cached + slow database) make latency-aware routing less predictable.
Smoothing helps — using exponentially weighted moving averages rather than raw recent samples prevents temporary spikes from skewing decisions. Some implementations use latency percentiles (p95) instead of averages to ignore outliers. Others weight connections by expected duration (active connections × expected time) for more accurate work-in-flight estimation.
Stateless services: requests are interchangeable. Any instance can handle any request. Load balancing focuses on even distribution and quick failure detection. Algorithms like round robin, random, or least connections work well.
Stateful storage systems: requests for the same data must route to the specific node holding that data. Consistent hashing becomes critical — not for load distribution but for data locality. Adding a node must redistribute minimal data. Virtual nodes matter more with fewer storage nodes.
Storage systems often use replicated data where multiple nodes hold the same data. Read operations can go to any replica; write operations must reach quorum. Load balancing for storage includes routing decisions based on replica health, data freshness requirements, and geographic locality.
Database connection pooling adds another layer — connection pool saturation (not just server load) may limit throughput. Load balancers that track database connection usage rather than just active requests provide more accurate routing for database-heavy workloads.
Further Reading
- Load Balancing — Load balancer architecture, health checking, SSL termination, and GSLB fundamentals
- API Gateway — How load balancing integrates with API management and routing
- Resilience Patterns — Circuit breakers, retries, and bulkheads that work alongside load balancing
- Microservices Architecture — How load balancing fits into larger distributed system patterns
- System Design Fundamentals — Foundational concepts for designing scalable distributed systems
- HAProxy Architecture Guide — Detailed documentation on HAProxy’s load balancing algorithms and configuration
- Envoy Load Balancing Documentation — Power of two choices, weighted load balancing, and outlier detection
- AWS Global Load Balancing Blog — How AWS handles multi-region load distribution and latency-based routing
- Maglev: A Fast and Reliable Software Network Load Balancer — Google’s paper on their consistent hashing-based load balancer used in production
Quick Recap Checklist
Before deploying any load balancing configuration to production, verify the following:
- Servers have similar capacities? Start with round robin or random
- Request durations vary significantly? Use least connections
- Frequently adding or removing servers? Implement consistent hashing
- Large heterogeneous server pools? Try power of two choices
- Session affinity required? Use cookie-based sticky sessions instead of IP hash
- Health checks configured? Set 3 failures to remove, 2 successes to restore
- Weights recalibrated after server changes? Weights drift from reality fast
- Monitoring per-backend request distribution? You cannot fix what you cannot see
- Circuit breakers integrated? They catch what health checks miss
- Failure tested in staging? Kill a backend and verify graceful handling
Conclusion
Load balancing algorithms run from trivially simple to sophisticated. Round robin and random need no state and distribute load evenly under high traffic. Weighted variants handle capacity differences. Least connections adapts to current load but adds complexity.
Latency and resource-based approaches provide more responsive routing but require additional infrastructure. IP hash offers session affinity at the cost of potential hotspots.
The algorithm matters less than the fundamentals: health checking, appropriate server sizing, and not overloading any single instance. Pick something reasonable, monitor it, and adjust as needed. When choosing an algorithm, start with round robin or random for simplicity, move to least connections when request durations vary, and reach for consistent hashing or power of two choices when operating at scale with frequent cluster changes.
For related reading, see my post on Load Balancing for fundamentals of load balancer architecture. To understand how load balancers integrate with API management, see API Gateway. For resilience patterns that work alongside load balancing, see Resilience Patterns.
Category
Related Posts
Cache Stampede Prevention: Protecting Your Cache
Learn how single-flight, request coalescing, and probabilistic early expiration prevent cache stampedes that can overwhelm your database.
Amazon Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.
Bloom Filters: Space-Efficient Probabilistic Data Structure
How Bloom filters provide memory-efficient membership testing with configurable false positive rates for caches, databases, and distributed systems.