Load Balancing Algorithms: Round Robin, Least Connections, and Beyond

Explore load balancing algorithms used in microservices including round robin, least connections, weighted, IP hash, and adaptive algorithms.

published: March 24, 2026 reading time: 70 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Load balancing algorithms answer one question: which backend instance handles this request? Round robin is simplest, least connections adapts to current load, and consistent hashing minimizes disruption when servers join or leave. The power of two choices sounds almost trivial but dramatically reduces worst-case load by picking two random servers and choosing the less loaded one. This guide walks through the real trade-offs, when each algorithm makes sense, and how circuit breakers interact with load balancing decisions in production systems.

Load Balancing Algorithms: Round Robin, Least Connections, and Beyond

Every request hitting your microservices deployment faces the same fundamental question: which backend service instance handles it? Someone has to make that call, and that someone is the load balancing algorithm. Get it right and your system hums along even under heavy traffic. Get it wrong and you will be debugging why one server is on fire while others sit idle.

Microservices complicate this decision. You might have dozens of instances spread across availability zones, each with different capacities, varying response times, and potentially different operational states. The algorithm has to navigate all of that while keeping response times low and routing around failures automatically.

This article walks through load balancing approaches from basic round robin to adaptive algorithms that watch real-time server health and adjust accordingly.

Introduction

In a monolith, scaling means copying the same application. Load balancing is simple: distribute requests across identical instances.

Microservices shift the picture. Each service runs multiple instances with different capacities. A payment service making synchronous database calls behaves completely differently from a caching service returning data from memory. A recommendation service might take 500ms while an inventory check finishes in 20ms. The load balancer has to account for all of this variation.

Beyond simple distribution, load balancers in microservices handle service discovery, health checking, circuit breaker integration, metrics collection, and SSL termination. The algorithm you choose affects all of these. Route based on real-time load and your circuit breakers stay quiet. Route poorly and circuit breakers work overtime protecting overloaded servers.

Static Load Balancing Algorithms

Static Algorithms

Static algorithms make routing decisions without considering current system state. They follow predetermined rules configured beforehand. The advantages are real: no state tracking overhead, predictable behavior, and straightforward debugging.

Round robin, weighted round robin, random, and IP hash all fall into this category. They require no communication with backends to gather current load, no smoothing of metrics, and no history of past decisions. The load balancer just applies the configured rule and moves on. This makes static algorithms fast and memory-efficient — O(1) decision time with zero per-backend state. The tradeoff is that static algorithms cannot adapt when backends have different capacities, when request durations vary, or when a backend starts experiencing elevated latency due to garbage collection pauses, database slowdowns, or network degradation. They are the right baseline to start with, but production microservices deployments with heterogeneous instances and variable workloads typically need to layer dynamic behavior on top.

What makes an algorithm “static” versus “dynamic” comes down to whether runtime state influences routing decisions. Static algorithms apply pre-configured rules that stay constant regardless of what is happening in your backend pool. The load balancer has no mechanism to observe connection counts, latency trends, CPU pressure, or any other real-time signal. Dynamic algorithms do the opposite — they query current metrics, maintain per-backend state, and adjust routing as conditions change. The distinction matters because static algorithms require zero coordination with backends, while dynamic algorithms need metric collection infrastructure, which adds both overhead and failure points.

The O(1) decision property deserves particular attention because it is not immediately obvious why it holds. When a round robin load balancer routes a request, it simply increments a counter and takes the next server in sequence. No lookup of current backend state, no comparison between servers, no computation whatsoever. The routing decision takes the same constant time whether you have 3 backends or 300. This matters at scale — if your load balancer handles 100,000 requests per second, shaving microseconds off each decision compounds into significant CPU savings. Dynamic algorithms with per-backend state tracking cannot achieve true O(1) because they must at minimum read current load metrics before deciding.

Static algorithms earn their place when server capacities are known, stable, and you want predictability over adaptability. A deployment where all instances run identical hardware, handle homogeneous workloads, and experience consistent request durations is a good candidate. You also want this predictability when debugging — when a routing decision goes wrong, you want to trace through a simple rule, not reconstruct what the algorithm concluded from a sliding window of exponentially weighted latency averages. Round robin gives you that auditability.

The fundamental trade-off is that static algorithms trade adaptability for simplicity. They cannot detect that Server 3 is experiencing a garbage collection pause and route traffic away. They cannot notice that your database has started responding slowly and adjust accordingly. They cannot respond to network degradation between zones or CPU spikes from background processes. When your backend behavior is stable and well-understood, this trade-off favors simplicity. When your system faces variable workloads, heterogeneous instance types, or operational disturbances, static algorithms eventually show their limits.

Round Robin

Round robin cycles through servers in order: Server 1, Server 2, Server 3, then back to Server 1. Each request goes to the next server in sequence.

No state to maintain. Each decision is independent. This makes it extremely fast and memory-efficient. No tracking connection counts, no calculating server load.

Round robin works when all servers have identical capacity and similar request processing times. Perfect homogeneity rarely exists though. If Server 1 has twice the memory of Server 2, round robin still sends equal traffic to both. Server 1 sits underutilized while Server 2 struggles.

DNS-based load balancing often uses round robin. Each DNS response rotates through available server IPs. Simple, but lacks awareness of server health or current conditions. Fine for some scenarios, but production microservice deployments usually need more sophistication.

graph LR
    A[Request 1] --> B[Server 1]
    C[Request 2] --> D[Server 2]
    E[Request 3] --> F[Server 3]
    G[Request 4] --> B
    H[Request 5] --> D

Weighted Round Robin

Weighted round robin assigns a weight to each server based on capacity. Servers with higher weights get more traffic proportionally. If Server 1 has weight 3 and Server 2 has weight 1, Server 1 gets three requests for every one that goes to Server 2.

Weights typically reflect server specs: CPU cores, memory size, expected performance. A newer server with more resources handles heavier loads. An older server running background workloads gets lighter traffic.

The catch is keeping weights accurate. A server that suddenly gets busy still receives its configured share of new requests. Weights reflect theoretical capacity, not current load. Regular recalibration becomes necessary as workloads change.

This approach suits heterogeneous server pools with relatively stable load patterns. When capacities shift frequently, static weights become maintenance burdens.

Random

Random routing distributes requests using a random number generator. Counterintuitively, random selection distributes load quite evenly under moderate to high traffic volumes.

With enough traffic, random selection approximates equal distribution naturally. The law of large numbers ensures convergence over time. For very high traffic systems where state management becomes expensive, random offers a simple alternative with no coordination overhead.

Variance is higher under low traffic. One server might get lucky while another receives fewer requests. Over time this evens out, but bursty traffic causes temporary imbalance.

Random works as a baseline algorithm. Some sophisticated approaches use random selection as a fallback or combine it with other methods.

IP Hash

IP hash routes requests based on a hash of the client IP address. The same client IP always routes to the same backend server. This provides session affinity without cookies or tracking mechanisms.

The hash function maps client IPs to servers. A simple modulo of the IP address integer value by server count works, but causes massive redistribution when servers are added or removed. Consistent hashing reduces this reshuffling, keeping most clients with the same server even when the pool changes.

IP hash breaks down when many clients share the same source IP. Users behind corporate proxies or NAT gateways all appear as one IP, routing to the same server and potentially creating hotspots. It also has no awareness of server load, so a busy server still receives its hash-allocated share.

For simple session affinity where clients need to return to the same server, IP hash works. For more control, cookie-based sticky sessions or application-level routing work better.

Dynamic Load Balancing Algorithms

Least Connections

Least connections routes new requests to the server with the fewest active connections. A server processing ten long-running requests might get a new request before a server that just started on an identical request. The algorithm adapts to current load rather than distributing evenly based on configuration.

This works well for workloads with variable request durations. A request holding a database connection for ten seconds should count differently than one returning cached data in milliseconds. Least connections captures these differences through active connection counts.

The algorithm requires tracking active connections for each backend in load balancer memory. This state updates with each request and response. Under very high traffic, the overhead of tracking and comparing connection counts adds up.

Least connections can cause thrashing under certain patterns. If many requests complete simultaneously, multiple new requests all see the same low count and flood the same server before it updates. Using a smoothed average rather than raw counts mitigates this.

Smoothing: Why Raw Counts Mislead

Raw active connection counts treat every connection as equal. In practice, requests vary wildly in how long they hold connections and how much work they do. A server with 5 long-running database queries consuming 5 seconds each differs vastly from a server with 5 fast cache lookups finishing in 5 milliseconds.

Smoothing addresses this by using exponentially weighted moving averages (EWMAs) of connection counts or latency. Rather than counting connections at a single moment, the algorithm tracks the trend. A server whose connection count is increasing rapidly gets lower weight than one whose count is stable or declining.

HAProxy implements this via queue and rate metrics. NGINX Plus tracks latency-weighted connection rates. AWS ALB’s least outstanding requests implicitly smooths by counting both queued requests and active connections together.

The practical impact: under bursty traffic, smoothed least connections prevents the stampede effect where all new requests rush to a server that just finished a batch of long-running requests simultaneously.

Least Response Time

Least response time routes to the server with the lowest combined metric of active connections and average response time. It combines load awareness with performance awareness in a single metric.

The calculation typically weights active connection count against recent response times. A server with fewer connections but much slower responses might not win. A moderately loaded server with fast responses wins.

AWS ALB uses least outstanding requests, focusing on how many requests are waiting versus actively processed. Google Cloud Load Balancing uses a similar model focused on minimizing latency.

This algorithm works well when response times vary significantly between requests and servers. A mix of fast cached responses and slow database queries benefits from response time awareness.

Resource-Based Routing

Resource-based routing makes decisions based on actual server resource utilization. The load balancer queries each server for current CPU, memory, or application-specific metrics before routing.

This requires agents on each server reporting metrics to the load balancer. The overhead of collecting and communicating metrics limits update frequency. The benefit is routing decisions that truly reflect server capacity rather than indirect signals like connection counts.

Some implementations use active reporting where servers push metrics. Others use passive monitoring where the load balancer tracks response times as a proxy for load. Active reporting is more accurate but adds complexity and network overhead.

Resource-based routing suits environments where server capacity varies significantly or where you want fine-grained control based on actual resource consumption.

Adaptive Algorithms

Adaptive algorithms go beyond simple metrics to make predictive routing decisions. They might watch trends in response time changes, error rates, or capacity utilization and route traffic before problems occur.

These algorithms often use machine learning to identify patterns. A server showing increasing response times might have traffic shifted away before it becomes critical. Error rate spikes trigger preemptive routing away from failing instances.

The complexity of adaptive algorithms makes them harder to debug and predict. The benefit is handling edge cases that rule-based algorithms miss. Production deployments often layer adaptive algorithms on top of simpler fallbacks.

Session Persistence & Consistent Hashing

Session Persistence and Sticky Sessions

Session persistence routes a particular user’s requests to the same backend server. Without it, a user who logs in on Server 1 might get routed to Server 2 on their next request, which has no memory of their session.

Sticky sessions create problems though. They complicate maintenance windows since taking down a server disconnects active users. They make horizontal scaling harder because load cannot be freely redistributed. A server getting stuck with long-running sessions might accumulate disproportionate load.

Sticky sessions matter most for applications that store session state locally rather than in distributed caches. Shopping carts, multi-step form wizard state, in-memory computation results might rely on server affinity. Most modern applications store session state externally in Redis or similar, reducing the need for sticky sessions.

When you do need sticky sessions, cookie-based affinity works better than IP hash. Cookies give more control and work correctly even when clients switch networks or share IPs. The load balancer reads a cookie to determine the target server.

Cookie-based sticky sessions insert a tracking cookie set by the load balancer. The first request gets routed normally, and the load balancer sets a cookie identifying the assigned server. Subsequent requests include the cookie, and the load balancer reads it to maintain affinity.

Consistent Hashing

The algorithms covered so far have a problem when servers join or leave the pool. Round robin, least connections, and even IP hash all redistribute traffic across every server in the pool. Adding one new server means every existing server potentially loses traffic. Removing one server means some traffic has nowhere to go.

Consistent hashing solves this. Instead of mapping clients to servers directly, consistent hashing maps both onto a hash ring. Each server gets a position on the ring based on a hash of its identifier. Each client gets a position based on a hash of its identifier. A client routes to the nearest server clockwise on the ring.

When a server joins, it claims a position on the ring and only takes over traffic from the few clients whose positions fall between it and its predecessor. When a server leaves, its traffic redistributes only to its successor. Most clients keep their same server assignment.

graph TD
    subgraph "Hash Ring"
        direction TB
        C1[Client A<br/>hash=50] --> S2["Server 2<br/>hash=180"]
        C2[Client B<br/>hash=90] --> S2
        C3[Client C<br/>hash=270] --> S1["Server 1<br/>hash=30"]
        C4[Client D<br/>hash=350] --> S3["Server 3<br/>hash=330"]
    end
    S1 -->|Next CW| S2
    S2 -->|Next CW| S3
    S3 -->|Next CW| S1

Virtual Nodes

Basic consistent hashing still causes uneven distribution. A server with fewer hash positions gets fewer clients. Virtual nodes fix this by giving each physical server multiple positions on the ring.

Instead of hashing the server ID once, you hash it with a suffix: server1:1, server1:2, server1:3. Each virtual node gets its own spot on the ring. A client looking up its position finds the nearest virtual node clockwise and routes to the physical server that owns it.

With enough virtual nodes, the distribution becomes statistically even. Most implementations use 150-200 virtual nodes per physical server, which provides good balance without excessive memory usage.

The tradeoff is lookup complexity. Finding the nearest virtual node on a ring requires sorted arrays with binary search or specialized data structures like a skip list. Memcached, DynamoDB, and Cassandra all use consistent hashing with virtual nodes for data distribution.

Consistent Hashing Trade-offs

Factor	Standard CH	Virtual Nodes
Distribution uniformity	Uneven for small clusters	Statistically even
Memory overhead	Low	Higher (per-node entries)
Rehashing on add/remove	Minimal	Minimal
Implementation complexity	Medium	Higher

For load balancers handling hundreds or thousands of backend servers, virtual nodes matter less because the law of large numbers already provides even distribution. For databases and distributed caches with fewer nodes, virtual nodes prevent hot spots.

Advanced Load Balancing

Power of Two Choices

Random and round robin both make decisions without considering current load. Under high traffic, statistical averaging keeps distribution reasonably even. But under moderate load, one unlucky server can accumulate more requests than its fair share — creating hot spots while others sit idle.

The power of two choices flips this. Instead of picking one server at random, pick two and choose the better one. When you compare two random servers and send traffic to the one with fewer active connections, the worst-case load drops dramatically, staying close to the average rather than spiking. This sounds almost too simple to work, but the math holds up.

The technique comes from a 1993 paper by Azar, Broder, Karlin, and Upfal. Google’s Maglev load balancer uses it. Envoy’s weighted load balancing builds on the same principle. Facebook’s infrastructure handles it at scale in their internal networking stack.

The algorithm works like this: when a request arrives, the load balancer randomly selects two backend servers from the healthy pool. It compares their current load and routes to the one with less. If both have equal load, pick either.

import random

def pick_two_choices(backends):
    """Pick two random backends, return the one with fewer connections."""
    candidates = random.sample(backends, 2)
    # Each backend tracks its own active connection count
    if candidates[0].active_connections < candidates[1].active_connections:
        return candidates[0]
    return candidates[1]

The power of two choices needs minimal state — just active connection counts per backend. No historical data, no smoothing, and it handles heterogeneous servers reasonably well when combined with weighted variants.

For systems where temporary imbalance causes real problems, two choices beats random. For high-volume homogeneous clusters where statistical averaging kicks in quickly, simple random or round robin work fine with less implementation overhead.

Circuit Breaker Integration

Load balancing algorithms and circuit breakers work together. The load balancer distributes traffic, but when a service starts failing, the circuit breaker stops traffic to failing instances.

Poor load balancing forces circuit breakers to work harder. A server running hot with CPU maxed out receives requests that timeout, triggering circuit breaker opens for that instance. Better load balancing would have spread work more evenly, keeping the server from becoming overloaded in the first place.

Some load balancers integrate circuit breaking directly. When error rates exceed thresholds for a particular backend, the load balancer stops routing traffic. This happens without needing separate circuit breaker libraries in your application code.

The interaction between load balancing and circuit breaking matters most during recovery. When a circuit breaker closes and traffic resumes, the load balancer should ease traffic back gradually rather than flooding the recovering service.

Client-Side vs Server-Side Load Balancing

Traditional load balancing happens server-side: a dedicated load balancer sits between clients and servers, making routing decisions for all incoming traffic.

Client-side load balancing puts the routing logic in the client. The client maintains a list of available servers and picks which one to call. Netflix’s Ribbon library is an example of client-side load balancing for JVM applications.

Client-side balancing removes the load balancer as a single point of failure. The client directly picks a server, reducing network hops. The tradeoff is that server list management becomes the client’s responsibility. When servers scale up or down, clients need to know.

Service discovery integrates with both approaches. Server-side load balancers often query service registries directly. Client-side load balancers typically receive server lists from service discovery and cache them locally.

graph TD
    subgraph "Server-Side Load Balancing"
        Client1[Client] --> LB[Load Balancer]
        LB --> S1[Server 1]
        LB --> S2[Server 2]
        LB --> S3[Server 3]
    end
    subgraph "Client-Side Load Balancing"
        Client2[Client] --> CL[Client Library]
        CL --> S4[Server 1]
        CL --> S5[Server 2]
        CL --> S6[Server 3]
    end

Server-side load balancing works well when you want centralized control, easier configuration updates, and built-in infrastructure like health checking and circuit breaking. Client-side load balancing suits environments where you want to eliminate the load balancer hop and reduce infrastructure dependencies.

Real-World Implementation Examples

Examples from Real Systems

Tool-specific implementations demonstrate how each platform approaches load balancing.

NGINX

NGINX supports multiple load balancing algorithms in its upstream configuration:

upstream backend {
    least_conn;  # Least connections algorithm
    server 192.168.1.10:8080 weight=3;
    server 192.168.1.11:8080 weight=1;
    server 192.168.1.12:8080 down;  # Marked as down
}

NGINX Plus adds least time and session persistence features. The free version provides round robin, least connections, and IP hash.

HAProxy

HAProxy offers sophisticated load balancing with clear configuration syntax:

backend servers
    balance roundrobin
    balance leastconn
    balance source
    server s1 192.168.1.10:8080 check inter 2000 fall 3
    server s2 192.168.1.11:8080 check inter 2000 fall 3
    server s3 192.168.1.12:8080 check inter 2000 fall 3

HAProxy’s source balance algorithm implements IP hash-like functionality. The check keyword enables health monitoring with configurable intervals and failure thresholds.

AWS ALB

AWS Application Load Balancer provides three routing algorithms:

Round Robin - Default, cycles through targets in the target group
Least Outstanding Requests - Routes to the target with the fewest pending requests
Flow Hash - Routes based on the tuple of protocol, source IP, destination IP, source port, destination port, and TCP sequence number

ALB integrates with Auto Scaling Groups, automatically distributing traffic across healthy instances as they scale.

Algorithm Comparison

Algorithm	State Required	Adapts to Load	Session Affinity	Complexity	Best For
Round Robin	None	No	No	Low	Homogeneous servers, simple deployments
Weighted Round Robin	Server weights	No	No	Low	Heterogeneous servers with stable load
Random	None	No	No	Low	High traffic where simplicity matters
IP Hash	None	No	Yes	Low	Session affinity without cookies
Least Connections	Active connections	Yes	No	Medium	Variable request durations
Least Response Time	Connections + latency	Yes	No	Medium	Latency-sensitive applications
Resource-Based	Resource metrics	Yes	No	High	Fine-grained capacity routing
Adaptive	Multiple metrics	Yes	No	High	Complex deployments with trends

Choosing the Right Algorithm

Algorithm selection depends on your workload characteristics and infrastructure. Here is what to think about:

Server homogeneity: If all servers have identical capacity and similar performance, round robin works fine. If servers vary significantly, use weighted variants.

Request characteristics: Do requests take roughly the same time, or do they vary widely? Long-running requests benefit from least connections. Fast, consistent requests work fine with round robin.

Session requirements: Do users need to return to the same server? Cookie-based sticky sessions or IP hash handle this. External session storage eliminates the need entirely.

Latency sensitivity: Are response times critical? Least response time or latency-based routing helps. Background tasks work fine with simple round robin.

Operational complexity: Sophisticated algorithms require more monitoring and tuning. Start simple and add complexity only when measurements show it is needed.

For most web applications, least connections or weighted round robin hits a good balance — these handle heterogeneous servers reasonably well and adapt to varying load without excessive complexity

Advanced Algorithm Implementations

Weighted Least Connections Implementation

Beyond simple least connections, production systems often implement weighted variants:

class WeightedLeastConnections:
    def __init__(self, instances):
        self.instances = instances

    def effective_load(self, instance):
        """Calculate effective load accounting for weights and current connections."""
        weight = instance.get('weight', 1)
        connections = instance.get('active_connections', 0)
        return connections / weight

    def select(self):
        return min(self.instances, key=lambda i: self.effective_load(i))

Adaptive Load Balancing with ML

Modern load balancers like Envoy support adaptive routing based on real-time metrics:

# Envoy least_request with adaptive routing
clusters:
  - name: my_service
    type: EDS
    lb_policy: LEAST_REQUEST
    least_request_lb_config:
      choice_count: 2 # Power of two choices
    outlier_detection:
      consecutive_5xx: 5
      interval: 10s

Consistent Hashing with Bounded Load

Some implementations add bounds to prevent any server from being overloaded during rehashing:

import hashlib

class BoundedConsistentHash:
    def __init__(self, instances, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes

        for instance in instances:
            self._add_instance(instance)

    def _add_instance(self, instance):
        for i in range(self.virtual_nodes):
            key = hashlib.md5(f"{instance['id']}:{i}".encode()).digest()
            self.ring[key] = instance
            self.sorted_keys.append(key)
        self.sorted_keys.sort()

    def get(self, client_id):
        """Get server for client, bounded to prevent overload during changes."""
        key = hashlib.md5(client_id.encode()).digest()
        # Find first server clockwise on ring
        for k in self.sorted_keys:
            if key <= k:
                return self.ring[k]
        return self.ring[self.sorted_keys[0]]

Failure Scenarios

The Thundering Herd Problem

When a popular service restarts, all clients retry simultaneously. A cache miss triggers multiple backend requests. An hour-long batch job releases workers simultaneously. This “thundering herd” overwhelms servers even though the total request volume hasn’t increased.

Load balancing helps but doesn’t solve thundering herd on its own. Client-side retries with jitter prevent synchronized retries. Server-side request coalescing (deduplicating concurrent requests for the same resource) reduces duplicate work. Token bucket rate limiting prevents request storms from reaching backends.

Specific triggering events determine whether you face a trickle or a flood. A cache restart is the classic case — your cache expires entries all at once, and suddenly every request that would have hit cache goes directly to backend. A batch job release is equally common: your hourly cron job wakes up thousands of workers simultaneously, each immediately hitting your service. A service restart after deployment triggers all clients to retry at once, especially if your clients implement exponential backoff but all started their retry timers at the same moment. Each of these events creates a request pattern that looks like a denial-of-service attack but is entirely accidental.

The math of why simultaneous retries overwhelm backends is straightforward arithmetic. Imagine 1000 clients, each retrying every second with a 100ms timeout. A backend restarts and becomes unavailable for 5 seconds. Each client exhausts its timeout, retries, and the retry fires again roughly every second. Over those 5 seconds, each client generates approximately 5 retry attempts. With 1000 clients, that is 5000 retry attempts compressed into a 5-second window — 1000 requests per second hitting backends that are already struggling to restart. Spread those same 5000 attempts over the full 5 seconds at one attempt per client per second, and you have at most 1000 concurrent requests, which most backends handle fine. The clustering is the problem.

Jitter implementation breaks up that clustering by adding randomness to retry timing. The formula is deceptively simple: delay = base_delay + random(0, max_delay * jitter_factor). If your base delay is 1 second, your max_delay is 5 seconds, and your jitter_factor is 1.0, each client waits between 1 and 6 seconds before retrying. This spreads the retry attempts across a 5-second window instead of clustering them at the same instant. The more clients you have, the more the random delays average out. Implement this at the client level, not the load balancer level — the load balancer cannot fix what clients are already sending.

Request coalescing takes a different approach: deduplicate concurrent requests for the same resource at the load balancer or backend layer. When 1000 requests arrive for expired cache data within milliseconds of each other, coalescing ensures only one actual computation happens. The remaining 999 requests wait for that single result and then receive it. This requires shared state — either in-memory tracking with locks per key, or a distributed cache with atomic check-and-set operations. The tradeoff is added latency for requests that wait, but massive backend savings during spikes. Without coalescing, a cache miss becomes a thundering herd. With coalescing, it becomes one request and 999 waiters.

Zone Failures and Geographic Distribution

When an entire availability zone fails, naive load balancers continue routing traffic to dead instances until health checks detect the failure. With 100 servers across 3 zones and zone C goes dark, you lose 33% of capacity instantly. If health check intervals are 10 seconds, you’re sending 33% of traffic to nowhere for up to 10 seconds.

Multi-zone-aware load balancing tracks zone membership and immediately removes all instances in a failing zone from the healthy pool. Some implementations also consider geographic latency — routing to the closest healthy zone even under normal operation.

A zone failure plays out in concrete timestamps. Zone C’s power grid trips at 14:32:01. Within milliseconds, all instances in that zone stop responding — not because they are crashed, but because they have no power. Your load balancer’s health checks are configured to run every 10 seconds. The last health check ran at 14:32:00, so the next one does not run until 14:32:10. Between those two moments, the load balancer still thinks all Zone C instances are healthy and continues routing traffic to them. Every request sent during that 9-second window fails immediately. If you have 1000 requests per second arriving, that is 9000 failed requests while your remaining two zones absorb traffic they were not provisioned for.

The 33% capacity loss is not just a percentage — it is a capacity emergency. With 100 servers across 3 zones, you have roughly 33-34 servers per zone. Losing Zone C cuts your total capacity from 100 servers to 66. If your autoscaling is configured to maintain 70% maximum utilization under normal operation, your remaining 66 servers are running at 47% utilization of their combined capacity. But your incoming traffic has not changed — it is still sized for 100 servers. You now have 47% utilization of 66 servers trying to handle 100% of the load meant for 100 servers. Response times degrade, queues build up, and if your remaining zones were already running hot, they cascade into failure as well.

Health check interval delay is the mechanism that allows this scenario to unfold. A 10-second health check interval means up to 10 seconds between detecting a zone failure and being able to route away from it. Some teams configure longer intervals to reduce false positives — a brief network blip should not eject an entire zone. But longer intervals directly extend the window of failed requests during a real failure. The tradeoff is between avoiding flapping (frequent unnecessary ejections) and fast failure detection. Many deployments settle on 5-10 second intervals, which means up to 10 seconds of traffic going to dead instances before the load balancer knows better.

Multi-zone-aware load balancing solves this by treating zone failure as a first-class concept rather than a collection of instance failures. When a zone’s failure domain goes dark — whether from power loss, network partition, or natural disaster — the load balancer should remove ALL instances from that zone in a single health check cycle, not eject them one by one as individual health checks fail. This requires the load balancer to track zone membership and coordinate ejections at the zone level. Some implementations offer “zone-level ejection policies” where an entire zone gets marked unhealthy based on aggregate signals (for example, if 50% of instances in a zone fail health checks, eject the whole zone). Without this feature, you wait for each instance to fail health checks individually, which takes N times your health check interval — 30 seconds with 3 instances per zone, 100 seconds with 10 instances per zone.

The Sticky Session Overload Cascade

Session affinity helps individual users but hurts overall distribution. When a server handling many sticky sessions gets overloaded, those users can’t redistribute. The overloaded server fails faster, triggering circuit breakers that isolate it. Users reconnect and get re-assigned to other servers, potentially overloading those in turn.

This cascade happens when sticky sessions combine with uneven load. Mitigations: external session storage (Redis) so servers can be replaced without user impact; gradual traffic shifting during maintenance; monitoring session distribution per server.

The exact cascade sequence plays out like this: Server A accumulates 500 sticky sessions during peak traffic at noon. At 14:00, a slow database query causes Server A’s latency to spike. The circuit breaker opens at 14:00:30, isolating Server A from the pool. Those 500 users from those sticky sessions reconnect and get reassigned round-robin to Servers B, C, and D. If those servers were already at 70% capacity handling their own workloads, they now spike to over 90% as they absorb Server A’s users plus their existing load. The cascade is now in full effect.

Sticky sessions make this worse because they concentrate risk. Without sticky sessions, those 500 users would have been distributed across all servers during normal operation via round robin or least connections. With sticky sessions, all 500 users concentrate on Server A. When Server A fails, those 500 users represent a sudden 33% traffic spike hitting the remaining servers. Without sticky sessions, the same 500 users would redistribute gradually and evenly. The concentration effect means one server’s failure creates a disproportionate impact on the cluster.

Redis external session storage eliminates this problem entirely. Store session data in Redis instead of on the server. The load balancer can reassign users to any server without losing session state, because every server accesses the shared session store. Server affinity becomes unnecessary — you can route to any healthy server and the user’s session data follows them. This decouples user identity from server identity. When Server A fails, those 500 users reconnect to any available server and their sessions are already there in Redis, waiting.

The Cold Start Stampede

Adding new servers to a cluster creates a cold start problem. New instances have empty caches, no JIT compilation, and cold database connections. Clients route to new instances expecting warm behavior and timeout, retry, or queue up.

Newly added servers receive requests via consistent hashing or weighted round robin but can’t handle the load they receive. Overwhelmed, they fail health checks and get removed. The cycle repeats. Mitigation: warm-up period where new servers receive minimal traffic, gradually increasing as they warm up.

What “cold” actually means matters here. A cold server starts with an empty CPU instruction cache — no JIT compilation has happened yet, so every request triggers interpretation or on-stack replacement. The query plan cache is empty too, meaning the database must hard-parse every query rather than reusing a cached plan, adding 10-100x latency to database-heavy requests. The connection pool starts empty; database connections establish on demand, adding connection setup time to the first few requests. The page cache is cold, so disk I/O happens instead of memory reads. Each of these individually adds latency compared to a warmed server that has already handled requests and populated these caches.

The cycle that overwhelms new servers plays out on a tight timeline. A new server joins the cluster at 14:00 via consistent hashing, which assigns it 10% of traffic — its normal share. Requests start arriving, a cache miss triggers a database query, and that query takes 500ms instead of the expected 5ms because the connection pool is cold and the query plan cache is empty. Client timeouts are set to 300ms. The client retries to other servers after the timeout. Those servers now receive double traffic — their own load plus the retries from the cold server. Meanwhile, the new server’s health check fails at 14:00:45 because its latency exceeds the threshold, and it gets ejected from the pool. The next scale-up event repeats the same cycle.

Warm-up period mechanics solve this by limiting traffic to cold servers. Configure the load balancer to send 1% of traffic to new instances for the first 5 minutes, then 5% for the next 5 minutes, ramping to full share over 15-20 minutes. During this window, the new instance handles enough traffic to populate its caches and warm up without being overwhelmed. Some systems use a readiness gate where the instance itself reports when it is ready for more traffic via a /ready endpoint — the load balancer polls this endpoint and only increases traffic when the instance signals it can handle more. This approach lets the instance control its own warm-up pace based on actual conditions rather than a fixed schedule.

Cascade & Rebalancing Failures

The Reweighting Rebalance Event

Manually changing server weights (for maintenance, upgrades, or capacity rebalancing) causes immediate redistribution. A weight change from 5 to 1 for an old server sends its share to others. If those others were already near capacity, they overload.

Automated weight adjustments that account for current load help. Adding capacity before removing weight prevents sudden shifts. Monitoring backend load during reweighting events catches problems early.

The weight change scenario plays out with concrete numbers. You have 4 servers: A(weight 5), B(weight 5), C(weight 5), D(weight 1). D is an old server you are decommissioning, so you change its weight from 1 to 0 over a 5-minute window. The problem is that D was handling 1/16th of traffic — 1/(5+5+5+1) = 1/16. When weight drops to 0, that 1/16th redistributes to A, B, and C. If each was already at 70% capacity, they each absorb an additional 6.25% of total traffic, spiking to 73.3% each. That 3.3% increase is enough to push some servers over their circuit breaker threshold if they were already running hot.

Recipients are often near capacity because that is how you provision systems. During normal operation, servers typically run at 60-80% capacity to leave headroom for traffic spikes. You design it this way deliberately — if servers ran at 100%, any unexpected traffic increase would immediately cause failures. The problem is that adding 3-5% more traffic to a server already running at 75% can push it over the circuit breaker threshold, which might be set at 80% error rate or latency. The margin for error is thin precisely when you need it most: during rebalancing events.

Automated weight adjustment approach solves this with gradual changes and active monitoring. Instead of instant weight changes, use a gradual weight scheduler: reduce weight by 1 every 30 seconds while monitoring backend latency. If latency increases by more than 20%, pause and alert. If latency returns to normal, continue. This spreads the redistribution over minutes rather than seconds. Some implementations also monitor per-backend utilization and automatically pause reweighting if any backend exceeds a utilization threshold. The key insight is that the goal is not just to move traffic — it is to move traffic at a pace the remaining servers can absorb without degradation.

The Circuit Breaker Side Effect

Circuit breakers open when error rates exceed thresholds. When a circuit breaker opens on Server A, all traffic shifts to Servers B, C, D. If the shift overwhelms those servers, more circuits open. A cascading failure propagates across the cluster.

Circuit breaker tuning matters: slow half-open recovery (letting a few requests through to test recovery) prevents rapid cycling. Global circuit breaker coordination prevents all instances opening simultaneously. Load balancer integration can shed traffic at the load balancer level before circuit breakers trigger.

The exact cascade sequence plays out in concrete timestamps. Server A’s error rate hits 50% at 14:00:10, triggering its circuit breaker to open. The load balancer redirects A’s 25% traffic share to B, C, and D, which are each already running at 60% capacity. B now spikes to 68%, C to 67%, D to 65%. At 14:00:25, B’s error rate hits the threshold and a second circuit opens. Now 50% of total traffic (from both A and B) hits C and D. At 14:00:40, C and D also open their circuits. You have experienced a total outage not from a single massive failure but from a rolling cascade that took 30 seconds to unfold. Each circuit opening was individually correct — error thresholds were exceeded. But the coordination between circuits was absent, so the sum of individual correct actions produced a collectively catastrophic result.

Slow half-open recovery is the mechanism that prevents this cascading pattern. When a circuit breaker is in half-open state (allowing a few test requests through to verify recovery), it should wait long enough — typically 30 to 60 seconds — before allowing those requests through. Too fast and the recovering server receives production traffic before it is truly healthy, causing it to fail again and triggering another cycle. The half-open window should itself be rate-limited: only N requests per second get through during half-open, regardless of how many test requests arrive. Without this rate limit, a cluster of half-open circuits can simultaneously receive a flood of test requests that overwhelms the recovering instances.

Global circuit breaker coordination addresses the root cause rather than the symptom. If each service instance tracks its own circuit breaker state independently, all instances can open their circuits simultaneously when they observe the same external dependency failure — a shared database going down, a downstream API becoming unavailable. The solution is distributed circuit breaker state stored in a coordination layer like Redis or ZooKeeper. When one instance opens its circuit, it writes to the shared state store. Other instances read this state and proactively reduce traffic to the affected dependency before their own error rates climb. This way, circuit opening becomes a coordinated action rather than independent reactions that compound each other.

The DNS Cache Confusion

DNS-based load balancing creates cache coherency issues. Updating DNS records to remove a failing server takes time to propagate. Clients with cached DNS keep sending traffic to the removed server until their TTL expires.

Short TTLs help but increase DNS query load; Anycast addressing (multiple locations with same IP) helps but complicates health checking — some deployments use HTTP redirects instead of DNS rewriting for faster failover

The DNS cache coherency problem manifests in a specific deployment scenario. You deploy a new version of your API and update DNS to point api.example.com from 1.2.3.4 (old server) to 1.2.3.5 (new server). The DNS A record carries a TTL of 3600 seconds (1 hour). Clients with cached DNS responses continue resolving api.example.com to 1.2.3.4 for up to that full hour. During that window, some fraction of your traffic still reaches the old server at 1.2.3.4, which might already be decommissioned and reject requests with connection refused errors. The users experience failures not because of any problem with the new server, but because of the gap between your deployment and full DNS propagation across all clients.

Short TTL versus long TTL presents a fundamental tradeoff with no free lunch. Short TTLs (30-60 seconds) mean faster failover when you need to remove a server, because clients will pick up the new DNS record within seconds rather than hours. But short TTLs increase DNS query load on your authoritative nameservers, and some recursive resolvers (especially in large enterprise and ISP networks) ignore TTLs below a minimum threshold — they impose their own minimum cache duration regardless of what your records specify, which can be 5 minutes or more even if you set TTL to 30 seconds. Long TTLs (hours to days) reduce DNS load and query costs, but delay failover during incidents. A common production compromise is 300 seconds (5 minutes): short enough for reasonable failover speed, long enough to keep DNS infrastructure costs manageable and to satisfy minimum cache thresholds imposed by major resolvers.

Anycast versus HTTP redirect represent two distinct failover philosophies. Anycast uses BGP routing to announce the same IP address from multiple geographic locations simultaneously. Network routing automatically directs traffic to the closest available location based on AS path length and other BGP metrics. When a server in one location fails, BGP withdraws its announcement for that IP and traffic automatically shifts to the next closest announcing location — this happens in seconds to minutes without any DNS changes. HTTP redirect takes a different approach: when the load balancer detects a server is unhealthy, it responds to the client with a 302 redirect to a different URL, which the client follows immediately. This is faster than DNS propagation because it happens at the HTTP layer rather than the DNS layer, but it requires client cooperation (all clients must follow redirects) and adds latency for the redirect hop itself.

Common Pitfalls / Anti-Patterns

Treating All Connections as Equal

Least connections counts TCP connections, not actual work. A server holding 5 long-running database queries counts the same as one handling 5 fast cache lookups. Use latency-weighted variants or smoothed averages when request durations vary significantly.

The failure mode is concrete. Server A has 5 connections, each holding a database query that takes 8 seconds. Server B has 5 connections, each hitting a cache that returns in2 milliseconds. Both show5 active connections in the load balancer’s view, so the next request goes to whichever one the algorithm picks — not the one that is actually less loaded. Under round robin, Server A gets hammered with more long-running queries while Server B sits idle. Under least connections, both appear equal and traffic distributes evenly, but Server A is actually overwhelmed while Server B has capacity to spare. Latency-weighted least connections addresses this by multiplying connection count by average response time — a server with 5 connections averaging 8 seconds each scores far worse than a server with 5 connections averaging 2 milliseconds.

What “equal” means in this context is the crux of the problem. A TCP connection is counted as 1 unit of work in naive least connections, regardless of whether that connection is idle, waiting on a database, or actively processing bytes. This model works if all requests take roughly the same duration, but breaks down when request durations vary by more than an order of magnitude. If your fastest requests complete in 2ms and your slowest take 200ms, a server handling the fast requests appears equally loaded as one handling the slow requests, when in reality their resource consumption differs by 100x.

The concrete failure scenario demonstrates the impact. Server A handles 5 API requests that each make a 500ms database call and return. Server B handles 5 API requests that hit a local cache and return in 2ms. Both report 5 active connections. The load balancer routes a new request to whichever server it picks. If it picks A, that request waits behind 5 x 500ms = 2.5 seconds of work. If it picks B, that request experiences 2ms of queueing. The user experience is dramatically different despite identical connection counts in the load balancer’s view. Under round robin, if A keeps getting selected, its queue grows unbounded while B sits idle between requests. Under naive least connections, both appear equal so the next request might land on either — but landing on A extends its queue further while landing on B is essentially free.

Latency-weighted variants fix this by incorporating time into the scoring formula. A common formula: score = activeconnections * sqrt(averageresponse_time). A server with 5 connections averaging 500ms scores 5 * 22.4 = 112, while a server with 5 connections averaging 2ms scores 5 * 1.4 = 7. The load balancer picks the lower score — Server B — even though both have the same connection count. More sophisticated implementations use exponentially weighted moving averages of latency rather than raw recent samples, which prevents temporary latency spikes from skewing decisions while still capturing sustained trends. Some load balancers implement “least outstanding requests” which counts both in-flight requests AND queued or pending requests, giving a more complete picture of actual work-in-flight than connection count alone.

Ignoring Server Capacity Differences

Round robin and random distribute requests evenly, not work. If Server A has 32 cores and Server B has 8, equal traffic distribution leaves Server B overloaded. Use weighted algorithms when capacities differ substantially.

This mistake is easy to make when servers look identical in the load balancer configuration but have real capacity differences in practice. A newly provisioned server with 32 cores and 64GB RAM sits alongside an older server with 8 cores and 16GB RAM running shared background workloads. Round robin sends equal traffic to both. Server B queues requests while Server A has CPU cycles to spare. The symptoms appear gradually: Server B shows elevated latency, timeouts start appearing in traces, and eventually the circuit breaker opens. The fix is weighted round robin — if Server A has4x the capacity of Server B, give it weight4 and weight 1. But weights drift. After a deploy that changes the workload profile, after adding new servers with different specs, or after a background job starts on one server, the configured weights no longer match actual capacity. Revisit weights whenever server pools change substantially.

The scenario plays out like this in practice. You configure both servers with identical settings in the load balancer — no weights, just two entries in the upstream block. Server A is a newly provisioned m5.4xlarge (16 cores, 64GB RAM) running your Java microservice with -Xmx32g. Server B is the same application image but running on an m5.xlarge (4 cores, 16GB RAM) with -Xmx8g. Both show as “healthy” in the load balancer dashboard. The configuration file lists them side by side with equal priority. Nothing flags them as different — they have the same health check interval, the same timeout values, the same port. The only difference is the instance type, and that difference lives in the cloud console, not in the load balancer config.

The gradual symptoms follow a predictable timeline. Week 1: everything seems fine. Both servers handle the initial traffic, response times are consistent, error rates hover near zero. Week 2: you deploy a new feature that increases memory usage by 20%. Server B starts GC thrashing — an 8GB heap on a 16GB machine where the OS already consumes 4GB leaves only 4GB for the Java heap, and GC runs every 30 seconds, causing 100-200ms pauses each time. Meanwhile, Server A has headroom to spare with its 32GB heap and lower GC pressure. P99 latency on Server B jumps to 500ms while Server A stays at 50ms. By week 3, the circuit breaker on Server B starts flapping — it opens, traffic shifts to A and C, those servers get hit with the full load, one of them opens, and now you have a cascading failure affecting your entire cluster.

The weighted round robin fix addresses this directly. You set weights proportional to capacity: Server A gets weight 4, Server B gets weight 1. Now A receives 80% of traffic and B receives 20%. This gives Server B breathing room to complete GC cycles without a queue of requests piling up behind it. The request distribution no longer matches the round robin pattern, but it matches the actual capacity profile of your fleet. Monitor p99 latency on each backend for a few days after making this change — you should see Server B’s tail latency drop back in line with Server A’s.

Weight drift is the harder problem. After three months, you add a third server — an m5.2xlarge (8 cores, 32GB RAM) — with the default weight of 1. The m5.4xlarge (16 cores) still has weight 4, but now it shares traffic with both the 2xlarge and the xlarge. The actual distribution becomes 4:2:1 instead of the 4:1 ratio you intended when you first set weights. The new server at weight 1 dilutes the heavier server’s share. Recalibrate whenever you add servers — a quick calculation: sum the relative capacities, assign each server a share of that total, and set weights accordingly. If your fleet now has capacities of 16, 8, and 4 cores, the total is 28, and weights should be 16:8:4, which normalizes to 4:2:1.

Over-engineering Early

Adaptive algorithms and resource-based routing add complexity that hurts debugging. Start with simple round robin or least connections, instrument thoroughly, and upgrade only when measurements show a specific problem simple algorithms cannot solve.

The path to a complex algorithm often starts with good intentions. A team reads about adaptive load balancing and deploys it before understanding whether their workload actually needs it. Six months later, when something goes wrong with routing, the on-call engineer has to debug a system where routing decisions depend on metrics collected from dozens of backends, smoothed with EWMAs, and weighted by a model that nobody has touched in months. Compare that to debugging round robin, where the decision is trivially auditable. The right approach is to instrument first: measure actual request distribution, latency variance, and backend utilization with simple round robin. If measurements show a specific problem — one backend consistently slower, uneven distribution under burst traffic — then reach for the algorithm that solves that specific problem. Adaptive algorithms are worth the complexity only when you have the observability to justify them and a measured problem they solve.

The pattern is predictable. You read about AWS ALB’s least outstanding requests algorithm or Envoy’s adaptive routing and think “we should have that.” You deploy it in staging, run a load test that looks realistic, it passes, you ship it to production. Six months later you’re debugging why routing to us-east-1 is slower than us-west-2, and you realize the algorithm’s latency weights are based on benchmarks from before the database migration last quarter. The algorithm was tuned for a world where us-east-1 was the primary region and us-west-2 was secondary — that relationship flipped after the migration, but nobody updated the smoothing factors. You spend two hours reconstructing what the algorithm was actually doing versus what you thought it was doing.

The 3am debugging scenario is worse. You get paged for elevated error rates across your API fleet. You open the dashboard to investigate the load balancer’s routing decisions. You see the algorithm is routing based on a smoothing factor nobody remembers configuring. The metrics it’s using — queue depth, connection count — haven’t been valid since the architecture change last quarter when you moved from a single-region deployment to multi-region with a failover database. You spend two hours trying to understand why Server E keeps getting traffic when its latency is 3x higher than others. You eventually discover the weight for that region was accidentally set to 0.8 instead of 1.2 during a config change three weeks ago — a one-character typo that the adaptive algorithm was slowly compensating for by making everything else look worse. With round robin, that typo would have been caught in the first review.

The instrument-first approach prevents this. Before changing algorithms, add metrics: per-backend request count, per-backend latency percentiles (p50/p95/p99), per-backend error count, and backend capacity utilization if your monitoring exposes it. Run with simple round robin for two weeks and collect baseline distributions. Document what “normal” looks like — what is the p99 latency on each backend during peak traffic, what is the request distribution across regions, what is the error rate floor. Then if you see one backend with p99 latency consistently 2x higher than others, investigate that specific backend before changing the algorithm. The measurement tells you what to fix, not the other way around. You might find the backend has a slow query, a misconfigured connection pool, or a memory leak — changing the load balancing algorithm would not have solved any of those problems.

Skipping Health Check Configuration

The best algorithm fails if traffic routes to dead servers. Configure health checks before deploying to production. Set thresholds that avoid flapping: three failures to remove, two successes to restore works as a baseline.

Without health checks, a load balancer routes to any backend in its configuration regardless of whether that backend is actually responding. A server that crashed 10 minutes ago still receives requests until someone notices and manually removes it. Health checks solve this by periodically verifying that each backend can actually handle requests. The key configuration choices are: what endpoint to check (a /health URL, a TCP port, or a custom script), how often to check (every 5-15 seconds is typical), how many consecutive failures trigger removal (3 is a common starting point), and how many successes re-enable a backend (2 prevents flapping between states). If health checks are too aggressive — checking every second with a single failure threshold — a brief network hiccup causes mass ejections. If they are too lenient — checking every 60 seconds with a threshold of 10 — a dead server receives traffic for many minutes before detection. Test health check behavior during network partitions and instance failures before going to production.

“Skipping” health checks in practice looks like this: you configure the load balancer with the backend IP addresses and ports, you set the algorithm (least connections, say), but you never configure any health check parameters. The load balancer assumes all backends are healthy because it has no information suggesting otherwise. The configuration might look like this in HAProxy: server app1 10.0.1.5:8080; server app2 10.0.1.6:8080; — with no “check” keyword anywhere. Both servers show as “UP” in the stats page. The load balancer has no mechanism to detect that either server is down.

The concrete failure unfolds at 14:00. Server 10.0.1.6 crashes — the operating system panics, the process dies, the instance becomes unreachable. The load balancer continues routing 50% of traffic to 10.0.1.6 because round robin has no state about server health — it just cycles through the list. Requests to 10.0.1.6 fail immediately with connection refused. Error rates spike across your API. Downstream services start their own circuit breakers. If you had health checks configured with “check inter 5000 fall 3” in HAProxy syntax — meaning check every 5 seconds, remove after 3 consecutive failures — the load balancer would have detected the failure within 15 seconds (3 checks × 5 seconds apart) and removed 10.0.1.6 from the pool automatically. The difference between a 15-second detection window and a manual removal that might take 20 minutes is the difference between a minor blip and a full outage.

The key configuration choices deserve depth. Endpoint type matters: an HTTP GET /health endpoint that returns 200 if the application is healthy and 503 if it is not gives you application-level awareness — the load balancer can detect that your Java app is dead even if the TCP port is still open. A TCP connect check only verifies that the port is open, not that the service behind it is responding. A custom script check can verify database connectivity, disk space, or any application-specific condition — useful for complex services where “alive” requires more than just an open socket. Interval is a tradeoff: 5 seconds balances detection speed against health check overhead — each health check uses a connection slot on the load balancer, and checking every second across a large fleet adds up. Failure threshold of 3 means 3 consecutive failures before removal, which prevents a single transient network hiccup from ejecting a healthy server. Success threshold of 2 consecutive successes before restoring prevents a single successful check from immediately restoring a server that is flapping — it requires stability before re-enabling.

Adding Servers Without Reassessing Weights

Weighted round robin weights reflect server capacity at configuration time — a newly added powerful server with default weight 1 gets starved while older servers with weight 5 absorb the load; recalibrate weights whenever server pools change substantially

Best Practices Summary

Start simple: Round robin or random handles most web application loads. Add complexity only when measurements show a specific problem.

Prefer least connections for variable workloads: When request durations differ significantly, connection count beats round robin for even distribution.

Use consistent hashing at scale: When you frequently add or remove servers, consistent hashing prevents the reshuffling that breaks other algorithms.

Prefer power of two choices for large pools: At sufficient scale, picking two random backends and choosing the less loaded one dramatically reduces worst-case load.

Account for heterogeneous capacities: Weighted variants exist because identical servers rarely exist in production. Match weights to actual server capacity.

Instrument before optimizing: Without measuring actual distribution and latency, you cannot know whether your algorithm needs improvement.

Combine algorithms with health checks: No routing algorithm helps if traffic goes to dead servers. Health checks are not optional.

Test failure modes

Kill a backend server and verify your load balancer handles it gracefully — do this in staging before it happens in production

Trade-off Analysis

Load balancing algorithms involve fundamental trade-offs between different dimensions. Understanding these helps in algorithm selection and tuning.

Fundamental Trade-offs

State vs Performance: Dynamic algorithms that track connection counts, latency, and resource utilization provide better routing decisions but require memory for state and CPU for calculations. Static algorithms like round robin make O(1) decisions with zero state but can’t adapt to real conditions.

Complexity vs Control: Sophisticated algorithms like adaptive or ML-based routing offer fine-grained control but introduce complexity that makes debugging harder. Simple algorithms like round robin are predictable and debuggable but lack flexibility.

Consistency vs Availability: Consistent hashing provides minimal reshuffling during cluster changes but requires coordination to maintain ring state. Less consistent approaches like random achieve better load distribution in exchange for more redistribution during changes.

Latency vs Accuracy: Real-time metric collection provides accurate routing decisions but adds latency to the routing process. Sampled or aggregated metrics reduce overhead at the cost of slower adaptation.

Scenario	Recommended Algorithm	Rationale
Homogeneous servers, simple deployment	Round Robin	Zero state, O(1) decision, adequate distribution
Heterogeneous server capacities	Weighted Round Robin	Handles capacity differences without dynamic overhead
Variable request durations	Least Connections	Adapts to current load rather than configuration
Frequent server additions/removals	Consistent Hashing	Minimizes reshuffling during cluster changes
Very large server pools	Power of Two Choices	Reduces worst-case load without O(n) scanning
Latency-sensitive applications	Least Response Time	Considers both load and response time
Session affinity required	Cookie-based Sticky Sessions	Survives IP changes, no NAT hotspots

Trade-off Comparison: Load Balancing Algorithms

Algorithm	Computational Complexity	Memory State	Adaptation Speed	Session Affinity	Best For
Round Robin	O(1)	None	None	No	Homogeneous servers
Weighted Round Robin	O(1)	Server weights	None	No	Known capacity differences
Random	O(1)	None	None	No	High traffic, simple deployments
IP Hash	O(1)	None	None	Yes	Session affinity needs
Least Connections	O(n)	Per-server counts	Medium	No	Variable request durations
Least Response Time	O(n)	Counts + latency	Fast	No	Latency-sensitive apps
Consistent Hashing	O(log n)	Ring + nodes	Slow	Limited	Frequent pool changes
Power of Two Choices	O(1)	Per-server counts	Medium	No	Large heterogeneous pools
Adaptive/ML-based	Variable	Multiple metrics	Fast	No	Complex multi-factor routing

Interview Questions

1. How does consistent hashing minimize reshuffling when servers are added or removed compared to standard hashing?

Standard hashing maps a client directly to a server via modulo: server = hash(client) % N. When N changes, almost every client maps to a different server — total redistribution.

Consistent hashing maps both clients and servers onto a hash ring. A client routes to the nearest server clockwise on the ring. When a server is added, it carves out a position between two existing servers and only takes over clients between it and its predecessor. When removed, only its successor picks up its traffic. In practice, only K/N clients are affected where K is the number of servers — roughly 1/Nth of the total, regardless of how many servers change.

2. What problem does the power of two choices algorithm solve, and why does picking just two random options work so well?

Random and round robin can create temporary load imbalance under moderate traffic. One unlucky server accumulates more requests than others, causing elevated latency or failures.

The power of two choices works because of load balancing mathematics. When you pick one server at random, the maximum load across servers can be much higher than average. When you pick two at random and choose the less loaded one, the maximum load drops dramatically — provably to within a constant factor of the optimal. You get most of the benefit of knowing all server loads with only two samples. This principle underlies Google's load balancer design and Envoy's load balancing.

3. Why can naive least connections cause thrashing, and how do smoothed averages prevent it?

Least connections routes to the server with fewest active connections. When a burst of long-running requests completes simultaneously, all those servers drop to zero active connections at the same instant. Incoming requests see the same low count and all rush to the same servers before they accumulate new load.

Smoothed averages fix this by tracking not just the raw count but the trend. Exponentially weighted moving averages (EWMAs) weight recent samples less heavily, so a server that was heavily loaded a moment ago retains some of that weight even after its connections complete. The algorithm responds to trends, not just snapshots, preventing the stampede effect.

4. What is the difference between client-side and server-side load balancing? When would you choose one over the other?

Server-side load balancing uses a dedicated component (hardware appliance or software like HAProxy/Nginx) between clients and servers. The load balancer handles routing, health checks, and failure handling. Clients just talk to the load balancer's IP.

Client-side load balancing embeds routing logic in the client library. The client maintains a list of healthy servers from service discovery, tracks failures locally, and picks which server to call directly. Netflix Ribbon is a classic example.

Choose server-side when you want centralized control, simpler clients, and built-in infrastructure. Choose client-side when you want to eliminate the load balancer hop, reduce infrastructure dependencies, and are willing to manage server list distribution and client-side failure handling.

5. How do virtual nodes improve consistent hashing distribution, and why do distributed databases like Cassandra use them?

Basic consistent hashing maps each physical server to one position on the hash ring. This creates uneven distribution because server IDs hash to random positions. A server whose hash lands near a cluster of client hashes gets disproportionate traffic while an isolated hash position gets few clients.

Virtual nodes give each physical server multiple positions on the ring by hashing `server_id:1`, `server_id:2`, etc. With 150-200 virtual nodes per physical server, the law of large numbers produces statistically even distribution. Cassandra uses 256 virtual nodes by default, DynamoDB uses a similar approach.

Databases need this more than load balancers because they typically have fewer nodes. A load balancer with 50 backend servers gets adequate distribution from raw consistent hashing. A database with 6 nodes does not.

6. Why does IP hash fail in environments with many clients behind NAT, and what alternatives preserve session affinity?

IP hash maps a client by hashing their source IP address. All users behind the same corporate proxy, university network, or mobile carrier NAT gateway appear as a single source IP. The load balancer hashes that IP to the same backend, routing all NAT'd users to the same server. This creates severe hot spots.

Alternatives: cookie-based sticky sessions let the load balancer assign a client to a server via a cookie set on first contact. The cookie survives IP changes from mobile users switching networks. Header-based affinity uses a custom header set by an upstream proxy. Application-level routing can also maintain affinity without load balancer involvement.

7. How does weighted round robin handle servers with different capacities, and what happens if weights become stale?

Weighted round robin assigns each server a weight proportional to its capacity relative to other servers. A server with weight 3 receives 3 requests for every 1 request sent to a server with weight 1. NGINX, HAProxy, and cloud load balancers all support this.

Weights become stale when servers are upgraded, when workloads change, or when new servers join with default weights. A newly added powerful server with default weight 1 gets starved while old servers with weight 5 absorb disproportionate traffic. Weights require periodic recalibration, which is why many teams prefer dynamic algorithms like least connections that adapt automatically.

8. What is the relationship between load balancing algorithms and circuit breakers? How do they interact during recovery?

Load balancers distribute traffic and handle complete server failures via health checks. Circuit breakers monitor error rates and latency to detect degraded backends that are still responding but poorly. Poor load balancing forces circuit breakers to work harder — a server overloaded by uneven distribution accumulates errors, triggering circuit breaker opens.

During recovery, the interaction matters most. When a circuit breaker closes and traffic resumes, naive load balancers flood the recovering service with requests immediately. Better implementations ease traffic back gradually, letting the recovering service warm up before full traffic resumes. Some load balancers integrate circuit breaking directly; others rely on application-level circuit breaker libraries like Resilience4j or Hystrix.

9. How does AWS ALB's "least outstanding requests" algorithm differ from traditional least connections?

Traditional least connections routes to the server with the fewest active TCP connections. Each connection counts equally regardless of how long it has been open or how much work it represents.

Least outstanding requests (used by AWS ALB) counts both actively processing requests and queued/pending requests. A server that has dispatched many requests but received none back (they are all outstanding in the network or waiting on downstram services) shows a higher count than a server that has processed many fast responses. This better approximates actual server work in flight rather than just connection count.

10. When would you choose a dynamic algorithm like resource-based routing over a static algorithm like round robin?

Choose static algorithms (round robin, random, weighted variants) when: servers have similar capacities, request processing times are consistent, server pools change infrequently, and you want minimal operational complexity. These are predictable, debuggable, and impose no measurement overhead.

Choose dynamic algorithms (least connections, least response time, resource-based, adaptive) when: servers have heterogeneous capacities or are running different workloads, request durations vary significantly, you need real-time adaptation to load spikes or degraded servers, and you have monitoring infrastructure to observe algorithm behavior. Dynamic algorithms add complexity — state tracking, metric collection, smoothing — so only add them when measurements justify the cost.

11. How does consistent hashing with bounded loads prevent overloaded servers during cluster changes?

Standard consistent hashing can cause temporary overloads during server additions or removals. A newly added server might receive requests beyond its capacity while redistributing from an overloaded departing server.

Bounded load consistent hashing adds a maximum load threshold per server. When selecting a server clockwise on the ring, if the chosen server exceeds its load bound, the algorithm searches clockwise for the next server under threshold. This caps maximum load during reshuffling at the cost of slightly more complex lookup and potential initial scan for heavily loaded rings.

Google's Maglev load balancer uses bounded loads to ensure no server receives more than a configured fraction of average load during rebalancing events.

12. What is request coalescing and why does it help with thundering herd problems at the load balancer level?

Request coalescing (also called request deduplication) tracks in-flight requests for the same resource. When a cache miss triggers backend requests for identical data, coalescing ensures only one backend request is made while others wait for the result.

Without coalescing, 1000 concurrent requests for expired cache data result in 1000 simultaneous backend calls. With coalescing, the first request goes to the backend, the remaining 999 wait, and all receive the same response. This reduces backend load dramatically during thundering herd events.

Implementation approaches: shared in-memory tracking with mutex/lock per key; partitioned counters avoiding centralized coordination; probabilistic early response for duplicate requests. The tradeoff is added latency for requests that wait, versus massive backend savings during spikes.

13. How does the "power of two choices" algorithm work in practice within Envoy or similar modern load balancers?

Envoy's least_request load balancing policy implements power of two choices. For each request, it randomly selects N servers from the healthy pool (Envoy defaults to 2) and picks the one with the lowest average latency or lowest active request count.

The implementation requires each server to track: active request count, EWMA of latency, and expose these via membership protocol. The load balancer maintains no global state — it samples from what servers report. This makes the algorithm highly scalable since decision complexity doesn't increase with cluster size.

The choice_count parameter (N) allows tuning. Higher N improves distribution accuracy but increases sampling overhead. At small N (2-3), the benefit curve is steep. At larger N, marginal returns diminish. Envoy allows configuring choice_count via least_request_lb_config.

14. Why might health checks alone be insufficient for detecting degraded backends, and how do outlier detection algorithms help?

Health checks verify whether a server responds at all — TCP connection succeeds, HTTP returns non-5xx. But a server can be responding with 200 OK while handling requests slowly due to CPU contention, memory pressure, or network degradation.

Outlier detection extends health checking by analyzing request outcomes. Envoy's outlier detection monitors: consecutive 5xx errors, success rate over time window, latency percentiles, error code frequency. When a server exhibits anomalous patterns, the load balancer marks it unhealthy and ejects it from the pool even if it still responds to health checks.

Ejection is temporary — servers return to the healthy pool after a configured interval. This allows transient issues (GC pause, temporary network hiccup) to resolve without manual intervention while still protecting the cluster from consistently degraded instances.

15. What are the trade-offs between anycast and unicast DNS-based load balancing approaches?

Unicast DNS: Each DNS response contains a single IP address. Clients connect to that specific endpoint. Easy to implement, predictable routing, but depends on client-side retry logic when that endpoint fails.

Anycast DNS: Multiple endpoints share the same IP address. DNS responses return the same IP regardless of which region answers. Network routing automatically directs traffic to the closest endpoint. Higher availability (loss of one endpoint doesn't break routing) but less control over which endpoint handles specific requests.

Anycast works well for stateless services where request distribution doesn't matter. Unicast provides better control for services requiring session affinity, geographic routing, or capacity planning. Cloud providers often combine both: anycast for global reach, then unicast within regions for precise control.

16. How does session affinity interact with horizontal scaling, and what strategies minimize the tradeoff between sticky sessions and scaling flexibility?

Session affinity creates asymmetric load distribution — some servers accumulate more sessions than others. During scaling events (adding or removing servers), existing sessions must either break (users reconnect to different servers) or the added server must accept no new sessions until it accumulates its share.

Strategies to minimize tradeoff: external session storage (Redis) decouples session state from server affinity — users can reconnect to any server that accesses the shared session store. Cookie-based affinity allows reassignment without breaking existing sessions, as long as the cookie persists. Gradual traffic shifting during maintenance lets sessions drain naturally before server removal.

Modern architectures prefer external session storage and avoid sticky sessions entirely. For legacy applications requiring affinity, cookie-based approaches with short cookie TTLs allow natural rebalancing over time.

17. What metrics should you monitor to detect load balancing algorithm problems before they cause outages?

Per-backend metrics: request rate, error rate, latency distribution (p50/p95/p99), active connection count for connection-oriented algorithms. If one backend has significantly higher latency or error rate than others, the algorithm may be routing disproportionate load there.

Distribution metrics: variance in request count across backends. For round robin, expect near-equal distribution. For dynamic algorithms, expect distribution proportional to capacity or load. High variance indicates algorithm malfunction or configuration issues.

Health check metrics: failure rates, time to detection (how long between backend failure and traffic cessation), flapping frequency. Slow detection means traffic continues to failed backends. Frequent flapping indicates threshold misconfiguration.

Algorithm-specific: for least connections, monitor connection count trends rather than absolute values — a server rapidly accumulating connections indicates overload before it becomes critical.

18. When implementing consistent hashing, why is the number of virtual nodes per physical server important, and how do you choose the right count?

Virtual nodes distribute hash positions more evenly across the ring. With only one position per server, statistical variation causes uneven distribution — some servers accumulate far more client positions than others, especially with fewer servers.

More virtual nodes improve distribution uniformity but increase memory usage and lookup complexity. Each virtual node entry requires storage (ring position, owner pointer). Binary search on sorted keys requires O(log V) where V = virtual nodes per server × servers.

Practical counts: Memcached uses 150 virtual nodes by default, providing good balance for clusters down to tens of nodes. Cassandra uses 256. Load balancers with hundreds of backend servers often use fewer virtual nodes because the law of large numbers already provides adequate distribution with fewer positions.

The right count depends on cluster size and memory constraints. For databases and caches with fewer nodes (3-20), use 100-200. For load balancers with many nodes (50+), 20-50 may suffice.

19. How does latency-aware load balancing work in practice, and what challenges arise when response times vary significantly across requests?

Latency-aware algorithms (like least response time) combine connection count with recent response times. A common formula: score = active_connections * average_response_time. Lower score wins. This prefers servers with both low connection counts AND fast response times.

Challenges arise when response times vary significantly. A server processing short cached responses (1ms) gets penalized compared to one processing long database queries (500ms) even if the short server has more connections. Mixed workload environments (fast cached + slow database) make latency-aware routing less predictable.

Smoothing helps — using exponentially weighted moving averages rather than raw recent samples prevents temporary spikes from skewing decisions. Some implementations use latency percentiles (p95) instead of averages to ignore outliers. Others weight connections by expected duration (active connections × expected time) for more accurate work-in-flight estimation.

20. What load balancing considerations differ between stateless microservices and stateful storage systems like databases or distributed caches?

Stateless services: requests are interchangeable. Any instance can handle any request. Load balancing focuses on even distribution and quick failure detection. Algorithms like round robin, random, or least connections work well.

Stateful storage systems: requests for the same data must route to the specific node holding that data. Consistent hashing becomes critical — not for load distribution but for data locality. Adding a node must redistribute minimal data. Virtual nodes matter more with fewer storage nodes.

Storage systems often use replicated data where multiple nodes hold the same data. Read operations can go to any replica; write operations must reach quorum. Load balancing for storage includes routing decisions based on replica health, data freshness requirements, and geographic locality.

Database connection pooling adds another layer — connection pool saturation (not just server load) may limit throughput. Load balancers that track database connection usage rather than just active requests provide more accurate routing for database-heavy workloads.

---

Quick Recap Checklist

Before deploying any load balancing configuration to production, verify the following:

Conclusion

Load balancing algorithms run from trivially simple to sophisticated. Round robin and random need no state and distribute load evenly under high traffic. Weighted variants handle capacity differences. Least connections adapts to current load but adds complexity.

Latency and resource-based approaches provide more responsive routing but require additional infrastructure. IP hash offers session affinity at the cost of potential hotspots.

The algorithm matters less than the fundamentals: health checking, appropriate server sizing, and not overloading any single instance. Pick something reasonable, monitor it, and adjust as needed. When choosing an algorithm, start with round robin or random for simplicity, move to least connections when request durations vary, and reach for consistent hashing or power of two choices when operating at scale with frequent cluster changes.

For related reading, see my post on Load Balancing for fundamentals of load balancer architecture. To understand how load balancers integrate with API management, see API Gateway. For resilience patterns that work alongside load balancing, see Resilience Patterns.

Load Balancing Algorithms: Round Robin, Least Connections, and Beyond

Introduction

Static Load Balancing Algorithms

Static Algorithms

Round Robin

Weighted Round Robin

Random

IP Hash

Dynamic Load Balancing Algorithms

Least Connections

Smoothing: Why Raw Counts Mislead

Least Response Time

Resource-Based Routing

Adaptive Algorithms

Session Persistence & Consistent Hashing

Session Persistence and Sticky Sessions

Consistent Hashing

Virtual Nodes

Consistent Hashing Trade-offs

Advanced Load Balancing

Power of Two Choices

Circuit Breaker Integration

Client-Side vs Server-Side Load Balancing

Real-World Implementation Examples

Examples from Real Systems

NGINX

HAProxy

AWS ALB

Algorithm Comparison

Choosing the Right Algorithm

Advanced Algorithm Implementations

Weighted Least Connections Implementation

Adaptive Load Balancing with ML

Consistent Hashing with Bounded Load

Failure Scenarios

The Thundering Herd Problem

Zone Failures and Geographic Distribution

The Sticky Session Overload Cascade

The Cold Start Stampede

Cascade & Rebalancing Failures

The Reweighting Rebalance Event

The Circuit Breaker Side Effect

The DNS Cache Confusion

Common Pitfalls / Anti-Patterns

Treating All Connections as Equal

Ignoring Server Capacity Differences

Over-engineering Early

Skipping Health Check Configuration

Adding Servers Without Reassessing Weights

Best Practices Summary

Test failure modes

Trade-off Analysis

Fundamental Trade-offs

Trade-off Comparison: Load Balancing Algorithms

Interview Questions

Further Reading

Quick Recap Checklist

Conclusion

Category

Tags

Related Posts

Cache Stampede Prevention: Protecting Your Cache

Amazon Architecture: Lessons from the Pioneer of Microservices

Bloom Filters: Space-Efficient Probabilistic Data Structure