High Availability Patterns: Building Reliable Distributed Systems

Learn essential high availability patterns including redundancy, failover, load balancing, and SLA calculations. Practical strategies for building systems that stay online.

published: reading time: 18 min read

High Availability Patterns: Building Reliable Distributed Systems

Availability measures how often a system is operational. High availability (HA) means the system stays up even when components fail. For critical systems, downtime costs money, reputation, and sometimes lives.

The CAP theorem tells us that during partitions, we choose between consistency and availability. High availability patterns help minimize the time spent in that trade-off by preventing failures and handling them gracefully when they occur.

This post covers practical patterns for building highly available systems.


What is High Availability?

High availability means systems that remain operational even when components fail. We measure it as a percentage of uptime over time:

graph LR
    A[99%] --> B[87.6 hours downtime/year]
    A --> C[3.65 days downtime/year]
    D[99.9%] --> E[8.76 hours downtime/year]
    D --> F[43.8 minutes downtime/year]
    G[99.99%] --> H[52.6 minutes downtime/year]
    G --> I[5.26 minutes downtime/year]

The “nines” matter. Each additional nine represents a tenfold reduction in downtime. Whether that matters depends on your business context. A video streaming service can probably survive 4 hours of downtime per year. A hospital’s monitoring system cannot.


Redundancy

The first line of defense against failure is having backup components. Redundancy means duplicating critical components so that if one fails, another takes over.

Types of Redundancy

Active-active: Multiple replicas serve traffic simultaneously. If one fails, others continue without interruption.

// Active-active: all servers handle requests
const servers = ["server1", "server2", "server3"];

async function handleRequest(request) {
  // Try servers in order until one responds
  for (const server of servers) {
    try {
      return await sendToServer(server, request);
    } catch (error) {
      continue; // Try next server
    }
  }
  throw new Error("All servers unavailable");
}

Active-passive: One primary server handles traffic. Standby servers are ready but not processing requests until failover.

// Active-passive: standby takes over on primary failure
const primary = "primaryServer";
const standby = "standbyServer";

async function handleRequest(request) {
  try {
    return await sendToServer(primary, request);
  } catch (error) {
    // Failover to standby
    console.log("Primary failed, activating standby");
    return await sendToServer(standby, request);
  }
}

Active-active requires more complex conflict resolution but provides better resource utilization. Active-passive is simpler but wastes resources on idle standby capacity.


Failover Patterns

Failover is the process of switching from a failed component to a backup. Several patterns exist:

Automatic vs Manual Failover

Automatic failover detects failures and switches without human intervention. Manual failover requires an operator to trigger the switch.

Automatic is faster but riskier. If detection is imperfect, you might failover unnecessarily, causing a cascade of problems. Manual failover gives you control but introduces human delay.

For most production systems, a hybrid approach works: automatic detection with automatic failover for minor issues, manual intervention for major events.

Health Checks

Failover needs accurate failure detection:

// Health check endpoint
app.get("/health", async (req, res) => {
  const health = {
    status: "ok",
    timestamp: Date.now(),
    checks: {
      database: await checkDatabase(),
      cache: await checkCache(),
      disk: await checkDiskSpace(),
    },
  };

  // Return unhealthy if any critical check fails
  const isHealthy =
    health.checks.database === "ok" && health.checks.cache === "ok";

  res.status(isHealthy ? 200 : 503).json(health);
});

async function checkDatabase() {
  try {
    await db.query("SELECT 1");
    return "ok";
  } catch (error) {
    return "failed";
  }
}

Health checks should verify actual functionality, not just process liveness. A process can be running but unable to serve requests.

Failover Time

Failover introduces latency. Components of failover time:

graph TD
    A[Failover Time] --> B[Failure Detection]
    A --> C[Decision to Failover]
    A --> D[ DNS/Route Update]
    A --> E[New Instance Startup]
    A --> F[Health Check Propagation]

    B --> B1[Usually 5-30 seconds]
    D --> D1[Can be 30-60 seconds for DNS]

DNS-based failover is slow because DNS records are cached and TTLs can be 60 seconds or more. Floating IP or anycast approaches are faster.


Load Balancing

Load balancers distribute traffic across multiple servers. They also help with availability by routing around failed instances.

Load Balancing Algorithms

Round robin: Send each request to the next server in sequence. Simple but does not account for varying request complexity.

// Round robin implementation
let currentIndex = 0;
const servers = ["server1", "server2", "server3"];

function getNextServer() {
  const server = servers[currentIndex];
  currentIndex = (currentIndex + 1) % servers.length;
  return server;
}

Least connections: Send new requests to the server with the fewest active connections. Better for varying request durations.

IP hash: Route requests from the same client IP to the same server. Useful when session state is stored locally.

Weighted: Assign weights to servers based on capacity. More powerful servers receive more traffic.

Load Balancer Health Checks

// Load balancer health monitoring
const servers = [
  { host: "server1", healthy: true, connections: 10 },
  { host: "server2", healthy: true, connections: 5 },
  { host: "server3", healthy: false, connections: 0 },
];

function routeRequest(request) {
  const healthy = servers.filter((s) => s.healthy);

  if (healthy.length === 0) {
    throw new Error("No healthy servers");
  }

  // Choose server with least connections
  const target = healthy.reduce((a, b) =>
    a.connections < b.connections ? a : b,
  );

  target.connections++;
  return sendRequest(target.host, request).finally(() => target.connections--);
}

SLA Calculations

Service Level Agreements define expected availability. Calculating SLA helps you understand what your system needs to deliver.

SLA Composition

End-to-end SLA depends on the weakest component:

// Calculate combined SLA
function calculateSLA(slaValues) {
  // For independent components, combined availability
  // is the product of individual availabilities

  const combined = slaValues.reduce((acc, sla) => {
    // Convert percentage to decimal
    const availability = sla / 100;
    return acc * availability;
  }, 1);

  return (combined * 100).toFixed(2) + "%";
}

// Example: Load balancer (99.99%) + 3 app servers (99.9% each)
// = 99.99% * 99.9% * 99.9% * 99.9%
const sla = calculateSLA([99.99, 99.9, 99.9, 99.9]);
console.log(sla); // Output: 99.60%

Adding more components reduces overall SLA. This is why HA architectures prefer minimal chains of dependencies.

Planning for SLA

Target SLAMax Downtime/YearMax Downtime/Month
99%3.65 days7.30 hours
99.9%8.76 hours43.8 minutes
99.99%52.6 minutes4.38 minutes
99.999%5.26 minutes26.3 seconds

99.999% (“five nines”) is extremely difficult. It allows only 5 minutes of downtime per year. Most systems target 99.9% or 99.99%.


Circuit Breakers

Circuit breakers prevent cascading failures. When a downstream service is failing, the circuit breaker trips and fast-fails requests instead of overwhelming the dying service.

class CircuitBreaker {
  constructor(failureThreshold = 5, timeout = 60000) {
    this.failureThreshold = failureThreshold;
    this.timeout = timeout;
    this.failures = 0;
    this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = 0;
  }

  async call(fn) {
    if (this.state === "OPEN") {
      if (Date.now() > this.nextAttempt) {
        this.state = "HALF_OPEN";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = "CLOSED";
  }

  onFailure() {
    this.failures++;
    if (this.failures >= this.failureThreshold) {
      this.state = "OPEN";
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

The circuit breaker gives failing services time to recover instead of being buried by retry storms.


Graceful Degradation

When full functionality is impossible, provide partial functionality. This is graceful degradation.

// Graceful degradation example
async function getProductDetails(productId) {
  const fullDetails = {
    reviews: null,
    relatedProducts: null,
    priceHistory: null,
  };

  try {
    fullDetails.reviews = await getReviews(productId);
  } catch (error) {
    console.log("Reviews unavailable");
  }

  try {
    fullDetails.relatedProducts = await getRelated(productId);
  } catch (error) {
    console.log("Related products unavailable");
  }

  try {
    fullDetails.priceHistory = await getPriceHistory(productId);
  } catch (error) {
    console.log("Price history unavailable");
  }

  return fullDetails;
}

Users see available information immediately instead of facing a blank screen or error message.



When to Use / When Not to Use

ScenarioRecommendation
Critical systems (healthcare, finance)High availability mandatory
SLA of 99.99%+Active-active with automatic failover
Cost-sensitive applicationsActive-passive with manual failover
Stateless servicesLoad balancer with health checks
Stateful services with persistenceDatabase replication with failover

When TO Use Active-Active

  • Multiple geographic regions: Users in different regions hit their nearest datacenter
  • High traffic volumes: Single active cannot handle load even with vertical scaling
  • Zero-downtime requirements: Failover happens instantly without any interruption
  • Read-heavy workloads: All replicas serve reads, dramatically increasing throughput

When NOT to Use Active-Active

  • Complex conflict resolution: When data has mutable state that is hard to partition
  • Strong consistency requirements: Synchronizing writes across active nodes is expensive
  • Limited budget: Running multiple active datacenters costs significantly more
  • Simple applications: The complexity cost outweighs availability benefits

Active-Active vs Active-Passive Decision Matrix

CriteriaActive-ActiveActive-Passive
Cost2-3x (all sites active)1.5-2x (standby idle)
ComplexityHigh (conflict resolution, sync)Medium (failover logic)
Failover SpeedInstant (traffic split)30s-5min (promotion)
Write ThroughputHigher (all nodes write)Limited to primary
Data ConsistencyComplex (multi-master sync)Simple (async replication)
Geographic DiversityExcellent (multi-region)Good (standby in other DC)
Resource UtilizationHigh (all nodes busy)Low (standby idle)
Rollback ComplexityComplex (already processing)Simple (demote standby)

Multi-Region Deployment Patterns

When deploying across multiple geographic regions, consider these patterns:

PatternDescriptionUse Case
Primary-SecondaryOne region accepts writes, others replicate asynchronouslyRead-heavy, geo-distributed users
Primary-PrimaryAll regions accept writes, conflict resolution requiredWrite-heavy, globally distributed users
CQRS + Global Traffic RoutingSeparate read/write paths, route users to nearest regionComplex domains, maximum performance
Stateless Microservices + Regional DatabasesCompute is stateless and globally distributedCloud-native, Kubernetes-based

Cross-region replication considerations:

// Multi-region write latency expectations
const REPLICATION_LATENCY = {
  "us-east-1 to eu-west-1": "~100-150ms RTT",
  "us-east-1 to ap-southeast-1": "~200-250ms RTT",
  "eu-west-1 to ap-southeast-1": "~150-200ms RTT",
};

// Consistency vs latency tradeoff in multi-region
async function writeMultiRegion(key, value, options = {}) {
  const { consistencyLevel = "QUORUM", regions = ["us-east-1", "eu-west-1"] } =
    options;

  if (consistencyLevel === "ALL_REGIONS") {
    // Strongest consistency, highest latency
    const results = await Promise.all(
      regions.map((region) => writeToRegion(region, key, value)),
    );
    return results;
  } else if (consistencyLevel === "QUORUM") {
    // Majority of regions must acknowledge
    const acks = await Promise.race([
      Promise.all(regions.map((r) => writeToRegion(r, key, value))),
      timeout(5000), // 5 second timeout
    ]);
    return acks;
  } else {
    // Local region only, async replicate
    await writeToRegion("local", key, value);
    backgroundSyncToOtherRegions(regions);
    return { success: true, replicated: false };
  }
}

Stateful vs Stateless Failover Differences

AspectStateless ServicesStateful Services
Failover ComplexityLow (just restart somewhere)High (must preserve state)
State RecoveryNone neededMust recover from replica or WAL
Failover Time5-30 seconds (container restart)30s-5min (state recovery)
Data Loss RiskNonePossible (unreplicated writes)
ScalingHorizontal (easy)Requires consistent hashing/sharding
Session AffinityNot neededOften required (or externalize state)

Stateful failover sequence:

sequenceDiagram
    participant Primary as Primary DB
    participant Standby as Standby DB
    participant LB as Load Balancer
    participant App as Application

    Primary->>Standby: Replicate WAL continuously
    Note over Primary,Standby: Async replication with lag

    Primary->>LB: Health check OK
    App->>LB: Write request
    LB->>Primary: Route to primary

    Primary->>Primary: CRASH - stops responding
    Standby->>Primary: Health check fails
    Standby->>LB: "I am available"
    LB->>Standby: Promote to primary

    Note over App: Brief write failure during failover
    App->>LB: Retry write
    LB->>Standby: Route to new primary
    Standby->>App: Write acknowledged

    Note over Primary,Standby: Re-sync when old primary recovers

Kubernetes HA Patterns

For Kubernetes-based deployments:

PatternDescriptionTrade-off
PodDisruptionBudget (PDB)Ensures minimum pods available during disruptionsMay delay node drains
PodAntiAffinitySpreads pods across nodes/AZsRequires enough nodes
multi-AZ vs multi-regionAZ failure = local; Region failure = full DRAZ simpler, Region safer
StatefulSetsOrdered deployment, persistent storageMore complex than Deployments
# Example: HA Kubernetes deployment with anti-affinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - api-server
              topologyKey: topology.kubernetes.io/zone
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server
# Example: PodDisruptionBudget for minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2 # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: api-server


Production Failure Scenarios

Failure ScenarioImpactMitigation
Load balancer instance failureAll traffic fails to backendRun multiple LBs; health checks remove failed instances
Primary database failureWrite operations failAutomatic failover to standby with health monitoring
DNS cache poisoningTraffic routed to wrong/invalid IPsShort TTLs; DNSSEC; use floating IP instead
Cascade failureOne component failure triggers othersCircuit breakers; bulkhead pattern; graceful degradation
Datacenter power failureEntire site goes downMulti-datacenter active-active; UPS and generators
Network partition between DCsSplit-brain riskQuorum-based decisions; automatic failover lock
Disk full on primaryDatabase writes failMonitoring; auto-scaling storage; archive old data

Failover Runbook

Active-Passover Failover Procedure

graph TD
    A[Alert: Primary Down] --> B{Automated Failover?}
    B -->|Yes| C[Health check confirms failure]
    C --> D[Promote standby to primary]
    D --> E[Update routing/DNS]
    E --> F[Verify replication lag < threshold]
    F --> G[Resume traffic]
    B -->|No| H[Page on-call engineer]
    H --> I[Engineer assesses situation]
    I --> J[Manual failover decision]
    J --> K[Follow runbook steps]
    K --> D

Checklist for Failover

- [ ] Verify primary is truly down (multiple checks)
- [ ] Confirm standby has caught up (replication lag check)
- [ ] Stop traffic to old primary (prevent split-brain)
- [ ] Promote standby
- [ ] Update DNS/routing to new primary
- [ ] Verify new primary is accepting writes
- [ ] Verify replicas are replicating to new primary
- [ ] Monitor error rates and latency post-failover
- [ ] Send status update to stakeholders
- [ ] Begin root cause analysis

Availability Math Calculator

Dependency Tree Analysis

For a system with multiple components, overall SLA is the product of individual SLAs:

System SLA = SLA_load_balancer × SLA_app_server × SLA_database × SLA_cache

Example:
- Load balancer: 99.99% (4 nines)
- 3 app servers (each 99.9%): 99.9%³ = 99.7%
- Primary database: 99.99%
- Read replica: 99.99%
- Cache: 99.9%

Combined = 0.9999 × 0.997 × 0.9999 × 0.9999 × 0.999
         = 0.9957 (99.57% - only 2 nines!)

Nines and Downtime Reference

SLADowntime/YearDowntime/MonthDowntime/Week
99%3.65 days7.3 hours1.68 hours
99.9%8.76 hours43.8 min10.1 min
99.99%52.6 min4.38 min1.01 min
99.999%5.26 min26.3 sec6.05 sec
99.9999%31.5 sec2.63 sec0.6 sec

Chaos Engineering Checklist

Game Days for HA Patterns

Run these chaos experiments to validate availability patterns:

Chaos ExperimentWhat It ValidatesSuccess Criteria
Kill random app serverLoad balancer routes around failure< 1% error rate
Terminate primary DBFailover completes successfully< 60s downtime
Network partition (single DC)Quorum maintainedReads continue, writes queued
Fill disk on replicaMonitoring detects issueAlert < 5 min, graceful degradation
Restart all app servers simultaneouslyTraffic spikes handledRate limiting prevents cascade

Pre-Game Day Checklist

- [ ] Define steady-state hypothesis
- [ ] Establish baseline metrics
- [ ] Notify stakeholders of experiment window
- [ ] Verify recent backup is available
- [ ] Confirm rollback plan
- [ ] Have on-call ready to intervene
- [ ] Document expected outcome
- [ ] Establish abort criteria

Observability Checklist

Golden Signals to Monitor

SignalWhat to MeasureAlert Threshold
LatencyP50, P95, P99 response timeP99 > 500ms
TrafficRequests per second, concurrent connectionsUnusual spikes/drops
ErrorsError rate by type (4xx, 5xx, timeouts)Error rate > 1%
SaturationCPU, memory, disk, connection pool utilization> 80%

Metrics to Capture

  • availability_uptime_percentage (gauge) - Per service, per datacenter
  • failover_duration_seconds (histogram) - Time to complete failover
  • health_check_failures_total (counter) - By check type
  • circuit_breaker_state (gauge) - Open/closed/half-open per circuit
  • dependency_health (gauge) - Status of each dependency (1=healthy, 0=down)

Alerts to Configure

AlertThresholdSeverity
Service error rate > 1%1% for 5 minutesWarning
Service error rate > 5%5% for 1 minuteCritical
Failover not completing in 60s60 secondsCritical
Dependency health check failing2 consecutive failuresWarning
All health checks failing1 consecutive failureCritical

Security Checklist

  • Load balancer accessible only via TLS (HTTPS)
  • Health check endpoints authenticated
  • Failover mechanisms protected from unauthorized trigger
  • DNS records protected with short TTLs and DNSSEC
  • Cross-datacenter traffic encrypted
  • Secrets for failover mechanisms rotated regularly
  • Audit logging of all failover events

Common Pitfalls / Anti-Patterns

Pitfall 1: Designing for Five Nines Without Budget

Problem: Targeting 99.999% availability requires significant investment in redundancy, monitoring, and processes. Teams targeting it without the budget to support it often miss the target.

Solution: Start with 99.9% (four nines), measure what actually causes downtime, and improve incrementally. Each nine costs roughly 10x more than the previous.

Pitfall 2: Ignoring Dependency SLAs

Problem: Your service has 99.99% uptime but depends on a service with 99% uptime. Your actual availability is 99.99% × 99% = 98.99%.

Solution: Map all dependencies and calculate composite SLA. If a dependency is weak, add redundancy for that specific dependency or accept the lower composite SLA.

Pitfall 3: Automatic Failover Without Testing

Problem: Automatic failover sounds great until it triggers unnecessarily due to a false positive, causing a cascade of problems.

Solution: Test failover regularly (at least quarterly). Use manual failover for initial deployments until you have confidence in detection accuracy.

Pitfall 4: Forgetting About Session State

Problem: When failover happens, in-memory session state is lost. Users are logged out, shopping carts are emptied.

Solution: Externalize session state to Redis or similar. Use stateless request processing wherever possible. For sessions that must be local, use session affinity (with awareness of failover).


Quick Recap

  • High availability requires intentional architecture with redundancy at every level.
  • Redundancy (active-active or active-passive) is the first line of defense.
  • Failover must be tested regularly. An untested failover plan is not a failover plan.
  • Circuit breakers prevent cascade failures from spreading.
  • Graceful degradation lets users accomplish core tasks even when features fail.
  • SLA math: add dependencies carefully. Each one reduces overall availability.

Copy/Paste Checklist

- [ ] Map all system dependencies and calculate composite SLA
- [ ] Implement redundancy at each critical component
- [ ] Configure health checks (not just liveness)
- [ ] Document and test failover runbook
- [ ] Schedule quarterly chaos/game days
- [ ] Implement circuit breakers on all downstream calls
- [ ] Externalize session state
- [ ] Monitor golden signals: latency, traffic, errors, saturation
- [ ] Set alerts with appropriate thresholds and escalation paths
- [ ] Budget for the availability target (each nine costs ~10x more)

The PACELC theorem post covers latency trade-offs that also affect availability decisions. The Consistency Models post discusses how consistency guarantees interact with availability.

Category

Related Posts

The Eight Fallacies of Distributed Computing

Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.

#distributed-systems #distributed-computing #system-design

Microservices vs Monolith: Choosing the Right Architecture

Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.

#microservices #monolith #architecture

Distributed Systems Primer: Key Concepts for Modern Architecture

A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.

#distributed-systems #system-design #architecture