High Availability Patterns: Building Reliable Distributed Systems

Learn essential high availability patterns including redundancy, failover, load balancing, and SLA calculations. Practical strategies for building systems that stay online.

published: March 22, 2026 reading time: 18 min read

High Availability Patterns: Building Reliable Distributed Systems

Availability measures how often a system is operational. High availability (HA) means the system stays up even when components fail. For critical systems, downtime costs money, reputation, and sometimes lives.

The CAP theorem tells us that during partitions, we choose between consistency and availability. High availability patterns help minimize the time spent in that trade-off by preventing failures and handling them gracefully when they occur.

This post covers practical patterns for building highly available systems.

What is High Availability?

High availability means systems that remain operational even when components fail. We measure it as a percentage of uptime over time:

graph LR
    A[99%] --> B[87.6 hours downtime/year]
    A --> C[3.65 days downtime/year]
    D[99.9%] --> E[8.76 hours downtime/year]
    D --> F[43.8 minutes downtime/year]
    G[99.99%] --> H[52.6 minutes downtime/year]
    G --> I[5.26 minutes downtime/year]

The “nines” matter. Each additional nine represents a tenfold reduction in downtime. Whether that matters depends on your business context. A video streaming service can probably survive 4 hours of downtime per year. A hospital’s monitoring system cannot.

Redundancy

The first line of defense against failure is having backup components. Redundancy means duplicating critical components so that if one fails, another takes over.

Types of Redundancy

Active-active: Multiple replicas serve traffic simultaneously. If one fails, others continue without interruption.

// Active-active: all servers handle requests
const servers = ["server1", "server2", "server3"];

async function handleRequest(request) {
  // Try servers in order until one responds
  for (const server of servers) {
    try {
      return await sendToServer(server, request);
    } catch (error) {
      continue; // Try next server
    }
  }
  throw new Error("All servers unavailable");
}

Active-passive: One primary server handles traffic. Standby servers are ready but not processing requests until failover.

// Active-passive: standby takes over on primary failure
const primary = "primaryServer";
const standby = "standbyServer";

async function handleRequest(request) {
  try {
    return await sendToServer(primary, request);
  } catch (error) {
    // Failover to standby
    console.log("Primary failed, activating standby");
    return await sendToServer(standby, request);
  }
}

Active-active requires more complex conflict resolution but provides better resource utilization. Active-passive is simpler but wastes resources on idle standby capacity.

Failover Patterns

Failover is the process of switching from a failed component to a backup. Several patterns exist:

Automatic vs Manual Failover

Automatic failover detects failures and switches without human intervention. Manual failover requires an operator to trigger the switch.

Automatic is faster but riskier. If detection is imperfect, you might failover unnecessarily, causing a cascade of problems. Manual failover gives you control but introduces human delay.

For most production systems, a hybrid approach works: automatic detection with automatic failover for minor issues, manual intervention for major events.

Health Checks

Failover needs accurate failure detection:

// Health check endpoint
app.get("/health", async (req, res) => {
  const health = {
    status: "ok",
    timestamp: Date.now(),
    checks: {
      database: await checkDatabase(),
      cache: await checkCache(),
      disk: await checkDiskSpace(),
    },
  };

  // Return unhealthy if any critical check fails
  const isHealthy =
    health.checks.database === "ok" && health.checks.cache === "ok";

  res.status(isHealthy ? 200 : 503).json(health);
});

async function checkDatabase() {
  try {
    await db.query("SELECT 1");
    return "ok";
  } catch (error) {
    return "failed";
  }
}

Health checks should verify actual functionality, not just process liveness. A process can be running but unable to serve requests.

Failover Time

Failover introduces latency. Components of failover time:

graph TD
    A[Failover Time] --> B[Failure Detection]
    A --> C[Decision to Failover]
    A --> D[ DNS/Route Update]
    A --> E[New Instance Startup]
    A --> F[Health Check Propagation]

    B --> B1[Usually 5-30 seconds]
    D --> D1[Can be 30-60 seconds for DNS]

DNS-based failover is slow because DNS records are cached and TTLs can be 60 seconds or more. Floating IP or anycast approaches are faster.

Load Balancing

Load balancers distribute traffic across multiple servers. They also help with availability by routing around failed instances.

Load Balancing Algorithms

Round robin: Send each request to the next server in sequence. Simple but does not account for varying request complexity.

// Round robin implementation
let currentIndex = 0;
const servers = ["server1", "server2", "server3"];

function getNextServer() {
  const server = servers[currentIndex];
  currentIndex = (currentIndex + 1) % servers.length;
  return server;
}

Least connections: Send new requests to the server with the fewest active connections. Better for varying request durations.

IP hash: Route requests from the same client IP to the same server. Useful when session state is stored locally.

Weighted: Assign weights to servers based on capacity. More powerful servers receive more traffic.

Load Balancer Health Checks

// Load balancer health monitoring
const servers = [
  { host: "server1", healthy: true, connections: 10 },
  { host: "server2", healthy: true, connections: 5 },
  { host: "server3", healthy: false, connections: 0 },
];

function routeRequest(request) {
  const healthy = servers.filter((s) => s.healthy);

  if (healthy.length === 0) {
    throw new Error("No healthy servers");
  }

  // Choose server with least connections
  const target = healthy.reduce((a, b) =>
    a.connections < b.connections ? a : b,
  );

  target.connections++;
  return sendRequest(target.host, request).finally(() => target.connections--);
}

SLA Calculations

Service Level Agreements define expected availability. Calculating SLA helps you understand what your system needs to deliver.

SLA Composition

End-to-end SLA depends on the weakest component:

// Calculate combined SLA
function calculateSLA(slaValues) {
  // For independent components, combined availability
  // is the product of individual availabilities

  const combined = slaValues.reduce((acc, sla) => {
    // Convert percentage to decimal
    const availability = sla / 100;
    return acc * availability;
  }, 1);

  return (combined * 100).toFixed(2) + "%";
}

// Example: Load balancer (99.99%) + 3 app servers (99.9% each)
// = 99.99% * 99.9% * 99.9% * 99.9%
const sla = calculateSLA([99.99, 99.9, 99.9, 99.9]);
console.log(sla); // Output: 99.60%

Adding more components reduces overall SLA. This is why HA architectures prefer minimal chains of dependencies.

Planning for SLA

Target SLA	Max Downtime/Year	Max Downtime/Month
99%	3.65 days	7.30 hours
99.9%	8.76 hours	43.8 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

99.999% (“five nines”) is extremely difficult. It allows only 5 minutes of downtime per year. Most systems target 99.9% or 99.99%.

Circuit Breakers

Circuit breakers prevent cascading failures. When a downstream service is failing, the circuit breaker trips and fast-fails requests instead of overwhelming the dying service.

class CircuitBreaker {
  constructor(failureThreshold = 5, timeout = 60000) {
    this.failureThreshold = failureThreshold;
    this.timeout = timeout;
    this.failures = 0;
    this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = 0;
  }

  async call(fn) {
    if (this.state === "OPEN") {
      if (Date.now() > this.nextAttempt) {
        this.state = "HALF_OPEN";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = "CLOSED";
  }

  onFailure() {
    this.failures++;
    if (this.failures >= this.failureThreshold) {
      this.state = "OPEN";
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

The circuit breaker gives failing services time to recover instead of being buried by retry storms.

Graceful Degradation

When full functionality is impossible, provide partial functionality. This is graceful degradation.

// Graceful degradation example
async function getProductDetails(productId) {
  const fullDetails = {
    reviews: null,
    relatedProducts: null,
    priceHistory: null,
  };

  try {
    fullDetails.reviews = await getReviews(productId);
  } catch (error) {
    console.log("Reviews unavailable");
  }

  try {
    fullDetails.relatedProducts = await getRelated(productId);
  } catch (error) {
    console.log("Related products unavailable");
  }

  try {
    fullDetails.priceHistory = await getPriceHistory(productId);
  } catch (error) {
    console.log("Price history unavailable");
  }

  return fullDetails;
}

Users see available information immediately instead of facing a blank screen or error message.

When to Use / When Not to Use

Scenario	Recommendation
Critical systems (healthcare, finance)	High availability mandatory
SLA of 99.99%+	Active-active with automatic failover
Cost-sensitive applications	Active-passive with manual failover
Stateless services	Load balancer with health checks
Stateful services with persistence	Database replication with failover

When TO Use Active-Active

Multiple geographic regions: Users in different regions hit their nearest datacenter
High traffic volumes: Single active cannot handle load even with vertical scaling
Zero-downtime requirements: Failover happens instantly without any interruption
Read-heavy workloads: All replicas serve reads, dramatically increasing throughput

When NOT to Use Active-Active

Complex conflict resolution: When data has mutable state that is hard to partition
Strong consistency requirements: Synchronizing writes across active nodes is expensive
Limited budget: Running multiple active datacenters costs significantly more
Simple applications: The complexity cost outweighs availability benefits

Active-Active vs Active-Passive Decision Matrix

Criteria	Active-Active	Active-Passive
Cost	2-3x (all sites active)	1.5-2x (standby idle)
Complexity	High (conflict resolution, sync)	Medium (failover logic)
Failover Speed	Instant (traffic split)	30s-5min (promotion)
Write Throughput	Higher (all nodes write)	Limited to primary
Data Consistency	Complex (multi-master sync)	Simple (async replication)
Geographic Diversity	Excellent (multi-region)	Good (standby in other DC)
Resource Utilization	High (all nodes busy)	Low (standby idle)
Rollback Complexity	Complex (already processing)	Simple (demote standby)

Multi-Region Deployment Patterns

When deploying across multiple geographic regions, consider these patterns:

Pattern	Description	Use Case
Primary-Secondary	One region accepts writes, others replicate asynchronously	Read-heavy, geo-distributed users
Primary-Primary	All regions accept writes, conflict resolution required	Write-heavy, globally distributed users
CQRS + Global Traffic Routing	Separate read/write paths, route users to nearest region	Complex domains, maximum performance
Stateless Microservices + Regional Databases	Compute is stateless and globally distributed	Cloud-native, Kubernetes-based

Cross-region replication considerations:

// Multi-region write latency expectations
const REPLICATION_LATENCY = {
  "us-east-1 to eu-west-1": "~100-150ms RTT",
  "us-east-1 to ap-southeast-1": "~200-250ms RTT",
  "eu-west-1 to ap-southeast-1": "~150-200ms RTT",
};

// Consistency vs latency tradeoff in multi-region
async function writeMultiRegion(key, value, options = {}) {
  const { consistencyLevel = "QUORUM", regions = ["us-east-1", "eu-west-1"] } =
    options;

  if (consistencyLevel === "ALL_REGIONS") {
    // Strongest consistency, highest latency
    const results = await Promise.all(
      regions.map((region) => writeToRegion(region, key, value)),
    );
    return results;
  } else if (consistencyLevel === "QUORUM") {
    // Majority of regions must acknowledge
    const acks = await Promise.race([
      Promise.all(regions.map((r) => writeToRegion(r, key, value))),
      timeout(5000), // 5 second timeout
    ]);
    return acks;
  } else {
    // Local region only, async replicate
    await writeToRegion("local", key, value);
    backgroundSyncToOtherRegions(regions);
    return { success: true, replicated: false };
  }
}

Stateful vs Stateless Failover Differences

Aspect	Stateless Services	Stateful Services
Failover Complexity	Low (just restart somewhere)	High (must preserve state)
State Recovery	None needed	Must recover from replica or WAL
Failover Time	5-30 seconds (container restart)	30s-5min (state recovery)
Data Loss Risk	None	Possible (unreplicated writes)
Scaling	Horizontal (easy)	Requires consistent hashing/sharding
Session Affinity	Not needed	Often required (or externalize state)

Stateful failover sequence:

sequenceDiagram
    participant Primary as Primary DB
    participant Standby as Standby DB
    participant LB as Load Balancer
    participant App as Application

    Primary->>Standby: Replicate WAL continuously
    Note over Primary,Standby: Async replication with lag

    Primary->>LB: Health check OK
    App->>LB: Write request
    LB->>Primary: Route to primary

    Primary->>Primary: CRASH - stops responding
    Standby->>Primary: Health check fails
    Standby->>LB: "I am available"
    LB->>Standby: Promote to primary

    Note over App: Brief write failure during failover
    App->>LB: Retry write
    LB->>Standby: Route to new primary
    Standby->>App: Write acknowledged

    Note over Primary,Standby: Re-sync when old primary recovers

Kubernetes HA Patterns

For Kubernetes-based deployments:

Pattern	Description	Trade-off
PodDisruptionBudget (PDB)	Ensures minimum pods available during disruptions	May delay node drains
PodAntiAffinity	Spreads pods across nodes/AZs	Requires enough nodes
multi-AZ vs multi-region	AZ failure = local; Region failure = full DR	AZ simpler, Region safer
StatefulSets	Ordered deployment, persistent storage	More complex than Deployments

# Example: HA Kubernetes deployment with anti-affinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - api-server
              topologyKey: topology.kubernetes.io/zone
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server

# Example: PodDisruptionBudget for minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2 # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: api-server

Production Failure Scenarios

Failure Scenario	Impact	Mitigation
Load balancer instance failure	All traffic fails to backend	Run multiple LBs; health checks remove failed instances
Primary database failure	Write operations fail	Automatic failover to standby with health monitoring
DNS cache poisoning	Traffic routed to wrong/invalid IPs	Short TTLs; DNSSEC; use floating IP instead
Cascade failure	One component failure triggers others	Circuit breakers; bulkhead pattern; graceful degradation
Datacenter power failure	Entire site goes down	Multi-datacenter active-active; UPS and generators
Network partition between DCs	Split-brain risk	Quorum-based decisions; automatic failover lock
Disk full on primary	Database writes fail	Monitoring; auto-scaling storage; archive old data

Failover Runbook

Active-Passover Failover Procedure

graph TD
    A[Alert: Primary Down] --> B{Automated Failover?}
    B -->|Yes| C[Health check confirms failure]
    C --> D[Promote standby to primary]
    D --> E[Update routing/DNS]
    E --> F[Verify replication lag < threshold]
    F --> G[Resume traffic]
    B -->|No| H[Page on-call engineer]
    H --> I[Engineer assesses situation]
    I --> J[Manual failover decision]
    J --> K[Follow runbook steps]
    K --> D

Checklist for Failover

- [ ] Verify primary is truly down (multiple checks)
- [ ] Confirm standby has caught up (replication lag check)
- [ ] Stop traffic to old primary (prevent split-brain)
- [ ] Promote standby
- [ ] Update DNS/routing to new primary
- [ ] Verify new primary is accepting writes
- [ ] Verify replicas are replicating to new primary
- [ ] Monitor error rates and latency post-failover
- [ ] Send status update to stakeholders
- [ ] Begin root cause analysis

Availability Math Calculator

Dependency Tree Analysis

For a system with multiple components, overall SLA is the product of individual SLAs:

System SLA = SLA_load_balancer × SLA_app_server × SLA_database × SLA_cache

Example:
- Load balancer: 99.99% (4 nines)
- 3 app servers (each 99.9%): 99.9%³ = 99.7%
- Primary database: 99.99%
- Read replica: 99.99%
- Cache: 99.9%

Combined = 0.9999 × 0.997 × 0.9999 × 0.9999 × 0.999
         = 0.9957 (99.57% - only 2 nines!)

Nines and Downtime Reference

SLA	Downtime/Year	Downtime/Month	Downtime/Week
99%	3.65 days	7.3 hours	1.68 hours
99.9%	8.76 hours	43.8 min	10.1 min
99.99%	52.6 min	4.38 min	1.01 min
99.999%	5.26 min	26.3 sec	6.05 sec
99.9999%	31.5 sec	2.63 sec	0.6 sec

Chaos Engineering Checklist

Game Days for HA Patterns

Run these chaos experiments to validate availability patterns:

Chaos Experiment	What It Validates	Success Criteria
Kill random app server	Load balancer routes around failure	< 1% error rate
Terminate primary DB	Failover completes successfully	< 60s downtime
Network partition (single DC)	Quorum maintained	Reads continue, writes queued
Fill disk on replica	Monitoring detects issue	Alert < 5 min, graceful degradation
Restart all app servers simultaneously	Traffic spikes handled	Rate limiting prevents cascade

Pre-Game Day Checklist

- [ ] Define steady-state hypothesis
- [ ] Establish baseline metrics
- [ ] Notify stakeholders of experiment window
- [ ] Verify recent backup is available
- [ ] Confirm rollback plan
- [ ] Have on-call ready to intervene
- [ ] Document expected outcome
- [ ] Establish abort criteria

Observability Checklist

Golden Signals to Monitor

Signal	What to Measure	Alert Threshold
Latency	P50, P95, P99 response time	P99 > 500ms
Traffic	Requests per second, concurrent connections	Unusual spikes/drops
Errors	Error rate by type (4xx, 5xx, timeouts)	Error rate > 1%
Saturation	CPU, memory, disk, connection pool utilization	> 80%

Metrics to Capture

availability_uptime_percentage (gauge) - Per service, per datacenter
failover_duration_seconds (histogram) - Time to complete failover
health_check_failures_total (counter) - By check type
circuit_breaker_state (gauge) - Open/closed/half-open per circuit
dependency_health (gauge) - Status of each dependency (1=healthy, 0=down)

Alerts to Configure

Alert	Threshold	Severity
Service error rate > 1%	1% for 5 minutes	Warning
Service error rate > 5%	5% for 1 minute	Critical
Failover not completing in 60s	60 seconds	Critical
Dependency health check failing	2 consecutive failures	Warning
All health checks failing	1 consecutive failure	Critical

Security Checklist

Load balancer accessible only via TLS (HTTPS)
Health check endpoints authenticated
Failover mechanisms protected from unauthorized trigger
DNS records protected with short TTLs and DNSSEC
Cross-datacenter traffic encrypted
Secrets for failover mechanisms rotated regularly
Audit logging of all failover events

Common Pitfalls / Anti-Patterns

Pitfall 1: Designing for Five Nines Without Budget

Problem: Targeting 99.999% availability requires significant investment in redundancy, monitoring, and processes. Teams targeting it without the budget to support it often miss the target.

Solution: Start with 99.9% (four nines), measure what actually causes downtime, and improve incrementally. Each nine costs roughly 10x more than the previous.

Pitfall 2: Ignoring Dependency SLAs

Problem: Your service has 99.99% uptime but depends on a service with 99% uptime. Your actual availability is 99.99% × 99% = 98.99%.

Solution: Map all dependencies and calculate composite SLA. If a dependency is weak, add redundancy for that specific dependency or accept the lower composite SLA.

Pitfall 3: Automatic Failover Without Testing

Problem: Automatic failover sounds great until it triggers unnecessarily due to a false positive, causing a cascade of problems.

Solution: Test failover regularly (at least quarterly). Use manual failover for initial deployments until you have confidence in detection accuracy.

Pitfall 4: Forgetting About Session State

Problem: When failover happens, in-memory session state is lost. Users are logged out, shopping carts are emptied.

Solution: Externalize session state to Redis or similar. Use stateless request processing wherever possible. For sessions that must be local, use session affinity (with awareness of failover).

Quick Recap

High availability requires intentional architecture with redundancy at every level.
Redundancy (active-active or active-passive) is the first line of defense.
Failover must be tested regularly. An untested failover plan is not a failover plan.
Circuit breakers prevent cascade failures from spreading.
Graceful degradation lets users accomplish core tasks even when features fail.
SLA math: add dependencies carefully. Each one reduces overall availability.

Copy/Paste Checklist

- [ ] Map all system dependencies and calculate composite SLA
- [ ] Implement redundancy at each critical component
- [ ] Configure health checks (not just liveness)
- [ ] Document and test failover runbook
- [ ] Schedule quarterly chaos/game days
- [ ] Implement circuit breakers on all downstream calls
- [ ] Externalize session state
- [ ] Monitor golden signals: latency, traffic, errors, saturation
- [ ] Set alerts with appropriate thresholds and escalation paths
- [ ] Budget for the availability target (each nine costs ~10x more)

The PACELC theorem post covers latency trade-offs that also affect availability decisions. The Consistency Models post discusses how consistency guarantees interact with availability.

High Availability Patterns: Building Reliable Distributed Systems

What is High Availability?

Redundancy

Types of Redundancy

Failover Patterns

Automatic vs Manual Failover

Health Checks

Failover Time

Load Balancing

Load Balancing Algorithms

Load Balancer Health Checks

SLA Calculations

SLA Composition

Planning for SLA

Circuit Breakers

Graceful Degradation

When to Use / When Not to Use

When TO Use Active-Active

When NOT to Use Active-Active

Active-Active vs Active-Passive Decision Matrix

Multi-Region Deployment Patterns

Stateful vs Stateless Failover Differences

Kubernetes HA Patterns

Production Failure Scenarios

Failover Runbook

Active-Passover Failover Procedure

Checklist for Failover

Availability Math Calculator

Dependency Tree Analysis

Nines and Downtime Reference

Chaos Engineering Checklist

Game Days for HA Patterns

Pre-Game Day Checklist

Observability Checklist

Golden Signals to Monitor

Metrics to Capture

Alerts to Configure

Security Checklist

Common Pitfalls / Anti-Patterns

Pitfall 1: Designing for Five Nines Without Budget

Pitfall 2: Ignoring Dependency SLAs

Pitfall 3: Automatic Failover Without Testing

Pitfall 4: Forgetting About Session State

Quick Recap

Copy/Paste Checklist

Category

Tags

Related Posts

The Eight Fallacies of Distributed Computing

Microservices vs Monolith: Choosing the Right Architecture

Distributed Systems Primer: Key Concepts for Modern Architecture