High Availability Patterns: Building Reliable Distributed Systems
Learn essential high availability patterns including redundancy, failover, load balancing, and SLA calculations. Practical strategies for building systems that stay online.
High Availability Patterns: Building Reliable Distributed Systems
Availability measures how often a system is operational. High availability (HA) means the system stays up even when components fail. For critical systems, downtime costs money, reputation, and sometimes lives.
The CAP theorem tells us that during partitions, we choose between consistency and availability. High availability patterns help minimize the time spent in that trade-off by preventing failures and handling them gracefully when they occur.
This post covers practical patterns for building highly available systems.
What is High Availability?
High availability means systems that remain operational even when components fail. We measure it as a percentage of uptime over time:
graph LR
A[99%] --> B[87.6 hours downtime/year]
A --> C[3.65 days downtime/year]
D[99.9%] --> E[8.76 hours downtime/year]
D --> F[43.8 minutes downtime/year]
G[99.99%] --> H[52.6 minutes downtime/year]
G --> I[5.26 minutes downtime/year]
The “nines” matter. Each additional nine represents a tenfold reduction in downtime. Whether that matters depends on your business context. A video streaming service can probably survive 4 hours of downtime per year. A hospital’s monitoring system cannot.
Redundancy
The first line of defense against failure is having backup components. Redundancy means duplicating critical components so that if one fails, another takes over.
Types of Redundancy
Active-active: Multiple replicas serve traffic simultaneously. If one fails, others continue without interruption.
// Active-active: all servers handle requests
const servers = ["server1", "server2", "server3"];
async function handleRequest(request) {
// Try servers in order until one responds
for (const server of servers) {
try {
return await sendToServer(server, request);
} catch (error) {
continue; // Try next server
}
}
throw new Error("All servers unavailable");
}
Active-passive: One primary server handles traffic. Standby servers are ready but not processing requests until failover.
// Active-passive: standby takes over on primary failure
const primary = "primaryServer";
const standby = "standbyServer";
async function handleRequest(request) {
try {
return await sendToServer(primary, request);
} catch (error) {
// Failover to standby
console.log("Primary failed, activating standby");
return await sendToServer(standby, request);
}
}
Active-active requires more complex conflict resolution but provides better resource utilization. Active-passive is simpler but wastes resources on idle standby capacity.
Failover Patterns
Failover is the process of switching from a failed component to a backup. Several patterns exist:
Automatic vs Manual Failover
Automatic failover detects failures and switches without human intervention. Manual failover requires an operator to trigger the switch.
Automatic is faster but riskier. If detection is imperfect, you might failover unnecessarily, causing a cascade of problems. Manual failover gives you control but introduces human delay.
For most production systems, a hybrid approach works: automatic detection with automatic failover for minor issues, manual intervention for major events.
Health Checks
Failover needs accurate failure detection:
// Health check endpoint
app.get("/health", async (req, res) => {
const health = {
status: "ok",
timestamp: Date.now(),
checks: {
database: await checkDatabase(),
cache: await checkCache(),
disk: await checkDiskSpace(),
},
};
// Return unhealthy if any critical check fails
const isHealthy =
health.checks.database === "ok" && health.checks.cache === "ok";
res.status(isHealthy ? 200 : 503).json(health);
});
async function checkDatabase() {
try {
await db.query("SELECT 1");
return "ok";
} catch (error) {
return "failed";
}
}
Health checks should verify actual functionality, not just process liveness. A process can be running but unable to serve requests.
Failover Time
Failover introduces latency. Components of failover time:
graph TD
A[Failover Time] --> B[Failure Detection]
A --> C[Decision to Failover]
A --> D[ DNS/Route Update]
A --> E[New Instance Startup]
A --> F[Health Check Propagation]
B --> B1[Usually 5-30 seconds]
D --> D1[Can be 30-60 seconds for DNS]
DNS-based failover is slow because DNS records are cached and TTLs can be 60 seconds or more. Floating IP or anycast approaches are faster.
Load Balancing
Load balancers distribute traffic across multiple servers. They also help with availability by routing around failed instances.
Load Balancing Algorithms
Round robin: Send each request to the next server in sequence. Simple but does not account for varying request complexity.
// Round robin implementation
let currentIndex = 0;
const servers = ["server1", "server2", "server3"];
function getNextServer() {
const server = servers[currentIndex];
currentIndex = (currentIndex + 1) % servers.length;
return server;
}
Least connections: Send new requests to the server with the fewest active connections. Better for varying request durations.
IP hash: Route requests from the same client IP to the same server. Useful when session state is stored locally.
Weighted: Assign weights to servers based on capacity. More powerful servers receive more traffic.
Load Balancer Health Checks
// Load balancer health monitoring
const servers = [
{ host: "server1", healthy: true, connections: 10 },
{ host: "server2", healthy: true, connections: 5 },
{ host: "server3", healthy: false, connections: 0 },
];
function routeRequest(request) {
const healthy = servers.filter((s) => s.healthy);
if (healthy.length === 0) {
throw new Error("No healthy servers");
}
// Choose server with least connections
const target = healthy.reduce((a, b) =>
a.connections < b.connections ? a : b,
);
target.connections++;
return sendRequest(target.host, request).finally(() => target.connections--);
}
SLA Calculations
Service Level Agreements define expected availability. Calculating SLA helps you understand what your system needs to deliver.
SLA Composition
End-to-end SLA depends on the weakest component:
// Calculate combined SLA
function calculateSLA(slaValues) {
// For independent components, combined availability
// is the product of individual availabilities
const combined = slaValues.reduce((acc, sla) => {
// Convert percentage to decimal
const availability = sla / 100;
return acc * availability;
}, 1);
return (combined * 100).toFixed(2) + "%";
}
// Example: Load balancer (99.99%) + 3 app servers (99.9% each)
// = 99.99% * 99.9% * 99.9% * 99.9%
const sla = calculateSLA([99.99, 99.9, 99.9, 99.9]);
console.log(sla); // Output: 99.60%
Adding more components reduces overall SLA. This is why HA architectures prefer minimal chains of dependencies.
Planning for SLA
| Target SLA | Max Downtime/Year | Max Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.30 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
99.999% (“five nines”) is extremely difficult. It allows only 5 minutes of downtime per year. Most systems target 99.9% or 99.99%.
Circuit Breakers
Circuit breakers prevent cascading failures. When a downstream service is failing, the circuit breaker trips and fast-fails requests instead of overwhelming the dying service.
class CircuitBreaker {
constructor(failureThreshold = 5, timeout = 60000) {
this.failureThreshold = failureThreshold;
this.timeout = timeout;
this.failures = 0;
this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = 0;
}
async call(fn) {
if (this.state === "OPEN") {
if (Date.now() > this.nextAttempt) {
this.state = "HALF_OPEN";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = "CLOSED";
}
onFailure() {
this.failures++;
if (this.failures >= this.failureThreshold) {
this.state = "OPEN";
this.nextAttempt = Date.now() + this.timeout;
}
}
}
The circuit breaker gives failing services time to recover instead of being buried by retry storms.
Graceful Degradation
When full functionality is impossible, provide partial functionality. This is graceful degradation.
// Graceful degradation example
async function getProductDetails(productId) {
const fullDetails = {
reviews: null,
relatedProducts: null,
priceHistory: null,
};
try {
fullDetails.reviews = await getReviews(productId);
} catch (error) {
console.log("Reviews unavailable");
}
try {
fullDetails.relatedProducts = await getRelated(productId);
} catch (error) {
console.log("Related products unavailable");
}
try {
fullDetails.priceHistory = await getPriceHistory(productId);
} catch (error) {
console.log("Price history unavailable");
}
return fullDetails;
}
Users see available information immediately instead of facing a blank screen or error message.
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Critical systems (healthcare, finance) | High availability mandatory |
| SLA of 99.99%+ | Active-active with automatic failover |
| Cost-sensitive applications | Active-passive with manual failover |
| Stateless services | Load balancer with health checks |
| Stateful services with persistence | Database replication with failover |
When TO Use Active-Active
- Multiple geographic regions: Users in different regions hit their nearest datacenter
- High traffic volumes: Single active cannot handle load even with vertical scaling
- Zero-downtime requirements: Failover happens instantly without any interruption
- Read-heavy workloads: All replicas serve reads, dramatically increasing throughput
When NOT to Use Active-Active
- Complex conflict resolution: When data has mutable state that is hard to partition
- Strong consistency requirements: Synchronizing writes across active nodes is expensive
- Limited budget: Running multiple active datacenters costs significantly more
- Simple applications: The complexity cost outweighs availability benefits
Active-Active vs Active-Passive Decision Matrix
| Criteria | Active-Active | Active-Passive |
|---|---|---|
| Cost | 2-3x (all sites active) | 1.5-2x (standby idle) |
| Complexity | High (conflict resolution, sync) | Medium (failover logic) |
| Failover Speed | Instant (traffic split) | 30s-5min (promotion) |
| Write Throughput | Higher (all nodes write) | Limited to primary |
| Data Consistency | Complex (multi-master sync) | Simple (async replication) |
| Geographic Diversity | Excellent (multi-region) | Good (standby in other DC) |
| Resource Utilization | High (all nodes busy) | Low (standby idle) |
| Rollback Complexity | Complex (already processing) | Simple (demote standby) |
Multi-Region Deployment Patterns
When deploying across multiple geographic regions, consider these patterns:
| Pattern | Description | Use Case |
|---|---|---|
| Primary-Secondary | One region accepts writes, others replicate asynchronously | Read-heavy, geo-distributed users |
| Primary-Primary | All regions accept writes, conflict resolution required | Write-heavy, globally distributed users |
| CQRS + Global Traffic Routing | Separate read/write paths, route users to nearest region | Complex domains, maximum performance |
| Stateless Microservices + Regional Databases | Compute is stateless and globally distributed | Cloud-native, Kubernetes-based |
Cross-region replication considerations:
// Multi-region write latency expectations
const REPLICATION_LATENCY = {
"us-east-1 to eu-west-1": "~100-150ms RTT",
"us-east-1 to ap-southeast-1": "~200-250ms RTT",
"eu-west-1 to ap-southeast-1": "~150-200ms RTT",
};
// Consistency vs latency tradeoff in multi-region
async function writeMultiRegion(key, value, options = {}) {
const { consistencyLevel = "QUORUM", regions = ["us-east-1", "eu-west-1"] } =
options;
if (consistencyLevel === "ALL_REGIONS") {
// Strongest consistency, highest latency
const results = await Promise.all(
regions.map((region) => writeToRegion(region, key, value)),
);
return results;
} else if (consistencyLevel === "QUORUM") {
// Majority of regions must acknowledge
const acks = await Promise.race([
Promise.all(regions.map((r) => writeToRegion(r, key, value))),
timeout(5000), // 5 second timeout
]);
return acks;
} else {
// Local region only, async replicate
await writeToRegion("local", key, value);
backgroundSyncToOtherRegions(regions);
return { success: true, replicated: false };
}
}
Stateful vs Stateless Failover Differences
| Aspect | Stateless Services | Stateful Services |
|---|---|---|
| Failover Complexity | Low (just restart somewhere) | High (must preserve state) |
| State Recovery | None needed | Must recover from replica or WAL |
| Failover Time | 5-30 seconds (container restart) | 30s-5min (state recovery) |
| Data Loss Risk | None | Possible (unreplicated writes) |
| Scaling | Horizontal (easy) | Requires consistent hashing/sharding |
| Session Affinity | Not needed | Often required (or externalize state) |
Stateful failover sequence:
sequenceDiagram
participant Primary as Primary DB
participant Standby as Standby DB
participant LB as Load Balancer
participant App as Application
Primary->>Standby: Replicate WAL continuously
Note over Primary,Standby: Async replication with lag
Primary->>LB: Health check OK
App->>LB: Write request
LB->>Primary: Route to primary
Primary->>Primary: CRASH - stops responding
Standby->>Primary: Health check fails
Standby->>LB: "I am available"
LB->>Standby: Promote to primary
Note over App: Brief write failure during failover
App->>LB: Retry write
LB->>Standby: Route to new primary
Standby->>App: Write acknowledged
Note over Primary,Standby: Re-sync when old primary recovers
Kubernetes HA Patterns
For Kubernetes-based deployments:
| Pattern | Description | Trade-off |
|---|---|---|
| PodDisruptionBudget (PDB) | Ensures minimum pods available during disruptions | May delay node drains |
| PodAntiAffinity | Spreads pods across nodes/AZs | Requires enough nodes |
| multi-AZ vs multi-region | AZ failure = local; Region failure = full DR | AZ simpler, Region safer |
| StatefulSets | Ordered deployment, persistent storage | More complex than Deployments |
# Example: HA Kubernetes deployment with anti-affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: topology.kubernetes.io/zone
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
# Example: PodDisruptionBudget for minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: api-server
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Load balancer instance failure | All traffic fails to backend | Run multiple LBs; health checks remove failed instances |
| Primary database failure | Write operations fail | Automatic failover to standby with health monitoring |
| DNS cache poisoning | Traffic routed to wrong/invalid IPs | Short TTLs; DNSSEC; use floating IP instead |
| Cascade failure | One component failure triggers others | Circuit breakers; bulkhead pattern; graceful degradation |
| Datacenter power failure | Entire site goes down | Multi-datacenter active-active; UPS and generators |
| Network partition between DCs | Split-brain risk | Quorum-based decisions; automatic failover lock |
| Disk full on primary | Database writes fail | Monitoring; auto-scaling storage; archive old data |
Failover Runbook
Active-Passover Failover Procedure
graph TD
A[Alert: Primary Down] --> B{Automated Failover?}
B -->|Yes| C[Health check confirms failure]
C --> D[Promote standby to primary]
D --> E[Update routing/DNS]
E --> F[Verify replication lag < threshold]
F --> G[Resume traffic]
B -->|No| H[Page on-call engineer]
H --> I[Engineer assesses situation]
I --> J[Manual failover decision]
J --> K[Follow runbook steps]
K --> D
Checklist for Failover
- [ ] Verify primary is truly down (multiple checks)
- [ ] Confirm standby has caught up (replication lag check)
- [ ] Stop traffic to old primary (prevent split-brain)
- [ ] Promote standby
- [ ] Update DNS/routing to new primary
- [ ] Verify new primary is accepting writes
- [ ] Verify replicas are replicating to new primary
- [ ] Monitor error rates and latency post-failover
- [ ] Send status update to stakeholders
- [ ] Begin root cause analysis
Availability Math Calculator
Dependency Tree Analysis
For a system with multiple components, overall SLA is the product of individual SLAs:
System SLA = SLA_load_balancer × SLA_app_server × SLA_database × SLA_cache
Example:
- Load balancer: 99.99% (4 nines)
- 3 app servers (each 99.9%): 99.9%³ = 99.7%
- Primary database: 99.99%
- Read replica: 99.99%
- Cache: 99.9%
Combined = 0.9999 × 0.997 × 0.9999 × 0.9999 × 0.999
= 0.9957 (99.57% - only 2 nines!)
Nines and Downtime Reference
| SLA | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% | 3.65 days | 7.3 hours | 1.68 hours |
| 99.9% | 8.76 hours | 43.8 min | 10.1 min |
| 99.99% | 52.6 min | 4.38 min | 1.01 min |
| 99.999% | 5.26 min | 26.3 sec | 6.05 sec |
| 99.9999% | 31.5 sec | 2.63 sec | 0.6 sec |
Chaos Engineering Checklist
Game Days for HA Patterns
Run these chaos experiments to validate availability patterns:
| Chaos Experiment | What It Validates | Success Criteria |
|---|---|---|
| Kill random app server | Load balancer routes around failure | < 1% error rate |
| Terminate primary DB | Failover completes successfully | < 60s downtime |
| Network partition (single DC) | Quorum maintained | Reads continue, writes queued |
| Fill disk on replica | Monitoring detects issue | Alert < 5 min, graceful degradation |
| Restart all app servers simultaneously | Traffic spikes handled | Rate limiting prevents cascade |
Pre-Game Day Checklist
- [ ] Define steady-state hypothesis
- [ ] Establish baseline metrics
- [ ] Notify stakeholders of experiment window
- [ ] Verify recent backup is available
- [ ] Confirm rollback plan
- [ ] Have on-call ready to intervene
- [ ] Document expected outcome
- [ ] Establish abort criteria
Observability Checklist
Golden Signals to Monitor
| Signal | What to Measure | Alert Threshold |
|---|---|---|
| Latency | P50, P95, P99 response time | P99 > 500ms |
| Traffic | Requests per second, concurrent connections | Unusual spikes/drops |
| Errors | Error rate by type (4xx, 5xx, timeouts) | Error rate > 1% |
| Saturation | CPU, memory, disk, connection pool utilization | > 80% |
Metrics to Capture
availability_uptime_percentage(gauge) - Per service, per datacenterfailover_duration_seconds(histogram) - Time to complete failoverhealth_check_failures_total(counter) - By check typecircuit_breaker_state(gauge) - Open/closed/half-open per circuitdependency_health(gauge) - Status of each dependency (1=healthy, 0=down)
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Service error rate > 1% | 1% for 5 minutes | Warning |
| Service error rate > 5% | 5% for 1 minute | Critical |
| Failover not completing in 60s | 60 seconds | Critical |
| Dependency health check failing | 2 consecutive failures | Warning |
| All health checks failing | 1 consecutive failure | Critical |
Security Checklist
- Load balancer accessible only via TLS (HTTPS)
- Health check endpoints authenticated
- Failover mechanisms protected from unauthorized trigger
- DNS records protected with short TTLs and DNSSEC
- Cross-datacenter traffic encrypted
- Secrets for failover mechanisms rotated regularly
- Audit logging of all failover events
Common Pitfalls / Anti-Patterns
Pitfall 1: Designing for Five Nines Without Budget
Problem: Targeting 99.999% availability requires significant investment in redundancy, monitoring, and processes. Teams targeting it without the budget to support it often miss the target.
Solution: Start with 99.9% (four nines), measure what actually causes downtime, and improve incrementally. Each nine costs roughly 10x more than the previous.
Pitfall 2: Ignoring Dependency SLAs
Problem: Your service has 99.99% uptime but depends on a service with 99% uptime. Your actual availability is 99.99% × 99% = 98.99%.
Solution: Map all dependencies and calculate composite SLA. If a dependency is weak, add redundancy for that specific dependency or accept the lower composite SLA.
Pitfall 3: Automatic Failover Without Testing
Problem: Automatic failover sounds great until it triggers unnecessarily due to a false positive, causing a cascade of problems.
Solution: Test failover regularly (at least quarterly). Use manual failover for initial deployments until you have confidence in detection accuracy.
Pitfall 4: Forgetting About Session State
Problem: When failover happens, in-memory session state is lost. Users are logged out, shopping carts are emptied.
Solution: Externalize session state to Redis or similar. Use stateless request processing wherever possible. For sessions that must be local, use session affinity (with awareness of failover).
Quick Recap
- High availability requires intentional architecture with redundancy at every level.
- Redundancy (active-active or active-passive) is the first line of defense.
- Failover must be tested regularly. An untested failover plan is not a failover plan.
- Circuit breakers prevent cascade failures from spreading.
- Graceful degradation lets users accomplish core tasks even when features fail.
- SLA math: add dependencies carefully. Each one reduces overall availability.
Copy/Paste Checklist
- [ ] Map all system dependencies and calculate composite SLA
- [ ] Implement redundancy at each critical component
- [ ] Configure health checks (not just liveness)
- [ ] Document and test failover runbook
- [ ] Schedule quarterly chaos/game days
- [ ] Implement circuit breakers on all downstream calls
- [ ] Externalize session state
- [ ] Monitor golden signals: latency, traffic, errors, saturation
- [ ] Set alerts with appropriate thresholds and escalation paths
- [ ] Budget for the availability target (each nine costs ~10x more)
The PACELC theorem post covers latency trade-offs that also affect availability decisions. The Consistency Models post discusses how consistency guarantees interact with availability.
Category
Related Posts
The Eight Fallacies of Distributed Computing
Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.
Microservices vs Monolith: Choosing the Right Architecture
Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.
Distributed Systems Primer: Key Concepts for Modern Architecture
A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.