Health Checks in Distributed Systems: Beyond Liveness
Explore advanced health check patterns for distributed systems including deep checks, aggregation, distributed health tracking, and health protocols.
Health Checks in Distributed Systems: Beyond Liveness Probes
Kubernetes gave us liveness and readiness probes. Those are useful, but they only solve the problem within a single pod. In a distributed system, you need health checks that span services, track cascading failures, and inform complex routing decisions.
This article goes beyond the Kubernetes probe types. It covers distributed health tracking, aggregated health views, adaptive health protocols, and building health systems that actually capture what it means for a distributed system to be healthy.
The Limitations of Local Health Checks
A local health check knows only about the service it belongs to. It knows whether this specific pod can serve requests. It knows nothing about whether the downstream services this pod depends on are working.
Consider a checkout service. The checkout pod is healthy. Its local health check passes. But the payment service it calls is down. Users can add items to cart but cannot checkout. The local health check gives you a false positive.
Distributed health checks solve this by tracking health across service boundaries.
Distributed Health Tracking
Dependency Health Propagation
Each service reports not just its own health but the health of its dependencies:
class HealthStatus:
def __init__(self, service_name: str):
self.service_name = service_name
self.own_health = "healthy"
self.dependency_health = {}
def update_dependency(self, dependency: str, status: str):
self.dependency_health[dependency] = status
def is_healthy(self) -> bool:
if self.own_health != "healthy":
return False
return all(d == "healthy" for d in self.dependency_health.values())
def to_dict(self) -> dict:
return {
"service": self.service_name,
"healthy": self.is_healthy(),
"own_health": self.own_health,
"dependencies": self.dependency_health
}
Services update their health status based on downstream service responses. When the payment service starts returning errors, the checkout service marks payment as unhealthy.
Health Status Storage
Centralize health status in a distributed store:
class HealthRegistry:
def __init__(self, kv_store: KVStore):
self.store = kv_store
def register(self, service_id: str, status: HealthStatus):
key = f"health:{service_id}"
self.store.set(key, status.to_dict(), ttl=30)
def get_all(self) -> dict[str, dict]:
keys = self.store.scan("health:*")
return {k: self.store.get(k) for k in keys}
def get_service_health(self, service_id: str) -> dict:
return self.store.get(f"health:{service_id}")
With a health registry, any service can query the overall system health. The API gateway, for instance, can check if the payment service is healthy before routing traffic there.
Aggregated Health Views
Individual health statuses are not enough. You need aggregated views that answer: is the overall system healthy? Can we serve user requests?
Aggregation Strategies
class HealthAggregator:
def aggregate(self, service_healths: list[dict]) -> str:
if not service_healths:
return "unknown"
healthy_count = sum(1 for s in service_healths if s.get("healthy"))
total_count = len(service_healths)
if healthy_count == total_count:
return "healthy"
if healthy_count > total_count / 2:
return "degraded"
return "unhealthy"
def aggregate_by_category(self, service_healths: list[dict]) -> dict:
categories = {}
for service in service_healths:
category = service.get("category", "uncategorized")
if category not in categories:
categories[category] = []
categories[category].append(service)
results = {}
for category, services in categories.items():
results[category] = self.aggregate(services)
return results
This helps answer questions like: are the payment services healthy enough to process orders? Even if recommendations are down, payments might need to stay up.
Deep Health Checks
Local health checks verify that a process is running. Deep health checks verify that a service can actually do its job.
Database Connectivity
def check_database_deep(db_pool, timeout_seconds: float = 2.0) -> bool:
try:
with db_pool.connection(timeout=timeout_seconds) as conn:
cursor = conn.cursor()
cursor.execute("SELECT 1")
result = cursor.fetchone()
if result[0] != 1:
return False
cursor.execute("SELECT txid_current()")
return True
except Exception:
return False
Set a timeout. A database check that takes 10 seconds to fail is not useful.
Cache Verification
Caches fail silently in most configurations. A connectivity ping might succeed while the cache is returning stale data:
def check_cache_deep(cache_client, timeout_seconds: float = 1.0) -> bool:
try:
test_key = f"health_check:{uuid.uuid4()}"
test_value = str(time.time())
cache_client.set(test_key, test_value, ex=10, timeout=timeout_seconds)
retrieved = cache_client.get(test_key, timeout=timeout_seconds)
if retrieved != test_value:
return False
cache_client.delete(test_key)
return True
except Exception:
return False
Downstream Service Verification
Check that downstream services are reachable and responding correctly:
def check_downstream_service(
service_url: str,
timeout_seconds: float = 3.0
) -> bool:
try:
response = requests.get(
f"{service_url}/health",
timeout=timeout_seconds
)
return response.status_code == 200
except Exception:
return False
For each downstream service, decide what level of verification makes sense. For critical dependencies, verify every time. For non-critical ones, a simple ping might do.
Kubernetes Probe Patterns
Kubernetes provides three probe types, each with different purposes:
Startup Probe
The startup probe indicates that the container is starting. Use it for applications that need time to initialize:
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes max
periodSeconds: 10
The startup probe must succeed before liveness and readiness probes take effect. Once it succeeds, Kubernetes stops checking it.
Readiness Probe
The readiness probe indicates the container can receive traffic. Use it to determine if the application can serve requests:
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
When a readiness probe fails, Kubernetes removes the pod from service endpoints. The pod continues running but receives no traffic.
Liveness Probe
The liveness probe indicates the container is running. Use it to detect situations where the process is alive but unresponsive:
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
timeoutSeconds: 5
When a liveness probe fails repeatedly, Kubernetes restarts the container.
Probe Differences
| Aspect | Startup | Readiness | Liveness |
|---|---|---|---|
| Purpose | Application initializing | Ready to serve traffic | Container running properly |
| Failure action | None (during startup) | Remove from endpoints | Restart container |
| When to use | Slow-starting apps | Traffic-sensitive deployments | Process hung detection |
| Probe checks | Initialization complete | Can handle requests | Responding to requests |
Implementation Example
@app.get("/health/startup")
def startup():
if not initialization_complete.is_set():
return {"status": "starting"}, 503
return {"status": "ready"}, 200
@app.get("/health/ready")
def ready():
# Check if can serve traffic
if not db.is_connected():
return {"status": "not ready", "reason": "database unavailable"}, 503
if not cache.is_available():
return {"status": "degraded", "reason": "cache unavailable"}, 200 # Can still serve
return {"status": "ready"}, 200
@app.get("/health/live")
def live():
# Simple check that process is responsive
return {"status": "alive"}, 200
Common Mistakes
- Liveness probe checking dependencies: If the database is down, liveness should still pass (the app is alive, just degraded). Use readiness for dependency checks.
- Readiness removing pods during startup: Without startup probe, liveness fails during initialization causing restart loops.
- Too aggressive probes: Probes that run too frequently or timeout too quickly can cause unnecessary restarts.
Health-Based Routing
Distributed health information enables smarter routing decisions.
Service Mesh Integration
When running behind a service mesh like Istio or Linkerd, you can configure routing based on health:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
The service mesh handles circuit breaking and traffic shifting automatically based on outlier detection.
Application-Level Routing
For custom routing logic, query the health registry:
class HealthAwareRouter:
def __init__(self, health_registry: HealthRegistry):
self.registry = health_registry
def route_request(self, request: Request) -> str:
all_instances = self.discover_service(request.target_service)
healthy_instances = [
inst for inst in all_instances
if self.is_instance_healthy(inst)
]
if not healthy_instances:
healthy_instances = all_instances
return self.select_instance(healthy_instances, request)
def is_instance_healthy(self, instance: ServiceInstance) -> bool:
health = self.registry.get_service_health(instance.id)
if not health:
return False
return health.get("healthy", False)
Health Check Protocols
Periodic Health Updates
Services publish their health status periodically:
async def publish_health_status(
service_id: str,
health_status: HealthStatus,
interval_seconds: int = 10
):
while True:
try:
registry.register(service_id, health_status)
await asyncio.sleep(interval_seconds)
except Exception as e:
logger.error(f"Failed to publish health: {e}")
This produces a constantly updated health map of your system.
Watch-Based Health Monitoring
Instead of polling, watch for health changes:
async def watch_health_changes(
service_id: str,
callback: Callable[[dict], None]
):
for change in registry.watch(f"health:{service_id}"):
callback(change.new_value)
When a service transitions from healthy to unhealthy, you can react immediately.
Cascading Failure Detection
One of the most valuable things distributed health checks can do is detect cascading failures before they complete.
Failure Propagation Tracking
graph TD
A[Payment Service Down] --> B[Checkout Unhealthy]
B --> C[Cart Unhealthy]
C --> D[Storefront Unhealthy]
A --> E[Order History Unhealthy]
A --> F[Analytics Failing]
Track which services depend on failing services. When payment goes down, immediately mark all services that depend on it as potentially affected.
class CascadingFailureDetector:
def __init__(self, dependency_graph: dict[str, list[str]]):
self.graph = dependency_graph
def get_affected_services(self, failed_service: str) -> list[str]:
affected = []
to_check = [failed_service]
while to_check:
current = to_check.pop()
for dependent in self.graph.get(current, []):
if dependent not in affected:
affected.append(dependent)
to_check.append(dependent)
return affected
This lets you see the blast radius of a failure. When payment goes down, you immediately know that checkout, order history, and analytics are affected.
Additional Common Mistakes
Health Checks That Are Too Heavy
A health check that queries the database, calls three downstream services, and processes significant data will itself become a load on the system.
Not Having a Health Check Strategy
Health checks should be designed, not added ad-hoc. Decide what each health endpoint checks, what timeouts to use, and how health information flows through your system
Ignoring Health Check Latency
If your health check takes 5 seconds to complete, you have added 5 seconds to your failure detection time. Keep health checks fast.
Not Testing Health Check Logic
Test what happens when health checks fail. Does your routing correctly avoid unhealthy instances? Does your alerting fire? Do your fallbacks activate?
See Chaos Engineering for how to test this systematically.
Observability
Health checks are only useful if you observe their results:
def record_health_metrics(service_id: str, status: HealthStatus):
metrics.gauge(
"service.health",
1 if status.is_healthy() else 0,
tags={"service": service_id}
)
for dep, health in status.dependency_health.items():
metrics.gauge(
"dependency.health",
1 if health == "healthy" else 0,
tags={"service": service_id, "dependency": dep}
)
Track these metrics: service health status over time, dependency health per service, time spent in degraded mode, and cascading failure events.
When to Use / When Not to Use Health Checks
When to Use Distributed Health Checks
Distributed health checks are worth the operational complexity when:
- Multi-service routing: Your API gateway or load balancer routes traffic based on service health — local probes alone cannot inform this decision.
- Cascading failure risk: You have deep service dependency chains where one failure can cascade if not caught early.
- Multi-region deployments: You need to route traffic away from unhealthy regions without manual intervention.
- Regulatory SLAs: Your availability requirements demand automated failure detection and response, not human-on-call judgment calls.
- Service mesh environments: You’re running Istio, Linkerd, or similar and want health-based traffic management beyond what the mesh provides out of the box.
When Not to Use Distributed Health Checks
Distributed health checks add complexity. Avoid them when:
- Single service or monolith: A simple liveness probe is enough if there are no downstream dependencies to track.
- Short-lived functions: Lambda functions and similar ephemeral compute don’t benefit from distributed health tracking.
- Fully managed PaaS: Heroku, Cloud Run, and similar platforms handle health-based routing at the platform level.
- Teams can’t operationalize them: If you cannot commit to monitoring and maintaining the health system, a health system you do not trust is worse than no health system.
Decision Flow
graph TD
A[Need to route traffic?] --> B[Multiple services involved?]
B -->|No| C[Simple liveness probe sufficient]
B -->|Yes| D[Deep service dependencies?]
D -->|No| E[Basic readiness probe + load balancer]
D -->|Yes| F[Implement distributed health tracking]
F --> G[Add cascading failure detection]
G --> H[Integrate with routing/circuit breakers]
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Health registry becomes unavailable | Services cannot determine overall system health; routing falls back to local probes | Use eventually-consistent storage (etcd, Consul) with local probe fallback; alert when health registry is unreachable |
| Health check storm | High-frequency health checks from many services overwhelm a downstream dependency | Use jitter in health check intervals; implement backoff when a dependency is degraded |
| False positive: dependency timeout | A slow but alive dependency causes health checks to report unhealthy | Set appropriately generous timeouts for dependency checks; use circuit breakers to stop checking known-failed dependencies |
| False negative: cache-only check | Health check passes but serves stale data | Use read/write verification for caches, not just connectivity checks |
| Probe configured too aggressively | Frequent restarts due to transient database hiccups | Set appropriate failureThreshold and periodSeconds; distinguish startup vs runtime probe requirements |
| Health status partition | One cluster of services has stale health data from another partition | Use timestamp-based health status with TTL expiry; alert when health status age exceeds threshold |
gRPC Health Checking
gRPC has a built-in health check protocol via grpc.health.v1.Health. Rather than defining a custom /health endpoint for every gRPC service, you implement the health service and any gRPC-aware load balancer or proxy can query it.
Implementing the Health Service
// Go example: implementing grpc.health.v1.Health
import "google.golang.org/grpc/health/grpc_health_v1"
type myHealthServer struct {
grpc_health_v1.UnimplementedHealthServer
status atomic.Value // stores health status
}
func (s *myHealthServer) Check(ctx context.Context, req *grpc_health_v1.HealthCheckRequest) (*grpc_health_v1.HealthCheckResponse, error) {
currentStatus := s.status.Load().(HealthStatus)
status := grpc_health_v1.HealthCheckResponse_SERVING
if currentStatus == Unhealthy {
status = grpc_health_v1.HealthCheckResponse_NOT_SERVING
}
return &grpc_health_v1.HealthCheckResponse{
Status: status,
}, nil
}
// Register with gRPC server
grpc_health_v1.RegisterHealthServer(srv, &myHealthServer{})
HTTP vs gRPC Health Comparison
| Aspect | HTTP /health endpoint | gRPC health.v1 |
|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | gRPC (HTTP/2, protobuf) |
| 负载 balancer support | Most LBs understand HTTP probes | Envoy, NGINX, Linkerd natively support gRPC health |
| Schema | Custom JSON or text | Protobuf (defined by spec) |
| Streaming | No | Watch RPC available for change notification |
| Standardized | No (custom implementation) | Yes (grpc.health.v1) |
gRPC health is worth implementing if you run gRPC services behind Envoy or Linkerd. If your load balancer only speaks HTTP, a REST /health endpoint is simpler.
Partial Health: Degraded vs Unhealthy
Most health systems return binary healthy/unhealthy. Real systems exist in more states. A service might be partially functional — serving cached data but not accepting writes, or handling read traffic but not write traffic.
Health State Spectrum
class HealthState:
HEALTHY = "healthy" # Fully operational
DEGRADED = "degraded" # Partial functionality, serving from cache or fallback
UNHEALTHY = "unhealthy" # Cannot serve any traffic
STARTING = "starting" # Initializing, not ready for traffic
DRAINING = "draining" # Gracefully shutting down, rejecting new requests
def aggregate_health(states: list[HealthState]) -> HealthState:
"""Aggregate multiple health states into one."""
if any(s == HealthState.UNHEALTHY for s in states):
return HealthState.UNHEALTHY
if any(s == HealthState.DRAINING for s in states):
return HealthState.DRAINING
if any(s == HealthState.STARTING for s in states):
return HealthState.STARTING
if any(s == HealthState.DEGRADED for s in states):
return HealthState.DEGRADED
return HealthState.HEALTHY
What Degraded Means in Practice
The exact meaning of degraded depends on your service. Define it explicitly:
- Read-only mode: Database writes are failing, so you are serving reads from cache only. Mark degraded. Downstream callers can decide whether degraded reads are acceptable.
- High latency mode: A dependency is slow. You are still serving requests but with elevated latency. Mark degraded so load balancers can route traffic elsewhere.
- Feature flag-off mode: A non-critical feature is broken. Mark degraded but continue serving core functionality.
Routing Based on Health State
def route_based_on_health(instance_health: HealthState, request_type: str) -> bool:
"""
Decide whether to route to an instance based on its health and request type.
Returns True if we should route to this instance.
"""
if instance_health == HealthState.HEALTHY:
return True
if instance_health == HealthState.DEGRADED:
# Route writes to healthy only
if request_type == "write":
return False
# Route reads to degraded in controlled way (backpressure)
return True
if instance_health in (HealthState.UNHEALTHY, HealthState.DRAINING, HealthState.STARTING):
return False
return False
Degraded does not mean “stop using this instance.” It means “use this instance for specific request types and monitor closely.”
Cache Health vs Data Health
Cache and database can be in different health states simultaneously. Your health check must distinguish between them because the correct response differs.
The Cache-Backup Fallback Pattern
def check_service_health() -> tuple[HealthState, dict]:
"""
Check both cache and database health.
Returns (overall_state, details)
"""
db_healthy = check_database()
cache_healthy = check_cache()
details = {
"database": "healthy" if db_healthy else "unhealthy",
"cache": "healthy" if cache_healthy else "unhealthy",
}
if db_healthy and cache_healthy:
return HealthState.HEALTHY, details
if db_healthy and not cache_healthy:
# Cache is down but database is fine — degraded, serving stale data
return HealthState.DEGRADED, details
if not db_healthy and cache_healthy:
# Database is down — unhealthy for writes, but may have stale reads
return HealthState.UNHEALTHY, details
return HealthState.UNHEALTHY, details
Why This Matters for Health Checks
A health endpoint that only pings the database will report unhealthy when the cache is down. But your service can still serve reads from the database. Similarly, a health endpoint that only checks the cache will report healthy even when the database is down and you are serving stale data.
Check both, and report their states separately so callers can make informed routing decisions.
Health Check Authentication
Health endpoints are tempting to leave unauthenticated — they need to be reachable by load balancers and monitoring systems. But an unauthenticated /health endpoint leaks internal system information.
What an Unauthenticated Health Endpoint Reveals
- Internal service names and versions
- Dependency topology
- Whether specific features are enabled
- Error messages from failing dependencies
Defense in Depth
# Layer 1: Network segmentation
# Health check endpoint only accessible from internal networks
# Load balancer security group restricts access to health check CIDR
# Layer 2: IP allowlist
ALLOWED_IPS = {"10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"}
@app.get("/health")
def health():
# Verify caller IP is in allowlist
client_ip = request.headers.get("X-Forwarded-For", request.remote_addr)
if not is_internal_ip(client_ip, ALLOWED_IPS):
return {"error": "unauthorized"}, 401
# Continue with health check
return check_health()
# Layer 3: Token-based auth for sensitive health data
@app.get("/health/detailed")
def health_detailed(token: str):
if token != os.environ.get("HEALTH_TOKEN"):
return {"error": "unauthorized"}, 401
return check_full_health() # Returns dependency details, versions, etc.
The simple /health endpoint stays public for load balancers. A separate /health/detailed endpoint with a token returns full system introspection for your monitoring tools.
What Not to Put in Health Responses
- Stack traces or exception details (can leak to attackers)
- Internal IP addresses or hostnames
- Credentials or session tokens
- Full configuration dumps
Health responses tell a caller whether to route traffic here. They are not a debugging endpoint.
Quick Recap
Distributed health checks go beyond local liveness probes:
- Track health across service boundaries, not just within a pod
- Use aggregated views to understand overall system health
- Implement deep health checks that verify actual functionality
- Build health-aware routing to avoid sending traffic to unhealthy instances
- Detect cascading failures before they complete
- Monitor your health checks and alert on degradation
Health checks are the nervous system of your distributed system. They tell every part how every other part is doing.
For more on related topics, see Health Checks, Circuit Breaker Pattern, and Resilience Patterns.
Category
Related Posts
Health Checks: Liveness, Readiness, and Service Availability
Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.
Backpressure Handling: Protecting Pipelines from Overload
Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.