Health Checks in Distributed Systems: Beyond Liveness

Explore advanced health check patterns for distributed systems including deep checks, aggregation, distributed health tracking, and health protocols.

published: reading time: 17 min read

Health Checks in Distributed Systems: Beyond Liveness Probes

Kubernetes gave us liveness and readiness probes. Those are useful, but they only solve the problem within a single pod. In a distributed system, you need health checks that span services, track cascading failures, and inform complex routing decisions.

This article goes beyond the Kubernetes probe types. It covers distributed health tracking, aggregated health views, adaptive health protocols, and building health systems that actually capture what it means for a distributed system to be healthy.

The Limitations of Local Health Checks

A local health check knows only about the service it belongs to. It knows whether this specific pod can serve requests. It knows nothing about whether the downstream services this pod depends on are working.

Consider a checkout service. The checkout pod is healthy. Its local health check passes. But the payment service it calls is down. Users can add items to cart but cannot checkout. The local health check gives you a false positive.

Distributed health checks solve this by tracking health across service boundaries.

Distributed Health Tracking

Dependency Health Propagation

Each service reports not just its own health but the health of its dependencies:

class HealthStatus:
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.own_health = "healthy"
        self.dependency_health = {}

    def update_dependency(self, dependency: str, status: str):
        self.dependency_health[dependency] = status

    def is_healthy(self) -> bool:
        if self.own_health != "healthy":
            return False
        return all(d == "healthy" for d in self.dependency_health.values())

    def to_dict(self) -> dict:
        return {
            "service": self.service_name,
            "healthy": self.is_healthy(),
            "own_health": self.own_health,
            "dependencies": self.dependency_health
        }

Services update their health status based on downstream service responses. When the payment service starts returning errors, the checkout service marks payment as unhealthy.

Health Status Storage

Centralize health status in a distributed store:

class HealthRegistry:
    def __init__(self, kv_store: KVStore):
        self.store = kv_store

    def register(self, service_id: str, status: HealthStatus):
        key = f"health:{service_id}"
        self.store.set(key, status.to_dict(), ttl=30)

    def get_all(self) -> dict[str, dict]:
        keys = self.store.scan("health:*")
        return {k: self.store.get(k) for k in keys}

    def get_service_health(self, service_id: str) -> dict:
        return self.store.get(f"health:{service_id}")

With a health registry, any service can query the overall system health. The API gateway, for instance, can check if the payment service is healthy before routing traffic there.

Aggregated Health Views

Individual health statuses are not enough. You need aggregated views that answer: is the overall system healthy? Can we serve user requests?

Aggregation Strategies

class HealthAggregator:
    def aggregate(self, service_healths: list[dict]) -> str:
        if not service_healths:
            return "unknown"

        healthy_count = sum(1 for s in service_healths if s.get("healthy"))
        total_count = len(service_healths)

        if healthy_count == total_count:
            return "healthy"

        if healthy_count > total_count / 2:
            return "degraded"

        return "unhealthy"

    def aggregate_by_category(self, service_healths: list[dict]) -> dict:
        categories = {}
        for service in service_healths:
            category = service.get("category", "uncategorized")
            if category not in categories:
                categories[category] = []
            categories[category].append(service)

        results = {}
        for category, services in categories.items():
            results[category] = self.aggregate(services)

        return results

This helps answer questions like: are the payment services healthy enough to process orders? Even if recommendations are down, payments might need to stay up.

Deep Health Checks

Local health checks verify that a process is running. Deep health checks verify that a service can actually do its job.

Database Connectivity

def check_database_deep(db_pool, timeout_seconds: float = 2.0) -> bool:
    try:
        with db_pool.connection(timeout=timeout_seconds) as conn:
            cursor = conn.cursor()
            cursor.execute("SELECT 1")
            result = cursor.fetchone()
            if result[0] != 1:
                return False

            cursor.execute("SELECT txid_current()")
            return True
    except Exception:
        return False

Set a timeout. A database check that takes 10 seconds to fail is not useful.

Cache Verification

Caches fail silently in most configurations. A connectivity ping might succeed while the cache is returning stale data:

def check_cache_deep(cache_client, timeout_seconds: float = 1.0) -> bool:
    try:
        test_key = f"health_check:{uuid.uuid4()}"
        test_value = str(time.time())

        cache_client.set(test_key, test_value, ex=10, timeout=timeout_seconds)

        retrieved = cache_client.get(test_key, timeout=timeout_seconds)

        if retrieved != test_value:
            return False

        cache_client.delete(test_key)
        return True
    except Exception:
        return False

Downstream Service Verification

Check that downstream services are reachable and responding correctly:

def check_downstream_service(
    service_url: str,
    timeout_seconds: float = 3.0
) -> bool:
    try:
        response = requests.get(
            f"{service_url}/health",
            timeout=timeout_seconds
        )
        return response.status_code == 200
    except Exception:
        return False

For each downstream service, decide what level of verification makes sense. For critical dependencies, verify every time. For non-critical ones, a simple ping might do.

Kubernetes Probe Patterns

Kubernetes provides three probe types, each with different purposes:

Startup Probe

The startup probe indicates that the container is starting. Use it for applications that need time to initialize:

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30 # 30 * 10s = 5 minutes max
  periodSeconds: 10

The startup probe must succeed before liveness and readiness probes take effect. Once it succeeds, Kubernetes stops checking it.

Readiness Probe

The readiness probe indicates the container can receive traffic. Use it to determine if the application can serve requests:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  successThreshold: 1
  failureThreshold: 3

When a readiness probe fails, Kubernetes removes the pod from service endpoints. The pod continues running but receives no traffic.

Liveness Probe

The liveness probe indicates the container is running. Use it to detect situations where the process is alive but unresponsive:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3
  timeoutSeconds: 5

When a liveness probe fails repeatedly, Kubernetes restarts the container.

Probe Differences

AspectStartupReadinessLiveness
PurposeApplication initializingReady to serve trafficContainer running properly
Failure actionNone (during startup)Remove from endpointsRestart container
When to useSlow-starting appsTraffic-sensitive deploymentsProcess hung detection
Probe checksInitialization completeCan handle requestsResponding to requests

Implementation Example

@app.get("/health/startup")
def startup():
    if not initialization_complete.is_set():
        return {"status": "starting"}, 503
    return {"status": "ready"}, 200

@app.get("/health/ready")
def ready():
    # Check if can serve traffic
    if not db.is_connected():
        return {"status": "not ready", "reason": "database unavailable"}, 503
    if not cache.is_available():
        return {"status": "degraded", "reason": "cache unavailable"}, 200  # Can still serve
    return {"status": "ready"}, 200

@app.get("/health/live")
def live():
    # Simple check that process is responsive
    return {"status": "alive"}, 200

Common Mistakes

  • Liveness probe checking dependencies: If the database is down, liveness should still pass (the app is alive, just degraded). Use readiness for dependency checks.
  • Readiness removing pods during startup: Without startup probe, liveness fails during initialization causing restart loops.
  • Too aggressive probes: Probes that run too frequently or timeout too quickly can cause unnecessary restarts.

Health-Based Routing

Distributed health information enables smarter routing decisions.

Service Mesh Integration

When running behind a service mesh like Istio or Linkerd, you can configure routing based on health:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s

The service mesh handles circuit breaking and traffic shifting automatically based on outlier detection.

Application-Level Routing

For custom routing logic, query the health registry:

class HealthAwareRouter:
    def __init__(self, health_registry: HealthRegistry):
        self.registry = health_registry

    def route_request(self, request: Request) -> str:
        all_instances = self.discover_service(request.target_service)
        healthy_instances = [
            inst for inst in all_instances
            if self.is_instance_healthy(inst)
        ]

        if not healthy_instances:
            healthy_instances = all_instances

        return self.select_instance(healthy_instances, request)

    def is_instance_healthy(self, instance: ServiceInstance) -> bool:
        health = self.registry.get_service_health(instance.id)
        if not health:
            return False
        return health.get("healthy", False)

Health Check Protocols

Periodic Health Updates

Services publish their health status periodically:

async def publish_health_status(
    service_id: str,
    health_status: HealthStatus,
    interval_seconds: int = 10
):
    while True:
        try:
            registry.register(service_id, health_status)
            await asyncio.sleep(interval_seconds)
        except Exception as e:
            logger.error(f"Failed to publish health: {e}")

This produces a constantly updated health map of your system.

Watch-Based Health Monitoring

Instead of polling, watch for health changes:

async def watch_health_changes(
    service_id: str,
    callback: Callable[[dict], None]
):
    for change in registry.watch(f"health:{service_id}"):
        callback(change.new_value)

When a service transitions from healthy to unhealthy, you can react immediately.

Cascading Failure Detection

One of the most valuable things distributed health checks can do is detect cascading failures before they complete.

Failure Propagation Tracking

graph TD
    A[Payment Service Down] --> B[Checkout Unhealthy]
    B --> C[Cart Unhealthy]
    C --> D[Storefront Unhealthy]
    A --> E[Order History Unhealthy]
    A --> F[Analytics Failing]

Track which services depend on failing services. When payment goes down, immediately mark all services that depend on it as potentially affected.

class CascadingFailureDetector:
    def __init__(self, dependency_graph: dict[str, list[str]]):
        self.graph = dependency_graph

    def get_affected_services(self, failed_service: str) -> list[str]:
        affected = []
        to_check = [failed_service]

        while to_check:
            current = to_check.pop()
            for dependent in self.graph.get(current, []):
                if dependent not in affected:
                    affected.append(dependent)
                    to_check.append(dependent)

        return affected

This lets you see the blast radius of a failure. When payment goes down, you immediately know that checkout, order history, and analytics are affected.

Additional Common Mistakes

Health Checks That Are Too Heavy

A health check that queries the database, calls three downstream services, and processes significant data will itself become a load on the system.

Not Having a Health Check Strategy

Health checks should be designed, not added ad-hoc. Decide what each health endpoint checks, what timeouts to use, and how health information flows through your system

Ignoring Health Check Latency

If your health check takes 5 seconds to complete, you have added 5 seconds to your failure detection time. Keep health checks fast.

Not Testing Health Check Logic

Test what happens when health checks fail. Does your routing correctly avoid unhealthy instances? Does your alerting fire? Do your fallbacks activate?

See Chaos Engineering for how to test this systematically.

Observability

Health checks are only useful if you observe their results:

def record_health_metrics(service_id: str, status: HealthStatus):
    metrics.gauge(
        "service.health",
        1 if status.is_healthy() else 0,
        tags={"service": service_id}
    )

    for dep, health in status.dependency_health.items():
        metrics.gauge(
            "dependency.health",
            1 if health == "healthy" else 0,
            tags={"service": service_id, "dependency": dep}
        )

Track these metrics: service health status over time, dependency health per service, time spent in degraded mode, and cascading failure events.

When to Use / When Not to Use Health Checks

When to Use Distributed Health Checks

Distributed health checks are worth the operational complexity when:

  • Multi-service routing: Your API gateway or load balancer routes traffic based on service health — local probes alone cannot inform this decision.
  • Cascading failure risk: You have deep service dependency chains where one failure can cascade if not caught early.
  • Multi-region deployments: You need to route traffic away from unhealthy regions without manual intervention.
  • Regulatory SLAs: Your availability requirements demand automated failure detection and response, not human-on-call judgment calls.
  • Service mesh environments: You’re running Istio, Linkerd, or similar and want health-based traffic management beyond what the mesh provides out of the box.

When Not to Use Distributed Health Checks

Distributed health checks add complexity. Avoid them when:

  • Single service or monolith: A simple liveness probe is enough if there are no downstream dependencies to track.
  • Short-lived functions: Lambda functions and similar ephemeral compute don’t benefit from distributed health tracking.
  • Fully managed PaaS: Heroku, Cloud Run, and similar platforms handle health-based routing at the platform level.
  • Teams can’t operationalize them: If you cannot commit to monitoring and maintaining the health system, a health system you do not trust is worse than no health system.

Decision Flow

graph TD
    A[Need to route traffic?] --> B[Multiple services involved?]
    B -->|No| C[Simple liveness probe sufficient]
    B -->|Yes| D[Deep service dependencies?]
    D -->|No| E[Basic readiness probe + load balancer]
    D -->|Yes| F[Implement distributed health tracking]
    F --> G[Add cascading failure detection]
    G --> H[Integrate with routing/circuit breakers]

Production Failure Scenarios

FailureImpactMitigation
Health registry becomes unavailableServices cannot determine overall system health; routing falls back to local probesUse eventually-consistent storage (etcd, Consul) with local probe fallback; alert when health registry is unreachable
Health check stormHigh-frequency health checks from many services overwhelm a downstream dependencyUse jitter in health check intervals; implement backoff when a dependency is degraded
False positive: dependency timeoutA slow but alive dependency causes health checks to report unhealthySet appropriately generous timeouts for dependency checks; use circuit breakers to stop checking known-failed dependencies
False negative: cache-only checkHealth check passes but serves stale dataUse read/write verification for caches, not just connectivity checks
Probe configured too aggressivelyFrequent restarts due to transient database hiccupsSet appropriate failureThreshold and periodSeconds; distinguish startup vs runtime probe requirements
Health status partitionOne cluster of services has stale health data from another partitionUse timestamp-based health status with TTL expiry; alert when health status age exceeds threshold

gRPC Health Checking

gRPC has a built-in health check protocol via grpc.health.v1.Health. Rather than defining a custom /health endpoint for every gRPC service, you implement the health service and any gRPC-aware load balancer or proxy can query it.

Implementing the Health Service

// Go example: implementing grpc.health.v1.Health
import "google.golang.org/grpc/health/grpc_health_v1"

type myHealthServer struct {
    grpc_health_v1.UnimplementedHealthServer
    status atomic.Value // stores health status
}

func (s *myHealthServer) Check(ctx context.Context, req *grpc_health_v1.HealthCheckRequest) (*grpc_health_v1.HealthCheckResponse, error) {
    currentStatus := s.status.Load().(HealthStatus)

    status := grpc_health_v1.HealthCheckResponse_SERVING
    if currentStatus == Unhealthy {
        status = grpc_health_v1.HealthCheckResponse_NOT_SERVING
    }

    return &grpc_health_v1.HealthCheckResponse{
        Status: status,
    }, nil
}

// Register with gRPC server
grpc_health_v1.RegisterHealthServer(srv, &myHealthServer{})

HTTP vs gRPC Health Comparison

AspectHTTP /health endpointgRPC health.v1
ProtocolHTTP/1.1 or HTTP/2gRPC (HTTP/2, protobuf)
负载 balancer supportMost LBs understand HTTP probesEnvoy, NGINX, Linkerd natively support gRPC health
SchemaCustom JSON or textProtobuf (defined by spec)
StreamingNoWatch RPC available for change notification
StandardizedNo (custom implementation)Yes (grpc.health.v1)

gRPC health is worth implementing if you run gRPC services behind Envoy or Linkerd. If your load balancer only speaks HTTP, a REST /health endpoint is simpler.

Partial Health: Degraded vs Unhealthy

Most health systems return binary healthy/unhealthy. Real systems exist in more states. A service might be partially functional — serving cached data but not accepting writes, or handling read traffic but not write traffic.

Health State Spectrum

class HealthState:
    HEALTHY = "healthy"           # Fully operational
    DEGRADED = "degraded"          # Partial functionality, serving from cache or fallback
    UNHEALTHY = "unhealthy"       # Cannot serve any traffic
    STARTING = "starting"         # Initializing, not ready for traffic
    DRAINING = "draining"         # Gracefully shutting down, rejecting new requests

def aggregate_health(states: list[HealthState]) -> HealthState:
    """Aggregate multiple health states into one."""
    if any(s == HealthState.UNHEALTHY for s in states):
        return HealthState.UNHEALTHY
    if any(s == HealthState.DRAINING for s in states):
        return HealthState.DRAINING
    if any(s == HealthState.STARTING for s in states):
        return HealthState.STARTING
    if any(s == HealthState.DEGRADED for s in states):
        return HealthState.DEGRADED
    return HealthState.HEALTHY

What Degraded Means in Practice

The exact meaning of degraded depends on your service. Define it explicitly:

  • Read-only mode: Database writes are failing, so you are serving reads from cache only. Mark degraded. Downstream callers can decide whether degraded reads are acceptable.
  • High latency mode: A dependency is slow. You are still serving requests but with elevated latency. Mark degraded so load balancers can route traffic elsewhere.
  • Feature flag-off mode: A non-critical feature is broken. Mark degraded but continue serving core functionality.

Routing Based on Health State

def route_based_on_health(instance_health: HealthState, request_type: str) -> bool:
    """
    Decide whether to route to an instance based on its health and request type.
    Returns True if we should route to this instance.
    """
    if instance_health == HealthState.HEALTHY:
        return True

    if instance_health == HealthState.DEGRADED:
        # Route writes to healthy only
        if request_type == "write":
            return False
        # Route reads to degraded in controlled way (backpressure)
        return True

    if instance_health in (HealthState.UNHEALTHY, HealthState.DRAINING, HealthState.STARTING):
        return False

    return False

Degraded does not mean “stop using this instance.” It means “use this instance for specific request types and monitor closely.”

Cache Health vs Data Health

Cache and database can be in different health states simultaneously. Your health check must distinguish between them because the correct response differs.

The Cache-Backup Fallback Pattern

def check_service_health() -> tuple[HealthState, dict]:
    """
    Check both cache and database health.
    Returns (overall_state, details)
    """
    db_healthy = check_database()
    cache_healthy = check_cache()

    details = {
        "database": "healthy" if db_healthy else "unhealthy",
        "cache": "healthy" if cache_healthy else "unhealthy",
    }

    if db_healthy and cache_healthy:
        return HealthState.HEALTHY, details

    if db_healthy and not cache_healthy:
        # Cache is down but database is fine — degraded, serving stale data
        return HealthState.DEGRADED, details

    if not db_healthy and cache_healthy:
        # Database is down — unhealthy for writes, but may have stale reads
        return HealthState.UNHEALTHY, details

    return HealthState.UNHEALTHY, details

Why This Matters for Health Checks

A health endpoint that only pings the database will report unhealthy when the cache is down. But your service can still serve reads from the database. Similarly, a health endpoint that only checks the cache will report healthy even when the database is down and you are serving stale data.

Check both, and report their states separately so callers can make informed routing decisions.

Health Check Authentication

Health endpoints are tempting to leave unauthenticated — they need to be reachable by load balancers and monitoring systems. But an unauthenticated /health endpoint leaks internal system information.

What an Unauthenticated Health Endpoint Reveals

  • Internal service names and versions
  • Dependency topology
  • Whether specific features are enabled
  • Error messages from failing dependencies

Defense in Depth

# Layer 1: Network segmentation
# Health check endpoint only accessible from internal networks
# Load balancer security group restricts access to health check CIDR

# Layer 2: IP allowlist
ALLOWED_IPS = {"10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"}

@app.get("/health")
def health():
    # Verify caller IP is in allowlist
    client_ip = request.headers.get("X-Forwarded-For", request.remote_addr)
    if not is_internal_ip(client_ip, ALLOWED_IPS):
        return {"error": "unauthorized"}, 401

    # Continue with health check
    return check_health()

# Layer 3: Token-based auth for sensitive health data
@app.get("/health/detailed")
def health_detailed(token: str):
    if token != os.environ.get("HEALTH_TOKEN"):
        return {"error": "unauthorized"}, 401

    return check_full_health()  # Returns dependency details, versions, etc.

The simple /health endpoint stays public for load balancers. A separate /health/detailed endpoint with a token returns full system introspection for your monitoring tools.

What Not to Put in Health Responses

  • Stack traces or exception details (can leak to attackers)
  • Internal IP addresses or hostnames
  • Credentials or session tokens
  • Full configuration dumps

Health responses tell a caller whether to route traffic here. They are not a debugging endpoint.

Quick Recap

Distributed health checks go beyond local liveness probes:

  • Track health across service boundaries, not just within a pod
  • Use aggregated views to understand overall system health
  • Implement deep health checks that verify actual functionality
  • Build health-aware routing to avoid sending traffic to unhealthy instances
  • Detect cascading failures before they complete
  • Monitor your health checks and alert on degradation

Health checks are the nervous system of your distributed system. They tell every part how every other part is doing.

For more on related topics, see Health Checks, Circuit Breaker Pattern, and Resilience Patterns.

Category

Related Posts

Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

#microservices #health-checks #kubernetes

Graceful Degradation: Systems That Bend Instead Break

Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.

#distributed-systems #fault-tolerance #resilience

Backpressure Handling: Protecting Pipelines from Overload

Learn how to implement backpressure in data pipelines to prevent cascading failures, handle overload gracefully, and maintain system stability.

#data-engineering #backpressure #data-pipelines