Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

published: March 24, 2026 reading time: 11 min read

Health Checks: Liveness, Readiness, and Service Availability

In distributed systems, your services do not exist in isolation. They call each other, depend on databases and caches, and serve traffic through load balancers. When a service starts failing, the rest of the system needs to know quickly. Health checks provide that visibility.

A properly implemented health check system tells Kubernetes when to route traffic to your pod, tells your load balancer which instances are ready, and gives your monitoring system early warning before problems cascade. Without health checks, you get cascading failures, traffic sent to dead instances, and problems that compound silently until they take down your entire application.

This article covers the three probe types Kubernetes provides, how to implement health endpoints in your services, how to handle deep health checks for dependencies, and the patterns that keep your system resilient when individual services fail.

The Three Probe Types

Kubernetes distinguishes between three states a pod can be in. Each state has a corresponding probe type that determines how Kubernetes manages the pod’s lifecycle and traffic routing.

graph TD
    A[Pod Starting] --> B{Startup Probe}
    B -->|Not Ready| C[Initializing]
    B -->|Ready| D{Liveness Probe}
    D -->|Failing| E[Restarting]
    D -->|Healthy| F{Readiness Probe}
    F -->|Failing| G[Remove from Traffic]
    F -->|Passing| H[Receive Traffic]
    E --> D
    G --> F

Liveness Probe: Is the Process Alive?

The liveness probe answers a simple question: is the process running and responsive? If the liveness probe fails, Kubernetes restarts the container. This handles situations where the process is alive but stuck in a deadlock or unresponsive state.

A basic liveness probe configuration looks like this:

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

The liveness probe waits 10 seconds after startup before the first check. Then it checks every 15 seconds. If the check takes more than 5 seconds, it counts as a failure. After 3 consecutive failures, Kubernetes restarts the container.

Keep liveness probes simple. A liveness probe that checks dependencies will restart your service whenever your database is temporarily unavailable, which makes outages worse, not better.

Readiness Probe: Can the Service Accept Traffic?

The readiness probe answers: can this instance handle requests right now? A service might be running but not ready if it is warming up, loading configuration, or recovering from a dependency outage.

When the readiness probe fails, Kubernetes removes the pod from the service endpoint slice. Traffic stops being routed to that instance. The pod keeps running and the probe keeps checking. When the probe passes again, traffic resumes.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 2

Use readiness probes for checks that verify dependencies. If your service needs a database connection and a cache to serve requests properly, the readiness probe should verify both. Keep the probe fast to avoid removing instances unnecessarily during brief slowdowns.

Startup Probe: The Initialization Grace Period

The startup probe handles applications that need significant time to initialize. If your service takes 30 seconds to start, a liveness probe that starts checking after 10 seconds will kill the container before it is ready.

The startup probe delays all other probes until it succeeds:

startupProbe:
  httpGet:
    path: /started
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12

With 5-second intervals and 12 failures allowed, the startup probe gives your service up to 60 seconds to initialize. Once the startup probe passes, Kubernetes switches to the liveness and readiness probes.

Startup probes suit applications that load large models, warm up JIT compilers, or perform initial data loads at startup.

Implementing Health Endpoints

Your service needs to expose endpoints that Kubernetes can query. Plan for three endpoints: a liveness endpoint for basic aliveness, a readiness endpoint that checks dependencies, and optionally a startup endpoint for initialization.

Basic Health Endpoint

The liveness endpoint should be trivially simple. It checks nothing except whether the HTTP server can respond:

@app.get("/live")
def liveness():
    return {"status": "alive"}

This endpoint must not check dependencies. If your database is down and this endpoint returns healthy, Kubernetes will keep the container running but the liveness probe passes. If the endpoint itself fails because the process is deadlocked, Kubernetes restarts the container, which is the desired behavior.

Readiness Endpoint with Dependency Checks

The readiness endpoint verifies your service can handle traffic:

@app.get("/ready")
def readiness():
    # Check database connectivity
    try:
        db.execute("SELECT 1")
    except Exception as e:
        raise HealthCheckFailed("Database unavailable")

    # Check cache connectivity
    try:
        cache.ping()
    except Exception as e:
        raise HealthCheckFailed("Cache unavailable")

    # Check downstream services
    for service in dependent_services:
        if not service.is_healthy():
            raise HealthCheckFailed(f"{service.name} unavailable")

    return {"status": "ready"}

Keep readiness checks fast. A 5-second timeout means 5 seconds of serving bad responses while your health check times out. Set timeouts aggressively and fail fast.

Startup Endpoint

The startup endpoint mirrors the readiness check but exists only during initialization:

@app.get("/started")
def startup():
    if not initialization_complete.is_set():
        raise HealthCheckFailed("Still initializing")
    return {"status": "started"}

Once initialization completes, this endpoint can return healthy permanently, or you can remove the startup probe configuration and let Kubernetes use only liveness and readiness probes.

Deep Health Checks

Simple endpoints that just return “healthy” catch process crashes but miss dependency failures. Deep health checks verify your dependencies are actually working.

Database Connectivity

Do not just check if the database process is running. Check if your application can execute queries:

def check_database():
    try:
        with db.connection() as conn:
            cursor = conn.cursor()
            cursor.execute("SELECT 1")
            result = cursor.fetchone()
            if result[0] != 1:
                raise HealthCheckFailed("Database query failed")
    except OperationalError:
        raise HealthCheckFailed("Database connection failed")

For PostgreSQL, SELECT 1 works. For MySQL, use SELECT 1 as well. For MongoDB, use db.admin.command('ping').

Cache Verification

Caches fail silently in most configurations. Verify your cache is actually storing and retrieving data:

def check_cache():
    try:
        test_key = f"health_check:{uuid.uuid4()}"
        test_value = str(time.time())

        cache.set(test_key, test_value, ex=10)
        retrieved = cache.get(test_key)

        if retrieved != test_value:
            raise HealthCheckFailed("Cache read/write mismatch")

        cache.delete(test_key)
    except Exception as e:
        raise HealthCheckFailed(f"Cache check failed: {e}")

Use a unique key per check to avoid collisions in shared cache environments.

Service Mesh Health Checks

When running behind a service mesh like Istio, Envoy handles health checking by default. You configure ReadinessGate in your pod spec and Envoy manages the actual health check calls:

readinessGates:
  - conditionType: "envoy.kubernetes.io/ready"

Your application still needs to expose a health endpoint for orchestration systems and load balancers that do not use Envoy’s sidecar proxy.

Kubernetes Configuration

Probe Configuration Options

Each probe type supports the same configuration parameters.

Parameter	Purpose	Typical Value
`initialDelaySeconds`	Wait before first check	Liveness: 10-30s, Readiness: 5-10s
`periodSeconds`	How often to check	10-15s for liveness, 5-10s for readiness
`timeoutSeconds`	When to count as failure	3-5s
`failureThreshold`	Failures before taking action	Liveness: 3, Readiness: 2
`successThreshold`	Consecutive successes to recover	1 for liveness (always 1)

Common Mistakes in Probe Configuration

Setting initialDelaySeconds too low causes premature failures. Your application needs time to start before Kubernetes starts checking. Set this based on your observed startup time, not your desired startup time.

Setting periodSeconds too short causes excessive load from health check requests. Setting it too long delays detection of failures. 10-15 seconds balances quick detection with minimal overhead.

Setting failureThreshold too low causes unnecessary restarts from transient issues. Setting it too high delays failure detection. For liveness probes, 3 failures over 45 seconds is reasonable. For readiness probes, 2 failures over 20 seconds balances sensitivity with stability.

Verifying Probe Configuration

Use kubectl to inspect probe configuration and test probes manually:

# Describe pod probe configuration
kubectl describe pod my-pod | grep -A 10 "Liveness"
kubectl describe pod my-pod | grep -A 10 "Readiness"

# Port-forward to test health endpoints
kubectl port-forward my-pod 8080:8080
curl http://localhost:8080/live
curl http://localhost:8080/ready

# Check pod status
kubectl get pod my-pod -o jsonpath='{.status.conditions[*]}'

Health Check Best Practices

Timeouts and Retries

Health check timeouts must be shorter than your request timeout. If your service times out requests at 30 seconds but health checks wait 10 seconds, failing health checks will not catch the problems fast enough.

For readiness probes checking dependencies, set timeouts at 2-3 seconds. Most dependency checks should complete in milliseconds. A 3-second timeout catches genuine problems without false positives from brief slowdowns.

Do not implement retry logic in health checks. Kubernetes handles retries at the probe level. If a health check fails, Kubernetes retries based on failureThreshold. Adding your own retry logic inside the health check endpoint adds latency and complexity without benefit.

Fallbacks and Graceful Degradation

When health checks fail, have a plan for degraded operation. If your recommendation service cannot reach its ML model, return popularity-based recommendations instead of errors. If your search service cannot reach Elasticsearch, fall back to database-backed search.

@app.get("/ready")
def readiness():
    try:
        check_database()
    except HealthCheckFailed:
        # Can we serve read-only traffic?
        if not app.allow_read_only_mode():
            raise
        return {"status": "ready", "mode": "read_only"}

    try:
        check_cache()
    except HealthCheckFailed:
        # Cache is optional
        return {"status": "ready", "cache": "degraded"}

    return {"status": "ready"}

What Not to Check in Health Endpoints

Keep liveness probes minimal. The liveness probe exists to detect deadlocks and crashes, not dependency outages. If your liveness probe fails whenever your database is unavailable, you restart into the same situation repeatedly.

Do not implement business logic in health checks. Health checks should verify infrastructure and dependencies, not application state. If you need to check application state, use separate monitoring endpoints with their own alerting.

Do not block health checks on long operations. A health check that takes 30 seconds to complete defeats its purpose. Set aggressive timeouts and fail fast.

Monitoring and Alerting

Health checks generate valuable signals for monitoring. Track health check latency and failure rates alongside application metrics.

# Track health check duration
def measure_health_check(name, check_func):
    start = time.time()
    try:
        check_func()
        duration = time.time() - start
        metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
        metrics.increment("health_check_success_total", labels={"check": name})
    except Exception:
        duration = time.time() - start
        metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
        metrics.increment("health_check_failure_total", labels={"check": name})
        raise

Set alerts on health check failures, not just application error rates. A failing health check often precedes customer-visible errors by several minutes.

When to Use / When Not to Use

When to Use Health Checks

Health checks are essential in these scenarios:

Container orchestration (Kubernetes, Docker Swarm) where orchestrators need to know when to restart or route traffic to your service
Load balancer integration where load balancers need to know which instances can receive traffic
Auto-scaling systems where scaling decisions depend on service health
Microservices with dependencies where you need to detect when downstream services are unavailable
Multi-instance deployments where you need to ensure all instances are healthy before serving traffic

When Not to Use Health Checks

Health checks may add unnecessary complexity in these cases:

Single-instance applications with no orchestration and no load balancing
Stateless batch jobs that run to completion and exit (though startup/shutdown hooks may still be useful)
Very short-lived tasks where the overhead of health check implementation outweighs the benefit
Services where failure is acceptable - non-critical background workers that can fail without impact

Probe Selection Guide

Scenario	Startup Probe	Liveness Probe	Readiness Probe
Slow-starting application	Required	Not needed until startup completes	Not needed until startup completes
Depends on external services	Not needed	Not recommended (restarts on transient deps)	Required (blocks traffic during dependency issues)
Serves cached data when deps fail	Not needed	Not recommended	Optional (can return healthy with degraded status)
Stateless computation	Required if startup time is non-trivial	Optional (process crash = container restart)	Optional
Database-backed API	Required	Not recommended	Required

Decision Flow

graph TD
    A[Implementing Health Checks] --> B{Application Slow to Start?}
    B -->|Yes| C[Add Startup Probe]
    B -->|No| D{Service Has Dependencies?}
    D -->|Yes| E{Need to Block Traffic When Deps Unavailable?}
    E -->|Yes| F[Add Readiness Probe]
    E -->|No| G[Add Liveness Probe]
    D -->|No| H{Can Crash Indicate Problem?}
    H -->|Yes| G
    H -->|No| I[No Probes Needed]
    C --> F
    C --> G

Quick Recap

Health checks let Kubernetes and your load balancers make intelligent routing decisions. Use liveness probes to detect crashed or deadlocked processes. Use readiness probes to control traffic routing based on dependency health. Use startup probes to give slow-starting applications time to initialize.

Keep liveness probes simple. Keep readiness probes fast and thorough. Set timeouts short and failure thresholds reasonable. Monitor your health checks and alert on failures.

For more on building resilient systems, see Resilience Patterns, Circuit Breaker Pattern, and Kubernetes.

Health Checks: Liveness, Readiness, and Service Availability

Health Checks: Liveness, Readiness, and Service Availability

The Three Probe Types

Liveness Probe: Is the Process Alive?

Readiness Probe: Can the Service Accept Traffic?

Startup Probe: The Initialization Grace Period

Implementing Health Endpoints

Basic Health Endpoint

Readiness Endpoint with Dependency Checks

Startup Endpoint

Deep Health Checks

Database Connectivity

Cache Verification

Service Mesh Health Checks

Kubernetes Configuration

Probe Configuration Options

Common Mistakes in Probe Configuration

Verifying Probe Configuration

Health Check Best Practices

Timeouts and Retries

Fallbacks and Graceful Degradation

What Not to Check in Health Endpoints

Monitoring and Alerting

When to Use / When Not to Use

When to Use Health Checks

When Not to Use Health Checks

Probe Selection Guide

Decision Flow

Quick Recap

Category

Tags

Related Posts

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

CI/CD Pipelines for Microservices