Health Checks: Liveness, Readiness, and Service Availability
Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.
Health Checks: Liveness, Readiness, and Service Availability
In distributed systems, your services do not exist in isolation. They call each other, depend on databases and caches, and serve traffic through load balancers. When a service starts failing, the rest of the system needs to know quickly. Health checks provide that visibility.
A properly implemented health check system tells Kubernetes when to route traffic to your pod, tells your load balancer which instances are ready, and gives your monitoring system early warning before problems cascade. Without health checks, you get cascading failures, traffic sent to dead instances, and problems that compound silently until they take down your entire application.
This article covers the three probe types Kubernetes provides, how to implement health endpoints in your services, how to handle deep health checks for dependencies, and the patterns that keep your system resilient when individual services fail.
The Three Probe Types
Kubernetes distinguishes between three states a pod can be in. Each state has a corresponding probe type that determines how Kubernetes manages the pod’s lifecycle and traffic routing.
graph TD
A[Pod Starting] --> B{Startup Probe}
B -->|Not Ready| C[Initializing]
B -->|Ready| D{Liveness Probe}
D -->|Failing| E[Restarting]
D -->|Healthy| F{Readiness Probe}
F -->|Failing| G[Remove from Traffic]
F -->|Passing| H[Receive Traffic]
E --> D
G --> F
Liveness Probe: Is the Process Alive?
The liveness probe answers a simple question: is the process running and responsive? If the liveness probe fails, Kubernetes restarts the container. This handles situations where the process is alive but stuck in a deadlock or unresponsive state.
A basic liveness probe configuration looks like this:
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
The liveness probe waits 10 seconds after startup before the first check. Then it checks every 15 seconds. If the check takes more than 5 seconds, it counts as a failure. After 3 consecutive failures, Kubernetes restarts the container.
Keep liveness probes simple. A liveness probe that checks dependencies will restart your service whenever your database is temporarily unavailable, which makes outages worse, not better.
Readiness Probe: Can the Service Accept Traffic?
The readiness probe answers: can this instance handle requests right now? A service might be running but not ready if it is warming up, loading configuration, or recovering from a dependency outage.
When the readiness probe fails, Kubernetes removes the pod from the service endpoint slice. Traffic stops being routed to that instance. The pod keeps running and the probe keeps checking. When the probe passes again, traffic resumes.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 2
Use readiness probes for checks that verify dependencies. If your service needs a database connection and a cache to serve requests properly, the readiness probe should verify both. Keep the probe fast to avoid removing instances unnecessarily during brief slowdowns.
Startup Probe: The Initialization Grace Period
The startup probe handles applications that need significant time to initialize. If your service takes 30 seconds to start, a liveness probe that starts checking after 10 seconds will kill the container before it is ready.
The startup probe delays all other probes until it succeeds:
startupProbe:
httpGet:
path: /started
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 12
With 5-second intervals and 12 failures allowed, the startup probe gives your service up to 60 seconds to initialize. Once the startup probe passes, Kubernetes switches to the liveness and readiness probes.
Startup probes suit applications that load large models, warm up JIT compilers, or perform initial data loads at startup.
Implementing Health Endpoints
Your service needs to expose endpoints that Kubernetes can query. Plan for three endpoints: a liveness endpoint for basic aliveness, a readiness endpoint that checks dependencies, and optionally a startup endpoint for initialization.
Basic Health Endpoint
The liveness endpoint should be trivially simple. It checks nothing except whether the HTTP server can respond:
@app.get("/live")
def liveness():
return {"status": "alive"}
This endpoint must not check dependencies. If your database is down and this endpoint returns healthy, Kubernetes will keep the container running but the liveness probe passes. If the endpoint itself fails because the process is deadlocked, Kubernetes restarts the container, which is the desired behavior.
Readiness Endpoint with Dependency Checks
The readiness endpoint verifies your service can handle traffic:
@app.get("/ready")
def readiness():
# Check database connectivity
try:
db.execute("SELECT 1")
except Exception as e:
raise HealthCheckFailed("Database unavailable")
# Check cache connectivity
try:
cache.ping()
except Exception as e:
raise HealthCheckFailed("Cache unavailable")
# Check downstream services
for service in dependent_services:
if not service.is_healthy():
raise HealthCheckFailed(f"{service.name} unavailable")
return {"status": "ready"}
Keep readiness checks fast. A 5-second timeout means 5 seconds of serving bad responses while your health check times out. Set timeouts aggressively and fail fast.
Startup Endpoint
The startup endpoint mirrors the readiness check but exists only during initialization:
@app.get("/started")
def startup():
if not initialization_complete.is_set():
raise HealthCheckFailed("Still initializing")
return {"status": "started"}
Once initialization completes, this endpoint can return healthy permanently, or you can remove the startup probe configuration and let Kubernetes use only liveness and readiness probes.
Deep Health Checks
Simple endpoints that just return “healthy” catch process crashes but miss dependency failures. Deep health checks verify your dependencies are actually working.
Database Connectivity
Do not just check if the database process is running. Check if your application can execute queries:
def check_database():
try:
with db.connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT 1")
result = cursor.fetchone()
if result[0] != 1:
raise HealthCheckFailed("Database query failed")
except OperationalError:
raise HealthCheckFailed("Database connection failed")
For PostgreSQL, SELECT 1 works. For MySQL, use SELECT 1 as well. For MongoDB, use db.admin.command('ping').
Cache Verification
Caches fail silently in most configurations. Verify your cache is actually storing and retrieving data:
def check_cache():
try:
test_key = f"health_check:{uuid.uuid4()}"
test_value = str(time.time())
cache.set(test_key, test_value, ex=10)
retrieved = cache.get(test_key)
if retrieved != test_value:
raise HealthCheckFailed("Cache read/write mismatch")
cache.delete(test_key)
except Exception as e:
raise HealthCheckFailed(f"Cache check failed: {e}")
Use a unique key per check to avoid collisions in shared cache environments.
Service Mesh Health Checks
When running behind a service mesh like Istio, Envoy handles health checking by default. You configure ReadinessGate in your pod spec and Envoy manages the actual health check calls:
readinessGates:
- conditionType: "envoy.kubernetes.io/ready"
Your application still needs to expose a health endpoint for orchestration systems and load balancers that do not use Envoy’s sidecar proxy.
Kubernetes Configuration
Probe Configuration Options
Each probe type supports the same configuration parameters.
| Parameter | Purpose | Typical Value |
|---|---|---|
initialDelaySeconds | Wait before first check | Liveness: 10-30s, Readiness: 5-10s |
periodSeconds | How often to check | 10-15s for liveness, 5-10s for readiness |
timeoutSeconds | When to count as failure | 3-5s |
failureThreshold | Failures before taking action | Liveness: 3, Readiness: 2 |
successThreshold | Consecutive successes to recover | 1 for liveness (always 1) |
Common Mistakes in Probe Configuration
Setting initialDelaySeconds too low causes premature failures. Your application needs time to start before Kubernetes starts checking. Set this based on your observed startup time, not your desired startup time.
Setting periodSeconds too short causes excessive load from health check requests. Setting it too long delays detection of failures. 10-15 seconds balances quick detection with minimal overhead.
Setting failureThreshold too low causes unnecessary restarts from transient issues. Setting it too high delays failure detection. For liveness probes, 3 failures over 45 seconds is reasonable. For readiness probes, 2 failures over 20 seconds balances sensitivity with stability.
Verifying Probe Configuration
Use kubectl to inspect probe configuration and test probes manually:
# Describe pod probe configuration
kubectl describe pod my-pod | grep -A 10 "Liveness"
kubectl describe pod my-pod | grep -A 10 "Readiness"
# Port-forward to test health endpoints
kubectl port-forward my-pod 8080:8080
curl http://localhost:8080/live
curl http://localhost:8080/ready
# Check pod status
kubectl get pod my-pod -o jsonpath='{.status.conditions[*]}'
Health Check Best Practices
Timeouts and Retries
Health check timeouts must be shorter than your request timeout. If your service times out requests at 30 seconds but health checks wait 10 seconds, failing health checks will not catch the problems fast enough.
For readiness probes checking dependencies, set timeouts at 2-3 seconds. Most dependency checks should complete in milliseconds. A 3-second timeout catches genuine problems without false positives from brief slowdowns.
Do not implement retry logic in health checks. Kubernetes handles retries at the probe level. If a health check fails, Kubernetes retries based on failureThreshold. Adding your own retry logic inside the health check endpoint adds latency and complexity without benefit.
Fallbacks and Graceful Degradation
When health checks fail, have a plan for degraded operation. If your recommendation service cannot reach its ML model, return popularity-based recommendations instead of errors. If your search service cannot reach Elasticsearch, fall back to database-backed search.
@app.get("/ready")
def readiness():
try:
check_database()
except HealthCheckFailed:
# Can we serve read-only traffic?
if not app.allow_read_only_mode():
raise
return {"status": "ready", "mode": "read_only"}
try:
check_cache()
except HealthCheckFailed:
# Cache is optional
return {"status": "ready", "cache": "degraded"}
return {"status": "ready"}
What Not to Check in Health Endpoints
Keep liveness probes minimal. The liveness probe exists to detect deadlocks and crashes, not dependency outages. If your liveness probe fails whenever your database is unavailable, you restart into the same situation repeatedly.
Do not implement business logic in health checks. Health checks should verify infrastructure and dependencies, not application state. If you need to check application state, use separate monitoring endpoints with their own alerting.
Do not block health checks on long operations. A health check that takes 30 seconds to complete defeats its purpose. Set aggressive timeouts and fail fast.
Monitoring and Alerting
Health checks generate valuable signals for monitoring. Track health check latency and failure rates alongside application metrics.
# Track health check duration
def measure_health_check(name, check_func):
start = time.time()
try:
check_func()
duration = time.time() - start
metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
metrics.increment("health_check_success_total", labels={"check": name})
except Exception:
duration = time.time() - start
metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
metrics.increment("health_check_failure_total", labels={"check": name})
raise
Set alerts on health check failures, not just application error rates. A failing health check often precedes customer-visible errors by several minutes.
When to Use / When Not to Use
When to Use Health Checks
Health checks are essential in these scenarios:
- Container orchestration (Kubernetes, Docker Swarm) where orchestrators need to know when to restart or route traffic to your service
- Load balancer integration where load balancers need to know which instances can receive traffic
- Auto-scaling systems where scaling decisions depend on service health
- Microservices with dependencies where you need to detect when downstream services are unavailable
- Multi-instance deployments where you need to ensure all instances are healthy before serving traffic
When Not to Use Health Checks
Health checks may add unnecessary complexity in these cases:
- Single-instance applications with no orchestration and no load balancing
- Stateless batch jobs that run to completion and exit (though startup/shutdown hooks may still be useful)
- Very short-lived tasks where the overhead of health check implementation outweighs the benefit
- Services where failure is acceptable - non-critical background workers that can fail without impact
Probe Selection Guide
| Scenario | Startup Probe | Liveness Probe | Readiness Probe |
|---|---|---|---|
| Slow-starting application | Required | Not needed until startup completes | Not needed until startup completes |
| Depends on external services | Not needed | Not recommended (restarts on transient deps) | Required (blocks traffic during dependency issues) |
| Serves cached data when deps fail | Not needed | Not recommended | Optional (can return healthy with degraded status) |
| Stateless computation | Required if startup time is non-trivial | Optional (process crash = container restart) | Optional |
| Database-backed API | Required | Not recommended | Required |
Decision Flow
graph TD
A[Implementing Health Checks] --> B{Application Slow to Start?}
B -->|Yes| C[Add Startup Probe]
B -->|No| D{Service Has Dependencies?}
D -->|Yes| E{Need to Block Traffic When Deps Unavailable?}
E -->|Yes| F[Add Readiness Probe]
E -->|No| G[Add Liveness Probe]
D -->|No| H{Can Crash Indicate Problem?}
H -->|Yes| G
H -->|No| I[No Probes Needed]
C --> F
C --> G
Quick Recap
Health checks let Kubernetes and your load balancers make intelligent routing decisions. Use liveness probes to detect crashed or deadlocked processes. Use readiness probes to control traffic routing based on dependency health. Use startup probes to give slow-starting applications time to initialize.
Keep liveness probes simple. Keep readiness probes fast and thorough. Set timeouts short and failure thresholds reasonable. Monitor your health checks and alert on failures.
For more on building resilient systems, see Resilience Patterns, Circuit Breaker Pattern, and Kubernetes.
Category
Related Posts
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ
Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.
CI/CD Pipelines for Microservices
Learn how to design and implement CI/CD pipelines for microservices with automated testing, blue-green deployments, and canary releases.