mTLS: Mutual TLS for Service-to-Service Authentication
Learn how mutual TLS secures communication between microservices, how to implement it, and how service meshes simplify mTLS management.
mTLS: Mutual TLS for Service-to-Service Authentication
TLS solves one half of the authentication problem. It proves the server is who it says it is. But when your services talk to each other, you need both sides to prove their identity. A regular TLS handshake lets any client connect to any server. In microservices, that is a problem.
A compromised service can impersonate any other service. It can eavesdrop on traffic, steal data, or man-in-the-middle requests between legitimate services. Regular TLS does not stop this.
mTLS fixes this gap. Both the client and server present certificates. Both sides verify each other before the connection proceeds. The result is encrypted, authenticated communication where both endpoints have proven their identity.
When to Use / When Not to Use
| Scenario | Use mTLS | Notes |
|---|---|---|
| Service-to-service communication within a cluster | Yes | Service mesh handles this automatically |
| Cross-cluster or multi-environment communication | Yes | mTLS with SPIFFE federation works well |
| External API calls (third-party services) | No | Use TLS with server certificates only; client certificates require certificate distribution |
| Mobile or desktop clients calling backend services | No | Use OAuth 2.0 / OIDC for user-facing flows |
| Services behind an API gateway that handles auth | Partial | mTLS between gateway and backends, not at the edge |
| IoT devices with limited crypto capability | Caution | CPU overhead and certificate management may not be feasible |
| High-throughput, latency-critical internal paths | Trade-off | Measure impact; connection pooling mitigates handshake latency |
Trade-offs
| Aspect | Regular TLS | mTLS |
|---|---|---|
| Authentication | Server only | Mutual (both sides) |
| Setup complexity | Lower | Higher (requires CA hierarchy, certificate distribution) |
| Operational overhead | Lower | Higher (rotation, revocation, chain validation) |
| Latency overhead | 0-1 RTT | 1 RTT (TLS 1.3) |
| CPU overhead | Encryption only | Encryption + client certificate validation |
| Security posture | Server verified | Both endpoints verified cryptographically |
When NOT to Use mTLS
- Browser-based clients: mTLS requires client-side certificates, which browsers handle poorly. Use OIDC instead.
- Third-party integrations: Distributing your internal CA certificates to external parties is impractical. Use API keys or OAuth.
- Migration periods: Running PERMISSIVE mode long-term creates security gaps. Only use during controlled transitions.
- Stateless serverless functions: Cold start latency compounds with certificate provisioning. Evaluate whether connection-level authentication adds value for your invocation pattern.
How mTLS Differs from Regular TLS
Regular TLS uses a one-way authentication model. The client verifies the server’s certificate. The server does not verify the client at all. The server accepts connections from anyone.
graph LR
Client -->|1. ClientHello| Server
Server -->|2. Certificate| Client
Client -->|3. Verify server| Client
Server -->|4. Encrypted channel established| Client
This model falls apart in microservices. Service A needs to verify that Service B is actually Service B, not some other container pretending to be B. Every service in your cluster needs to verify every other service it talks to.
mTLS adds a second layer of authentication. Both sides present certificates. Both sides verify. Here is how the handshake works:
sequence
Client -->|1. ClientHello| Server
Server -->|2. Certificate| Client
Client -->|3. Certificate| Server
Server -->|4. Verify client| Server
Client -->|5. Key exchange| Server
Server -->|6. Encrypted channel| Client
Certificate Authority Hierarchy
mTLS relies on a chain of trust. Understanding how this hierarchy works helps when debugging certificate issues and designing your PKI (Public Key Infrastructure).
Root CA
At the top sits the Root Certificate Authority. Root CAs are long-lived (often 10-20 years) and stored securely, usually offline. You do not use the Root CA directly to sign workload certificates. Instead, you create intermediate CAs.
Intermediate CA
Intermediate Certificate Authorities sit between the Root CA and leaf certificates. They are signed by the Root CA and can sign other certificates. Intermediates limit exposure: if one is compromised, you revoke it and create a new one without touching the Root CA.
Production mTLS setups usually have one Root CA and multiple intermediates per environment or per team.
Leaf Certificates
Leaf certificates (also called workload certificates) are what services actually use. Each service instance gets its own leaf certificate containing its service identity: service name, namespace, service account, and similar metadata.
Leaf certificates are short-lived. Hours or days, not years. This limits damage if a certificate is stolen. The CA issues new certificates automatically through rotation.
graph TD
RootCA[Root CA] -->|signs| IntermediateCA[Intermediate CA]
IntermediateCA -->|signs| ServiceA[Service A Certificate]
IntermediateCA -->|signs| ServiceB[Service B Certificate]
IntermediateCA -->|signs| ServiceC[Service C Certificate]
Certificate Lifecycle
Certificates are not set-and-forget. They need issuance, distribution, rotation, and revocation. Get any of these wrong and services stop communicating.
Issuance
When a new service pod starts, it needs a certificate. The pod requests a certificate from the CA through an API. The CA verifies the request, signs the certificate, and returns it.
Istio uses SDS (Secret Discovery Service) for this. The control plane issues certificates, and Envoy fetches them via SDS without restarts. Linkerd has its own certificate provisioning system using a Kubernetes mutating webhook.
Rotation
Certificates expire. Leaf certificates typically have 24-hour TTLs in production service mesh deployments. The CA issues new certificates before the old ones expire. Services pick up the new certificates automatically.
If rotation breaks, services lose communication when certificates expire. This is a common cause of production incidents. Monitor certificate expiration dates. Set alerts for certificates expiring within 7 days.
Revocation
Sometimes you need to invalidate a certificate before it expires. A service is compromised. A private key leaks. You need to stop trusting that certificate immediately.
CRL (Certificate Revocation List) and OCSP (Online Certificate Status Protocol) handle revocation in traditional PKI. Service meshes handle it differently. Most do not check CRLs or OCSP for every connection due to latency. Instead, they rely on short certificate lifetimes. Remove a compromised workload from the network, and its certificate expires within hours.
Istio supports faster revocation through its control plane. You can also push updated validation contexts to Envoys to deny specific certificates immediately.
How mTLS Works in Service Communication
When Service A calls Service B over mTLS, the handshake happens at the connection layer, transparently to your application code.
sequence
participant A as Service A
participant PA as Proxy A
participant PB as Proxy B
participant B as Service B
A ->> PA: HTTP request to Service B
PA ->> PA: TLS handshake with PB
PA ->> PB: ClientCertificate, Finished
PB ->> PB: Verify PA certificate
PB ->> B: Forward request (plaintext)
B ->> PB: Response
PB ->> PA: TLS encrypted response
PA ->> A: HTTP response
Sidecar proxies terminate TLS. Service A makes a plaintext HTTP call to its local proxy. The proxy on Service A’s side establishes mTLS with the proxy on Service B’s side. Service B’s proxy forwards the plaintext request to Service B.
Your application code never sees certificates or TLS. It makes normal network calls. The mesh handles authentication and encryption.
Certificate Path Validation
When a proxy receives a certificate during the TLS handshake, it validates the entire chain:
- Check the certificate is not expired
- Check the signature against the Intermediate CA’s public key
- Check the Intermediate CA’s certificate against the Root CA’s public key
- Check the certificate is not revoked (if configured)
If any check fails, the connection gets rejected.
Service Mesh Auto-mTLS
Setting up mTLS manually for every service is painful. You need to issue certificates, distribute them, handle rotation, and configure each service. Service meshes automate this.
Istio
Istio provides automatic mTLS through its control plane (istiod). Enable STRICT mode for a namespace and all communication requires mTLS.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: STRICT
With STRICT mode, only connections with valid mTLS certificates are allowed. PERMISSIVE mode allows both mTLS and plain text, useful during migration.
Istio issues workload certificates with 24-hour TTLs and rotates them automatically via SDS. Envoy detects certificate changes and reloads TLS context without dropping active connections.
Linkerd
Linkerd uses a different approach. Each service pod gets a Linkerd proxy (written in Rust) that handles mTLS automatically.
Linkerd’s CA issues certificates with short TTLs and handles rotation transparently. You do not configure mTLS explicitly; it is on by default for all mesh traffic. There is no PeerAuthentication resource.
Certificate Management Tools
Outside of service meshes, you need tools to manage certificates. cert-manager and Vault are the most common choices.
cert-manager
cert-manager is a Kubernetes-native certificate controller. It manages certificates from various issuers (Let’s Encrypt, Vault, internal CA) and keeps them renewed.
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: my-ca
spec:
ca:
secretName: ca-key-pair
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: service-a-cert
spec:
secretName: service-a-tls
issuerRef:
name: my-ca
commonName: service-a.default.svc
dnsNames:
- service-a.default.svc
duration: 24h
renewBefore: 4h
cert-manager handles rotation automatically. Before the certificate expires, cert-manager contacts the issuer, obtains a new certificate, and updates the Kubernetes secret.
HashiCorp Vault
Vault is a more complete secrets management solution. It can issue certificates dynamically, revoke them, and handle key rotation. Vault’s PKI secrets engine supports mTLS certificate issuance with short TTLs.
# Configure Vault PKI engine
vault secrets enable pki
# Set certificate TTL
vault secrets tune -max-lease-ttl=24h pki
# Create a role for service certificates
vault write pki/roles/service-mesh \
allowed_common_name="{{identity.entity.aliases.auth_jwt_aliased_entity_id}}.svc" \
allowed_uri_sans="spiffe://cluster/*" \
max_ttl=24h
Services authenticate to Vault using Kubernetes service accounts, request certificates, and receive short-lived credentials. Vault can also handle dynamic secret generation for other secrets beyond certificates.
Common Pitfalls
mTLS adds complexity. Several failure modes cause production incidents.
Certificate Expiration
The most common issue. Certificates expire, rotation breaks, and services lose communication. This happens because rotation logic has bugs, network issues prevent certificate fetch, or misconfigured TTLs.
Monitor certificate expiration actively. Set multiple alerts at different thresholds (7 days, 3 days, 1 day). Test rotation periodically in staging.
Revocation Checking Failures
If you rely on CRL or OCSP for revocation, failures can cause connection timeouts. Clients wait for revocation checks to timeout before failing.
Use short TTLs instead of aggressive revocation checking. If you must use CRLs, ensure they are accessible and small.
Mixed mTLS Modes
During migration, you may run PERMISSIVE mode in some namespaces and STRICT in others. Forgetting to switch back to STRICT leaves security gaps.
Audit PERMISSIVE configurations regularly. Use STRICT by default and only use PERMISSIVE temporarily during controlled migrations.
Certificate Chain Issues
If intermediate CA certificates are not distributed correctly, verification fails with cryptic errors. Applications see “certificate verify failed” without clear indication of the missing intermediate.
Ensure your CA hierarchy is correct. Verify certificates include the full chain. Test in staging before production deployment.
Namespace Isolation Gaps
mTLS policies sometimes have gaps between namespaces. A service in Namespace A may accept connections from Namespace B even if you intended isolation.
Define authorization policies explicitly. Assume default-deny. Explicitly allow only the service pairs that must communicate.
Performance Implications
mTLS adds latency and CPU overhead. The TLS handshake requires extra round trips and cryptographic operations. The overhead is usually manageable, but it is not zero.
Handshake Latency
A full TLS 1.3 handshake takes one round trip for mTLS (compared to 1.5 for 1.2). The additional certificate verification on both sides adds CPU time.
For short-lived connections, this matters more. If your services make many short-lived calls, use connection pooling to amortize handshake cost across many requests.
CPU Overhead
TLS encryption and decryption consume CPU. AES-NI hardware acceleration helps significantly. Modern CPUs handle TLS overhead well for most workloads.
Under heavy load with many concurrent connections, CPU may become a bottleneck. Profile your services with mTLS enabled.
Memory Overhead
Each TLS connection consumes memory for buffers and session state. Sidecar proxies add memory consumption per service instance.
Envoy’s memory usage scales with connection count. At high connection counts, tune buffer sizes and connection limits.
mTLS vs SPIFFE/SPIRE for Workload Identity
SPIFFE (Secure Production Identity Framework for Everyone) and its implementation SPIRE provide a standardized approach to workload identity that goes beyond certificates alone.
SPIFFE
SPIFFE defines a URI scheme for workload identity: spiffe://trust-domain/path. These identities are embedded in X.509 certificates (SVIDs - SPIFFE Verifiable Identity Documents) or JWTs.
SPIFFE focuses on the identity layer. It answers: how do I know which workload is making this request?
SPIRE
SPIRE is the implementation. It runs as an agent on each node and a server that manages registration and policy. The agent attests the workload’s environment (Kubernetes, AWS, etc.) and obtains SVIDs from the server.
graph TD
SPIREServer[SPIRE Server] -->|issues SVID| Agent[SPIRE Agent]
Agent -->|attests| Workload[Workload]
Workload -->|uses SVID| Service[Service]
Comparison
mTLS provides encryption and authentication. SPIFFE/SPIRE provides the identity layer that mTLS relies on. They work together.
Istio supports SPIFFE-based identity natively. Linkerd has its own identity system that is SPIFFE-compatible. If you use a service mesh, you are already using SPIFFE-like identity, even if you do not use SPIRE explicitly.
Use SPIRE directly when you need workload identity outside a service mesh or across multiple platforms. SPIRE can provision mTLS certificates for any workload, not just Kubernetes.
Production Checklist
Before going to production with mTLS:
- mTLS set to STRICT mode (not PERMISSIVE) in all namespaces
- Authorization policies define which service pairs can communicate
- Certificate rotation tested and monitored
- Alerts configured for certificate expiration (7 days, 3 days, 1 day)
- Certificate chain validated in staging
- Monitoring for mTLS handshake failures
- Resource limits set for sidecar proxies
- Performance profiled under load with mTLS enabled
- Backup CA certificates stored securely
- Rotation procedures documented and tested
Related Concepts
Service meshes handle mTLS automatically. For more on how meshes work, see Service Mesh and Istio and Envoy.
Resilience patterns like circuit breakers protect against cascading failures. See Circuit Breaker Pattern and Resilience Patterns.
For authentication beyond service identity, see API Contracts for how services establish contracts.
Conclusion
mTLS secures service-to-service communication in microservices architectures. Both client and server authenticate each other, preventing impersonation attacks. Certificates issued by a CA hierarchy establish trust, and short TTLs limit the impact of compromised certificates.
Service meshes like Istio and Linkerd handle mTLS automatically. They issue certificates, rotate them, and enforce policies without application code changes. This makes mTLS practical even at scale.
The operational complexity is real. Certificate expiration causes outages. Mixed modes leave security gaps. But with proper monitoring, automated rotation, and explicit authorization policies, mTLS provides strong security for service communication.
Plain text communication between services means a compromised service can eavesdrop on or impersonate any other service in your network. For any production system handling sensitive data, mTLS is not optional.
Production Runbook
Failure Scenarios and Mitigations
Scenario: Certificate Expiration Outage
Symptoms: Services stop communicating. Logs show “certificate verify failed” or “handshake failure” errors. Intermittent 503s between specific service pairs.
Diagnosis:
# Check certificate expiration dates
kubectl get secrets -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.creationTimestamp}{"\n"}' | while read name date; do
echo "$name: created $date"
done
# For Istio, check Envoy certificate details
istioctl proxy-config cluster <pod-name> -o json | jq '.[] | select(.name | contains("mtls")) | .tlsContext'
# For Linkerd, check certificate expiry
linkerd viz tap <service> | head -20
Mitigation:
- Identify which certificates are expired vs approaching expiry
- If the CA is working but rotation failed for specific pods, restart those pods to force certificate re-fetch
- If the CA itself has issues, you may need to fall back to PERMISSIVE mode temporarily while fixing the CA
- After fixing, restart affected pods and verify mTLS is restored with
istioctl authz check <pod>
Prevention:
- Alert at 7 days, 3 days, and 24 hours before expiration
- Test certificate rotation in staging every sprint
- Have backup CA certificates available for emergency key rotation
Scenario: Mixed mTLS Mode Security Gap
Symptoms: Security audit finds services accepting plain text connections. PERMISSIVE mode still configured in production namespaces.
Diagnosis:
# Check Istio PeerAuthentication policies
kubectl get peerauthentication -A -o yaml
# Find namespaces with PERMISSIVE mode
kubectl get peerauthentication -A | grep -v STRICT
# Verify with Envoy access logs
# Look for non-mTLS connections in logs
Mitigation:
- Identify all PERMISSIVE configurations
- Audit which services legitimately need PERMISSIVE (typically only during migration)
- Plan migration to STRICT for each service pair
- Apply STRICT mode incrementally, monitoring for breakage
Prevention:
- Policy-as-code to detect PERMISSIVE in production
- Automated security scans on namespace configurations
- Require PR approval for any PERMISSIVE mode change
Scenario: Intermediate CA Certificate Chain Break
Symptoms: “certificate verify failed” errors without clear indication of which certificate in the chain is problematic. Some services work, others do not.
Diagnosis:
# Check certificate chain in a pod
kubectl exec -it <pod> -c istio-proxy -- openssl s_client -connect <service>:443 -showcerts
# Verify chain against known good Root CA
openssl verify -CAfile /etc/certs/root-cert.pem /etc/certs/cert-chain.pem
# Check Istio control plane certificate
kubectl get secret istio-ca-secret -n istio-system -o yaml
Mitigation:
- Identify which intermediate CA signed the problematic certificates
- Distribute the missing intermediate CA certificate to affected services
- In Istio, restart the control plane to propagate updated certificates
- Verify the full chain is present in Envoy configuration
Prevention:
- Test certificate chain validation in CI/CD pipeline
- Monitor for “unable to get local issuer certificate” errors
- Store intermediate CA certificates in a configmap that updates automatically
Scenario: Sidecar Proxy Memory Pressure
Symptoms: OOM kills on sidecar proxies. Services experiencing latency spikes. Envoy memory usage growing unbounded.
Diagnosis:
# Check Envoy memory usage
kubectl top pods -n <namespace> -l app=<service>
# Check Envoy stats
curl -s http://<pod>:15000/stats | grep "memory"
# Check connection counts
istioctl proxy-config stats <pod> | grep "cluster.grpc.*connections"
Mitigation:
- Reduce connection limits in Envoy configuration
- Tune buffer sizes for your workload
- If due to connection buildup, check for failed health checks causing connection accumulation
- Scale horizontally if individual services have too many connections
Prevention:
- Set resource limits on sidecar proxies
- Monitor Envoy memory trends
- Configure circuit breakers to prevent cascading connection buildup
Observability Hooks
Metrics to Capture
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
mtls_handshake_success_total | mTLS handshake success rate | >0.1% failure rate |
mtls_handshake_duration_seconds | Handshake latency histogram | p99 > 50ms |
certificate_expiration_seconds | Time until certificate expires | <7 days warning, <1 day critical |
envoy_worker_threads_busy_percent | Sidecar CPU saturation | >80% |
envoy_memory_heap_size_bytes | Sidecar memory usage | Growing trend or >80% limit |
Logs to Collect
From Envoy sidecar (structured logging):
{
"event": "mtls_handshake",
"connection_id": "abc123",
"source_workload": "payment-service",
"destination_workload": "invoice-service",
"result": "success|failure",
"failure_reason": "certificate_expired|chain_validation_failed|revoked",
"tls_version": "1.3",
"duration_ms": 12
}
Key log fields: source identity, destination identity, handshake result, failure reason, duration, TLS version.
Traces to Capture
Enable tracing on Envoy with Jaeger or Zipkin. Key span attributes:
mTLS.peer.certificate.valid: booleanmTLS.peer.certificate.expiry.unix_timestamp: certificate expirationmTLS.peer.identity: SPIFFE ID of the remote peer
Dashboards to Build
- mTLS Health Overview: Handshake success rate, failure breakdown by reason, certificate expiration countdown
- Sidecar Resource Utilization: Memory, CPU per service, connection counts
- Certificate Lifecycle: Issuance rate, rotation success rate, upcoming expirations
- Security Posture: PERMISSIVE mode violations, unauthorized connection attempts
Alerting Rules
# Certificate expiring
- alert: CertificateExpiringSoon
expr: certificate_expiration_seconds < 86400 * 7
labels:
severity: warning
annotations:
summary: "Certificate expiring in {{ $value }}"
- alert: CertificateExpiringCritical
expr: certificate_expiration_seconds < 86400
labels:
severity: critical
# mTLS handshake failures
- alert: MTLSHandshakeFailures
expr: rate(mtls_handshake_failure_total[5m]) > 0.001
labels:
severity: warning
annotations:
summary: "mTLS handshake failure rate above 0.1%"
Quick Recap
- mTLS authenticates both client and server in service-to-service communication, preventing impersonation attacks that regular TLS cannot stop
- The CA hierarchy (Root CA, Intermediate CA, Leaf certificates) establishes trust; leaf certificates should be short-lived (hours to days)
- Service meshes like Istio and Linkerd automate certificate issuance, rotation, and enforcement without application code changes
- Certificate expiration is the most common production failure; monitor actively and alert at multiple thresholds (7d, 3d, 1d)
- Always use STRICT mode in production; PERMISSIVE mode is only for controlled migrations
- Sidecar proxies add memory overhead per connection and CPU overhead for TLS operations; profile under load
- SPIFFE/SPIRE provides standardized workload identity that mTLS certificates carry; they work together
- Build dashboards tracking handshake success rate, certificate expiration, and sidecar resource utilization
Category
Related Posts
OAuth 2.0 and OIDC for Microservices
Learn how OAuth 2.0 and OpenID Connect provide delegated authorization and federated identity for microservices architectures.
Secrets Management: Vault, Kubernetes Secrets, and Env Vars
Learn how to securely manage secrets, API keys, and credentials across microservices using HashiCorp Vault, Kubernetes Secrets, and best practices.
Service Identity: SPIFFE and Workload Identity in Microservices
Understand how SPIFFE provides cryptographic identity for microservices workloads and how to implement workload identity at scale.