Istio and Envoy: Deep Dive into Service Mesh Internals
Explore Istio service mesh architecture, Envoy proxy internals, mTLS implementation, traffic routing rules, and observability features with practical examples.
Istio and Envoy: Deep Dive into Service Mesh Internals
Istio and Envoy are often mentioned together. Istio is the control plane. Envoy is the sidecar proxy that handles actual traffic. Understanding how they work together helps you debug issues, tune performance, and design better service meshes.
This post goes deeper than the overview. We cover Envoy’s architecture, how Istio programs Envoy via xDS APIs, mTLS implementation details, and traffic routing mechanics.
Envoy Proxy Architecture
Envoy is a C++ proxy built for microservices. It runs as a sidecar alongside each service. Every inbound and outbound request passes through Envoy.
Envoy is configured declaratively. You do not make API calls to change its behavior. You push configuration, and Envoy applies it.
Filter Chain
Envoy processes requests through a chain of filters. Each filter handles a specific concern. The chain is configurable.
graph TD
Req[Incoming Request] --> L[Listener]
L --> F1[Auth Filter]
F1 --> F2[RBAC Filter]
F2 --> F3[Router Filter]
F3 --> Upstream[Upstream Service]
Filters can inspect, modify, or reject requests. The auth filter validates credentials. The router filter decides which upstream cluster to send to. You can add custom filters for metrics, tracing, or business logic.
L4 and L7 Processing
Envoy handles both layer 4 (TCP) and layer 7 (HTTP/gRPC) traffic.
At L4, Envoy forwards raw bytes. It can do port forwarding, TLS passthrough, or TCP proxying.
At L7, Envoy understands HTTP protocols. It can route based on headers, modify request/response bodies, apply rate limiting, and do weighted traffic splitting.
Istio uses L7 processing for its advanced routing features. The sidecar proxy must terminate and re-establish connections to inspect L7 metadata.
Istio’s Architecture
Istio deploys two main components: the control plane (istiod) and the data plane (Envoy proxies).
istiod
istiod is the Istio control plane. It handles configuration distribution, sending routing rules and policies to Envoys. It manages certificates and rotates mTLS credentials for all services.
Envoy Sidecar Injection
In Kubernetes, Istio injects Envoy sidecars via a mutating admission webhook. Label a namespace with istio-injection=enabled, and every new pod gets an Envoy container automatically.
# Enable injection for a namespace
kubectl label namespace default istio-injection=enabled
# Create a pod - Istio adds the sidecar automatically
kubectl apply -f deployment.yaml
The injected Envoy container runs with ISTIO_META_* environment variables that tell it how to connect to istiod.
xDS API: How Istio Programs Envoy
xDS stands for “everything discovery service.” It is the protocol Envoy uses to receive configuration from Istio.
The four main xDS services:
- LDS (Listener Discovery Service): What ports and filters the proxy should set up
- RDS (Route Discovery Service): What routes to use for each listener
- CDS (Cluster Discovery Service): What upstream clusters exist
- EDS (Endpoint Discovery Service): What IPs are in each cluster
graph LR
Istiod -->|LDS/RDS| Envoy[Envoy Sidecar]
Istiod -->|CDS/EDS| Envoy
Envoy -->|request| Upstream[Upstream Service]
Envoy connects to Istiod and streams configuration updates. When you change a VirtualService, Istiod computes the new routing config and pushes it to affected Envoys within seconds.
How a Request Flows with xDS
- Client pod calls
http://product-service:8080/api/products - Envoy on client side receives the request
- Envoy’s router filter looks up the route in RDS based on the host
- RDS returns the cluster name (e.g., “product-service”)
- EDS returns the endpoints (IPs) for that cluster
- Envoy load balances across endpoints, applies circuit breakers
- Envoy on server side receives the request, passes to the product container
All of this happens transparently. Your application makes a plain HTTP call. Envoy handles the rest.
mTLS Implementation
Istio provides mutual TLS (mTLS) automatically. All traffic between services is encrypted and authenticated.
How mTLS Works in Istio
Istio manages certificates through its CA (certificate authority). Each namespace gets a CA root certificate. Each pod gets a workload certificate signed by the CA.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: STRICT
With STRICT mode, only mTLS connections are allowed. Plain text connections are rejected at the proxy level.
Certificate Rotation
Istio rotates certificates automatically. Workload certificates have a short TTL (24 hours by default). The CA issues new certificates before the old ones expire.
Envoy detects certificate changes via its SDS (Secret Discovery Service). It reloads TLS context without restarting the proxy or dropping active connections.
Traffic Management
Istio’s traffic management goes beyond simple routing. It provides retries, timeouts, circuit breakers, and traffic splitting.
VirtualService Routing
VirtualService defines routing rules for traffic to a service.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: product-service
spec:
hosts:
- product-service
http:
- match:
- headers:
X-Canary:
exact: "true"
route:
- destination:
host: product-service
subset: v2
weight: 100
- route:
- destination:
host: product-service
subset: v1
weight: 100
This routes requests with X-Canary: true header to version v2. All other requests go to v1.
Weighted Traffic Splitting
Gradually shift traffic between versions:
- route:
- destination:
host: product-service
subset: v1
weight: 90
- destination:
host: product-service
subset: v2
weight: 10
Start with 10% traffic to v2. Watch error rates. Increase to 50%. If everything looks stable, cut over to 100%.
Circuit Breaking
Prevent cascading failures with outlier detection:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: product-service
spec:
host: product-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
If a pod returns 5 errors in 30 seconds, it gets ejected from the load balancer pool for 30 seconds. Other pods handle traffic while it recovers.
Observability
Istio generates telemetry automatically for every request.
Metrics
Envoy emits standard metrics: request count, request duration, request size, response size. Prometheus scrapes them from Envoy’s admin port.
Istio dashboards in Grafana show service-level metrics, including success rates, latencies, and saturation.
Distributed Tracing
Istio propagates trace context automatically. When a request enters the mesh, Istio creates or propagates a trace ID. Every service call carries the trace ID.
Jaeger or Zipkin collects traces. You see the complete request path across all services.
Access Logging
Envoy logs every request with details: source, destination, duration, response code. You can query logs by trace ID to see exactly what happened at each hop.
Sidecar Resource Tuning
Envoy sidecars consume memory and CPU. At scale, tune them to avoid resource waste.
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: resource-tuning
spec:
configPatches:
- applyTo: CLUSTER
patch:
operation: MERGE
value:
max_requests_per_connection: 1024
connect_timeout: 5s
This limits concurrent connections to upstreams and sets connection timeouts. Adjust based on your workload characteristics.
When to Use / When Not to Use Istio
Use Istio when:
- You need fine-grained traffic management (header-based routing, weighted splits, retries, timeouts)
- Multi-cluster networking is a requirement
- You need comprehensive observability across services (metrics, traces, logs)
- Compliance requires strong mTLS enforcement between all services
- You want centralized policy enforcement without modifying application code
- You are running on multiple clouds or hybrid environments
Probably not the right choice when:
- You are on a Kubernetes-only environment and Linkerd’s simplicity appeals to you
- Resource overhead is a primary concern (Linkerd has lower memory/CPU footprint)
- You only need basic mTLS without advanced traffic management
- Your team has limited capacity for complex infrastructure
- You are early in your microservices journey and have fewer than 10 services
Trade-off Analysis
| Factor | Istio | Linkerd | No Service Mesh |
|---|---|---|---|
| Setup Complexity | High | Low | None |
| Memory/CPU Overhead | 0.5-1GB per pod | 0.1-0.3GB per pod | Minimal |
| Traffic Management | Full L7 control | Basic L7 | None (app-level) |
| Observability | Native metrics, traces, logs | Native metrics | Custom implementation |
| mTLS | Automatic with rotation | Automatic | Manual or service-level |
| Operational Burden | Requires Istio expertise | Lightweight | None |
| Extensibility | EnvoyFilter, Wasm | Limited | Full control |
| Multi-cluster Support | Native | Limited | Complex |
| Learning Curve | Steep | Gentle | N/A |
Istio Architecture Overview
graph TB
subgraph ControlPlane["Control Plane - istiod"]
CA[Certificate Authority]
Config[Configuration Manager]
Registry[Service Registry]
end
subgraph DataPlane["Data Plane - Envoy Sidecars"]
subgraph PodA["Pod A"]
EnvoyA[Envoy Sidecar]
AppA[Application]
end
subgraph PodB["Pod B"]
EnvoyB[Envoy Sidecar]
AppB[Application]
end
end
CA -->|mTLS Certificates| EnvoyA
CA -->|mTLS Certificates| EnvoyB
Config -->|xDS API LDS/RDS| EnvoyA
Config -->|xDS API LDS/RDS| EnvoyB
Config -->|xDS API CDS/EDS| EnvoyA
Config -->|xDS API CDS/EDS| EnvoyB
Registry -->|Service Discovery| Config
EnvoyA -->|mTLS| EnvoyB
EnvoyB -->|mTLS| EnvoyA
AppA -->|Outbound| EnvoyA
AppB -->|Outbound| EnvoyB
EnvoyA -->|Inbound| AppA
EnvoyB -->|Inbound| AppB
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| istiod becomes unavailable | New configurations not pushed; existing traffic continues normally | Run istiod in HA mode (at least 2 replicas); existing data plane unaffected |
| Envoy sidecar OOM killed | Service loses all network connectivity | Set appropriate memory limits; tune Envoy’s resource configuration |
| xDS streaming connection breaks | Envoy may use stale configuration | Implement local configuration caching; monitor xDS sync status |
| mTLS certificate rotation failure | Services cannot communicate; encryption breaks | Monitor certificate expiration; use SDS for dynamic rotation; set reasonable TTLs |
| Envoy filter chain misconfiguration | Requests rejected or routed incorrectly | Test configuration changes in staging; use progressive rollout with traffic percentage |
| Network partition between namespaces | Services in separated namespaces cannot communicate | Design namespace isolation appropriately; use multi-cluster networking if needed |
| VirtualService misconfiguration | Traffic routed to wrong service or dropped | Validate routing rules; use dry-run mode where available; monitor 404 rates |
Observability Checklist
Key Metrics
- Request count by service, destination, and response code
- Request duration histograms (p50, p95, p99) per service
- Request size and response size per service
- mTLS connection success rate
- Circuit breaker trip count per service
- Outlier detection ejection count
- xDS configuration sync status per proxy
- Envoy memory and CPU usage per pod
Logs
- Envoy access logging enabled with detailed request metadata
- Include trace ID in all access log entries
- Log all mTLS handshake failures with endpoint details
- Log all circuit breaker state changes
- Capture Envoy logs at appropriate verbosity (info for errors, debug for tracing issues)
Alerts
- Alert when error rate exceeds 1% for 5 minutes
- Alert when p99 latency exceeds defined threshold
- Alert when sidecar memory usage exceeds 80% of limit
- Alert when certificate expires within 7 days
- Alert when circuit breaker trips more than threshold times per minute
- Alert when xDS sync failures detected
- Alert on unexpected increase in 404 responses
Security Checklist
- PeerAuthentication set to STRICT mode (not PERMISSIVE)
- AuthorizationPolicy enforced between all service pairs (default deny)
- Workload certificates auto-rotated with short TTL (24h default)
- istiod access restricted via RBAC and network policies
- Envoy admin endpoint disabled or restricted (not exposed publicly)
- No plain text traffic allowed at network policy level
- Regular security scanning of Istio version for CVEs
- Audit logs for policy changes and certificate operations
- SDS (Secret Discovery Service) used for dynamic certificate delivery
- Avoid storing sensitive data in Envoy configuration or logs
Common Pitfalls / Anti-Patterns
Underconfigured sidecar resources: Envoy needs adequate CPU and memory. Underconfigured sidecars cause OOM kills that drop all traffic to the pod. Profile sidecar resource usage under load and set appropriate limits with headroom.
Ignoring proxy warm-up: New Envoy proxies need to fetch xDS configuration before handling traffic. Without proper readiness probes, Kubernetes routes traffic to pods before their proxies are ready. Configure readinessProbe that verifies xDS sync.
Using PERMISSIVE mTLS in production: PERMISSIVE allows both mTLS and plain text connections. It is useful during migration but leaves a security gap if left enabled in production. Always switch to STRICT mode when migration completes.
Overly broad traffic policies: Applying mesh-wide policies that assume uniform requirements leads to problems. Some services need longer timeouts, different load balancing, or stricter circuit breaking. Use TrafficPolicy overrides per service.
Not tuning Envoy for workload: Default Envoy settings are conservative. Under high-throughput workloads, default connection limits, buffer sizes, and thread counts may become bottlenecks. Tune based on load testing.
Logging sensitive data in Envoy access logs: Envoy logs full request and response details by default. Ensure sensitive data (PII, credentials, tokens) is redacted or masked in logs to avoid security and compliance issues.
Quick Recap
graph LR
Istiod -->|LDS/RDS| Envoy
Istiod -->|CDS/EDS| Envoy
Envoy -->|mTLS| Envoy
Key Points
- Envoy is the sidecar proxy handling actual traffic; Istio is the control plane managing Envoys
- xDS API (LDS, RDS, CDS, EDS) programs Envoy declaratively
- mTLS and certificate rotation are automatic via SDS
- VirtualService and DestinationRule provide rich traffic management
- EnvoyFilter allows custom Envoy configuration when built-in features are insufficient
Production Checklist
# Istio Production Readiness
- [ ] mTLS set to STRICT mode
- [ ] AuthorizationPolicy with default-deny enforced
- [ ] Certificate rotation configured and monitored
- [ ] Sidecar resource limits appropriately configured
- [ ] Readiness probes configured for xDS sync
- [ ] Envoy access logging enabled with trace IDs
- [ ] Metrics dashboards operational
- [ ] Alerts configured for error rate, latency, and resource usage
- [ ] istiod running in HA mode (multi-replica)
- [ ] Circuit breaker thresholds configured per service
- [ ] Envoy admin endpoint restricted
- [ ] Regular Istio version updates for security patches
Related Concepts
For an introduction to service mesh concepts, see Service Mesh. For Kubernetes fundamentals, see Kubernetes.
Conclusion
Istio and Envoy work together to provide transparent service mesh features. Envoy’s filter chain and xDS API let Istio push configuration dynamically. mTLS happens automatically, with certificate rotation handled by the control plane.
The depth of control is significant. Route traffic with header rules, split traffic by percentage, enforce policies at the proxy level, and observe everything without touching application code. The trade-off is operational complexity: Istio is not a simple system to run.
Category
Related Posts
Service Mesh: Managing Microservice Communication
Learn how service mesh architectures handle microservice communication, sidecar proxies, traffic management, and security with Istio and Linkerd.
DNS-Based Service Discovery: Kubernetes, Consul, and etcd
Learn how DNS-based service discovery works in microservices platforms like Kubernetes, Consul, and etcd, including DNS naming conventions and SRV records.
GitOps: Infrastructure as Code with Git for Microservices
Discover GitOps principles and practices for managing microservices infrastructure using Git as the single source of truth.