Istio and Envoy: Deep Dive into Service Mesh Internals

Explore Istio service mesh architecture, Envoy proxy internals, mTLS implementation, traffic routing rules, and observability features with practical examples.

published: reading time: 13 min read

Istio and Envoy: Deep Dive into Service Mesh Internals

Istio and Envoy are often mentioned together. Istio is the control plane. Envoy is the sidecar proxy that handles actual traffic. Understanding how they work together helps you debug issues, tune performance, and design better service meshes.

This post goes deeper than the overview. We cover Envoy’s architecture, how Istio programs Envoy via xDS APIs, mTLS implementation details, and traffic routing mechanics.

Envoy Proxy Architecture

Envoy is a C++ proxy built for microservices. It runs as a sidecar alongside each service. Every inbound and outbound request passes through Envoy.

Envoy is configured declaratively. You do not make API calls to change its behavior. You push configuration, and Envoy applies it.

Filter Chain

Envoy processes requests through a chain of filters. Each filter handles a specific concern. The chain is configurable.

graph TD
    Req[Incoming Request] --> L[Listener]
    L --> F1[Auth Filter]
    F1 --> F2[RBAC Filter]
    F2 --> F3[Router Filter]
    F3 --> Upstream[Upstream Service]

Filters can inspect, modify, or reject requests. The auth filter validates credentials. The router filter decides which upstream cluster to send to. You can add custom filters for metrics, tracing, or business logic.

L4 and L7 Processing

Envoy handles both layer 4 (TCP) and layer 7 (HTTP/gRPC) traffic.

At L4, Envoy forwards raw bytes. It can do port forwarding, TLS passthrough, or TCP proxying.

At L7, Envoy understands HTTP protocols. It can route based on headers, modify request/response bodies, apply rate limiting, and do weighted traffic splitting.

Istio uses L7 processing for its advanced routing features. The sidecar proxy must terminate and re-establish connections to inspect L7 metadata.

Istio’s Architecture

Istio deploys two main components: the control plane (istiod) and the data plane (Envoy proxies).

istiod

istiod is the Istio control plane. It handles configuration distribution, sending routing rules and policies to Envoys. It manages certificates and rotates mTLS credentials for all services.

Envoy Sidecar Injection

In Kubernetes, Istio injects Envoy sidecars via a mutating admission webhook. Label a namespace with istio-injection=enabled, and every new pod gets an Envoy container automatically.

# Enable injection for a namespace
kubectl label namespace default istio-injection=enabled

# Create a pod - Istio adds the sidecar automatically
kubectl apply -f deployment.yaml

The injected Envoy container runs with ISTIO_META_* environment variables that tell it how to connect to istiod.

xDS API: How Istio Programs Envoy

xDS stands for “everything discovery service.” It is the protocol Envoy uses to receive configuration from Istio.

The four main xDS services:

  • LDS (Listener Discovery Service): What ports and filters the proxy should set up
  • RDS (Route Discovery Service): What routes to use for each listener
  • CDS (Cluster Discovery Service): What upstream clusters exist
  • EDS (Endpoint Discovery Service): What IPs are in each cluster
graph LR
    Istiod -->|LDS/RDS| Envoy[Envoy Sidecar]
    Istiod -->|CDS/EDS| Envoy
    Envoy -->|request| Upstream[Upstream Service]

Envoy connects to Istiod and streams configuration updates. When you change a VirtualService, Istiod computes the new routing config and pushes it to affected Envoys within seconds.

How a Request Flows with xDS

  1. Client pod calls http://product-service:8080/api/products
  2. Envoy on client side receives the request
  3. Envoy’s router filter looks up the route in RDS based on the host
  4. RDS returns the cluster name (e.g., “product-service”)
  5. EDS returns the endpoints (IPs) for that cluster
  6. Envoy load balances across endpoints, applies circuit breakers
  7. Envoy on server side receives the request, passes to the product container

All of this happens transparently. Your application makes a plain HTTP call. Envoy handles the rest.

mTLS Implementation

Istio provides mutual TLS (mTLS) automatically. All traffic between services is encrypted and authenticated.

How mTLS Works in Istio

Istio manages certificates through its CA (certificate authority). Each namespace gets a CA root certificate. Each pod gets a workload certificate signed by the CA.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: STRICT

With STRICT mode, only mTLS connections are allowed. Plain text connections are rejected at the proxy level.

Certificate Rotation

Istio rotates certificates automatically. Workload certificates have a short TTL (24 hours by default). The CA issues new certificates before the old ones expire.

Envoy detects certificate changes via its SDS (Secret Discovery Service). It reloads TLS context without restarting the proxy or dropping active connections.

Traffic Management

Istio’s traffic management goes beyond simple routing. It provides retries, timeouts, circuit breakers, and traffic splitting.

VirtualService Routing

VirtualService defines routing rules for traffic to a service.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    - match:
        - headers:
            X-Canary:
              exact: "true"
      route:
        - destination:
            host: product-service
            subset: v2
          weight: 100
    - route:
        - destination:
            host: product-service
            subset: v1
          weight: 100

This routes requests with X-Canary: true header to version v2. All other requests go to v1.

Weighted Traffic Splitting

Gradually shift traffic between versions:

- route:
    - destination:
        host: product-service
        subset: v1
      weight: 90
    - destination:
        host: product-service
        subset: v2
      weight: 10

Start with 10% traffic to v2. Watch error rates. Increase to 50%. If everything looks stable, cut over to 100%.

Circuit Breaking

Prevent cascading failures with outlier detection:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: product-service
spec:
  host: product-service
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

If a pod returns 5 errors in 30 seconds, it gets ejected from the load balancer pool for 30 seconds. Other pods handle traffic while it recovers.

Observability

Istio generates telemetry automatically for every request.

Metrics

Envoy emits standard metrics: request count, request duration, request size, response size. Prometheus scrapes them from Envoy’s admin port.

Istio dashboards in Grafana show service-level metrics, including success rates, latencies, and saturation.

Distributed Tracing

Istio propagates trace context automatically. When a request enters the mesh, Istio creates or propagates a trace ID. Every service call carries the trace ID.

Jaeger or Zipkin collects traces. You see the complete request path across all services.

Access Logging

Envoy logs every request with details: source, destination, duration, response code. You can query logs by trace ID to see exactly what happened at each hop.

Sidecar Resource Tuning

Envoy sidecars consume memory and CPU. At scale, tune them to avoid resource waste.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: resource-tuning
spec:
  configPatches:
    - applyTo: CLUSTER
      patch:
        operation: MERGE
        value:
          max_requests_per_connection: 1024
          connect_timeout: 5s

This limits concurrent connections to upstreams and sets connection timeouts. Adjust based on your workload characteristics.

When to Use / When Not to Use Istio

Use Istio when:

  • You need fine-grained traffic management (header-based routing, weighted splits, retries, timeouts)
  • Multi-cluster networking is a requirement
  • You need comprehensive observability across services (metrics, traces, logs)
  • Compliance requires strong mTLS enforcement between all services
  • You want centralized policy enforcement without modifying application code
  • You are running on multiple clouds or hybrid environments

Probably not the right choice when:

  • You are on a Kubernetes-only environment and Linkerd’s simplicity appeals to you
  • Resource overhead is a primary concern (Linkerd has lower memory/CPU footprint)
  • You only need basic mTLS without advanced traffic management
  • Your team has limited capacity for complex infrastructure
  • You are early in your microservices journey and have fewer than 10 services

Trade-off Analysis

FactorIstioLinkerdNo Service Mesh
Setup ComplexityHighLowNone
Memory/CPU Overhead0.5-1GB per pod0.1-0.3GB per podMinimal
Traffic ManagementFull L7 controlBasic L7None (app-level)
ObservabilityNative metrics, traces, logsNative metricsCustom implementation
mTLSAutomatic with rotationAutomaticManual or service-level
Operational BurdenRequires Istio expertiseLightweightNone
ExtensibilityEnvoyFilter, WasmLimitedFull control
Multi-cluster SupportNativeLimitedComplex
Learning CurveSteepGentleN/A

Istio Architecture Overview

graph TB
    subgraph ControlPlane["Control Plane - istiod"]
        CA[Certificate Authority]
        Config[Configuration Manager]
        Registry[Service Registry]
    end

    subgraph DataPlane["Data Plane - Envoy Sidecars"]
        subgraph PodA["Pod A"]
            EnvoyA[Envoy Sidecar]
            AppA[Application]
        end
        subgraph PodB["Pod B"]
            EnvoyB[Envoy Sidecar]
            AppB[Application]
        end
    end

    CA -->|mTLS Certificates| EnvoyA
    CA -->|mTLS Certificates| EnvoyB
    Config -->|xDS API LDS/RDS| EnvoyA
    Config -->|xDS API LDS/RDS| EnvoyB
    Config -->|xDS API CDS/EDS| EnvoyA
    Config -->|xDS API CDS/EDS| EnvoyB
    Registry -->|Service Discovery| Config

    EnvoyA -->|mTLS| EnvoyB
    EnvoyB -->|mTLS| EnvoyA
    AppA -->|Outbound| EnvoyA
    AppB -->|Outbound| EnvoyB
    EnvoyA -->|Inbound| AppA
    EnvoyB -->|Inbound| AppB

Production Failure Scenarios

FailureImpactMitigation
istiod becomes unavailableNew configurations not pushed; existing traffic continues normallyRun istiod in HA mode (at least 2 replicas); existing data plane unaffected
Envoy sidecar OOM killedService loses all network connectivitySet appropriate memory limits; tune Envoy’s resource configuration
xDS streaming connection breaksEnvoy may use stale configurationImplement local configuration caching; monitor xDS sync status
mTLS certificate rotation failureServices cannot communicate; encryption breaksMonitor certificate expiration; use SDS for dynamic rotation; set reasonable TTLs
Envoy filter chain misconfigurationRequests rejected or routed incorrectlyTest configuration changes in staging; use progressive rollout with traffic percentage
Network partition between namespacesServices in separated namespaces cannot communicateDesign namespace isolation appropriately; use multi-cluster networking if needed
VirtualService misconfigurationTraffic routed to wrong service or droppedValidate routing rules; use dry-run mode where available; monitor 404 rates

Observability Checklist

Key Metrics

  • Request count by service, destination, and response code
  • Request duration histograms (p50, p95, p99) per service
  • Request size and response size per service
  • mTLS connection success rate
  • Circuit breaker trip count per service
  • Outlier detection ejection count
  • xDS configuration sync status per proxy
  • Envoy memory and CPU usage per pod

Logs

  • Envoy access logging enabled with detailed request metadata
  • Include trace ID in all access log entries
  • Log all mTLS handshake failures with endpoint details
  • Log all circuit breaker state changes
  • Capture Envoy logs at appropriate verbosity (info for errors, debug for tracing issues)

Alerts

  • Alert when error rate exceeds 1% for 5 minutes
  • Alert when p99 latency exceeds defined threshold
  • Alert when sidecar memory usage exceeds 80% of limit
  • Alert when certificate expires within 7 days
  • Alert when circuit breaker trips more than threshold times per minute
  • Alert when xDS sync failures detected
  • Alert on unexpected increase in 404 responses

Security Checklist

  • PeerAuthentication set to STRICT mode (not PERMISSIVE)
  • AuthorizationPolicy enforced between all service pairs (default deny)
  • Workload certificates auto-rotated with short TTL (24h default)
  • istiod access restricted via RBAC and network policies
  • Envoy admin endpoint disabled or restricted (not exposed publicly)
  • No plain text traffic allowed at network policy level
  • Regular security scanning of Istio version for CVEs
  • Audit logs for policy changes and certificate operations
  • SDS (Secret Discovery Service) used for dynamic certificate delivery
  • Avoid storing sensitive data in Envoy configuration or logs

Common Pitfalls / Anti-Patterns

Underconfigured sidecar resources: Envoy needs adequate CPU and memory. Underconfigured sidecars cause OOM kills that drop all traffic to the pod. Profile sidecar resource usage under load and set appropriate limits with headroom.

Ignoring proxy warm-up: New Envoy proxies need to fetch xDS configuration before handling traffic. Without proper readiness probes, Kubernetes routes traffic to pods before their proxies are ready. Configure readinessProbe that verifies xDS sync.

Using PERMISSIVE mTLS in production: PERMISSIVE allows both mTLS and plain text connections. It is useful during migration but leaves a security gap if left enabled in production. Always switch to STRICT mode when migration completes.

Overly broad traffic policies: Applying mesh-wide policies that assume uniform requirements leads to problems. Some services need longer timeouts, different load balancing, or stricter circuit breaking. Use TrafficPolicy overrides per service.

Not tuning Envoy for workload: Default Envoy settings are conservative. Under high-throughput workloads, default connection limits, buffer sizes, and thread counts may become bottlenecks. Tune based on load testing.

Logging sensitive data in Envoy access logs: Envoy logs full request and response details by default. Ensure sensitive data (PII, credentials, tokens) is redacted or masked in logs to avoid security and compliance issues.

Quick Recap

graph LR
    Istiod -->|LDS/RDS| Envoy
    Istiod -->|CDS/EDS| Envoy
    Envoy -->|mTLS| Envoy

Key Points

  • Envoy is the sidecar proxy handling actual traffic; Istio is the control plane managing Envoys
  • xDS API (LDS, RDS, CDS, EDS) programs Envoy declaratively
  • mTLS and certificate rotation are automatic via SDS
  • VirtualService and DestinationRule provide rich traffic management
  • EnvoyFilter allows custom Envoy configuration when built-in features are insufficient

Production Checklist

# Istio Production Readiness

- [ ] mTLS set to STRICT mode
- [ ] AuthorizationPolicy with default-deny enforced
- [ ] Certificate rotation configured and monitored
- [ ] Sidecar resource limits appropriately configured
- [ ] Readiness probes configured for xDS sync
- [ ] Envoy access logging enabled with trace IDs
- [ ] Metrics dashboards operational
- [ ] Alerts configured for error rate, latency, and resource usage
- [ ] istiod running in HA mode (multi-replica)
- [ ] Circuit breaker thresholds configured per service
- [ ] Envoy admin endpoint restricted
- [ ] Regular Istio version updates for security patches

For an introduction to service mesh concepts, see Service Mesh. For Kubernetes fundamentals, see Kubernetes.

Conclusion

Istio and Envoy work together to provide transparent service mesh features. Envoy’s filter chain and xDS API let Istio push configuration dynamically. mTLS happens automatically, with certificate rotation handled by the control plane.

The depth of control is significant. Route traffic with header rules, split traffic by percentage, enforce policies at the proxy level, and observe everything without touching application code. The trade-off is operational complexity: Istio is not a simple system to run.

Category

Related Posts

Service Mesh: Managing Microservice Communication

Learn how service mesh architectures handle microservice communication, sidecar proxies, traffic management, and security with Istio and Linkerd.

#microservices #service-mesh #istio

DNS-Based Service Discovery: Kubernetes, Consul, and etcd

Learn how DNS-based service discovery works in microservices platforms like Kubernetes, Consul, and etcd, including DNS naming conventions and SRV records.

#microservices #dns #service-discovery

GitOps: Infrastructure as Code with Git for Microservices

Discover GitOps principles and practices for managing microservices infrastructure using Git as the single source of truth.

#microservices #gitops #infrastructure-as-code