Network Security: VPC, Firewall Rules, and Service Mesh mTLS

Design network security for cloud-native applications using VPCs, network policies, and mutual TLS for service-to-service encryption.

published: reading time: 12 min read

Network security is about controlling who can talk to whom. In a monolith, this is simple: one network, trust everything inside it. In a distributed system with hundreds of services, you need a different approach.

When to Use

Istio vs. Linkerd vs. No Service Mesh

Use a service mesh when you have multiple services that need mutual TLS without modifying application code. Linkerd is the better choice when you want simple, low-overhead mTLS with minimal configuration. Istio is better when you need fine-grained traffic control, advanced observability, or multi-cluster federation.

Do not use a service mesh if your services communicate over a dedicated network with no untrusted traffic, or if your team cannot afford the operational overhead. A service mesh adds complexity at every level: debugging, routing, and authentication all become mesh concerns.

cert-manager vs. Cloud-Native Certificate Management

Use cert-manager when you run Kubernetes and want a unified way to manage certificates across cloud and on-premises environments, or when you use Let’s Encrypt or other ACME providers.

Use cloud-native certificate management (AWS ACM, Azure Key Vault, GCP Certificate Manager) when you operate entirely within one cloud and your certificates are primarily for cloud-managed ingresses like ALB or Cloud CDN.

NACLs vs. Security Groups Alone

NACLs add value when you need subnet-wide rules that apply to all resources in a subnet, or when you want explicit deny rules at the network layer (for example, blocking known malicious IP ranges before traffic reaches any security group).

Most workloads do fine with security groups alone. NACLs are worth the additional configuration complexity when you have a specific compliance requirement for network-layer filtering or when multiple security groups need a common deny rule.

VPC Design and CIDR Allocation

A VPC (Virtual Private Cloud) is your network boundary in the cloud. Design it carefully because changing it later is painful.

Allocate CIDR blocks that give you room to grow. If you start with a /24, you will outgrow it. Use a /16 for production environments and segment with subnets.

# AWS VPC example
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Subnets across availability zones
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

resource "aws_subnet" "public" {
  count                   = 3
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 10)
  availability_zone        = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = false
}

Segment your VPC into at least three subnet types:

  • Public subnets: Load balancers, NAT gateways. Has direct internet access.
  • Private subnets: Application workloads. No direct internet access.
  • Data subnets: Databases, caches. Most restricted, no internet access at all.

Security Groups and NACLs

Security groups are stateful firewalls attached to instances or ENIs (Elastic Network Interfaces). They are the primary tool for controlling traffic to your workloads.

# Security group for an application tier
resource "aws_security_group" "app" {
  name        = "app-tier"
  description = "Security group for application servers"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.load_balancer.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Network ACLs (NACLs) are stateless and operate at the subnet level. Use them as a second layer of defense, for example, blocking all traffic to your data subnets except from specific app subnets.

# NACL for data subnet - only allow traffic from app tier
resource "aws_network_acl" "data" {
  vpc_id = aws_vpc.main.id

  ingress {
    rule_number     = 100
    from_port        = 5432
    to_port          = 5432
    protocol         = "tcp"
    cidr_block       = "10.0.1.0/24"  # App tier subnet
    rule_action      = "allow"
  }

  ingress {
    rule_number  = 200
    cidr_block   = "0.0.0.0/0"
    rule_action  = "deny"
  }
}

The key difference: security groups are stateful (return traffic is automatically allowed), NACLs are stateless (you must explicitly allow return traffic).

Kubernetes NetworkPolicy Enforcement

In Kubernetes, pods can communicate freely by default. A compromised pod can reach any other pod in the cluster. NetworkPolicy is the Kubernetes-native way to restrict this.

# Default deny all ingress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

---
# Allow only from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

Not all Kubernetes networking plugins enforce NetworkPolicy. Calico, Cilium, and Weave Net do. If you are using EKS with Amazon VPC CNI, you need to add Calico or another plugin for policy enforcement.

Service Mesh mTLS

When you want encryption and authentication between every service, a service mesh with mutual TLS is the answer. Instead of managing certificates in your application code, the mesh handles it.

flowchart LR
    A[Service A] -->|mTLS| B[Linkerd Proxy]
    B -->|mTLS| C[Linkerd Proxy]
    C --> D[Service B]
    E[Service C] -->|mTLS| F[Linkerd Proxy]
    F -->|mTLS| B
    B --> G[Control Plane<br/>certificates, policies]
    D --> H[Certificate<br/>rotation]

With Linkerd, you enable mTLS with a single line in the control plane configuration:

Istio and Linkerd are the two main options. Linkerd is simpler and lower-overhead. Istio is more feature-rich but more complex.

With Linkerd, you enable mTLS with a single line in the control plane configuration:

# Enable mTLS in Linkerd
apiVersion: linkerd.io/v1alpha2
kind: GlobalMeshPolicy
metadata:
  name: global
spec:
  enableTLS: true

Your services do not change. Linkerd intercepts traffic at the proxy level, verifies certificates, and encrypts communication automatically.

Istio gives you more control but requires more configuration:

# PeerAuthentication policy for strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

Certificate Management with cert-manager

Whether you are using service mesh mTLS or just securing ingress, certificates need to be managed automatically. cert-manager automates certificate issuance and renewal.

# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

Once you have a ClusterIssuer, you can request certificates for any service:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: myapp-tls
  namespace: production
spec:
  secretName: myapp-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - myapp.example.com

cert-manager handles renewal automatically. Certificates from Let’s Encrypt expire every 90 days; cert-manager renews them at 30 days by default.

Zero-Trust Network Architecture

Zero-trust means never assuming that a request is safe just because it comes from inside your network. Every request should be authenticated and authorized, regardless of source.

The practical implications:

  1. Identity-based access: Services should have identities (certificates, service accounts) that are verified, not just IP-based access.
  2. Microsegmentation: Each service should only be able to reach the services it needs, nothing more.
  3. Short-lived credentials: Service accounts should use short-lived tokens, not long-lived secrets.

The Secrets Management guide covers how to implement identity-based secrets distribution.

Production Failure Scenarios

FailureImpactMitigation
Security group rule conflict causing intermittent connectivityServices cannot reach each other, timeouts, partial failuresAlways number your security group rules, document expected port ranges, test after any security group change
cert-manager failing to renew certificate causing production outageHTTPS becomes unavailable, all TLS connections failMonitor cert-manager’s cert-ready condition, set up alerts 30 days before expiry, keep a backup certificate
Linkerd mTLS causing latency spikesService-to-service latency increases, timeoutsProfile your services with and without mTLS, use Linkerd’s tap command to identify slow connections, check proxy resource limits
NetworkPolicy misconfigured blocking all traffic to namespaceAll pods in namespace become unreachable, total outageApply NetworkPolicy to one pod first, test connectivity before broad rollout, always have a recovery path
NACL overly restrictive blocking legitimate trafficDatabase or API unreachable from app tier, cascading failuresTest NACL changes on a non-production subnet first, use descriptive rule numbers for easy identification

Network Security Trade-offs

ScenarioLinkerdIstioNo Service Mesh
ComplexityLowHighNone
mTLS overhead~1ms latencyVariableNone
Configuration requiredMinimalExtensiveApplication-level TLS
Traffic controlBasicFine-grainedNone
Best formTLS-only needsMulti-cluster, complex routingSingle service with TLS
Scenariocert-managerCloud-native ACM (ACM, Key Vault)
Kubernetes integrationNativeRequires external webhook
Multi-cloudWorks everywhereSingle cloud only
ACME/Let’s EncryptNativeVia cert-manager
CostFree (cert-manager) + ACMECloud provider pricing
Operational overheadMediumLow
ScenarioSecurity Groups onlySecurity Groups + NACLs
East-west blockingYesYes
Subnet-wide deny rulesNoYes
Operational complexityLowMedium
Compliance requirementsMost use casesRegulated environments

Network Security Observability

Certificate expiration causes complete outages. Set alerts at 60, 30, 7, and 1 day before expiry. If cert-manager reports a certificate as not-ready, investigate immediately — you may have a DNS validation failure or network connectivity issue.

Security group change frequency matters. Teams that modify security groups multiple times per day either have automation problems or unclear ownership. Frequent changes also make auditing harder.

For Linkerd and Istio, watch proxy CPU and memory on each pod. The sidecar adds overhead that catches teams off guard when they have tight resource limits and start getting evicted under load.

Key commands:

# Check cert-manager certificate status
kubectl get certificates -A -o wide

# Monitor certificate expiration
kubectl get certificates -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}'

# List security group rules in AWS
aws ec2 describe-security-groups --region us-east-1 --query 'SecurityGroups[*].{Name:GroupName,Rules:IpPermissions}'

# Check Linkerd mTLS status
linkerd identity -n production

# Verify NetworkPolicy is applied
kubectl get networkpolicies -A -o wide

Common Anti-Patterns

Relying on the internal network being safe. Internal networks get breached. Compromised containers can reach any other container in the same VPC. Treat internal traffic as untrusted and use mTLS or at minimum application-layer authentication.

Using self-signed certificates in production. Self-signed certs work for development but break certificate transparency logs and make incident response harder. Use Let’s Encrypt via cert-manager or your cloud’s managed certificate service.

Not rotating certificates. Certificates expire, get compromised, or need to be replaced after incidents. Automate renewal with cert-manager and test the renewal process before expiry.

Over-trusting the Kubernetes network. Pods can reach any other pod by default. A single compromised workload becomes a pivot point to attack every other workload. Default-deny NetworkPolicy costs little and limits lateral movement.

Attaching overly permissive security group rules as a shortcut. Opening port 0.0.0.0/0 or allowing all traffic from 10.0.0.0/8 “because it’s the internal network” defeats the purpose of security groups. Restrict source CIDRs to the minimum required.

Quick Recap

Key Takeaways

  • VPC design sets the foundation: use /16 with subnet segmentation, not /24
  • Security groups are stateful and instance-level; NACLs are stateless and subnet-level
  • Kubernetes NetworkPolicy is not enforced by default CNI plugins on EKS — add Calico or Cilium
  • Service mesh mTLS offloads certificate management from application code
  • cert-manager automates certificate issuance and renewal across all Kubernetes workloads
  • Zero-trust means authenticating every request regardless of network origin

Network Security Checklist

# 1. VPC uses /16 with public/private/data subnet segmentation
# 2. Security groups restrict source to specific CIDRs or security groups
# 3. NACLs add subnet-level deny rules for known malicious IPs
# 4. Calico or Cilium installed for NetworkPolicy enforcement on EKS
# 5. Default-deny NetworkPolicy applied to all production namespaces
# 6. cert-manager ClusterIssuer created for Let's Encrypt
# 7. Certificates auto-renewed 30 days before expiry
# 8. Service mesh mTLS enabled for all production namespaces
# 9. Certificate expiration alerts configured at 60, 30, 7, 1 days

Trade-off Summary

Control LayerScopeOperational ComplexityLatency Impact
Security groupsInstance-levelLowMinimal
NACLsSubnet-levelMediumMinimal
VPC endpointsService-levelMediumReduces (removes IGW)
PrivateLink / VPC peeringCross-accountHighMinimal
VPN / Direct ConnectOn-prem hybridHighAdds encryption overhead
Service mesh (mTLS)Pod-to-podHigh1-3% overhead
NetworkPolicy (K8s)Pod-levelMediumMinimal

Conclusion

Network security is not a perimeter. It is a mesh of controls that limit blast radius when something goes wrong. Design your VPC for growth, use security groups and NACLs to control east-west traffic, enforce Kubernetes NetworkPolicy, and use a service mesh for automatic mTLS between services.

Zero-trust is not a product you buy. It is an architecture you build, layer by layer.

Category

Related Posts

Container Security: Image Scanning and Vulnerability Management

Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.

#container-security #docker #kubernetes

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

#deployment #devops #kubernetes

Developing Helm Charts: Templates, Values, and Testing

Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.

#helm #kubernetes #devops