Network Security: VPC, Firewall Rules, and Service Mesh mTLS
Design network security for cloud-native applications using VPCs, network policies, and mutual TLS for service-to-service encryption.
Network security is about controlling who can talk to whom. In a monolith, this is simple: one network, trust everything inside it. In a distributed system with hundreds of services, you need a different approach.
When to Use
Istio vs. Linkerd vs. No Service Mesh
Use a service mesh when you have multiple services that need mutual TLS without modifying application code. Linkerd is the better choice when you want simple, low-overhead mTLS with minimal configuration. Istio is better when you need fine-grained traffic control, advanced observability, or multi-cluster federation.
Do not use a service mesh if your services communicate over a dedicated network with no untrusted traffic, or if your team cannot afford the operational overhead. A service mesh adds complexity at every level: debugging, routing, and authentication all become mesh concerns.
cert-manager vs. Cloud-Native Certificate Management
Use cert-manager when you run Kubernetes and want a unified way to manage certificates across cloud and on-premises environments, or when you use Let’s Encrypt or other ACME providers.
Use cloud-native certificate management (AWS ACM, Azure Key Vault, GCP Certificate Manager) when you operate entirely within one cloud and your certificates are primarily for cloud-managed ingresses like ALB or Cloud CDN.
NACLs vs. Security Groups Alone
NACLs add value when you need subnet-wide rules that apply to all resources in a subnet, or when you want explicit deny rules at the network layer (for example, blocking known malicious IP ranges before traffic reaches any security group).
Most workloads do fine with security groups alone. NACLs are worth the additional configuration complexity when you have a specific compliance requirement for network-layer filtering or when multiple security groups need a common deny rule.
VPC Design and CIDR Allocation
A VPC (Virtual Private Cloud) is your network boundary in the cloud. Design it carefully because changing it later is painful.
Allocate CIDR blocks that give you room to grow. If you start with a /24, you will outgrow it. Use a /16 for production environments and segment with subnets.
# AWS VPC example
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
# Subnets across availability zones
resource "aws_subnet" "private" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
}
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 10)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = false
}
Segment your VPC into at least three subnet types:
- Public subnets: Load balancers, NAT gateways. Has direct internet access.
- Private subnets: Application workloads. No direct internet access.
- Data subnets: Databases, caches. Most restricted, no internet access at all.
Security Groups and NACLs
Security groups are stateful firewalls attached to instances or ENIs (Elastic Network Interfaces). They are the primary tool for controlling traffic to your workloads.
# Security group for an application tier
resource "aws_security_group" "app" {
name = "app-tier"
description = "Security group for application servers"
vpc_id = aws_vpc.main.id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.load_balancer.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Network ACLs (NACLs) are stateless and operate at the subnet level. Use them as a second layer of defense, for example, blocking all traffic to your data subnets except from specific app subnets.
# NACL for data subnet - only allow traffic from app tier
resource "aws_network_acl" "data" {
vpc_id = aws_vpc.main.id
ingress {
rule_number = 100
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_block = "10.0.1.0/24" # App tier subnet
rule_action = "allow"
}
ingress {
rule_number = 200
cidr_block = "0.0.0.0/0"
rule_action = "deny"
}
}
The key difference: security groups are stateful (return traffic is automatically allowed), NACLs are stateless (you must explicitly allow return traffic).
Kubernetes NetworkPolicy Enforcement
In Kubernetes, pods can communicate freely by default. A compromised pod can reach any other pod in the cluster. NetworkPolicy is the Kubernetes-native way to restrict this.
# Default deny all ingress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow only from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Not all Kubernetes networking plugins enforce NetworkPolicy. Calico, Cilium, and Weave Net do. If you are using EKS with Amazon VPC CNI, you need to add Calico or another plugin for policy enforcement.
Service Mesh mTLS
When you want encryption and authentication between every service, a service mesh with mutual TLS is the answer. Instead of managing certificates in your application code, the mesh handles it.
flowchart LR
A[Service A] -->|mTLS| B[Linkerd Proxy]
B -->|mTLS| C[Linkerd Proxy]
C --> D[Service B]
E[Service C] -->|mTLS| F[Linkerd Proxy]
F -->|mTLS| B
B --> G[Control Plane<br/>certificates, policies]
D --> H[Certificate<br/>rotation]
With Linkerd, you enable mTLS with a single line in the control plane configuration:
Istio and Linkerd are the two main options. Linkerd is simpler and lower-overhead. Istio is more feature-rich but more complex.
With Linkerd, you enable mTLS with a single line in the control plane configuration:
# Enable mTLS in Linkerd
apiVersion: linkerd.io/v1alpha2
kind: GlobalMeshPolicy
metadata:
name: global
spec:
enableTLS: true
Your services do not change. Linkerd intercepts traffic at the proxy level, verifies certificates, and encrypts communication automatically.
Istio gives you more control but requires more configuration:
# PeerAuthentication policy for strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT
Certificate Management with cert-manager
Whether you are using service mesh mTLS or just securing ingress, certificates need to be managed automatically. cert-manager automates certificate issuance and renewal.
# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
Once you have a ClusterIssuer, you can request certificates for any service:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-tls
namespace: production
spec:
secretName: myapp-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- myapp.example.com
cert-manager handles renewal automatically. Certificates from Let’s Encrypt expire every 90 days; cert-manager renews them at 30 days by default.
Zero-Trust Network Architecture
Zero-trust means never assuming that a request is safe just because it comes from inside your network. Every request should be authenticated and authorized, regardless of source.
The practical implications:
- Identity-based access: Services should have identities (certificates, service accounts) that are verified, not just IP-based access.
- Microsegmentation: Each service should only be able to reach the services it needs, nothing more.
- Short-lived credentials: Service accounts should use short-lived tokens, not long-lived secrets.
The Secrets Management guide covers how to implement identity-based secrets distribution.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Security group rule conflict causing intermittent connectivity | Services cannot reach each other, timeouts, partial failures | Always number your security group rules, document expected port ranges, test after any security group change |
| cert-manager failing to renew certificate causing production outage | HTTPS becomes unavailable, all TLS connections fail | Monitor cert-manager’s cert-ready condition, set up alerts 30 days before expiry, keep a backup certificate |
| Linkerd mTLS causing latency spikes | Service-to-service latency increases, timeouts | Profile your services with and without mTLS, use Linkerd’s tap command to identify slow connections, check proxy resource limits |
| NetworkPolicy misconfigured blocking all traffic to namespace | All pods in namespace become unreachable, total outage | Apply NetworkPolicy to one pod first, test connectivity before broad rollout, always have a recovery path |
| NACL overly restrictive blocking legitimate traffic | Database or API unreachable from app tier, cascading failures | Test NACL changes on a non-production subnet first, use descriptive rule numbers for easy identification |
Network Security Trade-offs
| Scenario | Linkerd | Istio | No Service Mesh |
|---|---|---|---|
| Complexity | Low | High | None |
| mTLS overhead | ~1ms latency | Variable | None |
| Configuration required | Minimal | Extensive | Application-level TLS |
| Traffic control | Basic | Fine-grained | None |
| Best for | mTLS-only needs | Multi-cluster, complex routing | Single service with TLS |
| Scenario | cert-manager | Cloud-native ACM (ACM, Key Vault) |
|---|---|---|
| Kubernetes integration | Native | Requires external webhook |
| Multi-cloud | Works everywhere | Single cloud only |
| ACME/Let’s Encrypt | Native | Via cert-manager |
| Cost | Free (cert-manager) + ACME | Cloud provider pricing |
| Operational overhead | Medium | Low |
| Scenario | Security Groups only | Security Groups + NACLs |
|---|---|---|
| East-west blocking | Yes | Yes |
| Subnet-wide deny rules | No | Yes |
| Operational complexity | Low | Medium |
| Compliance requirements | Most use cases | Regulated environments |
Network Security Observability
Certificate expiration causes complete outages. Set alerts at 60, 30, 7, and 1 day before expiry. If cert-manager reports a certificate as not-ready, investigate immediately — you may have a DNS validation failure or network connectivity issue.
Security group change frequency matters. Teams that modify security groups multiple times per day either have automation problems or unclear ownership. Frequent changes also make auditing harder.
For Linkerd and Istio, watch proxy CPU and memory on each pod. The sidecar adds overhead that catches teams off guard when they have tight resource limits and start getting evicted under load.
Key commands:
# Check cert-manager certificate status
kubectl get certificates -A -o wide
# Monitor certificate expiration
kubectl get certificates -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}'
# List security group rules in AWS
aws ec2 describe-security-groups --region us-east-1 --query 'SecurityGroups[*].{Name:GroupName,Rules:IpPermissions}'
# Check Linkerd mTLS status
linkerd identity -n production
# Verify NetworkPolicy is applied
kubectl get networkpolicies -A -o wide
Common Anti-Patterns
Relying on the internal network being safe. Internal networks get breached. Compromised containers can reach any other container in the same VPC. Treat internal traffic as untrusted and use mTLS or at minimum application-layer authentication.
Using self-signed certificates in production. Self-signed certs work for development but break certificate transparency logs and make incident response harder. Use Let’s Encrypt via cert-manager or your cloud’s managed certificate service.
Not rotating certificates. Certificates expire, get compromised, or need to be replaced after incidents. Automate renewal with cert-manager and test the renewal process before expiry.
Over-trusting the Kubernetes network. Pods can reach any other pod by default. A single compromised workload becomes a pivot point to attack every other workload. Default-deny NetworkPolicy costs little and limits lateral movement.
Attaching overly permissive security group rules as a shortcut. Opening port 0.0.0.0/0 or allowing all traffic from 10.0.0.0/8 “because it’s the internal network” defeats the purpose of security groups. Restrict source CIDRs to the minimum required.
Quick Recap
Key Takeaways
- VPC design sets the foundation: use /16 with subnet segmentation, not /24
- Security groups are stateful and instance-level; NACLs are stateless and subnet-level
- Kubernetes NetworkPolicy is not enforced by default CNI plugins on EKS — add Calico or Cilium
- Service mesh mTLS offloads certificate management from application code
- cert-manager automates certificate issuance and renewal across all Kubernetes workloads
- Zero-trust means authenticating every request regardless of network origin
Network Security Checklist
# 1. VPC uses /16 with public/private/data subnet segmentation
# 2. Security groups restrict source to specific CIDRs or security groups
# 3. NACLs add subnet-level deny rules for known malicious IPs
# 4. Calico or Cilium installed for NetworkPolicy enforcement on EKS
# 5. Default-deny NetworkPolicy applied to all production namespaces
# 6. cert-manager ClusterIssuer created for Let's Encrypt
# 7. Certificates auto-renewed 30 days before expiry
# 8. Service mesh mTLS enabled for all production namespaces
# 9. Certificate expiration alerts configured at 60, 30, 7, 1 days
Trade-off Summary
| Control Layer | Scope | Operational Complexity | Latency Impact |
|---|---|---|---|
| Security groups | Instance-level | Low | Minimal |
| NACLs | Subnet-level | Medium | Minimal |
| VPC endpoints | Service-level | Medium | Reduces (removes IGW) |
| PrivateLink / VPC peering | Cross-account | High | Minimal |
| VPN / Direct Connect | On-prem hybrid | High | Adds encryption overhead |
| Service mesh (mTLS) | Pod-to-pod | High | 1-3% overhead |
| NetworkPolicy (K8s) | Pod-level | Medium | Minimal |
Conclusion
Network security is not a perimeter. It is a mesh of controls that limit blast radius when something goes wrong. Design your VPC for growth, use security groups and NACLs to control east-west traffic, enforce Kubernetes NetworkPolicy, and use a service mesh for automatic mTLS between services.
Zero-trust is not a product you buy. It is an architecture you build, layer by layer.
Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.