Multi-Cluster Kubernetes: Federation, Cross-Cluster Networking, Registries
Manage multiple Kubernetes clusters across regions and clouds using federation, cluster registries, and cross-cluster service discovery.
Multi-Cluster Kubernetes: Federation, Cluster Registries, and Cross-Cluster Networking
Running a single Kubernetes cluster works for many use cases. But as you scale, you may need multiple clusters for different regions, cloud providers, or environments. Managing multiple clusters introduces challenges around deployment consistency, service discovery, and network connectivity.
This post covers multi-cluster architectures, Kubernetes Federation v2, cluster registration, and patterns for cross-cluster communication.
For Kubernetes basics, see the Kubernetes fundamentals post. For high availability across availability zones, see the High Availability post.
When to Use / When Not to Use
When multi-cluster makes sense
Geographic distribution is the clearest win. If your users are spread across regions, pods in each region serving local users reduces latency. Data residency rules sometimes mandate it.
Different trust boundaries also warrant separation. A cluster for production and a cluster for testing are fundamentally different trust levels. Merging them adds attack surface for both.
Blast radius containment matters. If one cluster fails, does it take down everything? For critical production workloads, isolation means one cluster’s problem does not cascade.
Regulated environments often need it. Auditing requirements, data residency laws, compliance frameworks may all require cluster-level separation.
When single cluster is better
Namespace isolation is simpler. If your team can manage one cluster well, adding more clusters adds management overhead without proportional benefit.
Operational maturity matters. Multi-cluster means multiple control planes, multiple upgrade cycles, multiple network configurations. If your team is still learning Kubernetes, start with one cluster.
The complexity tax is real. Cross-cluster networking is harder than intra-cluster. Service discovery across clusters requires explicit configuration. Debugging is harder when the problem might be in a different cluster.
Common multi-cluster architectures
| Architecture | Use Case | Complexity |
|---|---|---|
| Federation v2 | Centralized control plane, propagate to member clusters | Medium |
| Cluster API | Infrastructure-as-code cluster lifecycle management | High |
| GitOps (ArgoCD) | One repo deploys to multiple clusters | Low-Medium |
| Service mesh (Istio) | Cross-cluster service discovery and traffic management | High |
Why Multi-Cluster?
Single clusters have limits. The maximum number of nodes depends on etcd performance and network插件 capabilities. More importantly, different clusters provide isolation for:
| Use Case | Benefit |
|---|---|
| Multi-region deployments | Lower latency for global users |
| Cloud provider diversification | Avoid vendor lock-in, regional outages |
| Environment separation | Dev, staging, production isolation |
| Compliance | Data residency requirements |
| Blast radius limitation | Failure in one cluster does not affect others |
Multi-cluster also enables rolling upgrades across clusters sequentially rather than risking all nodes simultaneously.
Federation v2 Architecture
Kubernetes Federation v2 (KubeFed) provides a control plane for managing resources across multiple clusters. Instead of deploying to each cluster individually, you deploy to the federation control plane, which propagates resources to member clusters.
KubeFed components
┌─────────────────────────────────────────┐
│ Federation Control Plane │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ KubeFed │ │ Federated │ │
│ │ Controller │ │ Resources │ │
│ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Cluster │ │ Cluster │
│ us-east-1 │ │ eu-west-1 │
└──────────────┘ └──────────────┘
Installing KubeFed
helm repo add kubefed-charts https://kubernetes-sigs.github.io/kubefed/charts
helm install -n kube-federation-system --create-namespace \
kubefed kubefed-charts/kubefed
Registering clusters
# Add cluster to federation
kubefedctl join cluster-us-east-1 \
--cluster-context=us-east-1 \
--host-cluster-context= federation-system
kubefedctl join cluster-eu-west-1 \
--cluster-context=eu-west-1 \
--host-cluster-context=federation-system
Federated deployment
apiVersion: types.federation.k8s.io/v1alpha1
kind: FederatedDeployment
metadata:
name: web-frontend
namespace: production
spec:
template:
metadata:
labels:
app: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
spec:
containers:
- name: nginx
image: nginx:1.25
placement:
clusters:
- name: cluster-us-east-1
- name: cluster-eu-west-1
overrides:
- clusterName: cluster-eu-west-1
clusterOverrides:
- path: "/spec/replicas"
value: 5
This FederatedDeployment deploys to both clusters but overrides the replica count for eu-west-1 to handle higher traffic there.
Cluster Registration and Lifecycle
Without federation, you can still manage multiple clusters through a cluster registry. A cluster registry is a set of Kubernetes clusters sharing a common API for registration and configuration.
Cluster API provider
Cluster API (CAPI) automates cluster provisioning and lifecycle management:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-us-east
namespace: default
spec:
clusterNetwork:
services:
cidrBlocks: ["10.100.0.0/12"]
pods:
cidrBlocks: ["10.96.0.0/12"]
serviceDomain: "cluster.local"
infrastructureRef:
kind: AWSCluster
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
name: production-us-east-infra
CAPI providers exist for AWS, GCP, Azure, vSphere, and bare metal.
kubeconfig management
For simple multi-cluster management, use kubeconfig contexts:
# Switch between clusters
kubectl config use-context cluster-us-east-1
kubectl config use-context cluster-eu-west-1
# List contexts
kubectl config get-contexts
A kubeconfig file can contain multiple clusters, users, and contexts:
apiVersion: v1
kind: Config
clusters:
- name: cluster-us-east-1
cluster:
server: https://us-east-1.example.com
certificate-authority-data: ...
- name: cluster-eu-west-1
cluster:
server: https://eu-west-1.example.com
certificate-authority-data: ...
contexts:
- name: cluster-us-east-1
context:
cluster: cluster-us-east-1
user: admin
- name: cluster-eu-west-1
context:
cluster: cluster-eu-west-1
user: admin
current-context: cluster-us-east-1
Cross-Cluster Service Discovery
When services run on different clusters, you need ways to discover and communicate with them.
Cluster Federation DNS
KubeFed enables DNS-based service discovery across clusters. Services get federated DNS names that resolve to endpoints in all member clusters:
web-frontend.production.svc.global
This global DNS name returns all endpoints from all clusters where the service is deployed.
Service export and import
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: api-backend
namespace: production
---
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: api-backend
namespace: production
spec:
clusters:
- name: cluster-us-east-1
- name: cluster-eu-west-1
The ServiceImport aggregates endpoints from both clusters. A ServiceExport in one cluster makes the service discoverable by imported services in other clusters.
External DNS for multi-cluster
ExternalDNS synchronizes Kubernetes Services with DNS providers:
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
spec:
containers:
- name: external-dns
image: k8s.ExternalDNS/external-dns:latest
args:
- --source=service
- --source=ingress
- --domain-filter=example.com
- --provider=cloudflare
- --policy=cluster-sync
With policy=cluster-sync, ExternalDNS deploys the same DNS records to all clusters, providing a consistent DNS interface regardless of which cluster serves the request.
Network Connectivity Options
Cross-cluster communication requires network connectivity. Several options exist:
VPC peering
Connect VPCs across AWS regions:
AWS: VPC Peering
GCP: VPC Network Peering
Azure: Virtual Network Peering
VPC peering creates direct network paths between clusters without traversing public internet. DNS resolution requires private hosted zones or CoreDNS with static entries.
VPN connections
Site-to-site VPN connects on-premises or cloud networks:
# WireGuard example
ip link add wg0 type wireguard
ip addr add 10.0.0.1/24 dev wg0
wg set wg0 private-key ./privatekey peer <PEER_PUBLIC_KEY> allowed-ips <REMOTE_CIDR>
VPNs encrypt traffic and allow full network access between sites.
Submariner for cross-cluster networking
Submariner provides direct pod-to-pod connectivity across Kubernetes clusters:
subctl deploy-broker --kubeconfig ~/.kube/config
subctl join --kubeconfig ~/.kube/config broker-info.subm --cable-driver libreswan
Submariner handles NAT traversal and encryption automatically. After joining, pods on one cluster can reach pods on another cluster using the pod’s IP address.
Service mesh for multi-cluster
Service meshes like Istio and Linkerd support multi-cluster deployments:
# Istio remote profile for cross-cluster
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-multicluster
spec:
profile: remote
values:
global:
meshID: my-mesh
multiCluster:
clusterName: cluster-us-east-1
network: network1
With Istio’s multi-cluster configuration, services can communicate across clusters transparently, and traffic policies apply consistently.
GitOps for Multi-Cluster
Managing multiple clusters manually becomes unmanageable at scale. GitOps automates deployment across clusters using a Git repository as the source of truth.
ArgoCD for multi-cluster
ArgoCD runs in its own namespace and syncs applications to target clusters:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web-frontend
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/example/manifests.git
targetRevision: main
path: production/web-frontend
destination:
server: https://kubernetes.default.svc
namespace: production
ArgoCD applications can target the local cluster or remote clusters. For multi-cluster, deploy an ArgoCD instance per cluster with a central Git repository.
Cluster fleet management
# AppProject for environment separation
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
clusters:
- name: cluster-us-east-1
server: https://us-east-1.example.com
- name: cluster-eu-west-1
server: https://eu-west-1.example.com
namespaces:
- production
AppProjects define which clusters applications can deploy to, limiting blast radius of misconfigured deployments.
Drift detection
ArgoCD continuously monitors deployed resources against Git:
argocd app get web-frontend
If someone changes resources directly on the cluster, ArgoCD detects the drift and either syncs automatically or alerts depending on your configuration.
Cross-Cluster Network Connectivity Trade-offs
| Approach | Latency | Security | Complexity | Best For |
|---|---|---|---|---|
| VPC Peering | Lowest | High (AWS-native) | Medium | Same cloud, same region |
| VPN (WireGuard/IPSec) | Low | High (encrypted tunnel) | Medium-High | Cross-cloud, cross-region |
| Submariner | Low | Medium (mTLS) | High | Service mesh across clusters |
| Service Mesh Federation | Medium | High (mTLS + policy) | Highest | Multi-cluster service mesh |
| ExternalDNS + Global LB | Varies | Medium | Medium | Global traffic routing |
Multi-Cluster Security Checklist
- Cluster API access restricted via RBAC and cluster roles
- GitOps enforces all cluster changes (no direct
kubectlto production) - Network policies restrict cross-cluster traffic at CNI level
- Service account tokens not shared between clusters
- Secrets not stored in Git (use external secrets operator or vault)
- Cluster credentials rotated regularly
- Audit logging enabled on all clusters
- Centralized identity provider (OIDC) for cross-cluster auth
Multi-Cluster Observability
Metrics aggregation:
# Cluster health at a glance
sum(kube_node_status_condition{condition="Ready"}) by (cluster)
# Deployment status across clusters
sum(kube_deployment_status_replicas) by (cluster, namespace)
# API server request rate by cluster
sum(rate(apiserver_request_total[5m])) by (cluster)
Unified logging across clusters:
Ship logs from all clusters to a central Loki or Elasticsearch instance. Label logs with cluster metadata:
# Promtail config for multi-cluster
scrape_configs:
- job_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: cluster
replacement: "us-east-1"
Key dashboards to build:
- Cluster inventory: number of nodes, pods, deployments per cluster
- Cross-cluster traffic: east-west bandwidth between clusters
- Deployment drift: Git desired state vs actual state per cluster
- Cost by cluster: compute costs attributed per cluster for chargeback
Multi-Cluster Cost Management
Multi-cluster costs scale with cluster count, not workload. Consider:
| Cost Factor | Single Cluster | Multi-Cluster |
|---|---|---|
| Control plane | 1x | N× control plane costs (each cluster runs its own) |
| Node overhead | Lower (bin-packing) | Higher (each cluster has system overhead) |
| Networking | Internal only | Cross-cluster egress costs |
| Operational | One cluster to manage | N clusters, N× monitoring burden |
Cost optimization strategies:
- Use cluster autoscalers to right-size nodes per cluster
- Consoldate dev/test environments onto shared clusters with namespace isolation
- Use spot/preemptible instances for non-production clusters
- Schedule batch workloads during off-peak hours to reduce cluster count
- Monitor cross-cluster egress costs (VPN/data transfer adds up)
Conclusion
Multi-cluster Kubernetes addresses needs for geographic distribution, fault isolation, and compliance. Federation v2 provides a control plane for managing resources across clusters. Cluster API automates cluster provisioning and lifecycle.
Cross-cluster service discovery requires explicit configuration. Cluster federation DNS, ExternalDNS, and service meshes provide different approaches to making services reachable across clusters.
Network connectivity options include VPC peering, VPNs, Submariner, and service mesh configurations. Choose based on your latency requirements, security posture, and operational complexity tolerance.
GitOps with ArgoCD or similar tools manages deployment consistency across clusters. A Git repository as the single source of truth ensures all clusters converge to the desired state.
Multi-cluster adds significant complexity. Start with single cluster and strong namespace isolation before moving to multiple clusters. The operational overhead of multi-cluster is substantial and justified only when single-cluster limits or requirements demand it.
Production Failure Scenarios
Cluster Isolation Failures
Intra-cluster network policies do not automatically extend across clusters. If you rely on network policies for security between services in different clusters, cross-cluster traffic may bypass those policies entirely.
If service A in cluster us-east-1 cannot reach service B in cluster eu-west-1, the problem could be DNS, routing, firewall rules, or the CNI configuration on either end. Nothing logs “blocked by cluster boundary.”
Verify cross-cluster connectivity separately. Test DNS resolution, test network routing, test firewall rules.
Cross-Cluster Network Partition
Network partitions between clusters cause split-brain in databases or services that rely on leader election. Both clusters keep running, both think they are primary, data diverges.
This is not a Kubernetes problem. It is a network problem. If your databases do not handle network partitions gracefully, multi-cluster adds risk rather than removing it.
Use database-native replication that handles cross-region consistency. Avoid cross-region reads on multi-region writes without proper replication lag handling.
Drift Detection Gaps
ArgoCD reconciles on a schedule. If someone makes a change directly on a cluster between reconciliations, the drift exists until the next sync cycle. For critical applications, this window matters.
Set sync intervals short enough that drift does not persist for long. Monitor ArgoCD sync status and alert on drift.
Anti-Patterns
Federating Everything
Not every resource belongs in a federation. Namespaces, RBAC roles, and cluster-scoped resources are usually better managed per-cluster. Adding them to federation just creates noise and potential conflict.
Federate the resources that genuinely need consistent state across clusters. Everything else stays cluster-local.
Manual Cluster Provisioning
Hand-rolling clusters means each one ends up slightly different. The first cluster has a workaround in its kubelet config, the second has a different CNI version, the third has an older API version. Configuration drift compounds.
Use Cluster API or similar infrastructure-as-code tools. Treat cluster creation as code, review it in Git, apply consistently.
Inconsistent RBAC
A role binding that exists in cluster us-east-1 but not in eu-west-1 causes confusing failures. Developers with correct permissions in one cluster get denied in another. They assume it is a cluster-specific bug, not an RBAC gap.
Manage RBAC through GitOps. The same manifests deploy to all clusters. If a role binding is missing somewhere, it shows up as drift.
Quick Recap Checklist
- Established clear justification for multi-cluster (not just because it sounds robust)
- Used GitOps (ArgoCD) to manage deployments across all clusters
- Implemented cross-cluster service discovery (ExternalDNS, ServiceImport)
- Configured network connectivity between clusters (VPC peering, VPN, or service mesh)
- Set up drift detection and alerting for cluster state divergence
- Standardized RBAC policies across all clusters via GitOps
- Tested failover scenarios in staging before relying on multi-cluster for HA
- Monitored cross-cluster network latency and addressed performance issues
For more on advanced Kubernetes topics, see the Advanced Kubernetes post.
Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.