Multi-Cluster Kubernetes: Federation, Cross-Cluster Networking, Registries
Manage multiple Kubernetes clusters across regions and clouds using federation, cluster registries, and cross-cluster service discovery.
Multi-Cluster Kubernetes: Federation, Cluster Registries, and Cross-Cluster Networking
Running a single Kubernetes cluster works for many use cases. But as you scale, you may need multiple clusters for different regions, cloud providers, or environments. Managing multiple clusters introduces challenges around deployment consistency, service discovery, and network connectivity.
This post covers multi-cluster architectures, Kubernetes Federation v2, cluster registration, and patterns for cross-cluster communication.
Introduction
When multi-cluster makes sense
Geographic distribution is the clearest win. If your users are spread across regions, pods in each region serving local users reduces latency. Data residency rules sometimes mandate it.
Different trust boundaries also warrant separation. A cluster for production and a cluster for testing are fundamentally different trust levels. Merging them adds attack surface for both.
Blast radius containment matters. If one cluster fails, does it take down everything? For critical production workloads, isolation means one cluster’s problem does not cascade.
Regulated environments often need it. Auditing requirements, data residency laws, compliance frameworks may all require cluster-level separation.
When single cluster is better
Namespace isolation is simpler. If your team can manage one cluster well, adding more clusters adds management overhead without proportional benefit.
Operational maturity matters. Multi-cluster means multiple control planes, multiple upgrade cycles, multiple network configurations. If your team is still learning Kubernetes, start with one cluster.
The complexity tax is real. Cross-cluster networking is harder than intra-cluster. Service discovery across clusters requires explicit configuration. Debugging is harder when the problem might be in a different cluster.
Common multi-cluster architectures
| Architecture | Use Case | Complexity |
|---|---|---|
| Federation v2 | Centralized control plane, propagate to member clusters | Medium |
| Cluster API | Infrastructure-as-code cluster lifecycle management | High |
| GitOps (ArgoCD) | One repo deploys to multiple clusters | Low-Medium |
| Service mesh (Istio) | Cross-cluster service discovery and traffic management | High |
Multi-Cluster Architecture
Single clusters have limits. The maximum number of nodes depends on etcd performance and network plugin capabilities. More importantly, different clusters provide isolation for:
| Use Case | Benefit |
|---|---|
| Multi-region deployments | Lower latency for global users |
| Cloud provider diversification | Avoid vendor lock-in, regional outages |
| Environment separation | Dev, staging, production isolation |
| Compliance | Data residency requirements |
| Blast radius limitation | Failure in one cluster does not affect others |
Multi-cluster also enables rolling upgrades across clusters sequentially rather than risking all nodes simultaneously.
Federation v2 Architecture
Kubernetes Federation v2 (KubeFed) provides a control plane for managing resources across multiple clusters. Instead of deploying to each cluster individually, you deploy to the federation control plane, which propagates resources to member clusters.
KubeFed components
┌─────────────────────────────────────────┐
│ Federation Control Plane │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ KubeFed │ │ Federated │ │
│ │ Controller │ │ Resources │ │
│ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Cluster │ │ Cluster │
│ us-east-1 │ │ eu-west-1 │
└──────────────┘ └──────────────┘
Installing KubeFed
helm repo add kubefed-charts https://kubernetes-sigs.github.io/kubefed/charts
helm install -n kube-federation-system --create-namespace \
kubefed kubefed-charts/kubefed
Registering clusters
# Add cluster to federation
kubefedctl join cluster-us-east-1 \
--cluster-context=us-east-1 \
--host-cluster-context= federation-system
kubefedctl join cluster-eu-west-1 \
--cluster-context=eu-west-1 \
--host-cluster-context=federation-system
Federated deployment
apiVersion: types.federation.k8s.io/v1alpha1
kind: FederatedDeployment
metadata:
name: web-frontend
namespace: production
spec:
template:
metadata:
labels:
app: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
spec:
containers:
- name: nginx
image: nginx:1.25
placement:
clusters:
- name: cluster-us-east-1
- name: cluster-eu-west-1
overrides:
- clusterName: cluster-eu-west-1
clusterOverrides:
- path: "/spec/replicas"
value: 5
This FederatedDeployment deploys to both clusters but overrides the replica count for eu-west-1 to handle higher traffic there.
Cluster Registration and Lifecycle
Without federation, you can still manage multiple clusters through a cluster registry. A cluster registry is a set of Kubernetes clusters sharing a common API for registration and configuration.
Cluster API provider
Cluster API (CAPI) automates cluster provisioning and lifecycle management:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-us-east
namespace: default
spec:
clusterNetwork:
services:
cidrBlocks: ["10.100.0.0/12"]
pods:
cidrBlocks: ["10.96.0.0/12"]
serviceDomain: "cluster.local"
infrastructureRef:
kind: AWSCluster
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
name: production-us-east-infra
CAPI providers exist for AWS, GCP, Azure, vSphere, and bare metal.
kubeconfig management
For simple multi-cluster management, use kubeconfig contexts:
# Switch between clusters
kubectl config use-context cluster-us-east-1
kubectl config use-context cluster-eu-west-1
# List contexts
kubectl config get-contexts
A kubeconfig file can contain multiple clusters, users, and contexts:
apiVersion: v1
kind: Config
clusters:
- name: cluster-us-east-1
cluster:
server: https://us-east-1.example.com
certificate-authority-data: ...
- name: cluster-eu-west-1
cluster:
server: https://eu-west-1.example.com
certificate-authority-data: ...
contexts:
- name: cluster-us-east-1
context:
cluster: cluster-us-east-1
user: admin
- name: cluster-eu-west-1
context:
cluster: cluster-eu-west-1
user: admin
current-context: cluster-us-east-1
Cross-Cluster Service Discovery
When services run on different clusters, you need ways to discover and communicate with them.
Cluster Federation DNS
KubeFed enables DNS-based service discovery across clusters. Services get federated DNS names that resolve to endpoints in all member clusters:
web-frontend.production.svc.global
This global DNS name returns all endpoints from all clusters where the service is deployed.
Service export and import
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: api-backend
namespace: production
---
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: api-backend
namespace: production
spec:
clusters:
- name: cluster-us-east-1
- name: cluster-eu-west-1
The ServiceImport aggregates endpoints from both clusters. A ServiceExport in one cluster makes the service discoverable by imported services in other clusters.
External DNS for multi-cluster
ExternalDNS synchronizes Kubernetes Services with DNS providers:
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
spec:
containers:
- name: external-dns
image: k8s.ExternalDNS/external-dns:latest
args:
- --source=service
- --source=ingress
- --domain-filter=example.com
- --provider=cloudflare
- --policy=cluster-sync
With policy=cluster-sync, ExternalDNS deploys the same DNS records to all clusters, providing a consistent DNS interface regardless of which cluster serves the request.
Network Connectivity Options
Cross-cluster communication requires network connectivity. Several options exist:
VPC peering
Connect VPCs across AWS regions:
AWS: VPC Peering
GCP: VPC Network Peering
Azure: Virtual Network Peering
VPC peering creates direct network paths between clusters without traversing public internet. DNS resolution requires private hosted zones or CoreDNS with static entries.
VPN connections
Site-to-site VPN connects on-premises or cloud networks:
# WireGuard example
ip link add wg0 type wireguard
ip addr add 10.0.0.1/24 dev wg0
wg set wg0 private-key ./privatekey peer <PEER_PUBLIC_KEY> allowed-ips <REMOTE_CIDR>
VPNs encrypt traffic and allow full network access between sites.
Submariner for cross-cluster networking
Submariner provides direct pod-to-pod connectivity across Kubernetes clusters:
subctl deploy-broker --kubeconfig ~/.kube/config
subctl join --kubeconfig ~/.kube/config broker-info.subm --cable-driver libreswan
Submariner handles NAT traversal and encryption automatically. After joining, pods on one cluster can reach pods on another cluster using the pod’s IP address.
Service mesh for multi-cluster
Service meshes like Istio and Linkerd support multi-cluster deployments:
# Istio remote profile for cross-cluster
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-multicluster
spec:
profile: remote
values:
global:
meshID: my-mesh
multiCluster:
clusterName: cluster-us-east-1
network: network1
With Istio’s multi-cluster configuration, services can communicate across clusters transparently, and traffic policies apply consistently.
GitOps for Multi-Cluster
Managing multiple clusters manually becomes unmanageable at scale. GitOps automates deployment across clusters using a Git repository as the source of truth.
ArgoCD Configuration
ArgoCD for multi-cluster
ArgoCD runs in its own namespace and syncs applications to target clusters:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web-frontend
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/example/manifests.git
targetRevision: main
path: production/web-frontend
destination:
server: https://kubernetes.default.svc
namespace: production
ArgoCD applications can target the local cluster or remote clusters. For multi-cluster, deploy an ArgoCD instance per cluster with a central Git repository.
Cluster fleet management
# AppProject for environment separation
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
clusters:
- name: cluster-us-east-1
server: https://us-east-1.example.com
- name: cluster-eu-west-1
server: https://eu-west-1.example.com
namespaces:
- production
AppProjects define which clusters applications can deploy to, limiting blast radius of misconfigured deployments.
Drift detection
ArgoCD continuously monitors deployed resources against Git:
argocd app get web-frontend
If someone changes resources directly on the cluster, ArgoCD detects the drift and either syncs automatically or alerts depending on your configuration.
Multi-Cluster Operations
Cross-Cluster Network Connectivity Trade-offs
| Approach | Latency | Security | Complexity | Best For |
|---|---|---|---|---|
| VPC Peering | Lowest | High (AWS-native) | Medium | Same cloud, same region |
| VPN (WireGuard/IPSec) | Low | High (encrypted tunnel) | Medium-High | Cross-cloud, cross-region |
| Submariner | Low | Medium (mTLS) | High | Service mesh across clusters |
| Service Mesh Federation | Medium | High (mTLS + policy) | Highest | Multi-cluster service mesh |
| ExternalDNS + Global LB | Varies | Medium | Medium | Global traffic routing |
Multi-Cluster Security Checklist
- Cluster API access restricted via RBAC and cluster roles
- GitOps enforces all cluster changes (no direct
kubectlto production) - Network policies restrict cross-cluster traffic at CNI level
- Service account tokens not shared between clusters
- Secrets not stored in Git (use external secrets operator or vault)
- Cluster credentials rotated regularly
- Audit logging enabled on all clusters
- Centralized identity provider (OIDC) for cross-cluster auth
Multi-Cluster Observability
Metrics aggregation:
# Cluster health at a glance
sum(kube_node_status_condition{condition="Ready"}) by (cluster)
# Deployment status across clusters
sum(kube_deployment_status_replicas) by (cluster, namespace)
# API server request rate by cluster
sum(rate(apiserver_request_total[5m])) by (cluster)
Unified logging across clusters:
Ship logs from all clusters to a central Loki or Elasticsearch instance. Label logs with cluster metadata:
# Promtail config for multi-cluster
scrape_configs:
- job_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: cluster
replacement: "us-east-1"
Key dashboards to build:
- Cluster inventory: number of nodes, pods, deployments per cluster
- Cross-cluster traffic: east-west bandwidth between clusters
- Deployment drift: Git desired state vs actual state per cluster
- Cost by cluster: compute costs attributed per cluster for chargeback
Multi-Cluster Cost Management
Multi-cluster costs scale with cluster count, not workload. Consider:
| Cost Factor | Single Cluster | Multi-Cluster |
|---|---|---|
| Control plane | 1x | N× control plane costs (each cluster runs its own) |
| Node overhead | Lower (bin-packing) | Higher (each cluster has system overhead) |
| Networking | Internal only | Cross-cluster egress costs |
| Operational | One cluster to manage | N clusters, N× monitoring burden |
Cost optimization strategies:
- Use cluster autoscalers to right-size nodes per cluster
- Consoldate dev/test environments onto shared clusters with namespace isolation
- Use spot/preemptible instances for non-production clusters
- Schedule batch workloads during off-peak hours to reduce cluster count
- Monitor cross-cluster egress costs (VPN/data transfer adds up)
Production Failure Scenarios
Cluster Isolation Failures
Intra-cluster network policies do not automatically extend across clusters. If you rely on network policies for security between services in different clusters, cross-cluster traffic may bypass those policies entirely.
If service A in cluster us-east-1 cannot reach service B in cluster eu-west-1, the problem could be DNS, routing, firewall rules, or the CNI configuration on either end. Nothing logs “blocked by cluster boundary.”
Verify cross-cluster connectivity separately. Test DNS resolution, test network routing, test firewall rules.
Cross-Cluster Network Partition
Network partitions between clusters cause split-brain in databases or services that rely on leader election. Both clusters keep running, both think they are primary, data diverges.
This is not a Kubernetes problem. It is a network problem. If your databases do not handle network partitions gracefully, multi-cluster adds risk rather than removing it.
Use database-native replication that handles cross-region consistency. Avoid cross-region reads on multi-region writes without proper replication lag handling.
Drift Detection Gaps
ArgoCD reconciles on a schedule. If someone makes a change directly on a cluster between reconciliations, the drift exists until the next sync cycle. For critical applications, this window matters.
Set sync intervals short enough that drift does not persist for long. Monitor ArgoCD sync status and alert on drift.
Common Pitfalls / Anti-Patterns
Federating Everything
Not every resource belongs in a federation. Namespaces, RBAC roles, and cluster-scoped resources are usually better managed per-cluster. Adding them to federation just creates noise and potential conflict.
Federate the resources that genuinely need consistent state across clusters. Everything else stays cluster-local.
Manual Cluster Provisioning
Hand-rolling clusters means each one ends up slightly different. The first cluster has a workaround in its kubelet config, the second has a different CNI version, the third has an older API version. Configuration drift compounds.
Use Cluster API or similar infrastructure-as-code tools. Treat cluster creation as code, review it in Git, apply consistently.
Inconsistent RBAC
A role binding that exists in cluster us-east-1 but not in eu-west-1 causes confusing failures. Developers with correct permissions in one cluster get denied in another. They assume it is a cluster-specific bug, not an RBAC gap.
Manage RBAC through GitOps. The same manifests deploy to all clusters. If a role binding is missing somewhere, it shows up as drift.
Interview Questions
Expected answer points:
- Multi-cluster is justified when you need geographic distribution for lower latency to users in different regions
- Different trust boundaries warrant separation: production vs testing, regulated vs non-regulated environments
- Blast radius containment matters for critical workloads — one cluster's failure should not cascade to others
- Compliance requirements may mandate data residency or cluster-level isolation that namespaces cannot provide
- Single cluster with namespaces is simpler when these concerns do not apply — multi-cluster adds significant operational overhead
Expected answer points:
- KubeFed provides a central control plane that propagates resources to member clusters instead of deploying to each cluster individually
- Components: KubeFed Controller manages the federation control plane, Federated Resources define how resources spread across clusters
- FederatedDeployments, FederatedServices, and similar types define both the template (spec.template) and the placement (spec.clusters)
- Override capability allows cluster-specific configuration: e.g., higher replica count in eu-west-1 than us-east-1
- Use KubeFed when you need centralized control and automatic propagation to all clusters simultaneously
- Skip KubeFed when GitOps with ArgoCD is sufficient — GitOps provides more flexibility and auditability
Expected answer points:
- Cluster Federation DNS (via KubeFed) provides federated DNS names like web-frontend.production.svc.global that resolve to all endpoints across member clusters
- ServiceExport and ServiceImport CRDs aggregate endpoints from multiple clusters under a single DNS name
- ExternalDNS synchronizes Kubernetes Services and Ingresses with external DNS providers (Cloudflare, Route53)
- ExternalDNS with policy=cluster-sync deploys the same DNS records to all clusters, providing consistent DNS interface regardless of which cluster serves the request
- Federation DNS works for internal cluster discovery; ExternalDNS works for external traffic routing to multi-cluster deployments
Expected answer points:
- Kubeconfig files contain clusters, users, and contexts — switch between clusters with kubectl config use-context
- List all contexts with kubectl config get-contexts to see available clusters
- For larger environments, use tools like kubeconfig-manager or kconf to organize and merge multiple kubeconfig files
- ArgoCD can target remote clusters by specifying the server URL in the Application spec destination
- Always use unique context names that indicate environment and region to avoid accidentally deploying to the wrong cluster
Expected answer points:
- VPC Peering: lowest latency, AWS-native security, but limited to same cloud and same region typically
- VPN (WireGuard/IPSec): works across clouds and regions, encrypted tunnel, but adds latency and requires VPN infrastructure management
- Submariner: direct pod-to-pod connectivity with NAT traversal and automatic encryption via mTLS; good for service mesh across clusters but high complexity
- Service Mesh Federation (Istio): highest security (mTLS + policy), works across clouds, but highest operational complexity and cost
- ExternalDNS + Global LB: varies latency, medium security, good for global traffic routing but does not provide pod-to-pod networking
Expected answer points:
- Cluster API (CAPI) defines Cluster, MachineDeployment, and infrastructure provider resources that together provision a full cluster
- InfrastructureRef points to provider-specific resources (AWSCluster, GCPCluster, AzureCluster) that create the underlying infrastructure
- Providers exist for AWS, GCP, Azure, vSphere, and bare metal — abstracting the infrastructure differences
- Cluster provisioning is declarative: apply a Cluster CR and the provider creates all required infrastructure and bootstraps the control plane
- Use CAPI when you need infrastructure-as-code cluster creation with consistent configuration across environments
Expected answer points:
- Network partitions between clusters cause databases on both clusters to think they are primary simultaneously
- Data divergence occurs when both clusters continue accepting writes during the partition
- This is a network problem, not a Kubernetes problem — Kubernetes does not prevent split-brain at the database layer
- Use database-native replication that handles cross-region consistency (e.g., CockroachDB, Spanner)
- Avoid cross-region reads on multi-region writes without proper replication lag handling
- Implement fencing and leader election mechanisms that database systems provide rather than relying on network reliability
Expected answer points:
- ArgoCD runs in its own namespace and syncs applications to target clusters by specifying server and namespace in the Application destination
- AppProjects define which clusters applications can deploy to, limiting blast radius of misconfigured deployments
- ArgoCD continuously monitors deployed resources against Git — if someone changes resources directly on the cluster, drift is detected
- Sync policy can be automatic (sync on drift) or manual — configure based on your risk tolerance
- For multi-cluster, deploy an ArgoCD instance per cluster with a central Git repository, or use ArgoCD Federation for centralized management
- Set sync intervals short enough that drift does not persist for long in production environments
Expected answer points:
- Multi-cluster costs scale with cluster count, not workload — each cluster has its own control plane overhead
- Control plane: single cluster has 1x control plane cost; N clusters have N× control plane costs
- Node overhead: single cluster benefits from bin-packing efficiency; multi-cluster has higher aggregate overhead from system pods per cluster
- Networking: single cluster is internal only; multi-cluster adds cross-cluster egress costs (VPN/data transfer)
- Operational: one cluster to monitor and manage vs N clusters with N× monitoring burden
- Cost optimization: use cluster autoscalers, consolidate dev/test onto shared clusters, use spot instances for non-production
Expected answer points:
- Prometheus Operator with federation or thanos for metrics aggregation across clusters
- Label logs with cluster metadata: use Promtail config with relabel_configs to add cluster label to all logs from a cluster
- Ship logs to central Loki or Elasticsearch — query across clusters by filtering on cluster label
- Key dashboards: cluster inventory (nodes, pods, deployments per cluster), cross-cluster traffic, deployment drift, cost by cluster
- PromQL for cluster health: sum(kube_node_status_condition{condition="Ready"}) by (cluster)
- Set up alerts for cross-cluster network latency spikes and deployment drift between clusters
Expected answer points:
- Not everything belongs in federation — namespaces, RBAC roles, and cluster-scoped resources are usually better managed per-cluster
- Manual cluster provisioning leads to configuration drift: each cluster ends up slightly different with different CNI versions and workarounds
- Inconsistent RBAC: a role binding in us-east-1 but not in eu-west-1 causes confusing failures for developers with correct permissions in one cluster
- Manage RBAC through GitOps — same manifests deploy to all clusters and drift shows up as ArgoCD drift
- Use Cluster API or infrastructure-as-code tools for consistent cluster creation and configuration management
Expected answer points:
Expected answer points:
- Never store secrets in Git — use External Secrets Operator to sync secrets from Vault, AWS Secrets Manager, or GCP Secret Manager
- External Secrets Operator creates Secret resources in the cluster backed by external secret stores
- HashiCorp Vault CSI provider can inject secrets as mounted files without pod-level secret synchronization
- For cluster-specific secrets (like database passwords), manage them per-cluster with ESO
- For shared secrets across clusters (like image pull secrets), consider Vault's transit encryption or ESO cluster-scoped resources
Expected answer points:
- Istio remote profile allows configuring cross-cluster service discovery and traffic management
- meshID and clusterName in global configuration identify the cluster within the mesh
- network setting in IstioOperator defines which network the cluster belongs to — different networks require cross-network routing
- Services communicate across clusters transparently, and Istio traffic policies (Circuit breaking, retries, mTLS) apply consistently
- Cross-cluster communication in Istio can use endpoint discovery via cross-network routing or flat multi-network with NodePort or LoadBalancer
Expected answer points:
- Rolling upgrades across clusters sequentially rather than all at once — upgrade cluster-by-cluster to minimize risk
- Use GitOps with ArgoCD: update the Git repository with new version, and ArgoCD propagates to target clusters
- Use ArgoCD sync waves or application sets with rollback capability if issues occur
- ArgoCD's diff visualization shows exactly what would change before applying — catch issues in review, not in production
- Test in staging cluster first, then apply to production clusters one at a time with monitoring between each
- For critical applications, have a rollback playbook ready that uses helm rollback or ArgoCD app rollback
Expected answer points:
- Rotate cluster credentials (kubeconfig, service account tokens) regularly using automation
- Use tools like kubelogin or the token request API for dynamic short-lived credentials instead of static service account tokens
- For long-lived kubeconfig credentials, automate rotation using a secrets manager and GitOps to push updated kubeconfigs
- Monitor credential expiration dates and alert before expiration
- Ensure ArgoCD or other GitOps tools have updated credentials after rotation — test automated deployments after credential changes
Expected answer points:
- Namespace-scoped resources (Deployments, Services, ConfigMaps) exist within a namespace and are managed per-namespace
- Cluster-scoped resources (ClusterRoles, CustomResourceDefinitions, PersistentVolumes) exist at cluster level and apply to the entire cluster
- Federate namespace-scoped resources that need consistent state across clusters — they propagate to matching namespaces
- Do not federate cluster-scoped resources unless necessary — they can conflict when applied to different cluster contexts
- RBAC cluster roles and cluster role bindings are typically managed per-cluster, not via federation
Expected answer points:
- Network Policies are cluster-scoped and only apply within a single cluster — they cannot span across clusters
- Intra-cluster network policies do not automatically extend across clusters — cross-cluster traffic may bypass those policies
- If you rely on network policies for security between services in different clusters, verify cross-cluster connectivity separately
- For cross-cluster security, use the network connectivity layer (VPC security groups, VPN firewalls) or service mesh authorization policies
- Test DNS resolution, routing, and firewall rules on both ends when diagnosing cross-cluster connectivity issues
Expected answer points:
- Maintain at least one secondary cluster in a different region for DR — primary cluster failure does not affect secondary
- Use Velero for cluster backup and restore — backup etcd data, PersistentVolumes, and cluster state
- ArgoCD Disaster Recovery: if primary cluster fails, promote secondary by updating ArgoCD Application destinations to point to secondary
- Database-level DR: replicate databases to secondary region using database-native async replication
- DNS failover: use ExternalDNS with health checks to route traffic to surviving cluster when primary fails
- Test DR scenarios regularly — a DR plan that has not been tested is not a DR plan
Expected answer points:
- Each cluster runs its own control plane and may be at a different Kubernetes version — upgrade cycles are not synchronized
- Application compatibility: applications developed against one Kubernetes API version may behave differently on another version
- Use Cluster API to manage version skew during upgrades — control plane and node pools can be upgraded incrementally
- Maintain compatibility matrix: which application versions work with which Kubernetes versions
- For rolling upgrades across clusters, test application compatibility on target version in staging before upgrading production clusters
- Consider using managed Kubernetes (EKS, GKE, AKS) to reduce control plane upgrade burden
Further Reading
- KubeFed GitHub repository - Federation v2 control plane for managing resources across clusters
- Cluster API documentation - Infrastructure-as-code cluster lifecycle management
- ArgoCD documentation - GitOps declarative continuous delivery for multi-cluster
- Submariner project - Direct pod-to-pod connectivity across Kubernetes clusters
- Istio multi-cluster setup - Service mesh cross-cluster configuration
- Prometheus Operator - Metrics aggregation for multi-cluster observability
Conclusion
Multi-cluster Kubernetes addresses needs for geographic distribution, fault isolation, and compliance. Federation v2 provides a control plane for managing resources across clusters. Cluster API automates cluster provisioning and lifecycle.
Cross-cluster service discovery requires explicit configuration. Cluster federation DNS, ExternalDNS, and service meshes provide different approaches to making services reachable across clusters.
Network connectivity options include VPC peering, VPNs, Submariner, and service mesh configurations. Choose based on your latency requirements, security posture, and operational complexity tolerance.
GitOps with ArgoCD or similar tools manages deployment consistency across clusters. A Git repository as the single source of truth ensures all clusters converge to the desired state.
Multi-cluster adds significant complexity. Start with single cluster and strong namespace isolation before moving to multiple clusters. The operational overhead of multi-cluster is substantial and justified only when single-cluster limits or requirements demand it.
- Established clear justification for multi-cluster (not just because it sounds robust)
- Used GitOps (ArgoCD) to manage deployments across all clusters
- Implemented cross-cluster service discovery (ExternalDNS, ServiceImport)
- Configured network connectivity between clusters (VPC peering, VPN, or service mesh)
- Set up drift detection and alerting for cluster state divergence
- Standardized RBAC policies across all clusters via GitOps
- Tested failover scenarios in staging before relying on multi-cluster for HA
- Monitored cross-cluster network latency and addressed performance issues
Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.