Advanced Kubernetes: Controllers, Operators, RBAC, Production Patterns

Explore Kubernetes custom controllers, operators, RBAC, network policies, storage classes, and advanced patterns for production cluster management.

published: March 22, 2026 reading time: 61 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

This post digs into the advanced Kubernetes territory you hit once basic Deployments stop being enough. It walks through CRDs and controller-runtime for building custom reconciliation loops, operator patterns with Operator SDK for domain-specific automation, RBAC mechanics with Roles and ClusterRoles, network policies for pod-level isolation, storage classes for dynamic provisioning, and scheduling controls like taints, tolerations, and affinity. The decision trees and failure scenarios come from actual production situations. By the end, you will know when to reach for native resources, custom controllers, or full operators to automate operational logic reliably.

Advanced Kubernetes: Controllers, Operators, RBAC, and Production Patterns

Kubernetes has become the de facto standard for container orchestration. If you have been running clusters for a while, you have likely encountered scenarios that basic Kubernetes resources do not handle well. Custom controllers, operators, and advanced security patterns solve these problems.

This guide assumes you already know Kubernetes basics. If you are just starting, our Docker Fundamentals guide covers containers first, which is essential groundwork before tackling Kubernetes.

Introduction

Before diving into advanced topics, let us review how Kubernetes control plane components work together.

graph TB
    subgraph "Control Plane"
        A[API Server] --> B[etcd]
        A --> C[Controller Manager]
        A --> D[Scheduler]
        C --> E[Controllers]
        E --> A
    end
    subgraph "Worker Nodes"
        F[Kubelet] --> A
        G[Container Runtime] --> F
        H[Kube Proxy] --> F
    end

The API server is the gateway to everything. All cluster operations go through it, and it validates configurations before persisting to etcd. Controllers watch the API server for changes and reconcile actual state toward desired state.

CRDs extend the Kubernetes API to define new resource types. They let you create domain-specific objects that Kubernetes can manage like native resources.

Defining a CRD

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  names:
    kind: Database
    plural: databases
    shortNames:
      - db
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                engine:
                  type: string
                  enum: [postgresql, mysql, mongodb]
                version:
                  type: string
                replicas:
                  type: integer
                  minimum: 1
                storage:
                  type: object
                  properties:
                    size:
                      type: string
                    storageClass:
                      type: string
            status:
              type: object
              properties:
                phase:
                  type: string
                endpoint:
                  type: string

After applying this CRD, you can create Database objects just like built-in resources:

kubectl apply -f database-crd.yaml
kubectl get databases

CRD Versioning

Kubernetes supports multiple versions of a CRD simultaneously. The storage flag indicates which version persists to etcd. This enables zero-downtime migrations when you need to change your schema.

versions:
  - name: v1
    served: true
    storage: true
  - name: v1beta1
    served: true
    storage: false

Clients can request specific versions via the Accept header. This gives you flexibility during rolling upgrades.

Custom Controllers

Controllers are control loops that watch resources and take action to achieve desired state. The Kubernetes ecosystem is built on this pattern, and you can extend it with custom controllers.

The Controller Pattern

A controller follows a reconcile loop:

Watch for changes to resources
Fetch current state
Compare current state with desired state
Take action to reconcile the difference
Update status
Repeat

The reconcile loop is simple in concept, but that simplicity is the point. Kubernetes itself is built entirely on this pattern. The Deployment controller watches for changes to Deployment objects, compares the current state (how many replicas are actually running) against the desired state (how many should be running), and takes action to make reality match. The ReplicaSet controller does the same for pods. The Job controller for, well, jobs. Every controller in Kubernetes follows this same structure.

What makes controllers powerful is that they run continuously. The loop does not execute once and exit; it runs as a background process, watching the API server indefinitely. When something changes, the controller wakes up, evaluates the situation, and acts. When nothing changes, it sits idle until the next event. This event-driven model keeps resource usage low while ensuring the cluster is always moving toward the desired state.

The status update step is where many custom controllers fall short. Writing to .status is what separates a working controller from a useful one. Without status updates, users cannot tell whether their custom resource is healthy, what phase it is in, or what error occurred. The Deployment controller writes status conditions that appear in kubectl get deployments output. Your custom controller should do the same. Status is not optional bookkeeping; it is the primary interface for users to understand what the controller is doing.

Exponential backoff is the other common failure point. When reconciliation fails, a naive controller immediately retries, hammering the API server on every failure. If your PostgreSQL database is down, you do not want your controller spamming the API server with failed reconciliation attempts every second. Instead, track the error and double the wait time between retries:1 second, 2 seconds, 4 seconds, 8 seconds. This prevents cascading failures and gives dependent systems time to recover.

Writing a Basic Controller

The controller-runtime library simplifies controller development:

package controller

import (
    "context"
    "fmt"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller"
    "sigs.k8s.io/controller-runtime/pkg/handler"
    "sigs.k8s.io/controller-runtime/pkg/manager"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"
    "sigs.k8s.io/controller-runtime/pkg/source"
)

type DatabaseReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
    log := fmt.Sprintf("Reconciling Database %s/%s", req.Namespace, req.Name)

    // Fetch the Database instance
    db := &examplev1.Database{}
    err := r.Get(ctx, req.NamespacedName, db)
    if err != nil {
        return reconcile.Result{}, client.IgnoreNotFound(err)
    }

    // Create or update the StatefulSet
    statefulSet := r.buildStatefulSet(db)
    err = r.createOrUpdate(ctx, statefulSet)
    if err != nil {
        return reconcile.Result{}, err
    }

    // Update status
    db.Status.Phase = "Running"
    db.Status.Endpoint = fmt.Sprintf("%s.%s.svc.cluster.local", db.Name, db.Namespace)
    r.Status().Update(ctx, db)

    return reconcile.Result{Requeue: true}, nil
}

Controllers run as part of a manager that handles caching, client connections, and leader election. This makes them robust in production environments with multiple replicas.

Operators: Domain-Specific Automation

Operators are custom controllers with domain-specific knowledge baked in. They encode operational expertise into software that handles complex, stateful applications.

The key difference from generic controllers is that operators understand the application they manage. They know how to handle backups, upgrades, failover, and other operational tasks.

Building an Operator with Operator SDK

Operator SDK provides scaffolding and best practices for building operators:

# Install operator-sdk
brew install operator-sdk

# Create a new operator
operator-sdk init --domain example.com --repo github.com/example/database-operator

# Create the API and controller
operator-sdk create api --group database --version v1 --kind Database --resource --controller

The SDK handles the boilerplate that every operator needs: the manager that coordinates controllers, the client that reads and writes Kubernetes resources, the scheme that knows how to serialize your custom types, and the loggers that tie into the standard Kubernetes logging infrastructure. Without the SDK, you would write all of this yourself before writing a single line of reconciliation logic.

The operator-sdk init command scaffolds a Go module with the standard layout. You get a main.go that creates a manager, a controllers/ directory where your reconciliation logic lives, and a Makefile with targets for building, testing, and deploying. The operator-sdk create api command generates the CRD manifests and the boilerplate for your custom resource types. You fill in the Spec and Status fields, and the code generator produces the deep copy methods, client interfaces, and informer code that the SDK needs.

The key distinction between a controller and an operator is not in the code structure but in the intent. A controller manages a built-in Kubernetes resource or a CRD with standard reconciliation logic. An operator manages an application with domain-specific knowledge baked in. If your reconciliation logic knows how to handle backups, upgrades, and failover for a specific database, you have an operator. If it just creates and updates standard Kubernetes resources, you have a controller. The SDK does not care about the distinction; it generates the same scaffolding for both.

Defining the Operator API

# api/v1/database_types.go
package v1

type DatabaseSpec struct {
Engine       string            `json:"engine,omitempty"`
Version      string            `json:"version,omitempty"`
Replicas     int32             `json:"replicas,omitempty"`
Storage      StorageSpec       `json:"storage,omitempty"`
BackupConfig *BackupConfigSpec `json:"backupConfig,omitempty"`
}

type StorageSpec struct {
Size         string `json:"size"`
StorageClass string `json:"storageClass,omitempty"`
}

type BackupConfigSpec struct {
Schedule string `json:"schedule"`
Bucket   string `json:"bucket"`
}

type DatabaseStatus struct {
Phase    string `json:"phase,omitempty"`
Endpoint string `json:"endpoint,omitempty"`
}

Implementing Reconcile Logic

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("database", req.NamespacedName)

    // Fetch the Database instance
    db := &databasev1.Database{}
    err := r.Get(ctx, req.NamespacedName, db)
    if err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Create or update StatefulSet
    statefulSet, err := r.desiredStatefulSet(db)
    if err != nil {
        return ctrl.Result{}, err
    }
    err = r.createOrUpdate(ctx, statefulSet)
    if err != nil {
        return ctrl.Result{}, err
    }

    // Handle backups if configured
    if db.Spec.BackupConfig != nil {
        result, err := r.reconcileBackups(ctx, db)
        if err != nil {
            return result, err
        }
    }

    // Update status
    db.Status.Phase = "Running"
    db.Status.Endpoint = fmt.Sprintf("%s.%s.svc.cluster.local", db.Name, db.Namespace)
    r.Status().Update(ctx, db)

    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

Practical Operator Examples

Operators shine for managing stateful applications:

Prometheus Operator manages Prometheus deployments and monitoring configurations
Velero Operator handles backup and restore of Kubernetes resources and volumes
Cert Manager automates certificate management with Let’s Encrypt

When building your own operator, ask yourself whether the application has complex lifecycle requirements that generic Kubernetes resources cannot handle.

The Prometheus Operator is worth studying as a reference implementation. It defines a Prometheus CRD and a ServiceMonitor CRD. Rather than manually configuring a Prometheus instance with scrape configs, you create a ServiceMonitor that selects pods by label and tells the Prometheus Operator which endpoints to scrape. The operator watches for ServiceMonitor objects and automatically updates the Prometheus configuration. This is the operator pattern at its best: domain logic encoded in a controller, users express intent through custom resources, and the operator handles the mechanics.

Velero takes a different approach. It manages backup and restore through a combination of CRDs and a controller that orchestrates volume snapshots and API server exports. The Backup CRD specifies what to back up and when to run. The controller reads the specification, coordinates with the volume provider to snapshot persistent volumes, exports the Kubernetes API server state, and stores both in object storage. Restore works the same way in reverse. What makes Velero useful is that it knows about namespaces, persistent volumes, and the relationships between resources in a way that a generic controller cannot.

Cert-manager is the most widely deployed operator in Kubernetes. It watches for Certificate resources and automatically provisions certificates from Let’s Encrypt or any ACME-compatible CA. The operator handles the ACME challenge flow, stores the certificate in a Secret, and renews before expiry. Without an operator, you would run certbot as a CronJob and manage certificate rotation yourself. Cert-manager makes this invisible infrastructure.

Role-Based Access Control

RBAC restricts who can perform operations in the cluster. It uses four key concepts: subjects (who), verbs (what actions), resources (what objects), and namespaces (where).

Roles and RoleBindings

Role and RoleBinding are namespace-scoped:

# Role definition
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: deployment-manager
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

# RoleBinding - grants Role to subjects
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: deployment-manager-binding
  namespace: production
subjects:
  - kind: User
    name: alice@example.com
    apiGroup: rbac.authorization.k8s.io
  - kind: Group
    name: developers
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: deployment-manager
  apiGroup: rbac.authorization.k8s.io

ClusterRoles and ClusterRoleBindings

ClusterRoles and ClusterRoleBindings work cluster-wide:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-viewer
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: node-viewer-binding
subjects:
  - kind: ServiceAccount
    name: metrics-collector
    namespace: monitoring
roleRef:
  kind: ClusterRole
  name: node-viewer
  apiGroup: rbac.authorization.k8s.io

ServiceAccount Usage

Pods use ServiceAccounts to authenticate to the API server:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  namespace: production
---
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  namespace: production
spec:
  serviceAccountName: my-app-sa
  containers:
    - name: app
      image: my-app:latest

Your application retrieves the token mounted at /var/run/secrets/kubernetes.io/serviceaccount/ and uses it to authenticate API calls.

Network Policies

Network policies restrict traffic between pods. By default, all pods can reach all other pods and services in a cluster. Network policies let you implement defense in depth.

Basic Network Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-isolation
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 53
        - protocol: UDP
          port: 53

This policy restricts the API pod to receive traffic only from frontend pods and monitoring namespaces, and allows it to send traffic only to the database and DNS.

DNS Egress

Almost every pod needs DNS resolution. Kubernetes relies on CoreDNS (or kube-dns in older clusters) running in the kube-system namespace to resolve service names, pod IPs, and external domains. When you apply a network policy with egress rules, DNS traffic is subject to those restrictions just like any other network traffic.

DNS uses UDP on port 53 for standard queries. TCP on port 53 takes over when responses exceed 512 bytes, which happens with DNSSEC, large DNS records, or zone transfers. If your egress policy allows UDP 53 but blocks TCP 53, your applications work most of the time but fail intermittently when DNS responses grow large. That is why you must allow both protocols in every egress policy.

To verify DNS works through your network policy, deploy a temporary busybox pod in the same namespace and run nslookup kubernetes.default.svc.cluster.local or dig +short google.com. If the command hangs or times out, your egress policy is blocking DNS traffic. The fix usually involves adding a rule for port 53 with the right namespaceSelector or just allowing DNS to the cluster IP of the DNS service.

One more thing: CoreDNS runs in kube-system, not in your application namespace. A podSelector in your egress rule will not match it. You have two options: use a namespaceSelector targeting kube-system, or use a broader egress rule that allows traffic on port 53 to any destination. Pick the one that matches your security posture.

Storage Classes and Persistent Volumes

Dynamic provisioning of persistent storage requires StorageClasses. They define how storage is provisioned when a PersistentVolumeClaim requests it.

Defining a StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

The WaitForFirstConsumer binding mode delays volume binding until a pod actually uses the claim. This allows the scheduler to co-locate volumes with pods in the same zone.

Using PersistentVolumes in Pods

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-storage
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-storage
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: database
  namespace: production
spec:
  containers:
    - name: db
      image: postgres:15-alpine
      volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: database-storage

Resource Quotas and Limits

Namespaces let you partition the cluster, but ResourceQuotas enforce resource limits within namespaces.

Setting Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 100Gi
    limits.cpu: "40"
    limits.memory: 200Gi
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
    - max:
        cpu: "8"
        memory: 32Gi
      min:
        cpu: 100m
        memory: 128Mi
      default:
        cpu: 500m
        memory: 1Gi
      defaultRequest:
        cpu: 200m
        memory: 256Mi
      type: Container

The LimitRange sets default requests and limits for containers that do not specify them, while ResourceQuota caps total resource usage per namespace.

Pod Disruption Budgets

When performing cluster maintenance, PDBs ensure minimum availability for your applications.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: frontend-pdb
  namespace: production
spec:
  maxUnavailable: 25%
  selector:
    matchLabels:
      app: frontend

The first PDB ensures at least 2 API pods are running during disruptions. The second allows up to 25% of frontend pods to be unavailable simultaneously.

Advanced Scheduling

Node Affinity and Anti-Affinity

apiVersion: v1
kind: Pod
metadata:
  name: database
  namespace: production
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values:
                  - ssd
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - database
            topologyKey: topology.kubernetes.io/zone

This schedules database pods on nodes with SSD storage and tries to spread them across availability zones.

Taints and Tolerations

Taints repel pods from nodes unless the pods have matching tolerations:

# Taint a node to repel non-critical workloads
kubectl taint nodes node1 dedicated=ml-workloads:NoSchedule

# Pod that tolerates the taint
apiVersion: v1
kind: Pod
metadata:
  name: ml-job
spec:
  tolerations:
    - key: dedicated
      operator: Equal
      value: ml-workloads
      effect: NoSchedule
  containers:
    - name: ml
      image: ml-training:latest

This pattern is useful for reserving nodes for specific workloads like ML training or stateful services.

Cluster Autoscaling

The cluster autoscaler adjusts the number of nodes based on pending pods and resource utilization. It talks to the cloud provider to add or remove nodes as needed.

Configuring the Cluster Autoscaler

apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  resourceLimits:
    maxNodesTotal: 50
    cores:
      min: 8
      max: 128
    memory:
      min: 16Gi
      max: 256Gi
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 3m
    unneededTime: 10m

Node Pool Management

Node pools let you group worker nodes by instance type, availability zone, or workload profile. Instead of running identical machines everywhere, you create separate pools for different categories — standard services on general-purpose nodes, caches on memory-optimized nodes, ML training on GPU nodes — and let the cluster autoscaler scale each pool on its own. Bursty workloads get a pool that can scale to zero when idle. Baseline services stay on a cheaper steady pool.

The MachinePool CR (used by clusters that support the Machine API, including OpenShift and some managed Kubernetes offerings) defines each pool. The minReplicas and maxReplicas fields set the autoscaling range for that pool, separate from the cluster autoscaler’s global limits. The instance type controls what compute lands on those nodes, and the availability zone setting controls where those nodes run.

apiVersion: machine.openshift.io/v1beta1
kind: MachinePool
metadata:
  name: worker-pool
  namespace: openshift-machine-api
spec:
  minReplicas: 2
  maxReplicas: 10
  platform:
    aws:
      instanceType: m5.xlarge

In the example YAML, the pool targets m5.xlarge instances on AWS with 2 nodes minimum and 10 maximum. The cluster autoscaler adjusts the replica count within that range based on pending pods. When it sees unschedulable pods that could fit on an m5.xlarge, it provisions nodes up to the max. When nodes sit idle past the unneededTime threshold, the autoscaler terminates them down to the min.

Mixing node pools is common in production. A web API tier might run on m5.large nodes in a single AZ to keep costs down, while an analytics workload runs on r5.xlarge nodes spread across three AZs for fault tolerance. Pod affinity and anti-affinity rules steer workloads to the right pool, and taints on the pool nodes keep out anything without a matching toleration. Without node pools, you end up sizing a single pool for your most demanding workload and paying for excess capacity everywhere else.

Vertical Pod Autoscaler

VPA recommends resource requests based on actual usage patterns:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

VPA can also automatically update resource requests by evicting and recreating pods. Combine it with the cluster autoscaler to handle both pod and node-level scaling.

Service Mesh Patterns

Service meshes like Istio, Linkerd, and Cilium Service Mesh extend Kubernetes networking with mTLS, traffic management, and observability without modifying application code.

Installing Istio

Getting Istio onto a cluster takes longer than running the install command. The profile you choose determines what gets deployed, and the defaults pull in more than most clusters need at first. The istioctl install command gives you a few built-in profiles: default is the starting point most teams use, minimal strips out everything except the control plane with no ingress gateway, remote is for multicluster setups where one cluster holds the control plane, and demo is for trying things out with everything enabled.

For a production cluster, start with default and disable the parts you do not need. Most teams do not need the tracing and visualization stack running continuously. Strip it out with --set values.meshConfig.enableTracing=false or use a custom IstioOperator CR that specifies only the components you actually use. The control plane (istiod) is the only mandatory piece. Everything else — ingress gateways, egress gateways, telemetry addons — is optional.

The sidecar injection decision matters more than the install profile. You can label a namespace with istio-injection=enabled and every pod in that namespace gets an Envoy sidecar injected automatically. Or you can use istioctl kube-inject to inject per-pod without a namespace label. The namespace label is simpler for most teams, but it means every workload in that namespace pays the resource cost of the sidecar. If you have cluster-critical workloads that cannot tolerate the additional latency or memory overhead, keep them in a namespace without injection and manage their traffic through gateway-level policies instead.

After installation, verify the control plane is healthy with istioctl verify-install. Check that the istiod pod is running and that the Envoy sidecars are reachable. The first time you deploy a workload with injection enabled, expect a short delay as the sidecar bootstraps and the mTLS handshake completes. After that, the sidecar is transparent to your application code — your services still communicate over plain HTTP or gRPC, and Istio handles the encryption underneath.

# Install Istio with minimal profile
istioctl install --set profile=default

# Disable unnecessary telemetry for production
istioctl install --set values.meshConfig.enableTracing=false --set values.pilot.env.PILOT_ENABLE_CONFIG_SOURCE=false

# Enable automatic sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled

# Verify installation
istioctl verify-install

# Check control plane status
kubectl get pods -n istio-system

mTLS Between Services

Istio enforces mutual TLS automatically. The PeerAuthentication resource with mode: STRICT tells Istio to reject any plaintext traffic between workloads in the mesh. Every connection gets a client certificate from Istio’s CA, and both sides verify the certificate chain. You do not need to manage certificates yourself — Istio provisions and rotates them.

The STRICT mode is what you want for production. There is also a PERMISSIVE mode that accepts both mTLS and plaintext, which is useful during migration from a non-meshed environment but is a security liability once everything is in the mesh. Setting PERMISSIVE permanently defeats the point of mTLS — a compromised pod could fall back to plaintext and evade policy enforcement.

Once PeerAuthentication is in STRICT mode, every pod in the mesh gets an Envoy sidecar that intercepts outbound and inbound traffic. The sidecar validates the client certificate on the way in and presents the pod’s certificate on the way out. Your application code sees plain TCP or HTTP — Istio handles the TLS underneath. This means you can enable mTLS without changing a single line of application code.

Verify mTLS is working by checking the Envoy stats on any pod. The istio_requests_total{response_code="unknown"} metric with a non-zero value indicates mTLS handshake failures. You can also run istioctl proxy-config secret to see the certificates loaded by a specific pod’s sidecar, including the SPIFFE ID that identifies the workload.

# Check mTLS status for a pod
istioctl proxy-config secret<pod-name> -n production

# Verify STRICT mode is enforced cluster-wide
istioctl get peerauthentication -n production

Traffic Splitting

Weight-based traffic splitting for canary deployments:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api-canary
  namespace: production
spec:
  hosts:
    - api
  http:
    - route:
        - destination:
            host: api
            subset: stable
          weight: 90
        - destination:
            host: api
            subset: canary
          weight: 10
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api
  namespace: production
spec:
  host: api
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary

Circuit Breaking

Prevent cascading failures with circuit breakers:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api
  namespace: production
spec:
  host: api
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Helm vs Kustomize

Helm and Kustomize both manage Kubernetes configurations, but they take different approaches. Here is when to use each.

Helm

Helm packages Kubernetes manifests as versioned, distributable charts. The templating model lets one chart serve multiple environments by overriding default values at deploy time. A chart bundles the Kubernetes YAML, a values.yaml with defaults, and a Chart.yaml with metadata. You override those defaults with environment-specific values files or --set flags when installing.

# Create a new chart
helm create my-app

# Package and install
helm package my-app
helm install my-app ./my-app-0.1.0.tgz

# Render templates without installing
helm template my-release my-chart --set replicaCount=3

Helm charts have values.yaml with defaults, overridden by user values:

# values.yaml
replicaCount: 2
image:
  repository: my-app
  tag: latest

Helm works well when you are deploying third-party software (Databases, nginx, Prometheus) where community charts already exist, or when you need the release management features: versioning, rollbacks, and hooks. The tradeoff is that Go templating in YAML can feel clunky for highly dynamic configurations, and Helm charts carry some operational overhead that is not justified for a single service in a single environment.

Kustomize

Kustomize takes a different approach: no templates, just plain YAML with overlay patches. You define a base configuration and overlay environment-specific changes on top. The base has your standard Deployment and Service YAML. A production overlay adds a namePrefix, bumps replica counts, and patches resource limits. The resulting manifests are plain Kubernetes YAML that you can apply directly with kubectl.

# kustomization.yaml (base)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
commonLabels:
  app: my-app

# kustomization.yaml (production overlay)
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
  - ../base
namePrefix: prod-
commonLabels:
  env: production
replicas:
  - name: my-app
    count: 5

Kustomize fits teams that want GitOps-friendly configuration where environment differences are explicit patches rather than interpolated variables. It integrates natively with kubectl apply -k and with GitOps tools like ArgoCD. The limitation is that Kustomize has no release management built in; it does not track versions or support rollbacks the way Helm does.

When to Use Each

Helm and Kustomize solve different problems, and the choice depends on where your configuration complexity lives. If you are deploying third-party software with lots of parameters, Helm is the practical choice. Bitnami maintains charts for almost every database, message queue, and middleware you can name. You get a values file with sensible defaults and override only what your environment needs. The chart IS the package; you are not building configuration from scratch.

Kustomize shines when your own applications have environment-specific differences. The overlay model fits naturally with how Git works: base configuration in one directory, environment-specific overlays in subdirectories. If staging differs from production in replica count, resource limits, and a few config values, Kustomize handles this cleanly without forcing you to template every line. The diff is explicit and reviewable.

The Helm-plus-Kustomize combination is less common but useful for organizations with strong GitOps practices. Use Kustomize to manage your base infrastructure and application defaults, then wrap specific releases with Helm when you need the release management features. This keeps the Kustomize layer for environment differences and uses Helm for the application package itself.

Scenario	Tool	Reason
Third-party software (Prometheus, nginx)	Helm	Charts available, templating fits package model
Custom app with environment differences	Kustomize	Overlay approach handles variants cleanly
Complex multi-tenant configs	Helm + Kustomize	Kustomize for base, Helm for releases
GitOps with ArgoCD	Either	Both integrate well with GitOps workflows

When NOT to Use

Helm and Kustomize both add overhead that only pays off when your configuration has enough complexity to justify it. If you are running a single service in a single environment with no reuse across teams, plain YAML with kubectl apply is the right choice. The moment you introduce a tool, you introduce maintenance burden: the tool version, the chart or overlay dependencies, the workflow changes for your team. That burden needs to be justified by real complexity.

The learning Kubernetes caveat is worth taking seriously. Teams that have not yet internalized how Deployments, Services, and ConfigMaps work are better off writing those resources by hand first. Tools abstract the primitives away. If you do not understand what a Deployment actually does under the hood, you will not understand why your Helm chart behaves unexpectedly when things go wrong. Kubernetes has a steep enough learning curve without adding Helm templating on top.

The single-environment rule is practical. If your service runs only in production and you never need to variant-test configuration across environments, you do not need either tool. Commit your YAML to Git, apply it with kubectl, and move on. The moment you find yourself copying YAML files between environments and changing a few values by hand, that is when Kustomize becomes worth the investment.

Scenario	Tool	Reason
Simple configs with minimal variation	Neither	Use plain YAML, keep it simple
Single environment, no reuse	Neither	Overhead not worth it
Learning Kubernetes	Neither	Focus on concepts, not tooling

StatefulSets

StatefulSets manage stateful applications requiring stable network identities and persistent storage.

Defining a StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: production
spec:
  serviceName: kafka
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.4.0
          ports:
            - containerPort: 9093
              name: kafka
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-storage
        resources:
          requests:
            storage: 50Gi

Scaling Considerations

StatefulSet scaling does not work like Deployment scaling, and treating it the same way ends badly. When you scale up a StatefulSet, Kubernetes creates pods in strict ordinal order: kafka-0, kafka-1, kafka-2. Each pod must be running and ready before the next one gets created. This is deliberate — it prevents a later broker from trying to join a cluster before the earlier ones are ready. When you scale down, Kubernetes terminates in reverse order, giving the departing pod a window to hand off gracefully.

The application-level consequences are where things get messy. Kafka brokers that are still syncing consume resources on existing brokers without handling writes. Scale too fast and you saturate replication bandwidth, causing consumer lag to spike. Most operators working with Kafka will tell you to wait for a broker to be fully in-sync (ISR) before adding another. The StatefulSet controller manages pod ordering, but nobody is managing the Kafka-level sync — that is on your runbooks.

PostgreSQL replicas have similar problems at a different scale. A new replica streams the full database before it can take queries. Large databases mean hours of catch-up. Scale by more than one at a time and the primary gets hammered by replication connections, degrading write performance for everyone. Some operators use minSyncAhead and maxSyncAhead to throttle the initial catch-up, then lift the limits once the replica is close. Skip this and a rapid scale-out can crater a production database.

Storage adds its own delays. Each new pod needs a PersistentVolumeClaim from the VolumeClaimTemplate. On cloud providers, volume provisioning takes minutes. With WaitForFirstConsumer binding, the volume is not even created until the pod gets scheduled, so you get provisioning time plus scheduling time stacked together. For stateful workloads that need to scale quickly, pre-provisioning volumes or using fast storage classes (SSDs with dynamic provisioners) is not a luxury.

Scale-in has its own failure mode. Kubernetes will happily terminate a pod without checking whether its data has been replicated elsewhere. Scale a 3-replica Kafka cluster to 2, and if the departing broker has the only copy of some partition’s data, you lose it. Configure partitionPlacementPolicy or equivalent to prevent scale-in when it would violate your replication factor. PodDisruptionBudgets with minAvailable keep Kubernetes from evicting too many pods during node drains, but they do not stop you from manually scaling to a value that violates your data redundancy requirements.

Init Containers for Bootstrap

spec:
  initContainers:
    - name: init-config
      image: busybox:1.36
      command:
        - sh
        - -c
        - |
          echo "Waiting for storage to be provisioned..."
          while [ ! -d /var/lib/kafka/data ]; do sleep 5; done
      volumeMounts:
        - name: data
          mountPath: /var/lib/kafka/data

When to Use / When Not to Use

Understanding when these advanced patterns apply helps you avoid over-engineering.

Custom Controllers and Operators

Use when:

You manage stateful applications with complex lifecycle requirements
You need to encode domain-specific operational knowledge into automated workflows
You want to reduce manual intervention for recurring operational tasks
You are building a platform that other teams will consume

When not to use:

Your application is stateless and scales horizontally without special handling
You only need basic Kubernetes primitives like Deployments and Services
The operational complexity of building an operator exceeds the manual effort it would save
You are in early stages and requirements are still changing rapidly

Decision Tree: Controllers vs Operators vs Native Resources

Use this flowchart to determine which approach fits your use case:

flowchart TD
    A[What are you trying to manage?] --> B{Is it a built-in K8s resource?}
    B -->|Yes| C[Use native resource<br/>Deployment, StatefulSet, Service, etc.]
    B -->|No| D{Does the app have complex lifecycle?}
    D -->|No - stateless, simple scale| E[Use native resources + Helm/Kustomize]
    D -->|Yes - backups, upgrades, failover| F{Is it a well-known off-the-shelf app?}
    F -->|Yes - Prometheus, CertManager, Velero| G[Install existing Operator<br/>via Helm or OperatorHub]
    F -->|No - custom domain app| H{Can existing controllers handle it?}
    H -->|Yes - CRD + standard reconciliation| I[Write Custom Controller<br/>with controller-runtime]
    H -->|No - app-specific domain logic| J[Build an Operator<br/>with Operator SDK]
    I --> K[Does it need Helm-style packaging?]
    J --> K
    K -->|Yes| L[Package as Operator with OLM]
    K -->|No| M[Deploy controller directly<br/>via YAML]

Quick reference:

Approach	Complexity	Best For
Native resources	Lowest	Deployments, Services, ConfigMaps, vanilla stateful apps
Helm/Kustomize	Low	Package and configure standard apps, no custom logic
Custom Controller	Medium	CRDs with standard reconcile loops, no app-specific domain
Existing Operator	Low-Medium	Prometheus, cert-manager, Velero, databases, message queues
Custom Operator	Highest	Complex domain logic, specialized stateful apps, internal platforms

RBAC and Network Policies

Use when:

Multiple teams share the same cluster
You need to enforce least-privilege access
Security compliance requires network segmentation
You want defense in depth beyond pod-level security

When not to use:

Single-tenant clusters with trusted users
Development or test environments without sensitive workloads
Network policies are handled by a higher-level service mesh

Storage Classes and Persistent Volumes Details

Use when:

Stateful workloads require persistent storage
You need dynamic provisioning based on application needs
You want to separate storage tiers (SSD vs HDD)

When not to use:

Stateless applications that store no persistent data
Caches or temporary data that can be lost without consequences

Production Failure Scenarios

Understanding real failure modes helps you prepare better.

Failure	Impact	Mitigation
etcd quorum loss	Cluster becomes read-only or unavailable	Maintain at least 3 etcd nodes, regular backups, separate etcd disks
API server overload	All cluster operations fail	Implement proper rate limiting, optimize client code, scale API server
Kubelet failure	All pods on node become unhealthy	Use pod disruption budgets, set pod priority classes, monitor node health
Storage class deletion with active PVCs	Pods cannot start, data loss potential	Use `allowVolumeExpansion: false` initially, never delete active storage classes
RBAC misconfiguration	Users cannot perform needed operations	Use `kubectl auth can-i` for verification, audit role bindings regularly
Network policy misconfiguration	Application pods cannot communicate	Test in staging first, use policy order carefully, always allow DNS egress
Controller reconciliation loops	High API server load, degraded performance	Implement proper reconciliation with exponential backoff
Pod budget too restrictive	Cluster upgrades blocked	Set realistic minAvailable values, test disruption scenarios

Observability Checklist

Comprehensive monitoring helps you catch issues before they become outages.

Metrics to Collect

graph LR
    A[Control Plane Metrics] --> B[API Server]
    A --> C[etcd]
    A --> D[Controller Manager]
    A --> E[Scheduler]
    F[Node Metrics] --> G[Kubelet]
    F --> H[Container Runtime]
    F --> I[Kube Proxy]
    J[Workload Metrics] --> K[Pod CPU Memory]
    J --> L[Deployment Replicas]
    J --> M[PV Usage]

Control plane metrics:

API server request latency and error rates
etcd disk I/O and WAL fsync latency
Controller reconciliation duration and error counts
Scheduler pod placement latency

Prometheus queries for control plane health:

# API server request error rate (5xx errors)
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) / sum(rate(apiserver_request_total{job="apiserver"}[5m]))

# etcd WAL fsync latency (p99)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="etcd"}[5m]))

# Controller reconciliation duration (p99)
histogram_quantile(0.99, rate(workqueue_work_duration_seconds_bucket{job="kube-controller-manager"}[5m]))

# Scheduler pod placement latency (p99)
histogram_quantile(0.99, rate(scheduler_pod_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))

# API server request latency by verb (p99)
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) by (verb)

# etcd compaction and backup duration
rate(etcd_debugging_compaction_duration_seconds_sum{job="etcd"}[5m])

# Leader election rate (high rate means instability)
rate(etcd_server_leader_changes_seen_total{job="etcd"}[5m])

Node metrics:

Kubelet working set and eviction thresholds
Container runtime CPU and memory usage
Network bytes sent/received per pod

Application metrics:

Pod CPU and memory actual usage vs requests
Deployment replica count vs desired
Persistent volume usage percentage
Custom resource status conditions

Logs to Capture

graph TD
    A[Application Logs] --> D[Stdout stderr]
    B[System Logs] --> E[Kubelet]
    B --> F[Container Runtime]
    C[Kubernetes Logs] --> G[API Server]
    C --> H[Controller Manager]
    C --> I[Scheduler]
    J[Audit Logs] --> K[API Requests by User]
    J --> L[Policy Violations]

Aggregate all logs to a central location (Loki, ELK, CloudWatch)
Include Kubernetes metadata: namespace, pod name, container name
Capture Kubernetes events for resource lifecycle changes
Store audit logs for compliance and security investigations

Alerts to Configure

Critical (immediate response required):

API server unavailable for more than 1 minute
etcd high latency or leadership elections
Node not ready for more than 2 minutes
Pod evictions occurring due to resource pressure

Warning (investigate soon):

Pod restart loop (CrashLoopBackOff)
Deployment replica count below desired
Persistent volume usage above 80%
Certificate expiration within 30 days

Security Checklist

RBAC Security

RBAC is the access control layer for your cluster, and misconfigured RBAC is one of the most common causes of both security incidents and operational headaches. The principle is simple: grant only the permissions actually needed, to exactly the principals that need them, for exactly the resources they touch. In practice, teams discover that ServiceAccount X can list secrets in namespace Y because someone copy-pasted a RoleBinding during an incident and never cleaned it up.

The quarterly review cadence exists because RBAC drift is silent. A new application gets deployed with a permissive RoleBinding. A new team member gets added to a group that used to have limited access. Over time, permissions accumulate. Set a calendar reminder to run kubectl get rolebindings,clusterrolebindings -A | grep -v system: and look for anything that looks broad. Pay particular attention to ClusterRoleBindings: they apply cluster-wide, and cluster-admin is exactly as dangerous as it sounds.

ServiceAccount tokens deserve more attention than they typically get. Every pod gets a ServiceAccount token mounted by default, even if the pod does not need API access. Those tokens do not expire unless you configure token volume projection or enable a service account token projected volume. A pod that gets compromised has access to whatever the ServiceAccount can do. Use automountServiceAccountToken: false in your pod specs for any workload that does not need Kubernetes API access. Use dedicated ServiceAccounts per application, not the default one.

The kubectl auth can-i command is underutilized. Before applying any RoleBinding, test it: kubectl auth can-i get pods --as=system:serviceaccount:production:my-app-sa -n production. This tells you exactly what that ServiceAccount can actually do, which is often different from what you intended when you wrote the RoleBinding. Build this into your deployment checklist.

Rotate ServiceAccount tokens on a schedule. Tokens do not expire by default in Kubernetes, which means a compromised token is valid forever unless you explicitly revoke it. Use short-lived tokens with token rotation built in, or at minimum have a process to rotate tokens periodically and audit their usage.

Network Security

Network policies in Kubernetes are allowlist-based, which catches most people by surprise. Without any NetworkPolicy defined, all pods can reach all other pods. This default-everything-allowed model works for development clusters but is a liability in production. The first NetworkPolicy you apply should be a default-deny: a policy that matches all pods in the namespace and allows nothing. Then you explicitly allow only the traffic each workload actually needs.

DNS egress is the one that trips up almost every team implementing network policies for the first time. Your application needs to resolve service names, external domains, and the Kubernetes API server address. That all goes through DNS, which runs on port 53 UDP and TCP. If your egress policy allows UDP 53 but not TCP 53, your applications will mostly work until a DNS response exceeds 512 bytes, at which point TCP fallback kicks in and your applications start failing silently. Allow both protocols on port 53.

The hardcoded IP problem is more subtle. When your application resolves a service name to a cluster IP, that IP is stable within the cluster but meaningless outside it. Code that logs “connecting to 10.0.0.5” is not debugging-friendly, and code that hardcodes service IPs is fragile because those IPs can change. Use Kubernetes DNS names instead: servicename.namespace.svc.cluster.local. This resolves correctly regardless of the actual cluster IP, and it works the same in every namespace.

Service mesh adds mTLS between services without changing application code, which sounds appealing until you consider the operational overhead. Istio and Linkerd both require their control plane to be running, the sidecar proxies to be injected into every pod, and the resource overhead of the proxies themselves. For most workloads, network policies provide sufficient isolation. If you have strict compliance requirements around mutual TLS or need fine-grained traffic control for canary deployments, the service mesh investment may be worth it.

Ingress traffic controls deserve equal attention, though they get less coverage. By default, any pod can receive traffic from any other pod. The Ingress resource manages incoming traffic from external sources, but within the cluster, you need to explicitly restrict which pods can send traffic to your application. Setting ingress rules for your most sensitive workloads prevents lateral movement if another service is compromised. Use podSelector with specific labels rather than broad namespace selectors to limit traffic to precisely the sources that need access.

Namespace isolation adds another layer. You can combine namespaceSelector with podSelector to allow traffic from an entire namespace rather than individual pods. This is useful for infrastructure components like monitoring or logging agents that need to scrape metrics from every pod in a namespace. The tradeoff is that any pod in that namespace can reach your workload, so restrict namespace selectors to infrastructure namespaces you control. For multi-tenant clusters, consider a separate network policy for each tenant namespace.

Implementing default-deny correctly requires understanding how policy evaluation works. When you apply a default-deny policy to a namespace, it blocks all traffic until you add explicit allow rules. The ordering matters: apply the default-deny first, verify it works (your test pods should lose connectivity), then add allow rules one at a time. If you apply the allow rules before the default-deny, the allows may be evaluated first and your default-deny never takes effect. The exception is when your CNI plugin supports ingress and egress default-deny natively, which some do by default.

ServiceAccount token exposure through network policies is a subtle risk. A compromised pod with a ServiceAccount token can use that token to authenticate with the Kubernetes API server, and network policies alone cannot block this if the API server is reachable. Combine network policies with automountServiceAccountToken: false for pods that do not need API access, and use RBAC to limit what each ServiceAccount can do. This defense in depth approach means even if a pod is compromised, the damage is contained.

The practical ordering for implementing network policies in a new namespace: first apply a default-deny for both ingress and egress. Second, add a rule allowing DNS egress on port 53 TCP and UDP to the cluster DNS service IP. Third, add rules for the specific dependencies your application needs (database on port 5432, cache on port 6379, etc.). Fourth, add ingress rules for the sources that need to reach your pod. Test each addition before moving to the next, and document each rule with comments explaining why it exists.

Pod Security

graph LR
    A[Pod Security] --> B[Run as non-root]
    A --> C[ReadOnly root filesystem]
    A --> D[Drop all capabilities]
    A --> E[No privileged containers]
    A --> F[Resource limits set]

Run containers as non-root user (securityContext.runAsNonRoot: true)
Use read-only root filesystem when possible (securityContext.readOnlyRootFilesystem: true)
Drop all capabilities and add only required ones (securityContext.capabilities.drop)
Set resource requests and limits to prevent resource starvation
Disable host PID and network namespaces (securityContext.hostPID: false, hostNetwork: false)
Use PodSecurityStandards or OPA Gatekeeper for policy enforcement

Secret Management

Kubernetes Secrets are base64-encoded, not encrypted, by default. This is a common misconception that leads to real incidents. When you create a Secret with kubectl create secret generic, the value is just base64-encoded and stored in etcd in plaintext. Anyone with etcd access can read your database passwords, API keys, and TLS certificates. Enabling encryption at rest for etcd is the minimum you should do, and even then, the encryption key itself needs to be managed carefully.

External secrets solutions address the root problem: secrets should not live in Kubernetes at all unless they are pulled from a dedicated secrets manager at runtime. The External Secrets Operator watches for ExternalSecret resources and syncs values from AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or HashiCorp Vault into standard Kubernetes Secrets. Your application reads a Kubernetes Secret, which is actually backed by the external store. This way, the secret value never lives in etcd, and you get rotation, audit logging, and access control from the secrets manager.

Vault integration through the CSI provider is another pattern worth knowing. Instead of the External Secrets Operator syncing secrets into Kubernetes Secrets, the Vault CSI provider mounts secrets as files directly into pods. The secret contents never appear in Kubernetes API server responses, which means they do not show up in audit logs or kubectl get secrets output. This is a meaningful security improvement for highly sensitive workloads.

Rotation is where most secret management strategies fall apart in practice. Rotating a database password sounds simple: update the secret, restart the application. In reality, you have multiple replicas, rolling restarts, zero-downtime requirements, and possibly dependent services that cache the old credential. Build rotation procedures before you need them, not during an incident. Document the steps, test them in staging, and make sure the rotation does not require a application restart if that is a constraint.

Common Pitfalls / Anti-Patterns

Controller Pitfalls

Reconciliation without backoff Writing a controller that continuously reconciles without exponential backoff will overwhelm the API server and cause cascading failures. Always implement retry logic with increasing delays.

Exponential backoff is not optional because the alternative is a death spiral. If your controller cannot reach the database it manages, retrying once per second means3,600 attempts per hour against an API server that is already struggling. Each failed reconciliation call consumes API server resources, which makes the API server slower, which causes more reconciliation failures. Backoff breaks this cycle. Start with a 1-second delay, double on each failure, cap at some maximum (5 minutes is reasonable), and reset the delay when reconciliation succeeds.

Ignoring status updates Controllers that fail to update status leave users blind about their resource state. Status conditions should reflect actual observed state.

The .status field is where users go to understand what is happening. If your controller creates a StatefulSet and sets replicas to 3, the user needs to know if that StatefulSet is actually running, how many replicas are ready, and what error occurred if something went wrong. Without status updates, kubectl get mycustomresource shows nothing useful. With proper status updates, users can debug without reading controller logs.

Not handling deletion Controllers must watch for deletion events and clean up resources properly. Orphaned resources cause ghost deployments and confusion.

When a user deletes a custom resource, the controller needs to clean up whatever it created. If the controller creates a Deployment and the user deletes the Database resource, the Deployment should be deleted too. This requires a finalizer. Add a finalizer to your resource, and in your reconciliation, check if the resource is being deleted. If it is, perform cleanup and remove the finalizer. Without this, you get orphaned resources that nobody asked for and nobody can explain.

RBAC Pitfalls

Using default service accounts Workloads should always use dedicated ServiceAccounts with specific permissions. Default ServiceAccount has too many implicit permissions.

The default ServiceAccount in every namespace has an implicit token mounted in every pod unless you explicitly disable it. That token has access to the Kubernetes API based on whatever RBAC bindings exist for the default ServiceAccount, which in many clusters is more permissive than you would expect. Creating a dedicated ServiceAccount for each application and explicitly referencing it in your pod spec is not paranoia; it is basic hygiene. When a pod gets compromised, you want its token to have exactly the permissions the application needs, nothing more.

Granting cluster-admin broadly Reserve cluster-admin for break-glass scenarios. Use namespace-scoped roles for daily operations.

Cluster-admin is the nuclear option. It grants unrestricted access to every resource in every namespace, including the ability to modify RBAC bindings themselves. This means a workload running as cluster-admin can escalate its own permissions, grant itself cluster-admin, or do anything else. There is no audit trail of what a cluster-admin token can do because it can do everything. Reserve it for emergency break-glass access, and use namespace-scoped RoleBindings for normal operations.

Forgetting to audit RBAC configurations drift over time. Regular audits catch permission creep.

RBAC drift is silent and cumulative. Nobody gets up in the morning and thinks “let me add more permissions to the CI pipeline,” but over months and years, permissions accumulate through one-off grants made during incidents, debugging sessions, and onboarding. Set a quarterly reminder to audit all RoleBindings and ClusterRoleBindings. Look for any binding that grants more permissions than the minimum documented use case requires. Pay special attention to bindings that reference ServiceAccounts; those tokens are mounted in pods and represent an ongoing access path.

Network Policy Pitfalls

Forgetting DNS DNS uses port 53 on both TCP and UDP. Without DNS egress, applications cannot resolve service names and all external calls fail.

DNS is the one that catches everyone. Your application needs to resolve servicename.namespace.svc.cluster.local to reach other services, and it needs to resolve external domains to reach cloud APIs, package registries, and everything else. Both resolutions go through the cluster DNS, which runs on port 53. UDP port 53 handles most queries, but TCP port 53 handles responses over 512 bytes. If your egress policy allows UDP53 but not TCP 53, your applications will fail intermittently when DNS responses are large. This is not theoretical; it is the most common network policy failure mode in production clusters.

Too permissive policies Using podSelector: {} matches all pods in the namespace. Be specific about source and destination pods.

An empty podSelector matches every pod in the namespace, which is almost never what you want. A policy with podSelector: {} and an ingress rule allowing port 8080 from anywhere means every pod in the namespace can receive traffic on port 8080. Use label selectors that specifically target the pods you want to allow. If you want to allow traffic from the frontend deployment, use podSelector: matchLabels: app: frontend, not an empty selector.

Policy ordering NetworkPolicy rules are evaluated in order. Place restrictive rules first to avoid accidentally allowing traffic.

Kubernetes network policies are evaluated in the order they are defined, and the first matching rule wins. This is counter-intuitive if you are used to firewall rules where the most permissive wins. In Kubernetes, if a policy with podSelector: {} appears before your specific policy, the empty selector matches first and your specific rule never applies. Always apply your most restrictive policies first, and test that they actually restrict traffic before applying more permissive ones.

Storage Pitfalls

Deleting StorageClass accidentally Never delete a StorageClass that has active claims. The deletion does not block, but dependent pods cannot recover.

Not monitoring volume quotas Running out of PV capacity blocks new PVC claims. Monitor available capacity and plan expansion.

Using ReadWriteMany incorrectly Not all volume plugins support ReadWriteMany. Using it with unsupported backends causes mount failures.

Interview Questions

1. How does a custom controller in Kubernetes differ from a built-in controller like the Deployment controller?

Expected answer points:

Built-in controllers (Deployment, StatefulSet, etc.) handle Kubernetes native resources and are part of the kube-controller-manager
Custom controllers extend Kubernetes by watching CRDs or native resources and reconciling toward a desired state you define
Custom controllers require you to implement the reconcile loop logic, while built-in controllers are pre-built
Custom controllers run as pods in the cluster, typically managed by a Deployment for high availability
The controller-runtime library simplifies building custom controllers with caching, client connections, and leader election built in

2. What is the difference between a Custom Resource Definition (CRD) and a built-in Kubernetes resource?

Expected answer points:

CRDs extend the Kubernetes API to define new resource types without modifying the core Kubernetes binary
Built-in resources (Pod, Deployment, Service) are compiled into the Kubernetes API server
CRDs are stored in etcd just like native resources, but validation is defined via OpenAPIV3Schema
CRDs support versioning to enable zero-downtime schema migrations (storage vs served versions)
Controllers can watch CRDs and reconcile them the same way they watch native resources

3. Explain the reconcile loop pattern in custom controllers.

Expected answer points:

The reconcile loop is a control loop that continuously works toward the desired state
Steps: Watch for changes → Fetch current state → Compare with desired → Take action → Update status → Repeat
Should implement exponential backoff to avoid overwhelming the API server during failures
Must handle deletion properly by cleaning up finalizers and orphaned resources
The controller-runtime library handles caching and provides a structured Reconcile method

4. When would you build a custom operator instead of using a generic controller?

Expected answer points:

Operators encode domain-specific operational knowledge into the reconciliation logic
Use operators for stateful applications with complex lifecycles (backups, upgrades, failover)
Use custom controllers for standard reconcile patterns without app-specific logic
Operators are built with Operator SDK (kubebuilder or ansible) which provides scaffolding, testing, and OLM packaging
If the application needs specialized domain logic beyond standard create/update/delete, build an operator

5. How does RBAC work in Kubernetes, and what are the four key concepts?

Expected answer points:

The four key concepts are: subjects (who), verbs (what actions), resources (what objects), and namespaces (where)
Role and RoleBinding are namespace-scoped; ClusterRole and ClusterRoleBinding are cluster-wide
Subjects can be Users, Groups, or ServiceAccounts
Common verbs: get, list, watch, create, update, patch, delete
Use kubectl auth can-i to test permissions before applying RBAC changes

6. What are Network Policies in Kubernetes, and why are they important?

Expected answer points:

Network policies restrict traffic between pods; by default, all pods can reach all other pods
They implement defense in depth by controlling both ingress and egress traffic
Use podSelector to target specific pods and namespaceSelector to allow traffic from specific namespaces
Always include DNS egress (port 53 TCP/UDP) or applications cannot resolve service names
NetworkPolicies are additive, so start with a default deny-all policy and explicitly allow required traffic

7. What is the difference between PodDisruptionBudget minAvailable and maxUnavailable?

Expected answer points:

minAvailable specifies the minimum number of pods that must remain available during disruptions
maxUnavailable specifies the maximum number of pods that can be unavailable simultaneously
minAvailable is absolute number or percentage; maxUnavailable is typically percentage
Use minAvailable when you need guaranteed capacity (like API servers)
Use maxUnavailable when you want to allow some flexibility during cluster maintenance

8. How do Taints and Tolerations work together to control pod placement?

Expected answer points:

Taints are applied to nodes to repel pods that do not have matching tolerations
Tolerations are applied to pods to allow them to be scheduled on tainted nodes
Taint effects: NoSchedule (hard reject), PreferNoSchedule (soft reject), NoExecute (evict after toleration)
A pod without a matching toleration cannot be scheduled on a tainted node
Use case: reserving nodes for ML workloads by tainting GPU nodes and tolerating only ML training pods

9. What are the key differences between Helm and Kustomize for Kubernetes configuration management?

Expected answer points:

Helm uses a templating model with values.yaml; Kustomize uses an overlay model with kustomization.yaml
Helm charts are versioned packages; Kustomize is patch-based configuration management
Helm is better for third-party software (Prometheus, nginx) where charts already exist
Kustomize is better for custom applications with environment-specific variants (dev, staging, prod)
Both integrate well with GitOps tools like ArgoCD

10. How does a StatefulSet differ from a Deployment, and when would you use each?

Expected answer points:

StatefulSets provide stable network identity (persistent pod names) and stable storage
Deployments are for stateless applications with interchangeable replicas
StatefulSets create pods with predictable, stable hostnames in order (e.g., kafka-0, kafka-1, kafka-2)
StatefulSets require VolumeClaimTemplates for persistent storage per pod instance
Use StatefulSets for databases, message queues, and other stateful applications needing stable identity

11. What are the key components of the Kubernetes control plane and how do they interact?

Expected answer points:

API Server is the gateway for all cluster operations, validating and persisting configurations to etcd
etcd stores the cluster state as a distributed key-value store
Controller Manager runs control loops that reconcile actual state toward desired state
Scheduler places pods on nodes based on resource requirements and constraints
Controllers watch the API server for changes and take corrective action to achieve desired state

12. How does the controller-runtime library simplify custom controller development?

Expected answer points:

controller-runtime provides caching, client connections, and leader election out of the box
It offers a structured Reconcile method with context for handling reconciliation requests
The library handles watch mechanisms for resources, reducing boilerplate code
It integrates with the manager pattern for running multiple controllers together
Error handling and exponential backoff can be implemented within the Reconcile loop

13. What is the difference between a ClusterRole and a Role in Kubernetes RBAC?

Expected answer points:

Role and RoleBinding are namespace-scoped, limited to permissions within a specific namespace
ClusterRole and ClusterRoleBinding are cluster-wide, with access to nodes, persistent volumes, and cluster-scoped resources
ClusterRoles can also grant access to non-resource URLs like /metrics
Use Roles for namespace-limited operations, ClusterRoles for cluster-wide or node-level access
ClusterRoleBindings can reference both ClusterRoles and namespace-scoped RoleBindings

14. How do you handle CRD versioning for zero-downtime migrations?

Expected answer points:

Define multiple versions in the CRD with the storage version flag indicating what persists to etcd
Set served: true for versions you want the API server to handle
Clients can request specific versions via the Accept header during rolling upgrades
Implement conversion webhooks if schema changes between versions require data transformation
Keep the old version served while transitioning traffic, then remove after migration

15. What are the benefits and trade-offs of using a service mesh like Istio?

Expected answer points:

Benefits: Automatic mTLS between services, traffic management (canary deployments, circuit breakers), observability without app changes
Trade-offs: Adds overhead (sidecar proxies consume memory and latency), increases operational complexity
Requires careful resource allocation for sidecar proxies
Debugging can be harder with the additional network hop layer
Consider lighter alternatives like Cilium for simpler networking needs

16. How does Vertical Pod Autoscaler (VPA) work and when would you use it?

Expected answer points:

VPA analyzes actual resource usage patterns and recommends or applies updated resource requests
It can operate in Off mode (recommendations only), Auto mode (applies by evicting and recreating pods), or Initial mode (at pod creation only)
VPA helps right-size resources based on real usage rather than estimates
Combine VPA with cluster autoscaler for both pod and node-level scaling
Be aware that VPA cannot be used simultaneously with HPA on the same metrics

17. What is the purpose of PodDisruptionBudgets and how do they protect availability?

Expected answer points:

PDBs ensure minimum availability during voluntary disruptions like cluster upgrades or node maintenance
minAvailable specifies the minimum number of pods that must remain available
maxUnavailable specifies the maximum number of pods that can be unavailable (typically percentage)
Kubernetes will block node drain operations that violate PDBs
Set realistic values based on application capacity requirements

18. How do StorageClasses enable dynamic volume provisioning?

Expected answer points:

StorageClasses define storage provisioners (like kubernetes.io/gce-pd or cloud-provider-specific ones)
They parameterize storage types (SSD vs HDD, regional vs zonal) for different performance tiers
WaitForFirstConsumer delays volume binding until a pod actually uses the claim, allowing zone co-location
allowVolumeExpansion: true allows PVCs to be expanded without recreating the volume
When a PVC requests storage, the provisioner creates the volume dynamically based on the StorageClass

19. What are the key differences between Helm and Kustomize approaches to configuration management?

Expected answer points:

Helm uses a templating model with values.yaml files and produces versioned chart packages
Kustomize uses a declarative overlay model with kustomization.yaml files for patches
Helm is better for installing third-party software where charts already exist
Kustomize excels at managing environment-specific variants (dev, staging, prod) with base + overlay
Both integrate with GitOps tools like ArgoCD and can be combined (Kustomize for base, Helm for releases)

20. What operational considerations are unique to StatefulSets compared to Deployments?

Expected answer points:

StatefulSets scale one pod at a time, maintaining ordering guarantees
Each pod gets a stable, predictable hostname (e.g., kafka-0, kafka-1)
VolumeClaimTemplates create unique PVCs per pod instance that persist across rescheduling
Stateful applications like databases require careful scaling procedures (e.g., Kafka broker sync times)
Init containers can handle bootstrap sequences that must complete before the next pod is created

Conclusion

Quick Recap Checklist

Before shipping your Kubernetes configuration to production, run through this checklist. A pre-flight review catches the configuration gaps that lead to production incidents. Missing resource limits cause noisy-neighbor problems. Overly permissive RBAC exposes the cluster to unnecessary risk. Network policies that accidentally block critical traffic bring down applications silently. Teams that skip this step often discover issues during an outage rather than during a controlled review.

The checklists below are organized by topic so you can work through them one area at a time. Start with controllers and operators, since misconfigured reconcile loops can cascade across the cluster. Then move through RBAC, network policies, storage, and scheduling. Each checklist item comes from real incidents that have affected production Kubernetes clusters.

You do not need to complete every item on the first pass. Prioritize the items that match your current workload patterns. For example, if you run only stateless workloads, the storage checklist matters less than the network policy and RBAC sections. The observability section at the end ties everything together, making sure you can actually see when something goes wrong.

For a quick sanity check before a deployment, run through these five questions: Does the controller implement exponential backoff? Does the ServiceAccount have only the permissions it actually needs? Does every namespace have a default-deny NetworkPolicy? Are resource requests and limits set for all containers? Is the controller’s health status visible through its status conditions? If any of these are no, fix them before going to production.

Operational Checklists

Controllers and Operators

Custom controller implements exponential backoff on reconciliation failures
Status conditions reflect actual observed state
Deletion handling cleans up orphaned resources
CRD schema validation covers required fields
Operator handles backup and restore scenarios

RBAC and Security

No workloads use default ServiceAccount
ServiceAccount permissions follow least privilege
ClusterRoleBindings reviewed quarterly
ServiceAccount tokens rotated regularly
etcd encryption at rest enabled

Network Policy Checklist

Default deny-all policy applied in each namespace
DNS egress (port 53 TCP/UDP) allowed in all policies
Policy ordering reviewed (restrictive rules first)
podSelector is specific, not {}

Storage

StorageClass has allowVolumeExpansion: false initially
No active PVCs before deleting a StorageClass
Volume capacity monitored and expanded proactively
Correct access mode (ReadWriteOnce vs ReadWriteMany) used

Scheduling

PodDisruptionBudgets set for critical deployments
Resource requests and limits defined for all containers
Taints and tolerations documented
Node affinity and anti-affinity configured for workload distribution

Observability

Observability is not a feature you add after an incident; it is infrastructure you build before you need it. When something breaks at 3am, you need to be able to answer three questions: What is failing? Why is it failing? What changed recently? If your metrics, logs, and traces do not make those questions answerable, you will spend the first hour of an incident just trying to understand what is happening.

Control plane metrics are the ones most teams skip because they are not immediately relevant to application behavior. But when the API server latency spikes, every controller in the cluster slows down, every kubelet gets delayed responses, and the entire cluster starts showing symptoms that look like application problems but are actually infrastructure problems. API server request latency, etcd WAL fsync duration, and scheduler pod placement latency are the three control plane metrics that most directly affect application reliability.

Audit logs are compliance requirements in most regulated environments, but they are also invaluable for incident response. Every request to the Kubernetes API server is logged with the identity of the caller, the timestamp, and the resource affected. If someone runs kubectl delete and breaks a deployment, the audit log tells you who ran it, when, and from which IP. Without audit logging, you are investigating blind.

Application metrics are where most teams start, and they should be comprehensive. CPU and memory usage per pod, replica count versus desired, restart count, and error rates are the baseline. For production services, you also want latency percentiles (p50, p95, p99), request throughput, and dependency health. If your application exposes a Prometheus metrics endpoint, Kubernetes can scrape it automatically and store the data in Prometheus or any other metrics backend.

Control plane metrics collected (API latency, etcd WAL fsync, scheduler latency)
Audit logs enabled and forwarded to central storage
Alerts configured for API server unavailability, etcd leadership changes, node not ready
Application metrics (CPU, memory, replica count) monitored

Summary

Key Takeaways

Custom controllers and operators encode operational expertise but add complexity only justified for stateful, domain-specific applications
RBAC follows the principle of least privilege: grant only the permissions actually needed
Network policies implement defense in depth; always include DNS egress rules
StorageClasses enable dynamic provisioning but require careful capacity planning
Pod disruption budgets protect availability during voluntary disruptions
Taints and tolerations control pod placement across node types
Observability across metrics, logs, and alerts is essential for production reliability

Production Readiness Checklist

# RBAC
kubectl get rolebindings,clusterrolebindings -A | grep -v system:
kubectl auth can-i --list --as=system:serviceaccount:production:my-app-sa

# Network Policies
kubectl get networkpolicies -A
kubectl describe networkpolicy <name> -n production

# Storage
kubectl get pvc -A | grep -v Bound
kubectl get storageclass

# Pod Security
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.securityContext.runAsNonRoot}{"\n"}{end}' -n production

# Controllers
kubectl get events --sort-by='.lastTimestamp' -n production | tail -50

Trade-off Summary

Pattern	Best For	Complexity	Operational Burden
Built-in controllers	Standard workloads	Low	Minimal
Custom controllers	Domain-specific automation	Medium	Medium
Operators (kubebuilder)	Complex lifecycle management	High	High
Operators (operator-sdk)	Existing Go projects	High	High
RBAC only	Simple permissions	Low	Minimal
OPA Gatekeeper	Policy enforcement	Medium	Medium
Kyverno	Policy as YAML	Low	Low

Advanced Kubernetes topics build on the fundamentals of containers and orchestration. Custom controllers and operators let you encode domain knowledge into automated workflows. RBAC and network policies enforce security boundaries. Storage classes and resource quotas ensure predictable cluster operation.

These patterns emerge from real production experience. Start with the basics of your applications, understand the failure modes, and apply the patterns that solve your specific problems.

For packaging your Kubernetes applications, the Helm Charts guide covers templating and release management. If you are building observability into your cluster, check out our Distributed Tracing and Prometheus & Grafana guides for monitoring setup.