Advanced Pod Scheduling: Affinity, Taints, Tolerations, Topology

Fine-tune pod placement across your Kubernetes cluster using node affinity, pod affinity/anti-affinity, taints/tolerations, and topology spread constraints.

published: March 25, 2026 reading time: 28 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

By default, Kubernetes throws pods wherever there's room. If you need more control, taints and tolerations let you reserve nodes — tag a GPU node with a taint, and only pods that explicitly tolerate it can schedule there. Affinity rules and topology spread constraints handle the rest, spreading replicas across availability zones so a single outage doesn't take down your whole service. The real win: priority classes let critical workloads cut in line, even preempting lower-priority pods when the cluster gets squeezed.

Advanced Pod Scheduling: Taints, Tolerations, Affinity, and Topology

Kubernetes schedules pods onto nodes automatically. The default scheduler looks at resource requests and finds a node with enough capacity. But sometimes you need more control. You might want to run workloads on specific nodes, keep certain pods separated, or distribute replicas across availability zones. Kubernetes provides several mechanisms for this: taints, tolerations, affinity, and topology spread constraints.

This post covers how Kubernetes scheduling works by default and how to override it for specialized workloads.

For Kubernetes basics, see the Kubernetes fundamentals post. For high availability patterns, see the High Availability post.

Introduction

Use scheduling mechanisms when

Specialized hardware is involved. GPU nodes, high-memory machines, nodes with local SSDs all warrant explicit placement decisions.
Multiple teams share a cluster and need isolation. Production workloads should not share nodes with less-trusted workloads.
Availability requirements demand zone distribution. If a single AZ outage should not take down your service, pods need to spread.
Related pods should co-locate or spread apart. A cache and its backend probably want to be near each other. Replicas of the same service definitely do not.

Skip custom scheduling when

Default scheduling works. The scheduler is good at what it does. Do not add constraints unless you have a problem default scheduling does not solve.
Your cluster is small. If you have 3 nodes and no specialized hardware, the overhead of managing affinity rules is not worth it.
You are adding constraints “just in case.” Over-constrained clusters leave pods pending.

Scheduling Decision Flow

flowchart TD
    A[Pod created<br/>no node assigned] --> B[Filter: Which nodes<br/>can run this pod?]
    B --> C{Enough nodes<br/>pass filtering?}
    C -->|No| D[Pod stays Pending]
    C -->|Yes| E[Score: Rank nodes<br/>by scheduling criteria]
    E --> F[Select highest<br/>score node]
    F --> G[Bind pod to node<br/>Pod starts]

The scheduler filters nodes that cannot run the pod (resource constraints, taints, selectors), then scores the remaining nodes to pick the best fit.

How Default Scheduling Works

The Kubernetes scheduler runs as a component on the control plane. When a pod gets created without a node assignment, the scheduler evaluates all available nodes and picks the best one.

The scheduling process has two phases:

Filtering: Find nodes that can run the pod (based on resource availability, node selectors, taints)
Scoring: Rank the filtered nodes (based on affinity rules, resource utilization, topology)

The scheduler uses plugins for each phase. The Filtering plugins remove nodes that cannot run the pod. The Scoring plugins rank the remaining nodes.

You can configure multiple schedulers and assign pods to specific schedulers:

spec:
  schedulerName: custom-scheduler

Without this field, pods use the default scheduler.

Taints and Tolerations for Node Exclusion

Taints repel pods from nodes. If a node has a taint, pods without a matching toleration cannot schedule onto that node. This is how you mark nodes for special purposes.

Adding a taint to a node

kubectl taint nodes node1 dedicated=postgres:NoSchedule

This taint has a key (dedicated), value (postgres), and effect (NoSchedule). The effect determines what happens to pods without a matching toleration:

Effect	Behavior
NoSchedule	Pods without matching toleration are not scheduled
PreferNoSchedule	Scheduler tries to avoid but may schedule if necessary
NoExecute	Existing pods without matching toleration are evicted

Tolerations in a pod spec

spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "postgres"
      effect: "NoSchedule"

The toleration says this pod tolerates nodes with the dedicated=postgres taint. With this toleration, the pod can schedule onto the tainted node.

Using Equal vs Exists operator

# Equal: match specific value
tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "postgres"
  effect: "NoSchedule"

# Exists: match key regardless of value
tolerations:
- key: "dedicated"
  operator: "Exists"
  effect: "NoSchedule"

The Exists operator matches any value for the key.

Common taint use cases

Mark nodes for specific workloads:

kubectl taint nodes node-gpu gpu-workload:NoSchedule
kubectl taint nodes node-batch batch-job:NoExecute
kubectl taint nodes node1 infra=proxy:PreferNoSchedule

Pods that need GPUs add a toleration:

tolerations:
  - key: "gpu-workload"
    operator: "Exists"
    effect: "NoSchedule"

Node Affinity Rules

Node affinity is more expressive than node selectors. It allows matching rules based on node labels and supports soft preferences (preferredDuringSchedulingIgnoredDuringExecution) in addition to hard requirements (requiredDuringSchedulingIgnoredDuringExecution).

Required node affinity

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "topology.kubernetes.io/zone"
                operator: In
                values:
                  - us-east-1a
                  - us-east-1b
              - key: "node.kubernetes.io/disktype"
                operator: In
                values:
                  - ssd

The pod must run on a node in zone us-east-1a or us-east-1b with an SSD disk. If no node matches, the pod stays unscheduled.

Preferred node affinity

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: "node.kubernetes.io/networking-type"
                operator: In
                values:
                  - fast
        - weight: 50
          preference:
            matchExpressions:
              - key: "topology.kubernetes.io/zone"
                operator: In
                values:
                  - us-east-1a

The scheduler prefers nodes in us-east-1a but also considers fast networking nodes. Higher weight means higher preference priority.

Combining node affinity and taints

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "dedicated"
                operator: In
                values:
                  - database
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "database"
      effect: "NoSchedule"

This pod requires a node labeled dedicated=database and tolerates that taint. The combination ensures the pod lands on a dedicated database node.

Pod Affinity and Anti-Affinity

Pod affinity keeps pods together. Pod anti-affinity keeps pods separated. Use these to co-locate related pods or distribute replicas across failure domains.

Pod anti-affinity for high availability

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                  - web-frontend
          topologyKey: topology.kubernetes.io/zone

This ensures web-frontend pods spread across availability zones. If three replicas exist, they land in three different zones if possible.

Pod affinity for co-location

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                  - redis
          topologyKey: kubernetes.io/hostname

This places pods on the same node as their Redis cache. The topologyKey kubernetes.io/hostname means same physical node (same kubelet).

Preferred pod anti-affinity

Hard requirements may prevent scheduling if constraints cannot be satisfied. Use preferred for softer constraints:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: "app"
                  operator: In
                  values:
                    - web-frontend
            topologyKey: topology.kubernetes.io/zone

The scheduler tries to spread pods across zones but does not fail if it cannot.

Topology Spread for HA

Topology spread provides more control over distribution than pod anti-affinity. You can specify how pods spread across failure domains while still meeting minimum availability requirements.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web-frontend
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: web-frontend

The first constraint requires web-frontend pods to spread across zones with maxSkew of 1. If unsatisfiable, pods do not schedule (DoNotSchedule).

The second constraint prefers spreading across nodes within a zone. If unsatisfiable, pods schedule anyway (ScheduleAnyway).

maxSkew explained

If you have 6 pods and 3 zones:

Zone A: 3 pods
Zone B: 2 pods
Zone C: 1 pod

The maxSkew is 1 (zones differ by at most 1 pod). Kubernetes would try to balance to 2 pods per zone.

If zone C has 0 pods:

Zone A: 3 pods
Zone B: 3 pods
Zone C: 0 pods

maxSkew is 3, which violates maxSkew: 1. The scheduler would try to schedule a pod in zone C.

Priority Classes and Preemption

Pod priority affects scheduling order. Higher priority pods schedule before lower priority pods. If a node runs out of resources, the scheduler may preempt (evict) lower priority pods to make room for higher priority pods.

Defining a priority class

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "Production workloads with high priority"

Using priority class

Once you define a PriorityClass, apply it to pods via the priorityClassName field in the pod spec:

spec:
  priorityClassName: high-priority

Priority values decide scheduling order. Backlog builds up, pods with higher values get handled first. When a high-priority pod lands on a node where lower-priority pods are already using the resources it needs, the scheduler evicts those lower-priority pods to make room. That is preemption.

For production services that would otherwise wait behind batch jobs, this is exactly what you want. The trouble starts when every team marks their services as critical. A cluster where everyone is highest-priority means the scheduler is constantly evicting pods to make room for other pods it just evicted elsewhere. The solution is tiered priority levels with real gaps between them: production-high at 10000, production-default at 5000, gives the scheduler room to work without chaotic bouncing.

The actual numbers do not matter much within int32 range. What matters is the relative ordering and how many distinct levels you have. Five clearly named tiers beat 1000 priority classes that nobody can remember. Set globalDefault: true on one non-critical class so pods without an explicit priority do not quietly inherit zero and get bumped the moment anything else shows up.

System priority classes

Kubernetes has system priority classes:

system-cluster-critical (value 2000000000)
system-node-critical (value 2000001000)

Use these for critical cluster components that should not be preempted.

Custom Scheduler Configuration Example

For workloads needing specialized scheduling logic, deploy a custom scheduler alongside the default:

# Deploy custom scheduler as a daemonset or standalone pod
kubectl apply -f custom-scheduler.yaml

# custom-scheduler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: custom-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-scheduler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:kube-scheduler
subjects:
  - kind: ServiceAccount
    name: custom-scheduler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-scheduler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: custom-scheduler
  template:
    metadata:
      labels:
        app: custom-scheduler
    spec:
      serviceAccountName: custom-scheduler
      containers:
        - name: scheduler
          image: registry.k8s.io/kube-scheduler:v1.28
          command:
            - /usr/local/bin/kube-scheduler
            - --scheduler-name=high-memory-scheduler
            - --leader-elect=false

Assign pods to the custom scheduler:

spec:
  schedulerName: high-memory-scheduler
  containers:
    - name: high-mem-app
      image: myapp:latest

Trade-off Analysis

Feature	Affinity/Anti-Affinity	Topology Spread Constraints
Scope	Pod-to-pod or pod-to-node	Pods across topology domains
Matching	Label selectors	Label selectors
TopologyKey	Any label (hostname, zone, etc.)	Any label
Hard requirements	`requiredDuringScheduling...`	`whenUnsatisfiable: DoNotSchedule`
Soft preferences	`preferredDuringScheduling...`	`whenUnsatisfiable: ScheduleAnyway`
Explicit maxSkew	No	Yes — controls imbalance tolerance
Multiple constraints	Multiple affinity rules	Multiple `topologySpreadConstraints` entries
Failure handling	Pod stays Pending	Pod stays Pending or schedules anyway

Use topology spread when you care about balanced distribution across zones or nodes (maxSkew is your control knob). Use affinity when you need to express relationships between specific pods or between pods and node labels.

Observability Hooks for Scheduling

What to monitor

Track pending pods, scheduling duration, and preemption events to catch scheduling problems early.

# Find pods stuck in Pending (likely scheduling issues)
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Check scheduling duration for recent pods
kubectl get events --sort-by='.lastTimestamp' | grep Scheduled

# Watch for preemption events
kubectl get events --sort-by='.lastTimestamp' | grep Preempt

# Check which nodes have what taints
kubectl get nodes -o json | jq '.items[].spec.taints'

Key metrics to track

Pending pod count by namespace (alert if > 0 for > 5 minutes)
Scheduling latency (time from pod creation to scheduled)
Preemption count (high preemption rates suggest resource pressure)
Node taint coverage (ensure critical workloads tolerate all taints)

Debug commands for scheduling

# View why a pod cannot be scheduled
kubectl describe pod myapp-xxx -n production | grep -A20 "Events"

# Test if a pod can schedule on specific nodes
kubectl cordon node1  # temporarily mark node unschedulable
kubectl uncordon node1  # restore

# Check scheduler logs (control plane)
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100

# Simulate pod scheduling with dry-run
kubectl run test --image=nginx --dry-run=server -o yaml | kubectl apply --dry-run=server

Production Failure Scenarios

Too many scheduling constraints that cannot all be satisfied leaves pods stuck in Pending. The describe output says something like no nodes match affinity rules.

Check events:

kubectl describe pod <name>
kubectl get events --sort-by='.lastTimestamp' | grep Unschedulable

If you have hard affinity rules across too many dimensions (zone, node type, memory), the scheduler may never find a node satisfying all of them.

Use preferredDuringSchedulingIgnoredDuringExecution for softer requirements. Reserve hard requirements for constraints that truly cannot be violated.

Priority Preemption Causing Disruption

When a high-priority pod cannot find a node, Kubernetes evicts lower-priority pods to make room. This is preemption.

The problem is that evicted pods do not always get scheduled immediately elsewhere. Critical services can go down if their pods get preempted during peak load.

Set priority classes carefully. Do not make every production pod the same extreme priority. Use PDBs to protect critical services even from preemption.

The scheduler picks lower-priority pods as eviction victims. It works out how much room the pending pod needs, evicts just enough lower-priority pods to make space, and those evicted pods get re-created by their controller and go back into the scheduling queue. During a resource crunch, this creates a cascade where critical services bounce between running and pending.

Detect preemption by watching for Preempted status on pods:

kubectl get events --sort-by='.lastTimestamp' | grep Preempt

If you see the same priority class getting preempted repeatedly, that class is too aggressive for your cluster.

Mitigation comes down to three things. First, leave a gap between priority levels: setting high-priority at 10000 and medium-priority at 5000 gives headroom for spikes without constant preemption of medium. Second, pair priority classes with PodDisruptionBudgets so the scheduler always leaves a minimum number of replicas intact when selecting victims. Third, never set globalDefault: true on a high priority class, or every pod without an explicit priority becomes high priority and preemption becomes chaotic.

The trap looks like this: a deployment marks all production pods with priorityClassName: system-cluster-critical. During a node drain for upgrades, the scheduler has to evict and reschedule dozens of pods simultaneously. With no lower-priority pods to preempt, the scheduler cannot make room efficiently. The fix is to reserve system-critical for genuinely cluster-wide components like DNS and networking plugins, and use application-level priority classes for regular workloads.

Taint NoExecute Evicting Critical Pods

The NoExecute effect is harsh. It evicts existing pods that do not tolerate the taint, not just prevents new scheduling.

If you taint a node for dedicated database workloads, the node running your monitoring agent with no database toleration gets emptied.

Always add tolerations before tainting nodes. Use tolerationSeconds for a grace period on critical workloads.

The practical difference: NoSchedule only blocks new scheduling. NoExecute applies to every pod on the node the moment you add it, including pods that have been running for weeks. This is why NoExecute is a maintenance window tool, not a workload reservation tool.

For emergency maintenance where you cannot add tolerations to every pod beforehand, use tolerationSeconds to buy grace time:

tolerations:
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300 # 5 minutes before eviction
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300

Controllers like theDeployment replicaset already bake these two tolerations into managed pods. Custom pods almost never have them, which is why they get evicted first during node issues. Adding these two tolerations to every custom pod is one of those small things that pays off when something goes wrong at 3am.

Safe node maintenance with NoExecute follows a specific sequence:

Cordon the node first (marks it unschedulable, does not evict anything)
Drain the node (evicts pods respecting PDBs and graceful termination periods)
Apply the NoExecute taint for the maintenance window
Do the maintenance work
Remove the taint
Uncordon the node

If you jump straight to step 3 without 1 and 2, pods skip the drain process entirely and lose PDB protection. They get evicted immediately and have no time to gracefully shut down.

The classic mistake: taint a node with dedicated=database:NoSchedule to reserve it for databases, then find that your monitoring daemonset has been emptied because those pods have no database toleration. The answer is to add the toleration to the monitoring pods, not to remove the taint from the database node.

Anti-Patterns

Too Many Hard Affinity Requirements

requiredDuringSchedulingIgnoredDuringExecution with many constraints is a recipe for Pending pods. If you need GPU, SSD, and specific zone, you might as well ask for aunicorn node.

Keep hard requirements minimal. Use preferences for everything else.

Hard requirements compound into impossibility because each one is an AND filter. If you have 3 GPU nodes, 5 SSD nodes, and 2 nodes in us-east-1a, the intersection might be 1 node or 0. The scheduler filters sequentially, so adding “zone us-east-1a AND disk SSD AND GPU” narrows the candidate pool at each step, not as a holistic calculation.

More than 2 hard node affinity requirements is a red flag. In practice, one hard requirement (usually topology or node type) covers most real cases. Everything else should be a soft preference.

Refactoring from 3 hard requirements to 1 hard + 2 soft looks like this:

# Before: 3 hard requirements — risky
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: "topology.kubernetes.io/zone"
              operator: In
              values: [us-east-1a]
            - key: "node.kubernetes.io/disktype"
              operator: In
              values: [ssd]
            - key: "gpu-workload"
              operator: Exists

# After: 1 hard (zone), 2 soft (SSD + GPU)
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: "topology.kubernetes.io/zone"
              operator: In
              values: [us-east-1a]
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 50
        preference:
          matchExpressions:
            - key: "node.kubernetes.io/disktype"
              operator: In
              values: [ssd]
      - weight: 50
        preference:
          matchExpressions:
            - key: "gpu-workload"
              operator: Exists

To diagnose whether a pod is over-constrained:

kubectl get pods --all-namespaces --field-selector=status.phase=Pending
kubectl describe pod <name> | grep -A5 "Events"

“no nodes match the node selector” means a hard affinity problem. If scheduling just takes forever, the soft preferences might be weighted too heavily.

undocumented Taints

Operations teams add taints without telling developers. Developers wonder why their pods will not schedule. It happens.

Maintain a taint inventory. Document which workloads tolerate which taints. Put tolerations in the workload manifests.

Run this periodically and save the output alongside your cluster configuration:

#!/bin/bash
echo "=== Node Taints ==="
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.taints // [] | map("\(.key)=\(.value):\(.effect)") | join(", "))"'
echo ""
echo "=== Pod Tolerations (workload manifests) ==="
kubectl get pods -A -o yaml | grep -A5 "tolerations:" | grep -E "(key:|effect:|operator:)"

Commit this output to your infrastructure repo on each cluster update. It creates an auditable trail of why nodes are reserved.

Pick a naming convention and stick to it. Something like team/<purpose>:NoSchedule or hardware/<type>:NoExecute makes taints self-documenting, so developers can guess the toleration from the name. Avoid generic keys like reserved or dedicated that carry no meaning and end up requiring a Slack message to decode.

Track at minimum:

Field	Example	What to record
Key	`team/payment`	Who owns the reservation
Value	`payment`	Workload type
Effect	`NoSchedule`	Scheduling vs eviction behavior
Reason	”Dedicated payment processing nodes”	Why it exists
Created date	2025-11-01	For audit trail
Expected removal	”After payment cluster migration Q2”	Prevents permanent taints

When a team provisions new dedicated hardware, the taint and its corresponding toleration should land in the same pull request. This means new engineers never hit the “why won’t my pod schedule” wall because the toleration is right there in the workload manifest alongside the taint definition.

Skipping topologySpreadConstraints

podAntiAffinity is good but not precise for zone distribution. topologySpreadConstraints is explicit about how imbalance is measured and what happens when constraints cannot be satisfied. Use it for anything that matters.

Why podAntiAffinity alone falls short: podAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution and topologyKey: topology.kubernetes.io/zone ensures pods do not share a zone, but it does not measure how imbalanced the distribution is. With 3 zones and 10 replicas, you could end up with 7 in zone A and 3 in zone B. Both pass the anti-affinity requirement (no two pods share a zone), but the service is vulnerable to a single zone failure taking down most capacity.

topologySpreadConstraints adds the maxSkew parameter that explicitly controls this. With maxSkew: 1 and 10 replicas across 3 zones, the scheduler maintains at most 1 pod difference between any two zones (4-3-3 or 4-4-2, never 7-3-0).

When to skip topologySpreadConstraints:

Single-node clusters where zone distribution is meaningless
Pods that do not need HA (batch jobs, dev workloads)
Workloads where the controller itself handles distribution (StatefulSets with zone-aware volume provisioning)

For everything critical, use both: podAntiAffinity for hard separation requirements, topologySpreadConstraints for balanced distribution.

Example combining both:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values: [web-frontend]
          topologyKey: topology.kubernetes.io/zone
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web-frontend

The podAntiAffinity prevents co-location within a zone. The topologySpreadConstraints ensures even spread across zones. Together they guarantee that if any single zone fails, at most one replica is lost.

Interview Questions

1. How do taints and tolerations work together to control pod placement?

Expected answer points:

Taints are applied to nodes to repel pods that do not have matching tolerations
Tolerations are applied to pods to allow them to be scheduled on tainted nodes
Taint effects: NoSchedule (prevent scheduling), PreferNoSchedule (soft repel), NoExecute (evict existing pods)
The Exists operator matches any value for a key; Equal requires an exact match
Use case: reserving GPU nodes for ML workloads by tainting with gpu-workload:NoSchedule

2. What is the difference between hard and soft node affinity?

Expected answer points:

Hard affinity (requiredDuringSchedulingIgnoredDuringExecution) enforces requirements strictly
Soft affinity (preferredDuringSchedulingIgnoredDuringExecution) prefers but does not require
Hard requirements cause pods to stay Pending if unsatisfied; soft preferences are best-effort
Soft affinity has a weight property (1-100) that influences scoring relative to other preferences
Use hard affinity for strict requirements like zone placement; soft for optimization

3. How does topology spread constraints differ from pod anti-affinity?

Expected answer points:

topologySpreadConstraints provides explicit maxSkew control for balanced distribution
Pod anti-affinity expresses relationships but does not measure imbalance directly
topologySpreadConstraints supports whenUnsatisfiable: DoNotSchedule vs ScheduleAnyway for hard vs soft constraints
maxSkew measures the difference in pod count between any two topology domains
Use topologySpreadConstraints when balanced zone distribution is critical; pod anti-affinity for co-location

4. What happens when a pod with hard anti-affinity cannot find a satisfying node?

Expected answer points:

The pod stays in Pending state until constraints can be satisfied or are modified
kubectl describe pod shows events like "no nodes match affinity rules"
Events are logged with reason "Unschedulable" and message detailing the constraint that failed
Use kubectl get events --sort-by='.lastTimestamp' | grep Unschedulable to find affected pods
Solution is either to relax constraints (preferred) or add more nodes matching requirements

5. How does NoExecute taint effect differ from NoSchedule?

Expected answer points:

NoSchedule only prevents new pods from scheduling on the tainted node; existing pods are unaffected
NoExecute evicts existing pods that do not tolerate the taint from the node
NoExecute can optionally include tolerationSeconds for a grace period before eviction
Use NoSchedule for dedicated nodes where you do not want new workloads; NoExecute for maintenance scenarios
NoExecute is harsher and should be used carefully for production nodes

6. What is maxSkew in topologySpreadConstraints and how is it calculated?

Expected answer points:

maxSkew defines the maximum allowed difference in pod count between any two topology domains
Calculation: (max pod count in any zone) minus (min pod count in any zone)
With 6 pods and 3 zones (3, 2, 1): maxSkew is 2, which violates maxSkew: 1
The scheduler tries to place pods to keep maxSkew within the limit during new pod scheduling
When unsatisfiable with DoNotSchedule, pods stay Pending; with ScheduleAnyway, they schedule anyway

7. How do priority classes affect pod scheduling order?

Expected answer points:

Higher priority pods schedule before lower priority pods when resources are constrained
When a high-priority pod cannot find a node, the scheduler may preempt (evict) lower priority pods
Preempted pods are scheduled elsewhere but may not find capacity immediately
system-cluster-critical and system-node-critical are reserved for critical cluster components
Use PDBs alongside priority classes to protect critical services from preemption

8. When would you use a custom scheduler instead of the default scheduler?

Expected answer points:

Use a custom scheduler when workloads have specialized placement requirements the default scheduler cannot express
Examples: GPU binding to specific GPU devices, NUMA-aware scheduling, geographic constraints
Run multiple schedulers side-by-side and assign pods via schedulerName field
Custom schedulers need their own leader election setup for HA if multiple replicas run
For most cases, the default scheduler with affinity/taints/tolerations is sufficient

9. How does combining node affinity with taints and tolerations work in practice?

Expected answer points:

A node can have both a taint (to repel untolerated pods) and labels (for affinity matching)
A pod can have both a toleration (to tolerate the taint) and node affinity (to prefer matching nodes)
The combination ensures pods land on dedicated nodes: toleration allows scheduling, affinity preferred placement
Example: dedicated database node with taint dedicated=database:NoSchedule and label dedicated=database
Pod tolerates the taint and has requiredDuringScheduling affinity to the dedicated label

10. What observability signals indicate scheduling problems in production?

Expected answer points:

Pending pod count by namespace: alert if > 0 for more than 5 minutes
Scheduling latency: time from pod creation to scheduled, watch for increasing latency
Preemption events: high preemption rates indicate resource pressure
Node taint coverage: ensure critical workloads tolerate all taints, check via kubectl get nodes -o json
Controller manager logs show scheduling decisions and failures

11. What is the difference between podAntiAffinity and nodeAffinity?

Expected answer points:

nodeAffinity matches against node labels, expressing relationships between pods and nodes
podAntiAffinity matches against pod labels, expressing relationships between pods
podAntiAffinity with topologyKey: topology.kubernetes.io/zone spreads pods across zones
nodeAffinity with topologyKey: kubernetes.io/hostname spreads pods across nodes
Combine both to spread pods across zones while landing on nodes with specific hardware

12. How do you prevent over-constrained clusters from having pods stuck in Pending?

Expected answer points:

Audit hard affinity requirements and convert non-critical ones to preferred
Reserve hard requirements for true invariants like regulatory zone requirements
Monitor pending pods and set alerts before they accumulate
Use preferredDuringSchedulingIgnoredDuringExecution with weight 1 for soft preferences
When adding new constraints, test in staging first with realistic pod counts

13. How does the scheduler filter and score phases work?

Expected answer points:

Filtering removes nodes that cannot run the pod: resource constraints, taints, node selectors
Scoring ranks remaining nodes based on affinity rules, resource utilization, topology spread
The scheduler binds the pod to the highest-scoring node after both phases
If no nodes pass filtering, the pod stays Pending with reason Unschedulable
Custom schedulers can replace or supplement these plugins

14. What is the purpose of tolerationSeconds in NoExecute tolerations?

Expected answer points:

tolerationSeconds specifies how long a pod tolerates a NoExecute taint before being evicted
If not specified, the pod is evicted immediately when the taint is applied
With tolerationSeconds: 3600, the pod has 1 hour before eviction if the node gets tainted
Use this for critical workloads to give them time to reschedule gracefully
The pod can be evicted earlier if the grace period expires or the node goes critical

15. How does topologyKey affect the meaning of topology spread?

Expected answer points:

topologyKey is the label key that defines the topology domain for spreading
topology.kubernetes.io/zone treats each availability zone as a domain
kubernetes.io/hostname treats each kubelet (node) as a domain
Custom labels can define other topology boundaries like rack or region
All pods with the same labelSelector are considered together for distribution across topologyKey domains

16. What is the tradeoff between using podAntiAffinity requiredDuringScheduling vs preferredDuringScheduling?

Expected answer points:

requiredDuringScheduling is a hard requirement: unsatisfiable constraints leave pods Pending
preferredDuringScheduling is a soft preference: scheduler tries but proceeds if impossible
Use required for strict HA requirements like spreading replicas across zones
Use preferred when perfect spread is nice-to-have but not critical
preferred with high weight (100) approaches required behavior while remaining softer

17. How do you configure pods to spread across zones while preferring same-node placement for cache?

Expected answer points:

Use topologySpreadConstraints with maxSkew: 1 and topologyKey: topology.kubernetes.io/zone for zone distribution
Add podAntiAffinity preferred for the cache layer to prefer same-node placement
With multiple constraints, the scheduler finds a node satisfying all constraints
The cache pod anti-affinity uses kubernetes.io/hostname so replicas prefer different nodes but can share
The zone spread constraint ensures overall zone balance as the primary concern

18. Why might a pod with a PVC fail to schedule even after the node is uncordoned?

Expected answer points:

PersistentVolumes are zone-specific in most cloud providers (EBS in AWS, PD in GCP)
If the PVC is bound to a volume in a different zone, the pod cannot start on a node in another zone
Use volumeBindingMode: WaitForFirstConsumer in StorageClass to delay binding until scheduling
The scheduler then places the pod in the same zone as the volume
Check PVC status and events: kubectl describe pod | grep -A10 "Events"

19. How does the scheduler decide which node is best when multiple pass filtering?

Expected answer points:

Scoring plugins rank nodes based on multiple criteria: affinity preferences, resource utilization, topology spread
Each scoring plugin returns a score; the scheduler sums them weighted by plugin priority
Nodes with higher total scores are preferred
Affinity rules contribute to scoring; preferredDuringScheduling weight determines influence
The scheduler picks the node with the highest score and binds the pod to it

20. What happens during cluster upgrades when nodes are drained and pods have NoExecute tolerations?

Expected answer points:

kubectl drain cordons the node and evicts pods respecting PDBs and terminationGracePeriodSeconds
Pods with NoExecute tolerations matching the drain taint are not evicted unless tolerationSeconds expires
Without proper tolerations, critical pods get evicted during routine maintenance
Add tolerations for cluster-level taints like node.kubernetes.io/not-ready:NoExecute
The drain command waits for pods to terminate; force deletion with --force for stuck pods

Conclusion

Kubernetes scheduling gives you fine-grained control over pod placement. Taints and tolerations let nodes repel pods that do not explicitly tolerate them. Node affinity places pods on nodes with specific labels. Pod affinity keeps related pods together; pod anti-affinity spreads them apart. Topology spread constraints distribute pods across failure domains like availability zones.

These mechanisms work together. A typical production setup might use taints to reserve GPU nodes, node affinity to place workloads on appropriate nodes, pod anti-affinity to spread replicas across zones, and topology spread to ensure balanced distribution.

Priority classes and preemption ensure critical workloads get scheduled first, even if it means evicting lower priority pods.

For more on keeping applications highly available, see the High Availability post.

Advanced Pod Scheduling: Taints, Tolerations, Affinity, and Topology

Introduction

Use scheduling mechanisms when

Skip custom scheduling when

Scheduling Decision Flow

How Default Scheduling Works

Taints and Tolerations for Node Exclusion

Adding a taint to a node

Tolerations in a pod spec

Using Equal vs Exists operator

Common taint use cases

Node Affinity Rules

Required node affinity

Preferred node affinity

Combining node affinity and taints

Pod Affinity and Anti-Affinity

Pod anti-affinity for high availability

Pod affinity for co-location

Preferred pod anti-affinity

Topology Spread for HA

maxSkew explained

Priority Classes and Preemption

Defining a priority class

Using priority class

System priority classes

Custom Scheduler Configuration Example

Trade-off Analysis

Observability Hooks for Scheduling

What to monitor

Key metrics to track

Debug commands for scheduling

Production Failure Scenarios

Priority Preemption Causing Disruption

Taint NoExecute Evicting Critical Pods

Anti-Patterns

Too Many Hard Affinity Requirements

undocumented Taints

Skipping topologySpreadConstraints

Interview Questions

Further Reading

Official Documentation

Articles and Guides

Related Guides

Conclusion

Category

Tags

Related Posts

Container Security: Image Scanning and Vulnerability Management

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Developing Helm Charts: Templates, Values, and Testing