Advanced Pod Scheduling: Affinity, Taints, Tolerations, Topology
Fine-tune pod placement across your Kubernetes cluster using node affinity, pod affinity/anti-affinity, taints/tolerations, and topology spread constraints.
Advanced Pod Scheduling: Taints, Tolerations, Affinity, and Topology
Kubernetes schedules pods onto nodes automatically. The default scheduler looks at resource requests and finds a node with enough capacity. But sometimes you need more control. You might want to run workloads on specific nodes, keep certain pods separated, or distribute replicas across availability zones. Kubernetes provides several mechanisms for this: taints, tolerations, affinity, and topology spread constraints.
This post covers how Kubernetes scheduling works by default and how to override it for specialized workloads.
For Kubernetes basics, see the Kubernetes fundamentals post. For high availability patterns, see the High Availability post.
Introduction
Use scheduling mechanisms when
- Specialized hardware is involved. GPU nodes, high-memory machines, nodes with local SSDs all warrant explicit placement decisions.
- Multiple teams share a cluster and need isolation. Production workloads should not share nodes with less-trusted workloads.
- Availability requirements demand zone distribution. If a single AZ outage should not take down your service, pods need to spread.
- Related pods should co-locate or spread apart. A cache and its backend probably want to be near each other. Replicas of the same service definitely do not.
Skip custom scheduling when
- Default scheduling works. The scheduler is good at what it does. Do not add constraints unless you have a problem default scheduling does not solve.
- Your cluster is small. If you have 3 nodes and no specialized hardware, the overhead of managing affinity rules is not worth it.
- You are adding constraints “just in case.” Over-constrained clusters leave pods pending.
Scheduling Decision Flow
flowchart TD
A[Pod created<br/>no node assigned] --> B[Filter: Which nodes<br/>can run this pod?]
B --> C{Enough nodes<br/>pass filtering?}
C -->|No| D[Pod stays Pending]
C -->|Yes| E[Score: Rank nodes<br/>by scheduling criteria]
E --> F[Select highest<br/>score node]
F --> G[Bind pod to node<br/>Pod starts]
The scheduler filters nodes that cannot run the pod (resource constraints, taints, selectors), then scores the remaining nodes to pick the best fit.
How Default Scheduling Works
The Kubernetes scheduler runs as a component on the control plane. When a pod gets created without a node assignment, the scheduler evaluates all available nodes and picks the best one.
The scheduling process has two phases:
- Filtering: Find nodes that can run the pod (based on resource availability, node selectors, taints)
- Scoring: Rank the filtered nodes (based on affinity rules, resource utilization, topology)
The scheduler uses plugins for each phase. The Filtering plugins remove nodes that cannot run the pod. The Scoring plugins rank the remaining nodes.
You can configure multiple schedulers and assign pods to specific schedulers:
spec:
schedulerName: custom-scheduler
Without this field, pods use the default scheduler.
Taints and Tolerations for Node Exclusion
Taints repel pods from nodes. If a node has a taint, pods without a matching toleration cannot schedule onto that node. This is how you mark nodes for special purposes.
Adding a taint to a node
kubectl taint nodes node1 dedicated=postgres:NoSchedule
This taint has a key (dedicated), value (postgres), and effect (NoSchedule). The effect determines what happens to pods without a matching toleration:
| Effect | Behavior |
|---|---|
| NoSchedule | Pods without matching toleration are not scheduled |
| PreferNoSchedule | Scheduler tries to avoid but may schedule if necessary |
| NoExecute | Existing pods without matching toleration are evicted |
Tolerations in a pod spec
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "postgres"
effect: "NoSchedule"
The toleration says this pod tolerates nodes with the dedicated=postgres taint. With this toleration, the pod can schedule onto the tainted node.
Using Equal vs Exists operator
# Equal: match specific value
tolerations:
- key: "dedicated"
operator: "Equal"
value: "postgres"
effect: "NoSchedule"
# Exists: match key regardless of value
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
The Exists operator matches any value for the key.
Common taint use cases
Mark nodes for specific workloads:
kubectl taint nodes node-gpu gpu-workload:NoSchedule
kubectl taint nodes node-batch batch-job:NoExecute
kubectl taint nodes node1 infra=proxy:PreferNoSchedule
Pods that need GPUs add a toleration:
tolerations:
- key: "gpu-workload"
operator: "Exists"
effect: "NoSchedule"
Node Affinity Rules
Node affinity is more expressive than node selectors. It allows matching rules based on node labels and supports soft preferences (preferredDuringSchedulingIgnoredDuringExecution) in addition to hard requirements (requiredDuringSchedulingIgnoredDuringExecution).
Required node affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "topology.kubernetes.io/zone"
operator: In
values:
- us-east-1a
- us-east-1b
- key: "node.kubernetes.io/disktype"
operator: In
values:
- ssd
The pod must run on a node in zone us-east-1a or us-east-1b with an SSD disk. If no node matches, the pod stays unscheduled.
Preferred node affinity
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: "node.kubernetes.io/networking-type"
operator: In
values:
- fast
- weight: 50
preference:
matchExpressions:
- key: "topology.kubernetes.io/zone"
operator: In
values:
- us-east-1a
The scheduler prefers nodes in us-east-1a but also considers fast networking nodes. Higher weight means higher preference priority.
Combining node affinity and taints
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "dedicated"
operator: In
values:
- database
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
This pod requires a node labeled dedicated=database and tolerates that taint. The combination ensures the pod lands on a dedicated database node.
Pod Affinity and Anti-Affinity
Pod affinity keeps pods together. Pod anti-affinity keeps pods separated. Use these to co-locate related pods or distribute replicas across failure domains.
Pod anti-affinity for high availability
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- web-frontend
topologyKey: topology.kubernetes.io/zone
This ensures web-frontend pods spread across availability zones. If three replicas exist, they land in three different zones if possible.
Pod affinity for co-location
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
This places pods on the same node as their Redis cache. The topologyKey kubernetes.io/hostname means same physical node (same kubelet).
Preferred pod anti-affinity
Hard requirements may prevent scheduling if constraints cannot be satisfied. Use preferred for softer constraints:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- web-frontend
topologyKey: topology.kubernetes.io/zone
The scheduler tries to spread pods across zones but does not fail if it cannot.
Topology Spread for HA
Topology spread provides more control over distribution than pod anti-affinity. You can specify how pods spread across failure domains while still meeting minimum availability requirements.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-frontend
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-frontend
The first constraint requires web-frontend pods to spread across zones with maxSkew of 1. If unsatisfiable, pods do not schedule (DoNotSchedule).
The second constraint prefers spreading across nodes within a zone. If unsatisfiable, pods schedule anyway (ScheduleAnyway).
maxSkew explained
If you have 6 pods and 3 zones:
- Zone A: 3 pods
- Zone B: 2 pods
- Zone C: 1 pod
The maxSkew is 1 (zones differ by at most 1 pod). Kubernetes would try to balance to 2 pods per zone.
If zone C has 0 pods:
- Zone A: 3 pods
- Zone B: 3 pods
- Zone C: 0 pods
maxSkew is 3, which violates maxSkew: 1. The scheduler would try to schedule a pod in zone C.
Priority Classes and Preemption
Pod priority affects scheduling order. Higher priority pods schedule before lower priority pods. If a node runs out of resources, the scheduler may preempt (evict) lower priority pods to make room for higher priority pods.
Defining a priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000
globalDefault: false
description: "Production workloads with high priority"
Using priority class
spec:
priorityClassName: high-priority
System priority classes
Kubernetes has system priority classes:
system-cluster-critical(value 2000000000)system-node-critical(value 2000001000)
Use these for critical cluster components that should not be preempted.
Custom Scheduler Configuration Example
For workloads needing specialized scheduling logic, deploy a custom scheduler alongside the default:
# Deploy custom scheduler as a daemonset or standalone pod
kubectl apply -f custom-scheduler.yaml
# custom-scheduler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: custom-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: custom-scheduler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-scheduler
subjects:
- kind: ServiceAccount
name: custom-scheduler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: custom-scheduler
template:
metadata:
labels:
app: custom-scheduler
spec:
serviceAccountName: custom-scheduler
containers:
- name: scheduler
image: registry.k8s.io/kube-scheduler:v1.28
command:
- /usr/local/bin/kube-scheduler
- --scheduler-name=high-memory-scheduler
- --leader-elect=false
Assign pods to the custom scheduler:
spec:
schedulerName: high-memory-scheduler
containers:
- name: high-mem-app
image: myapp:latest
Trade-off Analysis
| Feature | Affinity/Anti-Affinity | Topology Spread Constraints |
|---|---|---|
| Scope | Pod-to-pod or pod-to-node | Pods across topology domains |
| Matching | Label selectors | Label selectors |
| TopologyKey | Any label (hostname, zone, etc.) | Any label |
| Hard requirements | requiredDuringScheduling... | whenUnsatisfiable: DoNotSchedule |
| Soft preferences | preferredDuringScheduling... | whenUnsatisfiable: ScheduleAnyway |
| Explicit maxSkew | No | Yes — controls imbalance tolerance |
| Multiple constraints | Multiple affinity rules | Multiple topologySpreadConstraints entries |
| Failure handling | Pod stays Pending | Pod stays Pending or schedules anyway |
Use topology spread when you care about balanced distribution across zones or nodes (maxSkew is your control knob). Use affinity when you need to express relationships between specific pods or between pods and node labels.
Observability Hooks for Scheduling
What to monitor
Track pending pods, scheduling duration, and preemption events to catch scheduling problems early.
# Find pods stuck in Pending (likely scheduling issues)
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Check scheduling duration for recent pods
kubectl get events --sort-by='.lastTimestamp' | grep Scheduled
# Watch for preemption events
kubectl get events --sort-by='.lastTimestamp' | grep Preempt
# Check which nodes have what taints
kubectl get nodes -o json | jq '.items[].spec.taints'
Key metrics to track
- Pending pod count by namespace (alert if > 0 for > 5 minutes)
- Scheduling latency (time from pod creation to scheduled)
- Preemption count (high preemption rates suggest resource pressure)
- Node taint coverage (ensure critical workloads tolerate all taints)
Debug commands for scheduling
# View why a pod cannot be scheduled
kubectl describe pod myapp-xxx -n production | grep -A20 "Events"
# Test if a pod can schedule on specific nodes
kubectl cordon node1 # temporarily mark node unschedulable
kubectl uncordon node1 # restore
# Check scheduler logs (control plane)
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100
# Simulate pod scheduling with dry-run
kubectl run test --image=nginx --dry-run=server -o yaml | kubectl apply --dry-run=server
Production Failure Scenarios
Too many scheduling constraints that cannot all be satisfied leaves pods stuck in Pending. The describe output says something like no nodes match affinity rules.
Check events:
kubectl describe pod <name>
kubectl get events --sort-by='.lastTimestamp' | grep Unschedulable
If you have hard affinity rules across too many dimensions (zone, node type, memory), the scheduler may never find a node satisfying all of them.
Use preferredDuringSchedulingIgnoredDuringExecution for softer requirements. Reserve hard requirements for constraints that truly cannot be violated.
Priority Preemption Causing Disruption
When a high-priority pod cannot find a node, Kubernetes evicts lower-priority pods to make room. This is preemption.
The problem is that evicted pods do not always get scheduled immediately elsewhere. Critical services can go down if their pods get preempted during peak load.
Set priority classes carefully. Do not make every production pod the same extreme priority. Use PDBs to protect critical services even from preemption.
Taint NoExecute Evicting Critical Pods
The NoExecute effect is harsh. It evicts existing pods that do not tolerate the taint, not just prevents new scheduling.
If you taint a node for dedicated database workloads, the node running your monitoring agent with no database toleration gets emptied.
Always add tolerations before tainting nodes. Use tolerationSeconds for a grace period on critical workloads.
Anti-Patterns
Too Many Hard Affinity Requirements
requiredDuringSchedulingIgnoredDuringExecution with many constraints is a recipe for Pending pods. If you need GPU, SSD, and specific zone, you might as well ask for aunicorn node.
Keep hard requirements minimal. Use preferences for everything else.
undocumented Taints
Operations teams add taints without telling developers. Developers wonder why their pods will not schedule. It happens.
Maintain a taint inventory. Document which workloads tolerate which taints. Put tolerations in the workload manifests.
Skipping topologySpreadConstraints
podAntiAffinity is good but not precise for zone distribution. topologySpreadConstraints is explicit about how imbalance is measured and what happens when constraints cannot be satisfied. Use it for anything that matters.
Interview Questions
Expected answer points:
- Taints are applied to nodes to repel pods that do not have matching tolerations
- Tolerations are applied to pods to allow them to be scheduled on tainted nodes
- Taint effects: NoSchedule (prevent scheduling), PreferNoSchedule (soft repel), NoExecute (evict existing pods)
- The Exists operator matches any value for a key; Equal requires an exact match
- Use case: reserving GPU nodes for ML workloads by tainting with gpu-workload:NoSchedule
Expected answer points:
- Hard affinity (requiredDuringSchedulingIgnoredDuringExecution) enforces requirements strictly
- Soft affinity (preferredDuringSchedulingIgnoredDuringExecution) prefers but does not require
- Hard requirements cause pods to stay Pending if unsatisfied; soft preferences are best-effort
- Soft affinity has a weight property (1-100) that influences scoring relative to other preferences
- Use hard affinity for strict requirements like zone placement; soft for optimization
Expected answer points:
- topologySpreadConstraints provides explicit maxSkew control for balanced distribution
- Pod anti-affinity expresses relationships but does not measure imbalance directly
- topologySpreadConstraints supports whenUnsatisfiable: DoNotSchedule vs ScheduleAnyway for hard vs soft constraints
- maxSkew measures the difference in pod count between any two topology domains
- Use topologySpreadConstraints when balanced zone distribution is critical; pod anti-affinity for co-location
Expected answer points:
- The pod stays in Pending state until constraints can be satisfied or are modified
- kubectl describe pod shows events like "no nodes match affinity rules"
- Events are logged with reason "Unschedulable" and message detailing the constraint that failed
- Use kubectl get events --sort-by='.lastTimestamp' | grep Unschedulable to find affected pods
- Solution is either to relax constraints (preferred) or add more nodes matching requirements
Expected answer points:
- NoSchedule only prevents new pods from scheduling on the tainted node; existing pods are unaffected
- NoExecute evicts existing pods that do not tolerate the taint from the node
- NoExecute can optionally include tolerationSeconds for a grace period before eviction
- Use NoSchedule for dedicated nodes where you do not want new workloads; NoExecute for maintenance scenarios
- NoExecute is harsher and should be used carefully for production nodes
Expected answer points:
- maxSkew defines the maximum allowed difference in pod count between any two topology domains
- Calculation: (max pod count in any zone) minus (min pod count in any zone)
- With 6 pods and 3 zones (3, 2, 1): maxSkew is 2, which violates maxSkew: 1
- The scheduler tries to place pods to keep maxSkew within the limit during new pod scheduling
- When unsatisfiable with DoNotSchedule, pods stay Pending; with ScheduleAnyway, they schedule anyway
Expected answer points:
- Higher priority pods schedule before lower priority pods when resources are constrained
- When a high-priority pod cannot find a node, the scheduler may preempt (evict) lower priority pods
- Preempted pods are scheduled elsewhere but may not find capacity immediately
- system-cluster-critical and system-node-critical are reserved for critical cluster components
- Use PDBs alongside priority classes to protect critical services from preemption
Expected answer points:
- Use a custom scheduler when workloads have specialized placement requirements the default scheduler cannot express
- Examples: GPU binding to specific GPU devices, NUMA-aware scheduling, geographic constraints
- Run multiple schedulers side-by-side and assign pods via schedulerName field
- Custom schedulers need their own leader election setup for HA if multiple replicas run
- For most cases, the default scheduler with affinity/taints/tolerations is sufficient
Expected answer points:
- A node can have both a taint (to repel untolerated pods) and labels (for affinity matching)
- A pod can have both a toleration (to tolerate the taint) and node affinity (to prefer matching nodes)
- The combination ensures pods land on dedicated nodes: toleration allows scheduling, affinity preferred placement
- Example: dedicated database node with taint dedicated=database:NoSchedule and label dedicated=database
- Pod tolerates the taint and has requiredDuringScheduling affinity to the dedicated label
Expected answer points:
- Pending pod count by namespace: alert if > 0 for more than 5 minutes
- Scheduling latency: time from pod creation to scheduled, watch for increasing latency
- Preemption events: high preemption rates indicate resource pressure
- Node taint coverage: ensure critical workloads tolerate all taints, check via kubectl get nodes -o json
- Controller manager logs show scheduling decisions and failures
Expected answer points:
- nodeAffinity matches against node labels, expressing relationships between pods and nodes
- podAntiAffinity matches against pod labels, expressing relationships between pods
- podAntiAffinity with topologyKey: topology.kubernetes.io/zone spreads pods across zones
- nodeAffinity with topologyKey: kubernetes.io/hostname spreads pods across nodes
- Combine both to spread pods across zones while landing on nodes with specific hardware
Expected answer points:
- Audit hard affinity requirements and convert non-critical ones to preferred
- Reserve hard requirements for true invariants like regulatory zone requirements
- Monitor pending pods and set alerts before they accumulate
- Use preferredDuringSchedulingIgnoredDuringExecution with weight 1 for soft preferences
- When adding new constraints, test in staging first with realistic pod counts
Expected answer points:
- Filtering removes nodes that cannot run the pod: resource constraints, taints, node selectors
- Scoring ranks remaining nodes based on affinity rules, resource utilization, topology spread
- The scheduler binds the pod to the highest-scoring node after both phases
- If no nodes pass filtering, the pod stays Pending with reason Unschedulable
- Custom schedulers can replace or supplement these plugins
Expected answer points:
- tolerationSeconds specifies how long a pod tolerates a NoExecute taint before being evicted
- If not specified, the pod is evicted immediately when the taint is applied
- With tolerationSeconds: 3600, the pod has 1 hour before eviction if the node gets tainted
- Use this for critical workloads to give them time to reschedule gracefully
- The pod can be evicted earlier if the grace period expires or the node goes critical
Expected answer points:
- topologyKey is the label key that defines the topology domain for spreading
- topology.kubernetes.io/zone treats each availability zone as a domain
- kubernetes.io/hostname treats each kubelet (node) as a domain
- Custom labels can define other topology boundaries like rack or region
- All pods with the same labelSelector are considered together for distribution across topologyKey domains
Expected answer points:
- requiredDuringScheduling is a hard requirement: unsatisfiable constraints leave pods Pending
- preferredDuringScheduling is a soft preference: scheduler tries but proceeds if impossible
- Use required for strict HA requirements like spreading replicas across zones
- Use preferred when perfect spread is nice-to-have but not critical
- preferred with high weight (100) approaches required behavior while remaining softer
Expected answer points:
- Use topologySpreadConstraints with maxSkew: 1 and topologyKey: topology.kubernetes.io/zone for zone distribution
- Add podAntiAffinity preferred for the cache layer to prefer same-node placement
- With multiple constraints, the scheduler finds a node satisfying all constraints
- The cache pod anti-affinity uses kubernetes.io/hostname so replicas prefer different nodes but can share
- The zone spread constraint ensures overall zone balance as the primary concern
Expected answer points:
- PersistentVolumes are zone-specific in most cloud providers (EBS in AWS, PD in GCP)
- If the PVC is bound to a volume in a different zone, the pod cannot start on a node in another zone
- Use volumeBindingMode: WaitForFirstConsumer in StorageClass to delay binding until scheduling
- The scheduler then places the pod in the same zone as the volume
- Check PVC status and events: kubectl describe pod | grep -A10 "Events"
Expected answer points:
- Scoring plugins rank nodes based on multiple criteria: affinity preferences, resource utilization, topology spread
- Each scoring plugin returns a score; the scheduler sums them weighted by plugin priority
- Nodes with higher total scores are preferred
- Affinity rules contribute to scoring; preferredDuringScheduling weight determines influence
- The scheduler picks the node with the highest score and binds the pod to it
Expected answer points:
- kubectl drain cordons the node and evicts pods respecting PDBs and terminationGracePeriodSeconds
- Pods with NoExecute tolerations matching the drain taint are not evicted unless tolerationSeconds expires
- Without proper tolerations, critical pods get evicted during routine maintenance
- Add tolerations for cluster-level taints like node.kubernetes.io/not-ready:NoExecute
- The drain command waits for pods to terminate; force deletion with --force for stuck pods
Further Reading
Official Documentation
- Taints and Tolerations — Kubernetes scheduling guide
- Node Affinity — Affinity and anti-affinity documentation
- Topology Spread Constraints — Even distribution across topology domains
- Pod Priority and Preemption — Scheduling priority classes
Articles and Guides
- Advanced Pod Scheduling in Kubernetes — Deep dive into scheduling mechanisms
- Kubernetes Scheduling: The Complete Guide — Practical scheduling patterns with monitoring
Related Guides
- Kubernetes High Availability — Multi-AZ deployments and PDBs
- Kubernetes Custom Controllers — Extending Kubernetes with custom controllers
Conclusion
Kubernetes scheduling gives you fine-grained control over pod placement. Taints and tolerations let nodes repel pods that do not explicitly tolerate them. Node affinity places pods on nodes with specific labels. Pod affinity keeps related pods together; pod anti-affinity spreads them apart. Topology spread constraints distribute pods across failure domains like availability zones.
These mechanisms work together. A typical production setup might use taints to reserve GPU nodes, node affinity to place workloads on appropriate nodes, pod anti-affinity to spread replicas across zones, and topology spread to ensure balanced distribution.
Priority classes and preemption ensure critical workloads get scheduled first, even if it means evicting lower priority pods.
For more on keeping applications highly available, see the High Availability post.
Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.