Advanced Pod Scheduling: Affinity, Taints, Tolerations, Topology
Fine-tune pod placement across your Kubernetes cluster using node affinity, pod affinity/anti-affinity, taints/tolerations, and topology spread constraints.
Advanced Pod Scheduling: Taints, Tolerations, Affinity, and Topology
Kubernetes schedules pods onto nodes automatically. The default scheduler looks at resource requests and finds a node with enough capacity. But sometimes you need more control. You might want to run workloads on specific nodes, keep certain pods separated, or distribute replicas across availability zones. Kubernetes provides several mechanisms for this: taints, tolerations, affinity, and topology spread constraints.
This post covers how Kubernetes scheduling works by default and how to override it for specialized workloads.
For Kubernetes basics, see the Kubernetes fundamentals post. For high availability patterns, see the High Availability post.
When to Use / When Not to Use
Use scheduling mechanisms when
- Specialized hardware is involved. GPU nodes, high-memory machines, nodes with local SSDs all warrant explicit placement decisions.
- Multiple teams share a cluster and need isolation. Production workloads should not share nodes with less-trusted workloads.
- Availability requirements demand zone distribution. If a single AZ outage should not take down your service, pods need to spread.
- Related pods should co-locate or spread apart. A cache and its backend probably want to be near each other. Replicas of the same service definitely do not.
Skip custom scheduling when
- Default scheduling works. The scheduler is good at what it does. Do not add constraints unless you have a problem default scheduling does not solve.
- Your cluster is small. If you have 3 nodes and no specialized hardware, the overhead of managing affinity rules is not worth it.
- You are adding constraints “just in case.” Over-constrained clusters leave pods pending.
Scheduling Decision Flow
flowchart TD
A[Pod created<br/>no node assigned] --> B[Filter: Which nodes<br/>can run this pod?]
B --> C{Enough nodes<br/>pass filtering?}
C -->|No| D[Pod stays Pending]
C -->|Yes| E[Score: Rank nodes<br/>by scheduling criteria]
E --> F[Select highest<br/>score node]
F --> G[Bind pod to node<br/>Pod starts]
The scheduler filters nodes that cannot run the pod (resource constraints, taints, selectors), then scores the remaining nodes to pick the best fit.
How Default Scheduling Works
The Kubernetes scheduler runs as a component on the control plane. When a pod gets created without a node assignment, the scheduler evaluates all available nodes and picks the best one.
The scheduling process has two phases:
- Filtering: Find nodes that can run the pod (based on resource availability, node selectors, taints)
- Scoring: Rank the filtered nodes (based on affinity rules, resource utilization, topology)
The scheduler uses plugins for each phase. The Filtering plugins remove nodes that cannot run the pod. The Scoring plugins rank the remaining nodes.
You can configure multiple schedulers and assign pods to specific schedulers:
spec:
schedulerName: custom-scheduler
Without this field, pods use the default scheduler.
Taints and Tolerations for Node Exclusion
Taints repel pods from nodes. If a node has a taint, pods without a matching toleration cannot schedule onto that node. This is how you mark nodes for special purposes.
Adding a taint to a node
kubectl taint nodes node1 dedicated=postgres:NoSchedule
This taint has a key (dedicated), value (postgres), and effect (NoSchedule). The effect determines what happens to pods without a matching toleration:
| Effect | Behavior |
|---|---|
| NoSchedule | Pods without matching toleration are not scheduled |
| PreferNoSchedule | Scheduler tries to avoid but may schedule if necessary |
| NoExecute | Existing pods without matching toleration are evicted |
Tolerations in a pod spec
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "postgres"
effect: "NoSchedule"
The toleration says this pod tolerates nodes with the dedicated=postgres taint. With this toleration, the pod can schedule onto the tainted node.
Using Equal vs Exists operator
# Equal: match specific value
tolerations:
- key: "dedicated"
operator: "Equal"
value: "postgres"
effect: "NoSchedule"
# Exists: match key regardless of value
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
The Exists operator matches any value for the key.
Common taint use cases
Mark nodes for specific workloads:
kubectl taint nodes node-gpu gpu-workload:NoSchedule
kubectl taint nodes node-batch batch-job:NoExecute
kubectl taint nodes node1 infra=proxy:PreferNoSchedule
Pods that need GPUs add a toleration:
tolerations:
- key: "gpu-workload"
operator: "Exists"
effect: "NoSchedule"
Node Affinity Rules
Node affinity is more expressive than node selectors. It allows matching rules based on node labels and supports soft preferences (preferredDuringSchedulingIgnoredDuringExecution) in addition to hard requirements (requiredDuringSchedulingIgnoredDuringExecution).
Required node affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "topology.kubernetes.io/zone"
operator: In
values:
- us-east-1a
- us-east-1b
- key: "node.kubernetes.io/disktype"
operator: In
values:
- ssd
The pod must run on a node in zone us-east-1a or us-east-1b with an SSD disk. If no node matches, the pod stays unscheduled.
Preferred node affinity
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: "node.kubernetes.io/networking-type"
operator: In
values:
- fast
- weight: 50
preference:
matchExpressions:
- key: "topology.kubernetes.io/zone"
operator: In
values:
- us-east-1a
The scheduler prefers nodes in us-east-1a but also considers fast networking nodes. Higher weight means higher preference priority.
Combining node affinity and taints
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "dedicated"
operator: In
values:
- database
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
This pod requires a node labeled dedicated=database and tolerates that taint. The combination ensures the pod lands on a dedicated database node.
Pod Affinity and Anti-Affinity
Pod affinity keeps pods together. Pod anti-affinity keeps pods separated. Use these to co-locate related pods or distribute replicas across failure domains.
Pod anti-affinity for high availability
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- web-frontend
topologyKey: topology.kubernetes.io/zone
This ensures web-frontend pods spread across availability zones. If three replicas exist, they land in three different zones if possible.
Pod affinity for co-location
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
This places pods on the same node as their Redis cache. The topologyKey kubernetes.io/hostname means same physical node (same kubelet).
Preferred pod anti-affinity
Hard requirements may prevent scheduling if constraints cannot be satisfied. Use preferred for softer constraints:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- web-frontend
topologyKey: topology.kubernetes.io/zone
The scheduler tries to spread pods across zones but does not fail if it cannot.
Topology Spread for HA
Topology spread provides more control over distribution than pod anti-affinity. You can specify how pods spread across failure domains while still meeting minimum availability requirements.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-frontend
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-frontend
The first constraint requires web-frontend pods to spread across zones with maxSkew of 1. If unsatisfiable, pods do not schedule (DoNotSchedule).
The second constraint prefers spreading across nodes within a zone. If unsatisfiable, pods schedule anyway (ScheduleAnyway).
maxSkew explained
If you have 6 pods and 3 zones:
- Zone A: 3 pods
- Zone B: 2 pods
- Zone C: 1 pod
The maxSkew is 1 (zones differ by at most 1 pod). Kubernetes would try to balance to 2 pods per zone.
If zone C has 0 pods:
- Zone A: 3 pods
- Zone B: 3 pods
- Zone C: 0 pods
maxSkew is 3, which violates maxSkew: 1. The scheduler would try to schedule a pod in zone C.
Priority Classes and Preemption
Pod priority affects scheduling order. Higher priority pods schedule before lower priority pods. If a node runs out of resources, the scheduler may preempt (evict) lower priority pods to make room for higher priority pods.
Defining a priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000
globalDefault: false
description: "Production workloads with high priority"
Using priority class
spec:
priorityClassName: high-priority
System priority classes
Kubernetes has system priority classes:
system-cluster-critical(value 2000000000)system-node-critical(value 2000001000)
Use these for critical cluster components that should not be preempted.
Custom Scheduler Configuration Example
For workloads needing specialized scheduling logic, deploy a custom scheduler alongside the default:
# Deploy custom scheduler as a daemonset or standalone pod
kubectl apply -f custom-scheduler.yaml
# custom-scheduler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: custom-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: custom-scheduler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-scheduler
subjects:
- kind: ServiceAccount
name: custom-scheduler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: custom-scheduler
template:
metadata:
labels:
app: custom-scheduler
spec:
serviceAccountName: custom-scheduler
containers:
- name: scheduler
image: registry.k8s.io/kube-scheduler:v1.28
command:
- /usr/local/bin/kube-scheduler
- --scheduler-name=high-memory-scheduler
- --leader-elect=false
Assign pods to the custom scheduler:
spec:
schedulerName: high-memory-scheduler
containers:
- name: high-mem-app
image: myapp:latest
Affinity vs Topology Spread Trade-offs
| Feature | Affinity/Anti-Affinity | Topology Spread Constraints |
|---|---|---|
| Scope | Pod-to-pod or pod-to-node | Pods across topology domains |
| Matching | Label selectors | Label selectors |
| TopologyKey | Any label (hostname, zone, etc.) | Any label |
| Hard requirements | requiredDuringScheduling... | whenUnsatisfiable: DoNotSchedule |
| Soft preferences | preferredDuringScheduling... | whenUnsatisfiable: ScheduleAnyway |
| Explicit maxSkew | No | Yes — controls imbalance tolerance |
| Multiple constraints | Multiple affinity rules | Multiple topologySpreadConstraints entries |
| Failure handling | Pod stays Pending | Pod stays Pending or schedules anyway |
Use topology spread when you care about balanced distribution across zones or nodes (maxSkew is your control knob). Use affinity when you need to express relationships between specific pods or between pods and node labels.
Observability Hooks for Scheduling
What to monitor
Track pending pods, scheduling duration, and preemption events to catch scheduling problems early.
# Find pods stuck in Pending (likely scheduling issues)
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Check scheduling duration for recent pods
kubectl get events --sort-by='.lastTimestamp' | grep Scheduled
# Watch for preemption events
kubectl get events --sort-by='.lastTimestamp' | grep Preempt
# Check which nodes have what taints
kubectl get nodes -o json | jq '.items[].spec.taints'
Key metrics to track
- Pending pod count by namespace (alert if > 0 for > 5 minutes)
- Scheduling latency (time from pod creation to scheduled)
- Preemption count (high preemption rates suggest resource pressure)
- Node taint coverage (ensure critical workloads tolerate all taints)
Debug commands for scheduling
# View why a pod cannot be scheduled
kubectl describe pod myapp-xxx -n production | grep -A20 "Events"
# Test if a pod can schedule on specific nodes
kubectl cordon node1 # temporarily mark node unschedulable
kubectl uncordon node1 # restore
# Check scheduler logs (control plane)
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100
# Simulate pod scheduling with dry-run
kubectl run test --image=nginx --dry-run=server -o yaml | kubectl apply --dry-run=server
Production Failure Scenarios
Too many scheduling constraints that cannot all be satisfied leaves pods stuck in Pending. The describe output says something like no nodes match affinity rules.
Check events:
kubectl describe pod <name>
kubectl get events --sort-by='.lastTimestamp' | grep Unschedulable
If you have hard affinity rules across too many dimensions (zone, node type, memory), the scheduler may never find a node satisfying all of them.
Use preferredDuringSchedulingIgnoredDuringExecution for softer requirements. Reserve hard requirements for constraints that truly cannot be violated.
Priority Preemption Causing Disruption
When a high-priority pod cannot find a node, Kubernetes evicts lower-priority pods to make room. This is preemption.
The problem is that evicted pods do not always get scheduled immediately elsewhere. Critical services can go down if their pods get preempted during peak load.
Set priority classes carefully. Do not make every production pod the same extreme priority. Use PDBs to protect critical services even from preemption.
Taint NoExecute Evicting Critical Pods
The NoExecute effect is harsh. It evicts existing pods that do not tolerate the taint, not just prevents new scheduling.
If you taint a node for dedicated database workloads, the node running your monitoring agent with no database toleration gets emptied.
Always add tolerations before tainting nodes. Use tolerationSeconds for a grace period on critical workloads.
Anti-Patterns
Too Many Hard Affinity Requirements
requiredDuringSchedulingIgnoredDuringExecution with many constraints is a recipe for Pending pods. If you need GPU, SSD, and specific zone, you might as well ask for aunicorn node.
Keep hard requirements minimal. Use preferences for everything else.
undocumented Taints
Operations teams add taints without telling developers. Developers wonder why their pods will not schedule. It happens.
Maintain a taint inventory. Document which workloads tolerate which taints. Put tolerations in the workload manifests.
Skipping topologySpreadConstraints
podAntiAffinity is good but not precise for zone distribution. topologySpreadConstraints is explicit about how imbalance is measured and what happens when constraints cannot be satisfied. Use it for anything that matters.
Quick Recap Checklist
- Taints and tolerations set up for node dedication
- Node affinity places workloads on the right nodes
- podAntiAffinity spreads replicas across nodes or zones
- topologySpreadConstraints handles explicit zone distribution
- Priority classes set appropriately for critical workloads
- tolerationSeconds configured for graceful NoExecute handling
- Scheduling tested in staging before production
- Unschedulable pods monitored with alerts
Conclusion
Kubernetes scheduling gives you fine-grained control over pod placement. Taints and tolerations let nodes repel pods that do not explicitly tolerate them. Node affinity places pods on nodes with specific labels. Pod affinity keeps related pods together; pod anti-affinity spreads them apart. Topology spread constraints distribute pods across failure domains like availability zones.
These mechanisms work together. A typical production setup might use taints to reserve GPU nodes, node affinity to place workloads on appropriate nodes, pod anti-affinity to spread replicas across zones, and topology spread to ensure balanced distribution.
Priority classes and preemption ensure critical workloads get scheduled first, even if it means evicting lower priority pods.
For more on keeping applications highly available, see the High Availability post.
Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.