Process Scheduling Algorithms

A comprehensive guide to CPU scheduling algorithms including FCFS, SJF, Round Robin, Priority scheduling, and multi-level feedback queues.

published: May 19, 2026 reading time: 33 min read author: GeekWorkBench

Quick Summary

A comprehensive guide to CPU scheduling algorithms including FCFS, SJF, Round Robin, Priority scheduling, and multi-level feedback queues.

Process Scheduling Algorithms

Imagine you’re a air traffic controller managing a runway. You have planes (processes) of different sizes, different urgency levels, and different destinations (CPU burst lengths). Your job: keep the runway busy, minimize waiting time, and somehow satisfy everyone.

This is essentially what the CPU scheduler does — every millisecond, deciding which runnable process gets the CPU. The algorithm you choose determines whether your system feels responsive or sluggish, whether short tasks complete quickly or get buried under long ones.

Introduction

CPU scheduling algorithms decide which runnable process to run next on each CPU core. The scheduler makes this decision based on various factors: burst time, priority, waiting time, and the scheduling policy.

Modern systems often use multiple algorithms in combination. Linux’s Completely Fair Scheduler (CFS) handles normal processes, while special schedulers handle real-time tasks. Windows uses a multi-level feedback queue that dynamically adjusts priorities.

When to Use

System tuning — Choosing the right scheduler improves responsiveness
Real-time applications — Guaranteeing deadlines requires specific algorithms
Batch processing — Maximizing throughput is the primary goal
Multi-user systems — Fairness ensures no user gets starved

When Not to Use

Simple scripts — Default scheduler is sufficient
Single-user interactive — System defaults are optimized for this
Embedded with fixed workload — Custom schedulers add complexity

Types of Scheduling

Non-Preemptive vs Preemptive

Non-preemptive: Once a process starts running, it runs to completion (or until it blocks). Simple but can’t respond to higher priority tasks quickly.

Preemptive: The scheduler can interrupt a running process and switch to another. Modern OSes use preemptive scheduling for responsiveness.

graph LR
    subgraph Non-Preemptive
        A1[Process A starts] --> A2[A runs to completion]
    end

    subgraph Preemptive
        B1[Process B starts] --> B2[B preempted]
        B2 --> C1[Process C runs]
        C1 --> B3[B resumes]
    end

    style Non-Preemptive stroke:#4a1a1a
    style Preemptive stroke:#1a3a1a

Core Algorithms

First-Come, First-Served (FCFS)

The simplest algorithm: processes run in the order they arrived.

Pros: Simple, low overhead Cons: Convoy effect — short processes wait behind long ones

// Simulate FCFS scheduling
#include <stdio.h>

typedef struct {
    int pid;
    int arrival_time;
    int burst_time;
} Process;

void fcfs_schedule(Process procs[], int n) {
    int current_time = 0;

    printf("FCFS Scheduling:\n");
    printf("%-10s %-15s %-15s %-15s\n", "PID", "Start Time", "End Time", "Wait Time");

    for (int i = 0; i < n; i++) {
        int wait = current_time - procs[i].arrival_time;
        if (wait < 0) wait = 0;
        int start = current_time;
        current_time += procs[i].burst_time;
        int end = current_time;

        printf("%-10d %-15d %-15d %-15d\n", procs[i].pid, start, end, wait);
    }
}

Shortest Job First (SJF)

Select the process with the shortest burst time. Optimal for minimizing average waiting time — but requires knowing burst times in advance (which is rarely possible).

Pros: Optimal average wait time Cons: Starvation of long processes; requires burst time prediction

// Simulate SJF scheduling (non-preemptive)
#include <stdio.h>
#include <limits.h>

void sjf_schedule(Process procs[], int n) {
    int completed = 0;
    int current_time = 0;
    int total_wait = 0;

    int remaining = n;
    int visited[n];
    for (int i = 0; i < n; i++) visited[i] = 0;

    printf("\nSJF Scheduling:\n");
    printf("%-10s %-15s %-15s %-15s %-15s\n", "PID", "Arrival", "Burst", "Start", "Wait");

    while (completed < n) {
        int shortest = -1;
        int min_burst = INT_MAX;

        // Find shortest job that's arrived and not completed
        for (int i = 0; i < n; i++) {
            if (!visited[i] && procs[i].arrival_time <= current_time &&
                procs[i].burst_time < min_burst) {
                shortest = i;
                min_burst = procs[i].burst_time;
            }
        }

        if (shortest == -1) {
            current_time++;
            continue;
        }

        int wait = current_time - procs[shortest].arrival_time;
        int start = current_time;
        current_time += procs[shortest].burst_time;
        int end = current_time;

        printf("%-10d %-15d %-15d %-15d %-15d\n",
               procs[shortest].pid, procs[shortest].arrival_time,
               procs[shortest].burst_time, start, wait);

        total_wait += wait;
        visited[shortest] = 1;
        completed++;
    }

    printf("Average wait time: %.2f\n", (float)total_wait / n);
}

Round Robin (RR)

Each process gets a fixed time slice (quantum). When the quantum expires, the process goes to the back of the ready queue. Designed for time-sharing.

Pros: Fair, good response time, handles starvation Cons: Higher context switch overhead, depends on quantum size

#!/usr/bin/env python3
"""Round Robin Scheduler Simulation."""

from collections import deque

class Process:
    def __init__(self, pid, burst, arrival=0):
        self.pid = pid
        self.burst = burst
        self.arrival = arrival
        self.remaining = burst
        self.wait_time = 0
        self.turnaround = 0

def round_robin(processes, quantum):
    ready = deque(processes)
    current_time = 0
    completed = []

    print(f"Round Robin (quantum={quantum}):")
    print(f"{'Time':<8} {'Event':<20} {'Ready Queue'}")

    while ready:
        process = ready.popleft()

        if process.arrival > current_time:
            # Skip ahead if no processes have arrived
            current_time = process.arrival
            ready.append(process)
            continue

        # Execute for quantum or until completion
        exec_time = min(quantum, process.remaining)
        current_time += exec_time
        process.remaining -= exec_time

        # Other processes in ready queue wait
        for p in ready:
            if p.arrival <= current_time:
                p.wait_time += exec_time

        if process.remaining == 0:
            process.turnaround = current_time - process.arrival
            completed.append(process)
            print(f"{current_time:<8} P{process.pid} completes")
        else:
            ready.append(process)
            print(f"{current_time:<8} P{process.pid} preempted")

    return completed

# Example
processes = [Process(1, 24), Process(2, 3), Process(3, 3)]
result = round_robin(processes, 4)

print(f"\n{'PID':<8} {'Burst':<8} {'Wait':<8} {'Turnaround':<12}")
print("-" * 40)
for p in result:
    print(f"{p.pid:<8} {p.burst:<8} {p.wait_time:<8} {p.turnaround:<12}")

Priority Scheduling

Processes are assigned priorities. Higher priority processes run first. Two variants:

Preemptive: A new higher-priority process preempts the running one Non-preemptive: Higher priority only matters when current process blocks

Problem: Starvation — low priority processes may never run if higher priority ones keep arriving.

Solution: Aging — gradually increase priority of waiting processes.

// Priority Scheduling with Aging
#include <stdio.h>

#define BASE_PRIORITY 50
#define AGING_RATE 1  // Priority increases by 1 per time unit waiting

typedef struct {
    int pid;
    int base_priority;  // Lower number = higher priority
    int dynamic_priority;
    int remaining_burst;
    int waiting_time;
} Process;

void priority_with_aging(Process procs[], int n, int time_units) {
    printf("\nPriority Scheduling with Aging:\n");

    int completed = 0;
    int current_time = 0;

    while (completed < n) {
        // Update dynamic priorities (aging)
        for (int i = 0; i < n; i++) {
            if (procs[i].remaining_burst > 0 && procs[i].waiting_time > 0) {
                procs[i].dynamic_priority -= AGING_RATE;
                if (procs[i].dynamic_priority < 1) {
                    procs[i].dynamic_priority = 1;  // Max priority
                }
            }
        }

        // Find highest priority (lowest number)
        int highest = -1;
        int highest_priority = 999;

        for (int i = 0; i < n; i++) {
            if (procs[i].remaining_burst > 0 &&
                procs[i].dynamic_priority < highest_priority) {
                highest = i;
                highest_priority = procs[i].dynamic_priority;
            }
        }

        if (highest == -1) {
            current_time++;
            continue;
        }

        // Execute one time unit
        printf("Time %d: P%d (priority %d)\n",
               current_time, procs[highest].pid, procs[highest].dynamic_priority);

        procs[highest].remaining_burst--;
        current_time++;

        // Increment wait time for other processes
        for (int i = 0; i < n; i++) {
            if (i != highest && procs[i].remaining_burst > 0) {
                procs[i].waiting_time++;
            }
        }

        if (procs[highest].remaining_burst == 0) {
            completed++;
        }
    }
}

Multi-Level Feedback Queue (MLFQ)

MLFQ combines multiple queues with different priorities and time quantum sizes. Processes start in highest priority queue; if they use their entire quantum, they drop to lower priority. This allows short interactions to complete quickly while CPU-bound jobs get time.

graph TB
    subgraph Q1[Queue 1 - Highest Priority]
        A1[Interactive<br/>Quantum: 8ms]
    end
    subgraph Q2[Queue 2 - Medium Priority]
        B1[Mixed<br/>Quantum: 16ms]
    end
    subgraph Q3[Queue 3 - Lowest Priority]
        C1[Batch<br/>Quantum: 32ms]
    end

    A1 -->|Uses full quantum| B1
    B1 -->|Uses full quantum| C1
    C1 -->|Blocked I/O| A1

    A1 -->|Shorter quantum| Scheduler
    B1 -->|Medium quantum| Scheduler
    C1 -->|Longer quantum| Scheduler

    Scheduler -->|picks highest non-empty| A1
    Scheduler -->|picks highest non-empty| B1
    Scheduler -->|picks highest non-empty| C1

    style Q1 stroke:#1a4a1a
    style Q2 stroke:#1a1a4a
    style Q3 stroke:#4a1a1a

Architecture Diagram

sequenceDiagram
    participant P1 as Process A (CPU-bound)
    participant P2 as Process B (Interactive)
    participant Scheduler
    participant CPU

    Note over Scheduler,CPU: Time 0ms
    P2->>Scheduler: I/O completes, ready
    Scheduler->>CPU: Schedule P2 (interactive)
    CPU->>P2: Run P2

    Note over Scheduler,CPU: Time 8ms (quantum expires)
    Scheduler->>CPU: Preempt P2, schedule P1
    CPU->>P1: Run P1 (uses full quantum)

    Note over Scheduler,CPU: Time 16ms
    Scheduler->>CPU: Move P1 to lower queue
    Scheduler->>CPU: Schedule P2 again
    CPU->>P2: Continue P2

    Note over Scheduler,CPU: Time 24ms (P2 completes)
    Scheduler->>CPU: Schedule next in Q2
    CPU->>P1: Continue P1 in Q2

Production Failure Scenarios

Convoy Effect in FCFS

Problem: A long CPU-bound process blocks short I/O-bound processes. Short jobs wait while long job holds CPU.

Symptoms: System appears responsive at first, then slows as long jobs dominate.

Mitigation: Use Round Robin or SJF to give short jobs priority.

The convoy effect hides under light load. Say you have three processes: P1 (100ms CPU-bound), P2 (5ms I/O-bound), P3 (5ms I/O-bound). P1 arrives at t=0, P2 at t=1, then immediately blocks on disk I/O while P1 runs. By the time P2’s I/O completes at t=50, P1 is still burning CPU. P2 waits 49ms for a 5ms job. The I/O device sits idle the whole time.

The fix is recognizing that CPU-bound and I/O-bound processes have fundamentally different needs. I/O-bound processes need brief CPU windows to dispatch I/O requests and process results, they do not need long CPU bursts. Round Robin with a modest quantum gives these processes enough CPU to stay productive. SJF prioritizes short jobs by definition. MLFQ handles this automatically: I/O-bound processes drop their quantum usage and stay in high-priority queues, while CPU-bound jobs sink to lower queues with longer time slices.

Starvation in Priority Scheduling

Problem: Low priority processes never get CPU time as high priority processes keep arriving.

Symptoms: Certain processes show extremely high wait times, never completing.

Mitigation: Implement aging — increase priority of waiting processes over time.

In theory, high priority work gets done first. In practice, if high-priority work keeps arriving, low-priority processes queue up forever. This happens in production: a logging daemon at low priority gets buried under user request handlers during traffic spikes. The logs never flush. Insidious because the system looks fine under light load.

Aging treats wait time as a priority boost. A process waiting 500ms might have its priority raised from 50 to 45. After another 500ms, 40. Eventually it becomes high enough priority to run. Linux’s CFS uses a similar concept with vruntime: a process that has not run accumulates less vruntime, making it younger and more likely to be scheduled. Set the aging rate too fast and you undermine priority; too slow and processes still starve.

Quantum Too Small

Problem: Excessive context switches consume more CPU than useful work.

Symptoms: High system CPU utilization but low throughput, high context switch rate.

Mitigation: Profile with perf sched and adjust quantum. Rule: quantum should be 10x context switch cost.

A context switch on x86-64 takes roughly 1-5 microseconds. That does not sound like much, but it adds up. If your quantum is 1ms and the context switch costs 0.5ms, you are spending 33% of CPU time on scheduler overhead. With 1000 context switches per second on a busy system, that is half a second of pure waste every second.

The real cost is not just the switch time, it is cache pollution. When you switch to a new process, the CPU L1/L2 caches and TLB entries get invalidated. The new process starts with cold caches, spending the first portion of its quantum loading code and data back in. If the quantum expires before the process gets useful work done, you are paying cache-fill cost for almost no useful computation. The sweet spot is a quantum large enough that processes complete meaningful work between switches, but small enough that interactive latency stays low.

Quantum Too Large

Problem: System feels unresponsive; long processes block short ones.

Symptoms: Poor response time for interactive tasks despite idle CPU.

**Mitigation: Smaller quantum or combination of algorithms (interactive vs batch).

When quantum balloons, the scheduler starts acting like FCFS. A process that arrives just after a long job started waits nearly the entire quantum before getting a turn. Interactive users notice: they press a key and nothing happens for hundreds of milliseconds while the currently-running process finishes its time slice.

The fix is not always make quantum smaller everywhere. Some workloads benefit from longer quanta, big data processing wastes cycles on context switches if preemption happens too often. The real solution is often a hybrid approach: give interactive processes a short quantum in a high-priority queue, and let batch processes run in a lower-priority queue with longer slices. Linux’s CFS does something similar, dynamically calculating per-process quanta based on the number of runnable tasks.

Measuring Scheduling Performance

Linux exposes scheduler behavior through several interfaces. /proc/sys/kernel/ has tunable parameters: sched_latency_ns controls the target scheduling period (default ~48ms), and sched_min_granularity_ns sets a floor on time slices to prevent excessive switching when many processes run. /proc/sched_debug shows per-process scheduling details including vruntime and sleep time. For real-time tracing, perf sched records scheduling events and produces latency histograms.

The most useful command for quick triage is vmstat 1: the r column shows run queue length, the number of processes waiting for CPU. If it consistently exceeds 4 times the core count, the system is overloaded. Pair that with mpstat -P ALL 1 to see per-core utilization. If one core is pinned while others are idle, you likely have a single-threaded process creating a bottleneck, not a scheduling problem.

Algorithm	Avg Wait	Response	Throughput	Fairness
FCFS	High	Poor	Medium	Unfair
SJF	Lowest	Poor	Highest	Unfair (starvation)
Round Robin	Medium	Best	Medium	Fair
Priority	Variable	Variable	Variable	Unfair (starvation)
MLFQ	Low	Best	Highest	Fair

Scenario	Best Algorithm	Why
Batch processing	SJF	Maximize throughput
Interactive desktop	Round Robin / MLFQ	Minimize response time
Real-time systems	Fixed priority / Rate monotonic	Guarantees deadlines
Multi-user timeshare	Round Robin	Fairness
Mixed workload	MLFQ	Adapts dynamically

Quantum Setting	Context Switches	Response Time
Too small	Many	Low (but wasted overhead)
Optimal	Moderate	Good balance
Too large	Few	Poor response time

Implementation Snippets

Implementing a Simple MLFQ

#!/usr/bin/env python3
"""Multi-Level Feedback Queue Scheduler."""

from collections import deque

class Process:
    def __init__(self, pid, burst, priority=0):
        self.pid = pid
        self.burst = burst
        self.priority = priority  # 0 = highest
        self.remaining = burst

class MLFQScheduler:
    def __init__(self, num_queues=3, quantums=[8, 16, 32]):
        self.queues = [deque() for _ in range(num_queues)]
        self.quantums = quantums

    def add_process(self, proc):
        self.queues[proc.priority].append(proc)

    def schedule(self):
        time = 0
        completed = []

        while True:
            # Find highest non-empty queue
            chosen = None
            for i, q in enumerate(self.queues):
                if q:
                    chosen = (i, q.popleft())
                    break

            if chosen is None:
                break

            q_idx, proc = chosen
            quantum = self.quantums[q_idx]

            # Execute for quantum or until completion
            exec_time = min(quantum, proc.remaining)
            time += exec_time
            proc.remaining -= exec_time

            if proc.remaining == 0:
                proc.turnaround = time
                completed.append(proc)
                print(f"Time {time}: P{proc.pid} completed")
            else:
                # Demote to lower priority (next queue)
                next_q = min(q_idx + 1, len(self.queues) - 1)
                proc.priority = next_q
                self.queues[next_q].append(proc)
                print(f"Time {time}: P{proc.pid} demoted to Q{next_q}")

        return completed

# Example
scheduler = MLFQScheduler()
processes = [
    Process(1, 20),  # Interactive (will get demoted if CPU-bound)
    Process(2, 10),  # Short interactive
    Process(3, 40),  # Long batch
]
for p in processes:
    scheduler.add_process(p)

results = scheduler.schedule()

Measuring Scheduling Performance in Practice

#!/bin/bash
# Measure scheduling performance metrics

echo "=== Scheduler Information ==="
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_migration_cost_ns

echo -e "\n=== Context Switches ==="
cat /proc/stat | grep ctxt

echo -e "\n=== Run Queue Length ==="
vmstat 1 5

echo -e "\n=== Process State ==="
ps -eo state,pid,cmd | sort | uniq -c | sort -rn

echo -e "\n=== Scheduler Trace ==="
sudo perf sched record -- sleep 5
sudo perf sched latency

Observability Checklist

Key Metrics

# CPU utilization per core
mpstat -P ALL 1

# Run queue length (should be < 4 * cores)
vmstat 1 | awk '{print $r}'

# Context switches per second
sar -w 1

# Average scheduling latency
cat /proc/sched_debug | grep latency

Tracing

# Trace scheduling decisions
sudo perf trace -e sched:*

# Scheduler latency histogram
perf sched record -- sleep 10
perf sched timehist

# Per-process scheduling info
ps -eo pid,psr,ni,time,cmd --sort=-time

Common Pitfalls / Anti-Patterns

Security Pitfalls

The scheduler is one of those components that rarely shows its cracks until someone actively tries to exploit them. Misconfigured scheduling parameters let unprivileged users starve others of CPU time, or let real-time processes hold the processor indefinitely. These problems hide during development because they only surface under load, adversarial input, or in multi-tenant setups where one user is actively trying to degrade service to neighbors.

Two pitfalls come up most often. Priority manipulation: unprivileged users set negative nice values to jump ahead of everyone else. Real-time process monopolization: SCHED_FIFO or SCHED_RR tasks hold the CPU with no voluntary preemption path. Either one can take down a supposedly isolated system.

Priority Manipulation Attacks

Unprivileged users setting negative nice values can starve others. Restrict via:

# /etc/security/limits.conf
* hard nice 0

Real-time Process Security

Real-time scheduling classes (SCHED_FIFO, SCHED_RR) can monopolize CPU. Require CAP_SYS_NICE or proper cgroup configuration.

Operational Pitfalls

Scheduler misconfiguration is one of those issues that shows up as “slowness” in monitoring but takes forever to trace back to the actual cause. Quantum misconfiguration, wrong assumptions about burst times, and ignoring I/O blocking behavior are the usual suspects. These mistakes slip past QA because they only manifest under specific workloads or at scale.

Four pitfalls show up repeatedly in production. Quantum set too small burns CPU on context switches. Quantum set too large makes the system feel unresponsive. Assuming SJF works without knowing burst times leads to starvation. Ignoring I/O-bound process priority treats all runnable processes the same, which kills interactive performance. And on multi-core systems, a single global queue creates cache-line contention that per-CPU queues avoid. Each section below has detection commands and fixes you can apply immediately.

Ignoring Quantum Tuning

Default quantum may not match workload. Profile and adjust.

Linux’s default quantum assumes a balanced mix of interactive and batch work. But if you are running a high-frequency trading system where microseconds matter, that quantum is way too large. A quantum that is too small turns your CPU into a switchboard operator, constant context switches, cold caches, no real work getting done.

The tool for profiling is perf sched. Run perf sched record -- sleep 10 during peak load, then perf sched latency to see the distribution of scheduling latencies. If 95th-percentile latency is within your quantum, you are probably fine. If processes routinely wait multiple quanta before running, the quantum might be too large for your workload, or you have too many runnable processes competing for too few cores. sched_min_granularity_ns lets you set a floor, raising it reduces context switches when many processes are active, at the cost of fairness.

Using SJF Without Burst Prediction

In production, burst times aren’t known. Use MLFQ to approximate SJF behavior.

SJF’s theoretical optimal average wait time assumes you know the future. In reality, burst times come from estimates or historical data, both imperfect. A batch script that usually takes 100ms might take 5 seconds if it hits a slow disk. MLFQ sidesteps this by watching what processes actually do: if a process consistently uses its full quantum, it gets moved lower; if it blocks early, it stays high. Over time, the scheduler learns which processes are truly short and which ones just looked short on arrival.

If you must use SJF in production, use the exponential moving average of recent burst times rather than the raw first burst. This gives you some adaptability without full MLFQ complexity. Linux’s CFS approximates something similar by tracking vruntime per process, a process that runs a lot accumulates vruntime faster, naturally aging it relative to processes that have not run recently.

Forgetting About I/O Blocking

Processes that block frequently should get higher priority — don’t treat all runnable processes the same.

I/O-bound processes need very little CPU time, but they need it immediately when their I/O completes. A disk I/O request might take 10ms, but if the process waits another 20ms for CPU to process the result, you have doubled your latency for no reason. The fix is recognizing that runnable does not mean CPU-hungry. An interactive shell that blocks on keyboard input for 95% of its life still needs sub-millisecond response time when input arrives.

The naive approach is to give I/O-bound processes higher static priority. That works until high-priority CPU-bound processes starve them anyway. MLFQ handles this naturally because I/O-bound processes use less than their full quantum and stay high, while CPU-bound processes consume their quantum and drop. Linux has I/O priority classes through ionice: a process with ionice -c 2 -n 0 gets the highest I/O priority, ensuring it gets CPU when it needs it.

Not Considering Multi-Core

Single queue vs per-CPU queues have different tradeoffs for cache affinity vs contention.

On a single-core system, the scheduler’s data structures see low contention. Scale to 32 cores and a single global run queue becomes a bottleneck: every scheduler decision touches the same lock, and all 32 cores fight over it. The fix is per-CPU run queues: each core has its own queue and only touches its own data, eliminating lock contention. Linux’s CFS does this by default.

The tradeoff is cache affinity. A process that runs on core 0 has its data in core 0’s cache. Move it to core 1 and that cache is cold. Per-CPU queues naturally keep processes on the same core unless that core is overloaded, preserving cache warmth. The performance difference is significant: a cache hit might take 4 cycles; a cache miss off DRAM, 200+. For a process that touches lots of data, a database or web server, keeping it on one core can be worth more than load balancing across all cores.

Quick Recap Checklist

FCFS is simple but causes convoy effect
SJF minimizes average wait but requires knowing burst times
Round Robin is fair with good response time; quantum size matters
Priority scheduling has starvation; use aging to mitigate
MLFQ combines multiple algorithms, dynamically adjusting
Quantum too small = excessive context switches; too large = poor response
No single algorithm is best — trade-offs depend on workload

Interview Questions

1. What is the difference between preemptive and non-preemptive scheduling?

Non-preemptive scheduling means once a process starts running, it runs to completion (or until it voluntarily blocks). The scheduler only makes decisions when the running process blocks or terminates.

Preemptive scheduling allows the OS to forcibly interrupt a running process and switch to another. This happens on timer interrupts, when a higher priority process becomes ready, or based on time slice expiration.

Preemptive scheduling is essential for interactive systems because:

A long-running calculation shouldn't freeze the UI
Higher priority real-time tasks can preempt lower priority ones
Each process gets guaranteed CPU time regardless of other processes

Linux, Windows, and macOS all use preemptive scheduling for user processes. Some embedded systems use cooperative scheduling, but this requires all processes to be well-behaved.

2. Explain the convoy effect and how to mitigate it.

The convoy effect occurs in FCFS when a long CPU-bound process ahead causes short I/O-bound processes to wait, even though the I/O devices are idle. The long process holds the CPU while the short processes wait with idle I/O.

Example: Process A (100ms CPU) arrives before Process B (1ms CPU). B arrives just after A starts. B must wait 100ms while A uses CPU, even though B could complete in 1ms.

Mitigations:

Round Robin: Each process gets quantum time; short jobs complete faster
SJF: Short jobs run first, minimizing wait time (but requires knowing burst times)
Priority scheduling: Give I/O-bound processes higher priority
MLFQ: I/O-bound processes naturally stay in high-priority queues

3. How does MLFQ handle both interactive and batch workloads?

MLFQ uses multiple queues with different priorities and time quantum sizes:

High priority queue: Short quantum (4-8ms) for interactive processes
Medium priority: Medium quantum (16-32ms)
Low priority: Long quantum or FCFS for batch jobs

How it works:

New processes enter the highest priority queue
If a process uses its entire quantum, it's demoted to a lower queue
If a process blocks before using its quantum, it stays in the current queue (or moves up)
When a process in a higher queue becomes ready, it preempts lower ones

Result: Interactive processes (short bursts, frequent I/O) stay in high-priority queues with quick response. CPU-bound batch jobs naturally sink to lower queues but get longer time slices when they do run.

4. What is the relationship between quantum size and context switch overhead?

A context switch is pure overhead — CPU cycles spent saving and restoring process state instead of doing useful work. On modern x86-64, a context switch takes approximately 1-5 microseconds.

If quantum is too small (e.g., 1ms) and context switch takes 0.5ms, then 33% of CPU time is spent on scheduling overhead.

Rules for quantum sizing:

Quantum should be 10-20x the context switch cost
80% of CPU bursts should be shorter than quantum (otherwise most processes will be preempted)
Typical values: 4-100ms depending on workload

On Linux, sched_latency_ns (default ~48ms) divided by the number of runnable processes gives the per-process quantum. sched_min_granularity_ns sets a minimum to prevent excessive switching.

5. What is starvation and how does aging mitigate it?

Starvation (also called indefinite postponement) occurs when a process never gets scheduled because other processes keep getting priority. In priority scheduling, low-priority processes can wait forever if high-priority tasks keep arriving.

Aging is a technique to prevent starvation by gradually increasing the priority of waiting processes over time. The longer a process waits, the higher its priority becomes, until eventually it gets scheduled.

Example: A process with priority 50 (lowest) waiting 500ms might have its priority increased to 45, then 40, etc. After enough waiting, it becomes high priority and runs.

Linux's CFS doesn't use static priorities for aging — instead, the vruntime concept naturally handles fairness: a process that hasn't run for a long time accumulates less vruntime, making it more likely to be scheduled. This is implicit aging.

6. What is the convoy effect in FCFS scheduling and how does it impact system performance?

The convoy effect occurs when one or more CPU-bound processes monopolize the CPU, causing I/O-bound processes to wait even though the I/O devices are idle. The I/O-bound processes make a request, get scheduled briefly, block on I/O immediately, and then wait while the CPU-bound process completes its burst.

This creates a convoy of I/O requests behind a slow CPU-bound leader. The system appears sluggish despite having plenty of CPU capacity because I/O-bound processes complete quickly when they get CPU but are rarely given the chance.

Mitigation: Use SJF or Round Robin to give shorter jobs priority. MLFQ naturally handles this — I/O-bound processes stay in high-priority queues and get frequent short time slices.

7. How does MLFQ prevent starvation of low-priority processes?

MLFQ prevents starvation through two mechanisms: aging and queue promotion.

Aging: If a process stays in a lower queue for too long without being scheduled, its priority is gradually boosted, moving it to a higher queue. This ensures that even low-priority CPU-bound jobs eventually get CPU time.

Queue promotion: Some MLFQ implementations periodically promote all processes to the highest priority queue, ensuring that long-waiting processes are given a chance to run at high priority.

The key insight is that MLFQ assumes processes will be CPU-bound or I/O-bound — CPU-bound jobs naturally sink to lower queues, while I/O-bound jobs stay in high queues. Without aging, truly CPU-only jobs could be stuck at the bottom forever.

8. What is the relationship between time quantum and scheduling overhead?

Scheduling overhead is the CPU time spent deciding which process to run and performing the context switch, rather than doing useful work.

A smaller quantum means more frequent context switches, increasing overhead as a fraction of total CPU time. If quantum is 1ms and context switch takes 0.5ms, then 33% of CPU time is pure overhead.

A larger quantum reduces context switch frequency and overhead, but at the cost of poor response time — processes may wait a long time before getting CPU.

Rule of thumb: quantum should be 10-20x the context switch cost. On Linux, CFS dynamically calculates quantum based on sched_latency_ns divided by the number of runnable processes, yielding longer quanta when many processes are running.

9. What is the difference between turnaround time and waiting time as scheduling metrics?

Turnaround time = time from job submission to time of completion. It includes waiting time plus actual execution time plus any I/O time.

Waiting time = time spent in the ready queue, not executing. It excludes the actual execution time and I/O wait.

A scheduler optimizing for turnaround time might give long jobs more CPU upfront (FCFS), which increases waiting time for short jobs. A scheduler optimizing for waiting time (SJF) might reduce turnaround time but could starve long jobs.

Important: only waiting time is directly under the scheduler's control. Turnaround time depends on both scheduling decisions and the process's own behavior (I/O patterns, burst lengths).

10. How does multi-level feedback queue (MLFQ) dynamically learn process behavior?

MLFQ learns by observing whether a process uses its entire time quantum or blocks before it expires:

Uses full quantum (CPU-bound): MLFQ assumes the process needs heavy computation and demotes it to a lower priority queue with a longer quantum.
Blocks before quantum expires (I/O-bound): MLFQ assumes the process is interactive and keeps it in the current queue or promotes it to a higher priority queue.

This feedback-driven approach means MLFQ starts with no knowledge of a process and adapts based on observed behavior. A long-running batch job will be demoted over time; an interactive shell remains at high priority.

The tradeoff is that MLFQ can be gamed — a malicious process could periodically perform I/O to stay at high priority and monopolize CPU.

11. What is the oldest first (OFA) or fair share scheduling?

Fair share scheduling ensures that each user or group receives a fair proportion of CPU time, regardless of how many processes they have running. For example, if user A has 10 processes and user B has 1 process, both users get 50% of CPU (not user A getting 91% because they have more processes).

Linux's CFS implements this via the concept of "weight" (based on nice values) and per-group accounting. cgroups can be used to partition CPU time between different groups (e.g., production vs development workloads).

The key data structure is a red-black tree keyed by vruntime, but processes from different users have their vruntime accumulated separately before comparison, ensuring fair distribution.

12. How does deadline scheduling (SCHED_DEADLINE) guarantee task deadlines?

SCHED_DEADLINE uses the Earliest Deadline First (EDF) algorithm with bandwidth reservation. Each task specifies:

Runtime: Maximum CPU time needed per period (worst-case execution time)
Deadline: Time by which execution must complete
Period: Interval between successive task invocations

The scheduler checks that total runtime/period (bandwidth) does not exceed CPU capacity. If the system is oversubscribed, admission is denied.

At runtime, EDF picks the task with the earliest deadline. As long as total reserved bandwidth is under 100%, the scheduler can guarantee all deadlines will be met. This is a hard real-time guarantee — not probabilistic, but mathematical.

13. What is the difference between hard real-time and soft real-time scheduling?

Hard real-time: A missed deadline is a system failure. The task must complete by its deadline or the result is useless or dangerous. Examples: airbag controllers, pacemaker pacemakers, flight control systems. SCHED_DEADLINE with proper admission control provides hard real-time guarantees.

Soft real-time: Missing a deadline reduces quality but the system continues operating. Video playback dropping frames, audio glitching, or a trading system missing an arbitrage window are soft real-time failures. SCHED_FIFO and SCHED_RR provide soft real-time behavior — high priority but no formal deadline guarantee.

Key difference: hard real-time requires formal verification that the system can meet all deadlines under worst-case conditions. Soft real-time uses priority and FIFO/RR policies to minimize missed deadlines but doesn't guarantee them.

14. What is priority inheritance and how does it solve the priority inversion problem?

Priority inversion: High-priority task H is blocked waiting for a lock held by low-priority task L. Medium-priority task M preempts L, delaying H's lock release indefinitely.

Priority inheritance: When H blocks on a lock held by L, L's priority is temporarily raised to H's priority (or higher if there are multiple waiting tasks). This prevents M from preempting L while L holds the lock needed by H.

The inheritance chain can be transitive: if L holds another lock that a higher priority task H2 wants, L's priority gets raised to H2's level. When L releases the lock, its priority reverts.

Linux implements priority inheritance for mutexes via PI-futexes. Not all locks support it — only those explicitly implemented with PI-futexes. Priority inheritance adds overhead, so it is not enabled by default for all synchronization primitives.

15. How does Round Robin handle processes with different priority levels?

In priority-based Round Robin, the scheduler first selects the highest priority non-empty queue. All processes in that queue are scheduled in round-robin fashion with time quantum appropriate for that priority level.

Lower priority queues only receive CPU time when all higher priority queues are empty. This is why high-priority interactive processes get excellent response time — they are always scheduled first.

The tradeoff is that lower priority batch jobs may starve if higher priority processes are always runnable. This is addressed by periodically boosting starving processes' priority (aging) or using a more sophisticated multi-level queue that considers both priority and wait time.

16. What is the relationship between scheduler latency and system responsiveness?

Scheduler latency (scheduling latency) is the time between a process becoming runnable (e.g., I/O completing) and it actually starting to run on a CPU. System responsiveness is how quickly the system reacts to user actions.

For an interactive application (e.g., a shell command), responsiveness means the time between pressing Enter and seeing output. This depends on the shell process being scheduled, not just the CPU-bound work. Even if the CPU is busy, interactive processes need to be scheduled quickly to maintain the feeling of responsiveness.

Reducing scheduling latency improves responsiveness but may reduce throughput (more context switches). The kernel targets a balance via sched_latency_ns and sched_min_granularity_ns parameters.

17. How does the Linux CFS scheduler implement fairness?

CFS uses a red-black tree ordered by vruntime. Each runnable process accumulates vruntime proportional to the CPU time used, adjusted by the process's weight (from nice value). The leftmost node (minimum vruntime) is always picked as the next process to run.

Fairness is achieved because vruntime represents "how much CPU time this process deserves but hasn't received yet." A process that has run a lot has high vruntime. A process that hasn't run has low vruntime and will be picked next.

The target latency (sched_latency_ns) determines how long a process can run before being preempted. With many processes, each gets a smaller time slice but all make progress.

18. What is the relationship between scheduling class and scheduler implementation in Linux?

Linux defines scheduling classes (struct sched_class) that plug into the scheduler. Each class implements a common interface: enqueue, dequeue, pick_next, yield, etc.

Built-in classes in order of priority: stop_sched_class (for stop_machine), dl_sched_class (SCHED_DEADLINE), rt_sched_class (SCHED_FIFO/SCHED_RR), fair_sched_class (CFS, SCHED_OTHER), idle_sched_class (idle thread).

The scheduler iterates through classes in priority order when picking the next task. Only when a class has no runnable tasks does the scheduler check the next class. This means real-time tasks always preempt CFS tasks.

19. What is the difference between SCHED_OTHER and SCHED_NORMAL in Linux?

There is no functional difference — SCHED_OTHER and SCHED_NORMAL are aliases for the same scheduling policy. Both refer to the default CFS-based scheduler used by normal processes.

SCHED_BATCH is another CFS variant for batch jobs — it has slightly higher latency tolerance and is optimized for throughput over interactivity. SCHED_IDLE is for extremely low-priority background tasks that should only run when the system is otherwise idle.

All of these (SCHED_OTHER/SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE) use the CFS scheduler with different parameters or adjustments.

20. How does CPU overcommit affect scheduling decisions in a virtualized environment?

CPU overcommit means the hypervisor assigns more virtual CPUs to VMs than physical CPUs exist. The hypervisor scheduler must time-share physical CPUs among vCPUs.

When a vCPU is ready to run but no physical CPU is available, the hypervisor schedules another vCPU. This causes scheduling latency and reduced determinism. Real-time tasks inside VMs suffer because the hypervisor introduces an extra scheduling layer.

Mitigations: Use CPU affinity to pin vCPUs to physical cores, reducing hypervisor scheduling overhead. Use real-time scheduling classes for critical VMs. Reserve some physical cores for real-time workloads so they are not stolen by overcommitted VMs.

Conclusion

Process scheduling algorithms are the mechanism by which operating systems decide which runnable process gets CPU time. The choice of algorithm profoundly impacts system responsiveness, throughput, and fairness.

FCFS is simple but vulnerable to the convoy effect, where short processes wait behind long ones. SJF achieves optimal average wait time but requires knowing burst times in advance and causes starvation of long processes. Round Robin provides fair scheduling with good response time, but quantum size is critical — too small causes excessive context switches, too large degrades interactivity.

Priority scheduling allows important work to run first, but without aging, low-priority processes can starve indefinitely. Multi-Level Feedback Queue (MLFQ) dynamically adapts, giving interactive processes quick response while letting CPU-bound jobs progress without dominating. Modern Linux, Windows, and macOS all combine multiple algorithms this way.

For your next step, explore CPU interrupts and context switches to understand the mechanics of how the scheduler actually transfers control between processes, or dive into synchronization primitives (mutexes, semaphores, condition variables) which protect shared data in concurrent multi-threaded environments.

Process Scheduling Algorithms

Introduction

When to Use

When Not to Use

Types of Scheduling

Non-Preemptive vs Preemptive

Core Algorithms

First-Come, First-Served (FCFS)

Shortest Job First (SJF)

Round Robin (RR)

Priority Scheduling

Multi-Level Feedback Queue (MLFQ)

Architecture Diagram

Production Failure Scenarios

Convoy Effect in FCFS

Starvation in Priority Scheduling

Quantum Too Small

Quantum Too Large

Measuring Scheduling Performance

Implementation Snippets

Implementing a Simple MLFQ

Measuring Scheduling Performance in Practice

Observability Checklist

Key Metrics

Tracing

Common Pitfalls / Anti-Patterns

Security Pitfalls

Priority Manipulation Attacks

Real-time Process Security

Operational Pitfalls

Ignoring Quantum Tuning

Using SJF Without Burst Prediction

Forgetting About I/O Blocking

Not Considering Multi-Core

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates