Process Scheduling Algorithms

A comprehensive guide to CPU scheduling algorithms including FCFS, SJF, Round Robin, Priority scheduling, and multi-level feedback queues.

published: reading time: 26 min read author: GeekWorkBench

Process Scheduling Algorithms

Imagine you’re a air traffic controller managing a runway. You have planes (processes) of different sizes, different urgency levels, and different destinations (CPU burst lengths). Your job: keep the runway busy, minimize waiting time, and somehow satisfy everyone.

This is essentially what the CPU scheduler does — every millisecond, deciding which runnable process gets the CPU. The algorithm you choose determines whether your system feels responsive or sluggish, whether short tasks complete quickly or get buried under long ones.

Introduction

CPU scheduling algorithms decide which runnable process to run next on each CPU core. The scheduler makes this decision based on various factors: burst time, priority, waiting time, and the scheduling policy.

Modern systems often use multiple algorithms in combination. Linux’s Completely Fair Scheduler (CFS) handles normal processes, while special schedulers handle real-time tasks. Windows uses a multi-level feedback queue that dynamically adjusts priorities.

When to Use

  • System tuning — Choosing the right scheduler improves responsiveness
  • Real-time applications — Guaranteeing deadlines requires specific algorithms
  • Batch processing — Maximizing throughput is the primary goal
  • Multi-user systems — Fairness ensures no user gets starved

When Not to Use

  • Simple scripts — Default scheduler is sufficient
  • Single-user interactive — System defaults are optimized for this
  • Embedded with fixed workload — Custom schedulers add complexity

Types of Scheduling

Non-Preemptive vs Preemptive

Non-preemptive: Once a process starts running, it runs to completion (or until it blocks). Simple but can’t respond to higher priority tasks quickly.

Preemptive: The scheduler can interrupt a running process and switch to another. Modern OSes use preemptive scheduling for responsiveness.

graph LR
    subgraph Non-Preemptive
        A1[Process A starts] --> A2[A runs to completion]
    end

    subgraph Preemptive
        B1[Process B starts] --> B2[B preempted]
        B2 --> C1[Process C runs]
        C1 --> B3[B resumes]
    end

    style Non-Preemptive stroke:#4a1a1a
    style Preemptive stroke:#1a3a1a

Core Algorithms

First-Come, First-Served (FCFS)

The simplest algorithm: processes run in the order they arrived.

Pros: Simple, low overhead Cons: Convoy effect — short processes wait behind long ones

// Simulate FCFS scheduling
#include <stdio.h>

typedef struct {
    int pid;
    int arrival_time;
    int burst_time;
} Process;

void fcfs_schedule(Process procs[], int n) {
    int current_time = 0;

    printf("FCFS Scheduling:\n");
    printf("%-10s %-15s %-15s %-15s\n", "PID", "Start Time", "End Time", "Wait Time");

    for (int i = 0; i < n; i++) {
        int wait = current_time - procs[i].arrival_time;
        if (wait < 0) wait = 0;
        int start = current_time;
        current_time += procs[i].burst_time;
        int end = current_time;

        printf("%-10d %-15d %-15d %-15d\n", procs[i].pid, start, end, wait);
    }
}

Shortest Job First (SJF)

Select the process with the shortest burst time. Optimal for minimizing average waiting time — but requires knowing burst times in advance (which is rarely possible).

Pros: Optimal average wait time Cons: Starvation of long processes; requires burst time prediction

// Simulate SJF scheduling (non-preemptive)
#include <stdio.h>
#include <limits.h>

void sjf_schedule(Process procs[], int n) {
    int completed = 0;
    int current_time = 0;
    int total_wait = 0;

    int remaining = n;
    int visited[n];
    for (int i = 0; i < n; i++) visited[i] = 0;

    printf("\nSJF Scheduling:\n");
    printf("%-10s %-15s %-15s %-15s %-15s\n", "PID", "Arrival", "Burst", "Start", "Wait");

    while (completed < n) {
        int shortest = -1;
        int min_burst = INT_MAX;

        // Find shortest job that's arrived and not completed
        for (int i = 0; i < n; i++) {
            if (!visited[i] && procs[i].arrival_time <= current_time &&
                procs[i].burst_time < min_burst) {
                shortest = i;
                min_burst = procs[i].burst_time;
            }
        }

        if (shortest == -1) {
            current_time++;
            continue;
        }

        int wait = current_time - procs[shortest].arrival_time;
        int start = current_time;
        current_time += procs[shortest].burst_time;
        int end = current_time;

        printf("%-10d %-15d %-15d %-15d %-15d\n",
               procs[shortest].pid, procs[shortest].arrival_time,
               procs[shortest].burst_time, start, wait);

        total_wait += wait;
        visited[shortest] = 1;
        completed++;
    }

    printf("Average wait time: %.2f\n", (float)total_wait / n);
}

Round Robin (RR)

Each process gets a fixed time slice (quantum). When the quantum expires, the process goes to the back of the ready queue. Designed for time-sharing.

Pros: Fair, good response time, handles starvation Cons: Higher context switch overhead, depends on quantum size

#!/usr/bin/env python3
"""Round Robin Scheduler Simulation."""

from collections import deque

class Process:
    def __init__(self, pid, burst, arrival=0):
        self.pid = pid
        self.burst = burst
        self.arrival = arrival
        self.remaining = burst
        self.wait_time = 0
        self.turnaround = 0

def round_robin(processes, quantum):
    ready = deque(processes)
    current_time = 0
    completed = []

    print(f"Round Robin (quantum={quantum}):")
    print(f"{'Time':<8} {'Event':<20} {'Ready Queue'}")

    while ready:
        process = ready.popleft()

        if process.arrival > current_time:
            # Skip ahead if no processes have arrived
            current_time = process.arrival
            ready.append(process)
            continue

        # Execute for quantum or until completion
        exec_time = min(quantum, process.remaining)
        current_time += exec_time
        process.remaining -= exec_time

        # Other processes in ready queue wait
        for p in ready:
            if p.arrival <= current_time:
                p.wait_time += exec_time

        if process.remaining == 0:
            process.turnaround = current_time - process.arrival
            completed.append(process)
            print(f"{current_time:<8} P{process.pid} completes")
        else:
            ready.append(process)
            print(f"{current_time:<8} P{process.pid} preempted")

    return completed

# Example
processes = [Process(1, 24), Process(2, 3), Process(3, 3)]
result = round_robin(processes, 4)

print(f"\n{'PID':<8} {'Burst':<8} {'Wait':<8} {'Turnaround':<12}")
print("-" * 40)
for p in result:
    print(f"{p.pid:<8} {p.burst:<8} {p.wait_time:<8} {p.turnaround:<12}")

Priority Scheduling

Processes are assigned priorities. Higher priority processes run first. Two variants:

Preemptive: A new higher-priority process preempts the running one Non-preemptive: Higher priority only matters when current process blocks

Problem: Starvation — low priority processes may never run if higher priority ones keep arriving.

Solution: Aging — gradually increase priority of waiting processes.

// Priority Scheduling with Aging
#include <stdio.h>

#define BASE_PRIORITY 50
#define AGING_RATE 1  // Priority increases by 1 per time unit waiting

typedef struct {
    int pid;
    int base_priority;  // Lower number = higher priority
    int dynamic_priority;
    int remaining_burst;
    int waiting_time;
} Process;

void priority_with_aging(Process procs[], int n, int time_units) {
    printf("\nPriority Scheduling with Aging:\n");

    int completed = 0;
    int current_time = 0;

    while (completed < n) {
        // Update dynamic priorities (aging)
        for (int i = 0; i < n; i++) {
            if (procs[i].remaining_burst > 0 && procs[i].waiting_time > 0) {
                procs[i].dynamic_priority -= AGING_RATE;
                if (procs[i].dynamic_priority < 1) {
                    procs[i].dynamic_priority = 1;  // Max priority
                }
            }
        }

        // Find highest priority (lowest number)
        int highest = -1;
        int highest_priority = 999;

        for (int i = 0; i < n; i++) {
            if (procs[i].remaining_burst > 0 &&
                procs[i].dynamic_priority < highest_priority) {
                highest = i;
                highest_priority = procs[i].dynamic_priority;
            }
        }

        if (highest == -1) {
            current_time++;
            continue;
        }

        // Execute one time unit
        printf("Time %d: P%d (priority %d)\n",
               current_time, procs[highest].pid, procs[highest].dynamic_priority);

        procs[highest].remaining_burst--;
        current_time++;

        // Increment wait time for other processes
        for (int i = 0; i < n; i++) {
            if (i != highest && procs[i].remaining_burst > 0) {
                procs[i].waiting_time++;
            }
        }

        if (procs[highest].remaining_burst == 0) {
            completed++;
        }
    }
}

Multi-Level Feedback Queue (MLFQ)

MLFQ combines multiple queues with different priorities and time quantum sizes. Processes start in highest priority queue; if they use their entire quantum, they drop to lower priority. This allows short interactions to complete quickly while CPU-bound jobs get time.

graph TB
    subgraph Q1[Queue 1 - Highest Priority]
        A1[Interactive<br/>Quantum: 8ms]
    end
    subgraph Q2[Queue 2 - Medium Priority]
        B1[Mixed<br/>Quantum: 16ms]
    end
    subgraph Q3[Queue 3 - Lowest Priority]
        C1[Batch<br/>Quantum: 32ms]
    end

    A1 -->|Uses full quantum| B1
    B1 -->|Uses full quantum| C1
    C1 -->|Blocked I/O| A1

    A1 -->|Shorter quantum| Scheduler
    B1 -->|Medium quantum| Scheduler
    C1 -->|Longer quantum| Scheduler

    Scheduler -->|picks highest non-empty| A1
    Scheduler -->|picks highest non-empty| B1
    Scheduler -->|picks highest non-empty| C1

    style Q1 stroke:#1a4a1a
    style Q2 stroke:#1a1a4a
    style Q3 stroke:#4a1a1a

Architecture Diagram

sequenceDiagram
    participant P1 as Process A (CPU-bound)
    participant P2 as Process B (Interactive)
    participant Scheduler
    participant CPU

    Note over Scheduler,CPU: Time 0ms
    P2->>Scheduler: I/O completes, ready
    Scheduler->>CPU: Schedule P2 (interactive)
    CPU->>P2: Run P2

    Note over Scheduler,CPU: Time 8ms (quantum expires)
    Scheduler->>CPU: Preempt P2, schedule P1
    CPU->>P1: Run P1 (uses full quantum)

    Note over Scheduler,CPU: Time 16ms
    Scheduler->>CPU: Move P1 to lower queue
    Scheduler->>CPU: Schedule P2 again
    CPU->>P2: Continue P2

    Note over Scheduler,CPU: Time 24ms (P2 completes)
    Scheduler->>CPU: Schedule next in Q2
    CPU->>P1: Continue P1 in Q2

Production Failure Scenarios

Convoy Effect in FCFS

Problem: A long CPU-bound process blocks short I/O-bound processes. Short jobs wait while long job holds CPU.

Symptoms: System appears responsive at first, then slows as long jobs dominate.

Mitigation: Use Round Robin or SJF to give short jobs priority.

Starvation in Priority Scheduling

Problem: Low priority processes never get CPU time as high priority processes keep arriving.

Symptoms: Certain processes show extremely high wait times, never completing.

Mitigation: Implement aging — increase priority of waiting processes over time.

Quantum Too Small

Problem: Excessive context switches consume more CPU than useful work.

Symptoms: High system CPU utilization but low throughput, high context switch rate.

Mitigation: Profile with perf sched and adjust quantum. Rule: quantum should be 10x context switch cost.

Quantum Too Large

Problem: System feels unresponsive; long processes block short ones.

Symptoms: Poor response time for interactive tasks despite idle CPU.

Mitigation: Smaller quantum or combination of algorithms (interactive vs batch).


Trade-off Table

AlgorithmAvg WaitResponseThroughputFairness
FCFSHighPoorMediumUnfair
SJFLowestPoorHighestUnfair (starvation)
Round RobinMediumBestMediumFair
PriorityVariableVariableVariableUnfair (starvation)
MLFQLowBestHighestFair
ScenarioBest AlgorithmWhy
Batch processingSJFMaximize throughput
Interactive desktopRound Robin / MLFQMinimize response time
Real-time systemsFixed priority / Rate monotonicGuarantees deadlines
Multi-user timeshareRound RobinFairness
Mixed workloadMLFQAdapts dynamically
Quantum SettingContext SwitchesResponse Time
Too smallManyLow (but wasted overhead)
OptimalModerateGood balance
Too largeFewPoor response time

Implementation Snippets

Implementing a Simple MLFQ

#!/usr/bin/env python3
"""Multi-Level Feedback Queue Scheduler."""

from collections import deque

class Process:
    def __init__(self, pid, burst, priority=0):
        self.pid = pid
        self.burst = burst
        self.priority = priority  # 0 = highest
        self.remaining = burst

class MLFQScheduler:
    def __init__(self, num_queues=3, quantums=[8, 16, 32]):
        self.queues = [deque() for _ in range(num_queues)]
        self.quantums = quantums

    def add_process(self, proc):
        self.queues[proc.priority].append(proc)

    def schedule(self):
        time = 0
        completed = []

        while True:
            # Find highest non-empty queue
            chosen = None
            for i, q in enumerate(self.queues):
                if q:
                    chosen = (i, q.popleft())
                    break

            if chosen is None:
                break

            q_idx, proc = chosen
            quantum = self.quantums[q_idx]

            # Execute for quantum or until completion
            exec_time = min(quantum, proc.remaining)
            time += exec_time
            proc.remaining -= exec_time

            if proc.remaining == 0:
                proc.turnaround = time
                completed.append(proc)
                print(f"Time {time}: P{proc.pid} completed")
            else:
                # Demote to lower priority (next queue)
                next_q = min(q_idx + 1, len(self.queues) - 1)
                proc.priority = next_q
                self.queues[next_q].append(proc)
                print(f"Time {time}: P{proc.pid} demoted to Q{next_q}")

        return completed

# Example
scheduler = MLFQScheduler()
processes = [
    Process(1, 20),  # Interactive (will get demoted if CPU-bound)
    Process(2, 10),  # Short interactive
    Process(3, 40),  # Long batch
]
for p in processes:
    scheduler.add_process(p)

results = scheduler.schedule()

Measuring Scheduling Performance

#!/bin/bash
# Measure scheduling performance metrics

echo "=== Scheduler Information ==="
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_migration_cost_ns

echo -e "\n=== Context Switches ==="
cat /proc/stat | grep ctxt

echo -e "\n=== Run Queue Length ==="
vmstat 1 5

echo -e "\n=== Process State ==="
ps -eo state,pid,cmd | sort | uniq -c | sort -rn

echo -e "\n=== Scheduler Trace ==="
sudo perf sched record -- sleep 5
sudo perf sched latency

Observability Checklist

Key Metrics

# CPU utilization per core
mpstat -P ALL 1

# Run queue length (should be < 4 * cores)
vmstat 1 | awk '{print $r}'

# Context switches per second
sar -w 1

# Average scheduling latency
cat /proc/sched_debug | grep latency

Tracing

# Trace scheduling decisions
sudo perf trace -e sched:*

# Scheduler latency histogram
perf sched record -- sleep 10
perf sched timehist

# Per-process scheduling info
ps -eo pid,psr,ni,time,cmd --sort=-time

Common Pitfalls / Anti-Patterns

Priority Manipulation Attacks

Unprivileged users setting negative nice values can starve others. Restrict via:

# /etc/security/limits.conf
* hard nice 0

Real-time Process Security

Real-time scheduling classes (SCHED_FIFO, SCHED_RR) can monopolize CPU. Require CAP_SYS_NICE or proper cgroup configuration.


Common Pitfalls / Anti-patterns

Ignoring quantum tuning

Default quantum may not match workload. Profile and adjust.

Using SJF without burst prediction

In production, burst times aren’t known. Use MLFQ to approximate SJF behavior.

Forgetting about I/O blocking

Processes that block frequently should get higher priority — don’t treat all runnable processes the same.

Not considering multi-core

Single queue vs per-CPU queues have different tradeoffs for cache affinity vs contention.


Quick Recap Checklist

  • FCFS is simple but causes convoy effect
  • SJF minimizes average wait but requires knowing burst times
  • Round Robin is fair with good response time; quantum size matters
  • Priority scheduling has starvation; use aging to mitigate
  • MLFQ combines multiple algorithms, dynamically adjusting
  • Quantum too small = excessive context switches; too large = poor response
  • No single algorithm is best — trade-offs depend on workload

Interview Questions

1. What is the difference between preemptive and non-preemptive scheduling?

Non-preemptive scheduling means once a process starts running, it runs to completion (or until it voluntarily blocks). The scheduler only makes decisions when the running process blocks or terminates.

Preemptive scheduling allows the OS to forcibly interrupt a running process and switch to another. This happens on timer interrupts, when a higher priority process becomes ready, or based on time slice expiration.

Preemptive scheduling is essential for interactive systems because:

  • A long-running calculation shouldn't freeze the UI
  • Higher priority real-time tasks can preempt lower priority ones
  • Each process gets guaranteed CPU time regardless of other processes

Linux, Windows, and macOS all use preemptive scheduling for user processes. Some embedded systems use cooperative scheduling, but this requires all processes to be well-behaved.

2. Explain the convoy effect and how to mitigate it.

The convoy effect occurs in FCFS when a long CPU-bound process ahead causes short I/O-bound processes to wait, even though the I/O devices are idle. The long process holds the CPU while the short processes wait with idle I/O.

Example: Process A (100ms CPU) arrives before Process B (1ms CPU). B arrives just after A starts. B must wait 100ms while A uses CPU, even though B could complete in 1ms.

Mitigations:

  • Round Robin: Each process gets quantum time; short jobs complete faster
  • SJF: Short jobs run first, minimizing wait time (but requires knowing burst times)
  • Priority scheduling: Give I/O-bound processes higher priority
  • MLFQ: I/O-bound processes naturally stay in high-priority queues
3. How does MLFQ handle both interactive and batch workloads?

MLFQ uses multiple queues with different priorities and time quantum sizes:

  • High priority queue: Short quantum (4-8ms) for interactive processes
  • Medium priority: Medium quantum (16-32ms)
  • Low priority: Long quantum or FCFS for batch jobs

How it works:

  1. New processes enter the highest priority queue
  2. If a process uses its entire quantum, it's demoted to a lower queue
  3. If a process blocks before using its quantum, it stays in the current queue (or moves up)
  4. When a process in a higher queue becomes ready, it preempts lower ones

Result: Interactive processes (short bursts, frequent I/O) stay in high-priority queues with quick response. CPU-bound batch jobs naturally sink to lower queues but get longer time slices when they do run.

4. What is the relationship between quantum size and context switch overhead?

A context switch is pure overhead — CPU cycles spent saving and restoring process state instead of doing useful work. On modern x86-64, a context switch takes approximately 1-5 microseconds.

If quantum is too small (e.g., 1ms) and context switch takes 0.5ms, then 33% of CPU time is spent on scheduling overhead.

Rules for quantum sizing:

  • Quantum should be 10-20x the context switch cost
  • 80% of CPU bursts should be shorter than quantum (otherwise most processes will be preempted)
  • Typical values: 4-100ms depending on workload

On Linux, sched_latency_ns (default ~48ms) divided by the number of runnable processes gives the per-process quantum. sched_min_granularity_ns sets a minimum to prevent excessive switching.

5. What is starvation and how does aging mitigate it?

Starvation (also called indefinite postponement) occurs when a process never gets scheduled because other processes keep getting priority. In priority scheduling, low-priority processes can wait forever if high-priority tasks keep arriving.

Aging is a technique to prevent starvation by gradually increasing the priority of waiting processes over time. The longer a process waits, the higher its priority becomes, until eventually it gets scheduled.

Example: A process with priority 50 (lowest) waiting 500ms might have its priority increased to 45, then 40, etc. After enough waiting, it becomes high priority and runs.

Linux's CFS doesn't use static priorities for aging — instead, the vruntime concept naturally handles fairness: a process that hasn't run for a long time accumulates less vruntime, making it more likely to be scheduled. This is implicit aging.

6. What is the convoy effect in FCFS scheduling and how does it impact system performance?

The convoy effect occurs when one or more CPU-bound processes monopolize the CPU, causing I/O-bound processes to wait even though the I/O devices are idle. The I/O-bound processes make a request, get scheduled briefly, block on I/O immediately, and then wait while the CPU-bound process completes its burst.

This creates a convoy of I/O requests behind a slow CPU-bound leader. The system appears sluggish despite having plenty of CPU capacity because I/O-bound processes complete quickly when they get CPU but are rarely given the chance.

Mitigation: Use SJF or Round Robin to give shorter jobs priority. MLFQ naturally handles this — I/O-bound processes stay in high-priority queues and get frequent short time slices.

7. How does MLFQ prevent starvation of low-priority processes?

MLFQ prevents starvation through two mechanisms: aging and queue promotion.

Aging: If a process stays in a lower queue for too long without being scheduled, its priority is gradually boosted, moving it to a higher queue. This ensures that even low-priority CPU-bound jobs eventually get CPU time.

Queue promotion: Some MLFQ implementations periodically promote all processes to the highest priority queue, ensuring that long-waiting processes are given a chance to run at high priority.

The key insight is that MLFQ assumes processes will be CPU-bound or I/O-bound — CPU-bound jobs naturally sink to lower queues, while I/O-bound jobs stay in high queues. Without aging, truly CPU-only jobs could be stuck at the bottom forever.

8. What is the relationship between time quantum and scheduling overhead?

Scheduling overhead is the CPU time spent deciding which process to run and performing the context switch, rather than doing useful work.

A smaller quantum means more frequent context switches, increasing overhead as a fraction of total CPU time. If quantum is 1ms and context switch takes 0.5ms, then 33% of CPU time is pure overhead.

A larger quantum reduces context switch frequency and overhead, but at the cost of poor response time — processes may wait a long time before getting CPU.

Rule of thumb: quantum should be 10-20x the context switch cost. On Linux, CFS dynamically calculates quantum based on sched_latency_ns divided by the number of runnable processes, yielding longer quanta when many processes are running.

9. What is the difference between turnaround time and waiting time as scheduling metrics?

Turnaround time = time from job submission to time of completion. It includes waiting time plus actual execution time plus any I/O time.

Waiting time = time spent in the ready queue, not executing. It excludes the actual execution time and I/O wait.

A scheduler optimizing for turnaround time might give long jobs more CPU upfront (FCFS), which increases waiting time for short jobs. A scheduler optimizing for waiting time (SJF) might reduce turnaround time but could starve long jobs.

Important: only waiting time is directly under the scheduler's control. Turnaround time depends on both scheduling decisions and the process's own behavior (I/O patterns, burst lengths).

10. How does multi-level feedback queue (MLFQ) dynamically learn process behavior?

MLFQ learns by observing whether a process uses its entire time quantum or blocks before it expires:

  • Uses full quantum (CPU-bound): MLFQ assumes the process needs heavy computation and demotes it to a lower priority queue with a longer quantum.
  • Blocks before quantum expires (I/O-bound): MLFQ assumes the process is interactive and keeps it in the current queue or promotes it to a higher priority queue.

This feedback-driven approach means MLFQ starts with no knowledge of a process and adapts based on observed behavior. A long-running batch job will be demoted over time; an interactive shell remains at high priority.

The tradeoff is that MLFQ can be gamed — a malicious process could periodically perform I/O to stay at high priority and monopolize CPU.

11. What is the oldest first (OFA) or fair share scheduling?

Fair share scheduling ensures that each user or group receives a fair proportion of CPU time, regardless of how many processes they have running. For example, if user A has 10 processes and user B has 1 process, both users get 50% of CPU (not user A getting 91% because they have more processes).

Linux's CFS implements this via the concept of "weight" (based on nice values) and per-group accounting. cgroups can be used to partition CPU time between different groups (e.g., production vs development workloads).

The key data structure is a red-black tree keyed by vruntime, but processes from different users have their vruntime accumulated separately before comparison, ensuring fair distribution.

12. How does deadline scheduling (SCHED_DEADLINE) guarantee task deadlines?

SCHED_DEADLINE uses the Earliest Deadline First (EDF) algorithm with bandwidth reservation. Each task specifies:

  • Runtime: Maximum CPU time needed per period (worst-case execution time)
  • Deadline: Time by which execution must complete
  • Period: Interval between successive task invocations

The scheduler checks that total runtime/period (bandwidth) does not exceed CPU capacity. If the system is oversubscribed, admission is denied.

At runtime, EDF picks the task with the earliest deadline. As long as total reserved bandwidth is under 100%, the scheduler can guarantee all deadlines will be met. This is a hard real-time guarantee — not probabilistic, but mathematical.

13. What is the difference between hard real-time and soft real-time scheduling?

Hard real-time: A missed deadline is a system failure. The task must complete by its deadline or the result is useless or dangerous. Examples: airbag controllers, pacemaker pacemakers, flight control systems. SCHED_DEADLINE with proper admission control provides hard real-time guarantees.

Soft real-time: Missing a deadline reduces quality but the system continues operating. Video playback dropping frames, audio glitching, or a trading system missing an arbitrage window are soft real-time failures. SCHED_FIFO and SCHED_RR provide soft real-time behavior — high priority but no formal deadline guarantee.

Key difference: hard real-time requires formal verification that the system can meet all deadlines under worst-case conditions. Soft real-time uses priority and FIFO/RR policies to minimize missed deadlines but doesn't guarantee them.

14. What is priority inheritance and how does it solve the priority inversion problem?

Priority inversion: High-priority task H is blocked waiting for a lock held by low-priority task L. Medium-priority task M preempts L, delaying H's lock release indefinitely.

Priority inheritance: When H blocks on a lock held by L, L's priority is temporarily raised to H's priority (or higher if there are multiple waiting tasks). This prevents M from preempting L while L holds the lock needed by H.

The inheritance chain can be transitive: if L holds another lock that a higher priority task H2 wants, L's priority gets raised to H2's level. When L releases the lock, its priority reverts.

Linux implements priority inheritance for mutexes via PI-futexes. Not all locks support it — only those explicitly implemented with PI-futexes. Priority inheritance adds overhead, so it is not enabled by default for all synchronization primitives.

15. How does Round Robin handle processes with different priority levels?

In priority-based Round Robin, the scheduler first selects the highest priority non-empty queue. All processes in that queue are scheduled in round-robin fashion with time quantum appropriate for that priority level.

Lower priority queues only receive CPU time when all higher priority queues are empty. This is why high-priority interactive processes get excellent response time — they are always scheduled first.

The tradeoff is that lower priority batch jobs may starve if higher priority processes are always runnable. This is addressed by periodically boosting starving processes' priority (aging) or using a more sophisticated multi-level queue that considers both priority and wait time.

16. What is the relationship between scheduler latency and system responsiveness?

Scheduler latency (scheduling latency) is the time between a process becoming runnable (e.g., I/O completing) and it actually starting to run on a CPU. System responsiveness is how quickly the system reacts to user actions.

For an interactive application (e.g., a shell command), responsiveness means the time between pressing Enter and seeing output. This depends on the shell process being scheduled, not just the CPU-bound work. Even if the CPU is busy, interactive processes need to be scheduled quickly to maintain the feeling of responsiveness.

Reducing scheduling latency improves responsiveness but may reduce throughput (more context switches). The kernel targets a balance via sched_latency_ns and sched_min_granularity_ns parameters.

17. How does the Linux CFS scheduler implement fairness?

CFS uses a red-black tree ordered by vruntime. Each runnable process accumulates vruntime proportional to the CPU time used, adjusted by the process's weight (from nice value). The leftmost node (minimum vruntime) is always picked as the next process to run.

Fairness is achieved because vruntime represents "how much CPU time this process deserves but hasn't received yet." A process that has run a lot has high vruntime. A process that hasn't run has low vruntime and will be picked next.

The target latency (sched_latency_ns) determines how long a process can run before being preempted. With many processes, each gets a smaller time slice but all make progress.

18. What is the relationship between scheduling class and scheduler implementation in Linux?

Linux defines scheduling classes (struct sched_class) that plug into the scheduler. Each class implements a common interface: enqueue, dequeue, pick_next, yield, etc.

Built-in classes in order of priority: stop_sched_class (for stop_machine), dl_sched_class (SCHED_DEADLINE), rt_sched_class (SCHED_FIFO/SCHED_RR), fair_sched_class (CFS, SCHED_OTHER), idle_sched_class (idle thread).

The scheduler iterates through classes in priority order when picking the next task. Only when a class has no runnable tasks does the scheduler check the next class. This means real-time tasks always preempt CFS tasks.

19. What is the difference between SCHED_OTHER and SCHED_NORMAL in Linux?

There is no functional difference — SCHED_OTHER and SCHED_NORMAL are aliases for the same scheduling policy. Both refer to the default CFS-based scheduler used by normal processes.

SCHED_BATCH is another CFS variant for batch jobs — it has slightly higher latency tolerance and is optimized for throughput over interactivity. SCHED_IDLE is for extremely low-priority background tasks that should only run when the system is otherwise idle.

All of these (SCHED_OTHER/SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE) use the CFS scheduler with different parameters or adjustments.

20. How does CPU overcommit affect scheduling decisions in a virtualized environment?

CPU overcommit means the hypervisor assigns more virtual CPUs to VMs than physical CPUs exist. The hypervisor scheduler must time-share physical CPUs among vCPUs.

When a vCPU is ready to run but no physical CPU is available, the hypervisor schedules another vCPU. This causes scheduling latency and reduced determinism. Real-time tasks inside VMs suffer because the hypervisor introduces an extra scheduling layer.

Mitigations: Use CPU affinity to pin vCPUs to physical cores, reducing hypervisor scheduling overhead. Use real-time scheduling classes for critical VMs. Reserve some physical cores for real-time workloads so they are not stolen by overcommitted VMs.

Further Reading


Conclusion

Process scheduling algorithms are the mechanism by which operating systems decide which runnable process gets CPU time. The choice of algorithm profoundly impacts system responsiveness, throughput, and fairness.

FCFS is simple but vulnerable to the convoy effect, where short processes wait behind long ones. SJF achieves optimal average wait time but requires knowing burst times in advance and causes starvation of long processes. Round Robin provides fair scheduling with good response time, but quantum size is critical — too small causes excessive context switches, too large degrades interactivity.

Priority scheduling allows important work to run first, but without aging, low-priority processes can starve indefinitely. Multi-Level Feedback Queue (MLFQ) dynamically adapts, giving interactive processes quick response while letting CPU-bound jobs progress without dominating. Modern Linux, Windows, and macOS all combine multiple algorithms this way.

For your next step, explore CPU interrupts and context switches to understand the mechanics of how the scheduler actually transfers control between processes, or dive into synchronization primitives (mutexes, semaphores, condition variables) which protect shared data in concurrent multi-threaded environments.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science