Interrupts & Polling
Hardware interrupts, IRQ handling, and interrupt bottom halves (tasklets, workqueues) for robust I/O processing.
Introduction
At any moment, dozens of hardware devices may need the CPU’s attention—network packets arriving, keystrokes pressed, disk operations completing, timers firing. The CPU cannot continuously poll every device (wasteful) nor can devices wait indefinitely (data would be lost). The interrupt mechanism bridges this gap: devices assert a signal that causes the CPU to suspend current work and handle the event immediately.
Interrupts are the nervous system of computing—electrical impulses that demand immediate attention. Without them, operating systems would be blind to the physical world. Understanding interrupt handling is essential for driver development, system performance tuning, and debugging real-time or latency-sensitive applications.
Polling represents the alternative model—software repeatedly checking device status registers to see if work needs doing. This section explores both models, their trade-offs, and the hybrid approaches modern systems use.
When to Use / When Not to Use
When Hardware Interrupts Are Appropriate
- Sparse, unpredictable events: Network packets arriving, user input, hardware faults
- Latency-critical operations: Real-time systems where sub-millisecond response matters
- Low-power designs: CPU can sleep until hardware signals need for service
- High-bandwidth devices with flow control: Disks, high-speed network cards that can buffer temporarily
When Polling Is Appropriate
- Dense, predictable events: High-frequency timers, status monitoring
- Latency-jitter-tolerant designs: When consistent polling interval is acceptable
- Interrupt storm prevention: Devices that naturally generate very high interrupt rates
- Virtualized environments: Hypervisors may emulate interrupt delivery; polling vCPUs can be more efficient
- Modern NVMe drives: Internal queue depths and interrupt coalescing make polling viable
Hybrid Approaches (The Common Case)
Modern systems rarely use pure polling or pure interrupts. They combine:
- Interrupt coalescing: Accumulate multiple events before interrupting
- Deferred processing: Fast ISR handles minimal work; rest runs in thread context
- Polled phases: During intensive work, temporarily switch to polling for cache efficiency
Architecture or Flow Diagram
The following diagram illustrates the complete interrupt handling flow from hardware assertion to driver completion:
sequenceDiagram
participant HW as Hardware Device
participant CPU
participant IDT as Interrupt Descriptor Table
participant ISR as Interrupt Service Routine<br/>(Top Half)
participant BH as Bottom Half Handler<br/>(Tasklet/Workqueue)
participant DRV as Device Driver
HW->>CPU: Assert IRQ line
CPU->>CPU: Finish current instruction
CPU->>IDT: Lookup interrupt vector
IDT->>ISR: Jump to handler
ISR->>HW: Acknowledge interrupt
ISR->>ISR: Read status registers
ISR->>BH: Schedule bottom half
ISR-->>CPU: Return from interrupt
Note over CPU: Now running other work
BH->>DRV: Process I/O completions
DRV->>DRV: Handle data, wake waiters
The critical insight is the split between the top half (ISR, runs with interrupts disabled) and the bottom half (runs later, in process context). This split minimizes interrupt latency while allowing lengthy processing.
Core Concepts
Hardware Interrupts
Hardware interrupts are electrical signals from devices to the CPU requesting attention. The interrupt request (IRQ) line the device uses determines which handler runs.
Types of interrupts:
- Maskable interrupts (IRQ): Can be ignored by setting the IF flag in EFLAGS/RFLAGS
- Non-maskable interrupts (NMI): Cannot be ignored—used for critical events like hardware failures
- Message-signaled interrupts (MSI/MSI-X): Device writes to special memory address instead of using physical IRQ line; avoids interrupt pin limitations
Interrupt lifecycle:
// 1. Driver registers handler during initialization
static int my_driver_probe(struct pci_dev *pdev)
{
int ret;
// Register interrupt handler
ret = request_threaded_irq(
pdev->irq, // IRQ line
my_driver_isr, // Top half (atomic)
my_driver_thread, // Bottom half (threaded, can sleep)
IRQF_SHARED, // Share IRQ with other devices
"my_driver", // Name for /proc/interrupts
my_dev); // Passed to handler
return ret;
}
// 2. Interrupt fires, CPU vectors to ISR
static irqreturn_t my_driver_isr(int irq, void *dev_id)
{
struct my_device *dev = dev_id;
u32 status = readl(dev->regs + STATUS_REG);
if (!(status & IRQ_PENDING))
return IRQ_NONE; // Not our interrupt
// Acknowledge—prevention of spurious interrupts
writel(status, dev->regs + STATUS_REG);
// Store any needed state for bottom half
dev->interrupt_status = status;
return IRQ_WAKE_THREAD; // Schedule threaded handler
}
// 3. Threaded handler runs in process context
static irqreturn_t my_driver_thread(int irq, void *dev_id)
{
struct my_device *dev = dev_id;
u32 status = dev->interrupt_status;
if (status & PACKET_AVAILABLE)
process_packets(dev);
if (status & BUFFER_DONE)
complete_transfer(dev);
return IRQ_HANDLED;
}
IRQ Handling in Linux
The Linux kernel maintains an IRQ descriptor table. When an interrupt fires:
- Kernel disables interrupts on that CPU (mask)
- Looks up handler in
irq_desc[] - Calls handler chain (multiple drivers can share one IRQ line)
- Acknowledges interrupt to interrupt controller (PIC/APIC)
- Re-enables interrupts (unmask)
The APIC (Advanced PIC) handles interrupt routing in modern multi-core systems, enabling per-core interrupt affinity—IRQs can be directed to specific CPUs for cache efficiency.
Interrupt Bottom Halves
The top half (ISR) must run quickly and cannot sleep. Bottom halves handle the heavy processing.
Tasklets — Lightweight, scheduled per-CPU, cannot be preempted:
// Tasklet structure embedded in device struct
struct my_device {
// ... other fields ...
struct tasklet_struct tlet;
u32 pending_work;
};
// Initialize tasklet
void my_device_init(struct my_device *dev)
{
tasklet_init(&dev->tlet, my_device_tasklet, (unsigned long)dev);
}
// Tasklet function (runs in interrupt context, but can schedule)
void my_device_tasklet(unsigned long data)
{
struct my_device *dev = (struct my_device *)data;
u32 work = xchg(&dev->pending_work, 0); // Atomically grab and clear
if (work & PACKET_WORK)
process_packets(dev);
if (work & TIMER_WORK)
handle_timer(dev);
}
// In ISR: schedule tasklet instead of direct processing
static irqreturn_t my_isr(int irq, void *dev_id)
{
struct my_device *dev = dev_id;
u32 status = readl(dev->regs + STATUS_REG);
dev->pending_work |= status;
tasklet_schedule(&dev->tlet); // Schedule bottom half
return IRQ_HANDLED;
}
Workqueues — Run in kernel thread context, can sleep, can be delayed:
// Work structure for deferred processing
struct delayed_work {
struct work_struct work;
struct timer_list timer;
};
// Initialize work
INIT_WORK(&dev->work, process_workqueue);
INIT_DELAYED_WORK(&dev->delayed_work, process_delayed);
// Schedule immediate work
schedule_work(&dev->work);
// Schedule delayed work (5 jiffies)
schedule_delayed_work(&dev->delayed_work, 5);
// In driver cleanup
flush_work(&dev->work);
cancel_delayed_work_sync(&dev->delayed_work);
Threaded IRQs — Simpler model: handler runs as a kernel thread:
// request_threaded_irq combines ISR + thread in one call
// ISR is minimal (runs atomic), thread handles rest
ret = request_threaded_irq(
irq,
isr_routine, // Optional primary handler (can return IRQ_WAKE_THREAD)
threadRoutine, // Threaded handler (can sleep)
flags,
name,
dev_id);
Polling Model
Polling uses explicit software checking instead of hardware signals:
// Simple polling loop (inefficient—don't do this!)
while (device_has_work()) {
handle_device_event();
}
// Better: timer-driven polling with adjustable rate
static void poll_timer_callback(struct timer_list *t)
{
struct my_device *dev = from_timer(dev, t, poll_timer);
// Check and handle any pending work
u32 status = readl(dev->regs + STATUS_REG);
if (status)
process_status(dev);
// Reschedule if device still needs servicing
if (device_needs_poll(dev))
mod_timer(&dev->poll_timer, jiffies + POLL_INTERVAL);
}
// Initialize polling
timer_setup(&dev->poll_timer, poll_timer_callback, 0);
add_timer(&dev->poll_timer);
Comparison: Interrupts vs Polling
| Aspect | Interrupts | Polling |
|---|---|---|
| CPU overhead (idle device) | Zero—CPU sleeps | Constant—CPU checks repeatedly |
| Response latency | Immediate—hardware signals | Bounded by poll interval |
| Scalability | Pin-limited (MSI helps) | No hardware limit |
| Complexity | More complex (handling race conditions) | Simpler state machine |
| Event density | Degrades at high rates | Handles high rates naturally |
| Power consumption | Lower when idle | Higher when idle |
Production Failure Scenarios
Scenario 1: Interrupt Storm from Misbehaving Hardware
What happened: A faulty network card generated spurious interrupts at 50,000+ per second after receiving certain packet patterns. Each interrupt forced CPU context switches, consuming 40% of one core and causing massive latency spikes for legitimate operations.
Detection: cat /proc/interrupts showed interrupt count incrementing in thousands per second for that IRQ line. top showed softirq CPU time dominating.
Mitigation:
// Limit interrupt rate by returning IRQ_NONE for spurious events
static irqreturn_t net_driver_isr(int irq, void *dev_id)
{
struct net_device *ndev = dev_id;
u32 status = readl(ndev->regs + IRQ_STATUS);
// If no real interrupt source, pretend we didn't see it
if (!(status & (TX_DONE | RX_PENDING | LINK_CHANGE)))
return IRQ_NONE; // Spurious—don't reschedule
// Handle real sources, acknowledge, schedule NAPI if needed
// ...
return IRQ_HANDLED;
}
Also configure interrupt coalescing in hardware to reduce interrupt frequency.
Scenario 2: Deadlock from Blocking in ISR
What happened: A driver’s ISR called copy_to_user() which can fault and may sleep on some architectures. On an ARM platform, this caused a deadlock—the ISR had interrupts disabled and the page fault required an interrupt to handle memory management, but interrupts were masked.
Detection: System hang with “scheduling while atomic” kernel panic.
Mitigation:
// BAD: copy_to_user can sleep
static irqreturn_t bad_isr(int irq, void *dev_id)
{
char buf[256];
// DON'T do this—copy_to_user can fault
copy_to_user(user_buf, kernel_buf, sizeof(buf));
return IRQ_HANDLED;
}
// GOOD: Use put_user in atomic context (designed for ISR)
static irqreturn_t good_isr(int irq, void *dev_id)
{
// put_user is atomic-safe, no sleeping
put_user(kernel_value, user_ptr);
return IRQ_HANDLED;
}
// For larger data, defer to bottom half
static irqreturn_t isr(int irq, void *dev_id)
{
// Store minimal info, schedule work
dev->pending_data_len = readl(dev->regs + LEN_REG);
tasklet_schedule(&dev->tlet);
return IRQ_HANDLED;
}
static void tlet_handler(unsigned long data)
{
struct my_device *dev = (void *)data;
// Now in process context—copy_to_user is safe
copy_to_user(dev->user_buf, dev->kernel_buf, dev->pending_data_len);
}
Scenario 3: Race Between ISR and Driver Removal
What happened: Driver was being unloaded while interrupts were still in flight. The ISR dereferenced memory that had been freed during driver removal, causing a use-after-free panic.
Detection: Kernel panic in ISR context with corrupted data structures.
Mitigation:
static int my_driver_remove(struct pci_dev *pdev)
{
struct my_device *dev = pci_get_drvdata(pdev);
// Synchronize with any pending interrupts
// 1. Prevent new interrupts
disable_irq(pdev->irq);
// 2. Wait for any in-flight ISRs to complete
synchronize_irq(pdev->irq);
// 3. Now safe to free resources
tasklet_kill(&dev->tlet);
devm_free_irq(&pdev->dev, pdev->irq, dev);
pci_set_drvdata(pdev, NULL);
return 0;
}
Also use reference counting (kref_get/kref_put) to ensure device data isn’t freed while any code (ISR or otherwise) might access it.
Trade-off Table
| Mechanism | Latency | CPU Overhead (idle) | Throughput ceiling | Complexity |
|---|---|---|---|---|
| Bare interrupts | Lowest (immediate) | Zero | Limited by IRQ rate | Medium |
| Interrupt + tasklet | Low + deferred processing | Minimal when idle | High (batched) | Medium |
| Threaded IRQ | Low + full kernel features | Context switch cost | High | Low |
| NAPI (poll mode) | Higher (polling interval) | Zero when idle | Very high | High |
| Pure polling | Bounded by interval | Constant 100% | Limited by poll freq | Low |
Implementation Snippets
Complete Interrupt-Driven Driver Framework
#include <linux/module.h>
#include <linux/interrupt.h>
#include <linux/workqueue.h>
#include <linux/timer.h>
#define SHARED_IRQ
#define POLL_INTERVAL (HZ / 10) // 100ms
struct my_device {
void __iomem *regs;
unsigned int irq;
struct tasklet_struct tlet;
struct work_struct work;
struct timer_list poll_timer;
u32 interrupt_count;
bool use_polling; // Fallback when interrupts fail
};
static irqreturn_t my_driver_isr(int irq, void *dev_id)
{
struct my_device *dev = dev_id;
u32 status;
/* Read status - clears interrupt latches in hardware */
status = readl(dev->regs + INT_STATUS);
if (!status)
return IRQ_NONE;
/* Atomically store status for bottom half */
dev->interrupt_count++;
/* For this device, bottom half is tasklet */
tasklet_schedule(&dev->tlet);
return IRQ_HANDLED;
}
static void my_tlet_handler(unsigned long data)
{
struct my_device *dev = (struct my_device *)data;
u32 status = readl(dev->regs + INT_STATUS);
/* Process packet arrivals */
if (status & PKT_AVAILABLE) {
struct sk_buff *skb = dev_alloc_skb(2048);
if (skb) {
/* Transfer data - DMA or MMIO */
read_memcpy_from_io(skb_put(skb, 2048),
dev->regs + RX_BUFFER, 2048);
netif_rx(skb);
}
}
}
static void my_work_handler(struct work_struct *work)
{
/* For longer operations that can't sleep in tasklet */
struct my_device *dev = container_of(work, struct my_device, work);
/* Process configuration changes, etc. */
msleep(10); // Safe here
}
/* Polling fallback for environments with broken interrupts */
static void my_poll_timer(struct timer_list *t)
{
struct my_device *dev = from_timer(dev, t, poll_timer);
u32 status = readl(dev->regs + INT_STATUS);
if (status & PKT_AVAILABLE)
my_tlet_handler((unsigned long)dev);
if (dev->use_polling)
mod_timer(&dev->poll_timer, jiffies + POLL_INTERVAL);
}
static int my_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct my_device *dev;
int ret;
dev = devm_kzalloc(&pdev->dev, sizeof(*dev), GFP_KERNEL);
if (!dev)
return -ENOMEM;
pci_set_drvdata(pdev, dev);
/* Map hardware registers */
dev->regs = pcim_iomap(pdev, 0, 0);
if (!dev->regs)
return -ENOMEM;
/* Request IRQ - threaded variant for sleeping-capable handler */
ret = request_threaded_irq(pdev->irq, NULL, my_threaded_isr,
IRQF_SHARED, KBUILD_MODNAME, dev);
if (ret) {
dev_warn(&pdev->dev, "IRQ %u unavailable, using polling\n", pdev->irq);
dev->use_polling = true;
timer_setup(&dev->poll_timer, my_poll_timer, 0);
mod_timer(&dev->poll_timer, jiffies + POLL_INTERVAL);
} else {
dev->irq = pdev->irq;
}
/* Initialize tasklet and work */
tasklet_init(&dev->tlet, my_tlet_handler, (unsigned long)dev);
INIT_WORK(&dev->work, my_work_handler);
return 0;
}
static void my_driver_remove(struct pci_dev *pdev)
{
struct my_device *dev = pci_get_drvdata(pdev);
if (dev->use_polling) {
del_timer_sync(&dev->poll_timer);
} else {
free_irq(pdev->irq, dev);
}
tasklet_kill(&dev->tlet);
cancel_work_sync(&dev->work);
pci_set_drvdata(pdev, NULL);
}
static struct pci_device_id my_driver_id[] = {
{ PCI_DEVICE(0x1234, 0x5678) },
{ }
};
MODULE_DEVICE_TABLE(pci, my_driver_id);
static struct pci_driver my_driver = {
.name = KBUILD_MODNAME,
.id_table = my_driver_id,
.probe = my_driver_probe,
.remove = my_driver_remove,
};
module_pci_driver(my_driver);
Python Simulation of Interrupt vs Polling
#!/usr/bin/env python3
"""
Simulates interrupt vs polling paradigms for I/O handling.
"""
import time
import threading
from dataclasses import dataclass
from typing import Callable
import random
@dataclass
class Device:
"""Simulated hardware device with interrupt capability."""
name: str
interrupt_callback: Callable[[], None]
event_interval_ms: float # How often device generates events
def __post_init__(self):
self._running = False
self._thread = None
def start(self):
"""Simulate device generating interrupts."""
self._running = True
def event_loop():
while self._running:
time.sleep(random.uniform(0, self.event_interval_ms * 2) / 1000)
if self._running and random.random() < 0.7: # 70% event rate
self.interrupt_callback()
self._thread = threading.Thread(target=event_loop, daemon=True)
self._thread.start()
def stop(self):
self._running = False
if self._thread:
self._thread.join(timeout=1.0)
class InterruptDrivenHandler:
"""Handle events immediately via callback."""
def __init__(self):
self.events_handled = 0
self.last_event_time = None
def handle_interrupt(self):
self.events_handled += 1
self.last_event_time = time.time()
# Simulate minimal ISR work
self._process_event()
def _process_event(self):
pass # In real ISR, this would be minimal
class PollingHandler:
"""Handle events by periodic checking."""
def __init__(self, device: Device, poll_interval_ms: float = 100):
self.device = device
self.poll_interval = poll_interval_ms / 1000
self.events_handled = 0
self.last_poll_time = None
self._running = False
self._thread = None
self._pending = False # Simulated device register
def start(self):
self._running = True
def poll_loop():
while self._running:
self.last_poll_time = time.time()
# Read device "registers" (check for pending events)
self._check_device()
time.sleep(self.poll_interval)
self._thread = threading.Thread(target=poll_loop, daemon=True)
self._thread.start()
def _check_device(self):
# Simulate reading status register
if random.random() < 0.7: # 70% chance of event pending
self.events_handled += 1
def stop(self):
self._running = False
if self._thread:
self._thread.join(timeout=1.0)
if __name__ == "__main__":
print("=== Interrupt vs Polling Simulation ===\n")
# Interrupt-driven
handler_int = InterruptDrivenHandler()
device = Device("UART0", handler_int.handle_interrupt, event_interval_ms=10)
device.start()
time.sleep(1.0)
device.stop()
print(f"Interrupt-driven: {handler_int.events_handled} events handled")
print(f" Average latency: ~{(1.0/handler_int.events_handled)*10:.2f}ms (estimated)")
# Polling
handler_poll = PollingHandler(device, poll_interval_ms=50)
device2 = Device("UART1", lambda: None, event_interval_ms=10)
device2.start()
handler_poll.start()
time.sleep(1.0)
handler_poll.stop()
device2.stop()
print(f"\nPolling (50ms interval): {handler_poll.events_handled} events handled")
print(f" Average latency: ~{handler_poll.poll_interval * 1000 / 2:.1f}ms (half poll interval)")
Observability Checklist
Linux Interrupt Metrics
# View all IRQ statistics
cat /proc/interrupts
# Per-CPU interrupt counts
cat /proc/softirqs # Software interrupt (softirq) handling
# Check interrupt affinity
cat /proc/irq/32/smp_affinity # Which CPU handles IRQ 32
# Set interrupt affinity (move NIC IRQ to CPU 0)
echo 1 > /proc/irq/42/smp_affinity
# View interrupt distribution with perf
perf stat -e 'irq:irq_handler_entry' -a sleep 1
Key Metrics to Monitor
| Metric | Healthy Range | Alert Threshold | Indicates |
|---|---|---|---|
/proc/interrupts (per-IRQ delta) | Baseline + small variance | Sudden 10x increase | Interrupt storm |
softirq time in top | < 5% CPU | > 30% CPU | ISR deferred work overload |
irq_entry perf events | < 10,000/s | > 100,000/s | Excessive interrupts |
nmi count delta | Stable | Increasing | Hardware issues (memory errors) |
Common Pitfalls / Anti-Patterns
Interrupt Security Concerns
-
Interrupt Timing Attacks: Observing interrupt timing can leak information. Deterministic scheduling of bottom halves can expose sensitive computation patterns. Consider using
hrtimerfor high-resolution timing rather than relying on interrupt patterns. -
IRQ Routing Attacks: In virtualized environments, a malicious VM could manipulate interrupt routing to cause denial of service against other VMs sharing the same physical CPU. Use IOMMU interrupt remapping and verify APIC configuration.
-
MSI-X Vulnerabilities: Message-signaled interrupts write to memory addresses. If those addresses aren’t properly restricted, a device could potentially signal interrupts that access arbitrary memory. Modern systems use ATS (Address Translation Services) to validate DMA requests.
-
Interrupt Descriptor Table (IDT) Attacks: Malware can hook IDT entries to intercept interrupts. Use
kexecto reload a known-good kernel or verify IDT integrity withcat /proc/kallsyms | grep idt_table.
Common Pitfalls / Anti-patterns
| Anti-pattern | Why It’s Bad | Correct Approach |
|---|---|---|
| Long-running ISR | Blocks all interrupts, causes latency spikes | Defer to tasklet/workqueue |
| Blocking in ISR | Scheduling while atomic = panic | Use non-blocking primitives |
| Not acknowledging interrupts | Spurious interrupts repeat forever | Always acknowledge in hardware |
| No IRQ sharing cleanup | Leak resources on driver removal | Use free_irq with proper synchronization |
| Assuming single CPU | Race conditions on SMP | Use proper locking/synchronization |
| Disabling interrupts globally | Excessive latency for other devices | Use per-IRQ masking instead |
| Not handling interrupt storms | Denial of service from device | Implement coalescing/irq throttling |
Quick Recap Checklist
- Hardware interrupts allow devices to demand CPU attention asynchronously
- ISRs run with interrupts disabled—keep them fast and never block
- Bottom halves (tasklets, workqueues, threaded IRQs) defer lengthy processing
- Tasklets run in interrupt context but can schedule; workqueues run in process context
- Polling is the alternative—better for high-density events, worse for power/latency
- Hybrid NAPI model combines interrupts (for idle) with polling (for busy)
- Always acknowledge hardware interrupts to prevent spurious repeats
- Use
synchronize_irq()before freeing IRQ-related resources - Monitor
/proc/interruptsandsoftirqsto detect interrupt storms - Interrupt storms usually indicate hardware or driver bugs, not feature
Interview Questions
When an interrupt fires: (1) The device asserts an electrical signal on an IRQ line. (2) The CPU finishes its current instruction. (3) The CPU looks up the interrupt vector in the IDT (Interrupt Descriptor Table) and disables interrupts (sets IF flag). (4) CPU pushes current state (flags, CS, IP, error code) onto kernel stack. (5) CPU jumps to the interrupt handler address. (6) The kernel's entry code runs, transitioning to kernel context. (7) The appropriate IRQ handler (from irq_desc[]) is called. (8) Handler chain may invoke multiple registered handlers. (9) Handler acknowledges interrupt to interrupt controller (APIC/PIC). (10) Handler returns; kernel restores state and re-enables interrupts. Total latency is typically 1-10 microseconds on modern hardware.
schedule() or sleep in an interrupt handler?Interrupt handlers run in atomic context—the kernel has interrupts disabled and the current process context is indeterminate (the interrupted process may be user code, kernel code, or idle). schedule() tries to switch processes, which requires saving the current process's state and loading another. But the kernel cannot safely context-switch while an interrupt is being handled—locks may be held, the process list may be inconsistent, and the interrupted context's state is partially saved. Additionally, schedule() calls release_kernel_lock() which expects interrupts enabled. The correct pattern is to defer work to a bottom half (tasklet, workqueue, threaded IRQ) which runs in process context where scheduling is safe.
Tasklets and workqueues both defer work, but with key differences. Tasklets run in interrupt context (via tasklet_schedule()), cannot sleep, run atomically on a specific CPU, and are deterministic (same priority as hardware interrupts). Workqueues run in kernel thread context, can sleep, may migrate CPUs, and have more overhead but more capability. Use tasklets for quick, atomic, non-blocking deferrals (packet processing, bottom-half ISR work). Use workqueues for anything that needs to sleep, allocate memory, or take a long time (flush caches, send signals, memory reclamation). The modern preference is threaded IRQs for most deferral needs—they combine ISR and deferral in one clean interface.
NAPI (New API) is Linux's hybrid interrupt-polling mechanism for high-speed network drivers. It works as follows: When the network interface has low traffic, interrupts are enabled and packets arrive via interrupts. When traffic is high (dev->poll() is called and finds many packets), the driver disables interrupts and switches to polling mode, continuously calling napi->poll() in a tight loop to drain the receive ring. This prevents interrupt storms from fast devices (10GbE+), improves cache locality, and allows batching of packet processing. Once the ring is empty, interrupts are re-enabled. Most modern high-performance network drivers (ixgbe, mlx5, virtio-net) use NAPI.
Spurious interrupts waste CPU and can indicate hardware problems. Handling strategies: First, always read device status registers at the start of ISR and return IRQ_NONE if there's no real interrupt source—this tells the kernel "this IRQ wasn't for me." Second, implement interrupt coalescing in hardware (waiting for multiple events before asserting interrupt) if the device supports it. Third, for persistent spurious interrupts, Linux has a spurious IRQ handler count—if too many IRQ_NONE returns happen, the kernel will temporarily disable that IRQ line. Fourth, use irqdomain to properly map hardware IRQ numbers to Linux IRQ descriptors. Finally, ensure proper interrupt acknowledgment happens in the correct order (hardware first, then software) to prevent latch-up conditions.
Interrupt coalescing reduces interrupt frequency by waiting for multiple events before asserting an interrupt. Instead of interrupting for every packet, a network card might wait until it has accumulated 16 packets or 128 microseconds have passed since the last interrupt, then fire one combined interrupt. This dramatically reduces CPU overhead from interrupt handling—from thousands of interrupts per second to hundreds. The tradeoff is increased latency: a packet must wait for the coalescing window before being processed. For bulk transfer (file downloads), this latency is unnoticeable. For latency-sensitive applications (high-frequency trading), you tune coalescing aggressively (low thresholds, short timers). Hardware supports this via configurable rx-usecs and rx-max-frames registers. On Linux, the ethtool tool sets coalescing parameters for network drivers.
The APIC is the interrupt controller in modern multi-core x86 systems, replacing the older 8259A PIC. It consists of the local APIC (one per CPU core, handling CPU-local interrupts like timer and performance monitoring) and the I/O APIC (connecting devices to the system and routing interrupts to specific cores). The APIC enables per-core interrupt affinity—IRQs can be directed to specific CPUs for cache efficiency or latency optimization. It also handles interrupt priority and masking at the CPU level. In multi-core systems, the APIC can simultaneously handle many more interrupt sources than the old PIC (which was limited to 15 IRQs). On large NUMA systems, APIC routing affects cache locality—if an interrupt is handled by a core far from the device's NUMA node, latency increases. Tools like `turbostat` and `/proc/interrupts` help analyze interrupt routing across cores.
Edge-triggered interrupts fire once when the interrupt signal transitions (e.g., low to high). The device signals an event by pulsing the interrupt line; the handler runs once per pulse. Level-triggered interrupts fire continuously while the interrupt signal is active (e.g., high). The handler must clear the interrupt source to de-assert the line, otherwise the interrupt fires again immediately after returning. Most hardware uses edge-triggered interrupts because they are simpler to wire and work well when the device raises the line just long enough for the kernel to handle it. PCI devices typically use edge-triggered. Some platforms (like ARM) use level-triggered for GPIO interrupts. Using the wrong trigger type causes "interrupt storm" problems where the handler immediately re-fires after completing. Modern Linux uses `irq_set_type()` to configure trigger type if the hardware supports it.
Softirqs are software-generated interrupts that run in interrupt context at a deferred priority level. They are the bottom half of the hardware interrupt handling split. After a hardware ISR runs (top half), it can schedule softirqs to run later—the `raise_softirq()` function wakes a softirq handler. The kernel's softirq system handles timer ticks, network packet processing (NET_RX/NET_TX softirqs), scheduler operations (SCHED_SOFTIRQ), and RCU callbacks. Softirqs run with interrupts enabled (unlike hardware ISR), on the same CPU that raised them (for cache efficiency), and are processed when the kernel checks the `need_resched` flag in the scheduler. High-frequency softirq activity (many network packets, many timer ticks) can dominate CPU time. You can see softirq activity in `/proc/softirqs`. The ksoftirqd kernel threads handle softirq processing when the system is under heavy load.
Wire-based interrupts use physical interrupt lines (IRQ pins) routed through the interrupt controller. Each device needs a separate wire, limiting the maximum number of interrupts to the number of available pins. MSI (and MSI-X, the newer version) replaces physical wires with memory writes—the device writes to a special address, and that memory write triggers an interrupt to the CPU. This has major advantages: no interrupt pin limitation (a PCIe device can have up to 32 MSI vectors or 2048 MSI-X vectors), interrupts are directed to specific CPUs via the APIC's message routing, and MSI enables per-queue interrupt affinity for high-performance devices (each hardware queue gets its own interrupt, enabling CPU affinity). MSI-X is required for NVMe drives with multiple queues and for high-speed network cards that want minimal interrupt latency by binding each queue to a dedicated CPU.
Interrupt affinity allows directing an IRQ to a specific CPU or set of CPUs. The `smp_affinity` file in `/proc/irq/` controls this: writing a bitmask to `/proc/irq/42/smp_affinity` limits which CPUs can handle IRQ 42. In NUMA systems, the optimal setting is to direct the IRQ to a CPU on the same NUMA node as the device—this minimizes memory access latency when the IRQ handler accesses device memory. For a network card on NUMA node 0, set its IRQ affinity to CPUs on node 0. For storage controllers, similarly align with their NUMA node. Misaligned affinity causes cache-line bounces across the NUMA interconnect, adding microseconds to each interrupt handler invocation. Tools like `irqbalance` (a daemon) attempt to dynamically optimize affinity, but for deterministic high-performance workloads, manual tuning is often better. `cat /proc/irq/default_smp_affinity` sets the default affinity mask for newly allocated IRQs.
An interrupt handler (ISR) runs in atomic context with interrupts disabled on the current CPU—preemption is disabled and the scheduler cannot run until the ISR completes. A kernel thread runs in process context (can be preempted, can sleep, can schedule). ISR latency is bounded by hardware (microseconds) but cannot do anything that might sleep (no memory allocation, no mutex). A kernel thread can use full kernel services but has scheduling latency determined by its priority and the current scheduler load (milliseconds in worst case). When deciding between threaded IRQ and a workqueue task, the tradeoff is latency versus capability: a threaded IRQ handler runs quickly (latency) but still in atomic context, while a workqueue task has more overhead but can sleep and do blocking operations. For I/O-heavy work that needs to block (like waiting for a disk operation), use a workqueue. For quickly processing a received packet and waking a waiting task, use a threaded IRQ.
The IRQ descriptor table (`irq_desc[]` in the kernel) is an array of structures that describe each IRQ line in the system. Each descriptor contains: a pointer to the registered interrupt handler chain, IRQ state flags (enabled/disabled, in-progress), IRQ chip descriptor (pointer to low-level hardware operations like mask/unmask/ack), and per-IRQ statistics. When an interrupt fires, the kernel looks up the IRQ number in this table and dispatches to the registered handlers. The descriptor also stores the affinity mask (which CPUs can handle this IRQ) and the chip structure for platform-specific interrupt controller operations. Multiple devices can share one IRQ line—the kernel calls each handler in sequence until one claims the interrupt. The descriptor is also where the kernel stores per-IRQ configuration like trigger type and affinity.
`IRQF_ONESHOT` tells the kernel that this interrupt line cannot be shared and the handler must run to completion (or be threaded) before the interrupt line is re-enabled. Without `IRQF_ONESHOT`, the kernel disables the interrupt line at handler start and re-enables at handler completion—on multi-core systems, this means the device cannot generate another interrupt on this line until the handler finishes. For a slow handler (doing I/O), this causes interrupt starvation. With `IRQF_ONESHOT`, the kernel keeps the hardware interrupt disabled but allows the kernel to re-enable software interrupt handling on that CPU. The interrupt is threaded via `request_threaded_irq`, which splits the handler into a fast primary handler (which runs atomic and acknowledges the hardware) and a threaded part (which can do slow work). This prevents slow device access from blocking other interrupts on that IRQ line.
An interrupt storm is when a device generates interrupts at a rate faster than the system can handle, causing CPU saturation with interrupt processing and preventing useful work. The first diagnostic is `/proc/interrupts`—look at the delta (change) column if you're sampling repeatedly. If an IRQ's count is increasing by thousands per second, you have an interrupt storm. The causes: hardware malfunction (device defective, asserting IRQ line permanently), interrupt coalescing misconfiguration (too low thresholds), software bug causing the interrupt to be raised repeatedly for the same event (handler not acknowledging properly), or a race condition where clearing the interrupt doesn't actually clear the hardware condition. The `spurious` entry in `/proc/interrupts` shows counts of unhandled interrupt calls. Solving requires examining the device's interrupt status register in the ISR to see what condition is causing the repeat, and adjusting hardware coalescing or driver logic accordingly.
`disable_irq()` increments a counter and masks the IRQ line at the interrupt controller—each call must have a matching `enable_irq()` to decrement the counter and unmask. The IRQ line only re-enables when the counter reaches zero. This means calling `disable_irq()` twice requires two `enable_irq()` calls to restore. If `disable_irq()` is called from interrupt context, the function waits for any in-flight interrupts to complete before returning, ensuring no handler is running when it returns. The symmetric pair design prevents accidental re-enabling by another piece of code. Common bug: `disable_irq()` in init, `enable_irq()` in cleanup, but cleanup called multiple times causes double-enable. Using `request_threaded_irq()` with `IRQF_ONESHOT` avoids manual enable/disable for most cases. `disable_irq_nosync()` is a non-waiting variant that returns immediately without waiting for in-flight handlers.
`IRQF_SHARED` allows multiple devices to share one IRQ line—the kernel calls all registered handlers until one claims the interrupt. To share correctly: all devices on the line must support shared interrupts; the handler must check if its device actually raised the interrupt (by reading device status registers) and return `IRQ_NONE` if not; and the handler must acknowledge the interrupt before returning (so the hardware de-asserts the line). If two devices both return `IRQ_HANDLED` without checking, one might be handling the other's interrupt (and vice versa), leading to missed events. Shared interrupts are common when IRQ lines are limited (legacy systems). The request is `request_irq(irq, handler, IRQF_SHARED, name, dev_id)` where `dev_id` is a unique identifier each device passes so the handler can identify which device is reporting.
Interrupt handlers run atomically (interrupts disabled on that CPU), so they can safely access data structures that might be modified by normal kernel code if proper synchronization is used. The key is: if data is accessed from interrupt context and also from process context, you need either: (1) disabling the interrupt source when the process context code accesses it (via `spin_lock_irqsave()` which also disables interrupts), or (2) using lock-free algorithms with memory ordering, or (3) using per-cpu data that interrupt handlers never touch. The `spin_lock_irqsave` pattern is common: it disables interrupts on the local CPU, acquires the spinlock, and saves the interrupt state. When the lock is released and state restored, interrupts are re-enabled only if they were enabled before. This prevents deadlock (nested interrupt handlers can't deadlock on the same lock because interrupts are disabled).
`synchronize_irq()` blocks until any interrupt handler currently executing on that IRQ line completes. It is called during driver removal to ensure that no handler is running when resources are freed. If you `free_irq()` without synchronizing, you might free the IRQ descriptor (or device structure) while the ISR is still accessing it—causing a use-after-free crash. The pattern is: `disable_irq()` stops new interrupts, `synchronize_irq()` waits for any in-flight ISR to finish, then you can safely free resources. `synchronize_irq()` is expensive (it can sleep if the handler is slow), so call it only during shutdown sequences where sleeping is acceptable. For faster synchronization, `synchronize_rcu()` is used (for RCU-based synchronization), but that only applies to RCU-protected data, not general ISR completion. In practice, driver removal code calls `free_irq()` which internally calls `synchronize_irq()`.
High-frequency trading (HFT) systems often use polling (busy-wait) instead of interrupts because interrupts have non-deterministic latency. When a market data packet arrives, the ISR runs after a latency of 1-10 microseconds due to interrupt masking, vector lookup, and scheduler preemption. In HFT, this latency is unacceptable—the difference between being first and second in queue can be millions of dollars. Polling a network socket with `epoll_wait()` with a timeout of zero (busy loop) processes data immediately when it arrives. The tradeoff is CPU overhead (the core is always busy even when no data is present), but at extreme frequencies (10Gbps network, thousands of updates per second), the CPU would be handling interrupts constantly anyway. The busy-wait loop ensures every microsecond is used for processing. Many HFT systems also use kernel bypass (DPDK, Solarflare) to avoid OS overhead entirely, moving NIC handling into userspace with busy-polling.
Further Reading
- Linux Kernel Documentation: IRQ subsystem — Official documentation covering interrupt management, handler registration, and IRQ descriptors
- Linux Kernel Documentation: IRQ threading — Guide to threaded IRQs and IRQ affinity
- Intel Software Developer Manual, Chapter 6 — Complete reference on interrupt architecture, IDT, and APIC
- ARM Interrupt Architecture — ARM’s interrupt handling documentation for embedded systems
- NAPI: New API for Network Drivers — Linux networking documentation on hybrid interrupt-polling
- USENIX ‘02: Efficient IRQ Balancing — Research on SMP interrupt distribution optimization
Conclusion
Interrupts and polling represent two fundamental models for device-CPU communication, with modern systems blending both into hybrid approaches. Hardware interrupts provide immediate response for sparse, unpredictable events; polling handles dense, predictable workloads without interrupt overhead. The top-half/bottom-half split in Linux interrupt handling reflects this: ISRs run atomic and fast, while deferred handlers do lengthy work in process context.
Understanding interrupt handling prepares you for deeper OS topics: the interaction between interrupt context and scheduler, the security implications of IRQ routing in virtualized environments, and the evolution toward message-signaled interrupts that avoid physical pin limitations. These fundamentals also appear in ARM and RISC-V interrupt controllers, embedded real-time operating systems, and hypervisor virtual interrupt delivery.
Looking forward, the lines between interrupt and polling continue blurring as hardware supports interrupt coalescing, modern NVMe drives benefit from pure polling during intensive periods, and software-defined interrupt controllers enable flexible IRQ routing for power and performance optimization.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.