Virtual Memory
How operating systems use disk as an extension of RAM through demand paging, and the page replacement algorithms — LRU, Clock, Working Set — that decide what gets evicted.
Virtual Memory
Virtual memory is the engineering feat that makes a computer with 8 GB of RAM run applications that collectively demand 64 GB without crashing. It is the OS’s most visible memory management technique — a layer of indirection that allows the logical address space to exceed the physical address space, using disk as a backing store for the pages that do not fit in RAM. The elegance is that application code does not know and does not care whether its pages are resident in physical memory or have been evicted to swap.
The magic happens through demand paging: pages are loaded into physical memory only when a process accesses them. A program that calls malloc(1 GB) pays no cost at allocation time — the OS creates page table entries for the range but marks them as not present. Physical frames are allocated only when a page fault triggers. This lazy allocation extends to disk-backed pages, heap, and stack alike.
Introduction
When to Use / When Not to Use
Virtual memory is always active — you cannot disable it on any modern general-purpose OS. However, you make implicit choices about how aggressively your workload uses it.
When understanding virtual memory matters:
- Debugging memory-related crashes (segmentation faults, OOM kills, thrashing)
- Optimizing applications where working set > physical memory (large databases, ML training)
- Configuring swap space appropriately (or deciding to use no swap at all)
- Understanding why certain programs behave differently under memory pressure
When virtual memory is a liability:
- Write-heavy workloads on spinning disks (swap thrashing destroys I/O)
- Latency-sensitive real-time applications (page faults introduce unbounded latency)
- Embedded systems with limited storage (swap consumes precious flash endurance)
On latency-sensitive systems (trading platforms, real-time control systems), administrators often disable swap entirely (swapoff -a) to eliminate the possibility of a page fault causing a millisecond-scale pause that violates latency SLAs.
Architecture or Flow Diagram
Virtual memory combines demand paging with page replacement to seamlessly spill pages to disk and retrieve them on demand. The flow below shows the decision tree the OS follows when a page fault occurs.
flowchart TD
PF["Page Fault<br/>Triggered by CPU"]
HANDLER["Page Fault<br/>Handler"]
VALID["Is the address<br/>valid and permitted?"]
PRESENT["Is the page in<br/>physical memory?"]
SWAPPED["Is the page<br/>in swap?"]
LOAD["Load page from<br/>swap device to frame"]
ZERO["Allocate zero-filled<br/>frame from free pool"]
UPDATE["Update page table entry<br/>Mark page as present"]
SEGV["Send SIGSEGV<br/>Terminate process"]
RESUME["Resume interrupted<br/>instruction"]
PF --> HANDLER
HANDLER --> VALID
VALID -->|"No"| SEGV
VALID -->|"Yes"| PRESENT
PRESENT -->|"No, in swap"| LOAD
PRESENT -->|"No, never allocated"| ZERO
PRESENT -->|"Yes, already in RAM"| RESUME
LOAD --> UPDATE
ZERO --> UPDATE
UPDATE --> RESUME
style SEGV stroke:#ff6b6b
style LOAD stroke:#ff9f43
style ZERO stroke:#00fff9
The OS maintains a page replacement algorithm to choose which physical frame to evict when a newly faulted page needs a physical frame and none are free. This is where LRU, Clock, and Working Set algorithms diverge.
Core Concepts
Demand Paging
Demand paging is the mechanism by which pages are loaded into physical memory only upon first access. The page table entry for a not-yet-loaded page has the present (P) bit cleared. When a process accesses that page:
- CPU triggers page fault
- OS determines the fault address (CR2 register on x86)
- OS allocates a physical frame
- OS reads the page from its backing store (executable, page cache, or swap)
- OS updates the page table entry with the frame number and sets P=1
- CPU retries the instruction
The first access to a large mmap region is dramatically slower than subsequent access because of this fault storm. Tools like mincore (on Linux) can reveal which pages of a mapped region are resident.
Swap Space
Swap space is a disk partition or file that backs pages that have been evicted from physical memory. When the OS needs a physical frame and the free frame list is empty, it runs the page replacement algorithm to select a victim page — one that has not been recently used — writes it to swap if dirty, and reclaims the frame.
Linux swap space is managed in 4 KB blocks called swap slots. Each slot can hold one evicted page. The swapon command activates a swap partition; swapoff deactivates it. The kernel maintains a swap map — an array tracking reference counts for each swap slot.
Modern kernels support swap files (not just partitions), which can be resized dynamically. The performance of swap files on SSDs is generally acceptable; on HDDs it is poor due to random seek requirements.
Page Replacement Algorithms
When physical memory fills up, the OS must evict some pages to make room for newly needed ones. The choice of which page to evict is the domain of page replacement algorithms.
LRU (Least Recently Used): LRU evicts the page whose last access is furthest in the past. The theoretical ideal for minimizing cache misses, but LRU requires tracking every page access — an expensive hardware or software overhead. Software LRU requires updating a timestamp or queue on every memory access, interfering with normal program execution. Hardware LRU uses a special CPU counter but still requires TLB and cache coherence overhead.
Clock (Second Chance): Clock (also called the clock algorithm or approximated LRU) is a practical LRU approximation. Pages are arranged in a circular list. A pointer (the “clock hand”) scans pages; if a page’s accessed bit is set, it is cleared and the hand moves on. If not set, the page is evicted. This avoids per-access updates — only the accessed bit is checked on eviction, not updated on every access. The “second chance” name comes from the fact that pages with the accessed bit set get one additional time around the clock before eviction.
Linux uses a variant called Clock-Pro (or LRU of the CLOCK-PRO family), which distinguishes between hot and cold pages based on their accessed bit patterns.
Working Set Model: The working set of a process is the set of pages it actively uses over a time window (e.g., the last 10 seconds). Pages outside the working set can be evicted without significantly impacting the process’s hit rate. The working set model aims to keep the working set resident, evicting pages outside it first. The concept was formalized by Peter Denning in the 1960s and is still referenced in modern memory management literature.
A variant called WSClock (Working Set Clock) combines the clock algorithm with working set timestamps, providing both efficiency and accuracy.
Thrashing
Thrashing is the pathological state where the system spends more time swapping pages in and out than executing useful work. The classic scenario: total working sets of all runnable processes exceed physical memory. Every process needs pages, the OS evicts other processes’ pages to make room, those processes fault back in, and the cycle repeats at disk I/O speeds rather than CPU speeds.
The cure is either reducing the number of competing processes, adding physical RAM, or tuning memory limits (cgroups, container memory caps). The vm.swappiness sysctl influences the kernel’s tendency to swap out process pages versus reclaiming page cache.
Production Failure Scenarios
Thrashing Under Memory Pressure
Failure: A node running multiple Java microservices with large heaps experiences constant page faults. The GC in each JVM runs frequently, but heap pages are swapped out between GC cycles. GC pause times spike from 200 ms to 20 seconds. Response times balloon.
Mitigation: Set explicit memory limits for each container (docker run --memory=512m). Reduce vm.swappiness to 10-20 on hosts running latency-sensitive services. Consider disabling swap on database hosts (swapoff) to force the OOM killer to make hard decisions rather than slowly swapping. Move large heaps to hosts with ample headroom. Use memory requests/limits in Kubernetes to prevent any single pod from consuming the node’s memory.
Swap Storm from fork() After exec()
Failure: A web server forks a child process for each request, and the child immediately exec()s a new program. After fork(), the parent and child share all pages (copy-on-write). The child then calls exec(), which replaces the entire address space with the new program — discarding all COW pages. If many requests arrive simultaneously, the OS allocates and discards huge numbers of pages, generating massive swap activity even though the total memory footprint is small.
Mitigation: Use preFork or thread-pool models (Apache MPM worker, Nginx worker processes) instead of fork-per-request. Use vfork() on Linux (which shares the parent’s address space without COW overhead until exec()). Use container memory limits to isolate services and reduce cross-service swap pressure.
Transparent Huge Page Defragmentation Pauses
Failure: Linux’s transparent huge page (THP) feature defragmentates memory in the background to coalesce 4 KB pages into 2 MB huge pages. This defragmentation scan can cause latency spikes of several milliseconds in latency-sensitive workloads.
Mitigation: Disable THP for latency-sensitive applications: echo never > /sys/kernel/mm/transparent_hugepage/enabled. Use explicit huge page allocation (mmap(MAP_HUGETLB)) for services that need it (PostgreSQL, JVM). On kernel 5.17+, use madvise(MADV_COLLAPSE) to request huge page backing for specific regions.
Trade-off Table
| Algorithm | Implementation Cost | Eviction Accuracy | Hardware Support | Linux Implementation |
|---|---|---|---|---|
| Optimal (OPT) | Impossible (requires future knowledge) | Perfect (0 extra evictions) | None | Not used |
| LRU (true) | Very high (per-access timestamp update) | High | Specialized hardware | Not used (too expensive) |
| Clock (Second Chance) | Low (circular buffer, accessed bit check) | Moderate | Accessed bit (R-bit) | Used as fallback |
| Clock-Pro (Linux) | Low-moderate (multiple hands, working set tracking) | Good | Accessed bit | Default since 2.6 |
| Working Set / WSClock | Moderate (per-page timestamps) | Very good | Timestamps | Historical |
| Random | Zero | Poor (uninformed) | None | Used when others fail |
Implementation Snippets
Simulating LRU Page Replacement (Python)
#!/usr/bin/env python3
"""LRU Page Replacement Simulator.
Given a sequence of page accesses and a number of physical frames,
simulate the LRU algorithm and count page faults.
LRU: On each access, track when each page was last used.
When a replacement is needed, evict the page with the oldest
(last-most-recently-used) timestamp.
"""
from collections import OrderedDict
def lru_page_faults(access_sequence: list[int], num_frames: int) -> int:
"""Return the number of page faults using LRU replacement."""
# Use an OrderedDict: order = recency of access (tail = most recent)
resident_pages = OrderedDict() # page -> last_access_order
faults = 0
access_counter = 0
for page in access_sequence:
access_counter += 1
if page in resident_pages:
# Page hit: move to end (most recently used)
resident_pages.move_to_end(page)
else:
faults += 1
if len(resident_pages) >= num_frames:
# Evict LRU page (first item in OrderedDict)
resident_pages.popitem(last=False)
# Load new page and mark as most recently used
resident_pages[page] = access_counter
return faults
def clock_page_faults(access_sequence: list[int], num_frames: int) -> int:
"""Approximated LRU using the Clock (Second Chance) algorithm."""
# resident_pages[i] = (page_number, referenced_bit)
resident_pages = [] # List of (page, r_bit)
faults = 0
hand = 0 # Clock hand position
for page in access_sequence:
# Check if page is already resident
found = False
for i, (p, r) in enumerate(resident_pages):
if p == page:
resident_pages[i] = (p, 1) # Set referenced bit
found = True
break
if found:
continue
# Page fault — need to load page
faults += 1
# Find a frame to use
while True:
if len(resident_pages) < num_frames:
resident_pages.append((page, 1))
break
# Clock algorithm: scan for victim with r=0
_, r = resident_pages[hand]
if r == 0:
# Evict this page
resident_pages[hand] = (page, 1)
hand = (hand + 1) % num_frames
break
else:
# Second chance: clear bit, move on
resident_pages[hand] = (resident_pages[hand][0], 0)
hand = (hand + 1) % num_frames
return faults
# Example from textbook: access sequence 7, 0, 1, 2, 0, 3, 0, 4, 1, 2, 3, 4, 7, 0, 1
accesses = [7, 0, 1, 2, 0, 3, 0, 4, 1, 2, 3, 4, 7, 0, 1]
print("Page access sequence:", accesses)
print("\nLRU algorithm:")
for frames in range(1, 7):
faults = lru_page_faults(accesses, frames)
print(f" {frames} frame(s): {faults} faults")
print("\nClock algorithm:")
for frames in range(1, 7):
faults = clock_page_faults(accesses, frames)
print(f" {frames} frame(s): {faults} faults")
Monitoring Virtual Memory (bash)
#!/bin/bash
# Comprehensive virtual memory monitoring
echo "=== VM Statistics ==="
vmstat 1 5
echo ""
echo "=== Swap Usage ==="
swapon --show
echo ""
free -h
echo ""
echo "=== Top processes by major page faults ==="
ps -eo pid,comm,majflt,minflt,rss,vsz --sort=-majflt | head -10
echo ""
echo "=== Current swappiness ==="
cat /proc/sys/vm/swappiness
echo ""
echo "=== Recent OOM Kills ==="
dmesg | grep -i "oom\|out of memory" | tail -5
Observability Checklist
- Overall swap activity:
vmstat 1—si(swap in) andso(swap out) columns in KB/s - Major page fault rate per process:
ps -eo pid,comm,majflt --sort=-majflt | head - Minor page fault rate:
ps -eo pid,comm,minflt— high minflt = copy-on-write activity - Page cache pressure:
cat /proc/sys/vm/vfs_cache_pressure(default 100; higher = reclaim more page cache vs process pages) - Swappiness:
cat /proc/sys/vm/swappiness(0-100; higher = more willing to swap process pages) - Available swap:
swapon --showandfree -h - Per-process swap usage:
cat /proc/PID/status | grep -i swap - OOM killer events:
dmesg | grep "Out of memory" | tailorjournalctl -k | grep oom - Transparent huge page status:
cat /sys/kernel/mm/transparent_hugepage/enabled - Active vs inactive memory:
cat /proc/meminfo | grep -E "Active:|Inactive:"— inactive memory can be reclaimed without causing faults
Common Pitfalls / Anti-Patterns
Swapping sensitive data to disk: When pages containing sensitive data (cryptographic keys, passwords, private keys) are evicted to swap, they reside in plaintext on the swap device. If an attacker gains physical access or reads the swap device after the system crashes, they can recover this data. Mitigations include:
- Encrypted swap: LUKS-encrypted swap partitions, or
cryptswapwith a key stored only in RAM - Memory locking (mlock/mlockall): Prevent specific pages from being swapped by calling
mlock()on the containing region - Key management: Hardware security modules (HSMs) and Intel SGX enclaves keep keys in memory that cannot be swapped or inspected by the OS
Cold boot attacks targeting swap: Similar to the memory forensics scenario, cold boot attacks can image the swap partition and recover sensitive data. Encrypted swap renders this ineffective.
Memory disclosure via /proc filesystem: The /proc/PID/maps, /proc/PID/smaps, and /proc/PID/pagemap interfaces expose memory layout information that can aid exploits. Most production environments restrict access to /proc/PID/ to the owner or disable it via hidepid=2 mount option on /proc.
Common Pitfalls / Anti-patterns
Pitfall: Configuring too much swap on systems with SSDs.
If you set swap equal to RAM on an SSD-backed system, you may never hit OOM — the system will happily swap instead of invoking the OOM killer. This creates a death by a thousand cuts: applications slow down due to page fault latency, but do not crash. The system appears responsive but degrades gradually. Better practice: set vm.swappiness=1 (or 0 on kernels 3.5+) for swap-averse workloads, and use memory limits to enforce hard caps.
Pitfall: Disabling swap entirely on systems with overcommitted memory.
Disabling swap (swapoff -a) works well when you have more than enough physical RAM for all workloads. On a system with memory overcommit (common in containerized environments), disabling swap means the first process to exhaust memory triggers the OOM killer immediately, with no graceful degradation. You want some swap as a pressure valve.
Pitfall: Confusing virtual memory size with physical memory usage.
pmap -X <pid> shows the virtual address space size (VSZ) and resident set size (RSS). The gap between VSZ and RSS can be enormous — virtual memory includes mapped but never-touched pages, memory-mapped files that were faulted in then swapped, and stack guard pages. A process with VSZ=100 GB and RSS=200 MB is using 200 MB of physical memory, not 100 GB.
Anti-pattern: Large allocations without considering swap.
On 32-bit systems (or 32-bit processes on 64-bit kernels), a single large malloc can exhaust the virtual address space. mmap of a large region that is never accessed reserves virtual address space but no physical memory — but malloc reserves virtual address space for the entire heap and commits pages on first write. The combination of malloc + heavy swap usage can cause the OOM killer to terminate processes unexpectedly.
Quick Recap Checklist
- Virtual memory allows logical address spaces larger than physical memory by using disk as backing store
- Demand paging loads pages into RAM only on first access (page fault), not at allocation time
- Page replacement algorithms (LRU, Clock, Working Set) decide which physical pages to evict when RAM is full
- Thrashing occurs when the OS spends more time swapping than executing — total working sets exceed physical RAM
- The OOM killer invoked when physical memory + swap is exhausted, not when virtual address space is exhausted
- Linux uses Clock-Pro as its default page replacement algorithm (approximates LRU efficiently)
vm.swappinesscontrols kernel preference: higher = more willing to swap process pages, lower = prefer page cache reclaim- Huge pages (2 MB, 1 GB) reduce TLB pressure for large working sets — important for databases and JVMs
- Encrypted swap prevents cold-boot and physical-access attacks from reading sensitive data from swap space
- Tools:
vmstat 1(swap activity),ps -o majflt,minflt(fault rates),swapon --show(swap size),/proc/meminfo(memory state)
Interview Questions
Thrashing occurs when the total demand for physical memory from all runnable processes exceeds what is available. The OS spends most of its time swapping pages in and out — every time a process gets CPU time, it generates page faults, loads pages from disk, and then the OS evicts other processes' pages to make room. Throughput collapses because the CPU is idle waiting for disk I/O more than it is running code.
Detection: vmstat 1 shows high si (swap in) and so (swap out) columns. iostat -x 1 shows high disk utilization. ps shows processes with high majflt rates. The system appears sluggish despite moderate CPU usage.
Resolution: Add physical RAM, reduce the number of competing processes, set memory limits (cgroups, containers), reduce vm.swappiness to prefer page cache reclaim, or move workloads to dedicated hosts with adequate headroom.
True LRU (Least Recently Used) requires tracking the last access time of every page in memory. On each memory access, software must update a timestamp or reorder a structure — at massive overhead, since memory accesses are the most frequent events in a running program. Hardware assist (cache replacement sensors) exists on some architectures but is imperfect.
The Clock algorithm (Second Chance) approximates LRU by using a referenced (accessed) bit per page. Pages are arranged in a circular list. A pointer sweeps the list; if a page's accessed bit is 0, it is evicted immediately. If it is 1, the bit is cleared and the page gets a "second chance" — the pointer moves on. Pages that have been recently accessed retain their accessed bit and survive multiple sweeps. This approximates LRU without per-access overhead: the accessed bit is updated by hardware on every memory access, but only checked during eviction.
Linux's actual algorithm is Clock-Pro, a refined variant that distinguishes between hot and cold pages more accurately by tracking referenced patterns across multiple clock hands.
The CPU generates a page fault because the page table entry shows the page as not present. The OS page fault handler examines the faulting address to verify the access is valid (within a valid VMA and respecting protection bits). It then locates the page in the swap device using the swap offset stored in the PTE. The handler allocates a physical frame, initiates a disk I/O read to load the page from swap, and waits for completion. Meanwhile, other processes can run. Once the page is loaded, the PTE is updated with the frame number, present bits are set, and the instruction is retried — transparently resuming the process as if the page had always been in RAM.
The key insight is that the instruction is restartable — x86 memory access instructions use base+offset addressing that is re-executable after the handler resolves the fault.
The working set W(t, delta) is the set of pages referenced by a process during the time interval (t minus delta, t). It is the pages the process actively needs — those accessed within the last delta time units. Pages outside the working set can be evicted without significantly impacting the process's hit rate.
The working set model matters because it quantifies how much physical memory a process truly needs versus how much it has allocated. If the sum of all processes' working sets exceeds physical memory, thrashing is inevitable. OSes that implement working set approximations (like Linux's Clock-Pro) use it to prioritize which pages to keep resident. Monitoring the working set size of a database or JVM over time reveals whether its heap size is appropriate or excessive.
Database servers (PostgreSQL, MySQL, Oracle) manage their own buffer pool and typically have a well-tuned working set that fits in physical memory. If swap is enabled, the OS may begin evicting database buffer pool pages under memory pressure — even when the database's own internal eviction policy would have made a smarter choice. This causes database performance to degrade gracefully (slowly) rather than fail fast. By disabling swap, you force the OOM killer to make hard decisions: when physical memory is exhausted, some process gets killed immediately. This is preferable to slow, unpredictable swap thrashing. Swap can also introduce latency unpredictability — a page fault on a spinlock held by a critical thread could cause a millisecond-scale pause that violates real-time SLAs in trading or control systems.
Anonymous pages are process heap, stack, and data pages that have no filesystem backing — they are backed by swap space (or physical RAM if never swapped). When evicted, they must be written to the swap device. File-backed pages are mapped from files (e.g., code, mmap'd data) — when evicted, they can be dropped immediately if clean (matching the file on disk) or written back to the filesystem if dirty. The difference matters for page replacement: the kernel prefers evicting anonymous pages first because file-backed pages may need disk I/O to reconstitute them, while anonymous pages can be regenerated from swap or zero-filled. The shmem filesystem creates file-backed pages that behave like anonymous pages (stored in swap, no filesystem).
The Linux OOM killer uses an oom_score_adj value (-1000 to +1000) per process to influence selection. Higher values increase the likelihood of being killed; lower values (including negative) make a process more protected. The kernel calculates a score based on resident set size (RSS), page fault rate, and the oom_score_adj. When physical memory is exhausted, the killer traverses all processes and selects the one with the highest score — typically a large, memory-intensive process that has been running longest.
Containers add complexity: cgroups expose memory.oom.group which kills the entire container's process group when triggered. You can tune per-container OOM tolerance via memory.min and memory.low settings in cgroup v2, which cause the system to apply memory pressure before hitting the hard limit.
vmalloc allocates virtually contiguous but physically fragmented memory from the vmalloc area (~1.5 GB on x86-64). It is suitable for large buffers (multi-MB) that don't require physical contiguity. mmap with MAP_ANONYMOUS allocates from the process heap area — which is also virtually contiguous. Both return virtually contiguous memory. The key difference: vmalloc pages are NOT backed by the direct-mapped physical address range — accessing them requires additional page table setup. Anonymous mmap is backed directly by physical pages via the buddy system. For DMA or I/O buffer allocations requiring physical contiguity, neither is suitable — you need alloc_pages() (buddy system) with a high-order allocation. For very large allocations that don't need contiguity, mmap with MAP_ANONYMOUS is typically preferred for its simplicity.
The WSClock algorithm combines the clock (second chance) algorithm with per-page timestamps (working set model). Each page has a reference bit and a timestamp of the last reference time. The clock hand scans pages: if a page's reference bit is set, it is cleared and the timestamp updated; if not set, the page's age (time since last reference) is compared to the working set window. Pages older than the window are evicted. This gives the accuracy of true LRU timestamps with the efficiency of the clock algorithm's accessed-bit updates (hardware-managed, no per-access software overhead). Linux's Clock-Pro is a variant that uses referenced bit patterns across multiple clock hands to distinguish hot from cold pages with less precision overhead than full timestamps.
Memory overcommit occurs when the sum of all processes' virtual memory allocations exceeds physical RAM + swap. Linux's default behavior (overcommit mode 2) allows allocations as long as there is reclaimable memory — it does not reserve physical pages at allocation time. This is why malloc can return success for a 100 GB allocation on a 16 GB machine without swap: the physical pages are only allocated when the process actually touches the pages (demand paging). Overcommit is necessary for fork()+COW where the child shares all pages with the parent — without it, fork() would fail on large memory footprints. When physical memory is exhausted, the OOM killer selects and kills a process. Setting vm.overcommit_memory=0 enables heuristic overcommit; =1 always allows overcommit; =2 denies allocations that exceed the limit.
A major page fault occurs when the page must be read from swap (or a memory-mapped file on disk) — this requires actual disk I/O and takes milliseconds. A minor page fault occurs when the page is already in memory but not yet mapped in the process's page table (e.g., after fork(), when the child shares parent's physical pages until it writes) — no disk I/O, just a page table update. The majflt column in ps shows only genuine disk reads. The minflt column shows copy-on-write faults and other non-disk page faults. A process can have millions of minflt (e.g., fork-heavy workloads) without performance impact. A high majflt rate always indicates memory pressure — either the process's working set exceeds RAM or there is a memory leak causing gradual exhaustion.
Memory compaction (Linux 2.6.35+) is the kernel's mechanism for de-fragmenting physical memory to create large contiguous blocks for huge page allocations. The kernel scans the inactive page list, identifies movable pages, and migrates them (using the page migration mechanism) to coalesce free pages into larger blocks. This is important because huge page allocations (2 MB or 1 GB) require physically contiguous memory — without compaction, a system with 1 GB free but fragmented across 4 KB chunks cannot satisfy a 2 MB huge page request. Compaction runs as a background kernel thread (khugepaged for transparent huge pages). It is CPU-intensive and can cause latency spikes, which is why some database administrators disable transparent huge pages.
In modern Linux (2.6+), the page cache is the universal disk cache, unified for both files and block devices. The buffer cache (legacy, pre-2.6) managed individual disk blocks and has been subsumed by the page cache. What you see as "Buffers" in the free command output is a small portion of the buffer cache used for metadata I/O (superblock, inode, bitmap reads) that bypasses the page cache. The page cache caches file content in 4 KB pages; the VFS layer maps files to these pages. When you read a file, the page cache is checked first; if the page is present, the read is served from RAM at DRAM speed. The shmem filesystem creates tmpfs pages that live entirely in the page cache backed by swap. The kernel also maintains an active_list and inactive_list to implement a simplified LRU for the page cache.
madvise(MADV_WILLNEED) hints to the kernel that a memory region will be accessed soon, triggering asynchronous readahead — the kernel prefetches pages from disk into the page cache before the application requests them. This is useful for sequential access patterns on large files: calling madvise(MADV_WILLNEED, buf, len) before reading causes the kernel to issue disk I/O proactively, reducing page fault latency during the read loop. madvise(MADV_DONTNEED) hints that the pages in the region are no longer needed — the kernel can reclaim the physical frames immediately (marking them as free) without writing to swap, even if the pages are dirty. This is used by some malloc implementations to return unused memory to the OS. On Linux, madvise(MADV_DONTNEED) actually unmaps the pages; accessing them again causes a page fault and returns zero-filled pages (not the old content). This is different from POSIX specification where the content is preserved.
When pages containing sensitive data (passwords, cryptographic keys, private data) are swapped out to disk, they reside in plaintext on the swap device — a security risk if an attacker gains physical access or reads the swap partition after a crash. Encrypted swap (LUKS partition or cryptsetup + swap) ensures swapped pages are ciphertext on disk — physical access alone does not expose sensitive data. The encryption key is stored in RAM and lost on power cycle, making cold boot attacks against swap ineffective. Additionally, mlock() / mlockall() can lock specific pages into RAM, preventing them from being swapped out entirely — useful for security-critical data that must never hit disk. Hardware security modules (HSMs) and Intel SGX enclaves provide the strongest guarantees by keeping sensitive data in memory that cannot be swapped or inspected by the OS.
Transparent huge pages (THP) allow the kernel to automatically coalesce 4 KB pages into 2 MB huge pages for anonymous memory (heap, mmap) without application involvement. Benefits: reduced TLB pressure, fewer page table entries, less memory overhead for large working sets. Drawbacks: memory compaction (run periodically in background) can cause latency spikes of several milliseconds on latency-sensitive workloads. Internal fragmentation increases (wasted space within huge pages). THP works best with sequentially-accessed anonymous memory (e.g., malloc'd arrays, JVM heaps). Databases and latency-sensitive services often disable THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) to prevent compaction pauses and to use explicit huge pages with mmap(MAP_HUGETLB). PostgreSQL reports significant pause time reductions when THP is disabled on some workloads.
Linux 3.0+ assigns swap space priorities based on the swap location's speed ( SSD vs HDD ) and the NUMA node distance. Higher-priority swap is used first before lower-priority swap. In a multi-NUMA-node system with local SSD swap on each node, the kernel prefers swapping to the local SSD (lower latency) over remote storage. However, the kernel also considers the NUMA affinity of the page being swapped: if a page belongs to a process running on node 0 but is backed by swap on node 1, the kernel may prefer to swap it in when the process runs on node 0 (using the node 1 swap) rather than migrate the process. Swap priority is set via the pri= option in /etc/fstab or inferred from the order of swapon commands. The si and so columns in vmstat 1 show swap-in and swap-out rates per second.
A page frame reclaiming algorithm (PFRA) determines which physical frames to reclaim when the system needs free frames. Linux uses a multi-generational approach: active pages (frequently accessed) are on the active_list and moved to inactive_list when they appear less active; inactive pages at the tail of the list are the first candidates for eviction. The algorithm is not pure LRU — it uses a referenced bit scanned by the clock algorithm across multiple hands (Clock-Pro variant). The vfs_cache_pressure sysctl controls whether the kernel prefers reclaiming page cache (file-backed pages) or anonymous pages (heap, stack). Low values (~10-50) prefer file-backed pages; high values (~150-200) prefer anonymous pages. The swappiness sysctl is similar for the swap-out decision specifically. The reclaim algorithm also considers the LRU ordering within each list — pages at the tail of inactive_list are oldest (least recently referenced) and evicted first.
A VMA (Virtual Memory Area) is a contiguous range of virtual addresses with a uniform set of attributes (read/write/execute permissions, backing store). Each process has multiple VMAs: one for the code segment (r-x), one for the data segment (rw-), one for the heap (rw-), one for the stack (rw-), and additional VMAs for shared libraries, mmap regions, and thread stacks. The kernel tracks VMAs in a red-black tree (or AVL tree on some architectures) indexed by address. When a process accesses an address, the kernel first looks up which VMA contains that address — if none, it returns SIGSEGV. If found, it checks if the access is permitted by the VMA's permissions. Pages, on the other hand, are the unit of physical memory management — 4 KB of virtual and physical address space that can be individually mapped, swapped, and protected. A VMA contains many pages, but the VMA is a software structure; the pages are what actually get mapped to frames.
The OS uses VMA permissions and guard pages to enforce stack boundaries. When a process is created, the kernel creates a stack VMA with a guard page at the top (or bottom, depending on architecture) — a one-page VMA with zero permissions (read/write/execute all cleared). This guard page is not backed by any physical frame; accessing it triggers a page fault. When the stack needs to grow, the kernel's page fault handler checks whether the faulting address is adjacent to an existing stack VMA and within the maximum stack size limit (RLIMIT_STACK). If so, the handler extends the stack by mapping a new physical page, removing the guard page designation, and allowing the access to succeed. If the faulting address is beyond the maximum stack size or not adjacent to the stack VMA, the handler sends SIGSEGV, killing the process. This mechanism allows the stack to grow dynamically on demand (lazy allocation) while catching genuine stack overflows before they corrupt adjacent memory regions like the heap.
Further Reading
Swap Space Configuration and Performance
Swap size guidelines:
- Traditional rule: RAM × 1.5 to 2.0
- For swap-averse workloads (databases, in-memory caches): 0 or minimal
- For memory overcommit scenarios: at least RAM × 0.5 as pressure valve
- For hibernation: at least RAM (Linux hibernate saves memory to swap)
Swap performance on different storage:
| Storage | Swap Latency | Sequential Throughput | Random IOPS | Recommendation |
|---|---|---|---|---|
| NVMe SSD | ~100 μs | 3-7 GB/s | 100k-1M | Good for swap |
| SATA SSD | ~100 μs | 0.5-1 GB/s | 10k-100k | Acceptable |
| HDD | ~10 ms | 100-200 MB/s | 100-500 | Avoid for active swap |
Swap area priorities (Linux mkswap and swapon): Multiple swap files/partitions can have priorities assigned. Higher priority swap is used first. Use pri= option in /etc/fstab or swapon --priority.
ZSwap: Compressed Swap Cache
ZSwap (Linux 3.11+) is a lightweight compressed cache for pages that would be swapped out. Instead of writing to disk, ZSwap compresses pages and stores them in a memory pool. If the compressed pool fills up, the least-recently-used pages are written to disk.
Benefits:
- Reduces disk I/O for workloads with good compression ratios
- Improves performance for intermittent swap usage
- Especially effective for workloads where swapped pages are soon accessed again
Configuration: echo 1 > /sys/module/zswap/parameters/enabled (varies by distribution)
Page Replacement Algorithm Variants
Linux Clock-Pro: Tracks the referenced bit pattern across multiple clock hands to distinguish between hot and cold pages. Pages that are frequently referenced maintain a pattern of accessed bits; pages that were accessed once and not again get evicted first.
Two-List Strategy: Linux maintains active and inactive page lists. Pages are initially added to the active list; if not referenced again, they move to the inactive list and become candidates for eviction. This simple LRU approximation prevents a single reference from keeping a page permanently resident.
Key Takeaways
- Virtual memory extends physical RAM by using disk as a backing store for evicted pages
- Demand paging loads pages only on first access — virtual allocation costs nothing until used
- Page replacement algorithms (Clock-Pro in Linux) approximate LRU without per-access overhead
- Thrashing occurs when total working sets exceed physical memory — the system spends time swapping instead of computing
- OOM killer is invoked when physical memory plus swap is exhausted
- Encrypted swap prevents cold-boot and physical-access attacks from recovering swapped data
Conclusion
Virtual memory extends physical memory by using disk as a backing store for pages not currently in RAM. Demand paging loads pages only when accessed, so allocating 1 GB of virtual memory costs nothing until the pages are actually used. This enables systems to run workloads larger than installed RAM without crashing.
Page replacement algorithms decide which physical pages to evict when RAM fills up. True LRU is too expensive to implement, so Linux uses Clock-Pro — an efficient approximation that tracks hot and cold pages using accessed bits. Thrashing occurs when working sets collectively exceed physical memory, collapsing throughput as the system spends more time swapping than executing.
The OOM killer intervenes when physical memory plus swap is exhausted, choosing a process to terminate. For latency-sensitive workloads, administrators often disable swap entirely to force fast failure rather than slow degradation. Encrypted swap protects sensitive data from cold-boot attacks if the system is powered down while pages reside in swap space.
For your next step, explore paging and page tables to understand the data structures that map virtual pages to physical frames, or process scheduling to see how the OS decides which runnable process gets CPU time when memory is not the bottleneck.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.