Paging & Page Tables

Understanding how paging divides memory into fixed-size pages, how page tables track virtual-to-physical mappings, and how the TLB accelerates address translation.

published: reading time: 35 min read author: GeekWorkBench

Paging & Page Tables

Paging is the mechanism by which the operating system takes the vast, unwieldy logical address space and carves it into manageable chunks called pages, while mapping those to equally uniform chunks of physical memory called frames. Without paging, a process requesting 500 KB of contiguous memory would fail if physical memory were fragmented across a thousand small gaps — even if the total free memory exceeded 500 KB. Paging eliminates that problem entirely by making the mapping virtual, not physical.

Page tables are the data structures the OS builds and the hardware consults to perform this mapping on every single memory access. Understanding paging and page tables is essential for understanding virtual memory, memory protection, and the performance characteristics of virtually every program you have ever run.

Introduction

When to Use / When Not to Use

Paging is a fundamental OS mechanism — it is always in use. You do not choose to use it; your hardware and OS cooperate to apply it whether you want it or not. However, you make choices about how your application interacts with it.

When understanding paging helps you:

  • Allocating large buffers with malloc — understanding page boundaries explains allocator alignment behavior
  • Memory-mapping files with mmap — understanding page faults reveals why first access to mmap’d regions can be slow
  • Debugging OOM kills or excessive page fault rates
  • Writing lock-free data structures where cache line alignment and false sharing are concerns
  • Database optimization (PostgreSQL huge pages, InnoDB buffer pool pages)

When you can ignore it:

  • Purely functional, managed-language applications where the runtime handles all memory
  • Applications where memory access patterns are uniformly distributed (no concentrated hot spots)

Architecture or Flow Diagram

Address translation under paging splits the logical address into three components: the page directory index, the page table index, and the offset within the page. The MMU walks this hierarchy before accessing DRAM.

flowchart TD
    LA["Logical Address<br/>31..22: PD Index<br/>21..12: PT Index<br/>11..00: Offset"]
    PD["Page Directory<br/>1024 entries<br/>(stored in physical memory)"]
    PT["Page Table<br/>1024 entries<br/>(stored in physical memory)"]
    PDE["Page Directory<br/>Entry<br/>+ Present bit<br/>+ Frame number"]
    PTE["Page Table<br/>Entry<br/>+ Present bit<br/>+ Frame number<br/>+ Access rights"]
    PF["Physical Frame<br/>Number"]
    PA["Physical Address<br/>= PF + Offset"]
    DRAM["Physical DRAM"]

    LA -->|"CR3 points to PD"| PD
    PD -->|indexed by<br/>PD Index| PDE
    PDE -->|"points to PT<br/>in physical memory"| PT
    PT -->|indexed by<br/>PT Index| PTE
    PTE -->|"frame number"| PF
    PF -->|appended with<br/>Offset| PA
    PA --> DRAM

    style LA stroke:#00fff9
    style PA stroke:#00fff9

On modern x86-64 systems, the page table hierarchy is four levels: PML4 (Page Map Level 4) → PDP (Page Directory Pointer) → PD (Page Directory) → PT (Page Table). The CR3 register points to the PML4 in physical memory.

Core Concepts

Pages and Frames

A page is a fixed-size block in the virtual address space — traditionally 4 KB on x86 and ARM. A frame is the equivalent block in physical memory. The page size is a power of two (almost always 4 KB for compatibility), which simplifies address splitting and indexing.

Pages can be in one of three states:

  • Present: The page is mapped to a physical frame and accessible
  • Not present (swapped): The page’s content has been moved to swap space on disk
  • Not present (unmapped): The page has never been allocated — accessing it causes a fault

The OS maintains a page table for each process (actually a multi-level tree) where each entry maps one virtual page to either a physical frame or a swap location.

Multi-Level Page Tables

A flat page table for a full 32-bit address space with 4 KB pages would require 2^20 entries (1,048,576 entries). At 4 bytes per entry, that is 4 MB per process — acceptable for a few processes but not for thousands. Worse, the entire table must be resident in physical memory even if the process uses only a small fraction of its address space.

Multi-level page tables solve this by adding a tree structure. Only the page table pages that are actually needed are allocated. Unused portions of the virtual address space consume no physical memory.

On x86-64 with 4-level paging (Linux default):

  • PML4 (Page Map Level 4): 512 entries, points to PDP pages
  • PDP (Page Directory Pointer): 512 entries per PML4 entry, points to PD pages
  • PD (Page Directory): 512 entries per PDP entry, points to PT pages
  • PT (Page Table): 512 entries per PD entry, maps to physical frames

This means a process using only 4 pages (16 KB) of virtual memory needs: 1 PML4 page + 1 PDP page + 1 PD page + 1 PT page = 4 × 4 KB = 16 KB of page tables instead of 4 MB.

Translation Lookaside Buffer (TLB)

The TLB is a hardware cache of recent virtual-to-physical address translations. Every memory access goes through the MMU, which first checks the TLB. On a TLB hit, the translation is instantaneous (one cycle). On a TLB miss, the MMU walks the page table hierarchy in physical memory — an expensive operation.

Modern TLBs:

  • x86 L1 D-TLB: 64 entries (some architectures, 4-way set associative)
  • x86 L1 I-TLB: 64 entries
  • Last-level TLB (L2 TLB): 512-1024 entries
  • AMD Zen 3: 512-entry L1 TLB, 4096-entry L2 TLB

When context switching between processes, the TLB entries for the old process are generally invalid for the new process. On hardware without PCID (Process Context ID), the entire TLB is flushed on every context switch. With PCID (supported on Intel Haswell+ and AMD Zen+), each TLB entry is tagged with an address space ID, so entries from different processes coexist in the TLB.

Page Fault Handling

When a process accesses a page that is not present (not in physical memory), the CPU triggers a page fault exception. The OS page fault handler executes this sequence:

  1. Identify the faulting address: The CPU stores it in the CR2 register (x86)
  2. Determine fault type: Is the page just not loaded, or is the access invalid (protection violation)?
  3. If valid and not present: Allocate a physical frame (from the free frame list or page cache)
  4. If swapped out: Locate the page on the swap device, schedule a disk I/O read
  5. Update the page table entry: Set the frame number and mark the page as present
  6. Resume the instruction: The CPU retries the faulting memory access

The critical subtlety is that the faulting instruction is restartable only if the memory access used a base register (like [rbx + rsi]) and not a computed displacement that would differ after resolution. The x86 ISA is designed to make all memory accesses restartable, which is a core invariant the OS relies upon.

Production Failure Scenarios

Thrashing: Excessive Page Faults

Failure: When the total working set of all running processes exceeds physical memory capacity, the OS spends more time swapping pages in and out than executing useful work. Each process generates page faults, triggering disk I/O, which slows all other processes, generating more memory pressure — a feedback loop that collapses throughput to near-zero.

Mitigation: Monitor pgscank and pgscand columns in vmstat 1. Tune the swappiness parameter (/proc/sys/vm/swappiness — lower values prefer reclaiming page cache over swapping process pages). Add physical RAM. Use cgroups to limit per-service memory. Use memory limits in container runtimes (docker --memory, Kubernetes requests/limits).

Excessive TLB Miss Rate on Large Working Sets

Failure: Applications with working sets that exceed the TLB coverage (e.g., a 512 MB working set on a system where each TLB entry maps only 4 KB) generate TLB misses on nearly every access. Each miss requires a multi-level page table walk, adding ~25-50 cycles per memory access — a 5-10x slowdown compared to TLB-hit access.

Mitigation: Use huge pages (2 MB or 1 GB pages) so one TLB entry covers more memory. Linux transparent huge pages (echo always > /sys/kernel/mm/transparent_hugepage/enabled) automate huge page usage for mmap’d regions. PostgreSQL’s huge_pages setting explicitly uses 2 MB pages for the buffer pool.

Page Table Memory Overhead for Large Mappings

Failure: Memory-mapping a 100 GB file with 4 KB pages requires 100 GB / 4 KB = 25,638,400 page table entries. At 8 bytes per entry (x86-64), that is ~200 MB of page table memory just to describe the mapping, even though no physical memory is allocated yet.

Mitigation: Use mmap(MAP_HUGETLB) to map with 2 MB pages, reducing page table entries by 512x. Filesystems like ext4 and XFS support huge pages. Some databases use a dedicated I/O path that bypasses the page cache entirely (e.g., O_DIRECT in Linux) to avoid double-buffering.

Trade-off Table

StrategyPage SizeTLB EfficiencyInternal FragmentationPage Table OverheadBest Use Case
4 KB pages (standard)4 KBLow (many entries needed)~2 KB avg per mappingHigh (many entries)General application workloads
2 MB huge pages2 MBHigh (one entry = 512× coverage)~1 MB avg per mapping512× smallerDatabases, JVM heaps, kernel memory
1 GB huge pages1 GBVery highLarge (GB-scale waste)MinimalVery large memory-mapped files
Multi-level page tables4 KBN/AN/AMinimal for sparse address spacesAll modern general-purpose OSes
Inverted page tablesVariableLower (hash lookup)N/AFixed size regardless of mappingsEmbedded systems, database engines

Implementation Snippets

Simulating Two-Level Page Table Lookup (Python)

#!/usr/bin/env python3
"""Simplified two-level page table simulation.
Demonstrates how address translation works:
  - VPN (Virtual Page Number) split into:
      PD_INDEX (top bits) + PT_INDEX (bottom bits)
  - Each level is an array of entries
  - PFN (Page Frame Number) + offset = Physical Address
"""

import struct

PAGE_BITS = 12          # 4 KB pages: 2^12 = 4096 bytes
PAGE_SIZE = 1 << PAGE_BITS
ADDR_BITS = 20          # Simplified 20-bit logical address space (1 MB range)

# How many bits for each level of the page table
PD_BITS = 10
PT_BITS = PAGE_BITS      # Remaining bits for PT index

PD_SIZE = 1 << PD_BITS   # 1024 entries
PT_SIZE = 1 << PT_BITS   # 1024 entries

class PageTableEntry:
    """Simulates a single page table entry (PTE)."""
    def __init__(self, present: bool = False, pfn: int = 0, readonly: bool = False):
        self.present = present
        self.pfn = pfn          # Physical Frame Number
        self.readonly = readonly

    def to_int(self) -> int:
        # Simplified PTE encoding
        val = (self.pfn << 12) | (0b1 if self.present else 0) | (0b10 if self.readonly else 0)
        return val

    @classmethod
    def from_int(cls, val: int):
        present = bool(val & 0b1)
        readonly = bool(val & 0b10)
        pfn = val >> 12
        return cls(present, pfn, readonly)

class PageDirectory:
    """Top-level page directory. Contains PT_BASE addresses."""
    def __init__(self):
        self.entries = [None] * PD_SIZE

    def set_pt_base(self, pd_index: int, pt_base_phys: int):
        self.entries[pd_index] = pt_base_phys

    def get_pt_base(self, pd_index: int):
        return self.entries[pd_index]

class PageTable:
    """Second-level page table. Contains PFN mappings."""
    def __init__(self):
        self.entries = [PageTableEntry() for _ in range(PT_SIZE)]

    def map_page(self, pt_index: int, pfn: int, readonly: bool = False):
        self.entries[pt_index] = PageTableEntry(present=True, pfn=pfn, readonly=readonly)

def translate(logical_addr: int, pd: PageDirectory, pt_pages: dict) -> int:
    """Translate a logical address to a physical address."""
    # Split VPN into PD index and PT index
    pd_index = (logical_addr >> (ADDR_BITS - PD_BITS)) & ((1 << PD_BITS) - 1)
    pt_index = (logical_addr >> PAGE_BITS) & ((1 << PT_BITS) - 1)
    offset   = logical_addr & (PAGE_SIZE - 1)

    # Lookup PD entry to find PT location
    pt_base = pd.get_pt_base(pd_index)
    if pt_base is None:
        raise PageFaultException(f"Page table not present at PD[{pd_index}]")

    # Get PTE from the page table
    pte = pt_pages[pt_base].entries[pt_index]
    if not pte.present:
        raise PageFaultException(f"Virtual page {pt_index} not present")

    # Compute physical address
    physical_addr = (pte.pfn << PAGE_BITS) | offset
    return physical_addr

class PageFaultException(Exception):
    pass

# Example: Map logical pages 0, 1, 2 to physical frames 0x1A, 0x2B, 0x3C
pd = PageDirectory()
# In real hardware, these would be physical addresses of the page tables themselves
PT_BASE = 0x1000
pt_pages = {}

for pd_index in range(2):  # Just PD[0] and PD[1]
    pt_phys = PT_BASE + pd_index
    pt_pages[pt_phys] = PageTable()
    pd.set_pt_base(pd_index, pt_phys)

# Map 3 virtual pages: PD[0]->PT[0..511], PD[1]->PT[0]
pt_pages[PT_BASE + 0].map_page(0, 0x1A)   # VPage 0 -> PFrame 0x1A
pt_pages[PT_BASE + 0].map_page(1, 0x2B)   # VPage 1 -> PFrame 0x2B
pt_pages[PT_BASE + 1].map_page(0, 0x3C)   # VPage 512 (PD[1],PT[0]) -> PFrame 0x3C

# Test translations
print("Logical addr 0x00000 -> Physical 0x{:05X}".format(translate(0x00000, pd, pt_pages)))
print("Logical addr 0x01000 -> Physical 0x{:05X}".format(translate(0x01000, pd, pt_pages)))
print("Logical addr 0x20000 -> Physical 0x{:05X}".format(translate(0x20000, pd, pt_pages)))

Examining Page Fault Types (bash)

#!/bin/bash
# Show page fault statistics for a target process
# Use after observing poor performance to identify page fault causes

PID=${1:-$$}
echo "=== Page fault analysis for PID $PID ==="

# Read page fault counts from /proc/PID/status
grep -E "VmPeak|VmSize|VmRSS|VmData|VmStk|VmExe|VmLib|PageFault" \
    /proc/$PID/status 2>/dev/null

echo ""
echo "=== Minor vs Major page faults over 2 samples ==="
for i in 1 2; do
    echo "--- Sample $i ---"
    cat /proc/$PID/status | grep -i "fault"
    sleep 1
done

echo ""
echo "=== Swap usage ==="
grep -i swap /proc/$PID/status 2>/dev/null || echo "No swap used or not found"

echo ""
echo "=== Memory map (first 10 entries) ==="
cat /proc/$PID/maps | head -10

Observability Checklist

  • Minor vs major page fault rate: ps -o pid,comm,majflt,minflt — high majflt = swap thrashing
  • TLB miss rate: perf stat -e dTLB-load-misses,dTLB-store-misses ./program
  • Hardware page table walker (HWPM) events: Architecture-specific PMU events (e.g., perf stat -e r01d1,r20d1 on Intel for page walker counters)
  • Number of page table pages allocated: grep -c "" /proc/PID/smaps (counts VMAs — not page table pages, but correlated)
  • Huge page usage: cat /proc/meminfo | grep -i huge
  • OOM killer invocation log: dmesg | grep -i "out of memory\|oom-killer"
  • Swappiness setting: cat /proc/sys/vm/swappiness — default 60; reduce for database servers
  • Page cache pressure: cat /proc/sys/vm/pagecache (older kernels) or cat /proc/sys/vm/vfs_cache_pressure

Common Pitfalls / Anti-Patterns

Page-level Access Control: Page table entries contain protection bits: read (R), write (W), execute (X), and user/supervisor (U/S). The OS sets these when allocating pages. User-space code cannot write to kernel-only pages (U=0), and code pages are marked non-writable (W=0). This is enforced by hardware on every memory access — no software check needed.

SMEP (Supervisor Mode Execution Prevention): If the U (user) bit is set in a page table entry, the kernel cannot access that page even in privileged mode. This prevents kernel exploits from dereferencing user-space pointers — even if an attacker hijacks kernel control flow and jumps to user-space shellcode, the kernel cannot execute it due to SMEP.

Page fault side-channel attacks: Meltdown exploited the timing difference between accessing a valid page versus accessing a faulting (invalid) page — the fault handler adds latency, but the speculative access still populates the cache. The fix (IBRS microcode) serializes the address translation path, ensuring no speculative memory access occurs before the protection bits are checked.

Cold boot attacks and memory forensics: RAM can retain data for seconds to minutes after power loss. Investigators can cold-boot a machine and image the physical memory to recover page tables, encryption keys, and process data. Full-disk encryption (FDE) with a TPM-bound key mitigates this for persistent data. For memory at rest, use memory encryption features like AMD SEV or Intel TME.

Common Pitfalls / Anti-patterns

Pitfall: Assuming that malloc returns zero-initialized memory. malloc returns uninitialized memory — whatever bits were in the physical frames are returned as-is. Use calloc if you need zeroed memory. This is not just a cleanliness issue — reading uninitialized memory from another process’s page frames is a security vulnerability called “uninitialized memory disclosure.”

Pitfall: Accessing mmap regions without understanding fault behavior. Memory-mapped regions are not backed by physical frames until first access. The first read or write to each page triggers a page fault, disk I/O (if from a file), and page table update. This is called demand paging. If you mmap a 10 GB file and then sequentially scan it, you get 10 GB / 4 KB = 2.6 million page faults. Use MAP_POPULATE (on Linux) to pre-fault pages and avoid the fault storm.

Pitfall: Using very large malloc without huge pages. A malloc(8GB) for an in-memory database working set causes massive page table overhead: 8 GB / 4 KB = 2 million entries, consuming ~16 MB of page table memory per process. Worse, every page of the working set requires a separate TLB entry — exhausting TLB entries and forcing constant page table walks. Allocating with 2 MB huge pages reduces this to 4,000 entries.

Anti-pattern: Writing code that assumes contiguous virtual memory. Assuming you can walk a 10 GB array with a simple pointer increment is fine in virtual terms. In physical terms, those pages are scattered across physical memory. But for TLB purposes, what matters is the virtual contiguity — one huge page TLB entry covers 2 MB of virtually contiguous space regardless of physical scatter.

Quick Recap Checklist

  • Paging divides the virtual address space into uniform 4 KB pages and physical memory into 4 KB frames
  • Page tables map virtual pages to physical frames; multi-level trees minimize memory overhead for sparse address spaces
  • The TLB caches recent translations to avoid the cost of page table walks on every memory access
  • A page fault triggers when a process accesses a not-present page; the OS loads it from disk or zero-fills it
  • Multi-level page tables (PML4→PDP→PD→PT on x86-64) reduce page table overhead from 4 MB to proportional sizes
  • Huge pages (2 MB or 1 GB) reduce TLB pressure and page table overhead for large working sets
  • Thrashing occurs when processes’ combined working sets exceed physical memory, causing constant page-in/page-out
  • Tools: vmstat 1 (swapping), perf stat -e dTLB-load-misses (TLB misses), /proc/PID/maps (VMA layout)
  • Page table entries contain protection bits (R/W/X) enforced by the MMU on every memory access
  • Copy-on-write (COW) defers physical page copies after fork() until either process writes — optimizing fork() cost

Interview Questions

1. What is the purpose of having multiple levels of page tables instead of a single flat page table?

A flat page table for a 32-bit address space with 4 KB pages requires 2^20 = 1,048,576 entries. At 4 bytes per entry, that is 4 MB per process — even if the process only uses a few pages. If you have 1000 processes, that is 4 GB of page table memory sitting in physical RAM, consuming space that could be user data or page cache. Multi-level page tables allocate page table pages only for the portions of the address space that are actually in use. A process using only 16 KB of memory needs a PML4 page + PDP page + PD page + PT page = 4 × 4 KB = 16 KB of page tables instead of 4 MB. The tradeoff is that address translation now requires multiple memory accesses (one per level) instead of one — but the TLB absorbs this for frequently accessed translations.

2. What is a page fault, and what is the difference between a minor page fault and a major page fault?

A page fault occurs when a running instruction tries to access a virtual page that is not currently resident in physical memory. The CPU raises an exception and the OS page fault handler resolves it. A minor (soft) page fault means the page is validly mapped but not yet loaded in physical memory — it exists clean in the page cache and just needs to be faulted in from RAM, no disk I/O. A major (hard) page fault means the page is not in physical memory at all — it was swapped to disk and must be read back from the swap device, incurring disk I/O latency (~milliseconds). The minflt counter is used for soft faults (e.g., copy-on-write fork), while majflt counts hard faults. A process with high majflt is experiencing real memory scarcity.

3. Why does context switching require flushing the TLB, and what is the modern alternative?

TLB entries are tagged with virtual addresses and address space identifiers. When process A runs, its TLB entries map virtual page X to physical frame Y. When process B runs on the same CPU core, it also has a virtual page X in its address space — but it maps to a different physical frame (or no frame at all). If the TLB is not flushed, process B would incorrectly translate addresses using process A's TLB entries, reading or writing wrong data. Without PCID (Process Context ID) support, the only safe approach is to flush the entire TLB on every context switch. With PCID (Intel Haswell+, AMD Zen+), each TLB entry is tagged with an Address Space ID, and the CPU selectively invalidates only entries matching the outgoing ASID. This dramatically reduces TLB miss rates for frequently-switched processes.

4. What is the translation lookaside buffer, and what is the performance impact of a TLB miss?

The TLB is a hardware cache of virtual-to-physical address translations maintained by the MMU. On a TLB hit, the translation completes in a single cycle — effectively zero overhead. On a TLB miss, the MMU must walk the multi-level page tables (which are themselves in physical memory) to find the mapping. Each level of the page table tree requires a physical memory read, adding ~4-8 cycles per level. On x86-64 with 4-level paging, a TLB miss costs 20-50 cycles of stall. Worse, the page table pages compete for cache space with program data. A workload that suffers 100% TLB miss rate on a large working set can slow down by 10x compared to a workload with 99% TLB hit rate, purely due to page table walking overhead.

5. How do huge pages improve performance for workloads like databases or JVM heaps?

A 1 GB JVM heap with standard 4 KB pages requires 262,144 TLB entries to fully cover. Modern CPUs have 512-4096 entries in their last-level TLB — so the JVM's working set exhausts TLB capacity, causing constant page table walks. Using 2 MB huge pages reduces the same 1 GB heap to 512 TLB entries — within even a small TLB. Each TLB entry covering 2 MB eliminates 511 page table walks for sequential or random access within that huge page. Additionally, huge pages reduce page table memory overhead and improve cache efficiency (fewer page table entries to walk). PostgreSQL reports 5-20% throughput improvement with huge_pages=on. The tradeoff is increased internal fragmentation (wasted space within each huge page allocation).

6. What are the different page sizes supported on x86-64 and when would you use each?

x86-64 supports 4 KB (standard pages), 2 MB (large pages / huge pages), and 1 GB (1 GB pages). Standard 4 KB pages provide fine-grained control and low internal fragmentation — appropriate for general application workloads. 2 MB huge pages reduce TLB pressure for large working sets (databases, JVMs) and reduce page table overhead — used via sysctl vm.nr_hugepages or mmap(MAP_HUGETLB). 1 GB pages are used in database engines or OS kernels for huge mappings (e.g., the entire framebuffer) and require contiguous physical memory — difficult to allocate on fragmented systems. The PML4 entry's PS bit enables direct mapping of 1 GB pages without traversing the full 4-level page table.

7. What is the purpose of the Accessed and Dirty bits in a page table entry?

The Accessed bit (A, bit 5) is set by the CPU hardware on every read or write access to a page — regardless of whether it hits or misses in the TLB. The OS reads this bit to implement page replacement algorithms (like LRU) by identifying pages that have been recently used. On x86, the bit is never cleared by hardware; the OS must clear it during page replacement scanning.

The Dirty bit (D, bit 6) is set by the CPU on every write to a page. It indicates whether the page's content differs from what's on disk (for file-mapped pages) or in swap (for anonymous pages). Before evicting a dirty page, the OS must write it back to its backing store — otherwise data is lost. Clean pages can be evicted without I/O. This is why copying to a write buffer before file I/O is important: it ensures the dirty bit is set and the OS writes the correct data to disk.

8. What is PCID and how does it improve context switch performance?

PCID (Process Context Identifier) is a feature on Intel Haswell+ and AMD Zen+ that tags each TLB entry with an address space ID (ASID). This allows the TLB to retain entries from multiple processes simultaneously. When context switching, instead of flushing the entire TLB (which costs ~100-1000 cycles for refill), the OS simply switches to the new PCID. Old entries remain in the TLB tagged with their ASID. When a TLB lookup occurs, the CPU only matches entries with the current PCID. Linux uses invpcid to selectively invalidate entries for a specific PCID without flushing entries from other processes. On systems without PCID, every context switch flushes the TLB — a significant overhead for frequently-switched processes or when TLB working set is large.

9. What is the difference between a global (G) TLB entry and a per-process TLB entry?

The Global bit (G, bit 8) in a PTE marks pages that are visible across all address spaces — primarily used for kernel text (the kernel code segment). When a context switch occurs, TLB entries with G=1 are NOT invalidated, because the kernel code is the same in every address space. Without the G bit, the OS would have to invalidate all kernel page TLB entries on every context switch — a significant overhead on many-core systems. User-space pages have G=0, so they are flushed on context switch (or with PCID, tagged with the outgoing ASID). Setting G=1 on user pages is incorrect and dangerous because it would allow the user to access those pages after switching to a different process that also has those physical pages.

10. How does demand paging reduce startup time and memory usage?

Demand paging loads pages into physical memory only when a process accesses them — not when they are allocated. When a process starts (after fork+exec), the OS creates page table entries for the entire address space but marks them all as not present. The first access to each page triggers a page fault. This means startup only pays I/O cost for the pages actually touched — not the entire executable. A 100 MB binary that touches only 5 MB at startup loads 5 MB, not 100 MB. Additionally, shared libraries loaded by multiple processes are loaded once and shared read-only via the page tables — physical memory holds one copy while each process has its own virtual mapping. This sharing is why fork() is so fast: the child shares all the parent's pages until it writes.

11. What is the performance cost of a TLB miss vs a cache miss?

A TLB miss and a cache miss have different costs: a TLB miss on x86-64 with 4-level paging costs 4-8 additional physical memory reads (one per page table level) — ~20-50 cycles of stall before the actual data can be accessed. A cache miss (e.g., L1 miss that hits L2) costs the L2 access time (~12 cycles). A cache miss that goes to L3 costs ~20-40 cycles. A cache miss that goes to DRAM costs ~100 ns = ~300 cycles at 3GHz. The critical difference is that TLB misses are sequential with the data access (you must do the page table walk before you can access the cache or memory), while out-of-order execution can hide some cache miss latency by running other instructions in parallel. TLB misses for a large working set accumulate because every memory access pays the page table walk cost; cache misses for a small working set are amortized by hits in faster levels.

12. What is an inverted page table and when is it used?

An inverted page table maintains one entry per physical frame (instead of one entry per virtual page). Each entry records which virtual page and address space owns that physical frame. This reduces memory overhead for systems with large physical memory but sparse virtual usage — a 64 GB system with 4 KB pages needs ~16 MB for an inverted page table (one entry per frame) vs a multi-level page table that might require 1-2 GB for a full process page table. Inverted page tables use hash lookups to find the entry for a given virtual address. They are used in some embedded systems, database engines (e.g., Oracle, DB2), and the SPARC architecture. The downside is that hash collisions add lookup latency and the table is a shared resource requiring locking in multi-processor systems.

13. What happens when a process accesses a page with the read-only bit set to 0 (read-only page)?

On x86, if a process tries to write to a page with the Read/Write bit (R/W, bit 1) set to 0, the CPU raises a write fault (general protection fault or page fault depending on mode). The OS page fault handler checks the PTE. If the page was supposed to be writable (e.g., a heap page), this indicates a software bug — the OS delivers SIGSEGV (segmentation fault) to the process. For memory-mapped files opened read-only, this is the correct behavior — writes are denied. For copy-on-write pages (shared with parent after fork), both parent and child initially have the page marked read-only; when either writes, the fault handler copies the page, marks the new copy as writable, and resumes — this is how COW fork() works without copying at fork time.

14. What is the relationship between page tables and the TLB during a context switch?

On a context switch, the OS loads a new value into CR3 (the physical address of the root page table). This alone does NOT flush the TLB — old TLB entries from the previous process remain, but they are tagged with the old ASID/PCID. When the new process accesses memory, the TLB lookup only matches entries with the current PCID, so old entries are silently ignored (they don't cause wrong translations). On hardware without PCID, the OS must explicitly flush the entire TLB via mov cr3, cr3 or INVLPG. This is a major cost on many context switches. With PCID enabled (Linux default since ~2015 on Intel Haswell+), the flush is selective: only entries for the outgoing PCID are invalidated via invpcid, and entries from the new process that happen to still be in the TLB from before remain valid.

15. What is a page table entry's PFN field and why is it critical for address translation?

The PFN (Page Frame Number) field (bits 12-51 in a 64-bit PTE) stores the upper bits of the physical frame number. Combined with the page offset (bits 0-11 from the original virtual address), it forms the complete physical address. For a 4 KB page, the offset is 12 bits, so the PFN stores bits 12+ of the physical address. The physical frame number multiplied by 4096 (left-shifted by 12) gives the base physical address of the frame. The MMU extracts the PFN from the PTE, shifts it left by 12, and ORs with the original offset to produce the physical address. When a page is swapped out, its PTE stores the swap offset (not a PFN) — identified by a special bit pattern in the PFN field that indicates "not in RAM."

16. How does the OS choose which physical frame to allocate when servicing a page fault?

The OS uses a page replacement algorithm to select a victim frame from the free frame list or the inactive page list. Linux uses a variant of LRU called Clock-Pro (or the two-list strategy: active list + inactive list) to approximate LRU without the high overhead of scanning all pages. When a frame is selected, the OS checks its Dirty bit — if dirty, the page is written to its backing store (swap or file) before the frame is reused. The page table entry for the evicted virtual page is then updated to reflect that the page is no longer resident. The newly freed frame is assigned to the faulting process and marked present in its page table. The choice of victim matters for performance: a frame still in active use that gets evicted will cause another fault shortly after.

17. What is the difference between anonymous pages and file-backed pages in terms of page replacement behavior?

Anonymous pages are process heap and stack pages — they have no file-backed origin, only swap space as their persistence layer. When evicted, they must be written to swap if dirty. File-backed pages (memory-mapped files) are backed by the filesystem — when evicted, the OS can simply drop them if clean (matching the on-disk file), or write them back if dirty (via the page cache). This makes file-backed pages cheaper to reclaim: clean file pages can be evicted without any I/O, while anonymous pages always require swap writes unless they're clean zero pages. The MAP_ANONYMOUS flag creates anonymous mappings; mmap("/file") creates file-backed mappings. The reclaim priority for anonymous pages is controlled by the swappiness parameter.

18. Why does multi-level paging cause higher TLB miss rates for sparse address access patterns?

With multi-level page tables, accessing a sparsely distributed set of virtual pages requires that multiple levels of page table pages be resident in physical memory. If the page directory page itself gets evicted, any access to any page in that subtree causes a page table page fault before the actual data page fault can be serviced. This is called a "page table page fault" and it adds an extra I/O on top of the data access. For a working set that spans many disjoint memory regions (common in sparse matrix operations, certain hash table implementations, or debug builds with large address spaces), the intermediate page table pages may not fit in physical memory alongside the actual data pages — creating a cascade of page table misses that amplify the cost of each data access by 2-4x.

19. What is a page table entry use-after-free vulnerability and how does the OS mitigate it?

In a use-after-free vulnerability related to page tables, a malicious or buggy process can retain a reference to a physical frame after the OS has freed it (e.g., after munmap or process exit). If the OS reallocates that same physical frame to a different process before the old page table entries are fully cleared, the first process could potentially read or write data belonging to the second process through stale TLB entries or mappings. Mitigation includes: clearing PTE entries immediately when a virtual address range is unmapped (via madvise(MADV_DONTNEED) or munmap), using ASID/PCID tags to invalidate stale TLB entries, and in modern kernels, immediately unlinking the page from the victim's address space so the physical frame goes back to the free list with its page tables cleared. The kernel also poisons freed pages with known patterns to detect residual references.

20. How do 1 GB huge pages work at the hardware level and what are the allocation constraints?

On x86-64, a 1 GB page is enabled by setting the Page Size (PS) bit in a PML4 entry (which directly maps a 1 GB range) or a PDP entry (which maps 512 GB). When PS=1 at the PML4 level, the entry directly provides the upper 30 bits of the physical frame number, bypassing all lower page table levels — so only one TLB entry covers 1 GB. This eliminates four levels of page table walks entirely for that range. Allocation of 1 GB pages requires that the OS find 1 GB of physically contiguous memory — on a fragmented system, this can fail even when tens of gigabytes of total free memory exist. This is why 1 GB pages are typically pre-allocated at boot (via kernel parameter parameters like hugepagesz=1G hugepages=4) or by kernel modules at initialization. They are impractical for dynamic allocation in production workloads.

Further Reading

x86-64 Paging Deep Dive

On x86-64, the paging structure is four levels:

PML4 (Page Map Level 4)

  • 512 entries, each pointing to a PDP page
  • CR3 register points to the PML4 (physical address)
  • Bit 47:0 used for address translation on 48-bit VA systems

PDP (Page Directory Pointer)

  • 512 entries per PML4 entry, each pointing to a PD page

PD (Page Directory)

  • 512 entries per PDP entry, each pointing to a PT page

PT (Page Table)

  • 512 entries per PD, each maps to a 4 KB physical frame

For a 48-bit virtual address (256 TB user space):

  • Bits 63:48 are sign-extended bits 47
  • Bits 47:39 select PML4 entry
  • Bits 38:30 select PDP entry
  • Bits 29:21 select PD entry
  • Bits 20:12 select PT entry
  • Bits 11:0 are the page offset

Huge pages: PML4, PDP, and PD entries can have their “PS” (page size) bit set, mapping 1 GB, 2 MB, or 512 GB ranges directly without going through lower levels.

Page Table Entry Fields (x86-64)

FieldBitsPurpose
P (Present)0Page is in physical memory
RW1Read/Write (0=read-only, 1=read-write)
US2User/Supervisor (0=kernel only, 1=user accessible)
PWT3Page Write-Through
PCD4Page Cache Disable
A5Accessed (set by hardware on access)
D6Dirty (set by hardware on write)
PS7Page Size (1=large page, 0=normal)
G8Global (not flushed on context switch if CR4.PGE=1)
Available9-11OS-defined
PFN12-51Physical Frame Number

TLB Shootdown Optimization

On context switches, if the incoming process uses the same ASID (PCID), the TLB entries from the previous process are still valid. Linux uses invpcid to invalidate only entries matching the target ASID rather than flushing the entire TLB.

When PCID is not available (older hardware), the kernel flushes the TLB via mov cr3, cr3 or INVLPG for individual pages.

Key Takeaways

  • Paging divides virtual and physical memory into uniform 4 KB pages (or larger for huge pages)
  • Multi-level page tables (PML4 → PDP → PD → PT on x86-64) allocate memory proportional to actual usage
  • TLB caches translations — TLB misses trigger expensive page table walks
  • Page faults are restartable — the OS saves the faulting address in CR2 and preserves the instruction pointer
  • Huge pages reduce TLB pressure and page table overhead for large working sets

Conclusion

Paging is the mechanism that makes virtual memory possible — dividing both logical and physical memory into uniform 4 KB pages (or larger for huge pages) allows the OS to map any virtual page to any physical frame, eliminating external fragmentation and enabling memory protection at page granularity.

Multi-level page tables (PML4→PDP→PD→PT on x86-64) minimize the memory overhead of page tables for sparse address spaces. A process using only a few pages of virtual memory needs just a handful of page table pages, not the 4 MB a flat page table would require. The TLB caches recent translations to avoid the cost of page table walks on every memory access — TLB misses cost 20-50 cycles while TLB hits complete in a single cycle.

Page faults trigger when a process accesses a not-present page. The OS resolves minor faults by allocating a frame and updating the page table; major faults require disk I/O to load pages from swap. Copy-on-write optimizes fork() by deferring physical memory copies until either process writes.

For your next step, explore virtual memory to understand how the OS uses swap space as a backing store when physical memory is exhausted, or process scheduling to see how the OS decides which process gets CPU time when multiple processes compete for resources.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science