Buffer Cache & Page Cache

How operating systems cache disk blocks in RAM, the difference between write-back and write-through caching, and how sync semantics work.

published: May 19, 2026 reading time: 32 min read author: GeekWorkBench

Quick Summary

How operating systems cache disk blocks in RAM, the difference between write-back and write-through caching, and how sync semantics work.

Introduction

Every read from disk and every write to disk passes through RAM. Not by architectural necessity—your OS could read directly from disk on every file access—but by deliberate design. Memory is orders of magnitude faster than spinning storage, and disk traffic is the primary bottleneck in most computing tasks. The solution: kee

p recently-used disk blocks in RAM, where the CPU can access them in nanoseconds instead of milliseconds.

The buffer cache (historically) and page cache (modern Linux) are the OS mechanisms that implement this caching layer. When you read a file, the OS checks if the blocks are already in cache. If they are, no disk I/O occurs—pure memory access. If not, the OS reads from disk, stores the blocks in cache for future use, then delivers the data. For writes, the OS can either write-through (sync to disk immediately) or write-back (buffer in RAM and flush to disk later).

Understanding caching mechanics is essential for database administrators tuning performance, developers optimizing I/O patterns, and system administrators diagnosing memory pressure.

When to Use / When Not to Use

When Page Cache Is Your Friend

Repeated reads: If your application reads the same data multiple times, the second read is nearly free
Write-heavy workloads: Write-back caching batches multiple small writes into sequential disk operations
Boot and application startup: Frequently-used libraries and configuration files stay cached after first use
Database workloads: PostgreSQL and MySQL rely heavily on page cache for buffer pools

When Caching Causes Problems

Database with its own cache: Using both OS page cache and database buffer pool wastes memory; consider O_DIRECT
Real-time requirements: Write-back delays mean data isn’t on disk; use O_SYNC or fsync()
Memory-constrained systems: Large page cache can cause swapping if not properly sized
Predictable latency: Buffered I/O introduces variable latency depending on cache hits

Write Policy Selection

Policy	When Data Is Durable	Performance	Risk
Write-through	Immediately on write	Lower (waits for disk)	Minimal (no data loss on crash)
Write-back	On flush/sync	Higher (returns immediately)	Data loss possible if crash before flush
O_DIRECT	Bypasses cache entirely	Depends on usage	Application must manage durability

Architecture or Flow Diagram

The following diagram shows how page cache integrates with the VFS and filesystem layers:

flowchart TB
    subgraph "Application"
        APP["Application"]
    end

    subgraph "VFS Layer"
        VFS["VFS"]
    end

    subgraph "Page Cache"
        PC["Page Cache"]
    end

    subgraph "Filesystem"
        FS["Ext4 / XFS / Btrfs"]
    end

    subgraph "Block Layer"
        BL["Block Layer"]
    end

    subgraph "Device"
        DISK["Disk / NVMe"]
    end

    APP --> VFS
    VFS --> PC
    PC -->|Cache MISS| FS
    FS --> BL
    BL --> DISK
    PC -->|Cache HIT| APP

Linux’s page cache is a unified system: instead of separate buffer cache and page cache (as in older UNIX), there’s one cache called the page cache. The radix tree maps file offsets to struct page structures, which in turn hold the actual data. When a file is read, pages are allocated, filled from disk, and inserted into the tree.

Core Concepts

Page Cache Structure

The page cache organizes data in pages (typically 4KB). Each cached file has a radix tree that maps file byte offsets to page structures:

// Simplified page cache structure
struct address_space {
    struct radix_tree_root  page_tree;  // Maps offset → struct page
    spinlock_t              tree_lock;
    atomic_t                i_mmap_writable;
    struct radix_tree_root  i_mmap;     // For mmap'd file
    // ... other fields
};

// Each file's inode has an address_space
struct inode {
    // ...
    struct address_space  *i_data;
    // ...
};

// struct page represents one cached page
struct page {
    unsigned long flags;
    struct address_space *mapping;     // Which file this belongs to
    pgoff_t index;                      // Offset within file (in pages)
    void *virtual;                      // Kernel virtual address
    // ... reference counting, dirty bits, etc.
};

Read Path

// Simplified page cache read path
ssize_t generic_file_read(struct file *filp, char __user *buf,
                           size_t len, loff_t *ppos)
{
    struct address_space *mapping = filp->f_mapping;
    struct page *page;
    pgoff_t index = *ppos >> PAGE_SHIFT;
    size_t offset = *ppos & ~PAGE_MASK;

    // Try to find page in cache
    page = find_get_page(mapping, index);
    if (page && !PageUptodate(page)) {
        // Page in cache but not yet read from disk
        put_page(page);
        page = NULL;
    }

    if (!page) {
        // Cache miss—allocate page and read from disk
        page = __page_cache_alloc(GFP_KERNEL);
        if (!page)
            return -ENOMEM;

        // Insert into page cache
        error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
        if (error) {
            put_page(page);
            return error;
        }

        // Read from disk via filesystem
        error = mapping->a_ops->readpage(filp, page);
        if (error)
            return error;

        // Wait for I/O to complete
        lock_page(page);
    }

    // Copy data to user
    copy_page_to_user(page, buf, offset, len);
    *ppos += len;
    put_page(page);
    return len;
}

Write Path

Linux uses write-back caching by default:

// Simplified page cache write path
ssize_t generic_file_write(struct file *filp, const char __user *buf,
                           size_t len, loff_t *ppos)
{
    struct address_space *mapping = filp->f_mapping;
    struct page *page;
    pgoff_t index = *ppos >> PAGE_SHIFT;
    size_t offset = *ppos & ~PAGE_MASK;

    // Find or create page in cache
    page = __page_cache_alloc(GFP_KERNEL);
    // ... add to page cache ...

    // Copy user data to page (marks page as dirty)
    copy_page_from_user(page, buf, offset, len);

    // Mark page dirty—it will be written back later
    set_page_dirty(page);

    // Update file metadata
    file_update_time(filp);

    *ppos += len;
    put_page(page);
    return len;
}

The set_page_dirty() function marks the page as requiring writeback. The pdflush/flush mechanism (or flush daemon in newer kernels) eventually writes dirty pages to disk.

Dirty Page Writeback

// How dirty pages get flushed to disk
void balance_dirty_pages(struct address_space *mapping)
{
    unsigned long nr_dirty = global_node_page_state(NR_FILE_DIRTY);

    // Throttle if too many dirty pages
    // This prevents writing faster than the disk can handle
    if (nr_dirty > vm_dirty_ratio) {
        // Sleep and wait for flusher to catch up
        wait_event_uninterruptible(dirty_thresh,
            global_node_page_state(NR_FILE_DIRTY) < vm_dirty_ratio);
    }
}

// The flush daemon wakes periodically or when dirty pages expire
int flush_pdflush_flusher(void *data)
{
    // Called by workqueue
    for (;;) {
        // Find filesystems with dirty pages
        // Write them out in priority order
        // Wait for I/O completion
        schedule_timeout(HZ);  // Wake every second
    }
}

Buffer Cache vs Page Cache (Historical Context)

Older UNIX systems used a separate buffer cache for block device blocks and page cache for file pages. Linux 2.6 unified these into just the page cache. The buffer cache still exists conceptually (struct buffer_head) but is a thin layer on top of page cache.

// buffer_head—legacy structure layered on page cache
struct buffer_head {
    struct page *b_page;        // Parent page
    sector_t    b_blocknr;      // Block number on disk
    size_t      b_size;        // Block size
    char       *b_data;        // Pointer to data within page

    // Buffer state flags
    bit 0: BH_Uptodate   // Contains valid data
    bit 1: BH_Dirty      // Needs writeback
    bit 2: BH_Locked     // I/O in progress
    bit 3: BH_Req        // Has been scheduled for I/O
    bit 4: BH_Mapped     // Maps a valid block
    // ...
};

This is why fsync() calls write_inode() and sync_blockdev()—the buffer_head layer manages the low-level block I/O, while the page cache handles memory-mapped file access.

Production Failure Scenarios

Scenario 1: Data Loss from Write-Back Caching

What happened: A web application’s “save” function returned success to users immediately after writing to the page cache, but the data sat in dirty pages for 30+ seconds before the flusher daemon wrote it to disk. A power failure lost those 30 seconds of user data.

Detection: Users reported data loss after power outage. iostat showed high dirty page counts before crash. dmesg showed no disk errors.

Mitigation:

// Application must use fsync() to ensure durability
int save_user_data(int fd, const void *buf, size_t len)
{
    ssize_t written = write(fd, buf, len);
    if (written < 0)
        return -1;

    // Force the data to disk before returning "success"
    if (fsync(fd) < 0)
        return -1;  // Data is now durable (or error occurred)

    return 0;
}

// Alternative: open with O_SYNC for synchronous writes
int fd = open("data.txt", O_WRONLY | O_CREAT | O_SYNC, 0644);
// Every write() now blocks until data is on disk
write(fd, buf, len);  // Returns only after disk write completes

Scenario 2: Memory Pressure Causing Thrashing

What happened: A large analytics job read 80GB of data files sequentially, but the files were larger than available RAM. The page cache filled with data that would never be reused, then the job moved to new files, evicting old ones. The OS spent more time managing page cache eviction than doing useful work—effective I/O throughput collapsed.

Detection: free showed nearly zero available memory, iostat showed constant disk I/O (thrashing), top showed high sy (system) CPU.

Mitigation:

// Use O_DIRECT to bypass page cache for large sequential scans
// (requires aligned buffers—use posix_memalign or memalign)
int fd = open("analytics.bin", O_RDONLY | O_DIRECT);
if (fd >= 0) {
    char *buf;
    posix_memalign(&buf, 4096, 64 * 1024);  // Aligned for direct I/O

    while (read(fd, buf, 64 * 1024) > 0) {
        process_data(buf, 64 * 1024);  // Data not cached, no cache pollution
    }

    free(buf);
    close(fd);
}

// Or use readahead() hints for sequential access patterns
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  // "I'll read sequentially"
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED);     // Prefetch aggressively

Scenario 3: Page Cache Corruption During Crash

What happened: A misbehaving kernel module wrote to a page that had been truncated from a file but remained in the page cache. When the truncated page was later reused, the corrupted data was “restored” from cache, overwriting freshly-written file data.

Detection: Corrupted file contents after system crash; md5sum of files changed mysteriously.

Mitigation:

// Kernel code must properly invalidate cache on truncate
int my_file_truncate(struct inode *inode, loff_t new_size)
{
    // Invalidate cached pages beyond new file size
    truncate_inode_pages(inode->i_mapping, new_size);

    // Update inode metadata
    inode->i_size = new_size;
    mark_inode_dirty(inode);

    // Now safe—evicted pages won't corrupt new file data
    return 0;
}

// User space can use sync_file_range for fine-grained control
sync_file_range(fd, offset, nbytes,
                SYNC_FILE_RANGE_WRITE |     // Writeout
                SYNC_FILE_RANGE_WAIT_AFTER);// Wait for completion

Trade-off Table

Caching Strategy	Write Latency	Read Performance	Durability	Memory Efficiency
Full page cache (default)	Very low (returns immediately)	High (reuses cached)	Poor (relies on sync)	Low for sequential
O_SYNC writes	Very high (blocks on disk)	High	Excellent	Same as default
O_DIRECT reads	N/A	Varies (no cache)	N/A	Excellent (no waste)
Write-through cache	High	High	Good	Moderate
fsync() per operation	Medium (batched)	High	Excellent	High
No cache (raw block device)	Depends on driver	Low for repeated	N/A	N/A

Implementation Snippets

Checking and Controlling Page Cache Behavior

#!/usr/bin/env python3
"""
Python utilities for monitoring and controlling page cache.
"""
import os
import sys
from ctypes import CDLL, c_long, c_int, c_void_p, Structure

 libc = CDLL("libc.so.6", use_errno=True)

# posix_fadvise values
POSIX_FADV_NORMAL = 0
POSIX_FADV_SEQUENTIAL = 2
POSIX_FADV_WILLNEED = 3
POSIX_FADV_DONTNEED = 4
POSIX_FADV_NOREUSE = 5


def drop_caches():
    """Drop page cache (requires root)."""
    with open('/proc/sys/vm/drop_caches', 'w') as f:
        f.write('3')
    print("Page cache dropped")


def readahead_file(fd: int, offset: int = 0, len_hint: int = 0):
    """
    Hint to kernel that file content will be read sequentially.

    POSIX_FADV_SEQUENTIAL doubles the readahead window.
    """
    libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
                         c_int(POSIX_FADV_SEQUENTIAL))


def prefetch_file(fd: int, offset: int = 0, len_hint: int = 0):
    """
    Tell kernel to eagerly load pages into page cache.
    """
    libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
                         c_int(POSIX_FADV_WILLNEED))


def evict_cached_pages(fd: int, offset: int = 0, len_hint: int = 0):
    """
    Tell kernel to discard pages (won't write back—caller must handle dirty).

    WARNING: If pages are dirty, data loss occurs!
    """
    libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
                         c_int(POSIX_FADV_DONTNEED))


def get_cache_stats():
    """Read /proc/meminfo for page cache statistics."""
    stats = {}
    with open('/proc/meminfo') as f:
        for line in f:
            if ':' in line:
                key, value = line.split(':', 1)
                stats[key.strip()] = value.strip()
    return stats


if __name__ == "__main__":
    print("=== Page Cache Statistics ===\n")
    stats = get_cache_stats()

    fields = ['Buffers', 'Cached', 'SwapCached',
              'Active(File)', 'Inactive(file)',
              'Dirty', 'Writeback']
    for field in fields:
        print(f"  {field}: {stats.get(field, 'N/A')}")

    # If run as root, offer to drop cache
    if os.geteuid() == 0 and len(sys.argv) > 1 and sys.argv[1] == '--drop':
        print("\nDropping caches...")
        drop_caches()

Monitoring Page Cache Effectiveness

#!/bin/bash
# Monitor page cache hit ratio and effectiveness

echo "=== Memory and Page Cache Status ==="
grep -E "^(MemTotal|MemFree|Buffers|Cached|Active|Inactive|Dirty|Writeback):" /proc/meminfo

echo ""
echo "=== Page Cache Hit/Miss (from sar if available) ==="
if command -v sar &> /dev/null; then
    # Buffer activity
    sar -b 1 5 | tail -10
fi

echo ""
echo "=== Per-Filesystem I/O Statistics ==="
iostat -x 1 5 | grep -E "^Device|Filesystem" || true

echo ""
echo "=== Checking for memory pressure (high watermark) ==="
cat /proc/zoneinfo | awk '/Node/,/high/ {print}' | tail -20

echo ""
echo "=== VM Tunables ==="
echo "dirty_ratio: $(cat /proc/sys/vm/dirty_ratio)"
echo "dirty_background_ratio: $(cat /proc/sys/vm/dirty_background_ratio)"
echo "dirty_expire_centisecs: $(cat /proc/sys/vm/dirty_expire_centisecs)"
echo "dirty_writeback_centisecs: $(cat /proc/sys/vm/dirty_writeback_centisecs)"

Caching Behavior in Applications

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

/*
 * Demonstration of different caching behaviors:
 * 1. Normal buffered I/O (uses page cache)
 * 2. O_DIRECT bypasses page cache
 * 3. Proper fsync usage for durability
 */

#define BUFFER_SIZE (64 * 1024)

void demonstrate_fsync(int fd) {
    char buf[BUFFER_SIZE];
    ssize_t written;

    /* Write some data */
    written = write(fd, buf, BUFFER_SIZE);
    if (written < 0) {
        perror("write");
        return;
    }

    printf("write() returned: %zd (page cache, not yet on disk)\n", written);

    /* Ensure data reaches disk */
    if (fsync(fd) < 0) {
        perror("fsync");
        return;
    }

    printf("fsync() completed (data now durable on disk)\n");
}

int demonstrate_direct_io(void) {
    int fd;
    char *buf;
    int ret;

    /* O_DIRECT requires aligned memory */
    if (posix_memalign((void**)&buf, 4096, BUFFER_SIZE) != 0) {
        perror("posix_memalign");
        return 1;
    }

    /* Open with O_DIRECT to bypass page cache */
    fd = open("/tmp/test_direct_io", O_RDWR | O_CREAT | O_DIRECT, 0644);
    if (fd < 0) {
        /* Try without O_DIRECT if not supported */
        perror("open(O_DIRECT)");
        free(buf);
        return 1;
    }

    printf("Using O_DIRECT (bypasses page cache)\n");

    /* With O_DIRECT, each read/write is direct to/from disk */
    memset(buf, 0x42, BUFFER_SIZE);
    ret = write(fd, buf, BUFFER_SIZE);
    printf("Direct write returned: %d (disk I/O, synchronous or nearly so)\n", ret);

    /* With O_DIRECT, fsync() may still be needed for durability
       depending on filesystem */
    fsync(fd);

    free(buf);
    close(fd);
    return 0;
}

int main(int argc, char *argv[]) {
    int fd;
    const char *path = "/tmp/page_cache_demo";

    printf("=== Normal buffered I/O ===\n");
    fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) {
        perror("open");
        return 1;
    }
    demonstrate_fsync(fd);
    close(fd);

    printf("\n=== Direct I/O ===\n");
    demonstrate_direct_io();

    return 0;
}

Observability Checklist

Linux Page Cache Metrics

# Check overall memory and cache status
cat /proc/meminfo | grep -E "^(Buffers|Cached|Anon|Active|Inactive|Dirty|Writeback|Shmem):"

# Per-process page cache usage
cat /proc/$(pidof myapp)/smaps_rollup | grep -E "^(Private|Rss|Shared|Pss):"

# Page cache efficiency
# (need to calculate from vfs_cache_pressure and other metrics)
cat /proc/meminfo | grep -E "^(Active\(file\)|Inactive\(file\)|SReclaimable):"

# Check dirty page writeback status
cat /proc/vmstat | grep -E "^(nr_dirty|nr_writeback|nr_dirty_threshold|nr_free_pages):"

# View writeback activity in real-time
watch -n 1 'cat /proc/vmstat | grep -E "pgpgin|pgpgout|pswpin|pswpout|pgfree|pgactivate"'

# Block device I/O (includes writeback)
iostat -x 1

# Per-CPU statistics
mpstat -P ALL 1

Key Metrics to Monitor

Metric	Healthy Range	Alert Threshold	Indicates
`Cached` in `/proc/meminfo`	Depends on workload	Shrinking rapidly	Memory pressure
`Dirty` in `/proc/meminfo`	< 5% of RAM	> 10% of RAM	Writeback backlog
`Writeback` in `/proc/meminfo`	0 or very low	> 0 sustained	Disk can’t keep up
`pgpgin/pgpgoout` rate	Baseline	Sudden change	Abnormal I/O
`pgfault` rate	Baseline	Spikes	Page cache miss thrash

Common Pitfalls / Anti-Patterns

Page Cache Security Considerations

The page cache introduces a class of security risks that traditional memory safety discussions ignore. The kernel—not the application—manages this memory, and data lingers in cache after the application that created it exits.

Sensitive Data in Page Cache: When an application reads passwords, API keys, or cryptographic key material from files, those bytes land in the page cache. Unlike heap memory that an application frees explicitly, page cache pages stay in RAM until the kernel evicts them under memory pressure. A forensic acquisition on a shared system—cold boot attack, DMA attack, or a core dump—can recover data that applications assumed would vanish when they closed. Controls: use mlock() to pin sensitive pages so they never swap (but this only prevents swap; it does not stop eviction under memory pressure), use memfd_create() with F_SEAL_SEAL to create sealed memory file descriptors that bypass the page cache entirely, or read sensitive files via O_DIRECT with aligned buffers. The prctl(PR_SET_DUMPABLE, 0) call stops core dumps from including process heap memory, but page cache pages live in the kernel’s page cache, not the process’s VMA—so this only covers part of the problem.
Page Cache as Information Leak: The page cache leaks not just data content but access patterns. Each struct page entry records which file and offset it holds, so simply opening a file leaves traces in kernel data structures. A side channel can tell whether “file X was accessed recently” versus “file X was never touched” by reading /proc/PID/smaps or measuring memory access timing. Even with encrypted filesystems (ext4 with fscrypt, for example), the page cache stores decrypted content and the access pattern metadata survives. Against a nation-state adversary, mitigations are incomplete: kernel page table isolation helps, and disabling /proc interfaces that expose per-process page cache mappings reduces exposure in multi-tenant environments.
DMA Attacks on Page Cache: DMA-capable devices can read arbitrary physical memory, bypassing CPU access controls entirely. A malicious or compromised PCIe device can reach any memory address including page cache pages, reading sensitive data without the CPU knowing. The fix is an IOMMU, which enforces DMA access boundaries. Enable it in BIOS/UEFI and boot the kernel with intel_iommu=on or amd_iommu=on. In virtualized environments, make sure the hypervisor exposes a vIOMMU (VT-d or AMD-Vi). Without IOMMU, a compromised PCI device has unrestricted access to all physical memory.

Compliance Considerations

Compliance frameworks treat page cache memory differently in their details, but they agree on the basics: any memory region that can hold sensitive data needs protection.

PCI DSS: Requirement 3.4 says cardholder data must be unreadable everywhere it lives in memory, including page cache pages. Practical steps: encrypt files at the filesystem level (ext4 encryption, fscrypt) so the page cache holds ciphertext, use encrypted swap, and bypass the page cache for sensitive files with O_DIRECT. The QSA assessing your environment will want to see that raw PANs cannot be pulled from a core dump or recovered via cold boot attack.
HIPAA: PHI in page cache counts as “information in use” under HIPAA’s Security Rule, which requires administrative, physical, and technical safeguards for ePHI. This means encrypted filesystems for PHI-containing files, mlock() to prevent swapping, and audit logs tracking which systems access those files. A crash that produces a core dump can include PHI from page cache pages—dump storage needs the same protection as the original files.
EU GDPR: Personal data in page cache falls under “processing” per Article 4, which requires a lawful basis. The page cache is tricky because it holds data beyond the application’s lifecycle—a file read into cache stays there until the kernel evicts it, potentially readable via memory forensics long after the application closed. Data minimization argues for keeping sensitive personal data out of the page cache entirely when possible. For cross-border transfers, a US-hosted cloud VM processing EU personal data raises Schrems II concerns if the underlying infrastructure is subject to US surveillance statutes.

Coding Anti-Patterns

Anti-pattern	Why It’s Bad	Correct Approach
Assuming write() is durable	write() only commits to page cache	Use fsync() or O_SYNC for durability
Large sequential reads without O_DIRECT	Pollutes page cache, may cause thrashing	Use posix_fadvise(POSIX_FADV_DONTNEED) after reads
Ignoring dirty_ratio limits	Application blocks when dirty pages exceed threshold	Tune vm.dirty_ratio / vm.dirty_background_ratio
Not handling ENOSPC	Write can fail when page cache allocates beyond limit	Monitor available space, handle write errors
Using mmap() and expecting immediate disk writes	mmap writes are also buffered	msync() to force writeback
forgetting that truncate() doesn’t clear page cache	Truncated pages may persist in cache	Use truncate_inode_pages() or open with O_TRUNC

Quick Recap Checklist

Page cache stores disk blocks in RAM for faster access; cache hits avoid disk I/O entirely
Write-back (default): write() returns immediately; data reaches disk later via flusher daemon
Write-through: write() blocks until data reaches disk; use O_SYNC flag
fsync() forces a file’s dirty pages and metadata to disk before returning
O_DIRECT bypasses page cache for applications that manage their own caching (databases)
posix_fadvise() lets applications hint access patterns to kernel for better cache handling
Page cache is unified—file pages and block buffer pages share the same underlying mechanism
Monitor dirty page count and writeback activity to detect I/O bottlenecks
Memory pressure evicts page cache pages—ensure enough RAM for your workload
Never assume data is on disk without fsync(); a crash between write() and sync will lose data

Interview Questions

1. What is the difference between write-back and write-through caching?

Write-through caching writes data to both cache and underlying storage immediately—the write operation blocks until storage confirms the data is persisted. Write-back caching writes only to cache and returns immediately; the data is marked dirty and written to storage later by the flusher daemon. Linux defaults to write-back. Write-through is safer (no data loss on crash) but slower. Write-back is faster but risks data loss if the system crashes before the flusher daemon writes dirty pages to disk. Applications requiring durability must use fsync(), O_SYNC, or O_DSYNC regardless of caching mode.

2. What happens when you call fsync()? Walk through the entire process.

When you call fsync(fd): (1) VFS layer resolves the file descriptor to the struct file and its struct inode. (2) It calls the filesystem's fsync hook (ext4_sync_file(), xfs_file_fsync(), etc.). (3) The filesystem submits all dirty pages belonging to that inode for writeback. (4) The filesystem also submits the inode's metadata (atime, mtime, etc.). (5) It waits for all I/O to complete—either polling the bio completion or sleeping on a wait queue. (6) After all blocks are on disk, it updates the superblock and returns 0. fdatasync() is similar but doesn't flush metadata unless necessary for data retrieval. sync() calls fsync() on all filesystems globally.

3. When would you use O_DIRECT instead of normal buffered I/O?

Use O_DIRECT when you want to bypass the page cache entirely. This is most appropriate for databases that implement their own sophisticated buffer management—for example, PostgreSQL and MySQL use O_DIRECT to avoid double caching (both OS page cache and database buffer pool consuming RAM). It's also useful for very large sequential reads that won't be reused (preventing cache pollution), or for applications requiring deterministic I/O patterns without opaque caching. However, O_DIRECT requires aligned memory buffers (aligned to 512 or 4096 bytes depending on filesystem) and provides no read-ahead benefit. Most applications should use normal buffered I/O with posix_fadvise() hints.

4. Why might an application intentionally want to drop pages from the page cache?

Two primary reasons: (1) Memory pressure management—an application may hold large datasets that won't be accessed again, and explicitly dropping them (via posix_fadvise(POSIX_FADV_DONTNEED)) allows the kernel to use that memory for other purposes like file caching for active files. (2) Performance isolation—a batch job that processes many files sequentially might otherwise evict useful pages from the page cache. Dropping its accessed pages after processing ensures those evictions don't displace frequently-accessed data. This is especially important in systems running multiple workloads where "noisy neighbor" page cache pollution is a concern.

5. What is the relationship between the buffer cache and the page cache in modern Linux?

Modern Linux has a unified page cache—there is no separate buffer cache. Historically, UNIX had both: the buffer cache cached disk blocks for filesystems, while the page cache cached file content for memory-mapped files. Linux 2.6 merged these into a single page cache. The struct buffer_head that remains is a thin layer on top of page cache pages—it tracks which disk blocks map to which page. When you read a file, the page is allocated, filled from disk via buffer_head I/O, and cached as a page. Both memory-mapped file access and regular read()/write() access go through the same page cache, which simplifies coherency management and reduces duplicate caching.

6. How does the kernel determine which pages to evict from the page cache when memory pressure is high?

The Linux kernel uses a multi-gen page cache LRU (Least Recently Used) mechanism. Pages are organized into multiple "generations" and the eviction algorithm scans the inactive list, trying to reclaim pages that haven't been accessed recently. The vmscan (page reclaim daemon) drives this—it uses a refault pressure algorithm where pages that keep getting referenced stay in cache while cold pages are evicted.

The /proc/meminfo Active() vs Inactive() counters show the split. Shmem (shared memory tmpfs pages) and Unstable memory also compete for page cache memory. The kernel's vm.swappiness sysctl tunes the preference between anonymous (swap) and file-backed pages.

7. What is the difference between radix_tree_delete() and truncating pages in the page cache?

radix_tree_delete() removes a specific page at a known index from the page cache's radix tree. The page is removed but if it's dirty, it is not written to disk—the dirty data is simply lost. Truncation (via truncate_inode_pages()) similarly invalidates pages but properly handles dirty pages via vmtruncate() which writes them back first.

The key difference: direct radix_tree_delete() can corrupt file data if the page was dirty. Always use the proper truncate path when invalidating file-backed pages. Never assume that dropping a page from the radix tree is equivalent to freeing the page—dirty pages must go through writeback first.

8. How does O_SYNC differ from fsync() in terms of durability guarantees and performance?

O_SYNC is an open-mode flag that makes every write() synchronous—each write blocks until the data and metadata are on disk. fsync() is a per-file operation that forces pending writes and metadata updates to disk after a specific write sequence. fsync() is more granular: you batch writes, then sync at a commit point.

Performance: O_SYNC is slower because it waits for disk after every write, even small ones. fsync() lets you batch writes into larger I/O operations before syncing—far more efficient. Use O_SYNC only when every individual write must be durable (rare, e.g., write-ahead logging in databases), and use fsync() for periodic commits.

9. What is the "double write" problem in databases and how does it relate to page cache behavior?

The double-write problem: a page cache write succeeds (data in RAM), but the system crashes before the page is written to disk. The on-disk location contains the old page—a partial torn write. Database pages are larger than filesystem block sizes, so a crash mid-write can leave a page in a corrupted state.

The solution (PostgreSQL, InnoDB doublewrite buffer): write the page to a reserved area on disk first (twice), then write the page to its actual location. If crash occurs after the doublewrite but before the real location, the doublewrite copy is used for recovery. This is necessary precisely because the page cache write-back model does not guarantee atomic sector-sized writes.

10. What is posix_fadvise() and how does POSIX_FADV_WILLNEED differ from readahead?

posix_fadvise() lets applications hint their access pattern to the kernel's page cache. POSIX_FADV_WILLNEED tells the kernel to eagerly prefetch the specified file range into page cache—essentially triggering readahead for that range immediately. POSIX_FADV_SEQUENTIAL doubles the readahead window for sequential access. POSIX_FADV_DONTNEED tells the kernel to drop the specified pages from cache.

Unlike manual readahead(), POSIX_FADV_WILLNEED works on already-open file descriptors and can be targeted to specific offsets and lengths. It's the standard way to explicitly prefetch before known access patterns (like processing files in known order in batch jobs).

11. How does the page cache handle concurrent writes to the same file from multiple processes?

Multiple processes writing to the same file share the same page cache pages. The kernel tracks a "offset within file" -> "struct page" mapping. When two processes write to overlapping regions, their writes eventually serialize at the block layer—the page is locked during I/O submission. Concurrent writes to non-overlapping regions are handled independently.

The key issue: concurrent writes to overlapping regions can result in lost updates (last-write-wins) at the application level unless coordinated via file locking (flock(), fcntl() advisory locks). The page cache itself doesn't enforce write ordering—only the block layer's request scheduling and the filesystem's own locking.

12. What is the "shrinkable page cache" mechanism and how does it prevent memory exhaustion?

The shrinker API allows subsystem-specific page cache reclaim. Filesystems register a shrinker callback that the VM calls when memory is low. The callback returns the number of pages it reclaimed. For example, the dentry cache (filesystem entry cache) shrinker frees unused directory entries when memory pressure is high.

This prevents memory exhaustion: when vm pressure exceeds a threshold, the kernel calls registered shrinkers in priority order (starting with the most aggressive). Each shrinker frees its least-recently-used objects until memory pressure subsides. Without shrinkers, the page cache would grow until the OOM killer fires.

13. Why does mmap() writes to a file also go through the page cache?

Memory-mapped file writes are stored to a page in the page cache—the page is faulted in from disk if not present, modified in cache (marked dirty), and eventually written back by the flusher daemon. This is why mmap() writes are not immediately durable—msync() is needed to force writeback.

The advantage: mmap() provides zero-copy access to file data (the page is mapped directly into the process's address space). The disadvantage: writes are still cached and subject to the same write-back semantics as read()/write() I/O. Applications expecting direct I/O semantics from mmap() may have durability surprises.

14. What is the difference between fdatasync() and fsync()?

fsync() flushes both file data and metadata (atime, mtime, file size, extended attributes) to disk. fdatasync() flushes only data and enough metadata to retrieve the data later (e.g., file size changes that affect data retrieval). It omits flushing metadata that doesn't affect data access (like atime, ACLs, other extended attributes).

The performance difference: metadata syncs involve additional disk operations (inode updates to the inode table). fdatasync() is useful for applications (like databases) that need durability but don't care about non-essential metadata updates. PostgreSQL prefers fdatasync() for this reason—it reduces disk operations without sacrificing data durability.

15. How does tmpfs (shmem) interact with the page cache and swap?

tmpfs uses the same page cache infrastructure as regular files—it allocates struct page frames for its content. The key difference: tmpfs content lives entirely in RAM and can be swapped out under memory pressure (unlike ramdisk, which consumes memory pre-allocated). tmpfs pages in the page cache are tracked by the anonymous (inactive) LRU when swapped.

tmpfs also contributes to /proc/meminfo's Shmem counter. It is refcounted like other page cache pages and participates in the same reclaim algorithm. The size of tmpfs is limited by a mount option and by available swap space.

16. What is the "writeback throttling" problem and how does vm.dirty_ratio address it?

Without throttling, a process that generates dirty pages rapidly can flood the page cache with dirty pages faster than the disk can write them. This causes "writeback saturation"—the system spends all time writing and no time doing useful work. vm.dirty_ratio (percentage of available RAM) is the threshold at which processes generating dirty pages are throttled—they block in balance_dirty_pages() until the flusher catches up.

The related vm.dirty_background_ratio is the threshold at which the flusher daemon wakes up to begin writeback in the background (without blocking applications). Tuning these values trades write throughput for read responsiveness.

17. Why might an application see different data when reading a file it just wrote?

Because the write went to the page cache (write-back), not to disk. The application reading the file via a different file descriptor may get a stale page from the page cache (the old data) instead of its own just-written data, because the page hasn't been read in yet. This is a coherency issue: the writer has dirty pages, the reader gets the old cached pages.

This is actually a bug in the application—either the writer should fsync() before the reader accesses, or the reader should open with O_DIRECT to bypass the cache. This scenario typically occurs with separate processes or when a parent process writes and a child (via fork) reads—the child's address space inherits the cached page from before the parent's write.

18. What is the "stable page writeback" problem with RAID and how does it affect reliability?

In RAID write-back mode (write-back caching on the RAID controller), a power failure can lose data if the RAID controller acknowledges a write before it reaches disk. The page cache sends data to the RAID controller, the controller acknowledges (data in controller cache), but the system crashes before the controller flushes to physical disks.

Linux mitigates this with "force write-through" on RAID members when battery-backed write cache (BBWC) is not detected. Production storage systems should always have BBWC or use a filesystem with its own write-ahead journal (ext4, XFS) that survives RAID controller failures.

19. How does the page cache interact with encrypted filesystems like ext4 with encryption?

The page cache stores decrypted file content in plaintext pages. Encryption happens below the page cache—in the VFS layer or filesystem layer—when data moves between the page cache and the block layer. When you read a file, the block layer reads encrypted blocks from disk, the filesystem decrypts them, and the page cache stores the plaintext result.

This means: even with encrypted filesystem, sensitive data in page cache is plaintext in RAM. Cold boot attacks can recover plaintext from RAM. The encryption protects data at rest on disk, not data in transit through the page cache. Use memory locking (mlock()) if you need to prevent sensitive pages from being swapped out.

20. What is the difference between page cache writeback and "journal commit" in journaling filesystems?

Page cache writeback writes file data pages to their on-disk locations. Journal commit writes metadata (and optionally data) to a dedicated journal area on disk in a sequential, atomic transaction record. The journal guarantees that metadata updates survive crashes, while page cache writeback handles the actual file content.

When a filesystem commits a transaction, it writes all modified metadata (and data for data=journal mode) to the journal sequentially, then posts a commit record. After a crash, the filesystem replays the journal, reapplying committed transactions and discarding incomplete ones. This is separate from page cache writeback—the journal and data writeback serve different durability purposes.

Conclusion

The page cache represents one of the most effective performance optimizations in operating systems—keeping recently-accessed disk blocks in RAM eliminates disk I/O for cache hits, dramatically reducing read latency and batching writes for improved disk throughput. Write-back caching (Linux default) delivers high performance by returning immediately after writing to cache, but requires explicit fsync() calls for durability guarantees.

The unified page cache (merged from historical separate buffer and page caches) simplifies memory management and eliminates double-caching. O_DIRECT bypasses the page cache for applications like databases that implement their own buffer management. posix_fadvise() enables applications to hint access patterns, allowing the kernel to prefetch aggressively or evict pages that won’t be reused.

Looking forward, several trends reshape caching: persistent memory (PMEM) blurs the line between storage and memory, potentially eliminating some page cache benefits; io_uring enables asynchronous I/O that bypasses traditional buffered I/O paths; and pressure around memory efficiency drives continued refinement of page cache eviction policies and fsync() batching behavior.

Introduction

When to Use / When Not to Use

When Page Cache Is Your Friend

When Caching Causes Problems

Write Policy Selection

Architecture or Flow Diagram

Core Concepts

Page Cache Structure

Read Path

Write Path

Dirty Page Writeback

Buffer Cache vs Page Cache (Historical Context)

Production Failure Scenarios

Scenario 1: Data Loss from Write-Back Caching

Scenario 2: Memory Pressure Causing Thrashing

Scenario 3: Page Cache Corruption During Crash

Trade-off Table

Implementation Snippets

Checking and Controlling Page Cache Behavior

Monitoring Page Cache Effectiveness

Caching Behavior in Applications

Observability Checklist

Linux Page Cache Metrics

Key Metrics to Monitor

Common Pitfalls / Anti-Patterns

Page Cache Security Considerations

Compliance Considerations

Coding Anti-Patterns

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates