Buffer Cache & Page Cache
How operating systems cache disk blocks in RAM, the difference between write-back and write-through caching, and how sync semantics work.
Introduction
Every read from disk and every write to disk passes through RAM. Not by architectural necessity—your OS could read directly from disk on every file access—but by deliberate design. Memory is orders of magnitude faster than spinning storage, and disk traffic is the primary bottleneck in most computing tasks. The solution: kee
p recently-used disk blocks in RAM, where the CPU can access them in nanoseconds instead of milliseconds.
The buffer cache (historically) and page cache (modern Linux) are the OS mechanisms that implement this caching layer. When you read a file, the OS checks if the blocks are already in cache. If they are, no disk I/O occurs—pure memory access. If not, the OS reads from disk, stores the blocks in cache for future use, then delivers the data. For writes, the OS can either write-through (sync to disk immediately) or write-back (buffer in RAM and flush to disk later).
Understanding caching mechanics is essential for database administrators tuning performance, developers optimizing I/O patterns, and system administrators diagnosing memory pressure.
When to Use / When Not to Use
When Page Cache Is Your Friend
- Repeated reads: If your application reads the same data multiple times, the second read is nearly free
- Write-heavy workloads: Write-back caching batches multiple small writes into sequential disk operations
- Boot and application startup: Frequently-used libraries and configuration files stay cached after first use
- Database workloads: PostgreSQL and MySQL rely heavily on page cache for buffer pools
When Caching Causes Problems
- Database with its own cache: Using both OS page cache and database buffer pool wastes memory; consider
O_DIRECT - Real-time requirements: Write-back delays mean data isn’t on disk; use
O_SYNCorfsync() - Memory-constrained systems: Large page cache can cause swapping if not properly sized
- Predictable latency: Buffered I/O introduces variable latency depending on cache hits
Write Policy Selection
| Policy | When Data Is Durable | Performance | Risk |
|---|---|---|---|
| Write-through | Immediately on write | Lower (waits for disk) | Minimal (no data loss on crash) |
| Write-back | On flush/sync | Higher (returns immediately) | Data loss possible if crash before flush |
| O_DIRECT | Bypasses cache entirely | Depends on usage | Application must manage durability |
Architecture or Flow Diagram
The following diagram shows how page cache integrates with the VFS and filesystem layers:
flowchart TB
subgraph "Application"
APP["Application\nread() / write()"]
end
subgraph "VFS Layer"
VFS["VFS\n(superblock, inode, dentry)"]
end
subgraph "Page Cache"
PC["Page Cache\n(radix tree: page → disk blocks)"]
end
subgraph "Filesystem"
FS["Ext4 / XFS / Btrfs"]
end
subgraph "Block Layer"
BL["Block Layer\n(request queue, elevator)"]
end
subgraph "Device"
DISK["Disk / NVMe"]
end
APP --> VFS
VFS --> PC
PC -->|Cache HIT\n(no disk I/O)| APP
PC -->|Cache MISS\nread from disk| FS
FS --> BL
BL --> DISK
style PC stroke:#ffffff
style VFS stroke:#00fff9
style FS stroke:#00fff9
Linux’s page cache is a unified system: instead of separate buffer cache and page cache (as in older UNIX), there’s one cache called the page cache. The radix tree maps file offsets to struct page structures, which in turn hold the actual data. When a file is read, pages are allocated, filled from disk, and inserted into the tree.
Core Concepts
Page Cache Structure
The page cache organizes data in pages (typically 4KB). Each cached file has a radix tree that maps file byte offsets to page structures:
// Simplified page cache structure
struct address_space {
struct radix_tree_root page_tree; // Maps offset → struct page
spinlock_t tree_lock;
atomic_t i_mmap_writable;
struct radix_tree_root i_mmap; // For mmap'd file
// ... other fields
};
// Each file's inode has an address_space
struct inode {
// ...
struct address_space *i_data;
// ...
};
// struct page represents one cached page
struct page {
unsigned long flags;
struct address_space *mapping; // Which file this belongs to
pgoff_t index; // Offset within file (in pages)
void *virtual; // Kernel virtual address
// ... reference counting, dirty bits, etc.
};
Read Path
// Simplified page cache read path
ssize_t generic_file_read(struct file *filp, char __user *buf,
size_t len, loff_t *ppos)
{
struct address_space *mapping = filp->f_mapping;
struct page *page;
pgoff_t index = *ppos >> PAGE_SHIFT;
size_t offset = *ppos & ~PAGE_MASK;
// Try to find page in cache
page = find_get_page(mapping, index);
if (page && !PageUptodate(page)) {
// Page in cache but not yet read from disk
put_page(page);
page = NULL;
}
if (!page) {
// Cache miss—allocate page and read from disk
page = __page_cache_alloc(GFP_KERNEL);
if (!page)
return -ENOMEM;
// Insert into page cache
error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
if (error) {
put_page(page);
return error;
}
// Read from disk via filesystem
error = mapping->a_ops->readpage(filp, page);
if (error)
return error;
// Wait for I/O to complete
lock_page(page);
}
// Copy data to user
copy_page_to_user(page, buf, offset, len);
*ppos += len;
put_page(page);
return len;
}
Write Path
Linux uses write-back caching by default:
// Simplified page cache write path
ssize_t generic_file_write(struct file *filp, const char __user *buf,
size_t len, loff_t *ppos)
{
struct address_space *mapping = filp->f_mapping;
struct page *page;
pgoff_t index = *ppos >> PAGE_SHIFT;
size_t offset = *ppos & ~PAGE_MASK;
// Find or create page in cache
page = __page_cache_alloc(GFP_KERNEL);
// ... add to page cache ...
// Copy user data to page (marks page as dirty)
copy_page_from_user(page, buf, offset, len);
// Mark page dirty—it will be written back later
set_page_dirty(page);
// Update file metadata
file_update_time(filp);
*ppos += len;
put_page(page);
return len;
}
The set_page_dirty() function marks the page as requiring writeback. The pdflush/flush mechanism (or flush daemon in newer kernels) eventually writes dirty pages to disk.
Dirty Page Writeback
// How dirty pages get flushed to disk
void balance_dirty_pages(struct address_space *mapping)
{
unsigned long nr_dirty = global_node_page_state(NR_FILE_DIRTY);
// Throttle if too many dirty pages
// This prevents writing faster than the disk can handle
if (nr_dirty > vm_dirty_ratio) {
// Sleep and wait for flusher to catch up
wait_event_uninterruptible(dirty_thresh,
global_node_page_state(NR_FILE_DIRTY) < vm_dirty_ratio);
}
}
// The flush daemon wakes periodically or when dirty pages expire
int flush_pdflush_flusher(void *data)
{
// Called by workqueue
for (;;) {
// Find filesystems with dirty pages
// Write them out in priority order
// Wait for I/O completion
schedule_timeout(HZ); // Wake every second
}
}
Buffer Cache vs Page Cache (Historical Context)
Older UNIX systems used a separate buffer cache for block device blocks and page cache for file pages. Linux 2.6 unified these into just the page cache. The buffer cache still exists conceptually (struct buffer_head) but is a thin layer on top of page cache.
// buffer_head—legacy structure layered on page cache
struct buffer_head {
struct page *b_page; // Parent page
sector_t b_blocknr; // Block number on disk
size_t b_size; // Block size
char *b_data; // Pointer to data within page
// Buffer state flags
bit 0: BH_Uptodate // Contains valid data
bit 1: BH_Dirty // Needs writeback
bit 2: BH_Locked // I/O in progress
bit 3: BH_Req // Has been scheduled for I/O
bit 4: BH_Mapped // Maps a valid block
// ...
};
This is why fsync() calls write_inode() and sync_blockdev()—the buffer_head layer manages the low-level block I/O, while the page cache handles memory-mapped file access.
Production Failure Scenarios
Scenario 1: Data Loss from Write-Back Caching
What happened: A web application’s “save” function returned success to users immediately after writing to the page cache, but the data sat in dirty pages for 30+ seconds before the flusher daemon wrote it to disk. A power failure lost those 30 seconds of user data.
Detection: Users reported data loss after power outage. iostat showed high dirty page counts before crash. dmesg showed no disk errors.
Mitigation:
// Application must use fsync() to ensure durability
int save_user_data(int fd, const void *buf, size_t len)
{
ssize_t written = write(fd, buf, len);
if (written < 0)
return -1;
// Force the data to disk before returning "success"
if (fsync(fd) < 0)
return -1; // Data is now durable (or error occurred)
return 0;
}
// Alternative: open with O_SYNC for synchronous writes
int fd = open("data.txt", O_WRONLY | O_CREAT | O_SYNC, 0644);
// Every write() now blocks until data is on disk
write(fd, buf, len); // Returns only after disk write completes
Scenario 2: Memory Pressure Causing Thrashing
What happened: A large analytics job read 80GB of data files sequentially, but the files were larger than available RAM. The page cache filled with data that would never be reused, then the job moved to new files, evicting old ones. The OS spent more time managing page cache eviction than doing useful work—effective I/O throughput collapsed.
Detection: free showed nearly zero available memory, iostat showed constant disk I/O (thrashing), top showed high sy (system) CPU.
Mitigation:
// Use O_DIRECT to bypass page cache for large sequential scans
// (requires aligned buffers—use posix_memalign or memalign)
int fd = open("analytics.bin", O_RDONLY | O_DIRECT);
if (fd >= 0) {
char *buf;
posix_memalign(&buf, 4096, 64 * 1024); // Aligned for direct I/O
while (read(fd, buf, 64 * 1024) > 0) {
process_data(buf, 64 * 1024); // Data not cached, no cache pollution
}
free(buf);
close(fd);
}
// Or use readahead() hints for sequential access patterns
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL); // "I'll read sequentially"
posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED); // Prefetch aggressively
Scenario 3: Page Cache Corruption During Crash
What happened: A misbehaving kernel module wrote to a page that had been truncated from a file but remained in the page cache. When the truncated page was later reused, the corrupted data was “restored” from cache, overwriting freshly-written file data.
Detection: Corrupted file contents after system crash; md5sum of files changed mysteriously.
Mitigation:
// Kernel code must properly invalidate cache on truncate
int my_file_truncate(struct inode *inode, loff_t new_size)
{
// Invalidate cached pages beyond new file size
truncate_inode_pages(inode->i_mapping, new_size);
// Update inode metadata
inode->i_size = new_size;
mark_inode_dirty(inode);
// Now safe—evicted pages won't corrupt new file data
return 0;
}
// User space can use sync_file_range for fine-grained control
sync_file_range(fd, offset, nbytes,
SYNC_FILE_RANGE_WRITE | // Writeout
SYNC_FILE_RANGE_WAIT_AFTER);// Wait for completion
Trade-off Table
| Caching Strategy | Write Latency | Read Performance | Durability | Memory Efficiency |
|---|---|---|---|---|
| Full page cache (default) | Very low (returns immediately) | High (reuses cached) | Poor (relies on sync) | Low for sequential |
| O_SYNC writes | Very high (blocks on disk) | High | Excellent | Same as default |
| O_DIRECT reads | N/A | Varies (no cache) | N/A | Excellent (no waste) |
| Write-through cache | High | High | Good | Moderate |
| fsync() per operation | Medium (batched) | High | Excellent | High |
| No cache (raw block device) | Depends on driver | Low for repeated | N/A | N/A |
Implementation Snippets
Checking and Controlling Page Cache Behavior
#!/usr/bin/env python3
"""
Python utilities for monitoring and controlling page cache.
"""
import os
import sys
from ctypes import CDLL, c_long, c_int, c_void_p, Structure
libc = CDLL("libc.so.6", use_errno=True)
# posix_fadvise values
POSIX_FADV_NORMAL = 0
POSIX_FADV_SEQUENTIAL = 2
POSIX_FADV_WILLNEED = 3
POSIX_FADV_DONTNEED = 4
POSIX_FADV_NOREUSE = 5
def drop_caches():
"""Drop page cache (requires root)."""
with open('/proc/sys/vm/drop_caches', 'w') as f:
f.write('3')
print("Page cache dropped")
def readahead_file(fd: int, offset: int = 0, len_hint: int = 0):
"""
Hint to kernel that file content will be read sequentially.
POSIX_FADV_SEQUENTIAL doubles the readahead window.
"""
libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
c_int(POSIX_FADV_SEQUENTIAL))
def prefetch_file(fd: int, offset: int = 0, len_hint: int = 0):
"""
Tell kernel to eagerly load pages into page cache.
"""
libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
c_int(POSIX_FADV_WILLNEED))
def evict_cached_pages(fd: int, offset: int = 0, len_hint: int = 0):
"""
Tell kernel to discard pages (won't write back—caller must handle dirty).
WARNING: If pages are dirty, data loss occurs!
"""
libc.posix_fadvise64(fd, c_long(offset), c_long(len_hint),
c_int(POSIX_FADV_DONTNEED))
def get_cache_stats():
"""Read /proc/meminfo for page cache statistics."""
stats = {}
with open('/proc/meminfo') as f:
for line in f:
if ':' in line:
key, value = line.split(':', 1)
stats[key.strip()] = value.strip()
return stats
if __name__ == "__main__":
print("=== Page Cache Statistics ===\n")
stats = get_cache_stats()
fields = ['Buffers', 'Cached', 'SwapCached',
'Active(File)', 'Inactive(file)',
'Dirty', 'Writeback']
for field in fields:
print(f" {field}: {stats.get(field, 'N/A')}")
# If run as root, offer to drop cache
if os.geteuid() == 0 and len(sys.argv) > 1 and sys.argv[1] == '--drop':
print("\nDropping caches...")
drop_caches()
Monitoring Page Cache Effectiveness
#!/bin/bash
# Monitor page cache hit ratio and effectiveness
echo "=== Memory and Page Cache Status ==="
grep -E "^(MemTotal|MemFree|Buffers|Cached|Active|Inactive|Dirty|Writeback):" /proc/meminfo
echo ""
echo "=== Page Cache Hit/Miss (from sar if available) ==="
if command -v sar &> /dev/null; then
# Buffer activity
sar -b 1 5 | tail -10
fi
echo ""
echo "=== Per-Filesystem I/O Statistics ==="
iostat -x 1 5 | grep -E "^Device|Filesystem" || true
echo ""
echo "=== Checking for memory pressure (high watermark) ==="
cat /proc/zoneinfo | awk '/Node/,/high/ {print}' | tail -20
echo ""
echo "=== VM Tunables ==="
echo "dirty_ratio: $(cat /proc/sys/vm/dirty_ratio)"
echo "dirty_background_ratio: $(cat /proc/sys/vm/dirty_background_ratio)"
echo "dirty_expire_centisecs: $(cat /proc/sys/vm/dirty_expire_centisecs)"
echo "dirty_writeback_centisecs: $(cat /proc/sys/vm/dirty_writeback_centisecs)"
Caching Behavior in Applications
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
/*
* Demonstration of different caching behaviors:
* 1. Normal buffered I/O (uses page cache)
* 2. O_DIRECT bypasses page cache
* 3. Proper fsync usage for durability
*/
#define BUFFER_SIZE (64 * 1024)
void demonstrate_fsync(int fd) {
char buf[BUFFER_SIZE];
ssize_t written;
/* Write some data */
written = write(fd, buf, BUFFER_SIZE);
if (written < 0) {
perror("write");
return;
}
printf("write() returned: %zd (page cache, not yet on disk)\n", written);
/* Ensure data reaches disk */
if (fsync(fd) < 0) {
perror("fsync");
return;
}
printf("fsync() completed (data now durable on disk)\n");
}
int demonstrate_direct_io(void) {
int fd;
char *buf;
int ret;
/* O_DIRECT requires aligned memory */
if (posix_memalign((void**)&buf, 4096, BUFFER_SIZE) != 0) {
perror("posix_memalign");
return 1;
}
/* Open with O_DIRECT to bypass page cache */
fd = open("/tmp/test_direct_io", O_RDWR | O_CREAT | O_DIRECT, 0644);
if (fd < 0) {
/* Try without O_DIRECT if not supported */
perror("open(O_DIRECT)");
free(buf);
return 1;
}
printf("Using O_DIRECT (bypasses page cache)\n");
/* With O_DIRECT, each read/write is direct to/from disk */
memset(buf, 0x42, BUFFER_SIZE);
ret = write(fd, buf, BUFFER_SIZE);
printf("Direct write returned: %d (disk I/O, synchronous or nearly so)\n", ret);
/* With O_DIRECT, fsync() may still be needed for durability
depending on filesystem */
fsync(fd);
free(buf);
close(fd);
return 0;
}
int main(int argc, char *argv[]) {
int fd;
const char *path = "/tmp/page_cache_demo";
printf("=== Normal buffered I/O ===\n");
fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}
demonstrate_fsync(fd);
close(fd);
printf("\n=== Direct I/O ===\n");
demonstrate_direct_io();
return 0;
}
Observability Checklist
Linux Page Cache Metrics
# Check overall memory and cache status
cat /proc/meminfo | grep -E "^(Buffers|Cached|Anon|Active|Inactive|Dirty|Writeback|Shmem):"
# Per-process page cache usage
cat /proc/$(pidof myapp)/smaps_rollup | grep -E "^(Private|Rss|Shared|Pss):"
# Page cache efficiency
# (need to calculate from vfs_cache_pressure and other metrics)
cat /proc/meminfo | grep -E "^(Active\(file\)|Inactive\(file\)|SReclaimable):"
# Check dirty page writeback status
cat /proc/vmstat | grep -E "^(nr_dirty|nr_writeback|nr_dirty_threshold|nr_free_pages):"
# View writeback activity in real-time
watch -n 1 'cat /proc/vmstat | grep -E "pgpgin|pgpgout|pswpin|pswpout|pgfree|pgactivate"'
# Block device I/O (includes writeback)
iostat -x 1
# Per-CPU statistics
mpstat -P ALL 1
Key Metrics to Monitor
| Metric | Healthy Range | Alert Threshold | Indicates |
|---|---|---|---|
Cached in /proc/meminfo | Depends on workload | Shrinking rapidly | Memory pressure |
Dirty in /proc/meminfo | < 5% of RAM | > 10% of RAM | Writeback backlog |
Writeback in /proc/meminfo | 0 or very low | > 0 sustained | Disk can’t keep up |
pgpgin/pgpgoout rate | Baseline | Sudden change | Abnormal I/O |
pgfault rate | Baseline | Spikes | Page cache miss thrash |
Common Pitfalls / Anti-Patterns
Page Cache Security Considerations
-
Sensitive Data in Page Cache: Passwords, encryption keys, and sensitive file content may remain in page cache after use. On multi-tenant or shared systems, forensic analysis can sometimes recover data from page cache. Use
mlock()to prevent sensitive pages from being swapped, andmemfd_create()withF_SEAL_SEALto prevent page cache for sensitive operations. -
Page Cache as Information Leak: The existence of data in page cache (not just content) can reveal access patterns. A state-level adversary could infer which files a target accessed based on page cache residency, even if content is protected. Full disk encryption doesn’t prevent this timing analysis.
-
DMA Attacks on Page Cache: As discussed in the DMA post, DMA-capable devices can read/write page cache memory. Enable IOMMU to prevent unauthorized DMA access.
Compliance Considerations
- PCI DSS: Requirement 3.4 requires encryption of cardholder data in memory, including page cache
- HIPAA: PHI in page cache must be protected—use encrypted filesystems or memory-protected regions
- EU GDPR: Page cache containing personal data may be considered “processing” and requires appropriate safeguards
Common Pitfalls / Anti-patterns
| Anti-pattern | Why It’s Bad | Correct Approach |
|---|---|---|
| Assuming write() is durable | write() only commits to page cache | Use fsync() or O_SYNC for durability |
| Large sequential reads without O_DIRECT | Pollutes page cache, may cause thrashing | Use posix_fadvise(POSIX_FADV_DONTNEED) after reads |
| Ignoring dirty_ratio limits | Application blocks when dirty pages exceed threshold | Tune vm.dirty_ratio / vm.dirty_background_ratio |
| Not handling ENOSPC | Write can fail when page cache allocates beyond limit | Monitor available space, handle write errors |
| Using mmap() and expecting immediate disk writes | mmap writes are also buffered | msync() to force writeback |
| forgetting that truncate() doesn’t clear page cache | Truncated pages may persist in cache | Use truncate_inode_pages() or open with O_TRUNC |
Quick Recap Checklist
- Page cache stores disk blocks in RAM for faster access; cache hits avoid disk I/O entirely
- Write-back (default): write() returns immediately; data reaches disk later via flusher daemon
- Write-through: write() blocks until data reaches disk; use O_SYNC flag
- fsync() forces a file’s dirty pages and metadata to disk before returning
- O_DIRECT bypasses page cache for applications that manage their own caching (databases)
- posix_fadvise() lets applications hint access patterns to kernel for better cache handling
- Page cache is unified—file pages and block buffer pages share the same underlying mechanism
- Monitor dirty page count and writeback activity to detect I/O bottlenecks
- Memory pressure evicts page cache pages—ensure enough RAM for your workload
- Never assume data is on disk without fsync(); a crash between write() and sync will lose data
Interview Questions
Write-through caching writes data to both cache and underlying storage immediately—the write operation blocks until storage confirms the data is persisted. Write-back caching writes only to cache and returns immediately; the data is marked dirty and written to storage later by the flusher daemon. Linux defaults to write-back. Write-through is safer (no data loss on crash) but slower. Write-back is faster but risks data loss if the system crashes before the flusher daemon writes dirty pages to disk. Applications requiring durability must use fsync(), O_SYNC, or O_DSYNC regardless of caching mode.
fsync()? Walk through the entire process.When you call fsync(fd): (1) VFS layer resolves the file descriptor to the struct file and its struct inode. (2) It calls the filesystem's fsync hook (ext4_sync_file(), xfs_file_fsync(), etc.). (3) The filesystem submits all dirty pages belonging to that inode for writeback. (4) The filesystem also submits the inode's metadata (atime, mtime, etc.). (5) It waits for all I/O to complete—either polling the bio completion or sleeping on a wait queue. (6) After all blocks are on disk, it updates the superblock and returns 0. fdatasync() is similar but doesn't flush metadata unless necessary for data retrieval. sync() calls fsync() on all filesystems globally.
O_DIRECT instead of normal buffered I/O?Use O_DIRECT when you want to bypass the page cache entirely. This is most appropriate for databases that implement their own sophisticated buffer management—for example, PostgreSQL and MySQL use O_DIRECT to avoid double caching (both OS page cache and database buffer pool consuming RAM). It's also useful for very large sequential reads that won't be reused (preventing cache pollution), or for applications requiring deterministic I/O patterns without opaque caching. However, O_DIRECT requires aligned memory buffers (aligned to 512 or 4096 bytes depending on filesystem) and provides no read-ahead benefit. Most applications should use normal buffered I/O with posix_fadvise() hints.
Two primary reasons: (1) Memory pressure management—an application may hold large datasets that won't be accessed again, and explicitly dropping them (via posix_fadvise(POSIX_FADV_DONTNEED)) allows the kernel to use that memory for other purposes like file caching for active files. (2) Performance isolation—a batch job that processes many files sequentially might otherwise evict useful pages from the page cache. Dropping its accessed pages after processing ensures those evictions don't displace frequently-accessed data. This is especially important in systems running multiple workloads where "noisy neighbor" page cache pollution is a concern.
Modern Linux has a unified page cache—there is no separate buffer cache. Historically, UNIX had both: the buffer cache cached disk blocks for filesystems, while the page cache cached file content for memory-mapped files. Linux 2.6 merged these into a single page cache. The struct buffer_head that remains is a thin layer on top of page cache pages—it tracks which disk blocks map to which page. When you read a file, the page is allocated, filled from disk via buffer_head I/O, and cached as a page. Both memory-mapped file access and regular read()/write() access go through the same page cache, which simplifies coherency management and reduces duplicate caching.
The Linux kernel uses a multi-gen page cache LRU (Least Recently Used) mechanism. Pages are organized into multiple "generations" and the eviction algorithm scans the inactive list, trying to reclaim pages that haven't been accessed recently. The vmscan (page reclaim daemon) drives this—it uses a refault pressure algorithm where pages that keep getting referenced stay in cache while cold pages are evicted.
The /proc/meminfo Active() vs Inactive() counters show the split. Shmem (shared memory tmpfs pages) and Unstable memory also compete for page cache memory. The kernel's vm.swappiness sysctl tunes the preference between anonymous (swap) and file-backed pages.
radix_tree_delete() and truncating pages in the page cache?radix_tree_delete() removes a specific page at a known index from the page cache's radix tree. The page is removed but if it's dirty, it is not written to disk—the dirty data is simply lost. Truncation (via truncate_inode_pages()) similarly invalidates pages but properly handles dirty pages via vmtruncate() which writes them back first.
The key difference: direct radix_tree_delete() can corrupt file data if the page was dirty. Always use the proper truncate path when invalidating file-backed pages. Never assume that dropping a page from the radix tree is equivalent to freeing the page—dirty pages must go through writeback first.
O_SYNC differ from fsync() in terms of durability guarantees and performance?O_SYNC is an open-mode flag that makes every write() synchronous—each write blocks until the data and metadata are on disk. fsync() is a per-file operation that forces pending writes and metadata updates to disk after a specific write sequence. fsync() is more granular: you batch writes, then sync at a commit point.
Performance: O_SYNC is slower because it waits for disk after every write, even small ones. fsync() lets you batch writes into larger I/O operations before syncing—far more efficient. Use O_SYNC only when every individual write must be durable (rare, e.g., write-ahead logging in databases), and use fsync() for periodic commits.
The double-write problem: a page cache write succeeds (data in RAM), but the system crashes before the page is written to disk. The on-disk location contains the old page—a partial torn write. Database pages are larger than filesystem block sizes, so a crash mid-write can leave a page in a corrupted state.
The solution (PostgreSQL, InnoDB doublewrite buffer): write the page to a reserved area on disk first (twice), then write the page to its actual location. If crash occurs after the doublewrite but before the real location, the doublewrite copy is used for recovery. This is necessary precisely because the page cache write-back model does not guarantee atomic sector-sized writes.
posix_fadvise() and how does POSIX_FADV_WILLNEED differ from readahead?posix_fadvise() lets applications hint their access pattern to the kernel's page cache. POSIX_FADV_WILLNEED tells the kernel to eagerly prefetch the specified file range into page cache—essentially triggering readahead for that range immediately. POSIX_FADV_SEQUENTIAL doubles the readahead window for sequential access. POSIX_FADV_DONTNEED tells the kernel to drop the specified pages from cache.
Unlike manual readahead(), POSIX_FADV_WILLNEED works on already-open file descriptors and can be targeted to specific offsets and lengths. It's the standard way to explicitly prefetch before known access patterns (like processing files in known order in batch jobs).
Multiple processes writing to the same file share the same page cache pages. The kernel tracks a "offset within file" -> "struct page" mapping. When two processes write to overlapping regions, their writes eventually serialize at the block layer—the page is locked during I/O submission. Concurrent writes to non-overlapping regions are handled independently.
The key issue: concurrent writes to overlapping regions can result in lost updates (last-write-wins) at the application level unless coordinated via file locking (flock(), fcntl() advisory locks). The page cache itself doesn't enforce write ordering—only the block layer's request scheduling and the filesystem's own locking.
The shrinker API allows subsystem-specific page cache reclaim. Filesystems register a shrinker callback that the VM calls when memory is low. The callback returns the number of pages it reclaimed. For example, the dentry cache (filesystem entry cache) shrinker frees unused directory entries when memory pressure is high.
This prevents memory exhaustion: when vm pressure exceeds a threshold, the kernel calls registered shrinkers in priority order (starting with the most aggressive). Each shrinker frees its least-recently-used objects until memory pressure subsides. Without shrinkers, the page cache would grow until the OOM killer fires.
mmap() writes to a file also go through the page cache?Memory-mapped file writes are stored to a page in the page cache—the page is faulted in from disk if not present, modified in cache (marked dirty), and eventually written back by the flusher daemon. This is why mmap() writes are not immediately durable—msync() is needed to force writeback.
The advantage: mmap() provides zero-copy access to file data (the page is mapped directly into the process's address space). The disadvantage: writes are still cached and subject to the same write-back semantics as read()/write() I/O. Applications expecting direct I/O semantics from mmap() may have durability surprises.
fdatasync() and fsync()?fsync() flushes both file data and metadata (atime, mtime, file size, extended attributes) to disk. fdatasync() flushes only data and enough metadata to retrieve the data later (e.g., file size changes that affect data retrieval). It omits flushing metadata that doesn't affect data access (like atime, ACLs, other extended attributes).
The performance difference: metadata syncs involve additional disk operations (inode updates to the inode table). fdatasync() is useful for applications (like databases) that need durability but don't care about non-essential metadata updates. PostgreSQL prefers fdatasync() for this reason—it reduces disk operations without sacrificing data durability.
tmpfs uses the same page cache infrastructure as regular files—it allocates struct page frames for its content. The key difference: tmpfs content lives entirely in RAM and can be swapped out under memory pressure (unlike ramdisk, which consumes memory pre-allocated). tmpfs pages in the page cache are tracked by the anonymous (inactive) LRU when swapped.
tmpfs also contributes to /proc/meminfo's Shmem counter. It is refcounted like other page cache pages and participates in the same reclaim algorithm. The size of tmpfs is limited by a mount option and by available swap space.
vm.dirty_ratio address it?Without throttling, a process that generates dirty pages rapidly can flood the page cache with dirty pages faster than the disk can write them. This causes "writeback saturation"—the system spends all time writing and no time doing useful work. vm.dirty_ratio (percentage of available RAM) is the threshold at which processes generating dirty pages are throttled—they block in balance_dirty_pages() until the flusher catches up.
The related vm.dirty_background_ratio is the threshold at which the flusher daemon wakes up to begin writeback in the background (without blocking applications). Tuning these values trades write throughput for read responsiveness.
Because the write went to the page cache (write-back), not to disk. The application reading the file via a different file descriptor may get a stale page from the page cache (the old data) instead of its own just-written data, because the page hasn't been read in yet. This is a coherency issue: the writer has dirty pages, the reader gets the old cached pages.
This is actually a bug in the application—either the writer should fsync() before the reader accesses, or the reader should open with O_DIRECT to bypass the cache. This scenario typically occurs with separate processes or when a parent process writes and a child (via fork) reads—the child's address space inherits the cached page from before the parent's write.
In RAID write-back mode (write-back caching on the RAID controller), a power failure can lose data if the RAID controller acknowledges a write before it reaches disk. The page cache sends data to the RAID controller, the controller acknowledges (data in controller cache), but the system crashes before the controller flushes to physical disks.
Linux mitigates this with "force write-through" on RAID members when battery-backed write cache (BBWC) is not detected. Production storage systems should always have BBWC or use a filesystem with its own write-ahead journal (ext4, XFS) that survives RAID controller failures.
The page cache stores decrypted file content in plaintext pages. Encryption happens below the page cache—in the VFS layer or filesystem layer—when data moves between the page cache and the block layer. When you read a file, the block layer reads encrypted blocks from disk, the filesystem decrypts them, and the page cache stores the plaintext result.
This means: even with encrypted filesystem, sensitive data in page cache is plaintext in RAM. Cold boot attacks can recover plaintext from RAM. The encryption protects data at rest on disk, not data in transit through the page cache. Use memory locking (mlock()) if you need to prevent sensitive pages from being swapped out.
Page cache writeback writes file data pages to their on-disk locations. Journal commit writes metadata (and optionally data) to a dedicated journal area on disk in a sequential, atomic transaction record. The journal guarantees that metadata updates survive crashes, while page cache writeback handles the actual file content.
When a filesystem commits a transaction, it writes all modified metadata (and data for data=journal mode) to the journal sequentially, then posts a commit record. After a crash, the filesystem replays the journal, reapplying committed transactions and discarding incomplete ones. This is separate from page cache writeback—the journal and data writeback serve different durability purposes.
Further Reading
- Linux Kernel Documentation: Page Cache — Official page cache documentation
- Linux Kernel Documentation: VFS Layer — Virtual File System documentation covering inode and address_space
- Understanding the Linux Kernel, Chapter 16 — Memory mapping and page cache internals
- Linux Page Cache Management — Writeback and flush daemon documentation
- PostgreSQL Buffer Cache vs OS Page Cache — PostgreSQL documentation on O_DIRECT and buffer management
- memfd_create man page — Memory file descriptors for seal-protected shared memory
Conclusion
The page cache represents one of the most effective performance optimizations in operating systems—keeping recently-accessed disk blocks in RAM eliminates disk I/O for cache hits, dramatically reducing read latency and batching writes for improved disk throughput. Write-back caching (Linux default) delivers high performance by returning immediately after writing to cache, but requires explicit fsync() calls for durability guarantees.
The unified page cache (merged from historical separate buffer and page caches) simplifies memory management and eliminates double-caching. O_DIRECT bypasses the page cache for applications like databases that implement their own buffer management. posix_fadvise() enables applications to hint access patterns, allowing the kernel to prefetch aggressively or evict pages that won’t be reused.
Looking forward, several trends reshape caching: persistent memory (PMEM) blurs the line between storage and memory, potentially eliminating some page cache benefits; io_uring enables asynchronous I/O that bypasses traditional buffered I/O paths; and pressure around memory efficiency drives continued refinement of page cache eviction policies and fsync() batching behavior.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.