File System Abstraction

Understanding how operating systems organize and manage files through abstraction layers, from physical storage to logical file operations.

published: reading time: 28 min read author: GeekWorkBench

File System Abstraction

Every time you save a document, browse a folder, or stream a video, you’re interacting with one of computing’s most elegant abstractions: the file system. Behind the scenes, mountains of complex logic collapse into the simple act of reading and writing files. This post peels back those layers to reveal how modern operating systems transform raw magnetic domains on a spinning disk into the seamless experience of organizing your digital life.

The file system abstraction is foundational to everything else in operating systems. It provides the illusion of persistent, structured storage while hiding the messy reality of sector boundaries, wear leveling, and hardware failures. Without this abstraction, every programmer would need to be a storage engineer just to save a configuration file.

Introduction

When to Use / When Not to Use

Understanding file system abstraction helps you make informed decisions about data storage and performance optimization.

When file system abstraction shines:

  • Application development where persistence and portability matter
  • Database design that relies on durability guarantees
  • Backup systems that need to understand what’s stored where
  • Debugging storage-related issues in production

When to look deeper:

  • Embedded systems with custom storage controllers
  • Performance-critical applications where every I/O matters
  • Distributed systems where consistency guarantees get murky
  • When the abstraction leaks and you need to understand why

Architecture or Flow Diagram

graph TD
    A[Application System Calls] --> B[VFS Layer]
    B --> C[Specific File System ext4, NTFS, FAT]
    C --> D[Block Layer]
    D --> E[Storage Device Driver]
    E --> F[Physical Storage Device]

    G[Inode Cache] -.-> B
    H[Directory Entry Cache] -.-> B
    I[Page Cache] -.-> B

The architecture reveals the layered approach. Applications speak to the kernel through system calls. The Virtual File System (VFS) layer acts as a universal translator, allowing different file systems to coexist. Below that, specific implementations handle the particulars of each format. The block layer manages I/O scheduling, and finally the device driver speaks to the hardware.

Core Concepts

Files: The Logical Unit

A file is an abstract container for data, defined by a set of attributes rather than physical properties. The file system maintains:

  • Name: Human-readable identifier with case-sensitive or insensitive rules
  • Metadata: Size, creation time, modification time, permissions, ownership
  • Data blocks: Actual storage locations for content
  • Location references: Pointers to where data lives on disk

Files are named sequences of bytes. The operating system doesn’t care what’s inside. A text document and a compiled binary are both just sequences of bytes to the file system layer.

Inodes: The Metadata Engine

The inode (index node) is the fundamental data structure in Unix-style file systems. Every file has exactly one inode, identified by a unique number called the inode number.

struct inode {
    unsigned long i_ino;          // Inode number
    umode_t         i_mode;       // File type and permissions
    unsigned int    i_nlink;      // Link count (hard links)
    uid_t           i_uid;       // Owner UID
    gid_t           i_gid;       // Owner GID
    loff_t          i_size;      // Size in bytes
    struct timespec i_atime;      // Access time
    struct timespec i_mtime;      // Modification time
    struct timespec i_ctime;      // Change time
    unsigned long   i_blocks;    // Blocks allocated (512-byte blocks)
    unsigned short  i_bytes;     // Bytes per block
    struct super_block *i_sb;     // Super block reference
    struct inode_operations *i_op;// Operations table
};

The inode doesn’t contain the actual file name. That’s stored in directory entries, which map names to inode numbers. This separation allows multiple names (hard links) to point to the same file.

Directories: Name-to-Inode Mappings

A directory is itself a special file that maps names to inode numbers. In early Unix, directories were simply files containing records:

struct direct {
    unsigned long d_ino;    // Inode number
    unsigned short d_reclen;// Record length
    unsigned char d_type;   // File type (directory, regular, etc.)
    char d_name[];          // File name (variable length)
};

Modern file systems use more sophisticated directory implementations, but the principle remains: directories are searchable collections of name-to-inode mappings.

Hierarchical Organization

The directory tree creates the familiar path structure:

graph TD
    A["/ (root)"] --> B["/home"]
    A --> C["/etc"]
    A --> D["/var"]
    B --> E["/home/user"]
    E --> F["/home/user/documents"]
    F --> G["report.pdf"]

    style A stroke:#ff00ff,stroke-width:3px
    style G stroke:#ff00ff,stroke-width:2px

Each directory entry points to an inode. Following the chain from root through each path component resolves a full path to a specific inode.

The Superblock: File System Metadata

The superblock contains the file system’s critical metadata:

  • Total number of inodes and blocks
  • Free inode and block counts
  • Block size and group size
  • Magic number to identify file system type
  • Last mount time and write time
  • Journal location (for journaling file systems)

The superblock is duplicated across the partition for redundancy. If the primary superblock corrupts, the file system can recover from a backup copy.

Production Failure Scenarios

Scenario 1: Inode Exhaustion

What happened: A web server’s /tmp directory accumulated millions of small session files. The disk showed 40% free space, but the system couldn’t create new files. The issue: inode exhaustion. The partition had 100 million inodes, and 99 million were allocated to tiny files.

Detection:

df -i /tmp
# Filesystem  Inodes  IUsed  IFree  IUse%  Mounted on
# /dev/sda1   100M    99M    1M     99%     /tmp

Mitigation:

  • Monitor inode usage in production: df -i
  • Set up alerts when inode usage exceeds 80%
  • Implement file expiration policies for temporary directories
  • Consider partitioning strategies that balance inode allocation

Scenario 2: Orphaned Inodes

What happened: A database application crashed mid-transaction, leaving behind an inode that referenced blocks on disk but no directory entry pointed to it. The storage was allocated but inaccessible. This is a “lost” file eating disk space silently.

Detection:

# Use fsck in no-write mode to find orphans
sudo fsck -n /dev/sda1

# Or in debug mode
sudo debugfs -w /dev/sda1
debugfs: dump_unreachable

Mitigation:

  • Always properly unmount file systems before maintenance
  • Use journaling file systems that prevent orphaned structures
  • Run periodic integrity checks: xfs_repair for XFS, e2fsck for ext4
  • Take advantage of file system features like tune2fs -l to check state

Scenario 3: Fragmentation-Induced Performance Death

What happened: A heavily-used file server showed acceptable space metrics but catastrophic performance. Reads that should take milliseconds stretched to seconds. The culprit: external fragmentation. File data was scattered across thousands of non-contiguous blocks.

Detection:

# Check fragmentation on ext4
sudo fsck.ext4 -nv /dev/sda1

# Check fragmentation percentage
sudo tune2fs -l /dev/sda1 | grep -i frag

Mitigation:

  • Monitor fragmentation levels with vendor tools
  • Defragment proactively with e4defrag (ext4) or xfs_fsr (XFS)
  • Choose allocation strategies that minimize fragmentation (extents over blocks)
  • Plan capacity with headroom to prevent fragmentation-inducing fullness

Trade-off Table

Aspectext4XFSBtrfsNTFS
Max File Size16TB8EB16EB256TB
Max Volume Size1EB16EB16EB256TB
JournalingMetadata onlyFullCopy-on-writeMetadata only
SnapshotsVia LVMVia LVMNativeVia VSS
PerformanceGeneral purposeLarge filesMixedWindows optimal
FragmentationModerateLowLow (COW)Moderate

Implementation Snippet

Reading File Metadata in C

#include <stdio.h>
#include <sys/stat.h>
#include <time.h>

void print_file_info(const char *path) {
    struct stat st;

    if (stat(path, &st) == -1) {
        perror("stat");
        return;
    }

    printf("File: %s\n", path);
    printf("Size: %lld bytes\n", (long long) st.st_size);
    printf("Blocks: %lld\n", (long long) st.st_blocks);
    printf("IO Block: %ld bytes\n", st.st_blksize);
    printf("Inode: %lu\n", st.st_ino);
    printf("Links: %hu\n", st.st_nlink);
    printf("Permissions: %o\n", st.st_mode & 0777);

    // Timestamps
    printf("Access: %s", ctime(&st.st_atime));
    printf("Modify: %s", ctime(&st.st_mtime));
    printf("Change: %s", ctime(&st.st_ctime));
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <file>\n", argv[0]);
        return 1;
    }
    print_file_info(argv[1]);
    return 0;
}

Exploring Inodes with Python

import os
import stat

def explore_inode(filepath):
    """Explore file system inode information."""
    st = os.stat(filepath)

    print(f"Path: {filepath}")
    print(f"Inode: {st.st_ino}")
    print(f"Device: {st.st_dev}")
    print(f"Hard links: {st.st_nlink}")
    print(f"Size: {st.st_size} bytes")
    print(f"Blocks (512b): {st.st_blocks}")
    print(f"Block size: {st.st_blksize}")

    mode = st.st_mode
    print(f"\nPermissions: {oct(stat.S_IMODE(mode))}")
    print(f"Type: ", end="")

    if stat.S_ISDIR(mode):
        print("Directory")
    elif stat.S_ISREG(mode):
        print("Regular File")
    elif stat.S_ISLNK(mode):
        print("Symbolic Link")
    else:
        print("Other")

    # Access timestamps
    print(f"\nAccessed: {stat.getmtime(filepath)}")
    print(f"Modified: {stat.getctime(filepath)}")

if __name__ == "__main__":
    import sys
    explore_inode(sys.argv[1] if len(sys.argv) > 1 else ".")

Observability Checklist

Monitor these metrics to keep your file systems healthy:

  • df -h: Check available space across all mounted file systems
  • df -i: Monitor inode usage to prevent inode exhaustion
  • mount | grep -v tmpfs: List persistent mounts with their options
  • tune2fs -l /dev/sdX | grep -i state: Check file system clean/dirty state
  • dmesg | grep -i error | grep -i ext4: Scan kernel logs for file system errors

Key metrics to track:

  • Available space percentage (alert at >85% for critical partitions)
  • Inode usage percentage (alert at >80%)
  • I/O wait time (CPU waiting on storage operations)
  • Read/write throughput per partition
  • Error counts from smartctl output

Logs to watch:

# Check for file system errors
sudo journalctl -b | grep -i "ext4\|xfs\|filesystem"

# Monitor inotify watch consumption
cat /proc/sys/fs/inotify/max_user_watches

# Check for file handle leaks
lsof | wc -l

Common Pitfalls / Anti-Patterns

Permission Model

Unix file permissions operate on three classes:

  • Owner (u): The user who owns the file
  • Group (g): Users in the file’s group
  • Other (o): Everyone else

Three permission types for each class:

  • Read (r): View file contents or list directory
  • Write (w): Modify file contents or create/delete files in directory
  • Execute (x): Run file as program or access directory contents
# View permissions
ls -la /path/to/file
# -rw-r--r--  1 owner group  1234 May 19 10:00 file.txt

# Modify permissions
chmod 640 /path/to/file      # Owner: rw, Group: r, Other: none
chmod u+x /path/to/file      # Add execute for owner
chmod -R 755 /path/to/dir    # Recursive

Access Control Lists (ACLs)

Beyond traditional permissions, ACLs provide fine-grained control:

# Set ACL for specific user
setfacl -m u:alice:rw /path/to/file

# Set ACL for specific group
setfacl -m g:developers:rx /path/to/directory

# View ACLs
getfacl /path/to/file

# Remove specific ACL entry
setfacl -x u:alice /path/to/file

Security Considerations

  1. Sticky bit on shared directories: Prevents users from deleting others’ files in /tmp-like directories

    chmod +t /shared/directory
  2. Immutable files: Even root cannot modify immutable files without removing the flag

    chattr +i /important/file
    lsattr /important/file
    # ---i---------- /important/file
  3. Audit file access: Enable audit daemon to track who accesses what

    sudo auditctl -w /etc/shadow -p wa -k shadow_access
    sudo ausearch -k shadow_access

Common Pitfalls / Anti-patterns

1. Assuming Unlimited Space

# BAD: Writing without checking space
with open("/tmp/large_file", "w") as f:
    while True:
        f.write("data" * 1000)  # Will fail eventually

# GOOD: Check before writing
import shutil

def safe_write(path, data):
    stat = shutil.disk_usage("/tmp")
    if stat.free < len(data):
        raise IOError(f"Insufficient space. Need {len(data)}, have {stat.free}")
    with open(path, "w") as f:
        f.write(data)

2. Ignoring Path Length Limits

Early file systems had 255-byte path components. Modern systems support longer paths, but mixing older code can cause silent truncation and security issues.

3. Case Sensitivity Surprises

# On case-insensitive file system (Windows, macOS default)
touch MyFile.txt
ls myfile.txt  # Shows MyFile.txt

# On case-sensitive file system (Linux, some network mounts)
touch MyFile.txt
ls myfile.txt  # No such file or directory
# Hard link - same inode, must be on same filesystem
ln /existing/file /path/to/hardlink
ls -li /existing/file /path/to/hardlink  # Same inode number

# Symbolic link - separate file pointing to path
ln -s /existing/file /path/to/symlink
ls -li /existing/file /path/to/symlink  # Different inodes

Quick Recap Checklist

  • File systems provide abstraction over raw storage devices
  • Inodes store metadata; directories map names to inode numbers
  • Hard links share inodes; symbolic links store paths
  • Journaling file systems prevent corruption from crashes
  • Monitor inode and space usage to prevent production surprises
  • Permissions (rwx) apply to owner/group/others
  • ACLs provide finer-grained access control
  • Different file systems have different strengths and limits

Interview Questions

1. What is an inode and what information does it contain?

An inode (index node) is a data structure that stores metadata about a file in Unix-style file systems. It contains:

  • The file's size and block count
  • Owner UID and group GID
  • File permissions (mode)
  • Timestamps: access, modification, and change time
  • Link count (number of hard links pointing to this inode)
  • Pointers to the actual data blocks on disk

The inode does not store the file name—that lives in directory entries.

2. Explain the difference between hard links and symbolic links.

Hard links are multiple directory entries pointing to the same inode. They must be on the same file system and cannot link to directories. When you delete one hard link, the file persists until all hard links are removed. The inode's link count tracks this.

Symbolic links are special files that store a path (not an inode number). They can cross file systems and point to directories. Deleting the target breaks the symlink. They occupy a small inode for the link itself plus space for the path string.

Use hard links for equal-weight file names within a filesystem. Use symlinks when you need flexibility or cross-filesystem references.

3. What happens when you run `ls -la`? Walk through the kernel layers involved.

When you execute ls -la, the journey through the kernel looks like:

  1. Shell parses the command and calls fork() to create a child process
  2. execve() system call replaces the child with the /bin/ls program
  3. ls calls getdents64() or readdir() system call
  4. Kernel transfers control to the VFS layer, which identifies the mounted file system type
  5. File system driver (e.g., ext4) reads directory entries from disk or cache
  6. Page cache may serve frequently-accessed directory data
  7. Results flow back up through VFS to the application

The application never touches the disk directly—it only knows about the VFS interface.

4. How would you diagnose a "disk full" error when `df` shows space is available?

This classic problem usually means one of two things is exhausted: inodes or file handles.

First, check inode usage: df -i. If the IUse% is near 100%, the partition has run out of inodes (common with millions of tiny files in /tmp).

If inodes are fine, check file handle limits:

  • lsof | wc -l — current open file count
  • cat /proc/sys/fs/file-max — system-wide limit
  • ulimit -n — per-process limit

Also check for deleted files still held open by processes: lsof +L1 shows files deleted but not yet freed.

5. Describe what happens during a file open operation in the kernel.

The open() system call triggers a multi-step process:

  1. Path resolution: Starting from the root directory (or current working directory for relative paths), each path component is looked up in its parent directory
  2. VFS lookup: The VFS layer calls the underlying file system's lookup function
  3. Permission checking: The kernel verifies the process has execute permission on each directory component and read/write on the file itself
  4. Inode retrieval: If the file exists, its inode is loaded into the inode cache
  5. Open file table entry: A new entry is created in the process's file descriptor table, pointing to a global open file table entry
  6. File structure allocation: Memory is allocated for the file structure, including current position and access mode

On success, a file descriptor (small integer) is returned. Subsequent read(), write(), and close() calls use this descriptor.

6. What is the superblock and what critical information does it contain?

The superblock is the metadata heart of a file system. It lives in a fixed location (typically block 0 or block 1) and contains:

  • File system magic number: Identifies the file system type (0xEF53 for ext2/3/4)
  • Block size and total blocks: Fundamental geometry of the storage
  • Inode count and free inode count: Total and available inodes
  • Block count and free block count: Storage capacity information
  • Last mount time and last write time: File system usage timestamps
  • Journal location: For journaling file systems, where the journal lives
  • Block group descriptors: Locations of block and inode bitmaps

Because superblock corruption is catastrophic, ext2/3/4 duplicate it at regular intervals throughout the partition. If the primary superblock is damaged, the file system can recover from a backup copy.

7. How does the page cache improve file system performance?

The page cache (formerly buffer cache) stores recently accessed file data in free RAM. When you read a file, the kernel checks if the data is already in the page cache. If so, it returns the cached data immediately—zero disk I/O. If not, it reads from disk and caches the result for future use.

Benefits:

  • Read amplification reduction: Frequently read data doesn't require repeated disk reads
  • Write coalescing: Multiple writes to the same region are combined before disk I/O
  • Access time elimination: RAM access is nanoseconds; disk I/O is milliseconds
  • Writeback batching: Delayed writes allow the kernel to batch and optimize I/O patterns

The page cache is unified—file data and anonymous memory (process heap/stack) share the same pool. The kernel's memory reclaim algorithm (`LRU`) evicts cold pages when memory pressure increases.

8. What is the difference between direct I/O and buffered I/O?

Buffered I/O (default): Data flows through the page cache. Writes go to the cache and are lazily flushed to disk. Reads pull from cache, loading from disk on cache miss. The kernel manages consistency and can reorder I/O for performance.

Direct I/O: Data bypasses the page cache, going directly between application memory and the storage device. Applications use the O_DIRECT flag to request this. The kernel still performs I/O scheduling at the block layer, but the page cache is not involved.

When to use direct I/O:

  • Database systems that manage their own cache (PostgreSQL, Oracle)
  • Applications doing sequential scans that don't benefit from caching
  • When you need predictable I/O patterns without kernel interference

Direct I/O bypasses the page cache but not the block layer scheduler or RAID controller cache.

9. What are the fundamental POSIX file operations and how does the kernel handle them?

POSIX defines the standard file operations: open(), close(), read(), write(), lseek(), stat(), rename(), unlink().

The kernel handles these through the VFS layer:

  1. open(): Returns a file descriptor. Kernel creates an open file table entry, calls the file system's inode_operations->lookup(), allocates a dentry cache entry.
  2. read(fd, buf, n): Kernel looks up the fd in the process's file descriptor table, finds the file struct, calls the file's file_operations->read() which chains to the underlying file system driver.
  3. write(): Same path but calls file_operations->write(). Data may go through the page cache and be dirty-flushed to disk later.
  4. lseek(): Updates the file offset in the open file table entry. Doesn't touch the underlying storage.

The kernel maintains three tables: the per-process file descriptor table (array of pointers), the system-wide open file table (struct file with position and mode), and the dentry cache (directory entry to inode mapping). Each layer has a distinct purpose in managing file access.

10. What is the difference between journaling and non-journaling file systems, and what are the trade-offs?

A journaling file system records intended operations in a journal (a separate area on disk) before applying them to the main file system. If a crash occurs mid-operation, the journal can be replayed to complete or undo the operation, preventing metadata corruption.

Journaling process:

  1. Write operation to journal (begin transaction)
  2. Perform operation on main file system
  3. Mark transaction as complete in journal

If crash happens before step 3, the journal contains the incomplete operation. On reboot, the file system replays the journal and either completes the operation or undo it, ensuring consistency.

Non-journaling file systems (like ext2, FAT) write directly to the file system. Crashes can leave the file system in an inconsistent state with lost or orphaned data. Checking consistency requires scanning the entire file system (e.g., fsck), which is slow on large file systems.

Trade-offs:

  • Journaling adds write overhead (journal writes before main writes)
  • Journaling protects metadata but not necessarily file data (metadata-only journaling is common)
  • Ext4 uses "ordered" mode: journal commits metadata but not file data, preventing data corruption while being fast
11. How does copy-on-write (COW) work in file systems like Btrfs, and what advantages does it provide?

Copy-on-Write (COW) is a mechanism where modifying data does not overwrite the existing data in place. Instead, the modified data is written to a new location, and the pointer is updated.

In Btrfs:

  1. When you modify a data block, Btrfs writes the new version to a free location
  2. Only the metadata (which points to the data) is updated to reference the new location
  3. The old data remains unchanged until the old metadata pointers are also updated

Advantages:

  • Snapshots: Creating a snapshot is instantaneous—the snapshot shares the same data blocks until one copy is modified. Only the changed blocks consume additional space.
  • Self-healing: Btrfs maintains checksums for data and metadata. On read, if corruption is detected, Btrfs can read the good copy from a mirror (in RAID configurations) and repair the corrupted block.
  • Consistency: No in-place overwrites means crashes cannot corrupt existing data—only the incomplete new write is lost.

Disadvantages:

  • Fragmentation: COW causes data to be scattered across the disk over time
  • Write amplification: Small random writes become large sequential writes
  • SSD wear: Frequent COW writes can wear out SSDs faster (Btrfs handles this with wear leveling)
12. What is the role of the VFS (Virtual File System) layer and how does it enable multiple file systems?

The Virtual File System (VFS) is an abstraction layer in the Linux kernel that provides a unified interface to all file system implementations. Applications interact with VFS, not directly with ext4, XFS, or NTFS.

VFS defines standard operations:

struct super_block  // The file system as a whole
struct inode        // A specific file
struct dentry        // A directory entry (name to inode mapping)
struct file         // An open file handle

Each file system implements these operations for its specific on-disk format. When an application calls open(), VFS:

  1. Determines which mounted file system should handle the request
  2. Calls the file system's inode_operations->lookup() to find the inode
  3. Creates a struct file with the file system's file_operations

This design allows transparent access to different file systems. You can mount ext4, XFS, and FAT pen drives simultaneously. Network file systems like NFS and CIFS also implement VFS operations, making network drives appear as local directories.

13. How does file system fragmentation occur and what tools exist to defragment?

Fragmentation occurs when file data blocks are not contiguous on disk. As files are created, modified, and deleted, free space becomes scattered, and new files must be allocated from non-contiguous blocks.

Types:

  • Internal fragmentation: Wasted space within blocks (file smaller than block size)
  • External fragmentation: File blocks scattered across disk (causes slow reads)
  • Metadata fragmentation: Directory entries spread across disk

Tools for defragmentation:

  • ext4: e4defrag /dev/sda1 or e4defrag /mount/point
  • XFS: xfs_fsr /dev/sda1 (online defragmentation)
  • Btrfs: btrfs defrag /mount/point (btrfs balances allocations instead)
  • Windows: defrag C: in command prompt or Optimize Drives utility

Defragmentation helps mechanical hard drives (sequential access is faster) but is less important for SSDs (random access is fast, COW complicates it). Modern file systems with extent allocation (rather than block allocation) are more resistant to fragmentation.

14. What is the difference between a hard link count and a symbolic link, and when would you use each?

A hard link is a directory entry that points directly to an inode. Multiple hard links share the same inode, link count (in the inode) tracks how many directory entries reference it. When all hard links are deleted (link count becomes 0), the inode and its data blocks are freed.

Hard link rules:

  • Must be on the same file system as the original file
  • Cannot link to directories (prevents cycles in the directory tree)
  • The inode is shared—modifying the file affects all hard links

A symbolic link is a special file type containing a path string. It points to a path (not an inode), can cross file systems, and can point to directories or non-existent targets.

When to use hard links:

  • Multiple names for the same file in the same directory tree
  • Backup schemes where you want to maintain multiple references

When to use symbolic links:

  • Creating shortcuts or aliases
  • Linking across file systems or to directories
  • Pointing to files that may be renamed or moved
15. Explain the difference between block devices and character devices in Unix/Linux.

Block devices transfer data in fixed-size blocks (typically 512 bytes to 4KB). They support random access—you can seek to any block location. Examples: hard drives, SSDs, USB mass storage. The kernel buffers reads/writes through the page cache for efficiency.

Character devices transfer data as a stream of bytes, one character at a time. No random access, no buffering. Examples: keyboards, serial ports, terminals (/dev/tty), /dev/null. Programs read and write byte streams without knowing the underlying device.

Key differences:

  • Block devices: buffered I/O, random access, block-oriented
  • Character devices: unbuffered, sequential stream, byte-oriented

Device files are created with mknod: mknod /dev/sda b 8 0 (b = block device), mknod /dev/tty c 5 0 (c = character device). The major number identifies the driver; the minor number identifies the specific device instance.

16. What is the purpose of the directory entry cache (dentry cache) and how does it improve performance?

The dentry cache caches recently used directory entries, mapping directory names to inode numbers. It's a critical optimization because path resolution involves multiple directory lookups—every component in a path must be looked up.

For example, opening /home/user/documents/report.txt requires:

  1. Look up root / → inode
  2. Look up home in root → inode
  3. Look up user in /home → inode
  4. Look up documents in /home/user → inode
  5. Look up report.txt in /home/user/documents → inode

Without caching, each lookup requires disk I/O. The dentry cache stores these mappings, enabling O(1) lookup for frequently accessed paths. The dentry cache also stores parent pointers, enabling fast .. lookups and pathname canonicalization.

When directories are modified, the kernel invalidates affected dentry entries. The dentry cache is integrated with the inode cache—dentries reference inodes, and inodes reference their dentries.

17. How does the Linux page cache work with file-backed memory mappings?

The page cache stores file data in RAM to reduce disk I/O. When you read a file, the kernel checks the page cache. On cache hit, data is returned immediately. On cache miss, data is read from disk and cached for future access.

For file-backed memory mappings (via mmap()), the page cache is used as the backing store. When a program accesses a memory-mapped file:

  1. CPU generates virtual address
  2. MMU translates to physical address
  3. If page not in memory (not mapped), page fault occurs
  4. Kernel reads the file data into a page cache page
  5. MMU maps the page
  6. Access completes

Modifications to memory-mapped files go through the page cache. The kernel writes modified pages back to disk (dirty pages) either periodically or when msync() is called.

The page cache is unified: file data and anonymous memory (process heap/stack) share the same physical memory pool. The kernel's LRU (Least Recently Used) algorithm evicts cold pages when memory is needed.

18. What is FUSE (Filesystem in Userspace) and when would you use it?

FUSE allows unprivileged processes to implement file systems without kernel code. The file system logic runs in user space, communicating with the kernel via a special device (/dev/fuse).

FUSE architecture:

  1. User-space program registers with FUSE kernel module
  2. When VFS receives operations on the mount point, it passes them to the FUSE kernel module
  3. FUSE forwards operations to the user-space program
  4. User-space program performs the operation (reads from network, decrypts, etc.)
  5. FUSE returns the result to VFS

FUSE use cases:

  • sshfs: Mount remote SSH servers as local directories
  • gocryptfs: Encrypted file systems without kernel modules
  • mergerfs: Pool multiple drives into one logical volume
  • bindfs: Permission remapping layer
  • ntfs-3g: NTFS driver for Linux (was FUSE-based, now in kernel)

FUSE is slower than kernel file systems due to context switches, but it enables rapid development and doesn't require root privileges to install a file system.

19. How does the kernel handle file descriptor exhaustion and what limits apply?

A file descriptor is a small integer (index) in the process's file descriptor table. Each entry points to a struct file kernel object. The number of open files is limited by system and per-process limits.

System-wide limits:

  • /proc/sys/fs/file-max: Maximum number of open files system-wide
  • /proc/sys/fs/file-nr: Currently allocated (allocated, free, max)

Per-process limits:

  • ulimit -n: Soft limit (can increase to hard limit)
  • ulimit -Hn: Hard limit
  • Default: 1024 soft, 4096 hard (can be higher)

When limits are hit:

  • open() returns -1 with errno = EMFILE (too many open files)
  • Daemons may fail silently or crash
  • Long-running processes can leak file descriptors if not closed properly

Common causes of exhaustion:

  • File descriptor leaks (opened but never closed)
  • Logging to many files simultaneously
  • Making many network connections
20. What is the difference between synchronous and asynchronous I/O, and when would you use each?

Synchronous I/O: The call blocks until the operation completes. read() blocks until data is available; write() blocks until data is written to the kernel buffer. Simple to use, but blocks the thread.

Asynchronous I/O: The call returns immediately, before the operation completes. The application continues executing and receives notification (via callback, signal, or poll) when the operation finishes. More complex but better for high concurrency.

Linux async I/O APIs:

  • io_setup()/io_submit()/io_getevents(): Native Linux AIO
  • libaio: Wrapper around system calls
  • io_uring: New interface (kernel 5.1+) with significantly lower overhead

When to use synchronous I/O:

  • Simple programs with few concurrent operations
  • When blocking is acceptable (low concurrency requirements)

When to use asynchronous I/O:

  • High-concurrency servers (thousands of simultaneous connections)
  • I/O-bound workloads where blocking would limit throughput
  • Overlapping multiple I/O operations

Further Reading

Topic-Specific Deep Dives:

  • Inode Internals: Explore how inodes are stored in the inode table, how inode generation numbers work (for NFSv4), and how file locks are implemented via the inode’s flock list.

  • ACLs and Extended Attributes: Study how ACLs extend the traditional Unix permission model. Explore setfacl/getfacl, and extended attributes (xattr) for SELinux labels, capabilities, and custom metadata.

  • Btrfs Copy-on-Write: Btrfs doesn’t use journaling—it uses COW. When you modify a block, Btrfs writes a new copy rather than overwriting. This enables snapshots, checksums, and self-healing but requires understanding its implications for fragmentation and SSD wear.

  • FUSE Architecture: Filesystem in Userspace (FUSE) lets unprivileged processes implement file systems without kernel code. Study libfuse and projects like sshfs, gocryptfs, and mergerfs.

  • Network File System Internals: NFSv4 is stateful (unlike v3), and CIFS/SMB has complex caching semantics. Understanding how these differ from local file systems helps diagnose network mount issues.

Conclusion

The file system abstraction transforms raw storage into the familiar hierarchy of files and directories. At its core, inodes store metadata while directory entries map human-readable names to inode numbers. This separation enables hard links — multiple names pointing to the same file — and allows the file system to maintain data integrity through journaling.

VFS sits between your applications and specific file system implementations, providing a universal interface that ext4, XFS, NTFS, and others all implement. When you call open(), read(), or write(), the kernel resolves your request through VFS to the appropriate driver, with the page cache buffering I/O to minimize disk access.

For continued study, examine how ACLs extend the traditional Unix permission model, and how file systems like Btrfs add copy-on-write snapshots and native compression. The abstraction continues evolving — FUSE allows user-space file systems, while network file systems like NFS and CIFS extend the hierarchy across machines. Understanding these layers prepares you for debugging storage issues and designing applications that interact efficiently with persistent storage.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science