File System Abstraction
Understanding how operating systems organize and manage files through abstraction layers, from physical storage to logical file operations.
File System Abstraction
Every time you save a document, browse a folder, or stream a video, you’re interacting with one of computing’s most elegant abstractions: the file system. Behind the scenes, mountains of complex logic collapse into the simple act of reading and writing files. This post peels back those layers to reveal how modern operating systems transform raw magnetic domains on a spinning disk into the seamless experience of organizing your digital life.
The file system abstraction is foundational to everything else in operating systems. It provides the illusion of persistent, structured storage while hiding the messy reality of sector boundaries, wear leveling, and hardware failures. Without this abstraction, every programmer would need to be a storage engineer just to save a configuration file.
Introduction
When to Use / When Not to Use
Understanding file system abstraction helps you make informed decisions about data storage and performance optimization.
When file system abstraction shines:
- Application development where persistence and portability matter
- Database design that relies on durability guarantees
- Backup systems that need to understand what’s stored where
- Debugging storage-related issues in production
When to look deeper:
- Embedded systems with custom storage controllers
- Performance-critical applications where every I/O matters
- Distributed systems where consistency guarantees get murky
- When the abstraction leaks and you need to understand why
Architecture or Flow Diagram
graph TD
A[Application System Calls] --> B[VFS Layer]
B --> C[Specific File System ext4, NTFS, FAT]
C --> D[Block Layer]
D --> E[Storage Device Driver]
E --> F[Physical Storage Device]
G[Inode Cache] -.-> B
H[Directory Entry Cache] -.-> B
I[Page Cache] -.-> B
The architecture reveals the layered approach. Applications speak to the kernel through system calls. The Virtual File System (VFS) layer acts as a universal translator, allowing different file systems to coexist. Below that, specific implementations handle the particulars of each format. The block layer manages I/O scheduling, and finally the device driver speaks to the hardware.
Core Concepts
Files: The Logical Unit
A file is an abstract container for data, defined by a set of attributes rather than physical properties. The file system maintains:
- Name: Human-readable identifier with case-sensitive or insensitive rules
- Metadata: Size, creation time, modification time, permissions, ownership
- Data blocks: Actual storage locations for content
- Location references: Pointers to where data lives on disk
Files are named sequences of bytes. The operating system doesn’t care what’s inside. A text document and a compiled binary are both just sequences of bytes to the file system layer.
Inodes: The Metadata Engine
The inode (index node) is the fundamental data structure in Unix-style file systems. Every file has exactly one inode, identified by a unique number called the inode number.
struct inode {
unsigned long i_ino; // Inode number
umode_t i_mode; // File type and permissions
unsigned int i_nlink; // Link count (hard links)
uid_t i_uid; // Owner UID
gid_t i_gid; // Owner GID
loff_t i_size; // Size in bytes
struct timespec i_atime; // Access time
struct timespec i_mtime; // Modification time
struct timespec i_ctime; // Change time
unsigned long i_blocks; // Blocks allocated (512-byte blocks)
unsigned short i_bytes; // Bytes per block
struct super_block *i_sb; // Super block reference
struct inode_operations *i_op;// Operations table
};
The inode doesn’t contain the actual file name. That’s stored in directory entries, which map names to inode numbers. This separation allows multiple names (hard links) to point to the same file.
Directories: Name-to-Inode Mappings
A directory is itself a special file that maps names to inode numbers. In early Unix, directories were simply files containing records:
struct direct {
unsigned long d_ino; // Inode number
unsigned short d_reclen;// Record length
unsigned char d_type; // File type (directory, regular, etc.)
char d_name[]; // File name (variable length)
};
Modern file systems use more sophisticated directory implementations, but the principle remains: directories are searchable collections of name-to-inode mappings.
Hierarchical Organization
The directory tree creates the familiar path structure:
graph TD
A["/ (root)"] --> B["/home"]
A --> C["/etc"]
A --> D["/var"]
B --> E["/home/user"]
E --> F["/home/user/documents"]
F --> G["report.pdf"]
style A stroke:#ff00ff,stroke-width:3px
style G stroke:#ff00ff,stroke-width:2px
Each directory entry points to an inode. Following the chain from root through each path component resolves a full path to a specific inode.
The Superblock: File System Metadata
The superblock contains the file system’s critical metadata:
- Total number of inodes and blocks
- Free inode and block counts
- Block size and group size
- Magic number to identify file system type
- Last mount time and write time
- Journal location (for journaling file systems)
The superblock is duplicated across the partition for redundancy. If the primary superblock corrupts, the file system can recover from a backup copy.
Production Failure Scenarios
Scenario 1: Inode Exhaustion
What happened: A web server’s /tmp directory accumulated millions of small session files. The disk showed 40% free space, but the system couldn’t create new files. The issue: inode exhaustion. The partition had 100 million inodes, and 99 million were allocated to tiny files.
Detection:
df -i /tmp
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda1 100M 99M 1M 99% /tmp
Mitigation:
- Monitor inode usage in production:
df -i - Set up alerts when inode usage exceeds 80%
- Implement file expiration policies for temporary directories
- Consider partitioning strategies that balance inode allocation
Scenario 2: Orphaned Inodes
What happened: A database application crashed mid-transaction, leaving behind an inode that referenced blocks on disk but no directory entry pointed to it. The storage was allocated but inaccessible. This is a “lost” file eating disk space silently.
Detection:
# Use fsck in no-write mode to find orphans
sudo fsck -n /dev/sda1
# Or in debug mode
sudo debugfs -w /dev/sda1
debugfs: dump_unreachable
Mitigation:
- Always properly unmount file systems before maintenance
- Use journaling file systems that prevent orphaned structures
- Run periodic integrity checks:
xfs_repairfor XFS,e2fsckfor ext4 - Take advantage of file system features like
tune2fs -lto check state
Scenario 3: Fragmentation-Induced Performance Death
What happened: A heavily-used file server showed acceptable space metrics but catastrophic performance. Reads that should take milliseconds stretched to seconds. The culprit: external fragmentation. File data was scattered across thousands of non-contiguous blocks.
Detection:
# Check fragmentation on ext4
sudo fsck.ext4 -nv /dev/sda1
# Check fragmentation percentage
sudo tune2fs -l /dev/sda1 | grep -i frag
Mitigation:
- Monitor fragmentation levels with vendor tools
- Defragment proactively with
e4defrag(ext4) orxfs_fsr(XFS) - Choose allocation strategies that minimize fragmentation (extents over blocks)
- Plan capacity with headroom to prevent fragmentation-inducing fullness
Trade-off Table
| Aspect | ext4 | XFS | Btrfs | NTFS |
|---|---|---|---|---|
| Max File Size | 16TB | 8EB | 16EB | 256TB |
| Max Volume Size | 1EB | 16EB | 16EB | 256TB |
| Journaling | Metadata only | Full | Copy-on-write | Metadata only |
| Snapshots | Via LVM | Via LVM | Native | Via VSS |
| Performance | General purpose | Large files | Mixed | Windows optimal |
| Fragmentation | Moderate | Low | Low (COW) | Moderate |
Implementation Snippet
Reading File Metadata in C
#include <stdio.h>
#include <sys/stat.h>
#include <time.h>
void print_file_info(const char *path) {
struct stat st;
if (stat(path, &st) == -1) {
perror("stat");
return;
}
printf("File: %s\n", path);
printf("Size: %lld bytes\n", (long long) st.st_size);
printf("Blocks: %lld\n", (long long) st.st_blocks);
printf("IO Block: %ld bytes\n", st.st_blksize);
printf("Inode: %lu\n", st.st_ino);
printf("Links: %hu\n", st.st_nlink);
printf("Permissions: %o\n", st.st_mode & 0777);
// Timestamps
printf("Access: %s", ctime(&st.st_atime));
printf("Modify: %s", ctime(&st.st_mtime));
printf("Change: %s", ctime(&st.st_ctime));
}
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
return 1;
}
print_file_info(argv[1]);
return 0;
}
Exploring Inodes with Python
import os
import stat
def explore_inode(filepath):
"""Explore file system inode information."""
st = os.stat(filepath)
print(f"Path: {filepath}")
print(f"Inode: {st.st_ino}")
print(f"Device: {st.st_dev}")
print(f"Hard links: {st.st_nlink}")
print(f"Size: {st.st_size} bytes")
print(f"Blocks (512b): {st.st_blocks}")
print(f"Block size: {st.st_blksize}")
mode = st.st_mode
print(f"\nPermissions: {oct(stat.S_IMODE(mode))}")
print(f"Type: ", end="")
if stat.S_ISDIR(mode):
print("Directory")
elif stat.S_ISREG(mode):
print("Regular File")
elif stat.S_ISLNK(mode):
print("Symbolic Link")
else:
print("Other")
# Access timestamps
print(f"\nAccessed: {stat.getmtime(filepath)}")
print(f"Modified: {stat.getctime(filepath)}")
if __name__ == "__main__":
import sys
explore_inode(sys.argv[1] if len(sys.argv) > 1 else ".")
Observability Checklist
Monitor these metrics to keep your file systems healthy:
- df -h: Check available space across all mounted file systems
- df -i: Monitor inode usage to prevent inode exhaustion
- mount | grep -v tmpfs: List persistent mounts with their options
- tune2fs -l /dev/sdX | grep -i state: Check file system clean/dirty state
- dmesg | grep -i error | grep -i ext4: Scan kernel logs for file system errors
Key metrics to track:
- Available space percentage (alert at >85% for critical partitions)
- Inode usage percentage (alert at >80%)
- I/O wait time (CPU waiting on storage operations)
- Read/write throughput per partition
- Error counts from
smartctloutput
Logs to watch:
# Check for file system errors
sudo journalctl -b | grep -i "ext4\|xfs\|filesystem"
# Monitor inotify watch consumption
cat /proc/sys/fs/inotify/max_user_watches
# Check for file handle leaks
lsof | wc -l
Common Pitfalls / Anti-Patterns
Permission Model
Unix file permissions operate on three classes:
- Owner (u): The user who owns the file
- Group (g): Users in the file’s group
- Other (o): Everyone else
Three permission types for each class:
- Read (r): View file contents or list directory
- Write (w): Modify file contents or create/delete files in directory
- Execute (x): Run file as program or access directory contents
# View permissions
ls -la /path/to/file
# -rw-r--r-- 1 owner group 1234 May 19 10:00 file.txt
# Modify permissions
chmod 640 /path/to/file # Owner: rw, Group: r, Other: none
chmod u+x /path/to/file # Add execute for owner
chmod -R 755 /path/to/dir # Recursive
Access Control Lists (ACLs)
Beyond traditional permissions, ACLs provide fine-grained control:
# Set ACL for specific user
setfacl -m u:alice:rw /path/to/file
# Set ACL for specific group
setfacl -m g:developers:rx /path/to/directory
# View ACLs
getfacl /path/to/file
# Remove specific ACL entry
setfacl -x u:alice /path/to/file
Security Considerations
-
Sticky bit on shared directories: Prevents users from deleting others’ files in
/tmp-like directorieschmod +t /shared/directory -
Immutable files: Even root cannot modify immutable files without removing the flag
chattr +i /important/file lsattr /important/file # ---i---------- /important/file -
Audit file access: Enable audit daemon to track who accesses what
sudo auditctl -w /etc/shadow -p wa -k shadow_access sudo ausearch -k shadow_access
Common Pitfalls / Anti-patterns
1. Assuming Unlimited Space
# BAD: Writing without checking space
with open("/tmp/large_file", "w") as f:
while True:
f.write("data" * 1000) # Will fail eventually
# GOOD: Check before writing
import shutil
def safe_write(path, data):
stat = shutil.disk_usage("/tmp")
if stat.free < len(data):
raise IOError(f"Insufficient space. Need {len(data)}, have {stat.free}")
with open(path, "w") as f:
f.write(data)
2. Ignoring Path Length Limits
Early file systems had 255-byte path components. Modern systems support longer paths, but mixing older code can cause silent truncation and security issues.
3. Case Sensitivity Surprises
# On case-insensitive file system (Windows, macOS default)
touch MyFile.txt
ls myfile.txt # Shows MyFile.txt
# On case-sensitive file system (Linux, some network mounts)
touch MyFile.txt
ls myfile.txt # No such file or directory
4. Hard Links vs. Symbolic Links Confusion
# Hard link - same inode, must be on same filesystem
ln /existing/file /path/to/hardlink
ls -li /existing/file /path/to/hardlink # Same inode number
# Symbolic link - separate file pointing to path
ln -s /existing/file /path/to/symlink
ls -li /existing/file /path/to/symlink # Different inodes
Quick Recap Checklist
- File systems provide abstraction over raw storage devices
- Inodes store metadata; directories map names to inode numbers
- Hard links share inodes; symbolic links store paths
- Journaling file systems prevent corruption from crashes
- Monitor inode and space usage to prevent production surprises
- Permissions (rwx) apply to owner/group/others
- ACLs provide finer-grained access control
- Different file systems have different strengths and limits
Interview Questions
An inode (index node) is a data structure that stores metadata about a file in Unix-style file systems. It contains:
- The file's size and block count
- Owner UID and group GID
- File permissions (mode)
- Timestamps: access, modification, and change time
- Link count (number of hard links pointing to this inode)
- Pointers to the actual data blocks on disk
The inode does not store the file name—that lives in directory entries.
Hard links are multiple directory entries pointing to the same inode. They must be on the same file system and cannot link to directories. When you delete one hard link, the file persists until all hard links are removed. The inode's link count tracks this.
Symbolic links are special files that store a path (not an inode number). They can cross file systems and point to directories. Deleting the target breaks the symlink. They occupy a small inode for the link itself plus space for the path string.
Use hard links for equal-weight file names within a filesystem. Use symlinks when you need flexibility or cross-filesystem references.
When you execute ls -la, the journey through the kernel looks like:
- Shell parses the command and calls
fork()to create a child process - execve() system call replaces the child with the
/bin/lsprogram lscallsgetdents64()orreaddir()system call- Kernel transfers control to the VFS layer, which identifies the mounted file system type
- File system driver (e.g., ext4) reads directory entries from disk or cache
- Page cache may serve frequently-accessed directory data
- Results flow back up through VFS to the application
The application never touches the disk directly—it only knows about the VFS interface.
This classic problem usually means one of two things is exhausted: inodes or file handles.
First, check inode usage: df -i. If the IUse% is near 100%, the partition has run out of inodes (common with millions of tiny files in /tmp).
If inodes are fine, check file handle limits:
lsof | wc -l— current open file countcat /proc/sys/fs/file-max— system-wide limitulimit -n— per-process limit
Also check for deleted files still held open by processes: lsof +L1 shows files deleted but not yet freed.
The open() system call triggers a multi-step process:
- Path resolution: Starting from the root directory (or current working directory for relative paths), each path component is looked up in its parent directory
- VFS lookup: The VFS layer calls the underlying file system's lookup function
- Permission checking: The kernel verifies the process has execute permission on each directory component and read/write on the file itself
- Inode retrieval: If the file exists, its inode is loaded into the inode cache
- Open file table entry: A new entry is created in the process's file descriptor table, pointing to a global open file table entry
- File structure allocation: Memory is allocated for the file structure, including current position and access mode
On success, a file descriptor (small integer) is returned. Subsequent read(), write(), and close() calls use this descriptor.
The superblock is the metadata heart of a file system. It lives in a fixed location (typically block 0 or block 1) and contains:
- File system magic number: Identifies the file system type (0xEF53 for ext2/3/4)
- Block size and total blocks: Fundamental geometry of the storage
- Inode count and free inode count: Total and available inodes
- Block count and free block count: Storage capacity information
- Last mount time and last write time: File system usage timestamps
- Journal location: For journaling file systems, where the journal lives
- Block group descriptors: Locations of block and inode bitmaps
Because superblock corruption is catastrophic, ext2/3/4 duplicate it at regular intervals throughout the partition. If the primary superblock is damaged, the file system can recover from a backup copy.
The page cache (formerly buffer cache) stores recently accessed file data in free RAM. When you read a file, the kernel checks if the data is already in the page cache. If so, it returns the cached data immediately—zero disk I/O. If not, it reads from disk and caches the result for future use.
Benefits:
- Read amplification reduction: Frequently read data doesn't require repeated disk reads
- Write coalescing: Multiple writes to the same region are combined before disk I/O
- Access time elimination: RAM access is nanoseconds; disk I/O is milliseconds
- Writeback batching: Delayed writes allow the kernel to batch and optimize I/O patterns
The page cache is unified—file data and anonymous memory (process heap/stack) share the same pool. The kernel's memory reclaim algorithm (`LRU`) evicts cold pages when memory pressure increases.
Buffered I/O (default): Data flows through the page cache. Writes go to the cache and are lazily flushed to disk. Reads pull from cache, loading from disk on cache miss. The kernel manages consistency and can reorder I/O for performance.
Direct I/O: Data bypasses the page cache, going directly between application memory and the storage device. Applications use the O_DIRECT flag to request this. The kernel still performs I/O scheduling at the block layer, but the page cache is not involved.
When to use direct I/O:
- Database systems that manage their own cache (PostgreSQL, Oracle)
- Applications doing sequential scans that don't benefit from caching
- When you need predictable I/O patterns without kernel interference
Direct I/O bypasses the page cache but not the block layer scheduler or RAID controller cache.
POSIX defines the standard file operations: open(), close(), read(), write(), lseek(), stat(), rename(), unlink().
The kernel handles these through the VFS layer:
open(): Returns a file descriptor. Kernel creates an open file table entry, calls the file system'sinode_operations->lookup(), allocates a dentry cache entry.read(fd, buf, n): Kernel looks up the fd in the process's file descriptor table, finds the file struct, calls the file'sfile_operations->read()which chains to the underlying file system driver.write(): Same path but callsfile_operations->write(). Data may go through the page cache and be dirty-flushed to disk later.lseek(): Updates the file offset in the open file table entry. Doesn't touch the underlying storage.
The kernel maintains three tables: the per-process file descriptor table (array of pointers), the system-wide open file table (struct file with position and mode), and the dentry cache (directory entry to inode mapping). Each layer has a distinct purpose in managing file access.
A journaling file system records intended operations in a journal (a separate area on disk) before applying them to the main file system. If a crash occurs mid-operation, the journal can be replayed to complete or undo the operation, preventing metadata corruption.
Journaling process:
- Write operation to journal (begin transaction)
- Perform operation on main file system
- Mark transaction as complete in journal
If crash happens before step 3, the journal contains the incomplete operation. On reboot, the file system replays the journal and either completes the operation or undo it, ensuring consistency.
Non-journaling file systems (like ext2, FAT) write directly to the file system. Crashes can leave the file system in an inconsistent state with lost or orphaned data. Checking consistency requires scanning the entire file system (e.g., fsck), which is slow on large file systems.
Trade-offs:
- Journaling adds write overhead (journal writes before main writes)
- Journaling protects metadata but not necessarily file data (metadata-only journaling is common)
- Ext4 uses "ordered" mode: journal commits metadata but not file data, preventing data corruption while being fast
Copy-on-Write (COW) is a mechanism where modifying data does not overwrite the existing data in place. Instead, the modified data is written to a new location, and the pointer is updated.
In Btrfs:
- When you modify a data block, Btrfs writes the new version to a free location
- Only the metadata (which points to the data) is updated to reference the new location
- The old data remains unchanged until the old metadata pointers are also updated
Advantages:
- Snapshots: Creating a snapshot is instantaneous—the snapshot shares the same data blocks until one copy is modified. Only the changed blocks consume additional space.
- Self-healing: Btrfs maintains checksums for data and metadata. On read, if corruption is detected, Btrfs can read the good copy from a mirror (in RAID configurations) and repair the corrupted block.
- Consistency: No in-place overwrites means crashes cannot corrupt existing data—only the incomplete new write is lost.
Disadvantages:
- Fragmentation: COW causes data to be scattered across the disk over time
- Write amplification: Small random writes become large sequential writes
- SSD wear: Frequent COW writes can wear out SSDs faster (Btrfs handles this with wear leveling)
The Virtual File System (VFS) is an abstraction layer in the Linux kernel that provides a unified interface to all file system implementations. Applications interact with VFS, not directly with ext4, XFS, or NTFS.
VFS defines standard operations:
struct super_block // The file system as a whole
struct inode // A specific file
struct dentry // A directory entry (name to inode mapping)
struct file // An open file handle
Each file system implements these operations for its specific on-disk format. When an application calls open(), VFS:
- Determines which mounted file system should handle the request
- Calls the file system's
inode_operations->lookup()to find the inode - Creates a
struct filewith the file system'sfile_operations
This design allows transparent access to different file systems. You can mount ext4, XFS, and FAT pen drives simultaneously. Network file systems like NFS and CIFS also implement VFS operations, making network drives appear as local directories.
Fragmentation occurs when file data blocks are not contiguous on disk. As files are created, modified, and deleted, free space becomes scattered, and new files must be allocated from non-contiguous blocks.
Types:
- Internal fragmentation: Wasted space within blocks (file smaller than block size)
- External fragmentation: File blocks scattered across disk (causes slow reads)
- Metadata fragmentation: Directory entries spread across disk
Tools for defragmentation:
- ext4:
e4defrag /dev/sda1ore4defrag /mount/point - XFS:
xfs_fsr /dev/sda1(online defragmentation) - Btrfs:
btrfs defrag /mount/point(btrfs balances allocations instead) - Windows:
defrag C:in command prompt or Optimize Drives utility
Defragmentation helps mechanical hard drives (sequential access is faster) but is less important for SSDs (random access is fast, COW complicates it). Modern file systems with extent allocation (rather than block allocation) are more resistant to fragmentation.
A hard link is a directory entry that points directly to an inode. Multiple hard links share the same inode, link count (in the inode) tracks how many directory entries reference it. When all hard links are deleted (link count becomes 0), the inode and its data blocks are freed.
Hard link rules:
- Must be on the same file system as the original file
- Cannot link to directories (prevents cycles in the directory tree)
- The inode is shared—modifying the file affects all hard links
A symbolic link is a special file type containing a path string. It points to a path (not an inode), can cross file systems, and can point to directories or non-existent targets.
When to use hard links:
- Multiple names for the same file in the same directory tree
- Backup schemes where you want to maintain multiple references
When to use symbolic links:
- Creating shortcuts or aliases
- Linking across file systems or to directories
- Pointing to files that may be renamed or moved
Block devices transfer data in fixed-size blocks (typically 512 bytes to 4KB). They support random access—you can seek to any block location. Examples: hard drives, SSDs, USB mass storage. The kernel buffers reads/writes through the page cache for efficiency.
Character devices transfer data as a stream of bytes, one character at a time. No random access, no buffering. Examples: keyboards, serial ports, terminals (/dev/tty), /dev/null. Programs read and write byte streams without knowing the underlying device.
Key differences:
- Block devices: buffered I/O, random access, block-oriented
- Character devices: unbuffered, sequential stream, byte-oriented
Device files are created with mknod: mknod /dev/sda b 8 0 (b = block device), mknod /dev/tty c 5 0 (c = character device). The major number identifies the driver; the minor number identifies the specific device instance.
The dentry cache caches recently used directory entries, mapping directory names to inode numbers. It's a critical optimization because path resolution involves multiple directory lookups—every component in a path must be looked up.
For example, opening /home/user/documents/report.txt requires:
- Look up root
/→ inode - Look up
homein root → inode - Look up
userin/home→ inode - Look up
documentsin/home/user→ inode - Look up
report.txtin/home/user/documents→ inode
Without caching, each lookup requires disk I/O. The dentry cache stores these mappings, enabling O(1) lookup for frequently accessed paths. The dentry cache also stores parent pointers, enabling fast .. lookups and pathname canonicalization.
When directories are modified, the kernel invalidates affected dentry entries. The dentry cache is integrated with the inode cache—dentries reference inodes, and inodes reference their dentries.
The page cache stores file data in RAM to reduce disk I/O. When you read a file, the kernel checks the page cache. On cache hit, data is returned immediately. On cache miss, data is read from disk and cached for future access.
For file-backed memory mappings (via mmap()), the page cache is used as the backing store. When a program accesses a memory-mapped file:
- CPU generates virtual address
- MMU translates to physical address
- If page not in memory (not mapped), page fault occurs
- Kernel reads the file data into a page cache page
- MMU maps the page
- Access completes
Modifications to memory-mapped files go through the page cache. The kernel writes modified pages back to disk (dirty pages) either periodically or when msync() is called.
The page cache is unified: file data and anonymous memory (process heap/stack) share the same physical memory pool. The kernel's LRU (Least Recently Used) algorithm evicts cold pages when memory is needed.
FUSE allows unprivileged processes to implement file systems without kernel code. The file system logic runs in user space, communicating with the kernel via a special device (/dev/fuse).
FUSE architecture:
- User-space program registers with FUSE kernel module
- When VFS receives operations on the mount point, it passes them to the FUSE kernel module
- FUSE forwards operations to the user-space program
- User-space program performs the operation (reads from network, decrypts, etc.)
- FUSE returns the result to VFS
FUSE use cases:
- sshfs: Mount remote SSH servers as local directories
- gocryptfs: Encrypted file systems without kernel modules
- mergerfs: Pool multiple drives into one logical volume
- bindfs: Permission remapping layer
- ntfs-3g: NTFS driver for Linux (was FUSE-based, now in kernel)
FUSE is slower than kernel file systems due to context switches, but it enables rapid development and doesn't require root privileges to install a file system.
A file descriptor is a small integer (index) in the process's file descriptor table. Each entry points to a struct file kernel object. The number of open files is limited by system and per-process limits.
System-wide limits:
/proc/sys/fs/file-max: Maximum number of open files system-wide/proc/sys/fs/file-nr: Currently allocated (allocated, free, max)
Per-process limits:
ulimit -n: Soft limit (can increase to hard limit)ulimit -Hn: Hard limit- Default: 1024 soft, 4096 hard (can be higher)
When limits are hit:
open()returns -1 witherrno = EMFILE(too many open files)- Daemons may fail silently or crash
- Long-running processes can leak file descriptors if not closed properly
Common causes of exhaustion:
- File descriptor leaks (opened but never closed)
- Logging to many files simultaneously
- Making many network connections
Synchronous I/O: The call blocks until the operation completes. read() blocks until data is available; write() blocks until data is written to the kernel buffer. Simple to use, but blocks the thread.
Asynchronous I/O: The call returns immediately, before the operation completes. The application continues executing and receives notification (via callback, signal, or poll) when the operation finishes. More complex but better for high concurrency.
Linux async I/O APIs:
io_setup()/io_submit()/io_getevents(): Native Linux AIOlibaio: Wrapper around system callsio_uring: New interface (kernel 5.1+) with significantly lower overhead
When to use synchronous I/O:
- Simple programs with few concurrent operations
- When blocking is acceptable (low concurrency requirements)
When to use asynchronous I/O:
- High-concurrency servers (thousands of simultaneous connections)
- I/O-bound workloads where blocking would limit throughput
- Overlapping multiple I/O operations
Further Reading
Topic-Specific Deep Dives:
-
Inode Internals: Explore how inodes are stored in the inode table, how inode generation numbers work (for NFSv4), and how file locks are implemented via the inode’s
flocklist. -
ACLs and Extended Attributes: Study how ACLs extend the traditional Unix permission model. Explore
setfacl/getfacl, and extended attributes (xattr) for SELinux labels, capabilities, and custom metadata. -
Btrfs Copy-on-Write: Btrfs doesn’t use journaling—it uses COW. When you modify a block, Btrfs writes a new copy rather than overwriting. This enables snapshots, checksums, and self-healing but requires understanding its implications for fragmentation and SSD wear.
-
FUSE Architecture: Filesystem in Userspace (FUSE) lets unprivileged processes implement file systems without kernel code. Study
libfuseand projects likesshfs,gocryptfs, andmergerfs. -
Network File System Internals: NFSv4 is stateful (unlike v3), and CIFS/SMB has complex caching semantics. Understanding how these differ from local file systems helps diagnose network mount issues.
Conclusion
The file system abstraction transforms raw storage into the familiar hierarchy of files and directories. At its core, inodes store metadata while directory entries map human-readable names to inode numbers. This separation enables hard links — multiple names pointing to the same file — and allows the file system to maintain data integrity through journaling.
VFS sits between your applications and specific file system implementations, providing a universal interface that ext4, XFS, NTFS, and others all implement. When you call open(), read(), or write(), the kernel resolves your request through VFS to the appropriate driver, with the page cache buffering I/O to minimize disk access.
For continued study, examine how ACLs extend the traditional Unix permission model, and how file systems like Btrfs add copy-on-write snapshots and native compression. The abstraction continues evolving — FUSE allows user-space file systems, while network file systems like NFS and CIFS extend the hierarchy across machines. Understanding these layers prepares you for debugging storage issues and designing applications that interact efficiently with persistent storage.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.