Journaling & Crash Recovery
How file systems use write-ahead logging and journal checksums to ensure consistency and enable recovery from system crashes without data loss.
Journaling & Crash Recovery
Imagine you’re writing a document and your computer crashes. Without journaling, you come back to find not just the document in an unknown state, but potentially the file system itself corrupted—the operating system’s map of where everything is stored. Journaling prevents this catastrophe by recording what you’re about to do before you do it, so if a crash interrupts the operation, the system knows exactly where it left off.
Journaling is one of the most important innovations in file system reliability. It transformed file systems from fragile structures prone to catastrophic failure into robust systems that survive crashes gracefully. Understanding journaling explains why your server can crash at 3 AM and come back online cleanly at 3:02 AM.
─────────────────────────────────────────────────
Introduction
When to Use / When Not to Use
Understanding journaling helps with system reliability and recovery planning.
When journaling is essential:
- Production servers and critical systems
- Any system where data integrity matters more than raw speed
- Systems prone to unexpected shutdowns (power loss, kernel panic)
- Database servers and application infrastructure
When you might skip journaling:
- Read-only or embedded systems with limited write cycles (flash storage)
- Temporary scratch systems where speed outweighs reliability
- Battery-backed systems with guaranteed clean shutdown
Architecture or Flow Diagram
graph TD
A[Write Request] --> B[Log to Journal]
B --> C[Wait for Journal Write Commit]
C --> D[Modify File System]
D --> E[Checkpoint Complete]
F[Crash During Step B-E] --> G[Recovery Mode]
G --> H[Replay Journal]
H --> I[Complete Incomplete Transactions]
J[Crash After Commit] --> K[Recovery Mode]
K --> L[Replay Journal]
L --> M[Rollback Uncommitted]
style A stroke:#ff00ff,stroke-width:2px
style G stroke:#00fff9
style M stroke:#ff00ff,stroke-width:2px
The journal acts as a staging area. Modifications go there first and are committed before affecting the main file system.
Core Concepts
Write-Ahead Logging
The fundamental principle of journaling: write to the journal before modifying the file system.
sequenceDiagram
participant App as Application
participant FS as File System
participant J as Journal
participant Disk as Main Disk
App->>FS: Write data to /file.txt
FS->>J: Log transaction (begin)
FS->>J: Log metadata changes
FS->>J: Log commit block
J->>Disk: Flush journal to disk
FS->>Disk: Modify file system data
FS->>J: Mark transaction complete
Why write-ahead matters: If the system crashes after the journal is written but before the main file system is modified, recovery can replay the journal and complete the operation. If it crashes during the journal write itself, the transaction is incomplete and can be ignored.
Journal Structure
The journal is a circular buffer organized into transactions:
graph TD
A[Journal Superblock] --> B[Transaction 1]
B --> C[Transaction 2]
C --> D[Transaction 3]
D --> E[Transaction N]
E --> F[...wraps around...]
F --> A
B --> G[Descriptor Block]
G --> H[Block 1 Modification]
G --> I[Block 2 Modification]
G --> J[Commit Block]
style A stroke:#ff00ff,stroke-width:3px
style G stroke:#00fff9
style J stroke:#ff00ff
Transaction structure:
- Descriptor Block: Lists all modifications in this transaction
- Block References: Actual data or metadata being modified
- Commit Block: Indicates successful completion
Journal Modes
Different file systems offer different journaling modes with different trade-offs:
graph TD
A[Journal Modes] --> B[Journal Mode<br/>All data + metadata]
A --> C[Ordered Mode<br/>Metadata only]
A --> D[Writeback Mode<br/>Metadata only<br/>Data any order]
B --> E[Most reliable<br/>Slowest writes]
C --> F[Good reliability<br/>Moderate speed]
D --> G[Fastest<br/>Potential data loss]
style E stroke:#00fff9
style F stroke:#ff00ff
style G stroke:#00fff9
ext3/ext4 modes:
journal: All data and metadata go through journal. Safest but slowest.ordered: Metadata is journaled, data is written before metadata. Good balance.writeback: Only metadata is journaled, data order is not guaranteed. Fastest but risks stale data after crash.
# Check current journal mode
tune2fs -l /dev/sda1 | grep journaling
# Journal mode: writeback
# Change journal mode
sudo tune2fs -o journal_data_writeback /dev/sda1 # writeback
sudo tune2fs -o journal_data_ordered /dev/sda1 # ordered (default)
sudo tune2fs -o journal_data /dev/sda1 # journal (full)
Recovery Process
After a crash, the file system runs recovery:
# Pseudo-code for recovery algorithm
def recover_journal(journal_device):
# Read journal superblock
jsb = read_journal_superblock(journal_device)
# Find last committed transaction
last_trans = find_last_committed_transaction(jsb)
# Start recovery from there
for transaction in transactions_since(last_trans):
if transaction.is_committed():
# Replay: apply modifications to main file system
replay_transaction(transaction)
else:
# Incomplete: ignore (already rolled back conceptually)
discard_transaction(transaction)
# Clear journal (checkpoint complete)
clear_journal_up_to(last_trans)
Recovery steps:
- Scan journal from last checkpoint
- Identify committed transactions
- Replay committed transactions to main file system
- Discard incomplete transactions
- Update journal superblock
Journal Checksums
Modern journaling includes checksums to detect journal corruption:
// Journal block checksum structure
struct journal_header {
__u32 h_magic; // Journal magic number
__u32 h_block_type; // Type of block (descriptor, commit, etc.)
__u32 h_sequence; // Transaction sequence number
};
// Extended to include checksum
struct journal_block_tail {
__u32 h_chksum; // CRC32 checksum of journal block
__u32 h_magic; // Tail magic number
};
Without checksums, a corrupted journal could replay garbage or miss valid transactions. Checksums detect this.
Production Failure Scenarios
Scenario 1: ext3 -> ext4 Upgrade Gone Wrong
What happened: A system administrator upgraded from ext3 to ext4 using tune2fs. The system crashed during the first mount after conversion. On reboot, the file system couldn’t mount. The metadata structures were in an inconsistent state from the partial upgrade.
Why it happened: The upgrade to ext4 changes data structures (to extents, 48-bit block numbers, etc.). Without journaling of the conversion itself, an interruption left half-converted metadata.
Detection:
# Run file system check
sudo e2fsck -f /dev/sda1
# Output might show:
# Unsupported feature flags (0x8001)
# Inode size for deleted inode
Mitigation:
- Always backup before converting file system types
- Use journaling file systems (already using ext3 which had journaling)
- Ensure clean shutdown before major operations
- Use
tune2fs -O extents,uninit_bg /dev/sda1only with proper backup
Scenario 2: Power Loss During Journal Replay
What happened: A server lost power mid-write during peak load. On recovery, the file system showed some files with partial content, some files with incorrect timestamps, and directory entries pointing to stale inodes.
Why it happened: The journal was in writeback mode, meaning data could be written after its metadata was committed to the journal. The crash interrupted a write in progress, leaving metadata pointing to blocks that didn’t have valid content yet.
Detection:
# Check file system state
sudo e2fsck -n /dev/sda1
# Look for error patterns
dmesg | grep -i "ext4.*error\|filesystem.*error"
Mitigation:
-
Use
data=orderedordata=journalmount options for critical systems# /etc/fstab /dev/sda1 /home ext4 defaults,data=ordered 0 2 -
Use UPS battery backup for servers
-
Enable
barrier=1mount option to ensure journal writes complete before data writes -
Consider hardware with write-back cache and battery backup
Scenario 3: Corrupted Journal
What happened: A virtual machine’s virtual disk developed bad sectors in the journal area. On recovery, the system saw conflicting transaction sequences and couldn’t determine which transactions were valid.
Detection:
# Check journal info
sudo dumpe2fs /dev/sda1 | grep -i journal
# Try journal recovery
sudo e2fsck -fy /dev/sda1
Mitigation:
-
Use
journal_checksumfeature (ext4 default) -
Use RAID for redundancy
-
Configure monitoring for disk health (SMART)
-
For corrupted journals, may need to clear and replay:
# Clear journal (DANGEROUS - only for recovery) sudo tune2fs -O ^has_journal /dev/sda1 sudo e2fsck -f /dev/sda1 sudo tune2fs -j /dev/sda1 # Recreate journal
Trade-off Table
| Mode | Data Safety | Performance | Use Case |
|---|---|---|---|
| journal | Highest | Slowest | Critical data, small partitions |
| ordered | High | Moderate | General use (default in ext4) |
| writeback | Moderate | Fastest | Performance-focused, non-critical |
| no journal | Lowest | Fastest | Read-only, battery-backed |
Implementation Snippet
Simulating Journal Transaction
#!/usr/bin/env python3
"""Simulate a journaling file system for educational purposes."""
import time
import hashlib
class JournalEntry:
def __init__(self, trans_id, data):
self.trans_id = trans_id
self.data = data
self.checksum = self._compute_checksum()
def _compute_checksum(self):
return hashlib.md5(self.data.encode()).hexdigest()[:8]
class Journal:
def __init__(self, max_entries=100):
self.entries = []
self.max_entries = max_entries
self.last_committed = 0
self.transaction_log = []
def begin_transaction(self, trans_id):
"""Start a new transaction."""
self.transaction_log = [{
'type': 'begin',
'trans_id': trans_id,
'time': time.time()
}]
return True
def add_operation(self, operation):
"""Log an operation to current transaction."""
self.transaction_log.append({
'type': 'op',
'operation': operation,
'time': time.time()
})
def commit_transaction(self):
"""Commit the current transaction."""
commit_entry = {
'type': 'commit',
'trans_id': self.transaction_log[0]['trans_id'],
'entries': len(self.transaction_log) - 1,
'time': time.time()
}
# Create journal entries
for entry in self.transaction_log:
je = JournalEntry(entry['trans_id'], str(entry))
self.entries.append(je)
self.last_committed = self.transaction_log[0]['trans_id']
self.transaction_log = []
# Wrap around (simplified circular journal)
if len(self.entries) > self.max_entries:
self.entries = self.entries[-self.max_entries:]
return True
def rollback_transaction(self):
"""Rollback current transaction."""
self.transaction_log = []
return True
def replay(self):
"""Replay committed transactions (recovery simulation)."""
committed_trans = set()
replayed = []
for entry in self.entries:
# Verify checksum
if entry.checksum != entry._compute_checksum():
print(f"Checksum mismatch for transaction {entry.trans_id}")
continue
# Find commits
if 'commit' in entry.data:
committed_trans.add(entry.trans_id)
print(f"Replaying {len(committed_trans)} committed transactions")
return committed_trans
def status(self):
print(f"Journal entries: {len(self.entries)}")
print(f"Last committed trans: {self.last_committed}")
print(f"Current transaction: {len(self.transaction_log)} operations")
def simulate_journal_crash():
"""Simulate journal recovery after crash."""
journal = Journal()
# Transaction 1: Create file (commits successfully)
print("=== Transaction 1: Create file ===")
journal.begin_transaction(1)
journal.add_operation("allocate inode")
journal.add_operation("update directory")
journal.commit_transaction()
print("Committed")
# Transaction 2: Write data (starts but doesn't commit before crash)
print("\n=== Transaction 2: Write data (simulated crash) ===")
journal.begin_transaction(2)
journal.add_operation("read inode")
journal.add_operation("allocate blocks")
# Simulated crash here - transaction not committed
print("Crash! Transaction 2 not committed")
# Transaction 3: (would be after recovery)
print("\n=== After recovery ===")
journal.replay()
if __name__ == "__main__":
simulate_journal_crash()
Checking and Fixing Journal
#!/bin/bash
# check_and_fix_journal.sh
DEVICE=${1:-/dev/sda1}
echo "=== Journal Status for $DEVICE ==="
echo ""
# Check if journal exists
if sudo tune2fs -l "$DEVICE" | grep -q "Filesystem features.*journal"; then
echo "Journal present: YES"
else
echo "Journal present: NO"
fi
# Journal mode
MODE=$(sudo tune2fs -l "$DEVICE" | grep "Journal mode" | awk -F': ' '{print $2}')
echo "Journal mode: ${MODE:-unknown}"
# Last journal transaction
echo ""
echo "=== Checking Journal Integrity ==="
sudo dumpe2fs "$DEVICE" 2>/dev/null | grep -E "Journal sequence|Journal start"
# Force clean recovery on next mount
echo ""
echo "=== Forcing Clean Recovery ==="
sudo tune2fs -f -C 1 "$DEVICE" # Reset mount count to force check
# If journal is corrupted, clear and recreate
echo ""
echo "NOTE: To clear and recreate journal, run:"
echo " sudo tune2fs -O ^has_journal $DEVICE"
echo " sudo e2fsck -f $DEVICE"
echo " sudo tune2fs -j $DEVICE"
Observability Checklist
Monitoring journal health:
- dmesg | grep -i ext4: Check for journal errors
- sudo tune2fs -l /dev/sdX | grep -i journal: View journal configuration
- sudo dumpe2fs /dev/sdX | head -50: Check journal superblock
- iostat -x 1: Monitor I/O to journal device
# Comprehensive journal monitoring
echo "=== Journal Monitoring ==="
echo ""
# Check mount options for barrier
mount | grep "ext4" | awk '{print $1, $4}'
# Check for barrier in fstab
grep -E "ext4|barrier" /etc/fstab | grep -v "^#"
# View transaction rate
cat /proc/fs/ext4/*/extents_diff
# Monitor journal log
for fs in / /home /var; do
dev=$(df "$fs" | tail -1 | awk '{print $1}')
if [[ "$dev" == */dev/* ]]; then
echo "--- $fs ---"
sudo tune2fs -l "$dev" 2>/dev/null | grep -E "Journal|Recovery"
fi
done
# Check SMART status for disk errors
sudo smartctl -a /dev/sda | grep -E "Error|FAILURE|Pending"
Common Pitfalls / Anti-Patterns
Journal Encryption
For sensitive data, encrypt the file system including journal:
# Using dm-crypt for full encryption
sudo cryptsetup luksFormat /dev/sda1
sudo cryptsetup luksOpen /dev/sda1 encrypted
sudo mkfs.ext4 -j /dev/mapper/encrypted
# Mount encrypted file system
sudo mount /dev/mapper/encrypted /mnt
Benefits:
- All data including journal is encrypted
- Physical theft doesn’t expose data
- Journal doesn’t leak metadata about file access patterns
Compliance: Audit Journal Operations
For compliance, log journal transactions:
# Enable audit logging for file system operations
sudo auditd -s enable
# Add rules for file system changes
sudo auditctl -w /var -p rwxa -k fs_changes
# View audit logs for file system
sudo ausearch -k fs_changes | head -50
Secure Deletion with Journal Considerations
Even with journaling, deleted file data may persist in journal entries:
# Secure deletion requires overwriting
shred -vu /path/to/file
# For complete removal from journal, clear journal
# (WARNING: data loss possible)
sudo tune2fs -O ^has_journal /dev/sda1
sudo e2fsck -f /dev/sda1
sudo tune2fs -j /dev/sda1
Common Pitfalls / Anti-patterns
1. Disabling Barrier for Performance
# BAD: Disabling barrier (data integrity risk)
mount -o barrier=0 /dev/sda1 /mnt
# GOOD: Keep barriers enabled
mount -o barrier=1 /dev/sda1 /mnt
# Check if barriers are enabled
mount | grep -E "barrier|ext4"
Why this matters: Without barriers, the journal commit can be reordered with data writes. If a crash happens in a specific order, the file system might be inconsistent.
2. Using writeback Mode for Databases
# BAD: writeback for database
mount -o data=writeback /dev/sda1 /var/lib/postgresql
# GOOD: ordered for databases (or journal for maximum safety)
mount -o data=ordered /dev/sda1 /var/lib/postgresql
# or
mount -o data=journal /dev/sda1 /var/lib/postgresql
3. Ignoring Journal Warnings
# WARNING in dmesg - don't ignore
# EXT4-fs warning (sda1): ext4_dx_add_entry: Directory index full
# This means directory is too large for indexing
# Fix by running e2fsck to enable indexing
sudo e2fsck -fD /dev/sda1
4. Not Testing Recovery
# BAD: Never tested recovery procedure
# GOOD: Test in VM before production
# 1. Create test file system
# 2. Create important files
# 3. Crash simulation (sync; echo c > /proc/sysrq-trigger)
# 4. Recover and verify
# 5. Document recovery steps
Quick Recap Checklist
- Journaling records structural changes before applying them
- After crash, journal replay completes or rolls back incomplete transactions
- ext3/ext4 support three modes: journal (all data), ordered (metadata only), writeback (metadata, data unordered)
- The
barriermount option ensures journal commits are on disk - Journal checksums detect corruption and prevent replay of corrupted data
- For critical systems, use
data=journalordata=ordered - Clear and recreate journal only as last resort for recovery
- Always backup before any file system modifications
Interview Questions
Write-ahead logging (journaling) ensures consistency by recording all operations before applying them to the main file system. Here's the process:
- Begin transaction: Record that a new transaction is starting
- Log operations: Write all modifications (metadata changes, block allocations) to the journal
- Commit: Write a commit block to the journal indicating the transaction is complete
- Apply changes: Now modify the actual file system with the changes
- Checkpoint: Mark the transaction as fully applied (can clear from journal)
After a crash, the recovery process:
- Scans the journal for committed transactions
- Replays committed transactions to the main file system
- Discards incomplete transactions (they never completed the journal commit)
This guarantees the file system is always in a consistent state: either the full transaction applied, or it didn't apply at all. Never a partial state.
These ext3/ext4 mount options control what goes through the journal:
data=journal: All data and metadata are written to the journal first. Before modifying a file, both the file's data and its metadata are logged. Most reliable, but slowest (double write for all data).
data=ordered (default): Metadata goes through journal, but data is written directly to the file system. The file's data blocks are written before its metadata is committed to the journal. Good reliability with better performance.
data=writeback: Only metadata is journaled. Data writes can happen in any order relative to metadata commits. Fastest, but data might be stale after a crash (metadata says file has data that hasn't been written yet).
Recommendation: Use ordered for most systems. Use journal for critical data where you need to guarantee no data loss even on crashes.
The barrier option ensures that journal commits are flushed to disk before proceeding with subsequent operations. Without barriers, the operating system might reorder writes for performance.
Why this matters: If writes can be reordered, the journal might not contain a complete record of what was committed when a crash happens. You could lose the guarantee that committed transactions survive.
With barriers enabled:
- Write journal transaction
- Flush to disk (barrier)
- Write commit block
- Flush to disk (barrier)
- Only then proceed to modify file system
This ensures the journal is always consistent and recoverable. The performance cost is real (10-30% on some workloads), but for critical systems, it's worth it.
Disable barriers only if you have battery-backed write cache (RAID controllers with BBU) or data loss is acceptable.
Recovery steps for corrupted ext3/ext4 journal:
- Don't panic: The main file system might be fine, only the journal is corrupted
- Backup current state:
sudo sfdisk -d /dev/sda > mbr_backup.txt - Run fsck in read-only mode:
sudo e2fsck -n /dev/sda1 - Clear the journal:
sudo tune2fs -O ^has_journal /dev/sda1- This removes journal reference but doesn't erase journal data
- Run fsck to repair metadata:
sudo e2fsck -f /dev/sda1 - Recreate the journal:
sudo tune2fs -j /dev/sda1 - Remount:
sudo mount /dev/sda1 /mount/point
Warning: This procedure should only be used as a last resort. Clearing the journal can lead to data loss if the main file system was also corrupted. Always try normal recovery first.
Several scenarios can cause data loss even with journaling:
- Writeback mode: If using
data=writeback, data may be written after its metadata is committed. A crash after metadata commit but before data write leaves a file claiming to have content it doesn't. - Barriers disabled: Without barriers, writes can be reordered, breaking journal guarantees.
- Journal not on separate device: If the journal and data share the same disk and the disk develops hardware errors, both could fail simultaneously.
- Application-level issues: The application might have buffered data not yet flushed to the OS when the crash occurs. Journaling protects file system structure, not application data.
For true data protection:
- Use
data=orderedordata=journal - Enable barriers
- Use battery-backed RAID controllers
- Applications should flush data before critical operations
- Use UPS for servers
Both concepts share the same principle—log before act—but differ in scope:
- File system journaling: Only logs metadata changes (inode updates, directory modifications). Data is written directly to the file system. Guarantees file system consistency but not data consistency.
- Database WAL (Write-Ahead Logging): Logs actual data changes (row modifications, index updates). The database can replay these logs to reconstruct exact database state. Guarantees transactional consistency.
Databases often use both: they run on journaling file systems (for crash recovery of the file system itself) AND implement their own WAL (for transactional semantics like ACID). PostgreSQL's WAL is its primary recovery mechanism—fsync-ing WAL is more critical than fsync-ing data files.
JBD2 handles concurrency through several mechanisms:
- Transaction isolation: Each transaction has a unique sequence number. Only one transaction can be committed at a time in a given journal.
- Lock ordering: Journal lock must be acquired in consistent order to prevent deadlocks
- Checkpointing: Completed transactions are checkpointed (cleaned from journal) asynchronously, allowing new commits to proceed
- Revoking: Before committing, a transaction revokes blocks it's about to modify—this prevents another transaction from reusing blocks that would invalidate the in-flight commit
When a commit is in progress, other processes wanting to modify the journal wait. This serializes commits but allows multiple processes to have transactions in the "waiting to commit" state.
ext3 recovery (after crash):
- Must scan entire file system metadata
- O(file system size) complexity
- On large file systems (terabytes), this can take 10+ minutes
ext4 recovery:
- Uses journal replay only—much faster
- O(journal size) complexity, not file system size
- Typically seconds regardless of file system size
- Uninit_bg feature skips already-checked block groups
The ext4 uninit_bg feature also speeds up e2fsck significantly by tracking which metadata has been recently verified. This is why ext4 is preferred over ext3 for large storage.
A full journal (circular buffer wraps) causes the kernel to block all writes to that file system until a checkpoint completes and frees journal space:
- New transactions cannot start (no free journal space)
- Write operations block—processes hang
- If root filesystem, system can become unresponsive
Recovery:
# Force checkpoint (if journal is corrupted)
tune2fs -O ^has_journal /dev/sda1
e2fsck -f /dev/sda1
tune2fs -j /dev/sda1
Prevention: Monitor journal usage with dumpe2fs and ensure the journal is sized appropriately. For high-write workloads, larger journals or separate journal devices (XFS) help.
Journaling: Logs changes before applying them. On crash, replay log or discard incomplete transactions. Journal is a separate area that must be maintained.
COW: Never overwrites data in place. When you modify block X, you write a new copy of X to free space. The pointer to X is updated only after the new copy is committed. On crash, old data is intact—it's just the pointers that may be stale.
Btrfs advantages:
- No separate journal to maintain
- Self-healing via checksums (detect corruption, restore from mirror)
- Snapshots are free (just increment pointer reference)
- RAID-aware checksums can identify which copy is correct
Trade-off: COW causes more writes (every modification writes new blocks), which affects SSD wear and write amplification. Btrfs mitigates this with extent tree caching and compression.
Both identify journal transactions but serve different purposes: Transaction ID (sometimes called txn_id) is a monotonically increasing counter assigned when a transaction begins. It provides ordering—if txn_id A < txn_id B, A began before B. Sequence number (used in ext4/JBD2) is incremented when a transaction commits. Each committed transaction gets a new sequence number. Recovery uses sequence numbers to identify the last complete transaction—the journal superblock stores the last committed sequence. On recovery, the journal replay code starts from the last committed sequence and replays all committed transactions in order. The distinction matters because not all started transactions complete—on crash, incomplete transactions (with lower txn_id) are discarded based on sequence number gaps.
The journal superblock is always the first block of the journal and is written last when the journal is cleanly unmounted, or updated during recovery checkpointing. It contains: (1) journal magic number (0x4A726E for ext3/4) to validate the journal is intact; (2) last committed transaction sequence — the sequence number of the most recent committed transaction; (3) start of journal and end of journal block pointers for the circular buffer. On crash recovery, the kernel: (1) reads the journal superblock; (2) finds the last committed sequence; (3) scans journal blocks for transactions with that sequence or higher; (4) replays committed transactions, discarding incomplete ones (which have older sequences or no commit block). If the superblock itself is corrupted, use dumpe2fs or manual journal inspection to find the last valid commit.
The descriptor block is the header of each journal transaction, listing all modifications that will be made in this transaction. It contains: (1) block identifiers—for each block modified, the journal records its filesystem block number; (2) operation type—whether this is a metadata block (inode, bitmap, directory entry) or data block; (3) checksum (in journal_checksum mode) to detect corruption. The descriptor block essentially provides a "table of contents" for the transaction—during replay, the recovery code reads the descriptor, then reads the subsequent blocks, validates them against checksums, and applies them to the filesystem. Without the descriptor, the recovery code wouldn't know which blocks belong to which transaction or in what order they should be replayed.
After a crash, the boot sequence is: (1) kernel mounts filesystem—sees journal is dirty (needs recovery); (2) kernel replays journal—completes or rolls back incomplete transactions; (3) if replay succeeds, mount proceeds normally, fsck may still run for full metadata consistency check; (4) if replay fails (corrupted journal), kernel falls back to read-only mount; (5) fsck runs—does full metadata scan, fixes inconsistencies, may truncate journals. On ext4 with uninit_bg feature, fsck skips already-verified block groups, dramatically speeding recovery. For dirty unmounts (system crashed but journal was cleanly synced), journal replay is typically all that's needed. For corrupted journals, fsck may take much longer on large filesystems because it must scan all metadata.
The commit block marks a transaction as successfully completed. It's written after all modified blocks (listed in the descriptor block) have been flushed to the journal. The commit block itself contains the transaction's sequence number and (with checksums) a checksum of the entire transaction. The atomicity guarantee: a transaction is only considered complete if its commit block is written and readable. On crash, recovery scans for the most recent commit block—if a transaction has a begin and descriptor but no commit, it's discarded. This is why barrier operations surround the commit block write: the commit block must reach disk before the filesystem considers the transaction committed. The commit block is the "proof of life" for a transaction.
External journals (used when a separate partition holds the journal) can develop problems: (1) journal device failure—if the journal partition fails, the filesystem cannot mount; (2) journal bitmap corruption—the in-journal revocation table (which tracks revoked blocks) becomes inconsistent; (3) journal size mismatch—filesystem expects journal of specific size but underlying partition changed. Mitigation: use makefs -j when creating filesystems with external journals; monitor journal partition health separately from data partition. Recovery: if journal partition is damaged but data partition intact, clearing the journal and running e2fsck -f may recover, though recent uncommitted transactions will be lost. Use RAID for journal partitions in production—losing the journal typically means losing all recently committed metadata changes.
ext3/ext4 journal: Journals metadata only (in default ordered mode), data written directly to filesystem. Recovery replays metadata operations; data may be lost or stale. ZFS Intent Log (ZIL): ZFS is a COW filesystem that normally writes data in transaction groups (TXG)—synchronous writes (like fsync) would normally wait for TXG commit. The ZIL provides synchronous write semantics without waiting: synchronous writes go to ZIL (fast SSD), then async to main pool. On crash, ZIL replays any uncommitted ZIL records—synchronous writes are either fully persisted or fully rolled back. ZFS has no separate journal mode—all data is COW, metadata is implicitly consistent. The ZIL is not a traditional journal—it doesn't replay metadata, only re-executes synchronous write operations that hadn't been committed to the pool.
The revoke table records which blocks have been revoked (made invalid) in a transaction sequence. It's necessary for: (1) transaction rollback—if a transaction is revoked (not committed), blocks it allocated must be freed, not reused by another transaction; (2) journal wrapping—when the journal wraps from end back to beginning, revoked blocks are safe to reuse. Without revocation, a wrapped journal could overwrite blocks that a not-yet-replayed transaction needs. Revoke records are written to the journal before commit and checked during recovery. JBD2 uses a revocation bitmap and revocation table (per-transaction). When a transaction commits, its revoke records are discarded only after all older transactions are checkpointed. Corruption: if revoke table is corrupted, recovery may incorrectly reuse revoked blocks, causing data corruption.
Soft updates (introduced in BSD 4.4BSD) solve the same crash-consistency problem as journaling but with different trade-offs. Instead of a journal, soft updates use dependency tracking: the kernel tracks which operations must complete before others. Operations are applied in reverse order to maintain consistency—if a file is created and then its directory is modified, the directory modification is applied first (in case of crash, partial directory entry is harmless). Advantages: no separate journal area, 100% disk utilization (no journal overhead), better write clustering. Disadvantages: more complex kernel code, occasional momentary inconsistencies visible, requires background fsck after crash (though faster than full fsck). FreeBSD supports both soft updates and journaling (via gjournal). Journaling won on Linux due to simpler implementation and faster recovery; soft updates remain a valid alternative with superior write performance.
A checkpoint synchronizes the journal with the filesystem by: (1) writing all dirty metadata buffers to disk; (2) updating the journal superblock to reflect checkpointed transactions. Until checkpoint completes, the journal cannot wrap around (free its oldest blocks) because those blocks might be needed for recovery. If the journal fills up completely, writes are blocked until checkpoint finishes. JBD2 checkpoints: (1) in the background when transaction commits (unless journal is nearly full); (2) when e2fsck runs; (3) on unmount. You can force a checkpoint with debugfs -R 'dump tune2fs -f -C to reset mount count. If a filesystem is never cleanly unmounted and checkpoint never runs, on next mount the kernel performs recovery first, then forces a checkpoint to free journal space.
Further Reading
Topic-Specific Deep Dives:
-
JBD2 Implementation: The Journaling Block Device version 2 (JBD2) is the actual journaling layer in the Linux kernel. Studying journal_submit_inode_data_buffers(), journal_finish_inode_data_buffers(), and the transaction commit code reveals the full complexity.
-
Btrfs vs ext4 Recovery: Btrfs uses COW and checksums rather than traditional journaling. When Btrfs detects corruption, it can often identify the exact corrupt byte via checksum mismatches and self-heal from redundant copies if RAID is configured.
-
ZFS ZIL (ZFS Intent Log): ZFS uses a separate ZIL (similar to a journal) for synchronous writes. Understanding how the ZIL interacts with the ARC (Adaptive Replacement Cache) explains ZFS write patterns and potential data loss scenarios.
-
Journal Replay Optimization: Modern journaling implementations batch transactions and use checksum grouping. Studying how JBD2 handles log recovery explains why e2fsck times on large file systems have decreased significantly.
-
Soft Updates and Log-Structured File Systems: Soft updates (used historically in BSD FFS) use dependency tracking instead of journaling. Understanding the tradeoffs between soft updates, journaling, and copy-on-write helps appreciate why journaling won.
Conclusion
Journaling transforms file systems from fragile structures prone to catastrophic corruption into robust systems that recover cleanly from crashes. The key insight is write-ahead logging: record your intentions before you act, so that if you’re interrupted, you can either complete the operation or roll it back. The journal is not a backup—it is a reconstruction aid that enables the file system to rebuild consistent state after unexpected shutdown.
The three ext3/ext4 journal modes (journal, ordered, writeback) represent a spectrum from maximum safety to maximum performance. For critical systems, data=ordered (the default) provides the best balance, and enabling barriers ensures journal commits survive crashes. Always test your recovery procedures in a staging environment before relying on them in production.
For continued learning, explore how copy-on-write file systems (Btrfs, ZFS) provide similar guarantees through different mechanisms, study how ext4’s extent-based allocation improves on ext3’s block mapping, and examine the JBD2 (Journaling Block Device) implementation in the Linux kernel for a complete picture of how journaling actually works at the code level.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.