B-Trees and B+ Trees: Database Index Structures
Understand B-trees and B+ trees for disk-based storage, database indexing, and efficient range queries with O(log n) operations.
B-Trees and B+ Trees: Database Index Structures
B-trees are the workhorse of database indexes and file systems. While balanced binary trees give O(log n) search, they perform terribly on disk—each node access might require a disk read. B-trees solve this by storing hundreds of keys per node, matching disk block sizes and reducing tree height from ~30 (for billion records) to ~3.
The B+ tree variant that’s even more common in databases adds a crucial optimization: all records are stored in leaf nodes, while internal nodes only store keys. The leaf nodes are linked sequentially, enabling efficient range queries by traversing the linked list rather than revisiting parent nodes. MySQL’s InnoDB, PostgreSQL, and most RDBMSes use B+ trees as their primary index structure.
Introduction
B-trees are the workhorse of database indexes and file systems, yet their design makes perfect sense once you understand the problem they solve: disk I/O is expensive (milliseconds versus nanoseconds for memory), and balanced binary trees with two children per node require too many disk reads to be practical for large datasets. A B-tree node typically holds hundreds of key-pointer pairs, matching disk block sizes so that each node access costs at most one disk read. With t=100 (200 keys per node), a billion records requires only ~3 levels instead of ~30 for a binary tree.
B+ trees are the dominant variant in relational databases. They optimize B-trees for read-heavy workloads by storing all data only in leaf nodes, while internal nodes store only keys for routing. The leaf nodes are linked sequentially, making range queries efficient by traversing the linked list rather than bouncing back to parent nodes repeatedly. MySQL’s InnoDB, PostgreSQL, Oracle, and SQL Server all use B+ trees as their primary index structure.
This guide covers B-tree and B+ tree structure and operations, node split and merge mechanics, and the critical differences that make B+ trees preferable for most database workloads. You’ll understand why page size and fill factor matter in practice, how clustered versus non-clustered indexes affect performance, and why sequential ID generation (auto-increment) causes less index bloat than random UUIDs.
When to Use
B-trees and B+ trees apply when:
- Database indexing — Primary access path for most relational databases
- Filesystem indexing — NTFS, ext4, and HFS+ use B+ tree variants
- Large datasets on disk — Minimizing disk I/O matters more than memory efficiency
- Range queries — B+ tree linked leaves enable sequential access
- Ordered iteration — Keys can be retrieved in sorted order efficiently
When Not to Use
- In-memory databases (red-black trees or skip lists may be faster due to cache locality)
- Write-heavy workloads (B-tree rebalancing involves page splits that can be expensive)
- Key-value stores only needing point queries (hash indexes are O(1))
Architecture: Multi-Way Balanced Trees
graph TD
subgraph Internal["B+ Tree Internal Node"]
I1["[20 | 40 | 60]"]
end
subgraph Leaf["B+ Tree Leaf Nodes (linked)"]
L1["[10-15] -> [20-30] -> [40-50] -> [60-70]"]
end
I1 --> L1
Internal nodes store keys for routing; leaf nodes store actual data pointers and are linked for range access.
Implementation
B-Tree Node
class BTreeNode:
def __init__(self, t, leaf=False):
self.t = t # Minimum degree (defines order)
self.leaf = leaf
self.keys = [] # Key values
self.children = [] # Child pointers
self.n = 0 # Current number of keys
class BTree:
"""
B-tree implementation for educational purposes.
In practice, use database-provided implementations.
"""
def __init__(self, t):
self.t = t # Minimum degree (order = 2t)
self.root = BTreeNode(t, leaf=True)
def search(self, k, node=None):
"""Search for key k, return node if found."""
if node is None:
node = self.root
i = 0
while i < node.n and k > node.keys[i]:
i += 1
if i < node.n and k == node.keys[i]:
return node, i
if node.leaf:
return None
return self.search(k, node.children[i])
def insert(self, k):
"""Insert key k into B-tree."""
root = self.root
if len(root.keys) == 2 * self.t - 1:
new_root = BTreeNode(self.t, leaf=False)
new_root.children.append(root)
self._split_child(new_root, 0, root)
self.root = new_root
self._insert_nonfull(self.root, k)
def _split_child(self, parent, i, child):
"""Split a full child node."""
t = self.t
new_child = BTreeNode(t, leaf=child.leaf)
new_child.n = t - 1
# Copy keys to new child
for j in range(t - 1):
new_child.keys.append(child.keys[j + t])
if not child.leaf:
for j in range(t):
new_child.children.append(child.children[j + t])
child.n = t - 1
child.keys = child.keys[:t - 1]
child.children = child.children[:t]
# Insert median key into parent
parent.keys.insert(i, child.keys.pop(t - 1))
parent.children.insert(i + 1, new_child)
def _insert_nonfull(self, node, k):
"""Insert into non-full node."""
i = len(node.keys) - 1
if node.leaf:
node.keys.append(None)
while i >= 0 and k < node.keys[i]:
node.keys[i + 1] = node.keys[i]
i -= 1
node.keys[i + 1] = k
else:
while i >= 0 and k < node.keys[i]:
i -= 1
i += 1
if len(node.children[i].keys) == 2 * self.t - 1:
self._split_child(node, i, node.children[i])
if k > node.keys[i]:
i += 1
self._insert_nonfull(node.children[i], k)
Common Pitfalls / Anti-Patterns
- Confusing t (minimum degree) with order — B-tree of order m has max m children; t = ceil(m/2)
- Not handling root split — Special case when root is full and needs to become internal
- B+ tree leaf vs internal keys — Internal nodes store separator keys for routing only
- Disk block alignment — Real implementations align to filesystem blocks (4KB typically)
Quick Recap Checklist
- B-trees store up to 2t-1 keys per node, t-1 to 2t-1 children
- All leaves at same depth (balanced)
- B+ tree: only leaves store data, internal nodes store routing keys only
- B+ tree leaves are linked for efficient range scans
- Used by virtually all relational databases for index access
Trade-off Analysis
| Aspect | B-Tree | B+ Tree | Notes |
|---|---|---|---|
| Search depth | O(log n) to leaf | O(log n) to leaf | Same asymptotic |
| Range queries | Must revisit parent | Sequential leaf links | B+ tree wins |
| Insert/Split | Split may affect all levels | Same | Similar complexity |
| Storage efficiency | All nodes store data | Only leaves store data | B+ tree wins |
| Pointer overhead | Keys + data in all nodes | Internal nodes only have keys | B+ tree wins |
| Point query speed | Can find at leaf earlier | Always to leaf | B-tree may be faster |
| Leaf scan speed | Non-sequential | Sequential via linked list | B+ tree wins |
When to prefer B-tree: Write-heavy workloads, point queries where early termination helps, embedded databases with limited RAM.
When to prefer B+ tree: Read-heavy workloads with range scans, database indexes, filesystem metadata.
Real-world Failure Scenarios
-
PostgreSQL index bloat: Frequent updates leave empty space in pages, causing index size inflation. Periodic
REINDEXorVACUUMneeded. -
MySQL InnoDB page splits: Random inserts cause many page splits at intermediate levels, degrading insert performance. Bulk loading or sorted inserts perform better.
-
SQLite B-tree depth growth: Small cache + large dataset = cache misses. With default page size, deep trees cause 10-20 I/Os per query on cold cache.
-
MongoDB index selectivity: Low-selectivity indexes (e.g., boolean field) cause full index scans that are slower than collection scans.
-
Hot spot nodes in distributed B+ trees: Cassandra/ScyllaDB partition splits create hot nodes. Partition strategy matters more than tree algorithm.
Interview Questions
Disk I/O is expensive (milliseconds vs nanoseconds for memory). A binary tree with 1 billion records has height ~30—worst case 30 disk reads. A B-tree with t=100 (200 keys per node) has height only 3—maximum 3 disk reads. B-trees maximize fanout to minimize tree height and disk accesses per lookup.
In a B-tree, all nodes (including leaves) can store data and keys. In a B+ tree, only leaf nodes store data; internal nodes store only routing keys. This means: B+ trees have higher fan-out (more keys per internal node), all leaves at same depth, and efficient range queries via linked leaves. B-trees can find a key without traversing to a leaf.
When a node is full (2t-1 keys) and needs to insert, it splits at the median key. The median key goes to the parent; the left half stays in original node, right half goes to new node. This may cascade—parent might also be full, requiring recursive split. Splits propagate toward root at most O(tree height), typically 3-4 levels for large trees.
B+ tree leaves are linked via next/prev pointers. After finding the starting leaf via the tree, range queries traverse the linked list without returning to parent nodes. This enables sequential leaf access with minimal disk seeks—one I/O per leaf page vs. O(log n) parent revisits for a standard B-tree.
A B-tree of order m has a maximum of m children per node. The minimum degree t defines the node capacity as: minimum children = t, maximum children = 2t. Relationship: m = 2t (or close to it depending on implementation). For example, with t=50, each node holds up to 99 keys and up to 100 children pointers.
Red-black trees have height O(log n) but low fanout (~2 children per node), meaning ~30 levels for 1 billion records. Each level may be a disk I/O. B+ trees pack hundreds of keys per node (matching 4KB disk blocks), reducing height to 3-4 levels. Fewer levels = fewer disk seeks = orders of magnitude faster for storage-backed systems.
Write amplification = actual disk writes per user insert. B-trees cause writes when: (1) updating leaf pages, (2) splitting pages (write parent and two children), (3) updating parent keys. In write-heavy workloads, page splits cause multiple disk writes per insert. LSM trees reduce write amplification by batching writes in memory before sequential SSTable writes, at the cost of read amplification.
Larger keys reduce fanout: fewer keys fit per node, increasing tree height and I/O per search. For a 4KB page with 8-byte pointers, a 16-byte key allows ~200 keys per node (height ~3 for 1B records), while a 64-byte key drops to ~50 keys (height ~5). Composite indexes (key1, key2) compound this. Covered indexes and prefix compression mitigate the problem for index-only scans.
Expected answer points:
- Deletions merge sibling nodes when keys fall below minimum threshold (t-1)
- Coalescence or redistribution handles underfull nodes
- Page reclamation: deleted pages return to free list for reuse
- PostgreSQL VACUUM reclaims dead tuples; MySQL InnoDB purge thread handles deleted records
- Lazy deletion: pages marked deleted but not immediately reused until VACUUM/purge runs
Expected answer points:
- Clustered index: B+ tree leaf pages ARE the data pages; table rows stored in order of index key
- Non-clustered index: B+ tree leaf pages contain indexed key + row pointer to data
- Each table has one clustered index (primary key typically); multiple non-clustered indexes
- Clustered: range queries faster (data already ordered), insert/delete more expensive
- Non-clustered: extra I/O for data lookup via pointer chase
Expected answer points:
- Index bloat: accumulation of dead or reusable pages due to updates/deletes
- Causes: random inserts/deletes leave gaps, update-in-place rewrites pages
- Symptoms: index size larger than necessary, degraded scan performance, increased cache pressure
- Solutions: REINDEX rebuilds index from scratch; VACUUM (PostgreSQL) reclaims dead space
- MySQL InnoDB: OPTIMIZE TABLE reorganizes pages;ibuffg purge thread handles deletions
Expected answer points:
- High fanout: each internal node holds hundreds of children pointers
- 4KB page with 8-byte pointers allows ~500 children per internal node
- Height calculation: log_500(1 billion) ≈ 3.4 — only 4 levels needed
- Root node typically cached in memory; path from root to leaf is 2-3 disk I/Os
- B-tree variants with larger pages (e.g., 64KB) can achieve height of 2 for same dataset
Expected answer points:
- Fill factor: percentage of page space to reserve for future inserts (default 90%)
- Higher fill factor: more storage efficient, fewer pages, but more page splits on insert
- Lower fill factor: leaves room for in-page inserts, reduces splits but wastes space
- Use low fill factor for heavily updated tables, high fill factor for read-mostly
- PostgreSQL: CREATE INDEX WITH (fillfactor=70); MySQL InnoDB:innodb_fill_factor
Expected answer points:
- Unique index: each key appears exactly once in tree; enforced at insert time
- Non-unique index: duplicate keys allowed; typically handled by appending row IDs
- B+ tree stores (key, row_pointer) tuples; for duplicates, uses (key, transaction_id) or list pages
- Unique constraint check traverses tree O(log n) then verifies no match at leaf
- Non-unique: may need to scan all matching entries for query execution
Expected answer points:
- Leaf pages store sorted keys; adjacent keys often share common prefixes
- Prefix compression: store first key fully, subsequent keys store only differing suffix
- Example: ["apple", "apricot", "banana"] becomes ["apple", "-ricot", "banana"]
- Reduces leaf page space by 20-40% depending on key prefix similarity
- Decompression needed on read; trade-off between CPU and I/O savings
Expected answer points:
- LSM trees: writes go to memory (MemTable), then flushed to SSTables on disk
- B+ trees: in-place writes; page splits cause random I/O
- LSM tree write amplification: 1 (data written once), B+ tree: 3-10x
- B+ tree read amplification: O(log n), LSM tree: O(log n) + K (K = number of SSTables)
- LSM trees: better for write-heavy workloads; B+ trees: better for read-heavy or mixed
Expected answer points:
- Covering index: query can be satisfied entirely from index pages without touching table
- All referenced columns included in index key or stored as included columns
- Query executor does index lookup, reads leaf nodes only — no row pointer chase to heap
- PostgreSQL: "Index Only Scan"; MySQL: "Using index" in EXPLAIN
- Trade-offs: larger index, slower inserts/updates, but faster reads for covered queries
Expected answer points:
- Buffer pool (innodb_buffer_pool_size, shared_buffers) caches B+ tree pages in memory
- Hot pages: root, intermediate nodes cached; leaf pages loaded on demand
- Cache hit ratio determines effective I/O cost; 99% hit ratio = 1% disk reads
- LRU or clock-sweep eviction policies manage buffer pool pages
- Prefetching: sequential leaf scans pre-load adjacent pages into buffer pool
Expected answer points:
- Sequential IDs (auto-increment): inserts go to rightmost leaf page — minimal page splits
- GUIDs (UUIDs): random insertion point causes page splits throughout tree
- Random GUIDs: write amplification high, index bloat higher, cache efficiency lower
- Solutions: UUID v7 (time-ordered), UUID v1 with node ID, COMB (sequential GUIDs)
- PostgreSQL: gen_random_uuid(); MySQL: UUID() — both random by default
Expected answer points:
- Latch coupling: B+ tree operations acquire locks on nodes being modified
- Read-write locks: readers share, writers exclusive; prevents dirty reads
- Lock escalation: fine-grained node locks scale better than table-level locks
- Write-write: page split or merge requires parent lock during modification
- PostgreSQL: Heap Only Tuple (HOT) updates avoid index maintenance when possible
- MySQL InnoDB: gap locks and next-key locks for isolation; reduces deadlocks but adds contention
Further Reading
- Database Indexing Deep Dive — How indexes work in practice
- Comparing B-Trees and LSM Trees — Write-optimized alternatives
- B-Tree Visualization — Interactive B-tree operations
- Understanding B+ Tree Clustered vs Non-Clustered Indexes — SQL Server perspective
Conclusion
B-trees and B+ trees are the backbone of virtually every relational database you’ve ever used. The core idea is simple: keep trees shallow by cramming more keys into each node, so disk I/O stays low. B+ trees go a step further by chaining leaf nodes together, which makes range scans actually bearable. If you want to see where this leads next, look into composite indexes, covering indexes, and how query planners decide which index to use — Database Indexing Deep Dive is a good starting point.
Category
Related Posts
Arrays vs Linked Lists: Understanding the Trade-offs
Compare arrays and linked lists in terms of access time, insertion/deletion efficiency, memory usage, and cache performance.
Arrays: 1D, 2D, and Multi-dimensional Mastery
Master array operations, traversal, search, common patterns like two-pointer and sliding window, and when to use multi-dimensional arrays.
AVL Trees: Self-Balancing Binary Search Trees
Master AVL tree rotations, balance factors, and rebalancing logic. Learn when to use AVL vs Red-Black trees for your use case.