Union-Find: Disjoint Set Data Structure
Master union-find with path compression and union by rank for efficient set operations used in Kruskal's MST and connected components.
Union-Find: Disjoint Set Data Structure
Union-Find (also called Disjoint Set Union or DSU) is a data structure that tracks a collection of disjoint sets. It supports two operations: find (which set contains this element?) and union (merge two sets). With path compression and union by rank, both operations achieve nearly O(1) amortized time.
Union-Find powers Kruskal’s MST algorithm, detects cycles in graphs, finds connected components, and solves many partitioning problems.
Introduction
Union-Find (also called Disjoint Set Union or DSU) is a data structure that tracks a collection of disjoint sets, supporting two operations: find—which set contains this element?—and union—merge two sets together. What makes it work is that with two simple optimizations, both operations achieve amortized O(α(n)) time, where α(n) is the inverse Ackermann function that grows so slowly it is effectively constant for all practical inputs. This near-constant time makes Union-Find one of the most practical data structures for graph algorithms and connectivity problems.
The power of Union-Find comes from its simplicity and versatility. It forms the backbone of Kruskal’s minimum spanning tree algorithm, where cycle detection during edge addition requires efficient connectivity queries. It finds connected components in graphs, detects cycles, and solves partitioning problems in social networks and image processing. The data structure is also easy to implement correctly—many developers get it wrong without path compression or union by rank, leaving significant performance on the table.
This guide covers the core operations with path compression and union by rank, explains why these optimizations give near-constant time, and walks through practical applications like counting connected components, Kruskal’s MST, and the “most stones removed” problem. You’ll also see when to use union by size instead of union by rank, and how to implement find iteratively to avoid recursion depth issues.
Basic Implementation
class UnionFind:
"""
Union-Find with path compression and union by rank.
Nearly O(1) amortized for both operations.
"""
def __init__(self, n):
self.parent = list(range(n)) # Parent pointer
self.rank = [0] * n # Approximate tree depth
def find(self, x):
"""
Find root with path compression.
Flattens tree along the way for future speedups.
"""
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x]) # Recursive compression
return self.parent[x]
def union(self, x, y):
"""
Union by rank: attach smaller tree under larger tree.
Returns True if merged, False if already in same set.
"""
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False # Already in same set
# Union by rank: attach smaller depth tree under larger
if self.rank[root_x] < self.rank[root_y]:
root_x, root_y = root_y, root_x # Ensure root_x has >= rank
self.parent[root_y] = root_x
if self.rank[root_x] == self.rank[root_y]:
self.rank[root_x] += 1
return True
def connected(self, x, y):
"""Check if two elements are in the same set."""
return self.find(x) == self.find(y)
def count(self):
"""Count number of distinct sets."""
return sum(1 for i in range(len(self.parent)) if self.parent[i] == i)
Optimized Operations
class OptimizedUnionFind:
"""
Union-Find with:
- Path compression (find)
- Union by rank (union)
- Size tracking
"""
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.size = [1] * n # Track component sizes
def find(self, x):
"""Find with iterative path compression."""
root = x
while self.parent[root] != root:
root = self.parent[root]
# Compress path
while self.parent[x] != root:
next_x = self.parent[x]
self.parent[x] = root
x = next_x
return root
def union(self, x, y):
"""Union by size: attach smaller tree under larger."""
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False
# Attach smaller size tree under larger
if self.size[root_x] < self.size[root_y]:
root_x, root_y = root_y, root_x
self.parent[root_y] = root_x
self.size[root_x] += self.size[root_y]
return True
def component_size(self, x):
"""Get size of component containing x."""
return self.size[self.find(x)]
def max_component_size(self):
"""Largest component size."""
return max(self.size)
Common Applications
# Number of connected components in graph
def count_connected_components(n, edges):
"""Count components using union-find."""
uf = UnionFind(n)
for u, v in edges:
uf.union(u, v)
return uf.count()
# Detect cycle in undirected graph
def has_cycle(n, edges):
"""Detect if graph has a cycle."""
uf = UnionFind(n)
for u, v in edges:
if not uf.union(u, v):
return True # Already connected = cycle
return False
# Kruskal's MST using union-find
def kruskal_mst(n, edges):
"""
MST using Kruskal's algorithm.
edges: list of (weight, u, v)
"""
# Sort edges by weight
edges.sort(key=lambda x: x[0])
uf = UnionFind(n)
mst = []
total_weight = 0
for weight, u, v in edges:
if uf.union(u, v):
mst.append((u, v, weight))
total_weight += weight
if len(mst) == n - 1: # MST has n-1 edges
break
return mst, total_weight
# Most stones removed (LeetCode 947)
def remove_stones(stones):
"""
Maximum stones that can be removed = total stones - number of components.
Use union-find on stone positions.
"""
uf = UnionFind(20000) # 0-9999 for row, 10000-19999 for col
for x, y in stones:
uf.union(x, 10000 + y) # Connect row x to column y
# Count distinct roots
roots = set(uf.find(x) for x, _ in stones)
return len(stones) - len(roots)
# Longest consecutive sequence (LeetCode 128)
def longest_consecutive(nums):
"""
Find longest consecutive sequence length.
Use union-find to group consecutive numbers.
"""
if not nums:
return 0
uf = UnionFind(len(nums))
num_to_idx = {num: i for i, num in enumerate(nums)}
# Union each number with its neighbor if it exists
for num in nums:
if num - 1 in num_to_idx:
uf.union(num_to_idx[num], num_to_idx[num - 1])
if num + 1 in num_to_idx:
uf.union(num_to_idx[num], num_to_idx[num + 1])
# Find largest component size
max_size = 0
for i in range(len(nums)):
if uf.find(i) == i:
max_size = max(max_size, uf.size[i])
return max_size
Time Complexity Analysis
| Operation | Naive | With Path Compression | With Both Optimizations |
|---|---|---|---|
| find | O(n) | O(log n) | α(n) ≈ O(1) |
| union | O(n) | O(log n) | α(n) ≈ O(1) |
| connected | O(n) | O(log n) | α(n) ≈ O(1) |
α(n) is the inverse Ackermann function, which grows so slowly that for all practical values of n, α(n) ≤ 4.
Why Path Compression and Union by Rank Work
# Without optimizations: tree can degenerate to linked list (O(n) operations)
# With path compression:
# find(1) → 1 → 3 → 5 (find root)
# After: parent[1]=5, parent[3]=5, parent[5]=5 (flattened)
# With union by rank:
# Always attach smaller tree under larger
# Tree height ≤ log(n) because each merge at least doubles size of one component
Union-Find With Path Compression and Rank
graph TD
subgraph Before["Before Path Compression (find 1)"]
direction TB
N1["1"]
N2["2"]
N3["3"]
N4["4"]
N5["5"]
N1 --> N2
N2 --> N3
N3 --> N4
N4 --> N5
N5Style["5 is root"]
end
subgraph After["After Path Compression"]
direction TB
A1["1"]
A2["2"]
A3["3"]
A4["4"]
A5["5"]
A1 -.-> A5
A2 -.-> A5
A3 -.-> A5
A4 -.-> A5
A5["5 (root)"]
end
subgraph UnionByRank["Union by Rank Example"]
direction LR
R1["rank=2, size=4"]
R1 --> R2["rank=1, size=2"]
R1 --> R3["rank=1, size=1"]
R1 --> R4["rank=0"]
R5["Attach smaller under larger"]
end
When to Use Union-Find
| Problem | Union-Find Use |
|---|---|
| Kruskal’s MST | Detect cycles while adding edges |
| Connected components | Group related items |
| Cycle detection | Check if edge connects nodes in same component |
| Network connectivity | Is x connected to y? |
| Image processing | Connected components in pixels |
| Clustering | Group similar items |
Production Failure Scenarios
-
Path compression skipped in recursive find: If you implement find recursively but forget the assignment
self.parent[x] = self.find(self.parent[x]), path compression never happens. The tree degrades to O(n) height over time. Even iterative implementations can miss path compression if the compression loop doesn’t update all nodes along the chain. Every find must flatten—test this with a long chain before deployment. -
Union by rank/size not applied consistently: If you sometimes union with arbitrary ordering instead of by rank or size, the tree height bound breaks. A sequence of unions in bad order produces a linear chain, turning near-O(1) operations into O(n). This is especially insidious because it works fine on small test cases but degrades on large inputs.
-
Rank overflow with large n: The rank field is typically a small integer. If you have millions of elements and keep merging trees of equal rank, rank can overflow an 8-bit or 16-bit integer. The tree still works but the rank comparison logic breaks, potentially causing incorrect unions.
-
Parent array not initialized for all elements: When using union-find in a dynamic context (adding elements after initialization), new elements need proper parent and rank initialization. If you skip this, find on uninitialized elements reads garbage from the arrays and corrupts the structure.
Observability Checklist
Track union-find operations to catch degradation and correctness issues.
Core Metrics
- find operation count and latency per call
- union operation count and merge rate
- Tree depth distribution across all roots
- Rank distribution across all elements
- Component count over time
Health Signals
- find latency increasing: tree height growing beyond log n
- Union merge rate dropping: all unions are between same-set elements (may indicate logic error)
- Rank values growing beyond log₂(n): union-by-rank broken
- Component count not decreasing as expected: union operations failing silently
- find returning unexpected values: parent array corrupted
Alerting Thresholds
- find latency > log₂(n) consistently: tree degradation
- Rank values exceeding 30 for n < 10⁶: rank tracking error
- union returning False rate > 50% without actual cycles: logic error
- Component count stable when it should be changing: union not executing
Distributed Tracing
- Trace each find with path length and nodes visited
- Include root node and compression ratio in span metadata
- Track union with merged component sizes
- Correlate slow finds with specific tree shapes (deep chains)
Security Notes
Union-find has specific security concerns when attacker controls operations.
Rank manipulation to cause worst-case performance: An attacker who controls union order can craft sequences that violate the union-by-rank guarantee, creating deep trees. While find is still O(log n), the constant factor matters under heavy load.
Fix: Use randomized union by shuffling operations if adversarial. Or use union by size which is harder to manipulate predictably.
Parent array corruption via overflow: If the parent array stores indices and an attacker can trigger index overflow (through large n values), the structure can be corrupted to point to arbitrary memory locations.
Fix: Validate n against maximum bounds before initialization. Use language-level bounds checking where available.
Cycle detection bypass via timing: In cycle detection contexts, if timing differences in find operations reveal tree structure, an attacker could use this to craft “safe” edges that actually create cycles but appear safe by timing.
Fix: Ensure find operations take constant time regardless of path length. Use branchless implementations to eliminate timing side channels.
Quick Recap Checklist
- Union-Find tracks disjoint sets efficiently
- find(x): returns root of set containing x
- union(x, y): merges sets containing x and y
- Path compression flattens tree on find
- Union by rank attaches smaller tree under larger
- Amortized O(α(n)) ≈ O(1) per operation
- Used in Kruskal’s MST, cycle detection, connected components
Trade-off Analysis
| Approach | find Complexity | union Complexity | Space | Notes |
|---|---|---|---|---|
| Naive | O(n) | O(n) | O(n) | No optimizations |
| Path Compression Only | O(log n) | O(log n) | O(n) | Flattening on find |
| Union by Rank Only | O(log n) | O(log n) | O(n) | Height bound log n |
| Both Optimizations | α(n) ≈ O(1) | α(n) ≈ O(1) | O(n) | Inverse Ackermann |
| Union by Size | O(log n) | O(log n) | O(n) | Same as rank |
| Weighted Quick-Union | O(n) worst | O(log n) | O(n) | Without path compression |
When to Prefer Each
- Path compression only: Good when unions are rare, finds are frequent
- Union by rank only: Good when finds and unions are balanced
- Both (standard): Best overall performance for most use cases
- Union by size: Better when you need component size tracking
- Weighted quick-union: Legacy approach, path compression superior
Interview Questions
α(n) is the inverse Ackermann function—extremely slow growing. For n = 2^(2^(2^(2^16))), α(n) = 5. For any practical n, α(n) ≤ 4 or 5. It appears because with union by rank, tree height ≤ log(n), and path compression reduces this further. The combination is so effective that n operations take practically O(n) time, or O(α(n)) amortized per operation.
Union by rank and union by size achieve the same logarithmic bound, but rank is easier to maintain (no need to track sizes). Both ensure the tree height stays ≤ log(n) because when merging trees of equal rank, rank increases by 1. A tree of rank r has at least 2^r elements, so rank cannot exceed log(n). Size tracking gives the same guarantee and can also report component sizes, which is sometimes useful.
Process each edge (u, v). If find(u) == find(v), u and v are
already connected, so adding this edge creates a cycle. Otherwise,
union(u, v) merges their components. This is the basis of
Kruskal's MST algorithm—you only add an edge if it doesn't create a cycle.
Standard union-find has fixed size. For dynamic addition, maintain a mapping from element to index, and grow arrays when needed (amortized O(1)). Each new element starts as its own set. Some implementations use a dictionary instead of array indices for arbitrary element types. The key is ensuring new elements get their own parent and rank/size entries.
Expected answer points:
- Both achieve O(log n) tree height bound
- Rank is easier: no counter updates on merge
- Size gives actual component sizes (useful for queries)
- Rank can overflow with huge n; size grows naturally with elements
- Implementation choice depends on whether you need size tracking
Expected answer points:
- Maintain a counter initialized to n (number of elements)
- Decrement by 1 each time union() returns True (successful merge)
- The count variable never requires iteration—O(1) update per union
- Alternative: iterate parents array counting roots, but that's O(n)
Expected answer points:
- Without path compression: tree is 1->2->3, find(1) traverses 2 steps returning 3 as root
- With path compression: find(1) compresses path, making 1 point directly to 3
- Subsequent finds on 1 or 2 are O(1)
- The chain forms because unions attach root of second tree under first
- After union by rank, the shallower tree becomes parent
Expected answer points:
- First find on an element traverses the full path to root (O(log n) worst)
- Path compression flattens the tree—element now points directly to root
- Subsequent finds on same element or siblings become O(1)
- After many operations, the structure becomes nearly flat
- This is why amortized α(n) ≈ O(1) instead of O(log n)
Expected answer points:
- Iterative avoids call stack overflow for deep trees (n up to millions)
- Two-pass: first find root, then compress all intermediate nodes
- Recursive version is elegant but may stack overflow on degenerate input
- Iterative allows fine-grained control of the compression loop
- Both achieve same time complexity; iterative is safer for production
Expected answer points:
- Rank of a root equals tree height (log n maximum)
- Rank increases by 1 only when merging equal-rank trees
- After merging, the combined tree has rank = max(r1, r2) or r1+1 if equal
- Rank values remain bounded by log₂(n) regardless of operation count
- Typical rank values stay small (under 20) even for millions of elements
Expected answer points:
- Map each cell (i,j) to index i*cols + j
- Initialize union-find with n = rows * cols
- Iterate cells: if cell is '1', union with neighboring '1' cells
- Skip '0' cells—they are never part of islands
- Count distinct roots among all '1' cells to get island count
- Alternative: BFS/DFS flood fill, but union-find handles dynamic connectivity
Expected answer points:
- Kruskal sorts edges by weight and adds them greedily
- Before adding edge (u,v), check if they are already connected
- If find(u) == find(v), adding this edge would create a cycle—skip it
- If connected, union(u,v) merges components
- Stop when MST has n-1 edges
- Union-find makes cycle detection O(α(n)) per edge, dominating the O(m log m) sort
Expected answer points:
- Connected components partition elements into disjoint sets
- Each union operation reduces component count by 1 if it merges two different sets
- Initialize count = n (each element is its own component)
- Every successful union decrements count
- At the end, count = number of distinct roots
- Alternatively: iterate all parents and count unique roots (O(n))
Expected answer points:
- Without path compression and union by rank: O(n) per operation worst case
- Sequential unions in order: union(1,2), union(2,3), ..., union(n-1,n) creates a chain
- Each find traverses the entire chain (n-1 steps)
- Path compression alone: height becomes O(log n) but not flat
- Union by rank alone: height bounded by log n
- Both together: α(n) ≈ O(1) even in worst case sequences
Expected answer points:
- Build network where each connection is an edge between nodes
- Process edges one by one: if find(u) == find(v), this edge is redundant (cycle)
- Otherwise, union(u,v) adds the edge to the spanning forest
- After processing all edges, remaining edges form minimum spanning tree
- Count how many edges were skipped to measure redundancy
- Useful in network design verification and routing tables
Expected answer points:
- O(n) space: three arrays of size n (parent, rank, size)
- Parent array stores integer index per element
- Rank array stores small integers (0 to log n max)
- Size array (if used) stores component sizes
- For n=10 million: ~120MB with 4-byte integers
- Can be reduced: store rank in parent array's unused bits if bit-packing
Expected answer points:
Expected answer points:
- Both find connected components in O(V+E)
- Union-find: better when graph is dynamic or edges arrive incrementally
- Union-find: O(α(n)) per edge, no recursion, no visited array needed
- DFS/BFS: O(V+E) guaranteed, simpler to implement correctly
- Use DFS when you need traversal order or paths between nodes
- Use union-find for Kruskal, cycle detection, incremental connectivity
Expected answer points:
- Collaborative editing needs conflict resolution and operation ordering
- Union-find tracks connectivity but not operation order or causality
- Challenges: concurrent union operations need synchronization
- Distributed union-find requires consensus (Paxos/Raft) to resolve conflicts
- Better suited for offline or single-writer scenarios
- For CRDTs (conflict-free replicated data types), union-find alone is insufficient
Expected answer points:
- Need to revert parent, rank, and size changes on undo
- Use a stack of changes: each union pushes previous state onto stack
- On rollback, pop changes and restore previous parent/rank values
- Support multiple rollbacks: persistent stack (not just a single undo)
- Trade-off: additional memory for change history
- Alternative: command pattern where each operation is reversible
Further Reading
- CLRS Chapter 21: Data Structures for Disjoint Sets — The canonical textbook treatment with formal proofs
- CP-Algorithms: Union-Find — Practical implementation variants and applications
- Visualgo: Union-Find — Interactive visualization of union-find operations
- LeetCode Union-Find Problems — Practice problems categorized by difficulty
- Kruskal’s Algorithm — Union-Find’s primary application in MST finding
- Ackermann Function — Understanding why α(n) is effectively constant
Conclusion
Union-Find gives you near-constant time set operations—find and union both run in O(α(n)), which is effectively O(1) for any practical input size. Path compression flattens the tree on each find, and union by rank keeps trees shallow. It’s the standard tool for cycle detection in Kruskal’s MST, grouping connected components, and any problem involving disjoint sets. For graph traversal basics, see Graphs: Adjacency List and Matrix, or Topological Sort for another graph ordering algorithm.
Category
Related Posts
Bellman-Ford Algorithm: Shortest Paths with Negative Weights
Learn the Bellman-Ford algorithm for single-source shortest paths including negative edge weights and negative cycle detection.
Dijkstra's Algorithm: Finding Shortest Paths in Weighted Graphs
Master Dijkstra's algorithm for single-source shortest path problems in weighted graphs with positive edges, including implementations and trade-offs.
Kruskal's Algorithm: Building Minimum Spanning Trees
Learn Kruskal's algorithm for finding minimum spanning trees in weighted graphs using union-find data structures.