Union-Find: Disjoint Set Data Structure

Master union-find with path compression and union by rank for efficient set operations used in Kruskal's MST and connected components.

published: April 7, 2026 reading time: 19 min read author: GeekWorkBench

Union-Find: Disjoint Set Data Structure

Union-Find (also called Disjoint Set Union or DSU) is a data structure that tracks a collection of disjoint sets. It supports two operations: find (which set contains this element?) and union (merge two sets). With path compression and union by rank, both operations achieve nearly O(1) amortized time.

Union-Find powers Kruskal’s MST algorithm, detects cycles in graphs, finds connected components, and solves many partitioning problems.

Introduction

Union-Find (also called Disjoint Set Union or DSU) is a data structure that tracks a collection of disjoint sets, supporting two operations: find—which set contains this element?—and union—merge two sets together. What makes it work is that with two simple optimizations, both operations achieve amortized O(α(n)) time, where α(n) is the inverse Ackermann function that grows so slowly it is effectively constant for all practical inputs. This near-constant time makes Union-Find one of the most practical data structures for graph algorithms and connectivity problems.

The power of Union-Find comes from its simplicity and versatility. It forms the backbone of Kruskal’s minimum spanning tree algorithm, where cycle detection during edge addition requires efficient connectivity queries. It finds connected components in graphs, detects cycles, and solves partitioning problems in social networks and image processing. The data structure is also easy to implement correctly—many developers get it wrong without path compression or union by rank, leaving significant performance on the table.

This guide covers the core operations with path compression and union by rank, explains why these optimizations give near-constant time, and walks through practical applications like counting connected components, Kruskal’s MST, and the “most stones removed” problem. You’ll also see when to use union by size instead of union by rank, and how to implement find iteratively to avoid recursion depth issues.

Basic Implementation

class UnionFind:
    """
    Union-Find with path compression and union by rank.
    Nearly O(1) amortized for both operations.
    """

    def __init__(self, n):
        self.parent = list(range(n))  # Parent pointer
        self.rank = [0] * n  # Approximate tree depth

    def find(self, x):
        """
        Find root with path compression.
        Flattens tree along the way for future speedups.
        """
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])  # Recursive compression
        return self.parent[x]

    def union(self, x, y):
        """
        Union by rank: attach smaller tree under larger tree.
        Returns True if merged, False if already in same set.
        """
        root_x = self.find(x)
        root_y = self.find(y)

        if root_x == root_y:
            return False  # Already in same set

        # Union by rank: attach smaller depth tree under larger
        if self.rank[root_x] < self.rank[root_y]:
            root_x, root_y = root_y, root_x  # Ensure root_x has >= rank

        self.parent[root_y] = root_x

        if self.rank[root_x] == self.rank[root_y]:
            self.rank[root_x] += 1

        return True

    def connected(self, x, y):
        """Check if two elements are in the same set."""
        return self.find(x) == self.find(y)

    def count(self):
        """Count number of distinct sets."""
        return sum(1 for i in range(len(self.parent)) if self.parent[i] == i)

Optimized Operations

class OptimizedUnionFind:
    """
    Union-Find with:
    - Path compression (find)
    - Union by rank (union)
    - Size tracking
    """

    def __init__(self, n):
        self.parent = list(range(n))
        self.rank = [0] * n
        self.size = [1] * n  # Track component sizes

    def find(self, x):
        """Find with iterative path compression."""
        root = x
        while self.parent[root] != root:
            root = self.parent[root]

        # Compress path
        while self.parent[x] != root:
            next_x = self.parent[x]
            self.parent[x] = root
            x = next_x

        return root

    def union(self, x, y):
        """Union by size: attach smaller tree under larger."""
        root_x = self.find(x)
        root_y = self.find(y)

        if root_x == root_y:
            return False

        # Attach smaller size tree under larger
        if self.size[root_x] < self.size[root_y]:
            root_x, root_y = root_y, root_x

        self.parent[root_y] = root_x
        self.size[root_x] += self.size[root_y]

        return True

    def component_size(self, x):
        """Get size of component containing x."""
        return self.size[self.find(x)]

    def max_component_size(self):
        """Largest component size."""
        return max(self.size)

Common Applications

# Number of connected components in graph
def count_connected_components(n, edges):
    """Count components using union-find."""
    uf = UnionFind(n)
    for u, v in edges:
        uf.union(u, v)
    return uf.count()


# Detect cycle in undirected graph
def has_cycle(n, edges):
    """Detect if graph has a cycle."""
    uf = UnionFind(n)
    for u, v in edges:
        if not uf.union(u, v):
            return True  # Already connected = cycle
    return False


# Kruskal's MST using union-find
def kruskal_mst(n, edges):
    """
    MST using Kruskal's algorithm.
    edges: list of (weight, u, v)
    """
    # Sort edges by weight
    edges.sort(key=lambda x: x[0])

    uf = UnionFind(n)
    mst = []
    total_weight = 0

    for weight, u, v in edges:
        if uf.union(u, v):
            mst.append((u, v, weight))
            total_weight += weight

        if len(mst) == n - 1:  # MST has n-1 edges
            break

    return mst, total_weight


# Most stones removed (LeetCode 947)
def remove_stones(stones):
    """
    Maximum stones that can be removed = total stones - number of components.
    Use union-find on stone positions.
    """
    uf = UnionFind(20000)  # 0-9999 for row, 10000-19999 for col

    for x, y in stones:
        uf.union(x, 10000 + y)  # Connect row x to column y

    # Count distinct roots
    roots = set(uf.find(x) for x, _ in stones)
    return len(stones) - len(roots)


# Longest consecutive sequence (LeetCode 128)
def longest_consecutive(nums):
    """
    Find longest consecutive sequence length.
    Use union-find to group consecutive numbers.
    """
    if not nums:
        return 0

    uf = UnionFind(len(nums))
    num_to_idx = {num: i for i, num in enumerate(nums)}

    # Union each number with its neighbor if it exists
    for num in nums:
        if num - 1 in num_to_idx:
            uf.union(num_to_idx[num], num_to_idx[num - 1])
        if num + 1 in num_to_idx:
            uf.union(num_to_idx[num], num_to_idx[num + 1])

    # Find largest component size
    max_size = 0
    for i in range(len(nums)):
        if uf.find(i) == i:
            max_size = max(max_size, uf.size[i])

    return max_size

Time Complexity Analysis

Operation	Naive	With Path Compression	With Both Optimizations
find	O(n)	O(log n)	α(n) ≈ O(1)
union	O(n)	O(log n)	α(n) ≈ O(1)
connected	O(n)	O(log n)	α(n) ≈ O(1)

α(n) is the inverse Ackermann function, which grows so slowly that for all practical values of n, α(n) ≤ 4.

Why Path Compression and Union by Rank Work

# Without optimizations: tree can degenerate to linked list (O(n) operations)

# With path compression:
# find(1) → 1 → 3 → 5 (find root)
# After: parent[1]=5, parent[3]=5, parent[5]=5 (flattened)

# With union by rank:
# Always attach smaller tree under larger
# Tree height ≤ log(n) because each merge at least doubles size of one component

Union-Find With Path Compression and Rank

graph TD
    subgraph Before["Before Path Compression (find 1)"]
        direction TB
        N1["1"]
        N2["2"]
        N3["3"]
        N4["4"]
        N5["5"]

        N1 --> N2
        N2 --> N3
        N3 --> N4
        N4 --> N5

        N5Style["5 is root"]
    end

    subgraph After["After Path Compression"]
        direction TB
        A1["1"]
        A2["2"]
        A3["3"]
        A4["4"]
        A5["5"]

        A1 -.-> A5
        A2 -.-> A5
        A3 -.-> A5
        A4 -.-> A5
        A5["5 (root)"]
    end

    subgraph UnionByRank["Union by Rank Example"]
        direction LR
        R1["rank=2, size=4"]
        R1 --> R2["rank=1, size=2"]
        R1 --> R3["rank=1, size=1"]
        R1 --> R4["rank=0"]

        R5["Attach smaller under larger"]
    end

When to Use Union-Find

Problem	Union-Find Use
Kruskal’s MST	Detect cycles while adding edges
Connected components	Group related items
Cycle detection	Check if edge connects nodes in same component
Network connectivity	Is x connected to y?
Image processing	Connected components in pixels
Clustering	Group similar items

Production Failure Scenarios

Path compression skipped in recursive find: If you implement find recursively but forget the assignment self.parent[x] = self.find(self.parent[x]), path compression never happens. The tree degrades to O(n) height over time. Even iterative implementations can miss path compression if the compression loop doesn’t update all nodes along the chain. Every find must flatten—test this with a long chain before deployment.
Union by rank/size not applied consistently: If you sometimes union with arbitrary ordering instead of by rank or size, the tree height bound breaks. A sequence of unions in bad order produces a linear chain, turning near-O(1) operations into O(n). This is especially insidious because it works fine on small test cases but degrades on large inputs.
Rank overflow with large n: The rank field is typically a small integer. If you have millions of elements and keep merging trees of equal rank, rank can overflow an 8-bit or 16-bit integer. The tree still works but the rank comparison logic breaks, potentially causing incorrect unions.
Parent array not initialized for all elements: When using union-find in a dynamic context (adding elements after initialization), new elements need proper parent and rank initialization. If you skip this, find on uninitialized elements reads garbage from the arrays and corrupts the structure.

Observability Checklist

Track union-find operations to catch degradation and correctness issues.

Core Metrics

find operation count and latency per call
union operation count and merge rate
Tree depth distribution across all roots
Rank distribution across all elements
Component count over time

Health Signals

find latency increasing: tree height growing beyond log n
Union merge rate dropping: all unions are between same-set elements (may indicate logic error)
Rank values growing beyond log₂(n): union-by-rank broken
Component count not decreasing as expected: union operations failing silently
find returning unexpected values: parent array corrupted

Alerting Thresholds

find latency > log₂(n) consistently: tree degradation
Rank values exceeding 30 for n < 10⁶: rank tracking error
union returning False rate > 50% without actual cycles: logic error
Component count stable when it should be changing: union not executing

Distributed Tracing

Trace each find with path length and nodes visited
Include root node and compression ratio in span metadata
Track union with merged component sizes
Correlate slow finds with specific tree shapes (deep chains)

Security Notes

Union-find has specific security concerns when attacker controls operations.

Rank manipulation to cause worst-case performance: An attacker who controls union order can craft sequences that violate the union-by-rank guarantee, creating deep trees. While find is still O(log n), the constant factor matters under heavy load.

Fix: Use randomized union by shuffling operations if adversarial. Or use union by size which is harder to manipulate predictably.

Parent array corruption via overflow: If the parent array stores indices and an attacker can trigger index overflow (through large n values), the structure can be corrupted to point to arbitrary memory locations.

Fix: Validate n against maximum bounds before initialization. Use language-level bounds checking where available.

Cycle detection bypass via timing: In cycle detection contexts, if timing differences in find operations reveal tree structure, an attacker could use this to craft “safe” edges that actually create cycles but appear safe by timing.

Fix: Ensure find operations take constant time regardless of path length. Use branchless implementations to eliminate timing side channels.

Quick Recap Checklist

Union-Find tracks disjoint sets efficiently
find(x): returns root of set containing x
union(x, y): merges sets containing x and y
Path compression flattens tree on find
Union by rank attaches smaller tree under larger
Amortized O(α(n)) ≈ O(1) per operation
Used in Kruskal’s MST, cycle detection, connected components

Trade-off Analysis

Approach	find Complexity	union Complexity	Space	Notes
Naive	O(n)	O(n)	O(n)	No optimizations
Path Compression Only	O(log n)	O(log n)	O(n)	Flattening on find
Union by Rank Only	O(log n)	O(log n)	O(n)	Height bound log n
Both Optimizations	α(n) ≈ O(1)	α(n) ≈ O(1)	O(n)	Inverse Ackermann
Union by Size	O(log n)	O(log n)	O(n)	Same as rank
Weighted Quick-Union	O(n) worst	O(log n)	O(n)	Without path compression

When to Prefer Each

Path compression only: Good when unions are rare, finds are frequent
Union by rank only: Good when finds and unions are balanced
Both (standard): Best overall performance for most use cases
Union by size: Better when you need component size tracking
Weighted quick-union: Legacy approach, path compression superior

Interview Questions

1. What is the inverse Ackermann function and why does it appear?

α(n) is the inverse Ackermann function—extremely slow growing. For n = 2^(2^(2^(2^16))), α(n) = 5. For any practical n, α(n) ≤ 4 or 5. It appears because with union by rank, tree height ≤ log(n), and path compression reduces this further. The combination is so effective that n operations take practically O(n) time, or O(α(n)) amortized per operation.

2. Why do we use union by rank instead of union by size?

Union by rank and union by size achieve the same logarithmic bound, but rank is easier to maintain (no need to track sizes). Both ensure the tree height stays ≤ log(n) because when merging trees of equal rank, rank increases by 1. A tree of rank r has at least 2^r elements, so rank cannot exceed log(n). Size tracking gives the same guarantee and can also report component sizes, which is sometimes useful.

3. How does union-find detect cycles in a graph?

Process each edge (u, v). If find(u) == find(v), u and v are already connected, so adding this edge creates a cycle. Otherwise, union(u, v) merges their components. This is the basis of Kruskal's MST algorithm—you only add an edge if it doesn't create a cycle.

4. Can union-find handle dynamic addition of elements?

Standard union-find has fixed size. For dynamic addition, maintain a mapping from element to index, and grow arrays when needed (amortized O(1)). Each new element starts as its own set. Some implementations use a dictionary instead of array indices for arbitrary element types. The key is ensuring new elements get their own parent and rank/size entries.

5. What is the difference between union by rank vs union by size?

Expected answer points:

Both achieve O(log n) tree height bound
Rank is easier: no counter updates on merge
Size gives actual component sizes (useful for queries)
Rank can overflow with huge n; size grows naturally with elements
Implementation choice depends on whether you need size tracking

6. How would you implement a count of distinct components in O(1) after each union?

Expected answer points:

Maintain a counter initialized to n (number of elements)
Decrement by 1 each time union() returns True (successful merge)
The count variable never requires iteration—O(1) update per union
Alternative: iterate parents array counting roots, but that's O(n)

7. Walk through the sequence: union(1,2), union(2,3), find(1). What is the result with and without path compression?

Expected answer points:

Without path compression: tree is 1->2->3, find(1) traverses 2 steps returning 3 as root
With path compression: find(1) compresses path, making 1 point directly to 3
Subsequent finds on 1 or 2 are O(1)
The chain forms because unions attach root of second tree under first
After union by rank, the shallower tree becomes parent

8. How does path compression affect time complexity of successive find operations?

Expected answer points:

First find on an element traverses the full path to root (O(log n) worst)
Path compression flattens the tree—element now points directly to root
Subsequent finds on same element or siblings become O(1)
After many operations, the structure becomes nearly flat
This is why amortized α(n) ≈ O(1) instead of O(log n)

9. Implement find using iterative approach with path compression. Why prefer it over recursive?

Expected answer points:

Iterative avoids call stack overflow for deep trees (n up to millions)
Two-pass: first find root, then compress all intermediate nodes
Recursive version is elegant but may stack overflow on degenerate input
Iterative allows fine-grained control of the compression loop
Both achieve same time complexity; iterative is safer for production

10. What happens to the rank values when union by rank is applied consistently over many operations?

Expected answer points:

Rank of a root equals tree height (log n maximum)
Rank increases by 1 only when merging equal-rank trees
After merging, the combined tree has rank = max(r1, r2) or r1+1 if equal
Rank values remain bounded by log₂(n) regardless of operation count
Typical rank values stay small (under 20) even for millions of elements

11. How would you use union-find to find the number of islands in a 2D grid (LeetCode 200)?

Expected answer points:

Map each cell (i,j) to index i*cols + j
Initialize union-find with n = rows * cols
Iterate cells: if cell is '1', union with neighboring '1' cells
Skip '0' cells—they are never part of islands
Count distinct roots among all '1' cells to get island count
Alternative: BFS/DFS flood fill, but union-find handles dynamic connectivity

12. Explain why union-find is used in Kruskal's algorithm for MST.

Expected answer points:

Kruskal sorts edges by weight and adds them greedily
Before adding edge (u,v), check if they are already connected
If find(u) == find(v), adding this edge would create a cycle—skip it
If connected, union(u,v) merges components
Stop when MST has n-1 edges
Union-find makes cycle detection O(α(n)) per edge, dominating the O(m log m) sort

13. What is the relationship between connected components and union-find? How do you count them?

Expected answer points:

Connected components partition elements into disjoint sets
Each union operation reduces component count by 1 if it merges two different sets
Initialize count = n (each element is its own component)
Every successful union decrements count
At the end, count = number of distinct roots
Alternatively: iterate all parents and count unique roots (O(n))

14. Describe a scenario where union-find performs poorly. What input causes worst-case behavior?

Expected answer points:

Without path compression and union by rank: O(n) per operation worst case
Sequential unions in order: union(1,2), union(2,3), ..., union(n-1,n) creates a chain
Each find traverses the entire chain (n-1 steps)
Path compression alone: height becomes O(log n) but not flat
Union by rank alone: height bounded by log n
Both together: α(n) ≈ O(1) even in worst case sequences

15. How can union-find be used to detect redundant connections in a network?

Expected answer points:

Build network where each connection is an edge between nodes
Process edges one by one: if find(u) == find(v), this edge is redundant (cycle)
Otherwise, union(u,v) adds the edge to the spanning forest
After processing all edges, remaining edges form minimum spanning tree
Count how many edges were skipped to measure redundancy
Useful in network design verification and routing tables

16. What is the memory footprint of union-find with path compression and union by rank?

Expected answer points:

O(n) space: three arrays of size n (parent, rank, size)
Parent array stores integer index per element
Rank array stores small integers (0 to log n max)
Size array (if used) stores component sizes
For n=10 million: ~120MB with 4-byte integers
Can be reduced: store rank in parent array's unused bits if bit-packing

17. How does union-find handle the "most stones removed" problem (LeetCode 947)?

Expected answer points:

Key insight: answer = total stones - number of components

For stone at (row, col), use two indices: row and col+10000 (offset to separate row/col spaces)

Union all stones in same row with each other (same row = connected)

Union all stones in same column with each other (same column = connected)

Stones in same row or column end up in same component

Each component keeps exactly 1 stone; rest can be removed

18. Compare union-find vs DFS/BFS for finding connected components. When would you choose each?

Expected answer points:

Both find connected components in O(V+E)
Union-find: better when graph is dynamic or edges arrive incrementally
Union-find: O(α(n)) per edge, no recursion, no visited array needed
DFS/BFS: O(V+E) guaranteed, simpler to implement correctly
Use DFS when you need traversal order or paths between nodes
Use union-find for Kruskal, cycle detection, incremental connectivity

19. Can union-find be used for near-real-time collaborative editing? What is the challenge?

Expected answer points:

Collaborative editing needs conflict resolution and operation ordering
Union-find tracks connectivity but not operation order or causality
Challenges: concurrent union operations need synchronization
Distributed union-find requires consensus (Paxos/Raft) to resolve conflicts
Better suited for offline or single-writer scenarios
For CRDTs (conflict-free replicated data types), union-find alone is insufficient

20. Design an union-find variant that supports rollback (undo operations). What data structure is needed?

Expected answer points:

Need to revert parent, rank, and size changes on undo
Use a stack of changes: each union pushes previous state onto stack
On rollback, pop changes and restore previous parent/rank values
Support multiple rollbacks: persistent stack (not just a single undo)
Trade-off: additional memory for change history
Alternative: command pattern where each operation is reversible

Conclusion

Union-Find gives you near-constant time set operations—find and union both run in O(α(n)), which is effectively O(1) for any practical input size. Path compression flattens the tree on each find, and union by rank keeps trees shallow. It’s the standard tool for cycle detection in Kruskal’s MST, grouping connected components, and any problem involving disjoint sets. For graph traversal basics, see Graphs: Adjacency List and Matrix, or Topological Sort for another graph ordering algorithm.

Union-Find: Disjoint Set Data Structure

Introduction

Basic Implementation

Optimized Operations

Common Applications

Time Complexity Analysis

Why Path Compression and Union by Rank Work

Union-Find With Path Compression and Rank

When to Use Union-Find

Production Failure Scenarios

Observability Checklist

Core Metrics

Health Signals

Alerting Thresholds

Distributed Tracing

Security Notes

Quick Recap Checklist

Trade-off Analysis

When to Prefer Each

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Bellman-Ford Algorithm: Shortest Paths with Negative Weights

Dijkstra's Algorithm: Finding Shortest Paths in Weighted Graphs

Kruskal's Algorithm: Building Minimum Spanning Trees