Longest Common Subsequence: String Alignment Problems

Master the classic DP problem for string similarity, diff tools, and bioinformatics with O(mn) time and space optimization techniques.

published: reading time: 16 min read author: GeekWorkBench

Longest Common Subsequence: String Alignment Problems

The longest common subsequence (LCS) problem asks: given two sequences, what is the longest subsequence they share while preserving order? Unlike substrings, subsequences need not be contiguous—it’s about finding the common “skeleton” when you delete zero or more elements from each sequence. This problem underlies file diff tools like Git, plagiarism detection, bioinformatics sequence alignment, and countless other applications where measuring similarity matters.

The DP solution builds a table where dp[i][j] represents the LCS length of the first i characters of sequence A and first j characters of sequence B. The recurrence is elegant: if characters match, it’s 1 + dp[i-1][j-1]; otherwise, it’s max(dp[i-1][j], dp[i][j-1]). With space optimization from O(mn) to O(min(m,n)), this becomes surprisingly practical.

Introduction

LCS is one of the most foundational problems in dynamic programming. Given two sequences, LCS finds the longest subsequence present in both in the same relative order. It’s not just an academic exercise—it powers Git diff, DNA sequence alignment in bioinformatics, plagiarism detection in academic systems, and fuzzy matching in autocomplete features. The key insight that makes it tractable is that the problem exhibits optimal substructure: the LCS of two strings depends only on the LCS of their prefixes.

Understanding LCS matters for several reasons. First, it establishes the string alignment paradigm that underlies many DP problems. Second, the space-optimized O(min(m,n)) solution teaches an important lesson about reducing memory footprint while preserving correctness. Third, variations like edit distance and shortest common supersequence build directly on LCS concepts. In technical interviews, LCS appears regularly—not just as a direct “find LCS length” question, but as a blueprint for solving subsequence problems in disguise.

This guide covers the O(mn) DP table approach, how to reconstruct the actual subsequence string (not just its length), and the space optimization that reduces memory to O(min(m,n)). You’ll also learn Hirschberg’s algorithm for linear-space reconstruction, bit-parallel techniques for small alphabets, and how Hunt-Szymanski speeds up LCS when matches are sparse. By the end, you’ll recognize LCS patterns in new problems and know exactly which optimization applies.

When to Use

LCS applies when:

  • Comparing versions or detecting changes — Git diff, version control
  • Plagiarism detection — Measuring textual similarity between documents
  • Bioinformatics — DNA sequence alignment (BLAST, CLUSTAL)
  • Minimum edit distance problems — Levenshtein distance is a close cousin
  • Auto-completion and fuzzy matching — Suggesting corrections based on similarity

When Not to Use

  • When you need contiguous matches (use longest common substring instead)
  • Very large sequences with small alphabet (consider suffix arrays or FM-index)
  • Real-time streaming scenarios (batch processing assumes full input)

Architecture: DP Table Construction

graph TD
    A["dp[i-1][j-1] if match, else max(dp[i-1][j], dp[i][j-1])"] --> B[Fill dp table]
    B --> C["Trace back from dp[m][n]"]
    C --> D[LCS characters]

    subgraph Table
    E["s1[0] s1[1] s1[2]..."]
    F["s2[0]"]
    G["dp[0][*] = 0"]
    H["dp[*][0] = 0"]
    end

The first row and column are zeros (empty sequence has LCS length 0). Fill row by row, then trace diagonally when characters match.

Implementation

Standard LCS (O(mn) space)

def lcs_length(s1, s2):
    """
    Find LCS length between two strings.
    Time: O(mn), Space: O(mn)
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    return dp[m][n]

Space-Optimized LCS (O(n) space)

def lcs_length_optimized(s1, s2):
    """
    Space-optimized LCS using rolling array.
    Time: O(mn), Space: O(min(m, n))
    """
    # Ensure s2 is the shorter one for space efficiency
    if len(s1) < len(s2):
        s1, s2 = s2, s1

    m, n = len(s1), len(s2)
    prev = [0] * (n + 1)
    curr = [0] * (n + 1)

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = prev[j - 1] + 1
            else:
                curr[j] = max(prev[j], curr[j - 1])
        prev, curr = curr, [0] * (n + 1)

    return prev[n]

Reconstruct LCS String

def lcs_string(s1, s2):
    """
    Reconstruct the actual LCS string, not just its length.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    # Trace back to reconstruct
    lcs = []
    i, j = m, n
    while i > 0 and j > 0:
        if s1[i - 1] == s2[j - 1]:
            lcs.append(s1[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1

    return ''.join(reversed(lcs))

Common Pitfalls / Anti-Patterns

  1. Off-by-one indexing — DP table has (m+1)×(n+1) cells; character indices offset by 1
  2. Confusing subsequence with substring — LCS allows gaps; substring requires contiguity
  3. Space optimization with mutation — When swapping, don’t accidentally share references
  4. Trace-back boundary conditions — While loop needs both i > 0 and j > 0 to avoid index errors

Quick Recap Checklist

  • Characters must match AND be in same relative order
  • DP table: match = diagonal + 1, no match = max of left/top
  • First row/column always 0 (empty sequence)
  • Space optimization: use two rows (prev/curr) since only previous row needed
  • Test with: empty strings, single char, identical strings, no common chars

Interview Questions

1. Why does the DP recurrence work?

The recurrence captures the two possibilities for the optimal solution ending at position (i,j): If s1[i-1] == s2[j-1], we extend the LCS of prefixes by one. Otherwise, the optimal solution must omit one of these characters, so we take the max of excluding s1[i-1] (dp[i-1][j]) or excluding s2[j-1] (dp[i][j-1]). By induction, these subproblems are already solved optimally when we compute dp[i][j].

2. How do you reconstruct the actual LCS string?

Start from dp[m][n] (bottom-right of table) and trace backwards. When characters match, move diagonally (up-left) and append the character. When they don't match, move in the direction of the larger value (up if dp[i-1][j] > dp[i][j-1], else left). This reconstructs the LCS by reversing the decisions made during table construction.

3. What's the relationship between LCS and Edit Distance?

Both are DP string alignment problems. LCS finds the longest common subsequence, maximizing matches. Edit distance (Levenshtein) finds minimum operations to transform one string to another, minimizing cost. The relationship: edit_distance = |s1| + |s2| - 2 × LCS_length when insertions/deletions cost 1 and substitutions cost 2. They optimize opposite objectives.

4. Can you implement LCS using only O(n) space?

Yes, using a rolling array technique. Since each dp[i][j] only depends on dp[i-1][j-1], dp[i-1][j], and dp[i][j-1], we only need the previous row. Maintain two arrays of length n+1 (prev and curr). After processing each row, swap them. This reduces space from O(mn) to O(min(m,n)). The trade-off is that you lose the ability to reconstruct the actual LCS string — only the length is available.

5. How would you find all longest common subsequences?

To enumerate all LCS strings, modify the traceback. When both dp[i-1][j] == dp[i][j-1] and they exceed the diagonal value, two valid paths exist. Use backtracking with memoization to collect all distinct LCS strings. The worst-case number of LCS strings is exponential — strings of all identical characters like "AAA" and "AAA" yield C(2n,n) possible LCS strings — so complete enumeration is only practical for small inputs.

6. What happens to LCS when strings are very large (100K+ characters)?

The standard O(mn) DP becomes impractical due to both time and memory. A 100K x 100K table requires 10 billion cells. Workarounds exist: use Hirschberg's algorithm for O(min(m,n)) space LCS reconstruction, suffix automaton or bit-parallel techniques for small alphabets, Myers' O(ND) diff algorithm for nearly identical strings, or heuristic approximations when exact LCS is not required.

7. How does LCS compare to Longest Common Substring?

LCS allows gaps between matched characters; longest common substring (LCSubstr) requires characters to be contiguous in both strings. LCS uses DP with max-over-neighbors: if match: dp[i][j] = dp[i-1][j-1] + 1, else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]). LCSubstr resets to zero on a miss: if match: dp[i][j] = dp[i-1][j-1] + 1, else: dp[i][j] = 0. LCSubstr can be solved in O(n+m) with suffix automaton or rolling hash, while LCS is inherently O(mn) in general.

8. Given three strings, how do you find their LCS?

Extend the 2D DP to 3D. When all three characters match, dp[i][j][k] = 1 + dp[i-1][j-1][k-1]. Otherwise take the max of dp[i-1][j][k], dp[i][j-1][k], and dp[i][j][k-1]. Time complexity becomes O(mnp) and space O(mnp). Generalizing to k strings blows up to O(n^k) time, which becomes intractable quickly — multi-string LCS is NP-hard for arbitrary k.

9. How would you use LCS for plagiarism detection?

Compute the LCS between two documents (tokenized as sentences, words, or n-grams). The plagiarism ratio is LCS_length / min(|A|, |B|). A high ratio signals significant overlap. In practice: normalize text by removing whitespace and punctuation, use sentence-level tokens for better semantics, combine with shingling and MinHash for large corpora, and set a threshold (e.g., 0.8) to flag potential plagiarism.

10. What is Hirschberg's algorithm and why is it important?

Hirschberg's algorithm computes the LCS in O(mn) time but only O(min(m,n)) space while still reconstructing the full LCS string. The trick is divide-and-conquer: compute the DP forward from the start and backward from the end for the middle row, find where forward + backward is maximized to get the split point, then recursively solve the two halves. The significance is that LCS reconstruction doesn't need the full O(mn) DP table — this makes it feasible for large strings where memory is the real bottleneck.

11. How does bit-parallel LCS achieve speedups on small alphabets?

Bit-parallel LCS (Myers algorithm) encodes the DP row as machine words, exploiting parallelism in hardware. For alphabet size Σ, store Σ bitmasks for string B, then use bitwise operations to compute the entire row in O(n/W) where W is the machine word size (e.g., 64 bits). Per-character work drops from O(1) to O(1/W) amortized. This only works for small alphabets where the bitmask fits in a word, and performs best when |B| < W. For Unicode or large alphabets, bitmasks become impractical and you fall back to standard DP.

12. What is the Shortest Common Supersequence and how does it relate to LCS?

The Shortest Common Supersequence (SCS) of two strings is the shortest string that has both as subsequences. The connection to LCS is direct: SCS_length = |s1| + |s2| - LCS_length. To build an SCS, run the LCS traceback but interleave non-matching characters from both strings. SCS is essentially the LCS plus the gaps between matched pairs. For more than two strings, SCS stays NP-hard — the same intractability as multi-string LCS.

13. How would you adapt LCS for case-insensitive or Unicode string matching?

Normalize both strings before running LCS. For case-insensitive matching, convert both to lowercase (or uppercase) upfront — a one-time O(m+n) cost. For Unicode, NFC/NFD normalization handles characters with diacritics that visually match but have different code points. Consider grapheme clusters (e.g., é as e +́) if visual equality matters. The critical rule: preprocessing must be consistent across both strings before any comparison, never done as a post-check during DP, because the DP recurrence depends on character equality being uniform everywhere.

14. How does LCS help resolve git merge conflicts automatically?

Git uses a variation of Myers' diff algorithm, which borrows from LCS concepts. When merging branches, git finds the LCS between the original file and each branch version to separate non-conflicting changes from conflicting ones. Segments that differ in only one branch merge automatically; segments changed in both produce merge conflicts. The LCS pinpoints the common base, enabling a three-way merge: changes from the common ancestor to branch A and to branch B are identified independently, then combined. Conflicts only appear where the LCS-based alignment breaks because both branches modified the same region in incompatible ways.

15. Can you solve LCS for streaming input where strings arrive incrementally?

Standard LCS needs both strings in full, but streaming has its workarounds. Keep the current DP row for the first string against the streaming second string, updating it row-by-row as each chunk arrives. The full table never materializes. If the first string also streams in piece by piece, you're dealing with a different problem — true online LCS requires a bounded buffer strategy. When memory is tight, compute LCS on windows or fall back to approximation algorithms like ukkonen's that allow an error threshold ε.

16. What is the difference between LCS and Longest Palindromic Subsequence (LPS)?

LPS finds the longest subsequence that reads the same forwards and backwards in a single string. The connection: LPS(s) = LCS(s, reverse(s)). Reverse the string and find LCS with the original — the result is the longest palindrome. A palindrome's first and last characters must match, so aligning s with its reverse captures that symmetric structure. Complexity stays O(n²) with O(n) space possible. This relationship shows up in problems like minimum insertions to form a palindrome.

17. How do you batch-compute LCS across many pairs of strings efficiently?

If the first strings share structure (like a common prefix), build a suffix automaton or trie of all first-strings to share DP work. When strings share the same alphabet, pre-compute character positions and reuse row computations. N pairs processed naively take O(N·m·n). You can parallelize at the pair level (each pair is independent) or at the cell level (trickier to sync). If one string is shared across all pairs (e.g., one query against N database entries), build the DP table once and reuse partial results. Sorting pairs by length and processing shortest first helps when using space-optimized O(n) methods.

18. How does the Hunt-Szymanski algorithm improve LCS for rare matches?

Hunt-Szymanski is a divide-and-conquer LCS tuned for strings with few matching pairs. First find all matches (positions where s1[i] == s2[j]), then treat these as a DAG longest path problem. Building the full O(m·n) table is wasteful when matches are sparse. Sort match positions by row, then find the longest increasing subsequence in that sorted list. Complexity is O((m+n)·log(m) + R·log(R)) where R is the match count. This beats O(mn) DP when R << mn — which is exactly the scenario git faces with large, divergent files.

19. How would you compute LCS with at most k mismatches allowed?

Constrained LCS (CLCS) allows up to k mismatched positions without breaking the subsequence. Extend the DP state to dp[i][j][t] = LCS length with exactly t mismatches used so far. On a character match, take the diagonal and optionally reset the mismatch counter. On a miss, either move without counting (free gap) or increment the mismatch counter (counted gap). This runs in O(m·n·k) time with O(m·n) or O(m·n·k) space depending on pruning. When k is small relative to m and n, it's tractable; when k grows large, you're essentially computing string similarity anyway.

20. How does the Hirschberg traceback differ from a standard LCS traceback?

Hirschberg maintains two DP rows (forward and backward) instead of the full table. The forward pass goes left to right up to the middle column; the backward pass goes right to left from the end. At the middle row, forward[i][j] + backward[i][j] gives the LCS length crossing that boundary. Find the split point where forward[mid][j] + backward[mid][j] == LCS_length, then recursively reconstruct the LCS string by splitting at that mid point. Standard traceback walks from dp[m][n] backward through the entire O(mn) matrix. Hirschberg achieves O(min(m,n)) space by keeping only two rows at each recursion level — a fundamental difference in how traceback information is stored.

Further Reading

Conclusion

LCS comes down to building a DP table where a match gives you 1 + dp[i-1][j-1] and a miss gives you max(dp[i-1][j], dp[i][j-1]). Space-optimize to two rows if you only need the length, or keep the full table if you need to reconstruct the actual subsequence string. Watch for the off-by-one trap: your table indices are offset by one from your string indices. For related string problems, see Edit Distance and Levenshtein.

Category

Related Posts

Edit Distance: Transforming One String to Another

Solve the minimum edit distance problem using dynamic programming with applications in spell checking, DNA alignment, and autocomplete.

#edit-distance #dynamic-programming #levenshtein

1D Dynamic Programming Problems: Classic Patterns

Learn to solve common 1D dynamic programming problems including climbing stairs, house robber, and coin change with optimized space solutions.

#dynamic-programming #dp #1d-dp

2D Dynamic Programming: Matrix Chain and Beyond

Master 2D DP problems with two state variables for string manipulation, matrix chain multiplication, and optimal game strategies.

#dynamic-programming #dp #2d-dp