Overview
The longest common substring problem identifies the longest contiguous sequence of characters that appears in two or more strings. Unlike the longest common subsequence, which allows non-contiguous matches, the longest common substring requires all matching characters to appear consecutively in the same order within each string.
Given two strings "ABABC" and "BABCA", the longest common substring is "BABC" with length 4. The substring must appear without interruption in both strings - individual characters scattered throughout don't qualify unless they form a continuous block.
This problem appears frequently in bioinformatics for DNA sequence analysis, plagiarism detection systems, file diff utilities, and data deduplication tools. The algorithm determines how much continuous overlap exists between datasets, making it valuable for identifying similar content blocks, detecting copy-paste scenarios, or finding repeated patterns in genetic sequences.
The distinction between substring and subsequence is critical. The subsequence "ABABC" could match scattered characters across a string, but the substring must match a contiguous block. This requirement makes the substring problem both more restrictive and, in some ways, more tractable than the subsequence variant.
# Substring: must be contiguous
str1 = "ABABC"
str2 = "BABCA"
# "BABC" appears continuously in both
# Not valid: "ABABC" scattered across str2
# Valid substring must appear as unbroken sequence
Key Principles
The longest common substring algorithm operates on the principle of comparing all possible substrings between two input strings to find the longest match. At its core, the problem requires examining each position in both strings and extending matches as far as possible.
The dynamic programming approach builds a matrix where each cell (i,j) represents the length of the common substring ending at position i in the first string and position j in the second string. If characters match at these positions, the cell value equals the diagonal predecessor plus one. If characters differ, the cell value resets to zero.
# Dynamic programming state transition
# If str1[i] == str2[j]:
# dp[i][j] = dp[i-1][j-1] + 1
# Else:
# dp[i][j] = 0
The matrix tracks all common substring lengths, and the maximum value indicates the longest common substring length. The position of this maximum value determines where the substring ends, allowing reconstruction of the actual substring by backtracking.
String indexing matters significantly. Most implementations use 1-based indexing for the DP matrix to simplify boundary conditions, while 0-based indexing for the actual strings follows standard programming conventions. This dual indexing scheme requires careful translation when extracting the final substring.
The algorithm maintains two key pieces of information: the maximum length found and the ending position in one of the strings. Some implementations track both ending positions for verification or when multiple strings are involved.
# Core state tracking
max_length = 0
end_position = 0
# During matrix construction
if current_length > max_length
max_length = current_length
end_position = current_index
end
For multiple strings (more than two), the problem extends by finding common substrings across all inputs. This typically involves pairwise comparisons or suffix tree structures for efficiency. The generalized approach identifies substrings present in every input string, not just the maximum pair.
Implementation Approaches
Three primary algorithmic approaches solve the longest common substring problem, each with distinct time and space complexity characteristics. The choice depends on input size, performance requirements, and memory constraints.
Brute Force Approach
The brute force method generates all possible substrings from the first string and checks each against the second string. For a string of length n, this generates O(n²) substrings, and checking each substring against the second string (length m) takes O(m) time, yielding O(n² × m) overall complexity.
This approach requires minimal space - just storage for the current best substring - but becomes impractical for strings longer than a few hundred characters. The algorithm starts with the longest possible substrings and works down, allowing early termination once a match is found at a given length.
for length from n down to 1:
for each substring of this length in str1:
if substring exists in str2:
return substring
The brute force approach works well for very short strings or when the longest common substring is expected to be found quickly among longer substrings. Educational contexts use this method to demonstrate the problem before introducing optimization.
Dynamic Programming Approach
Dynamic programming reduces complexity to O(n × m) time and O(n × m) space by building a matrix that reuses previous computations. Each cell computation requires only the diagonal predecessor, making the algorithm efficient through memoization.
The DP matrix fills row by row, with each cell representing the length of the common substring ending at that position. The maximum value in the entire matrix indicates the answer. This approach guarantees finding the optimal solution in polynomial time.
Space optimization reduces memory usage to O(min(n,m)) by maintaining only two rows at a time - the current row and previous row. Since each cell depends only on the diagonal predecessor from the previous row, older rows can be discarded. This optimization maintains O(n × m) time complexity while dramatically reducing space requirements for large inputs.
current_row = array of size m
previous_row = array of size m
for i in 1..n:
for j in 1..m:
if str1[i] == str2[j]:
current_row[j] = previous_row[j-1] + 1
else:
current_row[j] = 0
swap(current_row, previous_row)
Suffix Tree Approach
Suffix trees provide O(n + m) construction time and O(n + m) space for the generalized suffix tree. After building the tree containing all suffixes from both strings, finding the longest common substring requires finding the deepest internal node that has descendant leaves from both strings.
This approach excels when solving the problem for multiple string pairs or when the same strings are queried repeatedly. The initial suffix tree construction cost amortizes across multiple queries. Suffix arrays offer a space-efficient alternative to suffix trees with similar time complexity.
The suffix tree marks each leaf with its source string identifier. Internal nodes represent common prefixes, and the depth of an internal node indicates the length of that common prefix. The deepest internal node with mixed descendants represents the longest common substring.
For most applications involving two strings and single queries, dynamic programming offers the best balance. Suffix trees become advantageous when processing many strings or performing repeated queries on the same dataset.
Ruby Implementation
Ruby provides string manipulation capabilities that make implementing the longest common substring straightforward. The dynamic programming approach translates cleanly into Ruby with its expressive syntax and built-in array handling.
Basic Dynamic Programming Implementation
def longest_common_substring(str1, str2)
return "" if str1.empty? || str2.empty?
m = str1.length
n = str2.length
# Create DP matrix
dp = Array.new(m + 1) { Array.new(n + 1, 0) }
max_length = 0
end_position = 0
# Fill DP matrix
1.upto(m) do |i|
1.upto(n) do |j|
if str1[i - 1] == str2[j - 1]
dp[i][j] = dp[i - 1][j - 1] + 1
if dp[i][j] > max_length
max_length = dp[i][j]
end_position = i
end
end
end
end
# Extract substring
return "" if max_length == 0
str1[(end_position - max_length)...end_position]
end
# Usage
longest_common_substring("ABABC", "BABCA")
# => "BABC"
The implementation uses 1-based indexing for the DP matrix but 0-based indexing for string access (str1[i-1]). The matrix size is (m+1) × (n+1) to accommodate the zero-initialization row and column representing empty string prefixes.
Space-Optimized Implementation
def longest_common_substring_optimized(str1, str2)
return "" if str1.empty? || str2.empty?
# Ensure str1 is the shorter string for memory efficiency
str1, str2 = str2, str1 if str1.length > str2.length
m = str1.length
n = str2.length
previous_row = Array.new(n + 1, 0)
current_row = Array.new(n + 1, 0)
max_length = 0
end_position = 0
1.upto(m) do |i|
1.upto(n) do |j|
if str1[i - 1] == str2[j - 1]
current_row[j] = previous_row[j - 1] + 1
if current_row[j] > max_length
max_length = current_row[j]
end_position = i
end
else
current_row[j] = 0
end
end
previous_row, current_row = current_row, previous_row
current_row.fill(0)
end
return "" if max_length == 0
str1[(end_position - max_length)...end_position]
end
# Usage
longest_common_substring_optimized("ABABC", "BABCA")
# => "BABC"
This version uses only O(n) space by maintaining two rows. The row swap after each iteration reuses memory. Ensuring the shorter string is str1 minimizes the array size.
Finding All Common Substrings
def all_common_substrings(str1, str2, min_length = 1)
return [] if str1.empty? || str2.empty?
m = str1.length
n = str2.length
dp = Array.new(m + 1) { Array.new(n + 1, 0) }
substrings = []
1.upto(m) do |i|
1.upto(n) do |j|
if str1[i - 1] == str2[j - 1]
dp[i][j] = dp[i - 1][j - 1] + 1
if dp[i][j] >= min_length
substring = str1[(i - dp[i][j])...i]
substrings << substring unless substrings.include?(substring)
end
end
end
end
substrings.sort_by { |s| -s.length }
end
# Usage
all_common_substrings("ABABC", "BABCA", 2)
# => ["BABC", "ABC", "AB", "BC"]
This implementation collects all common substrings meeting the minimum length requirement. The include? check prevents duplicates, though this adds overhead. For performance-critical applications, use a Set for O(1) duplicate detection.
Case-Insensitive Comparison
def longest_common_substring_ignore_case(str1, str2)
return "" if str1.empty? || str2.empty?
str1_lower = str1.downcase
str2_lower = str2.downcase
m = str1.length
n = str2.length
dp = Array.new(m + 1) { Array.new(n + 1, 0) }
max_length = 0
end_position = 0
1.upto(m) do |i|
1.upto(n) do |j|
if str1_lower[i - 1] == str2_lower[j - 1]
dp[i][j] = dp[i - 1][j - 1] + 1
if dp[i][j] > max_length
max_length = dp[i][j]
end_position = i
end
end
end
end
return "" if max_length == 0
# Return original case from str1
str1[(end_position - max_length)...end_position]
end
# Usage
longest_common_substring_ignore_case("AbaBc", "BABCA")
# => "aBC" (preserves original case from first string)
This version compares lowercase versions but returns the substring in its original case from the first string. For applications requiring normalized output, return from the lowercase version instead.
Multiple String Support
def longest_common_substring_multiple(strings)
return "" if strings.empty? || strings.any?(&:empty?)
# Start with first two strings
result = longest_common_substring(strings[0], strings[1])
return "" if result.empty?
# Check if result appears in remaining strings
strings[2..-1].each do |str|
# Find where result appears in current string
found = false
result.length.downto(1) do |len|
0.upto(result.length - len) do |start|
substring = result[start, len]
if str.include?(substring)
result = substring
found = true
break
end
end
break if found
end
return "" if result.empty?
end
result
end
# Usage
longest_common_substring_multiple(["ABABC", "BABCA", "ABCAB"])
# => "AB"
This approach finds the longest common substring between the first two strings, then verifies it appears in remaining strings, reducing its length as needed. For better performance with many strings, use suffix tree approaches.
Performance Considerations
The dynamic programming solution operates in O(n × m) time complexity where n and m represent the lengths of the two input strings. Each cell in the matrix requires constant time to compute, and the matrix contains n × m cells. Space complexity matches at O(n × m) for the full matrix implementation.
The space-optimized version reduces memory usage to O(min(n,m)) by maintaining only two rows, which matters significantly for large strings. A string pair of 10,000 characters each requires 100 million cells in the full matrix (400MB for 4-byte integers) versus 10,000 cells (40KB) in the optimized version.
Character comparison dominates the runtime. Ruby's string indexing operation runs in O(1) time, but the sheer number of comparisons (n × m) creates the bottleneck. For strings of 1,000 characters each, the algorithm performs one million character comparisons.
require 'benchmark'
str1 = "A" * 1000 + "MATCH" + "B" * 1000
str2 = "C" * 1000 + "MATCH" + "D" * 1000
Benchmark.bm do |x|
x.report("full matrix:") do
longest_common_substring(str1, str2)
end
x.report("optimized:") do
longest_common_substring_optimized(str1, str2)
end
end
# Results show similar time, vastly different memory usage
The time complexity remains unchanged between implementations - both perform the same n × m comparisons. Memory usage differs dramatically, making the optimized version essential for large inputs or memory-constrained environments.
For repeated queries on the same strings, preprocessing with suffix trees reduces query time to O(n + m) for tree construction and O(1) for finding the deepest internal node with mixed descendants. The tradeoff involves higher implementation complexity and memory overhead for the tree structure.
Early termination strategies help when the maximum possible remaining substring length cannot exceed the current best. If at position i in a string of length n, the maximum remaining length is (n - i). When this value falls below the current best length, remaining iterations cannot improve the result.
def longest_common_substring_early_termination(str1, str2)
m = str1.length
n = str2.length
max_length = 0
end_position = 0
dp = Array.new(m + 1) { Array.new(n + 1, 0) }
1.upto(m) do |i|
# Early termination check
break if (m - i + 1) <= max_length
1.upto(n) do |j|
next if (n - j + 1) <= max_length
if str1[i - 1] == str2[j - 1]
dp[i][j] = dp[i - 1][j - 1] + 1
if dp[i][j] > max_length
max_length = dp[i][j]
end_position = i
end
end
end
end
return "" if max_length == 0
str1[(end_position - max_length)...end_position]
end
This optimization helps when strings have minimal overlap. Worst-case complexity remains O(n × m), but average-case performance improves for dissimilar strings.
String encoding affects performance. ASCII strings allow single-byte character access, while UTF-8 strings may require multi-byte character decoding. Ruby handles this transparently, but processing UTF-8 strings with multi-byte characters runs slower than ASCII equivalents.
Cache locality impacts performance on modern processors. The row-by-row access pattern in the DP solution exhibits good spatial locality, keeping recently accessed memory in cache. The two-row optimization further improves cache performance by reducing the working set size.
Practical Examples
File Comparison and Diff Tools
Version control systems and file diff utilities use longest common substring algorithms to identify unchanged blocks of text between file versions. This enables efficient delta encoding and meaningful diff displays.
def file_diff_blocks(old_content, new_content)
lines_old = old_content.split("\n")
lines_new = new_content.split("\n")
# Treat each line as a character for LCS purposes
old_str = lines_old.join("\x00")
new_str = lines_new.join("\x00")
common = longest_common_substring(old_str, new_str)
return [] if common.empty?
# Convert back to lines
common.split("\x00")
end
# Usage
old_file = "line1\nline2\nline3\nline4"
new_file = "line1\nline2\nmodified\nline4"
unchanged = file_diff_blocks(old_file, new_file)
# => ["line1", "line2"]
# Shows largest unchanged block
This approach identifies the largest contiguous block of unchanged lines. Real diff tools use more sophisticated algorithms (Myers diff), but longest common substring provides the foundation for understanding content similarity.
Plagiarism Detection
Academic integrity tools detect copied content by finding long common substrings between submitted documents. Matches exceeding a threshold length suggest potential plagiarism.
class PlagiarismDetector
MIN_MATCH_LENGTH = 50
def initialize(min_match_length = MIN_MATCH_LENGTH)
@min_match_length = min_match_length
end
def detect(document1, document2)
# Normalize: remove whitespace, lowercase
norm1 = normalize(document1)
norm2 = normalize(document2)
matches = []
# Find all common substrings above threshold
loop do
match = longest_common_substring(norm1, norm2)
break if match.length < @min_match_length
matches << {
text: match,
length: match.length,
position1: norm1.index(match),
position2: norm2.index(match)
}
# Remove match to find next longest
norm1 = norm1.sub(match, ' ' * match.length)
norm2 = norm2.sub(match, ' ' * match.length)
end
matches
end
private
def normalize(text)
text.downcase.gsub(/\s+/, ' ').strip
end
end
# Usage
detector = PlagiarismDetector.new(30)
doc1 = "The quick brown fox jumps over the lazy dog. This is original content."
doc2 = "Some new text. The quick brown fox jumps over the lazy dog. More new text."
matches = detector.detect(doc1, doc2)
matches.each do |m|
puts "Match (#{m[:length]} chars): #{m[:text]}"
end
# => Match (43 chars): the quick brown fox jumps over the lazy dog
This detector finds multiple common substrings by iteratively removing found matches. Production systems would use more sophisticated approaches like rolling hashes or suffix arrays for efficiency.
DNA Sequence Analysis
Bioinformatics applications identify conserved regions between DNA sequences using longest common substring matching. Shared sequences indicate evolutionary relationships or functional importance.
class DNAAnalyzer
BASES = %w[A T G C]
def find_conserved_regions(sequence1, sequence2, min_length = 10)
validate_sequence(sequence1)
validate_sequence(sequence2)
conserved = []
s1 = sequence1.upcase
s2 = sequence2.upcase
# Find all conserved regions
while true
region = longest_common_substring(s1, s2)
break if region.length < min_length
pos1 = s1.index(region)
pos2 = s2.index(region)
conserved << {
sequence: region,
length: region.length,
position_seq1: pos1,
position_seq2: pos2,
gc_content: gc_content(region)
}
# Mask found region
s1[pos1, region.length] = 'N' * region.length
s2[pos2, region.length] = 'N' * region.length
end
conserved.sort_by { |r| -r[:length] }
end
private
def validate_sequence(seq)
invalid = seq.upcase.chars.reject { |c| BASES.include?(c) }
raise ArgumentError, "Invalid bases: #{invalid.join}" unless invalid.empty?
end
def gc_content(sequence)
gc_count = sequence.count('GC')
(gc_count.to_f / sequence.length * 100).round(2)
end
end
# Usage
analyzer = DNAAnalyzer.new
seq1 = "ATCGATCGATCGTAGCTAGCTA"
seq2 = "GCTAGCTAGCTATCGATCGATC"
regions = analyzer.find_conserved_regions(seq1, seq2, 8)
regions.each do |r|
puts "Conserved: #{r[:sequence]} (GC: #{r[:gc_content]}%)"
end
# => Conserved: ATCGATCG (GC: 50.0%)
The analyzer identifies multiple conserved regions, masks them, and continues searching. GC content calculation provides additional biological context about the conserved regions.
Data Deduplication
Storage systems use longest common substring algorithms to identify duplicate or similar content blocks, enabling efficient compression and deduplication.
class BlockDeduplicator
BLOCK_SIZE = 4096
MIN_DUPLICATE_SIZE = 1024
def initialize(min_duplicate_size = MIN_DUPLICATE_SIZE)
@min_duplicate_size = min_duplicate_size
@blocks = {}
end
def add_file(filename, content)
blocks = content.chars.each_slice(BLOCK_SIZE).map(&:join)
duplicates = []
@blocks.each do |other_file, other_blocks|
other_blocks.each_with_index do |other_block, idx|
blocks.each_with_index do |block, block_idx|
common = longest_common_substring(block, other_block)
if common.length >= @min_duplicate_size
duplicates << {
with_file: other_file,
with_block: idx,
current_block: block_idx,
duplicate_size: common.length,
space_saved: common.length
}
end
end
end
end
@blocks[filename] = blocks
duplicates
end
def total_savings
# Calculate total deduplication savings
all_duplicates = []
@blocks.each do |file, blocks|
@blocks.each do |other_file, other_blocks|
next if file == other_file
blocks.each do |block|
other_blocks.each do |other_block|
common = longest_common_substring(block, other_block)
all_duplicates << common.length if common.length >= @min_duplicate_size
end
end
end
end
all_duplicates.sum / 2 # Divide by 2 to avoid double counting
end
end
# Usage
dedup = BlockDeduplicator.new(512)
content1 = "A" * 2000 + "unique content here" + "B" * 2000
content2 = "C" * 1000 + "A" * 2000 + "different content" + "B" * 2000
dups = dedup.add_file("file1.txt", content1)
dups = dedup.add_file("file2.txt", content2)
dups.each do |d|
puts "Found #{d[:duplicate_size]} byte duplicate with #{d[:with_file]}"
end
This simplified deduplicator finds common blocks between files. Production systems would use content-defined chunking and hash-based lookups for better performance.
Reference
Algorithm Complexity
| Approach | Time Complexity | Space Complexity | Best Use Case |
|---|---|---|---|
| Brute Force | O(n² × m) | O(1) | Very short strings, educational purposes |
| Dynamic Programming | O(n × m) | O(n × m) | General purpose, guaranteed optimal |
| DP Space Optimized | O(n × m) | O(min(n,m)) | Large strings, memory constraints |
| Suffix Tree | O(n + m) construction | O(n + m) | Multiple queries, many strings |
| Suffix Array | O(n log n) construction | O(n) | Space-efficient alternative to trees |
Core Algorithm Components
| Component | Purpose | Implementation Detail |
|---|---|---|
| DP Matrix | Track substring lengths at each position | Size (m+1) × (n+1) with 0-initialized borders |
| Match Detection | Compare characters at positions i and j | Set dp[i][j] = dp[i-1][j-1] + 1 on match |
| Mismatch Handling | Reset length counter on character difference | Set dp[i][j] = 0 when str1[i] != str2[j] |
| Maximum Tracking | Record longest substring found | Update max_length and end_position during iteration |
| Substring Extraction | Retrieve actual substring from result | Extract from end_position - max_length to end_position |
Ruby String Methods
| Method | Description | Use in LCS Context |
|---|---|---|
| String#[] | Access character or substring by index | Extract final result substring |
| String#length | Get string length | Initialize matrix dimensions |
| String#empty? | Check if string is empty | Validate input before processing |
| String#downcase | Convert to lowercase | Case-insensitive comparison |
| String#include? | Check if substring exists | Verify result in multiple string scenarios |
| String#index | Find position of substring | Locate substring for position tracking |
| Array.new | Create new array with initialization | Construct DP matrix and rows |
Common Edge Cases
| Edge Case | Behavior | Handling Strategy |
|---|---|---|
| Empty string input | No common substring possible | Return empty string immediately |
| No common characters | No matches exist | Return empty string after full iteration |
| Identical strings | Entire string is common | Return full string |
| One character match only | Single character substring | Return that character if valid |
| Multiple equal-length matches | Multiple valid answers | Return first found or all matches |
| Case sensitivity | Different cases treated as different | Normalize if case-insensitive needed |
Performance Optimization Techniques
| Technique | Impact | Tradeoff |
|---|---|---|
| Space optimization with two rows | Reduces memory from O(n×m) to O(min(n,m)) | No change to time complexity |
| Early termination | Reduces average-case comparisons | Adds conditional checks per iteration |
| Shorter string as rows | Minimizes memory allocation | Requires string swap logic |
| Rolling hash for preprocessing | Speeds up multiple queries | Higher implementation complexity |
| Suffix tree construction | Reduces query to O(n+m) | High construction cost and memory |
Implementation Patterns
# Basic pattern with full matrix
def lcs_basic(s1, s2)
m, n = s1.length, s2.length
dp = Array.new(m + 1) { Array.new(n + 1, 0) }
max_len, end_pos = 0, 0
1.upto(m) do |i|
1.upto(n) do |j|
if s1[i-1] == s2[j-1]
dp[i][j] = dp[i-1][j-1] + 1
max_len, end_pos = dp[i][j], i if dp[i][j] > max_len
end
end
end
max_len == 0 ? "" : s1[(end_pos - max_len)...end_pos]
end
# Space-optimized pattern
def lcs_optimized(s1, s2)
s1, s2 = s2, s1 if s1.length > s2.length
m, n = s1.length, s2.length
prev, curr = Array.new(n + 1, 0), Array.new(n + 1, 0)
max_len, end_pos = 0, 0
1.upto(m) do |i|
1.upto(n) do |j|
curr[j] = s1[i-1] == s2[j-1] ? prev[j-1] + 1 : 0
max_len, end_pos = curr[j], i if curr[j] > max_len
end
prev, curr = curr, prev
curr.fill(0)
end
max_len == 0 ? "" : s1[(end_pos - max_len)...end_pos]
end
Comparison with Related Problems
| Problem | Contiguity Requirement | Output | Typical Application |
|---|---|---|---|
| Longest Common Substring | Must be contiguous | Single longest substring | File diff, plagiarism detection |
| Longest Common Subsequence | Can be non-contiguous | Sequence of characters | Edit distance, DNA alignment |
| Edit Distance | N/A - measures difference | Numeric distance | Spell checking, similarity metrics |
| String Matching | Exact pattern match | Position of pattern | Search, pattern recognition |
| Approximate String Matching | Allows mismatches | Positions with similarity | Fuzzy search, error-tolerant matching |