CrackedRuby - Longest Common Substring

Overview

The longest common substring problem identifies the longest contiguous sequence of characters that appears in two or more strings. Unlike the longest common subsequence, which allows non-contiguous matches, the longest common substring requires all matching characters to appear consecutively in the same order within each string.

Given two strings "ABABC" and "BABCA", the longest common substring is "BABC" with length 4. The substring must appear without interruption in both strings - individual characters scattered throughout don't qualify unless they form a continuous block.

This problem appears frequently in bioinformatics for DNA sequence analysis, plagiarism detection systems, file diff utilities, and data deduplication tools. The algorithm determines how much continuous overlap exists between datasets, making it valuable for identifying similar content blocks, detecting copy-paste scenarios, or finding repeated patterns in genetic sequences.

The distinction between substring and subsequence is critical. The subsequence "ABABC" could match scattered characters across a string, but the substring must match a contiguous block. This requirement makes the substring problem both more restrictive and, in some ways, more tractable than the subsequence variant.

# Substring: must be contiguous
str1 = "ABABC"
str2 = "BABCA"
# "BABC" appears continuously in both

# Not valid: "ABABC" scattered across str2
# Valid substring must appear as unbroken sequence

Key Principles

The longest common substring algorithm operates on the principle of comparing all possible substrings between two input strings to find the longest match. At its core, the problem requires examining each position in both strings and extending matches as far as possible.

The dynamic programming approach builds a matrix where each cell (i,j) represents the length of the common substring ending at position i in the first string and position j in the second string. If characters match at these positions, the cell value equals the diagonal predecessor plus one. If characters differ, the cell value resets to zero.

# Dynamic programming state transition
# If str1[i] == str2[j]:
#   dp[i][j] = dp[i-1][j-1] + 1
# Else:
#   dp[i][j] = 0

The matrix tracks all common substring lengths, and the maximum value indicates the longest common substring length. The position of this maximum value determines where the substring ends, allowing reconstruction of the actual substring by backtracking.

String indexing matters significantly. Most implementations use 1-based indexing for the DP matrix to simplify boundary conditions, while 0-based indexing for the actual strings follows standard programming conventions. This dual indexing scheme requires careful translation when extracting the final substring.

The algorithm maintains two key pieces of information: the maximum length found and the ending position in one of the strings. Some implementations track both ending positions for verification or when multiple strings are involved.

# Core state tracking
max_length = 0
end_position = 0

# During matrix construction
if current_length > max_length
  max_length = current_length
  end_position = current_index
end

For multiple strings (more than two), the problem extends by finding common substrings across all inputs. This typically involves pairwise comparisons or suffix tree structures for efficiency. The generalized approach identifies substrings present in every input string, not just the maximum pair.

Implementation Approaches

Three primary algorithmic approaches solve the longest common substring problem, each with distinct time and space complexity characteristics. The choice depends on input size, performance requirements, and memory constraints.

Brute Force Approach

The brute force method generates all possible substrings from the first string and checks each against the second string. For a string of length n, this generates O(n²) substrings, and checking each substring against the second string (length m) takes O(m) time, yielding O(n² × m) overall complexity.

This approach requires minimal space - just storage for the current best substring - but becomes impractical for strings longer than a few hundred characters. The algorithm starts with the longest possible substrings and works down, allowing early termination once a match is found at a given length.

for length from n down to 1:
  for each substring of this length in str1:
    if substring exists in str2:
      return substring

The brute force approach works well for very short strings or when the longest common substring is expected to be found quickly among longer substrings. Educational contexts use this method to demonstrate the problem before introducing optimization.

Dynamic Programming Approach

Dynamic programming reduces complexity to O(n × m) time and O(n × m) space by building a matrix that reuses previous computations. Each cell computation requires only the diagonal predecessor, making the algorithm efficient through memoization.

The DP matrix fills row by row, with each cell representing the length of the common substring ending at that position. The maximum value in the entire matrix indicates the answer. This approach guarantees finding the optimal solution in polynomial time.

Space optimization reduces memory usage to O(min(n,m)) by maintaining only two rows at a time - the current row and previous row. Since each cell depends only on the diagonal predecessor from the previous row, older rows can be discarded. This optimization maintains O(n × m) time complexity while dramatically reducing space requirements for large inputs.

current_row = array of size m
previous_row = array of size m

for i in 1..n:
  for j in 1..m:
    if str1[i] == str2[j]:
      current_row[j] = previous_row[j-1] + 1
    else:
      current_row[j] = 0
  swap(current_row, previous_row)

Suffix Tree Approach

Suffix trees provide O(n + m) construction time and O(n + m) space for the generalized suffix tree. After building the tree containing all suffixes from both strings, finding the longest common substring requires finding the deepest internal node that has descendant leaves from both strings.

This approach excels when solving the problem for multiple string pairs or when the same strings are queried repeatedly. The initial suffix tree construction cost amortizes across multiple queries. Suffix arrays offer a space-efficient alternative to suffix trees with similar time complexity.

The suffix tree marks each leaf with its source string identifier. Internal nodes represent common prefixes, and the depth of an internal node indicates the length of that common prefix. The deepest internal node with mixed descendants represents the longest common substring.

For most applications involving two strings and single queries, dynamic programming offers the best balance. Suffix trees become advantageous when processing many strings or performing repeated queries on the same dataset.

Ruby Implementation

Ruby provides string manipulation capabilities that make implementing the longest common substring straightforward. The dynamic programming approach translates cleanly into Ruby with its expressive syntax and built-in array handling.

Basic Dynamic Programming Implementation

def longest_common_substring(str1, str2)
  return "" if str1.empty? || str2.empty?
  
  m = str1.length
  n = str2.length
  
  # Create DP matrix
  dp = Array.new(m + 1) { Array.new(n + 1, 0) }
  
  max_length = 0
  end_position = 0
  
  # Fill DP matrix
  1.upto(m) do |i|
    1.upto(n) do |j|
      if str1[i - 1] == str2[j - 1]
        dp[i][j] = dp[i - 1][j - 1] + 1
        
        if dp[i][j] > max_length
          max_length = dp[i][j]
          end_position = i
        end
      end
    end
  end
  
  # Extract substring
  return "" if max_length == 0
  str1[(end_position - max_length)...end_position]
end

# Usage
longest_common_substring("ABABC", "BABCA")
# => "BABC"

The implementation uses 1-based indexing for the DP matrix but 0-based indexing for string access (str1[i-1]). The matrix size is (m+1) × (n+1) to accommodate the zero-initialization row and column representing empty string prefixes.

Space-Optimized Implementation

def longest_common_substring_optimized(str1, str2)
  return "" if str1.empty? || str2.empty?
  
  # Ensure str1 is the shorter string for memory efficiency
  str1, str2 = str2, str1 if str1.length > str2.length
  
  m = str1.length
  n = str2.length
  
  previous_row = Array.new(n + 1, 0)
  current_row = Array.new(n + 1, 0)
  
  max_length = 0
  end_position = 0
  
  1.upto(m) do |i|
    1.upto(n) do |j|
      if str1[i - 1] == str2[j - 1]
        current_row[j] = previous_row[j - 1] + 1
        
        if current_row[j] > max_length
          max_length = current_row[j]
          end_position = i
        end
      else
        current_row[j] = 0
      end
    end
    
    previous_row, current_row = current_row, previous_row
    current_row.fill(0)
  end
  
  return "" if max_length == 0
  str1[(end_position - max_length)...end_position]
end

# Usage
longest_common_substring_optimized("ABABC", "BABCA")
# => "BABC"

This version uses only O(n) space by maintaining two rows. The row swap after each iteration reuses memory. Ensuring the shorter string is str1 minimizes the array size.

Finding All Common Substrings

def all_common_substrings(str1, str2, min_length = 1)
  return [] if str1.empty? || str2.empty?
  
  m = str1.length
  n = str2.length
  dp = Array.new(m + 1) { Array.new(n + 1, 0) }
  
  substrings = []
  
  1.upto(m) do |i|
    1.upto(n) do |j|
      if str1[i - 1] == str2[j - 1]
        dp[i][j] = dp[i - 1][j - 1] + 1
        
        if dp[i][j] >= min_length
          substring = str1[(i - dp[i][j])...i]
          substrings << substring unless substrings.include?(substring)
        end
      end
    end
  end
  
  substrings.sort_by { |s| -s.length }
end

# Usage
all_common_substrings("ABABC", "BABCA", 2)
# => ["BABC", "ABC", "AB", "BC"]

This implementation collects all common substrings meeting the minimum length requirement. The include? check prevents duplicates, though this adds overhead. For performance-critical applications, use a Set for O(1) duplicate detection.

Case-Insensitive Comparison

def longest_common_substring_ignore_case(str1, str2)
  return "" if str1.empty? || str2.empty?
  
  str1_lower = str1.downcase
  str2_lower = str2.downcase
  
  m = str1.length
  n = str2.length
  dp = Array.new(m + 1) { Array.new(n + 1, 0) }
  
  max_length = 0
  end_position = 0
  
  1.upto(m) do |i|
    1.upto(n) do |j|
      if str1_lower[i - 1] == str2_lower[j - 1]
        dp[i][j] = dp[i - 1][j - 1] + 1
        
        if dp[i][j] > max_length
          max_length = dp[i][j]
          end_position = i
        end
      end
    end
  end
  
  return "" if max_length == 0
  # Return original case from str1
  str1[(end_position - max_length)...end_position]
end

# Usage
longest_common_substring_ignore_case("AbaBc", "BABCA")
# => "aBC" (preserves original case from first string)

This version compares lowercase versions but returns the substring in its original case from the first string. For applications requiring normalized output, return from the lowercase version instead.

Multiple String Support

def longest_common_substring_multiple(strings)
  return "" if strings.empty? || strings.any?(&:empty?)
  
  # Start with first two strings
  result = longest_common_substring(strings[0], strings[1])
  return "" if result.empty?
  
  # Check if result appears in remaining strings
  strings[2..-1].each do |str|
    # Find where result appears in current string
    found = false
    result.length.downto(1) do |len|
      0.upto(result.length - len) do |start|
        substring = result[start, len]
        if str.include?(substring)
          result = substring
          found = true
          break
        end
      end
      break if found
    end
    
    return "" if result.empty?
  end
  
  result
end

# Usage
longest_common_substring_multiple(["ABABC", "BABCA", "ABCAB"])
# => "AB"

This approach finds the longest common substring between the first two strings, then verifies it appears in remaining strings, reducing its length as needed. For better performance with many strings, use suffix tree approaches.

Performance Considerations

The dynamic programming solution operates in O(n × m) time complexity where n and m represent the lengths of the two input strings. Each cell in the matrix requires constant time to compute, and the matrix contains n × m cells. Space complexity matches at O(n × m) for the full matrix implementation.

The space-optimized version reduces memory usage to O(min(n,m)) by maintaining only two rows, which matters significantly for large strings. A string pair of 10,000 characters each requires 100 million cells in the full matrix (400MB for 4-byte integers) versus 10,000 cells (40KB) in the optimized version.

Character comparison dominates the runtime. Ruby's string indexing operation runs in O(1) time, but the sheer number of comparisons (n × m) creates the bottleneck. For strings of 1,000 characters each, the algorithm performs one million character comparisons.

require 'benchmark'

str1 = "A" * 1000 + "MATCH" + "B" * 1000
str2 = "C" * 1000 + "MATCH" + "D" * 1000

Benchmark.bm do |x|
  x.report("full matrix:") do
    longest_common_substring(str1, str2)
  end
  
  x.report("optimized:") do
    longest_common_substring_optimized(str1, str2)
  end
end

# Results show similar time, vastly different memory usage

The time complexity remains unchanged between implementations - both perform the same n × m comparisons. Memory usage differs dramatically, making the optimized version essential for large inputs or memory-constrained environments.

For repeated queries on the same strings, preprocessing with suffix trees reduces query time to O(n + m) for tree construction and O(1) for finding the deepest internal node with mixed descendants. The tradeoff involves higher implementation complexity and memory overhead for the tree structure.

Early termination strategies help when the maximum possible remaining substring length cannot exceed the current best. If at position i in a string of length n, the maximum remaining length is (n - i). When this value falls below the current best length, remaining iterations cannot improve the result.

def longest_common_substring_early_termination(str1, str2)
  m = str1.length
  n = str2.length
  max_length = 0
  end_position = 0
  
  dp = Array.new(m + 1) { Array.new(n + 1, 0) }
  
  1.upto(m) do |i|
    # Early termination check
    break if (m - i + 1) <= max_length
    
    1.upto(n) do |j|
      next if (n - j + 1) <= max_length
      
      if str1[i - 1] == str2[j - 1]
        dp[i][j] = dp[i - 1][j - 1] + 1
        
        if dp[i][j] > max_length
          max_length = dp[i][j]
          end_position = i
        end
      end
    end
  end
  
  return "" if max_length == 0
  str1[(end_position - max_length)...end_position]
end

This optimization helps when strings have minimal overlap. Worst-case complexity remains O(n × m), but average-case performance improves for dissimilar strings.

String encoding affects performance. ASCII strings allow single-byte character access, while UTF-8 strings may require multi-byte character decoding. Ruby handles this transparently, but processing UTF-8 strings with multi-byte characters runs slower than ASCII equivalents.

Cache locality impacts performance on modern processors. The row-by-row access pattern in the DP solution exhibits good spatial locality, keeping recently accessed memory in cache. The two-row optimization further improves cache performance by reducing the working set size.

Practical Examples

File Comparison and Diff Tools

Version control systems and file diff utilities use longest common substring algorithms to identify unchanged blocks of text between file versions. This enables efficient delta encoding and meaningful diff displays.

def file_diff_blocks(old_content, new_content)
  lines_old = old_content.split("\n")
  lines_new = new_content.split("\n")
  
  # Treat each line as a character for LCS purposes
  old_str = lines_old.join("\x00")
  new_str = lines_new.join("\x00")
  
  common = longest_common_substring(old_str, new_str)
  return [] if common.empty?
  
  # Convert back to lines
  common.split("\x00")
end

# Usage
old_file = "line1\nline2\nline3\nline4"
new_file = "line1\nline2\nmodified\nline4"

unchanged = file_diff_blocks(old_file, new_file)
# => ["line1", "line2"]
# Shows largest unchanged block

This approach identifies the largest contiguous block of unchanged lines. Real diff tools use more sophisticated algorithms (Myers diff), but longest common substring provides the foundation for understanding content similarity.

Plagiarism Detection

Academic integrity tools detect copied content by finding long common substrings between submitted documents. Matches exceeding a threshold length suggest potential plagiarism.

class PlagiarismDetector
  MIN_MATCH_LENGTH = 50
  
  def initialize(min_match_length = MIN_MATCH_LENGTH)
    @min_match_length = min_match_length
  end
  
  def detect(document1, document2)
    # Normalize: remove whitespace, lowercase
    norm1 = normalize(document1)
    norm2 = normalize(document2)
    
    matches = []
    
    # Find all common substrings above threshold
    loop do
      match = longest_common_substring(norm1, norm2)
      break if match.length < @min_match_length
      
      matches << {
        text: match,
        length: match.length,
        position1: norm1.index(match),
        position2: norm2.index(match)
      }
      
      # Remove match to find next longest
      norm1 = norm1.sub(match, ' ' * match.length)
      norm2 = norm2.sub(match, ' ' * match.length)
    end
    
    matches
  end
  
  private
  
  def normalize(text)
    text.downcase.gsub(/\s+/, ' ').strip
  end
end

# Usage
detector = PlagiarismDetector.new(30)
doc1 = "The quick brown fox jumps over the lazy dog. This is original content."
doc2 = "Some new text. The quick brown fox jumps over the lazy dog. More new text."

matches = detector.detect(doc1, doc2)
matches.each do |m|
  puts "Match (#{m[:length]} chars): #{m[:text]}"
end
# => Match (43 chars): the quick brown fox jumps over the lazy dog

This detector finds multiple common substrings by iteratively removing found matches. Production systems would use more sophisticated approaches like rolling hashes or suffix arrays for efficiency.

DNA Sequence Analysis

Bioinformatics applications identify conserved regions between DNA sequences using longest common substring matching. Shared sequences indicate evolutionary relationships or functional importance.

class DNAAnalyzer
  BASES = %w[A T G C]
  
  def find_conserved_regions(sequence1, sequence2, min_length = 10)
    validate_sequence(sequence1)
    validate_sequence(sequence2)
    
    conserved = []
    s1 = sequence1.upcase
    s2 = sequence2.upcase
    
    # Find all conserved regions
    while true
      region = longest_common_substring(s1, s2)
      break if region.length < min_length
      
      pos1 = s1.index(region)
      pos2 = s2.index(region)
      
      conserved << {
        sequence: region,
        length: region.length,
        position_seq1: pos1,
        position_seq2: pos2,
        gc_content: gc_content(region)
      }
      
      # Mask found region
      s1[pos1, region.length] = 'N' * region.length
      s2[pos2, region.length] = 'N' * region.length
    end
    
    conserved.sort_by { |r| -r[:length] }
  end
  
  private
  
  def validate_sequence(seq)
    invalid = seq.upcase.chars.reject { |c| BASES.include?(c) }
    raise ArgumentError, "Invalid bases: #{invalid.join}" unless invalid.empty?
  end
  
  def gc_content(sequence)
    gc_count = sequence.count('GC')
    (gc_count.to_f / sequence.length * 100).round(2)
  end
end

# Usage
analyzer = DNAAnalyzer.new
seq1 = "ATCGATCGATCGTAGCTAGCTA"
seq2 = "GCTAGCTAGCTATCGATCGATC"

regions = analyzer.find_conserved_regions(seq1, seq2, 8)
regions.each do |r|
  puts "Conserved: #{r[:sequence]} (GC: #{r[:gc_content]}%)"
end
# => Conserved: ATCGATCG (GC: 50.0%)

The analyzer identifies multiple conserved regions, masks them, and continues searching. GC content calculation provides additional biological context about the conserved regions.

Data Deduplication

Storage systems use longest common substring algorithms to identify duplicate or similar content blocks, enabling efficient compression and deduplication.

class BlockDeduplicator
  BLOCK_SIZE = 4096
  MIN_DUPLICATE_SIZE = 1024
  
  def initialize(min_duplicate_size = MIN_DUPLICATE_SIZE)
    @min_duplicate_size = min_duplicate_size
    @blocks = {}
  end
  
  def add_file(filename, content)
    blocks = content.chars.each_slice(BLOCK_SIZE).map(&:join)
    
    duplicates = []
    
    @blocks.each do |other_file, other_blocks|
      other_blocks.each_with_index do |other_block, idx|
        blocks.each_with_index do |block, block_idx|
          common = longest_common_substring(block, other_block)
          
          if common.length >= @min_duplicate_size
            duplicates << {
              with_file: other_file,
              with_block: idx,
              current_block: block_idx,
              duplicate_size: common.length,
              space_saved: common.length
            }
          end
        end
      end
    end
    
    @blocks[filename] = blocks
    duplicates
  end
  
  def total_savings
    # Calculate total deduplication savings
    all_duplicates = []
    
    @blocks.each do |file, blocks|
      @blocks.each do |other_file, other_blocks|
        next if file == other_file
        
        blocks.each do |block|
          other_blocks.each do |other_block|
            common = longest_common_substring(block, other_block)
            all_duplicates << common.length if common.length >= @min_duplicate_size
          end
        end
      end
    end
    
    all_duplicates.sum / 2 # Divide by 2 to avoid double counting
  end
end

# Usage
dedup = BlockDeduplicator.new(512)
content1 = "A" * 2000 + "unique content here" + "B" * 2000
content2 = "C" * 1000 + "A" * 2000 + "different content" + "B" * 2000

dups = dedup.add_file("file1.txt", content1)
dups = dedup.add_file("file2.txt", content2)

dups.each do |d|
  puts "Found #{d[:duplicate_size]} byte duplicate with #{d[:with_file]}"
end

This simplified deduplicator finds common blocks between files. Production systems would use content-defined chunking and hash-based lookups for better performance.

Reference

Algorithm Complexity

Approach	Time Complexity	Space Complexity	Best Use Case
Brute Force	O(n² × m)	O(1)	Very short strings, educational purposes
Dynamic Programming	O(n × m)	O(n × m)	General purpose, guaranteed optimal
DP Space Optimized	O(n × m)	O(min(n,m))	Large strings, memory constraints
Suffix Tree	O(n + m) construction	O(n + m)	Multiple queries, many strings
Suffix Array	O(n log n) construction	O(n)	Space-efficient alternative to trees

Core Algorithm Components

Component	Purpose	Implementation Detail
DP Matrix	Track substring lengths at each position	Size (m+1) × (n+1) with 0-initialized borders
Match Detection	Compare characters at positions i and j	Set dp[i][j] = dp[i-1][j-1] + 1 on match
Mismatch Handling	Reset length counter on character difference	Set dp[i][j] = 0 when str1[i] != str2[j]
Maximum Tracking	Record longest substring found	Update max_length and end_position during iteration
Substring Extraction	Retrieve actual substring from result	Extract from end_position - max_length to end_position

Ruby String Methods

Method	Description	Use in LCS Context
String#[]	Access character or substring by index	Extract final result substring
String#length	Get string length	Initialize matrix dimensions
String#empty?	Check if string is empty	Validate input before processing
String#downcase	Convert to lowercase	Case-insensitive comparison
String#include?	Check if substring exists	Verify result in multiple string scenarios
String#index	Find position of substring	Locate substring for position tracking
Array.new	Create new array with initialization	Construct DP matrix and rows

Common Edge Cases

Edge Case	Behavior	Handling Strategy
Empty string input	No common substring possible	Return empty string immediately
No common characters	No matches exist	Return empty string after full iteration
Identical strings	Entire string is common	Return full string
One character match only	Single character substring	Return that character if valid
Multiple equal-length matches	Multiple valid answers	Return first found or all matches
Case sensitivity	Different cases treated as different	Normalize if case-insensitive needed

Performance Optimization Techniques

Technique	Impact	Tradeoff
Space optimization with two rows	Reduces memory from O(n×m) to O(min(n,m))	No change to time complexity
Early termination	Reduces average-case comparisons	Adds conditional checks per iteration
Shorter string as rows	Minimizes memory allocation	Requires string swap logic
Rolling hash for preprocessing	Speeds up multiple queries	Higher implementation complexity
Suffix tree construction	Reduces query to O(n+m)	High construction cost and memory

Implementation Patterns

# Basic pattern with full matrix
def lcs_basic(s1, s2)
  m, n = s1.length, s2.length
  dp = Array.new(m + 1) { Array.new(n + 1, 0) }
  max_len, end_pos = 0, 0
  
  1.upto(m) do |i|
    1.upto(n) do |j|
      if s1[i-1] == s2[j-1]
        dp[i][j] = dp[i-1][j-1] + 1
        max_len, end_pos = dp[i][j], i if dp[i][j] > max_len
      end
    end
  end
  
  max_len == 0 ? "" : s1[(end_pos - max_len)...end_pos]
end

# Space-optimized pattern
def lcs_optimized(s1, s2)
  s1, s2 = s2, s1 if s1.length > s2.length
  m, n = s1.length, s2.length
  prev, curr = Array.new(n + 1, 0), Array.new(n + 1, 0)
  max_len, end_pos = 0, 0
  
  1.upto(m) do |i|
    1.upto(n) do |j|
      curr[j] = s1[i-1] == s2[j-1] ? prev[j-1] + 1 : 0
      max_len, end_pos = curr[j], i if curr[j] > max_len
    end
    prev, curr = curr, prev
    curr.fill(0)
  end
  
  max_len == 0 ? "" : s1[(end_pos - max_len)...end_pos]
end

Comparison with Related Problems

Problem	Contiguity Requirement	Output	Typical Application
Longest Common Substring	Must be contiguous	Single longest substring	File diff, plagiarism detection
Longest Common Subsequence	Can be non-contiguous	Sequence of characters	Edit distance, DNA alignment
Edit Distance	N/A - measures difference	Numeric distance	Spell checking, similarity metrics
String Matching	Exact pattern match	Position of pattern	Search, pattern recognition
Approximate String Matching	Allows mismatches	Positions with similarity	Fuzzy search, error-tolerant matching

Longest Common Substring