CrackedRuby - Edit Distance

Overview

Edit distance quantifies the similarity between two strings by calculating the minimum number of single-character operations needed to transform one string into another. The Levenshtein distance, the most common variant, permits three operations: insertion of a character, deletion of a character, and substitution of one character for another. Each operation carries a cost of one.

The algorithm originated from work by Vladimir Levenshtein in 1965 for error correction in binary codes. String comparison problems appear throughout software development: spell checkers suggest corrections based on edit distance, version control systems use it to generate diffs, search engines implement fuzzy matching for typo tolerance, and bioinformatics applications align DNA sequences.

Edit distance provides a numeric measure of string similarity. A distance of zero indicates identical strings, while larger values indicate greater dissimilarity. The maximum possible distance equals the length of the longer string, occurring when strings share no common characters.

# Transforming "kitten" to "sitting" requires 3 operations
# 1. Substitute 'k' with 's': "sitten"
# 2. Substitute 'e' with 'i': "sittin"  
# 3. Insert 'g': "sitting"
# Edit distance = 3

The algorithm applies to any sequence comparison problem where elements can be compared for equality. While commonly used for strings, the same technique works for arrays, lists, or any ordered collection of comparable elements.

Key Principles

The edit distance algorithm builds an optimal solution from optimal solutions to smaller subproblems, exhibiting the characteristics of dynamic programming. The problem has optimal substructure: the edit distance between two strings depends on the edit distances between their prefixes.

For strings S and T, the edit distance calculation considers three cases at each position:

The characters at the current positions match. The edit distance equals the edit distance of the remaining substrings without those characters. No additional operation is required.

The characters differ, requiring one of three operations:

Insertion: Add a character from T to S. The cost equals one plus the edit distance between S and T without its first character.

Deletion: Remove a character from S. The cost equals one plus the edit distance between S without its first character and T.

Substitution: Replace a character in S with the corresponding character from T. The cost equals one plus the edit distance between both strings without their first characters.

The algorithm selects the operation with minimum cost. This recursive definition provides the foundation for both recursive and iterative implementations.

The dynamic programming approach constructs a matrix where entry (i,j) represents the edit distance between the first i characters of S and the first j characters of T. The matrix has dimensions (m+1) × (n+1) where m and n are the string lengths.

Base cases initialize the first row and column. Transforming an empty string to a string of length k requires k insertions. Converting a string of length k to an empty string requires k deletions.

# Matrix for edit distance between "CAT" and "DOG"
#     ""  D  O  G
# ""   0  1  2  3
# C    1  1  2  3
# A    2  2  2  3
# T    3  3  3  3

Each cell calculation depends only on three adjacent cells: the cell above, the cell to the left, and the diagonal cell. This dependency structure permits efficient computation in a specific order.

The recurrence relation defines:

D[i,j] = D[i-1,j-1]                           if S[i] = T[j]
D[i,j] = 1 + min(D[i-1,j],                    (deletion)
                 D[i,j-1],                    (insertion)
                 D[i-1,j-1])                  (substitution)
         if S[i] ≠ T[j]

The final answer appears in the bottom-right cell of the matrix, representing the edit distance between the complete strings.

Implementation Approaches

The recursive approach directly implements the mathematical definition but suffers from exponential time complexity due to repeated subproblem calculations. Each recursive call spawns up to three additional calls, creating a call tree with overlapping subproblems.

def edit_distance_recursive(s, t, i = s.length - 1, j = t.length - 1)
  return j + 1 if i < 0
  return i + 1 if j < 0
  
  return edit_distance_recursive(s, t, i - 1, j - 1) if s[i] == t[j]
  
  [
    edit_distance_recursive(s, t, i - 1, j),     # delete
    edit_distance_recursive(s, t, i, j - 1),     # insert
    edit_distance_recursive(s, t, i - 1, j - 1)  # substitute
  ].min + 1
end

Memoization optimizes the recursive approach by caching subproblem results in a hash or array. When a subproblem recurs, the cached result returns immediately instead of recalculating. This reduces time complexity from exponential to polynomial while maintaining the recursive structure.

def edit_distance_memo(s, t, memo = {})
  key = [s, t]
  return memo[key] if memo.key?(key)
  
  return memo[key] = t.length if s.empty?
  return memo[key] = s.length if t.empty?
  
  if s[0] == t[0]
    memo[key] = edit_distance_memo(s[1..-1], t[1..-1], memo)
  else
    memo[key] = 1 + [
      edit_distance_memo(s[1..-1], t, memo),
      edit_distance_memo(s, t[1..-1], memo),
      edit_distance_memo(s[1..-1], t[1..-1], memo)
    ].min
  end
  
  memo[key]
end

The bottom-up dynamic programming approach builds the solution iteratively by filling the matrix row by row or column by column. This eliminates recursion overhead and provides predictable memory access patterns, improving cache performance.

Space-optimized implementations observe that computing row i requires only row i-1. Instead of maintaining the entire matrix, two arrays suffice: one for the current row and one for the previous row. This reduces space complexity from O(mn) to O(min(m,n)) by processing the shorter string as columns.

def edit_distance_space_optimized(s, t)
  # Ensure s is the shorter string
  s, t = t, s if s.length > t.length
  
  previous = (0..s.length).to_a
  
  t.length.times do |j|
    current = [j + 1]
    s.length.times do |i|
      cost = s[i] == t[j] ? 0 : 1
      current << [
        current[i] + 1,      # insertion
        previous[i + 1] + 1, # deletion
        previous[i] + cost   # substitution
      ].min
    end
    previous = current
  end
  
  previous.last
end

Diagonal iteration processes the matrix along diagonals rather than rows or columns. This approach can parallelize computation of cells on the same diagonal, as they have no dependencies on each other.

Ruby Implementation

Ruby's string handling and array operations make implementing edit distance straightforward. The standard dynamic programming approach creates a two-dimensional matrix using nested arrays.

def edit_distance(source, target)
  m, n = source.length, target.length
  matrix = Array.new(m + 1) { Array.new(n + 1) }
  
  # Initialize base cases
  (0..m).each { |i| matrix[i][0] = i }
  (0..n).each { |j| matrix[0][j] = j }
  
  # Fill matrix
  (1..m).each do |i|
    (1..n).each do |j|
      if source[i - 1] == target[j - 1]
        matrix[i][j] = matrix[i - 1][j - 1]
      else
        matrix[i][j] = [
          matrix[i - 1][j],      # deletion
          matrix[i][j - 1],      # insertion
          matrix[i - 1][j - 1]   # substitution
        ].min + 1
      end
    end
  end
  
  matrix[m][n]
end

edit_distance("kitten", "sitting")
# => 3

The implementation extends to return the actual edit sequence, not just the distance. Backtracking through the matrix reconstructs the operations by examining which cell contributed to each minimum.

def edit_distance_with_path(source, target)
  m, n = source.length, target.length
  matrix = Array.new(m + 1) { Array.new(n + 1) }
  
  (0..m).each { |i| matrix[i][0] = i }
  (0..n).each { |j| matrix[0][j] = j }
  
  (1..m).each do |i|
    (1..n).each do |j|
      if source[i - 1] == target[j - 1]
        matrix[i][j] = matrix[i - 1][j - 1]
      else
        matrix[i][j] = [
          matrix[i - 1][j],
          matrix[i][j - 1],
          matrix[i - 1][j - 1]
        ].min + 1
      end
    end
  end
  
  # Backtrack to find operations
  operations = []
  i, j = m, n
  
  while i > 0 || j > 0
    if i == 0
      operations << [:insert, target[j - 1]]
      j -= 1
    elsif j == 0
      operations << [:delete, source[i - 1]]
      i -= 1
    elsif source[i - 1] == target[j - 1]
      i -= 1
      j -= 1
    else
      deletion = matrix[i - 1][j]
      insertion = matrix[i][j - 1]
      substitution = matrix[i - 1][j - 1]
      
      if substitution <= deletion && substitution <= insertion
        operations << [:substitute, source[i - 1], target[j - 1]]
        i -= 1
        j -= 1
      elsif deletion <= insertion
        operations << [:delete, source[i - 1]]
        i -= 1
      else
        operations << [:insert, target[j - 1]]
        j -= 1
      end
    end
  end
  
  { distance: matrix[m][n], operations: operations.reverse }
end

result = edit_distance_with_path("kitten", "sitting")
# => {
#   distance: 3,
#   operations: [
#     [:substitute, "k", "s"],
#     [:substitute, "e", "i"],
#     [:insert, "g"]
#   ]
# }

Ruby classes encapsulate edit distance functionality with configurable costs for different operations. Some applications assign different weights to insertions, deletions, and substitutions.

class EditDistance
  attr_reader :insert_cost, :delete_cost, :substitute_cost
  
  def initialize(insert: 1, delete: 1, substitute: 1)
    @insert_cost = insert
    @delete_cost = delete
    @substitute_cost = substitute
  end
  
  def calculate(source, target)
    m, n = source.length, target.length
    matrix = Array.new(m + 1) { Array.new(n + 1) }
    
    (0..m).each { |i| matrix[i][0] = i * delete_cost }
    (0..n).each { |j| matrix[0][j] = j * insert_cost }
    
    (1..m).each do |i|
      (1..n).each do |j|
        if source[i - 1] == target[j - 1]
          matrix[i][j] = matrix[i - 1][j - 1]
        else
          matrix[i][j] = [
            matrix[i - 1][j] + delete_cost,
            matrix[i][j - 1] + insert_cost,
            matrix[i - 1][j - 1] + substitute_cost
          ].min
        end
      end
    end
    
    matrix[m][n]
  end
  
  def similarity_ratio(source, target)
    distance = calculate(source, target)
    max_length = [source.length, target.length].max
    return 1.0 if max_length.zero?
    
    1.0 - (distance.to_f / max_length)
  end
end

calculator = EditDistance.new(substitute: 2)
calculator.calculate("cat", "cut")
# => 2 (substitution costs 2)

calculator.similarity_ratio("kitten", "sitting")
# => 0.5714285714285714

The Text gem provides edit distance functionality for Ruby applications without implementing the algorithm manually. The gem includes optimized implementations and additional string similarity metrics.

require 'text'

Text::Levenshtein.distance("kitten", "sitting")
# => 3

# Damerau-Levenshtein includes transpositions
Text::Damerau.distance("abcd", "acbd")
# => 1 (transpose 'b' and 'c')

Practical Examples

Spell checking applications find the closest dictionary words to a misspelled input. The spell checker calculates edit distance between the input and each dictionary word, suggesting words with the smallest distances.

class SpellChecker
  def initialize(dictionary)
    @dictionary = dictionary
  end
  
  def suggest(word, max_distance: 2, max_suggestions: 5)
    suggestions = @dictionary.map do |dict_word|
      distance = edit_distance(word, dict_word)
      { word: dict_word, distance: distance } if distance <= max_distance
    end.compact
    
    suggestions.sort_by { |s| s[:distance] }
               .first(max_suggestions)
               .map { |s| s[:word] }
  end
  
  private
  
  def edit_distance(s, t)
    m, n = s.length, t.length
    prev = (0..n).to_a
    
    m.times do |i|
      curr = [i + 1]
      n.times do |j|
        cost = s[i] == t[j] ? 0 : 1
        curr << [curr[j] + 1, prev[j + 1] + 1, prev[j] + cost].min
      end
      prev = curr
    end
    
    prev[n]
  end
end

dictionary = %w[cat hat mat sat bat chat]
checker = SpellChecker.new(dictionary)
checker.suggest("cot", max_distance: 1)
# => ["cat", "hat", "mat", "sat", "bat"]

Fuzzy string matching in search systems tolerates user typos by finding records with similar strings. A product search matches queries against product names even when users misspell terms.

class FuzzySearcher
  def initialize(items)
    @items = items
  end
  
  def search(query, threshold: 0.7)
    results = @items.map do |item|
      distance = edit_distance(query.downcase, item.downcase)
      max_length = [query.length, item.length].max
      similarity = 1.0 - (distance.to_f / max_length)
      
      { item: item, similarity: similarity } if similarity >= threshold
    end.compact
    
    results.sort_by { |r| -r[:similarity] }
  end
  
  private
  
  def edit_distance(s, t)
    return t.length if s.empty?
    return s.length if t.empty?
    
    prev = (0..t.length).to_a
    
    s.length.times do |i|
      curr = [i + 1]
      t.length.times do |j|
        cost = s[i] == t[j] ? 0 : 1
        curr << [curr.last + 1, prev[j + 1] + 1, prev[j] + cost].min
      end
      prev = curr
    end
    
    prev.last
  end
end

products = ["iPhone 13", "Samsung Galaxy", "Google Pixel", "OnePlus 9"]
searcher = FuzzySearcher.new(products)
searcher.search("iPhne")
# => [{ item: "iPhone 13", similarity: 0.875 }]

Version control diff algorithms use edit distance concepts to identify changes between file versions. The algorithm finds the minimum set of line insertions and deletions transforming one file version into another.

class LineDiffer
  def diff(original, modified)
    orig_lines = original.split("\n")
    mod_lines = modified.split("\n")
    
    m, n = orig_lines.length, mod_lines.length
    matrix = Array.new(m + 1) { Array.new(n + 1) }
    
    (0..m).each { |i| matrix[i][0] = i }
    (0..n).each { |j| matrix[0][j] = j }
    
    (1..m).each do |i|
      (1..n).each do |j|
        if orig_lines[i - 1] == mod_lines[j - 1]
          matrix[i][j] = matrix[i - 1][j - 1]
        else
          matrix[i][j] = [
            matrix[i - 1][j] + 1,
            matrix[i][j - 1] + 1,
            matrix[i - 1][j - 1] + 1
          ].min
        end
      end
    end
    
    # Backtrack to generate diff
    diff = []
    i, j = m, n
    
    while i > 0 || j > 0
      if i > 0 && j > 0 && orig_lines[i - 1] == mod_lines[j - 1]
        diff.unshift({ type: :unchanged, line: orig_lines[i - 1] })
        i -= 1
        j -= 1
      elsif j > 0 && (i == 0 || matrix[i][j - 1] < matrix[i - 1][j])
        diff.unshift({ type: :added, line: mod_lines[j - 1] })
        j -= 1
      else
        diff.unshift({ type: :deleted, line: orig_lines[i - 1] })
        i -= 1
      end
    end
    
    diff
  end
  
  def format(diff)
    diff.map do |change|
      case change[:type]
      when :unchanged then "  #{change[:line]}"
      when :added then "+ #{change[:line]}"
      when :deleted then "- #{change[:line]}"
      end
    end.join("\n")
  end
end

original = "line 1\nline 2\nline 3"
modified = "line 1\nline 2 modified\nline 3\nline 4"

differ = LineDiffer.new
diff = differ.diff(original, modified)
puts differ.format(diff)
# Output:
#   line 1
# - line 2
# + line 2 modified
#   line 3
# + line 4

DNA sequence alignment in bioinformatics calculates edit distance between genetic sequences to identify mutations, insertions, and deletions. The algorithm helps identify evolutionary relationships between organisms.

def sequence_alignment(seq1, seq2)
  m, n = seq1.length, seq2.length
  matrix = Array.new(m + 1) { Array.new(n + 1) }
  
  # Gap penalties
  gap_penalty = 1
  
  (0..m).each { |i| matrix[i][0] = i * gap_penalty }
  (0..n).each { |j| matrix[0][j] = j * gap_penalty }
  
  (1..m).each do |i|
    (1..n).each do |j|
      match_cost = seq1[i - 1] == seq2[j - 1] ? 0 : 1
      
      matrix[i][j] = [
        matrix[i - 1][j] + gap_penalty,      # deletion
        matrix[i][j - 1] + gap_penalty,      # insertion
        matrix[i - 1][j - 1] + match_cost    # match/mismatch
      ].min
    end
  end
  
  matrix[m][n]
end

dna1 = "ACGTACGT"
dna2 = "ACCTACCT"
sequence_alignment(dna1, dna2)
# => 2

Common Patterns

The standard Levenshtein distance treats all operations equally with cost one. Many applications require weighted operations where insertions, deletions, and substitutions have different costs based on domain requirements.

def weighted_edit_distance(s, t, costs = {})
  insert_cost = costs[:insert] || 1
  delete_cost = costs[:delete] || 1
  substitute_cost = costs[:substitute] || 1
  
  m, n = s.length, t.length
  prev = (0..n).map { |j| j * insert_cost }
  
  m.times do |i|
    curr = [(i + 1) * delete_cost]
    n.times do |j|
      if s[i] == t[j]
        curr << prev[j]
      else
        curr << [
          curr.last + insert_cost,
          prev[j + 1] + delete_cost,
          prev[j] + substitute_cost
        ].min
      end
    end
    prev = curr
  end
  
  prev.last
end

# Keyboard-aware costs (keys closer together = lower cost)
weighted_edit_distance("cat", "car", substitute: 0.5)
# => 0.5

The Damerau-Levenshtein distance extends the algorithm to include transpositions as a fourth operation type. Transposing adjacent characters counts as a single operation, reflecting common typing errors.

def damerau_levenshtein(s, t)
  m, n = s.length, t.length
  max_dist = m + n
  h = Hash.new { |hash, key| hash[key] = 0 }
  matrix = Array.new(m + 2) { Array.new(n + 2, 0) }
  
  matrix[0][0] = max_dist
  (0..m).each { |i| matrix[i + 1][0] = max_dist; matrix[i + 1][1] = i }
  (0..n).each { |j| matrix[0][j + 1] = max_dist; matrix[1][j + 1] = j }
  
  (1..m).each do |i|
    db = 0
    (1..n).each do |j|
      k = h[t[j - 1]]
      l = db
      cost = 1
      
      if s[i - 1] == t[j - 1]
        cost = 0
        db = j
      end
      
      matrix[i + 1][j + 1] = [
        matrix[i][j] + cost,                           # substitution
        matrix[i + 1][j] + 1,                         # insertion
        matrix[i][j + 1] + 1,                         # deletion
        matrix[k][l] + (i - k - 1) + 1 + (j - l - 1)  # transposition
      ].min
    end
    h[s[i - 1]] = i
  end
  
  matrix[m + 1][n + 1]
end

damerau_levenshtein("abcd", "acbd")
# => 1 (transpose 'b' and 'c')

edit_distance("abcd", "acbd")
# => 2 (would require delete + insert without transposition)

Bounded edit distance calculations terminate early when the distance exceeds a threshold. This optimization improves performance when only small distances matter, as in spell checking with maximum suggestion distance.

def bounded_edit_distance(s, t, max_distance)
  m, n = s.length, t.length
  
  # Quick rejection tests
  return max_distance + 1 if (m - n).abs > max_distance
  
  prev = (0..n).to_a
  
  m.times do |i|
    curr = [i + 1]
    min_in_row = curr[0]
    
    n.times do |j|
      cost = s[i] == t[j] ? 0 : 1
      val = [curr.last + 1, prev[j + 1] + 1, prev[j] + cost].min
      curr << val
      min_in_row = val if val < min_in_row
    end
    
    # Early termination if minimum in row exceeds threshold
    return max_distance + 1 if min_in_row > max_distance
    
    prev = curr
  end
  
  prev.last <= max_distance ? prev.last : max_distance + 1
end

bounded_edit_distance("intention", "execution", 3)
# => 4 (exceeds threshold, returns 4)

Similarity scoring normalizes edit distance to a percentage or ratio, making comparisons across different string lengths meaningful. Applications use similarity scores for ranking search results or matching records.

def similarity_score(s, t)
  return 1.0 if s == t
  return 0.0 if s.empty? && t.empty?
  
  distance = edit_distance(s, t)
  max_length = [s.length, t.length].max
  
  1.0 - (distance.to_f / max_length)
end

similarity_score("kitten", "sitting")
# => 0.5714285714285714

similarity_score("cat", "cat")
# => 1.0

similarity_score("abc", "xyz")
# => 0.0

Performance Considerations

The standard dynamic programming algorithm has O(mn) time complexity and O(mn) space complexity for strings of length m and n. Each matrix cell requires constant time to compute, and the algorithm computes (m+1)(n+1) cells.

Space optimization reduces memory usage to O(min(m,n)) by maintaining only two rows of the matrix. The current row depends only on the previous row, eliminating the need to store the entire matrix.

def space_efficient_edit_distance(s, t)
  # Process shorter string as columns
  s, t = t, s if s.length > t.length
  
  m, n = s.length, t.length
  prev = (0..m).to_a
  
  n.times do |j|
    curr = [j + 1]
    m.times do |i|
      cost = s[i] == t[j] ? 0 : 1
      curr << [
        curr.last + 1,
        prev[i + 1] + 1,
        prev[i] + cost
      ].min
    end
    prev = curr
  end
  
  prev.last
end

Early termination optimization stops computation when the remaining cells cannot possibly produce a result below a threshold. For bounded distance calculations, this prunes large portions of the matrix.

Character frequency preprocessing quickly rejects string pairs that cannot have small edit distances. If the character frequency difference between strings exceeds the maximum allowed distance, the strings must exceed that distance.

def quick_reject?(s, t, max_distance)
  char_diff = Hash.new(0)
  
  s.each_char { |c| char_diff[c] += 1 }
  t.each_char { |c| char_diff[c] -= 1 }
  
  char_diff.values.sum { |v| v.abs } / 2 > max_distance
end

def optimized_bounded_distance(s, t, max_distance)
  return max_distance + 1 if quick_reject?(s, t, max_distance)
  
  bounded_edit_distance(s, t, max_distance)
end

Diagonal computation reduces the matrix to only the diagonals within max_distance of the main diagonal. For small max_distance values relative to string lengths, this significantly reduces computation.

def diagonal_edit_distance(s, t, max_distance)
  m, n = s.length, t.length
  return max_distance + 1 if (m - n).abs > max_distance
  
  # Only compute diagonals within max_distance band
  prev = {}
  (-max_distance..max_distance).each { |k| prev[k] = k.abs }
  
  (1..m).each do |i|
    curr = {}
    start_k = [i - max_distance, -max_distance].max
    end_k = [i + max_distance, m + max_distance].min
    
    (start_k..end_k).each do |k|
      j = i - k + n - m
      next if j < 1 || j > n
      
      cost = s[i - 1] == t[j - 1] ? 0 : 1
      candidates = []
      candidates << prev[k - 1] + 1 if prev[k - 1]
      candidates << curr[k - 1] + 1 if curr[k - 1]
      candidates << prev[k] + cost if prev[k]
      
      curr[k] = candidates.min if candidates.any?
    end
    prev = curr
  end
  
  prev[n - m] || max_distance + 1
end

Parallel computation distributes matrix calculation across multiple threads or processes. Each diagonal contains cells with no dependencies on each other, permitting parallel evaluation.

For large-scale applications comparing many string pairs, preprocessing creates n-gram indexes or other data structures that filter candidates before computing exact edit distances. This reduces the number of expensive distance calculations.

Reference

Algorithm Complexity

Aspect	Complexity	Notes
Time	O(mn)	m and n are string lengths
Space	O(mn)	Full matrix approach
Space optimized	O(min(m,n))	Two-row approach
Early termination	O(mn) worst case	Better average case with threshold

Operation Types

Operation	Description	Cost	Example
Match	Characters already equal	0	'a' matches 'a'
Insertion	Add character to source	1	'' → 'a'
Deletion	Remove character from source	1	'a' → ''
Substitution	Replace one character	1	'a' → 'b'
Transposition	Swap adjacent characters	1	'ab' → 'ba' (Damerau-Levenshtein only)

Common Variants

Variant	Operations	Use Case
Levenshtein	Insert, Delete, Substitute	General string comparison
Damerau-Levenshtein	Insert, Delete, Substitute, Transpose	Typing error detection
Longest Common Subsequence	Insert, Delete only	Diff algorithms
Hamming Distance	Substitute only, equal length	Error detection codes
Weighted Edit Distance	Operations with custom costs	Domain-specific matching

Ruby Implementation Patterns

Pattern	Memory	Speed	Use Case
Full matrix	O(mn)	Fast	Need edit path
Two-array	O(min(m,n))	Fast	Distance only
Recursive memoized	O(mn)	Slower	Small inputs
Bounded	O(n)	Very fast	Threshold filtering

Optimization Techniques

Technique	Benefit	Trade-off
Space optimization	Reduces memory 50-90%	Cannot reconstruct path
Early termination	Faster for bounded distance	Added complexity
Diagonal computation	Reduces computation with threshold	More complex indexing
Quick rejection	Eliminates impossible pairs	Preprocessing overhead
Character frequency	Fast initial filtering	May reject valid candidates

Common Applications

Application	Distance Type	Threshold
Spell checking	Levenshtein	1-2
Fuzzy search	Normalized	0.7-0.9
DNA alignment	Weighted	Variable
Duplicate detection	Damerau-Levenshtein	2-3
Diff generation	LCS-based	N/A

Edit Distance