Overview
Edit distance quantifies the similarity between two strings by calculating the minimum number of single-character operations needed to transform one string into another. The Levenshtein distance, the most common variant, permits three operations: insertion of a character, deletion of a character, and substitution of one character for another. Each operation carries a cost of one.
The algorithm originated from work by Vladimir Levenshtein in 1965 for error correction in binary codes. String comparison problems appear throughout software development: spell checkers suggest corrections based on edit distance, version control systems use it to generate diffs, search engines implement fuzzy matching for typo tolerance, and bioinformatics applications align DNA sequences.
Edit distance provides a numeric measure of string similarity. A distance of zero indicates identical strings, while larger values indicate greater dissimilarity. The maximum possible distance equals the length of the longer string, occurring when strings share no common characters.
# Transforming "kitten" to "sitting" requires 3 operations
# 1. Substitute 'k' with 's': "sitten"
# 2. Substitute 'e' with 'i': "sittin"
# 3. Insert 'g': "sitting"
# Edit distance = 3
The algorithm applies to any sequence comparison problem where elements can be compared for equality. While commonly used for strings, the same technique works for arrays, lists, or any ordered collection of comparable elements.
Key Principles
The edit distance algorithm builds an optimal solution from optimal solutions to smaller subproblems, exhibiting the characteristics of dynamic programming. The problem has optimal substructure: the edit distance between two strings depends on the edit distances between their prefixes.
For strings S and T, the edit distance calculation considers three cases at each position:
The characters at the current positions match. The edit distance equals the edit distance of the remaining substrings without those characters. No additional operation is required.
The characters differ, requiring one of three operations:
Insertion: Add a character from T to S. The cost equals one plus the edit distance between S and T without its first character.
Deletion: Remove a character from S. The cost equals one plus the edit distance between S without its first character and T.
Substitution: Replace a character in S with the corresponding character from T. The cost equals one plus the edit distance between both strings without their first characters.
The algorithm selects the operation with minimum cost. This recursive definition provides the foundation for both recursive and iterative implementations.
The dynamic programming approach constructs a matrix where entry (i,j) represents the edit distance between the first i characters of S and the first j characters of T. The matrix has dimensions (m+1) × (n+1) where m and n are the string lengths.
Base cases initialize the first row and column. Transforming an empty string to a string of length k requires k insertions. Converting a string of length k to an empty string requires k deletions.
# Matrix for edit distance between "CAT" and "DOG"
# "" D O G
# "" 0 1 2 3
# C 1 1 2 3
# A 2 2 2 3
# T 3 3 3 3
Each cell calculation depends only on three adjacent cells: the cell above, the cell to the left, and the diagonal cell. This dependency structure permits efficient computation in a specific order.
The recurrence relation defines:
D[i,j] = D[i-1,j-1] if S[i] = T[j]
D[i,j] = 1 + min(D[i-1,j], (deletion)
D[i,j-1], (insertion)
D[i-1,j-1]) (substitution)
if S[i] ≠ T[j]
The final answer appears in the bottom-right cell of the matrix, representing the edit distance between the complete strings.
Implementation Approaches
The recursive approach directly implements the mathematical definition but suffers from exponential time complexity due to repeated subproblem calculations. Each recursive call spawns up to three additional calls, creating a call tree with overlapping subproblems.
def edit_distance_recursive(s, t, i = s.length - 1, j = t.length - 1)
return j + 1 if i < 0
return i + 1 if j < 0
return edit_distance_recursive(s, t, i - 1, j - 1) if s[i] == t[j]
[
edit_distance_recursive(s, t, i - 1, j), # delete
edit_distance_recursive(s, t, i, j - 1), # insert
edit_distance_recursive(s, t, i - 1, j - 1) # substitute
].min + 1
end
Memoization optimizes the recursive approach by caching subproblem results in a hash or array. When a subproblem recurs, the cached result returns immediately instead of recalculating. This reduces time complexity from exponential to polynomial while maintaining the recursive structure.
def edit_distance_memo(s, t, memo = {})
key = [s, t]
return memo[key] if memo.key?(key)
return memo[key] = t.length if s.empty?
return memo[key] = s.length if t.empty?
if s[0] == t[0]
memo[key] = edit_distance_memo(s[1..-1], t[1..-1], memo)
else
memo[key] = 1 + [
edit_distance_memo(s[1..-1], t, memo),
edit_distance_memo(s, t[1..-1], memo),
edit_distance_memo(s[1..-1], t[1..-1], memo)
].min
end
memo[key]
end
The bottom-up dynamic programming approach builds the solution iteratively by filling the matrix row by row or column by column. This eliminates recursion overhead and provides predictable memory access patterns, improving cache performance.
Space-optimized implementations observe that computing row i requires only row i-1. Instead of maintaining the entire matrix, two arrays suffice: one for the current row and one for the previous row. This reduces space complexity from O(mn) to O(min(m,n)) by processing the shorter string as columns.
def edit_distance_space_optimized(s, t)
# Ensure s is the shorter string
s, t = t, s if s.length > t.length
previous = (0..s.length).to_a
t.length.times do |j|
current = [j + 1]
s.length.times do |i|
cost = s[i] == t[j] ? 0 : 1
current << [
current[i] + 1, # insertion
previous[i + 1] + 1, # deletion
previous[i] + cost # substitution
].min
end
previous = current
end
previous.last
end
Diagonal iteration processes the matrix along diagonals rather than rows or columns. This approach can parallelize computation of cells on the same diagonal, as they have no dependencies on each other.
Ruby Implementation
Ruby's string handling and array operations make implementing edit distance straightforward. The standard dynamic programming approach creates a two-dimensional matrix using nested arrays.
def edit_distance(source, target)
m, n = source.length, target.length
matrix = Array.new(m + 1) { Array.new(n + 1) }
# Initialize base cases
(0..m).each { |i| matrix[i][0] = i }
(0..n).each { |j| matrix[0][j] = j }
# Fill matrix
(1..m).each do |i|
(1..n).each do |j|
if source[i - 1] == target[j - 1]
matrix[i][j] = matrix[i - 1][j - 1]
else
matrix[i][j] = [
matrix[i - 1][j], # deletion
matrix[i][j - 1], # insertion
matrix[i - 1][j - 1] # substitution
].min + 1
end
end
end
matrix[m][n]
end
edit_distance("kitten", "sitting")
# => 3
The implementation extends to return the actual edit sequence, not just the distance. Backtracking through the matrix reconstructs the operations by examining which cell contributed to each minimum.
def edit_distance_with_path(source, target)
m, n = source.length, target.length
matrix = Array.new(m + 1) { Array.new(n + 1) }
(0..m).each { |i| matrix[i][0] = i }
(0..n).each { |j| matrix[0][j] = j }
(1..m).each do |i|
(1..n).each do |j|
if source[i - 1] == target[j - 1]
matrix[i][j] = matrix[i - 1][j - 1]
else
matrix[i][j] = [
matrix[i - 1][j],
matrix[i][j - 1],
matrix[i - 1][j - 1]
].min + 1
end
end
end
# Backtrack to find operations
operations = []
i, j = m, n
while i > 0 || j > 0
if i == 0
operations << [:insert, target[j - 1]]
j -= 1
elsif j == 0
operations << [:delete, source[i - 1]]
i -= 1
elsif source[i - 1] == target[j - 1]
i -= 1
j -= 1
else
deletion = matrix[i - 1][j]
insertion = matrix[i][j - 1]
substitution = matrix[i - 1][j - 1]
if substitution <= deletion && substitution <= insertion
operations << [:substitute, source[i - 1], target[j - 1]]
i -= 1
j -= 1
elsif deletion <= insertion
operations << [:delete, source[i - 1]]
i -= 1
else
operations << [:insert, target[j - 1]]
j -= 1
end
end
end
{ distance: matrix[m][n], operations: operations.reverse }
end
result = edit_distance_with_path("kitten", "sitting")
# => {
# distance: 3,
# operations: [
# [:substitute, "k", "s"],
# [:substitute, "e", "i"],
# [:insert, "g"]
# ]
# }
Ruby classes encapsulate edit distance functionality with configurable costs for different operations. Some applications assign different weights to insertions, deletions, and substitutions.
class EditDistance
attr_reader :insert_cost, :delete_cost, :substitute_cost
def initialize(insert: 1, delete: 1, substitute: 1)
@insert_cost = insert
@delete_cost = delete
@substitute_cost = substitute
end
def calculate(source, target)
m, n = source.length, target.length
matrix = Array.new(m + 1) { Array.new(n + 1) }
(0..m).each { |i| matrix[i][0] = i * delete_cost }
(0..n).each { |j| matrix[0][j] = j * insert_cost }
(1..m).each do |i|
(1..n).each do |j|
if source[i - 1] == target[j - 1]
matrix[i][j] = matrix[i - 1][j - 1]
else
matrix[i][j] = [
matrix[i - 1][j] + delete_cost,
matrix[i][j - 1] + insert_cost,
matrix[i - 1][j - 1] + substitute_cost
].min
end
end
end
matrix[m][n]
end
def similarity_ratio(source, target)
distance = calculate(source, target)
max_length = [source.length, target.length].max
return 1.0 if max_length.zero?
1.0 - (distance.to_f / max_length)
end
end
calculator = EditDistance.new(substitute: 2)
calculator.calculate("cat", "cut")
# => 2 (substitution costs 2)
calculator.similarity_ratio("kitten", "sitting")
# => 0.5714285714285714
The Text gem provides edit distance functionality for Ruby applications without implementing the algorithm manually. The gem includes optimized implementations and additional string similarity metrics.
require 'text'
Text::Levenshtein.distance("kitten", "sitting")
# => 3
# Damerau-Levenshtein includes transpositions
Text::Damerau.distance("abcd", "acbd")
# => 1 (transpose 'b' and 'c')
Practical Examples
Spell checking applications find the closest dictionary words to a misspelled input. The spell checker calculates edit distance between the input and each dictionary word, suggesting words with the smallest distances.
class SpellChecker
def initialize(dictionary)
@dictionary = dictionary
end
def suggest(word, max_distance: 2, max_suggestions: 5)
suggestions = @dictionary.map do |dict_word|
distance = edit_distance(word, dict_word)
{ word: dict_word, distance: distance } if distance <= max_distance
end.compact
suggestions.sort_by { |s| s[:distance] }
.first(max_suggestions)
.map { |s| s[:word] }
end
private
def edit_distance(s, t)
m, n = s.length, t.length
prev = (0..n).to_a
m.times do |i|
curr = [i + 1]
n.times do |j|
cost = s[i] == t[j] ? 0 : 1
curr << [curr[j] + 1, prev[j + 1] + 1, prev[j] + cost].min
end
prev = curr
end
prev[n]
end
end
dictionary = %w[cat hat mat sat bat chat]
checker = SpellChecker.new(dictionary)
checker.suggest("cot", max_distance: 1)
# => ["cat", "hat", "mat", "sat", "bat"]
Fuzzy string matching in search systems tolerates user typos by finding records with similar strings. A product search matches queries against product names even when users misspell terms.
class FuzzySearcher
def initialize(items)
@items = items
end
def search(query, threshold: 0.7)
results = @items.map do |item|
distance = edit_distance(query.downcase, item.downcase)
max_length = [query.length, item.length].max
similarity = 1.0 - (distance.to_f / max_length)
{ item: item, similarity: similarity } if similarity >= threshold
end.compact
results.sort_by { |r| -r[:similarity] }
end
private
def edit_distance(s, t)
return t.length if s.empty?
return s.length if t.empty?
prev = (0..t.length).to_a
s.length.times do |i|
curr = [i + 1]
t.length.times do |j|
cost = s[i] == t[j] ? 0 : 1
curr << [curr.last + 1, prev[j + 1] + 1, prev[j] + cost].min
end
prev = curr
end
prev.last
end
end
products = ["iPhone 13", "Samsung Galaxy", "Google Pixel", "OnePlus 9"]
searcher = FuzzySearcher.new(products)
searcher.search("iPhne")
# => [{ item: "iPhone 13", similarity: 0.875 }]
Version control diff algorithms use edit distance concepts to identify changes between file versions. The algorithm finds the minimum set of line insertions and deletions transforming one file version into another.
class LineDiffer
def diff(original, modified)
orig_lines = original.split("\n")
mod_lines = modified.split("\n")
m, n = orig_lines.length, mod_lines.length
matrix = Array.new(m + 1) { Array.new(n + 1) }
(0..m).each { |i| matrix[i][0] = i }
(0..n).each { |j| matrix[0][j] = j }
(1..m).each do |i|
(1..n).each do |j|
if orig_lines[i - 1] == mod_lines[j - 1]
matrix[i][j] = matrix[i - 1][j - 1]
else
matrix[i][j] = [
matrix[i - 1][j] + 1,
matrix[i][j - 1] + 1,
matrix[i - 1][j - 1] + 1
].min
end
end
end
# Backtrack to generate diff
diff = []
i, j = m, n
while i > 0 || j > 0
if i > 0 && j > 0 && orig_lines[i - 1] == mod_lines[j - 1]
diff.unshift({ type: :unchanged, line: orig_lines[i - 1] })
i -= 1
j -= 1
elsif j > 0 && (i == 0 || matrix[i][j - 1] < matrix[i - 1][j])
diff.unshift({ type: :added, line: mod_lines[j - 1] })
j -= 1
else
diff.unshift({ type: :deleted, line: orig_lines[i - 1] })
i -= 1
end
end
diff
end
def format(diff)
diff.map do |change|
case change[:type]
when :unchanged then " #{change[:line]}"
when :added then "+ #{change[:line]}"
when :deleted then "- #{change[:line]}"
end
end.join("\n")
end
end
original = "line 1\nline 2\nline 3"
modified = "line 1\nline 2 modified\nline 3\nline 4"
differ = LineDiffer.new
diff = differ.diff(original, modified)
puts differ.format(diff)
# Output:
# line 1
# - line 2
# + line 2 modified
# line 3
# + line 4
DNA sequence alignment in bioinformatics calculates edit distance between genetic sequences to identify mutations, insertions, and deletions. The algorithm helps identify evolutionary relationships between organisms.
def sequence_alignment(seq1, seq2)
m, n = seq1.length, seq2.length
matrix = Array.new(m + 1) { Array.new(n + 1) }
# Gap penalties
gap_penalty = 1
(0..m).each { |i| matrix[i][0] = i * gap_penalty }
(0..n).each { |j| matrix[0][j] = j * gap_penalty }
(1..m).each do |i|
(1..n).each do |j|
match_cost = seq1[i - 1] == seq2[j - 1] ? 0 : 1
matrix[i][j] = [
matrix[i - 1][j] + gap_penalty, # deletion
matrix[i][j - 1] + gap_penalty, # insertion
matrix[i - 1][j - 1] + match_cost # match/mismatch
].min
end
end
matrix[m][n]
end
dna1 = "ACGTACGT"
dna2 = "ACCTACCT"
sequence_alignment(dna1, dna2)
# => 2
Common Patterns
The standard Levenshtein distance treats all operations equally with cost one. Many applications require weighted operations where insertions, deletions, and substitutions have different costs based on domain requirements.
def weighted_edit_distance(s, t, costs = {})
insert_cost = costs[:insert] || 1
delete_cost = costs[:delete] || 1
substitute_cost = costs[:substitute] || 1
m, n = s.length, t.length
prev = (0..n).map { |j| j * insert_cost }
m.times do |i|
curr = [(i + 1) * delete_cost]
n.times do |j|
if s[i] == t[j]
curr << prev[j]
else
curr << [
curr.last + insert_cost,
prev[j + 1] + delete_cost,
prev[j] + substitute_cost
].min
end
end
prev = curr
end
prev.last
end
# Keyboard-aware costs (keys closer together = lower cost)
weighted_edit_distance("cat", "car", substitute: 0.5)
# => 0.5
The Damerau-Levenshtein distance extends the algorithm to include transpositions as a fourth operation type. Transposing adjacent characters counts as a single operation, reflecting common typing errors.
def damerau_levenshtein(s, t)
m, n = s.length, t.length
max_dist = m + n
h = Hash.new { |hash, key| hash[key] = 0 }
matrix = Array.new(m + 2) { Array.new(n + 2, 0) }
matrix[0][0] = max_dist
(0..m).each { |i| matrix[i + 1][0] = max_dist; matrix[i + 1][1] = i }
(0..n).each { |j| matrix[0][j + 1] = max_dist; matrix[1][j + 1] = j }
(1..m).each do |i|
db = 0
(1..n).each do |j|
k = h[t[j - 1]]
l = db
cost = 1
if s[i - 1] == t[j - 1]
cost = 0
db = j
end
matrix[i + 1][j + 1] = [
matrix[i][j] + cost, # substitution
matrix[i + 1][j] + 1, # insertion
matrix[i][j + 1] + 1, # deletion
matrix[k][l] + (i - k - 1) + 1 + (j - l - 1) # transposition
].min
end
h[s[i - 1]] = i
end
matrix[m + 1][n + 1]
end
damerau_levenshtein("abcd", "acbd")
# => 1 (transpose 'b' and 'c')
edit_distance("abcd", "acbd")
# => 2 (would require delete + insert without transposition)
Bounded edit distance calculations terminate early when the distance exceeds a threshold. This optimization improves performance when only small distances matter, as in spell checking with maximum suggestion distance.
def bounded_edit_distance(s, t, max_distance)
m, n = s.length, t.length
# Quick rejection tests
return max_distance + 1 if (m - n).abs > max_distance
prev = (0..n).to_a
m.times do |i|
curr = [i + 1]
min_in_row = curr[0]
n.times do |j|
cost = s[i] == t[j] ? 0 : 1
val = [curr.last + 1, prev[j + 1] + 1, prev[j] + cost].min
curr << val
min_in_row = val if val < min_in_row
end
# Early termination if minimum in row exceeds threshold
return max_distance + 1 if min_in_row > max_distance
prev = curr
end
prev.last <= max_distance ? prev.last : max_distance + 1
end
bounded_edit_distance("intention", "execution", 3)
# => 4 (exceeds threshold, returns 4)
Similarity scoring normalizes edit distance to a percentage or ratio, making comparisons across different string lengths meaningful. Applications use similarity scores for ranking search results or matching records.
def similarity_score(s, t)
return 1.0 if s == t
return 0.0 if s.empty? && t.empty?
distance = edit_distance(s, t)
max_length = [s.length, t.length].max
1.0 - (distance.to_f / max_length)
end
similarity_score("kitten", "sitting")
# => 0.5714285714285714
similarity_score("cat", "cat")
# => 1.0
similarity_score("abc", "xyz")
# => 0.0
Performance Considerations
The standard dynamic programming algorithm has O(mn) time complexity and O(mn) space complexity for strings of length m and n. Each matrix cell requires constant time to compute, and the algorithm computes (m+1)(n+1) cells.
Space optimization reduces memory usage to O(min(m,n)) by maintaining only two rows of the matrix. The current row depends only on the previous row, eliminating the need to store the entire matrix.
def space_efficient_edit_distance(s, t)
# Process shorter string as columns
s, t = t, s if s.length > t.length
m, n = s.length, t.length
prev = (0..m).to_a
n.times do |j|
curr = [j + 1]
m.times do |i|
cost = s[i] == t[j] ? 0 : 1
curr << [
curr.last + 1,
prev[i + 1] + 1,
prev[i] + cost
].min
end
prev = curr
end
prev.last
end
Early termination optimization stops computation when the remaining cells cannot possibly produce a result below a threshold. For bounded distance calculations, this prunes large portions of the matrix.
Character frequency preprocessing quickly rejects string pairs that cannot have small edit distances. If the character frequency difference between strings exceeds the maximum allowed distance, the strings must exceed that distance.
def quick_reject?(s, t, max_distance)
char_diff = Hash.new(0)
s.each_char { |c| char_diff[c] += 1 }
t.each_char { |c| char_diff[c] -= 1 }
char_diff.values.sum { |v| v.abs } / 2 > max_distance
end
def optimized_bounded_distance(s, t, max_distance)
return max_distance + 1 if quick_reject?(s, t, max_distance)
bounded_edit_distance(s, t, max_distance)
end
Diagonal computation reduces the matrix to only the diagonals within max_distance of the main diagonal. For small max_distance values relative to string lengths, this significantly reduces computation.
def diagonal_edit_distance(s, t, max_distance)
m, n = s.length, t.length
return max_distance + 1 if (m - n).abs > max_distance
# Only compute diagonals within max_distance band
prev = {}
(-max_distance..max_distance).each { |k| prev[k] = k.abs }
(1..m).each do |i|
curr = {}
start_k = [i - max_distance, -max_distance].max
end_k = [i + max_distance, m + max_distance].min
(start_k..end_k).each do |k|
j = i - k + n - m
next if j < 1 || j > n
cost = s[i - 1] == t[j - 1] ? 0 : 1
candidates = []
candidates << prev[k - 1] + 1 if prev[k - 1]
candidates << curr[k - 1] + 1 if curr[k - 1]
candidates << prev[k] + cost if prev[k]
curr[k] = candidates.min if candidates.any?
end
prev = curr
end
prev[n - m] || max_distance + 1
end
Parallel computation distributes matrix calculation across multiple threads or processes. Each diagonal contains cells with no dependencies on each other, permitting parallel evaluation.
For large-scale applications comparing many string pairs, preprocessing creates n-gram indexes or other data structures that filter candidates before computing exact edit distances. This reduces the number of expensive distance calculations.
Reference
Algorithm Complexity
| Aspect | Complexity | Notes |
|---|---|---|
| Time | O(mn) | m and n are string lengths |
| Space | O(mn) | Full matrix approach |
| Space optimized | O(min(m,n)) | Two-row approach |
| Early termination | O(mn) worst case | Better average case with threshold |
Operation Types
| Operation | Description | Cost | Example |
|---|---|---|---|
| Match | Characters already equal | 0 | 'a' matches 'a' |
| Insertion | Add character to source | 1 | '' → 'a' |
| Deletion | Remove character from source | 1 | 'a' → '' |
| Substitution | Replace one character | 1 | 'a' → 'b' |
| Transposition | Swap adjacent characters | 1 | 'ab' → 'ba' (Damerau-Levenshtein only) |
Common Variants
| Variant | Operations | Use Case |
|---|---|---|
| Levenshtein | Insert, Delete, Substitute | General string comparison |
| Damerau-Levenshtein | Insert, Delete, Substitute, Transpose | Typing error detection |
| Longest Common Subsequence | Insert, Delete only | Diff algorithms |
| Hamming Distance | Substitute only, equal length | Error detection codes |
| Weighted Edit Distance | Operations with custom costs | Domain-specific matching |
Ruby Implementation Patterns
| Pattern | Memory | Speed | Use Case |
|---|---|---|---|
| Full matrix | O(mn) | Fast | Need edit path |
| Two-array | O(min(m,n)) | Fast | Distance only |
| Recursive memoized | O(mn) | Slower | Small inputs |
| Bounded | O(n) | Very fast | Threshold filtering |
Optimization Techniques
| Technique | Benefit | Trade-off |
|---|---|---|
| Space optimization | Reduces memory 50-90% | Cannot reconstruct path |
| Early termination | Faster for bounded distance | Added complexity |
| Diagonal computation | Reduces computation with threshold | More complex indexing |
| Quick rejection | Eliminates impossible pairs | Preprocessing overhead |
| Character frequency | Fast initial filtering | May reject valid candidates |
Common Applications
| Application | Distance Type | Threshold |
|---|---|---|
| Spell checking | Levenshtein | 1-2 |
| Fuzzy search | Normalized | 0.7-0.9 |
| DNA alignment | Weighted | Variable |
| Duplicate detection | Damerau-Levenshtein | 2-3 |
| Diff generation | LCS-based | N/A |