Overview
String searching algorithms locate occurrences of a pattern within a larger text string. The naive approach compares the pattern character-by-character at each position in the text, resulting in O(n*m) time complexity where n is the text length and m is the pattern length. Advanced algorithms like Knuth-Morris-Pratt (KMP) and Boyer-Moore reduce this complexity through preprocessing and intelligent skipping strategies.
KMP, developed in 1977, preprocesses the pattern to identify self-similarities, eliminating redundant comparisons when mismatches occur. The algorithm maintains O(n+m) time complexity by never backtracking in the text string, making it ideal for streaming data or situations where re-reading text is expensive.
Boyer-Moore, published in 1977, takes a different approach by scanning the pattern from right to left and using two heuristics—the bad character rule and the good suffix rule—to skip sections of text. In practice, Boyer-Moore often performs better than KMP on natural language text, sometimes achieving sublinear time complexity by skipping multiple characters at once.
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
# Naive search requires checking each position
# Advanced algorithms skip intelligently based on pattern analysis
Ruby's built-in String#index uses optimized C implementations, but understanding these algorithms provides insight into pattern matching complexity and enables custom implementations for specialized requirements.
Key Principles
String searching algorithms optimize the matching process through pattern analysis and strategic skipping. The fundamental challenge is avoiding redundant comparisons when a mismatch occurs. Naive algorithms restart from the next position in the text, wasting information gained from partial matches.
KMP Preprocessing and Failure Function
KMP constructs a failure function (also called prefix function or LPS array for Longest Proper Prefix which is also Suffix) during preprocessing. This array stores, for each position in the pattern, the length of the longest proper prefix of the substring that is also a suffix. When a mismatch occurs, this function determines how far to shift the pattern without missing potential matches.
The failure function captures self-similarity within the pattern. For pattern "ABABCABAB", the failure function reveals that if a mismatch occurs after matching "ABAB", the next comparison can start from position 2 because "AB" appears at both the beginning and end of the matched substring.
# Pattern: A B A B C A B A B
# Index: 0 1 2 3 4 5 6 7 8
# LPS: 0 0 1 2 0 1 2 3 4
# At index 4, longest proper prefix that's also suffix is "" (length 0)
# At index 8, "ABAB" appears at both ends (length 4)
Boyer-Moore Heuristics
Boyer-Moore employs two rules for determining skip distances: the bad character rule and the good suffix rule. The algorithm uses whichever rule suggests a larger skip.
The bad character rule preprocesses the pattern to record the rightmost occurrence of each character. When a mismatch occurs at text position i with character c, the pattern shifts so that the rightmost occurrence of c in the pattern aligns with position i. If c doesn't appear in the pattern, the entire pattern shifts past position i.
# Pattern: "EXAMPLE"
# Bad character table stores rightmost positions:
# E: 6, X: 1, A: 2, M: 3, P: 4, L: 5
# Any character not in pattern: -1
The good suffix rule identifies when a suffix of the matched portion appears elsewhere in the pattern. If characters match from the right but then mismatch, the algorithm shifts the pattern to align any previous occurrence of the matched suffix.
Comparison Strategies
KMP scans left-to-right and never backtracks in the text, making it suitable for sequential access patterns. The algorithm maintains a single pointer in the text that only moves forward, beneficial for processing streams or large files where seeking backwards is costly.
Boyer-Moore scans right-to-left within each pattern position, which seems counterintuitive but enables larger skips. The rightmost character provides maximum information for the bad character rule—if it doesn't match, the entire pattern can often shift by its full length.
Both algorithms guarantee O(n+m) worst-case time complexity, but their practical performance differs. KMP performs consistently across different input types, while Boyer-Moore excels on larger alphabets and longer patterns where skip distances increase.
Ruby Implementation
Ruby implements string searching through several methods, with String#index and String#scan providing built-in functionality. Custom implementations demonstrate algorithmic concepts and enable modifications for specific requirements.
KMP Implementation
class KMPSearch
def initialize(pattern)
@pattern = pattern
@lps = build_lps_array
end
def search(text)
positions = []
i = 0 # text index
j = 0 # pattern index
while i < text.length
if text[i] == @pattern[j]
i += 1
j += 1
end
if j == @pattern.length
positions << i - j
j = @lps[j - 1]
elsif i < text.length && text[i] != @pattern[j]
j = @lps[j - 1] if j != 0
i += 1 if j == 0
end
end
positions
end
private
def build_lps_array
lps = Array.new(@pattern.length, 0)
length = 0
i = 1
while i < @pattern.length
if @pattern[i] == @pattern[length]
length += 1
lps[i] = length
i += 1
else
if length != 0
length = lps[length - 1]
else
lps[i] = 0
i += 1
end
end
end
lps
end
end
# Usage
searcher = KMPSearch.new("ABABCABAB")
text = "ABABDABACDABABCABABABABCABAB"
positions = searcher.search(text)
# => [10, 19]
The LPS array construction uses dynamic programming. Each position depends on previous values, with the length variable tracking the current prefix length. When characters match, both pointers advance. On mismatch, the length falls back to the LPS value at the previous position, checking progressively shorter prefixes.
Boyer-Moore Implementation
class BoyerMooreSearch
def initialize(pattern)
@pattern = pattern
@bad_char = build_bad_character_table
end
def search(text)
positions = []
m = @pattern.length
n = text.length
s = 0 # shift of pattern with respect to text
while s <= n - m
j = m - 1
# Scan from right to left
j -= 1 while j >= 0 && @pattern[j] == text[s + j]
if j < 0
# Pattern found
positions << s
s += (s + m < n) ? m - @bad_char[text[s + m]] : 1
else
# Mismatch - apply bad character rule
bad_char_shift = j - @bad_char[text[s + j]]
s += [1, bad_char_shift].max
end
end
positions
end
private
def build_bad_character_table
table = Hash.new(-1)
@pattern.chars.each_with_index do |char, index|
table[char] = index
end
table
end
end
# Usage
searcher = BoyerMooreSearch.new("EXAMPLE")
text = "HERE IS A SIMPLE EXAMPLE IN THE TEXT"
positions = searcher.search(text)
# => [16]
This implementation focuses on the bad character rule, which provides most of Boyer-Moore's practical benefit. The good suffix rule adds complexity with modest improvement in typical cases. The hash-based bad character table handles any character size alphabet efficiently.
Ruby String Methods with Custom Callbacks
class PatternMatcher
def self.find_all_with_context(text, pattern, context_size = 10)
matches = []
index = 0
while index = text.index(pattern, index)
start_context = [0, index - context_size].max
end_context = [text.length, index + pattern.length + context_size].min
matches << {
position: index,
context: text[start_context...end_context],
before: text[start_context...index],
match: pattern,
after: text[(index + pattern.length)...end_context]
}
index += 1
end
matches
end
end
text = "The quick brown fox jumps over the lazy fox near the river"
results = PatternMatcher.find_all_with_context(text, "fox", 8)
# Returns detailed match information with surrounding context
Streaming Search Implementation
class StreamingKMP
def initialize(pattern)
@pattern = pattern
@lps = build_lps_array
@j = 0 # Persistent pattern index
end
def process_chunk(chunk)
matches = []
chunk.chars.each_with_index do |char, chunk_index|
while @j > 0 && char != @pattern[@j]
@j = @lps[@j - 1]
end
if char == @pattern[@j]
@j += 1
end
if @j == @pattern.length
matches << chunk_index - @j + 1
@j = @lps[@j - 1]
end
end
matches
end
def reset
@j = 0
end
private
def build_lps_array
lps = Array.new(@pattern.length, 0)
length = 0
i = 1
while i < @pattern.length
if @pattern[i] == @pattern[length]
length += 1
lps[i] = length
i += 1
else
if length != 0
length = lps[length - 1]
else
lps[i] = 0
i += 1
end
end
end
lps
end
end
# Usage for processing data streams
searcher = StreamingKMP.new("ERROR")
IO.foreach("large_log_file.txt") do |line|
positions = searcher.process_chunk(line)
puts "Found ERROR at positions: #{positions}" unless positions.empty?
searcher.reset # Reset for each line
end
Practical Examples
DNA Sequence Analysis
Biological sequence analysis requires finding specific patterns in long DNA strings. KMP handles repetitive sequences efficiently since DNA has a small alphabet (A, C, G, T) with frequent repetition.
class DNASearcher
def initialize
@cache = {}
end
def find_motif(sequence, motif)
searcher = @cache[motif] ||= KMPSearch.new(motif)
positions = searcher.search(sequence)
{
motif: motif,
occurrences: positions.length,
positions: positions,
frequency: positions.length.to_f / sequence.length
}
end
def find_multiple_motifs(sequence, motifs)
motifs.map { |motif| find_motif(sequence, motif) }
end
end
dna = "ATCGATCGATCGTAGCTAGCTAGCTGATCGATCG"
searcher = DNASearcher.new
results = searcher.find_multiple_motifs(dna, ["ATCG", "TAGC", "GATC"])
# Identifies all occurrences of each motif
Log File Pattern Detection
Log analysis often requires finding error patterns, timestamps, or specific event sequences across large files. Boyer-Moore performs well here due to longer patterns and larger alphabets.
class LogAnalyzer
ERROR_PATTERNS = {
sql_error: "SQL Error:",
timeout: "Connection timeout",
memory: "OutOfMemory",
critical: "CRITICAL:"
}
def analyze_file(filepath)
searchers = ERROR_PATTERNS.transform_values { |pattern|
BoyerMooreSearch.new(pattern)
}
results = Hash.new { |h, k| h[k] = [] }
line_number = 0
File.foreach(filepath) do |line|
line_number += 1
searchers.each do |error_type, searcher|
positions = searcher.search(line)
positions.each do |pos|
results[error_type] << {
line: line_number,
position: pos,
content: line.strip
}
end
end
end
results
end
def self.generate_report(results)
results.map do |error_type, occurrences|
{
type: error_type,
count: occurrences.length,
first_occurrence: occurrences.first,
last_occurrence: occurrences.last
}
end
end
end
analyzer = LogAnalyzer.new
errors = analyzer.analyze_file("production.log")
report = LogAnalyzer.generate_report(errors)
Text Search with Case Insensitivity
Many applications require case-insensitive searching. Pattern preprocessing must account for case variations.
class CaseInsensitiveSearch
def initialize(pattern)
@pattern = pattern.downcase
@kmp = KMPSearch.new(@pattern)
end
def search(text)
normalized_text = text.downcase
positions = @kmp.search(normalized_text)
positions.map do |pos|
{
position: pos,
original_text: text[pos, @pattern.length],
matched_pattern: @pattern
}
end
end
end
text = "The Quick BROWN fox and the QUICK rabbit"
searcher = CaseInsensitiveSearch.new("quick")
results = searcher.search(text)
# Finds both "Quick" and "QUICK"
Multi-Pattern Search
Searching for multiple patterns simultaneously requires running separate searches or using more advanced algorithms like Aho-Corasick. For moderate pattern counts, parallel KMP searches work well.
class MultiPatternSearch
def initialize(patterns)
@searchers = patterns.map { |p| [p, KMPSearch.new(p)] }.to_h
end
def search(text)
results = {}
@searchers.each do |pattern, searcher|
positions = searcher.search(text)
results[pattern] = positions unless positions.empty?
end
# Return matches sorted by position
all_matches = results.flat_map do |pattern, positions|
positions.map { |pos| {position: pos, pattern: pattern} }
end
all_matches.sort_by { |m| m[:position] }
end
end
text = "Programming in Ruby requires understanding Ruby syntax and Ruby idioms"
patterns = ["Ruby", "syntax", "understanding"]
searcher = MultiPatternSearch.new(patterns)
matches = searcher.search(text)
# Returns all matches sorted by position
Design Considerations
Selecting the appropriate string searching algorithm depends on text characteristics, pattern properties, performance requirements, and implementation constraints.
Text and Pattern Characteristics
Small alphabets (binary data, DNA sequences) favor KMP because skip distances in Boyer-Moore remain small. With only 4 characters in DNA or 2 in binary, the bad character rule provides minimal benefit. KMP's consistent O(n+m) performance handles repetitive patterns efficiently.
Large alphabets (natural language, source code) favor Boyer-Moore. English text with 26+ characters plus punctuation and whitespace creates opportunities for large skips. When a character doesn't appear in the pattern, Boyer-Moore skips the entire pattern length.
Pattern length affects algorithm choice. Short patterns (2-5 characters) see minimal benefit from preprocessing overhead. Ruby's built-in methods or naive search suffice. Medium patterns (6-20 characters) benefit from both algorithms. Long patterns (20+ characters) favor Boyer-Moore, where skip distances can be substantial.
Access Pattern Requirements
Sequential-only access requires KMP. When processing streams, pipes, or network data where rewinding is impossible, KMP's forward-only text traversal is essential. The algorithm never needs to re-read previous text positions.
Random access environments suit Boyer-Moore better. When the entire text resides in memory or supports efficient seeking, Boyer-Moore's potential backward movement in the text doesn't impose penalties, and its average-case performance improvements manifest fully.
Preprocessing vs. Search Trade-offs
Single search scenarios may not justify preprocessing overhead. For one-time searches, Ruby's built-in String#index provides optimized C implementations that often outperform pure Ruby algorithm implementations.
Repeated searches with the same pattern amortize preprocessing costs. Applications that search for the same pattern across multiple texts or repeatedly in a single large text benefit from constructing the LPS array or bad character table once.
Performance Profiles
KMP provides predictable performance across all inputs. The O(n+m) guarantee holds for worst-case, average-case, and best-case scenarios. This predictability aids capacity planning and performance analysis.
Boyer-Moore exhibits variable performance. Best-case scenarios approach O(n/m), scanning only every mth character in the text. Worst-case degradation to O(nm) occurs with adversarial patterns, though this is rare in practice. Average-case performance typically exceeds KMP on natural text.
Implementation Complexity
KMP requires understanding the LPS array construction, which can be subtle. The preprocessing logic involves careful pointer manipulation, and errors in LPS construction cause incorrect search results.
Boyer-Moore with only the bad character rule is straightforward to implement. Adding the good suffix rule increases complexity significantly for modest practical gain. Many implementations omit it.
Ruby-Specific Considerations
Ruby strings use variable-width encodings (UTF-8 by default). Character-by-character comparison works correctly, but byte-level access patterns differ from fixed-width encodings. Algorithm implementations must use character indexing ([]) rather than byte operations.
Built-in methods like String#index, String#rindex, and String#scan use optimized C code. Pure Ruby implementations of KMP or Boyer-Moore typically run slower than built-ins for simple searches. Custom algorithms make sense for:
- Specialized requirements (case-insensitive, fuzzy matching, streaming)
- Learning and demonstration purposes
- Pattern statistics or metadata collection during search
- Integration with custom data structures
Performance Considerations
String searching performance depends on algorithm characteristics, input properties, and implementation details. Understanding these factors guides optimization decisions.
Time Complexity Analysis
KMP guarantees O(n+m) time in all cases:
- Preprocessing: O(m) to build the LPS array
- Searching: O(n) with each text character examined at most twice
- Total: O(n+m) regardless of input
Boyer-Moore theoretical complexity:
- Preprocessing: O(m + σ) where σ is alphabet size
- Best case: O(n/m) by skipping pattern-length chunks
- Average case: O(n) with constant better than KMP on natural text
- Worst case: O(nm) with adversarial patterns (rare in practice)
Space Complexity
KMP requires O(m) additional space for the LPS array. This array persists across searches when reusing the same pattern.
Boyer-Moore requires O(σ) for the bad character table, typically implemented as a hash in Ruby. For ASCII text, σ = 128 or 256. For Unicode, hash-based storage prevents excessive memory usage.
Practical Performance Factors
Cache locality affects modern CPU performance. KMP's sequential text access patterns align with cache line fetching, reducing cache misses. Boyer-Moore's variable skip distances may cause more cache misses but process fewer total characters.
require 'benchmark'
def benchmark_algorithms(text, pattern, iterations = 1000)
kmp = KMPSearch.new(pattern)
bm = BoyerMooreSearch.new(pattern)
Benchmark.bmbm do |x|
x.report("KMP:") { iterations.times { kmp.search(text) } }
x.report("Boyer-Moore:") { iterations.times { bm.search(text) } }
x.report("Ruby index:") { iterations.times {
pos = 0
positions = []
while pos = text.index(pattern, pos)
positions << pos
pos += 1
end
}}
end
end
# Test with different text characteristics
text = File.read("large_file.txt")
pattern = "specific_pattern"
benchmark_algorithms(text, pattern)
Alphabet Size Impact
Small alphabets reduce Boyer-Moore effectiveness. With alphabet size 2, maximum skip distance is 2 regardless of pattern length. DNA searching with 4 characters sees limited benefit from bad character rule.
Large alphabets maximize Boyer-Moore advantage. English text, source code, and general Unicode text create frequent opportunities for large skips when characters don't match.
Pattern Repetition Effects
Patterns with repetition favor KMP. Pattern "AAAAAAB" searching in text "AAAAAAA..." demonstrates KMP's strength—the LPS array enables efficient handling of repeated partial matches.
Distinct character patterns favor Boyer-Moore. Pattern "EXAMPLE" in typical text allows aggressive skipping because characters E, X, M rarely appear in random positions.
Optimization Techniques
Preprocessing pattern metadata accelerates searches:
class OptimizedSearch
def initialize(pattern)
@pattern = pattern
@length = pattern.length
# Fast path checks
@first_char = pattern[0]
@last_char = pattern[-1]
@has_repetition = pattern.chars.uniq.length != pattern.length
# Choose algorithm based on characteristics
@searcher = if @has_repetition || @length < 8
KMPSearch.new(pattern)
else
BoyerMooreSearch.new(pattern)
end
end
def search(text)
return [] if text.length < @length
# Quick check if pattern characters exist in text
pattern_chars = @pattern.chars.to_set
text_chars = text.chars.to_set
return [] unless (pattern_chars - text_chars).empty?
@searcher.search(text)
end
end
Memory Access Patterns
Sequential memory access improves performance. KMP reads text sequentially, benefiting from hardware prefetching. Boyer-Moore's variable skip distances may defeat prefetching but process fewer total bytes.
Ruby Implementation Overhead
Pure Ruby implementations carry interpretation overhead. Method calls, array bounds checking, and dynamic typing add constant factors to theoretical complexity. For performance-critical applications:
# Use native methods when possible
text.scan(/#{Regexp.escape(pattern)}/) # Fast for simple patterns
# Or drop to C extensions for custom algorithms
require 'inline'
class FastSearch
inline do |builder|
builder.c "
VALUE kmp_search(VALUE text_str, VALUE pattern_str) {
// C implementation for performance
}
"
end
end
Common Pitfalls
String searching implementations encounter several recurring issues related to edge cases, encoding handling, and algorithmic subtleties.
Empty Pattern and Text Handling
Empty patterns require explicit handling. Mathematically, an empty pattern appears at every position in any text, but most applications interpret empty patterns as returning no matches or raising errors.
class RobustKMP
def search(text)
return [] if @pattern.empty?
return [] if text.empty?
return [] if text.length < @pattern.length
# Continue with normal search
end
end
Off-by-One Errors in LPS Construction
The LPS array construction involves careful index management. Common errors include:
# INCORRECT - doesn't handle first position
def build_lps_wrong
lps = Array.new(@pattern.length)
lps[0] = 0 # Missing - first position always 0
# ...
end
# CORRECT
def build_lps_correct
lps = Array.new(@pattern.length, 0) # Initialize all to 0
length = 0
i = 1 # Start from second position
# ...
end
Character Encoding Mismatches
Ruby strings use encodings that affect character indexing. ASCII-8BIT, UTF-8, and other encodings interpret bytes differently.
# Encoding mismatch causes incorrect results
text = "café".encode("UTF-8") # Multi-byte characters
pattern = "caf".encode("ASCII-8BIT")
# Pattern matching fails due to encoding mismatch
# Must ensure consistent encoding:
def search_safe(text, pattern)
normalized_text = text.encode("UTF-8")
normalized_pattern = pattern.encode("UTF-8")
# Perform search with consistent encoding
end
Boyer-Moore Edge Cases
The bad character rule requires careful handling when the mismatched character doesn't appear in the pattern. Shifting past the current position is essential:
# INCORRECT - may shift backward
def search_wrong(text)
# ...
bad_char_shift = j - @bad_char[text[s + j]]
s += bad_char_shift # Can be negative!
end
# CORRECT - ensure positive shift
def search_correct(text)
# ...
bad_char_shift = j - @bad_char[text[s + j]]
s += [1, bad_char_shift].max # Never shift backward
end
Overlapping Matches
Deciding how to handle overlapping occurrences affects results. Pattern "AA" in text "AAAA" has three overlapping matches at positions 0, 1, and 2.
class OverlapAwareSearch
def search_all_overlapping(text)
positions = []
@kmp.search(text).each do |pos|
positions << pos
end
positions
end
def search_non_overlapping(text)
positions = []
i = 0
while i <= text.length - @pattern.length
if text[i, @pattern.length] == @pattern
positions << i
i += @pattern.length # Skip past entire match
else
i += 1
end
end
positions
end
end
Unicode Normalization
Unicode characters have multiple representations. The character "é" can be composed (single codepoint U+00E9) or decomposed (e + combining acute accent). Pattern matching may fail without normalization.
require 'unicode/normalize'
class UnicodeAwareSearch
def initialize(pattern)
@pattern = pattern.unicode_normalize(:nfc)
@searcher = KMPSearch.new(@pattern)
end
def search(text)
normalized_text = text.unicode_normalize(:nfc)
@searcher.search(normalized_text)
end
end
# Without normalization:
text = "café" # é as composed character
pattern = "café" # é as decomposed (e + combining accent)
# May not match without normalization
Streaming State Management
Streaming implementations must preserve state between chunks. Patterns split across chunk boundaries require careful handling:
class StatefulStreamSearch
def initialize(pattern)
@pattern = pattern
@buffer = ""
@searcher = KMPSearch.new(pattern)
@global_position = 0
end
def process_chunk(chunk)
# Combine with buffer to handle patterns spanning boundaries
combined = @buffer + chunk
positions = @searcher.search(combined)
# Adjust positions to global coordinates
global_positions = positions.map { |p| p + @global_position - @buffer.length }
# Keep potential partial match at end
@buffer = combined[-([@pattern.length - 1, 0].max)..-1] || ""
@global_position += chunk.length
global_positions
end
end
Performance Degradation with Poor Pattern Selection
Patterns with high repetition can cause performance issues in naive implementations. Pattern "AAAAAB" in text of repeated "A"s triggers worst-case behavior without proper algorithm selection.
# Detect problematic patterns
def analyze_pattern_risk(pattern)
unique_chars = pattern.chars.uniq.length
repetition_ratio = 1.0 - (unique_chars.to_f / pattern.length)
if repetition_ratio > 0.7
:high_repetition # Use KMP
elsif pattern.length < 5
:short_pattern # Use built-in
else
:normal # Use Boyer-Moore
end
end
Reference
Algorithm Comparison
| Algorithm | Time Complexity | Space Complexity | Best Use Case |
|---|---|---|---|
| Naive | O(nm) worst case | O(1) | Very short patterns, one-time searches |
| KMP | O(n+m) all cases | O(m) | Streaming, small alphabets, repetitive patterns |
| Boyer-Moore | O(n/m) to O(nm) | O(m+σ) | Large alphabets, long patterns, random access |
Time Complexity Breakdown
| Phase | KMP | Boyer-Moore | Notes |
|---|---|---|---|
| Preprocessing | O(m) | O(m+σ) | σ is alphabet size |
| Search Best | O(n) | O(n/m) | BM can skip characters |
| Search Average | O(n) | O(n) | BM typically faster constant |
| Search Worst | O(n) | O(nm) | BM worst case rare in practice |
Pattern Characteristics Decision Matrix
| Pattern Trait | Recommended Algorithm | Reasoning |
|---|---|---|
| High repetition | KMP | LPS array handles repeated prefixes |
| Distinct characters | Boyer-Moore | Maximum skip distances |
| Small alphabet | KMP | Limited skip potential |
| Large alphabet | Boyer-Moore | Frequent skip opportunities |
| Length under 5 | Built-in methods | Preprocessing overhead not justified |
| Length 5-20 | Either algorithm | Both perform well |
| Length over 20 | Boyer-Moore | Large skip distances |
Ruby String Method Quick Reference
| Method | Purpose | Example |
|---|---|---|
| index | Find first occurrence | text.index(pattern) |
| rindex | Find last occurrence | text.rindex(pattern) |
| scan | Find all non-overlapping | text.scan(pattern) |
| match | Regular expression match | text.match(/pattern/) |
| include? | Check existence | text.include?(pattern) |
KMP Algorithm Steps
| Step | Action | Purpose |
|---|---|---|
| 1 | Build LPS array | Identify pattern self-similarities |
| 2 | Initialize text and pattern pointers | Track current positions |
| 3 | Compare characters | Check for match |
| 4 | On match, advance both pointers | Continue comparison |
| 5 | On full pattern match, record position | Found occurrence |
| 6 | On mismatch, use LPS to shift pattern | Skip redundant comparisons |
| 7 | Continue until text exhausted | Complete search |
Boyer-Moore Bad Character Rule
| Scenario | Action | Example |
|---|---|---|
| Character in pattern | Shift to align rightmost occurrence | Pattern EXAMPLE, mismatch on X shifts to align previous X |
| Character not in pattern | Shift entire pattern past character | Pattern EXAMPLE, mismatch on Z shifts 7 positions |
| Multiple occurrences | Use rightmost position | Pattern ABCABC, use second occurrence |
Implementation Checklist
| Requirement | KMP | Boyer-Moore |
|---|---|---|
| Build failure function | LPS array with dynamic programming | Bad character table with hash |
| Handle empty pattern | Return empty array or raise error | Return empty array or raise error |
| Handle empty text | Return empty array | Return empty array |
| Handle single character | Ensure loop conditions work | Ensure loop conditions work |
| Test overlapping matches | Verify LPS-based shifts | Verify shift calculations |
| Validate encoding | Normalize to single encoding | Normalize to single encoding |
| Check boundary conditions | Pattern longer than text | Pattern longer than text |
Common Pattern Analysis
| Pattern Type | Characteristics | Example | Best Algorithm |
|---|---|---|---|
| DNA sequence | 4-character alphabet, repetitive | ATCGATCGATCG | KMP |
| English word | 26+ alphabet, distinct | EXAMPLE | Boyer-Moore |
| Error code | Mixed alphanumeric, medium length | ERR_404_NOT_FOUND | Boyer-Moore |
| Binary data | 2-character alphabet | 10101010 | KMP |
| Log timestamp | Structured, fixed format | 2024-01-01 | Boyer-Moore |
| Hash prefix | Hex characters, distinct | a3f5c9 | Boyer-Moore |
Performance Optimization Strategies
| Technique | Application | Benefit |
|---|---|---|
| Pattern analysis | Select algorithm based on characteristics | Optimal average performance |
| Quick character existence check | Verify all pattern characters in text | Early rejection |
| Preprocessing cache | Reuse LPS/bad char tables | Amortize preprocessing cost |
| Chunk buffering | Maintain overlap in streaming | Handle boundary matches |
| Encoding normalization | Ensure consistent representation | Prevent false negatives |
| Skip trivial cases | Handle empty inputs separately | Reduce branch overhead |