CrackedRuby CrackedRuby

String Searching (KMP, Boyer-Moore)

Overview

String searching algorithms locate occurrences of a pattern within a larger text string. The naive approach compares the pattern character-by-character at each position in the text, resulting in O(n*m) time complexity where n is the text length and m is the pattern length. Advanced algorithms like Knuth-Morris-Pratt (KMP) and Boyer-Moore reduce this complexity through preprocessing and intelligent skipping strategies.

KMP, developed in 1977, preprocesses the pattern to identify self-similarities, eliminating redundant comparisons when mismatches occur. The algorithm maintains O(n+m) time complexity by never backtracking in the text string, making it ideal for streaming data or situations where re-reading text is expensive.

Boyer-Moore, published in 1977, takes a different approach by scanning the pattern from right to left and using two heuristics—the bad character rule and the good suffix rule—to skip sections of text. In practice, Boyer-Moore often performs better than KMP on natural language text, sometimes achieving sublinear time complexity by skipping multiple characters at once.

text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"

# Naive search requires checking each position
# Advanced algorithms skip intelligently based on pattern analysis

Ruby's built-in String#index uses optimized C implementations, but understanding these algorithms provides insight into pattern matching complexity and enables custom implementations for specialized requirements.

Key Principles

String searching algorithms optimize the matching process through pattern analysis and strategic skipping. The fundamental challenge is avoiding redundant comparisons when a mismatch occurs. Naive algorithms restart from the next position in the text, wasting information gained from partial matches.

KMP Preprocessing and Failure Function

KMP constructs a failure function (also called prefix function or LPS array for Longest Proper Prefix which is also Suffix) during preprocessing. This array stores, for each position in the pattern, the length of the longest proper prefix of the substring that is also a suffix. When a mismatch occurs, this function determines how far to shift the pattern without missing potential matches.

The failure function captures self-similarity within the pattern. For pattern "ABABCABAB", the failure function reveals that if a mismatch occurs after matching "ABAB", the next comparison can start from position 2 because "AB" appears at both the beginning and end of the matched substring.

# Pattern: A B A B C A B A B
# Index:   0 1 2 3 4 5 6 7 8
# LPS:     0 0 1 2 0 1 2 3 4

# At index 4, longest proper prefix that's also suffix is "" (length 0)
# At index 8, "ABAB" appears at both ends (length 4)

Boyer-Moore Heuristics

Boyer-Moore employs two rules for determining skip distances: the bad character rule and the good suffix rule. The algorithm uses whichever rule suggests a larger skip.

The bad character rule preprocesses the pattern to record the rightmost occurrence of each character. When a mismatch occurs at text position i with character c, the pattern shifts so that the rightmost occurrence of c in the pattern aligns with position i. If c doesn't appear in the pattern, the entire pattern shifts past position i.

# Pattern: "EXAMPLE"
# Bad character table stores rightmost positions:
# E: 6, X: 1, A: 2, M: 3, P: 4, L: 5
# Any character not in pattern: -1

The good suffix rule identifies when a suffix of the matched portion appears elsewhere in the pattern. If characters match from the right but then mismatch, the algorithm shifts the pattern to align any previous occurrence of the matched suffix.

Comparison Strategies

KMP scans left-to-right and never backtracks in the text, making it suitable for sequential access patterns. The algorithm maintains a single pointer in the text that only moves forward, beneficial for processing streams or large files where seeking backwards is costly.

Boyer-Moore scans right-to-left within each pattern position, which seems counterintuitive but enables larger skips. The rightmost character provides maximum information for the bad character rule—if it doesn't match, the entire pattern can often shift by its full length.

Both algorithms guarantee O(n+m) worst-case time complexity, but their practical performance differs. KMP performs consistently across different input types, while Boyer-Moore excels on larger alphabets and longer patterns where skip distances increase.

Ruby Implementation

Ruby implements string searching through several methods, with String#index and String#scan providing built-in functionality. Custom implementations demonstrate algorithmic concepts and enable modifications for specific requirements.

KMP Implementation

class KMPSearch
  def initialize(pattern)
    @pattern = pattern
    @lps = build_lps_array
  end

  def search(text)
    positions = []
    i = 0  # text index
    j = 0  # pattern index

    while i < text.length
      if text[i] == @pattern[j]
        i += 1
        j += 1
      end

      if j == @pattern.length
        positions << i - j
        j = @lps[j - 1]
      elsif i < text.length && text[i] != @pattern[j]
        j = @lps[j - 1] if j != 0
        i += 1 if j == 0
      end
    end

    positions
  end

  private

  def build_lps_array
    lps = Array.new(@pattern.length, 0)
    length = 0
    i = 1

    while i < @pattern.length
      if @pattern[i] == @pattern[length]
        length += 1
        lps[i] = length
        i += 1
      else
        if length != 0
          length = lps[length - 1]
        else
          lps[i] = 0
          i += 1
        end
      end
    end

    lps
  end
end

# Usage
searcher = KMPSearch.new("ABABCABAB")
text = "ABABDABACDABABCABABABABCABAB"
positions = searcher.search(text)
# => [10, 19]

The LPS array construction uses dynamic programming. Each position depends on previous values, with the length variable tracking the current prefix length. When characters match, both pointers advance. On mismatch, the length falls back to the LPS value at the previous position, checking progressively shorter prefixes.

Boyer-Moore Implementation

class BoyerMooreSearch
  def initialize(pattern)
    @pattern = pattern
    @bad_char = build_bad_character_table
  end

  def search(text)
    positions = []
    m = @pattern.length
    n = text.length
    s = 0  # shift of pattern with respect to text

    while s <= n - m
      j = m - 1

      # Scan from right to left
      j -= 1 while j >= 0 && @pattern[j] == text[s + j]

      if j < 0
        # Pattern found
        positions << s
        s += (s + m < n) ? m - @bad_char[text[s + m]] : 1
      else
        # Mismatch - apply bad character rule
        bad_char_shift = j - @bad_char[text[s + j]]
        s += [1, bad_char_shift].max
      end
    end

    positions
  end

  private

  def build_bad_character_table
    table = Hash.new(-1)
    @pattern.chars.each_with_index do |char, index|
      table[char] = index
    end
    table
  end
end

# Usage
searcher = BoyerMooreSearch.new("EXAMPLE")
text = "HERE IS A SIMPLE EXAMPLE IN THE TEXT"
positions = searcher.search(text)
# => [16]

This implementation focuses on the bad character rule, which provides most of Boyer-Moore's practical benefit. The good suffix rule adds complexity with modest improvement in typical cases. The hash-based bad character table handles any character size alphabet efficiently.

Ruby String Methods with Custom Callbacks

class PatternMatcher
  def self.find_all_with_context(text, pattern, context_size = 10)
    matches = []
    index = 0

    while index = text.index(pattern, index)
      start_context = [0, index - context_size].max
      end_context = [text.length, index + pattern.length + context_size].min
      
      matches << {
        position: index,
        context: text[start_context...end_context],
        before: text[start_context...index],
        match: pattern,
        after: text[(index + pattern.length)...end_context]
      }
      
      index += 1
    end

    matches
  end
end

text = "The quick brown fox jumps over the lazy fox near the river"
results = PatternMatcher.find_all_with_context(text, "fox", 8)
# Returns detailed match information with surrounding context

Streaming Search Implementation

class StreamingKMP
  def initialize(pattern)
    @pattern = pattern
    @lps = build_lps_array
    @j = 0  # Persistent pattern index
  end

  def process_chunk(chunk)
    matches = []
    chunk.chars.each_with_index do |char, chunk_index|
      while @j > 0 && char != @pattern[@j]
        @j = @lps[@j - 1]
      end

      if char == @pattern[@j]
        @j += 1
      end

      if @j == @pattern.length
        matches << chunk_index - @j + 1
        @j = @lps[@j - 1]
      end
    end
    matches
  end

  def reset
    @j = 0
  end

  private

  def build_lps_array
    lps = Array.new(@pattern.length, 0)
    length = 0
    i = 1

    while i < @pattern.length
      if @pattern[i] == @pattern[length]
        length += 1
        lps[i] = length
        i += 1
      else
        if length != 0
          length = lps[length - 1]
        else
          lps[i] = 0
          i += 1
        end
      end
    end
    lps
  end
end

# Usage for processing data streams
searcher = StreamingKMP.new("ERROR")
IO.foreach("large_log_file.txt") do |line|
  positions = searcher.process_chunk(line)
  puts "Found ERROR at positions: #{positions}" unless positions.empty?
  searcher.reset  # Reset for each line
end

Practical Examples

DNA Sequence Analysis

Biological sequence analysis requires finding specific patterns in long DNA strings. KMP handles repetitive sequences efficiently since DNA has a small alphabet (A, C, G, T) with frequent repetition.

class DNASearcher
  def initialize
    @cache = {}
  end

  def find_motif(sequence, motif)
    searcher = @cache[motif] ||= KMPSearch.new(motif)
    positions = searcher.search(sequence)
    
    {
      motif: motif,
      occurrences: positions.length,
      positions: positions,
      frequency: positions.length.to_f / sequence.length
    }
  end

  def find_multiple_motifs(sequence, motifs)
    motifs.map { |motif| find_motif(sequence, motif) }
  end
end

dna = "ATCGATCGATCGTAGCTAGCTAGCTGATCGATCG"
searcher = DNASearcher.new

results = searcher.find_multiple_motifs(dna, ["ATCG", "TAGC", "GATC"])
# Identifies all occurrences of each motif

Log File Pattern Detection

Log analysis often requires finding error patterns, timestamps, or specific event sequences across large files. Boyer-Moore performs well here due to longer patterns and larger alphabets.

class LogAnalyzer
  ERROR_PATTERNS = {
    sql_error: "SQL Error:",
    timeout: "Connection timeout",
    memory: "OutOfMemory",
    critical: "CRITICAL:"
  }

  def analyze_file(filepath)
    searchers = ERROR_PATTERNS.transform_values { |pattern| 
      BoyerMooreSearch.new(pattern) 
    }
    
    results = Hash.new { |h, k| h[k] = [] }
    line_number = 0

    File.foreach(filepath) do |line|
      line_number += 1
      searchers.each do |error_type, searcher|
        positions = searcher.search(line)
        positions.each do |pos|
          results[error_type] << {
            line: line_number,
            position: pos,
            content: line.strip
          }
        end
      end
    end

    results
  end

  def self.generate_report(results)
    results.map do |error_type, occurrences|
      {
        type: error_type,
        count: occurrences.length,
        first_occurrence: occurrences.first,
        last_occurrence: occurrences.last
      }
    end
  end
end

analyzer = LogAnalyzer.new
errors = analyzer.analyze_file("production.log")
report = LogAnalyzer.generate_report(errors)

Text Search with Case Insensitivity

Many applications require case-insensitive searching. Pattern preprocessing must account for case variations.

class CaseInsensitiveSearch
  def initialize(pattern)
    @pattern = pattern.downcase
    @kmp = KMPSearch.new(@pattern)
  end

  def search(text)
    normalized_text = text.downcase
    positions = @kmp.search(normalized_text)
    
    positions.map do |pos|
      {
        position: pos,
        original_text: text[pos, @pattern.length],
        matched_pattern: @pattern
      }
    end
  end
end

text = "The Quick BROWN fox and the QUICK rabbit"
searcher = CaseInsensitiveSearch.new("quick")
results = searcher.search(text)
# Finds both "Quick" and "QUICK"

Multi-Pattern Search

Searching for multiple patterns simultaneously requires running separate searches or using more advanced algorithms like Aho-Corasick. For moderate pattern counts, parallel KMP searches work well.

class MultiPatternSearch
  def initialize(patterns)
    @searchers = patterns.map { |p| [p, KMPSearch.new(p)] }.to_h
  end

  def search(text)
    results = {}
    
    @searchers.each do |pattern, searcher|
      positions = searcher.search(text)
      results[pattern] = positions unless positions.empty?
    end

    # Return matches sorted by position
    all_matches = results.flat_map do |pattern, positions|
      positions.map { |pos| {position: pos, pattern: pattern} }
    end
    
    all_matches.sort_by { |m| m[:position] }
  end
end

text = "Programming in Ruby requires understanding Ruby syntax and Ruby idioms"
patterns = ["Ruby", "syntax", "understanding"]
searcher = MultiPatternSearch.new(patterns)
matches = searcher.search(text)
# Returns all matches sorted by position

Design Considerations

Selecting the appropriate string searching algorithm depends on text characteristics, pattern properties, performance requirements, and implementation constraints.

Text and Pattern Characteristics

Small alphabets (binary data, DNA sequences) favor KMP because skip distances in Boyer-Moore remain small. With only 4 characters in DNA or 2 in binary, the bad character rule provides minimal benefit. KMP's consistent O(n+m) performance handles repetitive patterns efficiently.

Large alphabets (natural language, source code) favor Boyer-Moore. English text with 26+ characters plus punctuation and whitespace creates opportunities for large skips. When a character doesn't appear in the pattern, Boyer-Moore skips the entire pattern length.

Pattern length affects algorithm choice. Short patterns (2-5 characters) see minimal benefit from preprocessing overhead. Ruby's built-in methods or naive search suffice. Medium patterns (6-20 characters) benefit from both algorithms. Long patterns (20+ characters) favor Boyer-Moore, where skip distances can be substantial.

Access Pattern Requirements

Sequential-only access requires KMP. When processing streams, pipes, or network data where rewinding is impossible, KMP's forward-only text traversal is essential. The algorithm never needs to re-read previous text positions.

Random access environments suit Boyer-Moore better. When the entire text resides in memory or supports efficient seeking, Boyer-Moore's potential backward movement in the text doesn't impose penalties, and its average-case performance improvements manifest fully.

Preprocessing vs. Search Trade-offs

Single search scenarios may not justify preprocessing overhead. For one-time searches, Ruby's built-in String#index provides optimized C implementations that often outperform pure Ruby algorithm implementations.

Repeated searches with the same pattern amortize preprocessing costs. Applications that search for the same pattern across multiple texts or repeatedly in a single large text benefit from constructing the LPS array or bad character table once.

Performance Profiles

KMP provides predictable performance across all inputs. The O(n+m) guarantee holds for worst-case, average-case, and best-case scenarios. This predictability aids capacity planning and performance analysis.

Boyer-Moore exhibits variable performance. Best-case scenarios approach O(n/m), scanning only every mth character in the text. Worst-case degradation to O(nm) occurs with adversarial patterns, though this is rare in practice. Average-case performance typically exceeds KMP on natural text.

Implementation Complexity

KMP requires understanding the LPS array construction, which can be subtle. The preprocessing logic involves careful pointer manipulation, and errors in LPS construction cause incorrect search results.

Boyer-Moore with only the bad character rule is straightforward to implement. Adding the good suffix rule increases complexity significantly for modest practical gain. Many implementations omit it.

Ruby-Specific Considerations

Ruby strings use variable-width encodings (UTF-8 by default). Character-by-character comparison works correctly, but byte-level access patterns differ from fixed-width encodings. Algorithm implementations must use character indexing ([]) rather than byte operations.

Built-in methods like String#index, String#rindex, and String#scan use optimized C code. Pure Ruby implementations of KMP or Boyer-Moore typically run slower than built-ins for simple searches. Custom algorithms make sense for:

  • Specialized requirements (case-insensitive, fuzzy matching, streaming)
  • Learning and demonstration purposes
  • Pattern statistics or metadata collection during search
  • Integration with custom data structures

Performance Considerations

String searching performance depends on algorithm characteristics, input properties, and implementation details. Understanding these factors guides optimization decisions.

Time Complexity Analysis

KMP guarantees O(n+m) time in all cases:

  • Preprocessing: O(m) to build the LPS array
  • Searching: O(n) with each text character examined at most twice
  • Total: O(n+m) regardless of input

Boyer-Moore theoretical complexity:

  • Preprocessing: O(m + σ) where σ is alphabet size
  • Best case: O(n/m) by skipping pattern-length chunks
  • Average case: O(n) with constant better than KMP on natural text
  • Worst case: O(nm) with adversarial patterns (rare in practice)

Space Complexity

KMP requires O(m) additional space for the LPS array. This array persists across searches when reusing the same pattern.

Boyer-Moore requires O(σ) for the bad character table, typically implemented as a hash in Ruby. For ASCII text, σ = 128 or 256. For Unicode, hash-based storage prevents excessive memory usage.

Practical Performance Factors

Cache locality affects modern CPU performance. KMP's sequential text access patterns align with cache line fetching, reducing cache misses. Boyer-Moore's variable skip distances may cause more cache misses but process fewer total characters.

require 'benchmark'

def benchmark_algorithms(text, pattern, iterations = 1000)
  kmp = KMPSearch.new(pattern)
  bm = BoyerMooreSearch.new(pattern)
  
  Benchmark.bmbm do |x|
    x.report("KMP:") { iterations.times { kmp.search(text) } }
    x.report("Boyer-Moore:") { iterations.times { bm.search(text) } }
    x.report("Ruby index:") { iterations.times { 
      pos = 0
      positions = []
      while pos = text.index(pattern, pos)
        positions << pos
        pos += 1
      end
    }}
  end
end

# Test with different text characteristics
text = File.read("large_file.txt")
pattern = "specific_pattern"
benchmark_algorithms(text, pattern)

Alphabet Size Impact

Small alphabets reduce Boyer-Moore effectiveness. With alphabet size 2, maximum skip distance is 2 regardless of pattern length. DNA searching with 4 characters sees limited benefit from bad character rule.

Large alphabets maximize Boyer-Moore advantage. English text, source code, and general Unicode text create frequent opportunities for large skips when characters don't match.

Pattern Repetition Effects

Patterns with repetition favor KMP. Pattern "AAAAAAB" searching in text "AAAAAAA..." demonstrates KMP's strength—the LPS array enables efficient handling of repeated partial matches.

Distinct character patterns favor Boyer-Moore. Pattern "EXAMPLE" in typical text allows aggressive skipping because characters E, X, M rarely appear in random positions.

Optimization Techniques

Preprocessing pattern metadata accelerates searches:

class OptimizedSearch
  def initialize(pattern)
    @pattern = pattern
    @length = pattern.length
    
    # Fast path checks
    @first_char = pattern[0]
    @last_char = pattern[-1]
    @has_repetition = pattern.chars.uniq.length != pattern.length
    
    # Choose algorithm based on characteristics
    @searcher = if @has_repetition || @length < 8
      KMPSearch.new(pattern)
    else
      BoyerMooreSearch.new(pattern)
    end
  end

  def search(text)
    return [] if text.length < @length
    
    # Quick check if pattern characters exist in text
    pattern_chars = @pattern.chars.to_set
    text_chars = text.chars.to_set
    return [] unless (pattern_chars - text_chars).empty?
    
    @searcher.search(text)
  end
end

Memory Access Patterns

Sequential memory access improves performance. KMP reads text sequentially, benefiting from hardware prefetching. Boyer-Moore's variable skip distances may defeat prefetching but process fewer total bytes.

Ruby Implementation Overhead

Pure Ruby implementations carry interpretation overhead. Method calls, array bounds checking, and dynamic typing add constant factors to theoretical complexity. For performance-critical applications:

# Use native methods when possible
text.scan(/#{Regexp.escape(pattern)}/)  # Fast for simple patterns

# Or drop to C extensions for custom algorithms
require 'inline'

class FastSearch
  inline do |builder|
    builder.c "
      VALUE kmp_search(VALUE text_str, VALUE pattern_str) {
        // C implementation for performance
      }
    "
  end
end

Common Pitfalls

String searching implementations encounter several recurring issues related to edge cases, encoding handling, and algorithmic subtleties.

Empty Pattern and Text Handling

Empty patterns require explicit handling. Mathematically, an empty pattern appears at every position in any text, but most applications interpret empty patterns as returning no matches or raising errors.

class RobustKMP
  def search(text)
    return [] if @pattern.empty?
    return [] if text.empty?
    return [] if text.length < @pattern.length
    
    # Continue with normal search
  end
end

Off-by-One Errors in LPS Construction

The LPS array construction involves careful index management. Common errors include:

# INCORRECT - doesn't handle first position
def build_lps_wrong
  lps = Array.new(@pattern.length)
  lps[0] = 0  # Missing - first position always 0
  # ...
end

# CORRECT
def build_lps_correct
  lps = Array.new(@pattern.length, 0)  # Initialize all to 0
  length = 0
  i = 1  # Start from second position
  # ...
end

Character Encoding Mismatches

Ruby strings use encodings that affect character indexing. ASCII-8BIT, UTF-8, and other encodings interpret bytes differently.

# Encoding mismatch causes incorrect results
text = "café".encode("UTF-8")      # Multi-byte characters
pattern = "caf".encode("ASCII-8BIT")

# Pattern matching fails due to encoding mismatch
# Must ensure consistent encoding:
def search_safe(text, pattern)
  normalized_text = text.encode("UTF-8")
  normalized_pattern = pattern.encode("UTF-8")
  # Perform search with consistent encoding
end

Boyer-Moore Edge Cases

The bad character rule requires careful handling when the mismatched character doesn't appear in the pattern. Shifting past the current position is essential:

# INCORRECT - may shift backward
def search_wrong(text)
  # ...
  bad_char_shift = j - @bad_char[text[s + j]]
  s += bad_char_shift  # Can be negative!
end

# CORRECT - ensure positive shift
def search_correct(text)
  # ...
  bad_char_shift = j - @bad_char[text[s + j]]
  s += [1, bad_char_shift].max  # Never shift backward
end

Overlapping Matches

Deciding how to handle overlapping occurrences affects results. Pattern "AA" in text "AAAA" has three overlapping matches at positions 0, 1, and 2.

class OverlapAwareSearch
  def search_all_overlapping(text)
    positions = []
    @kmp.search(text).each do |pos|
      positions << pos
    end
    positions
  end

  def search_non_overlapping(text)
    positions = []
    i = 0
    while i <= text.length - @pattern.length
      if text[i, @pattern.length] == @pattern
        positions << i
        i += @pattern.length  # Skip past entire match
      else
        i += 1
      end
    end
    positions
  end
end

Unicode Normalization

Unicode characters have multiple representations. The character "é" can be composed (single codepoint U+00E9) or decomposed (e + combining acute accent). Pattern matching may fail without normalization.

require 'unicode/normalize'

class UnicodeAwareSearch
  def initialize(pattern)
    @pattern = pattern.unicode_normalize(:nfc)
    @searcher = KMPSearch.new(@pattern)
  end

  def search(text)
    normalized_text = text.unicode_normalize(:nfc)
    @searcher.search(normalized_text)
  end
end

# Without normalization:
text = "café"  # é as composed character
pattern = "café"  # é as decomposed (e + combining accent)
# May not match without normalization

Streaming State Management

Streaming implementations must preserve state between chunks. Patterns split across chunk boundaries require careful handling:

class StatefulStreamSearch
  def initialize(pattern)
    @pattern = pattern
    @buffer = ""
    @searcher = KMPSearch.new(pattern)
    @global_position = 0
  end

  def process_chunk(chunk)
    # Combine with buffer to handle patterns spanning boundaries
    combined = @buffer + chunk
    positions = @searcher.search(combined)
    
    # Adjust positions to global coordinates
    global_positions = positions.map { |p| p + @global_position - @buffer.length }
    
    # Keep potential partial match at end
    @buffer = combined[-([@pattern.length - 1, 0].max)..-1] || ""
    @global_position += chunk.length
    
    global_positions
  end
end

Performance Degradation with Poor Pattern Selection

Patterns with high repetition can cause performance issues in naive implementations. Pattern "AAAAAB" in text of repeated "A"s triggers worst-case behavior without proper algorithm selection.

# Detect problematic patterns
def analyze_pattern_risk(pattern)
  unique_chars = pattern.chars.uniq.length
  repetition_ratio = 1.0 - (unique_chars.to_f / pattern.length)
  
  if repetition_ratio > 0.7
    :high_repetition  # Use KMP
  elsif pattern.length < 5
    :short_pattern    # Use built-in
  else
    :normal           # Use Boyer-Moore
  end
end

Reference

Algorithm Comparison

Algorithm Time Complexity Space Complexity Best Use Case
Naive O(nm) worst case O(1) Very short patterns, one-time searches
KMP O(n+m) all cases O(m) Streaming, small alphabets, repetitive patterns
Boyer-Moore O(n/m) to O(nm) O(m+σ) Large alphabets, long patterns, random access

Time Complexity Breakdown

Phase KMP Boyer-Moore Notes
Preprocessing O(m) O(m+σ) σ is alphabet size
Search Best O(n) O(n/m) BM can skip characters
Search Average O(n) O(n) BM typically faster constant
Search Worst O(n) O(nm) BM worst case rare in practice

Pattern Characteristics Decision Matrix

Pattern Trait Recommended Algorithm Reasoning
High repetition KMP LPS array handles repeated prefixes
Distinct characters Boyer-Moore Maximum skip distances
Small alphabet KMP Limited skip potential
Large alphabet Boyer-Moore Frequent skip opportunities
Length under 5 Built-in methods Preprocessing overhead not justified
Length 5-20 Either algorithm Both perform well
Length over 20 Boyer-Moore Large skip distances

Ruby String Method Quick Reference

Method Purpose Example
index Find first occurrence text.index(pattern)
rindex Find last occurrence text.rindex(pattern)
scan Find all non-overlapping text.scan(pattern)
match Regular expression match text.match(/pattern/)
include? Check existence text.include?(pattern)

KMP Algorithm Steps

Step Action Purpose
1 Build LPS array Identify pattern self-similarities
2 Initialize text and pattern pointers Track current positions
3 Compare characters Check for match
4 On match, advance both pointers Continue comparison
5 On full pattern match, record position Found occurrence
6 On mismatch, use LPS to shift pattern Skip redundant comparisons
7 Continue until text exhausted Complete search

Boyer-Moore Bad Character Rule

Scenario Action Example
Character in pattern Shift to align rightmost occurrence Pattern EXAMPLE, mismatch on X shifts to align previous X
Character not in pattern Shift entire pattern past character Pattern EXAMPLE, mismatch on Z shifts 7 positions
Multiple occurrences Use rightmost position Pattern ABCABC, use second occurrence

Implementation Checklist

Requirement KMP Boyer-Moore
Build failure function LPS array with dynamic programming Bad character table with hash
Handle empty pattern Return empty array or raise error Return empty array or raise error
Handle empty text Return empty array Return empty array
Handle single character Ensure loop conditions work Ensure loop conditions work
Test overlapping matches Verify LPS-based shifts Verify shift calculations
Validate encoding Normalize to single encoding Normalize to single encoding
Check boundary conditions Pattern longer than text Pattern longer than text

Common Pattern Analysis

Pattern Type Characteristics Example Best Algorithm
DNA sequence 4-character alphabet, repetitive ATCGATCGATCG KMP
English word 26+ alphabet, distinct EXAMPLE Boyer-Moore
Error code Mixed alphanumeric, medium length ERR_404_NOT_FOUND Boyer-Moore
Binary data 2-character alphabet 10101010 KMP
Log timestamp Structured, fixed format 2024-01-01 Boyer-Moore
Hash prefix Hex characters, distinct a3f5c9 Boyer-Moore

Performance Optimization Strategies

Technique Application Benefit
Pattern analysis Select algorithm based on characteristics Optimal average performance
Quick character existence check Verify all pattern characters in text Early rejection
Preprocessing cache Reuse LPS/bad char tables Amortize preprocessing cost
Chunk buffering Maintain overlap in streaming Handle boundary matches
Encoding normalization Ensure consistent representation Prevent false negatives
Skip trivial cases Handle empty inputs separately Reduce branch overhead