CrackedRuby logo

CrackedRuby

StringScanner

Overview

StringScanner provides a streaming lexical analysis interface for parsing strings character by character or token by token. Ruby implements StringScanner as part of the standard library, offering stateful string processing with position tracking and pattern matching capabilities.

The StringScanner class maintains an internal pointer that advances through the string as scanning operations succeed. Each scan operation attempts to match a pattern at the current position, advancing the pointer on successful matches while leaving it unchanged on failures. This approach enables building parsers and tokenizers without manually tracking position state.

StringScanner operates on frozen string copies, preventing modification during scanning operations. The scanner stores the original string, current position, and match data from the most recent successful operation.

require 'strscan'

scanner = StringScanner.new("Hello, World!")
scanner.scan(/Hello/) # => "Hello"
scanner.pos           # => 5
scanner.scan(/,/)     # => ","
scanner.rest          # => " World!"

Primary use cases include lexical analysis for domain-specific languages, parsing structured text formats, extracting tokens from log files, and building state machines for text processing. StringScanner excels when parsing requirements involve sequential token extraction with position-dependent logic.

# Parsing key-value pairs
data = "name=John age=30 city=Boston"
scanner = StringScanner.new(data)
pairs = {}

until scanner.eos?
  key = scanner.scan(/\w+/)
  scanner.scan(/=/)
  value = scanner.scan(/\w+/)
  pairs[key] = value
  scanner.scan(/\s+/)  # skip whitespace
end

pairs # => {"name"=>"John", "age"=>"30", "city"=>"Boston"}

The class integrates with Ruby's regular expression engine, supporting capture groups, named captures, and all standard regex features. StringScanner maintains compatibility with String indexing and slicing operations while adding stateful scanning capabilities.

Basic Usage

StringScanner construction requires a string argument that becomes the scanning target. The scanner initializes with position zero and no match data. Basic scanning operations include scan, check, skip, and position management methods.

The scan method attempts pattern matching at the current position, returning the matched string on success or nil on failure. Successful scans advance the scanner position by the match length. Pattern matching uses Ruby's regular expression engine with full feature support.

require 'strscan'

text = "The year is 2024"
scanner = StringScanner.new(text)

word = scanner.scan(/\w+/)    # => "The"
scanner.scan(/\s+/)           # => " "
scanner.scan(/\w+/)           # => "year"
scanner.scan(/\s+is\s+/)      # => " is "
year = scanner.scan(/\d+/)    # => "2024"

scanner.eos?                  # => true

The check method performs pattern matching without advancing the scanner position. This enables lookahead operations and conditional parsing logic. The skip method advances the position without returning matched content, useful for discarding separators or whitespace.

csv_line = "apple,banana,cherry"
scanner = StringScanner.new(csv_line)
items = []

until scanner.eos?
  item = scanner.scan(/[^,]+/)
  items << item if item
  scanner.skip(/,/)  # advance past comma without capturing
end

items # => ["apple", "banana", "cherry"]

Position management methods include pos for current position, pos= for position assignment, reset for returning to the beginning, and terminate for jumping to the end. The rest method returns unscanned content from the current position to the end.

scanner = StringScanner.new("ABCDEFGH")
scanner.scan(/ABC/)     # => "ABC"
scanner.pos             # => 3
scanner.pos = 1         # rewind to position 1
scanner.scan(/BCD/)     # => "BCD"
scanner.rest            # => "EFGH"

StringScanner provides string inspection methods including string for the original input, matched for the most recent successful match, pre_match for content before the last match, and post_match for content after the last match. These methods enable context-aware parsing decisions.

text = "Error: Invalid input on line 42"
scanner = StringScanner.new(text)
scanner.scan(/Error:/)
scanner.scan(/\s+/)
scanner.scan(/\w+/)

scanner.matched      # => "Invalid"
scanner.pre_match    # => "Error: "
scanner.post_match   # => " input on line 42"

Advanced Usage

StringScanner supports complex parsing patterns through method chaining, capture groups, and stateful decision trees. Advanced usage involves combining multiple scan operations, implementing backtracking mechanisms, and building reusable parsing components.

Capture groups enable extracting multiple values from single pattern matches. StringScanner integrates with Ruby's capture group syntax, providing access to numbered and named captures through standard regex mechanisms.

log_entry = "2024-01-15 14:30:22 ERROR Failed to connect to database"
scanner = StringScanner.new(log_entry)

# Parse timestamp with capture groups
if scanner.scan(/(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})/)
  year, month, day, hour, minute, second = scanner[1..6]
  timestamp = Time.new(year, month, day, hour, minute, second)
end

scanner.scan(/\s+/)
level = scanner.scan(/\w+/)      # => "ERROR"
scanner.scan(/\s+/)
message = scanner.rest           # => "Failed to connect to database"

Implementing backtracking requires position management and conditional parsing paths. StringScanner enables saving and restoring positions to implement complex parsing logic with multiple attempt strategies.

def parse_number_or_identifier(scanner)
  start_pos = scanner.pos

  # Try parsing as number first
  if number = scanner.scan(/-?\d+\.?\d*/)
    return [:number, number.include?('.') ? number.to_f : number.to_i]
  end

  # Backtrack and try identifier
  scanner.pos = start_pos
  if identifier = scanner.scan(/[a-zA-Z_]\w*/)
    return [:identifier, identifier]
  end

  # No match found
  nil
end

text = "42 hello 3.14 world"
scanner = StringScanner.new(text)
tokens = []

until scanner.eos?
  scanner.skip(/\s+/)
  if token = parse_number_or_identifier(scanner)
    tokens << token
  end
end

tokens # => [[:number, 42], [:identifier, "hello"], [:number, 3.14], [:identifier, "world"]]

Building recursive descent parsers with StringScanner involves creating parsing methods that call each other based on grammar rules. This approach handles nested structures and complex language constructs.

class JSONParser
  def initialize(text)
    @scanner = StringScanner.new(text)
  end

  def parse_value
    skip_whitespace

    case
    when @scanner.check(/"/)
      parse_string
    when @scanner.check(/\d/)
      parse_number
    when @scanner.check(/\{/)
      parse_object
    when @scanner.check(/\[/)
      parse_array
    when @scanner.scan(/true/)
      true
    when @scanner.scan(/false/)
      false
    when @scanner.scan(/null/)
      nil
    end
  end

  private

  def parse_string
    @scanner.scan(/"([^"]*)"/)
    @scanner[1]
  end

  def parse_number
    @scanner.scan(/-?\d+\.?\d*/).to_f
  end

  def skip_whitespace
    @scanner.skip(/\s*/)
  end
end

StringScanner excels at implementing state machines for complex parsing scenarios. Each state corresponds to a parsing context, with transitions based on pattern matches and current position.

def parse_config_file(content)
  scanner = StringScanner.new(content)
  config = {}
  state = :expecting_key
  current_section = nil

  until scanner.eos?
    scanner.skip(/\s*/)

    case state
    when :expecting_key
      if scanner.scan(/\[([^\]]+)\]/)
        current_section = scanner[1]
        config[current_section] ||= {}
        state = :expecting_key
      elsif scanner.scan(/(\w+)/)
        key = scanner[1]
        scanner.skip(/\s*=\s*/)
        state = :expecting_value
      end

    when :expecting_value
      value = scanner.scan(/[^\n\r]+/)
      config[current_section][key] = value.strip
      state = :expecting_key
    end

    scanner.skip(/[\n\r]+/)
  end

  config
end

Error Handling & Debugging

StringScanner raises specific exceptions for common error conditions including invalid positions, nil string arguments, and encoding mismatches. Understanding these error patterns enables building robust parsing systems with appropriate error recovery strategies.

The most common StringScanner errors involve position management and string handling. Setting positions beyond string bounds raises RangeError, while passing nil to the constructor raises TypeError. String encoding issues manifest as Encoding::CompatibilityError during pattern matching operations.

def safe_scan(scanner, pattern)
  result = scanner.scan(pattern)
  if result.nil?
    raise ParseError, "Expected #{pattern} at position #{scanner.pos}"
  end
  result
rescue RangeError => e
  raise ParseError, "Position out of bounds: #{e.message}"
rescue Encoding::CompatibilityError => e
  raise ParseError, "Encoding mismatch: #{e.message}"
end

# Usage with error handling
begin
  scanner = StringScanner.new(input_text)
  token = safe_scan(scanner, /\w+/)
rescue ParseError => e
  puts "Parse failed: #{e.message}"
  # Implement recovery strategy
end

Debugging StringScanner operations requires visibility into scanner state, pattern matching attempts, and position changes. Implementing logging wrappers around scan operations provides detailed execution traces for complex parsing scenarios.

class DebuggingScanner
  def initialize(string, debug: false)
    @scanner = StringScanner.new(string)
    @debug = debug
  end

  def scan(pattern)
    start_pos = @scanner.pos
    result = @scanner.scan(pattern)

    if @debug
      status = result ? "SUCCESS" : "FAILED"
      puts "[#{status}] #{pattern.inspect} at pos #{start_pos}"
      puts "  Matched: #{result.inspect}" if result
      puts "  New pos: #{@scanner.pos}"
    end

    result
  end

  def method_missing(method, *args, &block)
    @scanner.public_send(method, *args, &block)
  end
end

# Debug parsing issues
scanner = DebuggingScanner.new("test123", debug: true)
scanner.scan(/\w+/)    # Shows detailed trace
scanner.scan(/\d+/)

Position tracking errors occur when manual position changes create inconsistent scanner state. Implementing position validation prevents common off-by-one errors and invalid position assignments.

class ValidatingScanner
  def initialize(string)
    @scanner = StringScanner.new(string)
    @string = string
  end

  def pos=(new_pos)
    if new_pos < 0 || new_pos > @string.length
      raise ArgumentError, "Position #{new_pos} outside valid range 0..#{@string.length}"
    end

    @scanner.pos = new_pos
  end

  def bounded_scan(pattern, max_advance: nil)
    start_pos = @scanner.pos
    result = @scanner.scan(pattern)

    if result && max_advance && (result.length > max_advance)
      @scanner.pos = start_pos  # rollback
      raise ParseError, "Match exceeded maximum advance of #{max_advance}"
    end

    result
  end
end

Recovery strategies for parsing failures include backtracking to known good positions, skipping problematic input, and implementing error synchronization points. These techniques enable parsers to continue processing after encountering invalid input.

def parse_with_recovery(text, &block)
  scanner = StringScanner.new(text)
  results = []
  errors = []

  until scanner.eos?
    checkpoint = scanner.pos

    begin
      result = yield scanner
      results << result if result
    rescue ParseError => e
      errors << { position: checkpoint, error: e.message }

      # Recovery: skip to next whitespace or delimiter
      scanner.pos = checkpoint
      scanner.skip(/[^\s,;]*/)  # skip problematic token
      scanner.skip(/\s*[,;]?\s*/)  # skip delimiter
    end
  end

  { results: results, errors: errors }
end

Performance & Memory

StringScanner operations exhibit linear time complexity for most scanning patterns, with performance characteristics depending heavily on regular expression complexity and string length. Understanding these patterns enables optimization strategies for high-throughput text processing applications.

Memory usage patterns in StringScanner involve string duplication and match data storage. The scanner creates a frozen copy of the input string, maintaining references to the original data throughout the scanning lifecycle. Large input strings consume proportional memory, with additional overhead for capture group storage.

require 'benchmark'
require 'memory_profiler'

def benchmark_scanning_patterns
  small_text = "word " * 1000      # ~5KB
  large_text = "word " * 100_000   # ~500KB

  patterns = {
    simple: /\w+/,
    complex: /(?<type>\w+)\s+(?<value>\d+)/,
    alternation: /cat|dog|bird|fish|elephant/
  }

  patterns.each do |name, pattern|
    puts "\n=== Pattern: #{name} ==="

    [small_text, large_text].each_with_index do |text, i|
      size_label = i == 0 ? "Small" : "Large"

      report = MemoryProfiler.report do
        scanner = StringScanner.new(text)
        count = 0

        time = Benchmark.realtime do
          until scanner.eos?
            scanner.scan(pattern)
            scanner.skip(/\s+/)
            count += 1
          end
        end

        puts "#{size_label}: #{count} matches in #{(time * 1000).round(2)}ms"
      end

      puts "Memory: #{report.total_allocated_memsize / 1024}KB allocated"
    end
  end
end

Optimization strategies focus on pattern efficiency, minimizing backtracking, and reusing scanner instances. Complex regular expressions with excessive alternation or nested quantifiers significantly impact scanning performance.

# Inefficient: excessive backtracking
slow_pattern = /(\w+)*\s*=\s*(\w+)*/

# Efficient: possessive quantifiers reduce backtracking
fast_pattern = /(\w++)=(\w++)/

# Benchmark comparison
def compare_patterns(text, iterations = 10000)
  patterns = {
    slow: /(\w+)*\s*=\s*(\w+)*/,
    fast: /(\w++)=(\w++)/,
    atomic: /(?>\w+)=(?>\w+)/
  }

  results = {}

  patterns.each do |name, pattern|
    time = Benchmark.realtime do
      iterations.times do
        scanner = StringScanner.new(text)
        scanner.scan(pattern)
      end
    end

    results[name] = (time * 1000).round(2)
  end

  results
end

text = "variable = value"
results = compare_patterns(text)
puts "Pattern performance (ms): #{results}"

Scanner instance reuse reduces allocation overhead for repetitive parsing tasks. Resetting scanner state costs less than creating new instances, particularly for large input strings or high-frequency operations.

class OptimizedParser
  def initialize
    @scanner = StringScanner.new("")
  end

  def parse_lines(lines)
    results = []

    lines.each do |line|
      # Reuse scanner instance
      @scanner.string = line
      @scanner.reset

      result = parse_single_line(@scanner)
      results << result if result
    end

    results
  end

  private

  def parse_single_line(scanner)
    # Parse logic here
    scanner.scan(/\w+/)
  end
end

# Memory-efficient batch processing
parser = OptimizedParser.new
results = []

File.foreach('large_file.txt') do |line|
  if line_results = parser.parse_lines([line])
    results.concat(line_results)
  end

  # Periodic cleanup for long-running processes
  GC.start if results.length % 10000 == 0
end

Large file processing requires streaming approaches to manage memory consumption. Processing files in chunks prevents memory exhaustion while maintaining parsing accuracy across chunk boundaries.

def stream_parse_large_file(filename, chunk_size: 8192)
  buffer = ""
  results = []

  File.open(filename, 'r') do |file|
    until file.eof?
      chunk = file.read(chunk_size)
      buffer += chunk

      # Process complete tokens only
      scanner = StringScanner.new(buffer)
      processed = ""

      until scanner.eos?
        start_pos = scanner.pos

        if token = scanner.scan(/[^\n]*\n/)
          results << parse_token(token.strip)
          processed += token
        else
          # Incomplete line, save for next chunk
          break
        end
      end

      # Keep unprocessed content for next iteration
      buffer = buffer[processed.length..-1]
    end

    # Process remaining buffer content
    if buffer.length > 0
      results << parse_token(buffer.strip)
    end
  end

  results
end

Reference

Core Methods

Method Parameters Returns Description
#new(string) string (String) StringScanner Creates scanner for string
#scan(pattern) pattern (Regexp) String or nil Matches pattern at current position
#check(pattern) pattern (Regexp) String or nil Matches without advancing position
#skip(pattern) pattern (Regexp) Integer or nil Advances position by match length
#scan_until(pattern) pattern (Regexp) String or nil Scans until pattern matches
#skip_until(pattern) pattern (Regexp) Integer or nil Skips until pattern matches
#match?(pattern) pattern (Regexp) Integer or nil Returns match length without advancing

Position Management

Method Parameters Returns Description
#pos None Integer Current position in string
#pos=(position) position (Integer) Integer Sets current position
#reset None StringScanner Resets position to beginning
#terminate None StringScanner Sets position to end of string
#eos? None Boolean Returns true if at end of string
#beginning_of_line? None Boolean Returns true if at line start

String Access

Method Parameters Returns Description
#string None String Returns original string
#string=(new_string) new_string (String) String Replaces scanned string
#rest None String Returns unscanned portion
#rest_size None Integer Returns length of unscanned portion
#peek(length) length (Integer) String Returns next length characters
#getch None String or nil Returns and advances one character
#getbyte None Integer or nil Returns and advances one byte

Match Information

Method Parameters Returns Description
#matched None String or nil Returns last matched string
#matched? None Boolean Returns true if last operation matched
#matched_size None Integer or nil Returns length of last match
#pre_match None String Returns string before last match
#post_match None String Returns string after last match
#[](index) index (Integer) String or nil Returns capture group by index

Common Patterns

Pattern Type Example Description
Word boundaries /\b\w+\b/ Complete words only
Numbers /[+-]?\d*\.?\d+/ Integer or floating point
Quoted strings /"[^"]*"/ Simple double-quoted strings
Whitespace /\s+/ One or more whitespace characters
Line endings /\r?\n/ Cross-platform line breaks
Identifiers /[a-zA-Z_]\w*/ Programming language identifiers

Error Types

Exception Cause Recovery Strategy
TypeError nil string argument Validate input before scanning
RangeError Invalid position assignment Check bounds before setting position
Encoding::CompatibilityError Encoding mismatch Convert strings to compatible encoding
RegexpError Invalid regular expression Validate patterns before use

State Inspection

scanner = StringScanner.new("Hello World")
scanner.scan(/Hello/)

# Current state
scanner.pos           # => 5
scanner.matched       # => "Hello"
scanner.rest          # => " World"
scanner.eos?          # => false

# Match details
scanner.pre_match     # => ""
scanner.post_match    # => " World"
scanner.matched_size  # => 5

Performance Characteristics

Operation Time Complexity Space Complexity Notes
#scan O(n) O(1) Depends on regex complexity
#check O(n) O(1) No position advancement
#skip O(n) O(1) Returns match length only
#pos= O(1) O(1) Direct position assignment
#reset O(1) O(1) Returns to beginning
#[](index) O(1) O(1) Capture group access