Overview
StringScanner provides a streaming lexical analysis interface for parsing strings character by character or token by token. Ruby implements StringScanner as part of the standard library, offering stateful string processing with position tracking and pattern matching capabilities.
The StringScanner class maintains an internal pointer that advances through the string as scanning operations succeed. Each scan operation attempts to match a pattern at the current position, advancing the pointer on successful matches while leaving it unchanged on failures. This approach enables building parsers and tokenizers without manually tracking position state.
StringScanner operates on frozen string copies, preventing modification during scanning operations. The scanner stores the original string, current position, and match data from the most recent successful operation.
require 'strscan'
scanner = StringScanner.new("Hello, World!")
scanner.scan(/Hello/) # => "Hello"
scanner.pos # => 5
scanner.scan(/,/) # => ","
scanner.rest # => " World!"
Primary use cases include lexical analysis for domain-specific languages, parsing structured text formats, extracting tokens from log files, and building state machines for text processing. StringScanner excels when parsing requirements involve sequential token extraction with position-dependent logic.
# Parsing key-value pairs
data = "name=John age=30 city=Boston"
scanner = StringScanner.new(data)
pairs = {}
until scanner.eos?
key = scanner.scan(/\w+/)
scanner.scan(/=/)
value = scanner.scan(/\w+/)
pairs[key] = value
scanner.scan(/\s+/) # skip whitespace
end
pairs # => {"name"=>"John", "age"=>"30", "city"=>"Boston"}
The class integrates with Ruby's regular expression engine, supporting capture groups, named captures, and all standard regex features. StringScanner maintains compatibility with String indexing and slicing operations while adding stateful scanning capabilities.
Basic Usage
StringScanner construction requires a string argument that becomes the scanning target. The scanner initializes with position zero and no match data. Basic scanning operations include scan
, check
, skip
, and position management methods.
The scan
method attempts pattern matching at the current position, returning the matched string on success or nil on failure. Successful scans advance the scanner position by the match length. Pattern matching uses Ruby's regular expression engine with full feature support.
require 'strscan'
text = "The year is 2024"
scanner = StringScanner.new(text)
word = scanner.scan(/\w+/) # => "The"
scanner.scan(/\s+/) # => " "
scanner.scan(/\w+/) # => "year"
scanner.scan(/\s+is\s+/) # => " is "
year = scanner.scan(/\d+/) # => "2024"
scanner.eos? # => true
The check
method performs pattern matching without advancing the scanner position. This enables lookahead operations and conditional parsing logic. The skip
method advances the position without returning matched content, useful for discarding separators or whitespace.
csv_line = "apple,banana,cherry"
scanner = StringScanner.new(csv_line)
items = []
until scanner.eos?
item = scanner.scan(/[^,]+/)
items << item if item
scanner.skip(/,/) # advance past comma without capturing
end
items # => ["apple", "banana", "cherry"]
Position management methods include pos
for current position, pos=
for position assignment, reset
for returning to the beginning, and terminate
for jumping to the end. The rest
method returns unscanned content from the current position to the end.
scanner = StringScanner.new("ABCDEFGH")
scanner.scan(/ABC/) # => "ABC"
scanner.pos # => 3
scanner.pos = 1 # rewind to position 1
scanner.scan(/BCD/) # => "BCD"
scanner.rest # => "EFGH"
StringScanner provides string inspection methods including string
for the original input, matched
for the most recent successful match, pre_match
for content before the last match, and post_match
for content after the last match. These methods enable context-aware parsing decisions.
text = "Error: Invalid input on line 42"
scanner = StringScanner.new(text)
scanner.scan(/Error:/)
scanner.scan(/\s+/)
scanner.scan(/\w+/)
scanner.matched # => "Invalid"
scanner.pre_match # => "Error: "
scanner.post_match # => " input on line 42"
Advanced Usage
StringScanner supports complex parsing patterns through method chaining, capture groups, and stateful decision trees. Advanced usage involves combining multiple scan operations, implementing backtracking mechanisms, and building reusable parsing components.
Capture groups enable extracting multiple values from single pattern matches. StringScanner integrates with Ruby's capture group syntax, providing access to numbered and named captures through standard regex mechanisms.
log_entry = "2024-01-15 14:30:22 ERROR Failed to connect to database"
scanner = StringScanner.new(log_entry)
# Parse timestamp with capture groups
if scanner.scan(/(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})/)
year, month, day, hour, minute, second = scanner[1..6]
timestamp = Time.new(year, month, day, hour, minute, second)
end
scanner.scan(/\s+/)
level = scanner.scan(/\w+/) # => "ERROR"
scanner.scan(/\s+/)
message = scanner.rest # => "Failed to connect to database"
Implementing backtracking requires position management and conditional parsing paths. StringScanner enables saving and restoring positions to implement complex parsing logic with multiple attempt strategies.
def parse_number_or_identifier(scanner)
start_pos = scanner.pos
# Try parsing as number first
if number = scanner.scan(/-?\d+\.?\d*/)
return [:number, number.include?('.') ? number.to_f : number.to_i]
end
# Backtrack and try identifier
scanner.pos = start_pos
if identifier = scanner.scan(/[a-zA-Z_]\w*/)
return [:identifier, identifier]
end
# No match found
nil
end
text = "42 hello 3.14 world"
scanner = StringScanner.new(text)
tokens = []
until scanner.eos?
scanner.skip(/\s+/)
if token = parse_number_or_identifier(scanner)
tokens << token
end
end
tokens # => [[:number, 42], [:identifier, "hello"], [:number, 3.14], [:identifier, "world"]]
Building recursive descent parsers with StringScanner involves creating parsing methods that call each other based on grammar rules. This approach handles nested structures and complex language constructs.
class JSONParser
def initialize(text)
@scanner = StringScanner.new(text)
end
def parse_value
skip_whitespace
case
when @scanner.check(/"/)
parse_string
when @scanner.check(/\d/)
parse_number
when @scanner.check(/\{/)
parse_object
when @scanner.check(/\[/)
parse_array
when @scanner.scan(/true/)
true
when @scanner.scan(/false/)
false
when @scanner.scan(/null/)
nil
end
end
private
def parse_string
@scanner.scan(/"([^"]*)"/)
@scanner[1]
end
def parse_number
@scanner.scan(/-?\d+\.?\d*/).to_f
end
def skip_whitespace
@scanner.skip(/\s*/)
end
end
StringScanner excels at implementing state machines for complex parsing scenarios. Each state corresponds to a parsing context, with transitions based on pattern matches and current position.
def parse_config_file(content)
scanner = StringScanner.new(content)
config = {}
state = :expecting_key
current_section = nil
until scanner.eos?
scanner.skip(/\s*/)
case state
when :expecting_key
if scanner.scan(/\[([^\]]+)\]/)
current_section = scanner[1]
config[current_section] ||= {}
state = :expecting_key
elsif scanner.scan(/(\w+)/)
key = scanner[1]
scanner.skip(/\s*=\s*/)
state = :expecting_value
end
when :expecting_value
value = scanner.scan(/[^\n\r]+/)
config[current_section][key] = value.strip
state = :expecting_key
end
scanner.skip(/[\n\r]+/)
end
config
end
Error Handling & Debugging
StringScanner raises specific exceptions for common error conditions including invalid positions, nil string arguments, and encoding mismatches. Understanding these error patterns enables building robust parsing systems with appropriate error recovery strategies.
The most common StringScanner errors involve position management and string handling. Setting positions beyond string bounds raises RangeError
, while passing nil to the constructor raises TypeError
. String encoding issues manifest as Encoding::CompatibilityError
during pattern matching operations.
def safe_scan(scanner, pattern)
result = scanner.scan(pattern)
if result.nil?
raise ParseError, "Expected #{pattern} at position #{scanner.pos}"
end
result
rescue RangeError => e
raise ParseError, "Position out of bounds: #{e.message}"
rescue Encoding::CompatibilityError => e
raise ParseError, "Encoding mismatch: #{e.message}"
end
# Usage with error handling
begin
scanner = StringScanner.new(input_text)
token = safe_scan(scanner, /\w+/)
rescue ParseError => e
puts "Parse failed: #{e.message}"
# Implement recovery strategy
end
Debugging StringScanner operations requires visibility into scanner state, pattern matching attempts, and position changes. Implementing logging wrappers around scan operations provides detailed execution traces for complex parsing scenarios.
class DebuggingScanner
def initialize(string, debug: false)
@scanner = StringScanner.new(string)
@debug = debug
end
def scan(pattern)
start_pos = @scanner.pos
result = @scanner.scan(pattern)
if @debug
status = result ? "SUCCESS" : "FAILED"
puts "[#{status}] #{pattern.inspect} at pos #{start_pos}"
puts " Matched: #{result.inspect}" if result
puts " New pos: #{@scanner.pos}"
end
result
end
def method_missing(method, *args, &block)
@scanner.public_send(method, *args, &block)
end
end
# Debug parsing issues
scanner = DebuggingScanner.new("test123", debug: true)
scanner.scan(/\w+/) # Shows detailed trace
scanner.scan(/\d+/)
Position tracking errors occur when manual position changes create inconsistent scanner state. Implementing position validation prevents common off-by-one errors and invalid position assignments.
class ValidatingScanner
def initialize(string)
@scanner = StringScanner.new(string)
@string = string
end
def pos=(new_pos)
if new_pos < 0 || new_pos > @string.length
raise ArgumentError, "Position #{new_pos} outside valid range 0..#{@string.length}"
end
@scanner.pos = new_pos
end
def bounded_scan(pattern, max_advance: nil)
start_pos = @scanner.pos
result = @scanner.scan(pattern)
if result && max_advance && (result.length > max_advance)
@scanner.pos = start_pos # rollback
raise ParseError, "Match exceeded maximum advance of #{max_advance}"
end
result
end
end
Recovery strategies for parsing failures include backtracking to known good positions, skipping problematic input, and implementing error synchronization points. These techniques enable parsers to continue processing after encountering invalid input.
def parse_with_recovery(text, &block)
scanner = StringScanner.new(text)
results = []
errors = []
until scanner.eos?
checkpoint = scanner.pos
begin
result = yield scanner
results << result if result
rescue ParseError => e
errors << { position: checkpoint, error: e.message }
# Recovery: skip to next whitespace or delimiter
scanner.pos = checkpoint
scanner.skip(/[^\s,;]*/) # skip problematic token
scanner.skip(/\s*[,;]?\s*/) # skip delimiter
end
end
{ results: results, errors: errors }
end
Performance & Memory
StringScanner operations exhibit linear time complexity for most scanning patterns, with performance characteristics depending heavily on regular expression complexity and string length. Understanding these patterns enables optimization strategies for high-throughput text processing applications.
Memory usage patterns in StringScanner involve string duplication and match data storage. The scanner creates a frozen copy of the input string, maintaining references to the original data throughout the scanning lifecycle. Large input strings consume proportional memory, with additional overhead for capture group storage.
require 'benchmark'
require 'memory_profiler'
def benchmark_scanning_patterns
small_text = "word " * 1000 # ~5KB
large_text = "word " * 100_000 # ~500KB
patterns = {
simple: /\w+/,
complex: /(?<type>\w+)\s+(?<value>\d+)/,
alternation: /cat|dog|bird|fish|elephant/
}
patterns.each do |name, pattern|
puts "\n=== Pattern: #{name} ==="
[small_text, large_text].each_with_index do |text, i|
size_label = i == 0 ? "Small" : "Large"
report = MemoryProfiler.report do
scanner = StringScanner.new(text)
count = 0
time = Benchmark.realtime do
until scanner.eos?
scanner.scan(pattern)
scanner.skip(/\s+/)
count += 1
end
end
puts "#{size_label}: #{count} matches in #{(time * 1000).round(2)}ms"
end
puts "Memory: #{report.total_allocated_memsize / 1024}KB allocated"
end
end
end
Optimization strategies focus on pattern efficiency, minimizing backtracking, and reusing scanner instances. Complex regular expressions with excessive alternation or nested quantifiers significantly impact scanning performance.
# Inefficient: excessive backtracking
slow_pattern = /(\w+)*\s*=\s*(\w+)*/
# Efficient: possessive quantifiers reduce backtracking
fast_pattern = /(\w++)=(\w++)/
# Benchmark comparison
def compare_patterns(text, iterations = 10000)
patterns = {
slow: /(\w+)*\s*=\s*(\w+)*/,
fast: /(\w++)=(\w++)/,
atomic: /(?>\w+)=(?>\w+)/
}
results = {}
patterns.each do |name, pattern|
time = Benchmark.realtime do
iterations.times do
scanner = StringScanner.new(text)
scanner.scan(pattern)
end
end
results[name] = (time * 1000).round(2)
end
results
end
text = "variable = value"
results = compare_patterns(text)
puts "Pattern performance (ms): #{results}"
Scanner instance reuse reduces allocation overhead for repetitive parsing tasks. Resetting scanner state costs less than creating new instances, particularly for large input strings or high-frequency operations.
class OptimizedParser
def initialize
@scanner = StringScanner.new("")
end
def parse_lines(lines)
results = []
lines.each do |line|
# Reuse scanner instance
@scanner.string = line
@scanner.reset
result = parse_single_line(@scanner)
results << result if result
end
results
end
private
def parse_single_line(scanner)
# Parse logic here
scanner.scan(/\w+/)
end
end
# Memory-efficient batch processing
parser = OptimizedParser.new
results = []
File.foreach('large_file.txt') do |line|
if line_results = parser.parse_lines([line])
results.concat(line_results)
end
# Periodic cleanup for long-running processes
GC.start if results.length % 10000 == 0
end
Large file processing requires streaming approaches to manage memory consumption. Processing files in chunks prevents memory exhaustion while maintaining parsing accuracy across chunk boundaries.
def stream_parse_large_file(filename, chunk_size: 8192)
buffer = ""
results = []
File.open(filename, 'r') do |file|
until file.eof?
chunk = file.read(chunk_size)
buffer += chunk
# Process complete tokens only
scanner = StringScanner.new(buffer)
processed = ""
until scanner.eos?
start_pos = scanner.pos
if token = scanner.scan(/[^\n]*\n/)
results << parse_token(token.strip)
processed += token
else
# Incomplete line, save for next chunk
break
end
end
# Keep unprocessed content for next iteration
buffer = buffer[processed.length..-1]
end
# Process remaining buffer content
if buffer.length > 0
results << parse_token(buffer.strip)
end
end
results
end
Reference
Core Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#new(string) |
string (String) |
StringScanner |
Creates scanner for string |
#scan(pattern) |
pattern (Regexp) |
String or nil |
Matches pattern at current position |
#check(pattern) |
pattern (Regexp) |
String or nil |
Matches without advancing position |
#skip(pattern) |
pattern (Regexp) |
Integer or nil |
Advances position by match length |
#scan_until(pattern) |
pattern (Regexp) |
String or nil |
Scans until pattern matches |
#skip_until(pattern) |
pattern (Regexp) |
Integer or nil |
Skips until pattern matches |
#match?(pattern) |
pattern (Regexp) |
Integer or nil |
Returns match length without advancing |
Position Management
Method | Parameters | Returns | Description |
---|---|---|---|
#pos |
None | Integer |
Current position in string |
#pos=(position) |
position (Integer) |
Integer |
Sets current position |
#reset |
None | StringScanner |
Resets position to beginning |
#terminate |
None | StringScanner |
Sets position to end of string |
#eos? |
None | Boolean |
Returns true if at end of string |
#beginning_of_line? |
None | Boolean |
Returns true if at line start |
String Access
Method | Parameters | Returns | Description |
---|---|---|---|
#string |
None | String |
Returns original string |
#string=(new_string) |
new_string (String) |
String |
Replaces scanned string |
#rest |
None | String |
Returns unscanned portion |
#rest_size |
None | Integer |
Returns length of unscanned portion |
#peek(length) |
length (Integer) |
String |
Returns next length characters |
#getch |
None | String or nil |
Returns and advances one character |
#getbyte |
None | Integer or nil |
Returns and advances one byte |
Match Information
Method | Parameters | Returns | Description |
---|---|---|---|
#matched |
None | String or nil |
Returns last matched string |
#matched? |
None | Boolean |
Returns true if last operation matched |
#matched_size |
None | Integer or nil |
Returns length of last match |
#pre_match |
None | String |
Returns string before last match |
#post_match |
None | String |
Returns string after last match |
#[](index) |
index (Integer) |
String or nil |
Returns capture group by index |
Common Patterns
Pattern Type | Example | Description |
---|---|---|
Word boundaries | /\b\w+\b/ |
Complete words only |
Numbers | /[+-]?\d*\.?\d+/ |
Integer or floating point |
Quoted strings | /"[^"]*"/ |
Simple double-quoted strings |
Whitespace | /\s+/ |
One or more whitespace characters |
Line endings | /\r?\n/ |
Cross-platform line breaks |
Identifiers | /[a-zA-Z_]\w*/ |
Programming language identifiers |
Error Types
Exception | Cause | Recovery Strategy |
---|---|---|
TypeError |
nil string argument | Validate input before scanning |
RangeError |
Invalid position assignment | Check bounds before setting position |
Encoding::CompatibilityError |
Encoding mismatch | Convert strings to compatible encoding |
RegexpError |
Invalid regular expression | Validate patterns before use |
State Inspection
scanner = StringScanner.new("Hello World")
scanner.scan(/Hello/)
# Current state
scanner.pos # => 5
scanner.matched # => "Hello"
scanner.rest # => " World"
scanner.eos? # => false
# Match details
scanner.pre_match # => ""
scanner.post_match # => " World"
scanner.matched_size # => 5
Performance Characteristics
Operation | Time Complexity | Space Complexity | Notes |
---|---|---|---|
#scan |
O(n) | O(1) | Depends on regex complexity |
#check |
O(n) | O(1) | No position advancement |
#skip |
O(n) | O(1) | Returns match length only |
#pos= |
O(1) | O(1) | Direct position assignment |
#reset |
O(1) | O(1) | Returns to beginning |
#[](index) |
O(1) | O(1) | Capture group access |