Overview
Parser architecture in Ruby encompasses the design patterns, classes, and methodologies for transforming unstructured or semi-structured data into usable objects. Ruby provides multiple parsing approaches through its standard library and core language features, from simple string manipulation to sophisticated parser combinators.
The foundation of Ruby's parsing architecture rests on several key components. The String
class offers basic tokenization through methods like #split
, #scan
, and #match
. Regular expressions provide pattern-based parsing through the Regexp
class and literal syntax. The standard library includes specialized parsers for common formats: JSON
, CSV
, YAML
, and URI
.
Ruby's parsing architecture supports both imperative and declarative approaches. Imperative parsing involves step-by-step data transformation using conditional logic and iteration. Declarative parsing defines grammar rules and transformations that the parser applies automatically.
# Basic string tokenization
input = "name:John,age:30,city:Boston"
tokens = input.split(",").map { |pair| pair.split(":") }
# => [["name", "John"], ["age", "30"], ["city", "Boston"]]
# Pattern-based parsing with regex
email_pattern = /(\w+)@(\w+\.\w+)/
"Contact: john@example.com".match(email_pattern)
# => #<MatchData "john@example.com" 1:"john" 2:"example.com">
# Standard library parser
require 'json'
JSON.parse('{"name": "John", "age": 30}')
# => {"name"=>"John", "age"=>30}
Parser architecture in Ruby commonly involves creating custom parser classes that encapsulate parsing logic, maintain state during parsing operations, and provide error handling. These parsers typically implement methods for tokenization, syntactic analysis, and semantic transformation.
Basic Usage
Ruby parsing operations begin with input preparation and tokenization. The parsing process typically involves breaking input into meaningful units, analyzing the structure of these units, and transforming them into target objects or data structures.
String-based parsing forms the simplest parsing architecture. Ruby strings support scanning operations that extract tokens based on patterns or delimiters. The #scan
method returns all matches of a pattern, while #split
divides strings at delimiter boundaries.
# Token extraction with scan
log_line = "2024-01-15 ERROR Database connection failed"
tokens = log_line.scan(/(\d{4}-\d{2}-\d{2})\s+(\w+)\s+(.+)/)
# => [["2024-01-15", "ERROR", "Database connection failed"]]
# Delimiter-based parsing
csv_row = "John,30,Engineer,Boston"
fields = csv_row.split(",")
# => ["John", "30", "Engineer", "Boston"]
# Multiple delimiter handling
config_line = "key=value;timeout=30;retry=true"
pairs = config_line.split(";").map { |pair| pair.split("=") }
# => [["key", "value"], ["timeout", "30"], ["retry", "true"]]
Regular expression parsing provides more sophisticated pattern matching capabilities. Ruby regex objects support named captures, which create more maintainable parsing code by associating semantic names with matched groups.
# Named capture groups for structured parsing
url_pattern = /(?<protocol>https?):\/\/(?<domain>[^\/]+)(?<path>\/.*)?/
url = "https://api.example.com/users/123"
match = url.match(url_pattern)
parsed_url = {
protocol: match[:protocol],
domain: match[:domain],
path: match[:path]
}
# => {:protocol=>"https", :domain=>"api.example.com", :path=>"/users/123"}
Custom parser classes encapsulate parsing logic and maintain parsing state. These classes typically define methods for different parsing phases and provide clear interfaces for consuming parsed data.
class ConfigParser
def initialize(input)
@input = input
@position = 0
@config = {}
end
def parse
lines = @input.split("\n")
lines.each { |line| parse_line(line.strip) }
@config
end
private
def parse_line(line)
return if line.empty? || line.start_with?("#")
if line.include?("=")
key, value = line.split("=", 2)
@config[key.strip] = parse_value(value.strip)
end
end
def parse_value(value)
case value
when /^\d+$/ then value.to_i
when /^\d+\.\d+$/ then value.to_f
when /^(true|false)$/ then value == "true"
else value.gsub(/^["']|["']$/, "")
end
end
end
config_text = <<~CONFIG
# Database settings
host=localhost
port=5432
timeout=30.5
ssl=true
name="production_db"
CONFIG
parser = ConfigParser.new(config_text)
result = parser.parse
# => {"host"=>"localhost", "port"=>5432, "timeout"=>30.5, "ssl"=>true, "name"=>"production_db"}
Advanced Usage
Advanced parser architecture in Ruby involves implementing stateful parsers, parser combinators, and abstract syntax tree construction. These techniques handle complex grammars, nested structures, and context-sensitive parsing requirements.
Stateful parsers maintain context throughout the parsing process, enabling handling of nested structures and context-dependent tokens. State machines provide a formal approach to managing parser state transitions.
class JSONParser
def initialize(input)
@input = input
@position = 0
@current_char = @input[0]
end
def parse
skip_whitespace
parse_value
end
private
def advance
@position += 1
@current_char = @position < @input.length ? @input[@position] : nil
end
def skip_whitespace
while @current_char&.match?(/\s/)
advance
end
end
def parse_value
skip_whitespace
case @current_char
when '{' then parse_object
when '[' then parse_array
when '"' then parse_string
when /\d/, '-' then parse_number
when 't', 'f' then parse_boolean
when 'n' then parse_null
else raise "Unexpected character: #{@current_char}"
end
end
def parse_object
advance # skip '{'
obj = {}
skip_whitespace
return obj if @current_char == '}'
loop do
key = parse_string
skip_whitespace
raise "Expected ':'" unless @current_char == ':'
advance
value = parse_value
obj[key] = value
skip_whitespace
break if @current_char == '}'
raise "Expected ',' or '}'" unless @current_char == ','
advance
skip_whitespace
end
advance # skip '}'
obj
end
def parse_string
raise "Expected '\"'" unless @current_char == '"'
advance
result = ""
while @current_char != '"'
if @current_char == '\\'
advance
case @current_char
when '"', '\\', '/' then result += @current_char
when 'n' then result += "\n"
when 't' then result += "\t"
when 'r' then result += "\r"
else raise "Invalid escape sequence"
end
else
result += @current_char
end
advance
end
advance # skip closing '"'
result
end
def parse_number
start_pos = @position
advance if @current_char == '-'
raise "Invalid number" unless @current_char&.match?(/\d/)
advance while @current_char&.match?(/\d/)
if @current_char == '.'
advance
raise "Invalid number" unless @current_char&.match?(/\d/)
advance while @current_char&.match?(/\d/)
end
number_str = @input[start_pos...@position]
number_str.include?('.') ? number_str.to_f : number_str.to_i
end
end
Parser combinators enable composition of smaller parsers into more complex parsing logic. This functional approach promotes reusable parsing components and declarative grammar specification.
class ParserCombinator
def initialize(&block)
@parser_proc = block
end
def call(input, position = 0)
@parser_proc.call(input, position)
end
def >>(other_parser)
ParserCombinator.new do |input, position|
result1 = call(input, position)
return result1 unless result1[:success]
result2 = other_parser.call(input, result1[:position])
if result2[:success]
{
success: true,
value: [result1[:value], result2[:value]],
position: result2[:position]
}
else
result2
end
end
end
def |(other_parser)
ParserCombinator.new do |input, position|
result1 = call(input, position)
return result1 if result1[:success]
other_parser.call(input, position)
end
end
def many
ParserCombinator.new do |input, position|
results = []
current_position = position
loop do
result = call(input, current_position)
break unless result[:success]
results << result[:value]
current_position = result[:position]
end
{ success: true, value: results, position: current_position }
end
end
end
def char(expected_char)
ParserCombinator.new do |input, position|
if position < input.length && input[position] == expected_char
{ success: true, value: expected_char, position: position + 1 }
else
{ success: false, error: "Expected '#{expected_char}'", position: position }
end
end
end
def regex(pattern)
ParserCombinator.new do |input, position|
substring = input[position..-1]
match = substring.match(/^#{pattern}/)
if match
{ success: true, value: match[0], position: position + match[0].length }
else
{ success: false, error: "Pattern #{pattern} not found", position: position }
end
end
end
# Usage example
identifier = regex(/[a-zA-Z][a-zA-Z0-9]*/)
equals = char('=')
number = regex(/\d+/)
assignment = identifier >> equals >> number
result = assignment.call("variable=123", 0)
# => {:success=>true, :value=>["variable", "=", "123"], :position=>11}
Abstract syntax tree construction transforms parsed tokens into structured representations that preserve semantic relationships. AST nodes encapsulate both data and behavioral methods for tree traversal and transformation.
class ASTNode
attr_reader :type, :value, :children
def initialize(type, value = nil, children = [])
@type = type
@value = value
@children = children
end
def accept(visitor)
visitor.visit(self)
end
def traverse(&block)
block.call(self)
@children.each { |child| child.traverse(&block) }
end
end
class ExpressionParser
def initialize(tokens)
@tokens = tokens
@position = 0
end
def parse
parse_expression
end
private
def current_token
@tokens[@position]
end
def advance
@position += 1
end
def parse_expression
left = parse_term
while current_token&.match?(/[+\-]/)
operator = current_token
advance
right = parse_term
left = ASTNode.new(:binary_op, operator, [left, right])
end
left
end
def parse_term
left = parse_factor
while current_token&.match?(/[*\/]/)
operator = current_token
advance
right = parse_factor
left = ASTNode.new(:binary_op, operator, [left, right])
end
left
end
def parse_factor
token = current_token
advance
case token
when /\d+/
ASTNode.new(:number, token.to_i)
when '('
expr = parse_expression
advance # skip ')'
expr
else
raise "Unexpected token: #{token}"
end
end
end
Error Handling & Debugging
Parser error handling requires distinguishing between syntax errors, semantic errors, and runtime exceptions. Ruby parsers should provide descriptive error messages with location information and recovery strategies for partial parsing scenarios.
Exception hierarchy design creates specific error types for different parsing failures. Custom exception classes carry contextual information about parsing state when errors occur.
class ParseError < StandardError
attr_reader :position, :line, :column, :context
def initialize(message, position: nil, line: nil, column: nil, context: nil)
@position = position
@line = line
@column = column
@context = context
super(build_message(message))
end
private
def build_message(base_message)
parts = [base_message]
parts << "at position #{@position}" if @position
parts << "line #{@line}, column #{@column}" if @line && @column
parts << "context: #{@context}" if @context
parts.join(" ")
end
end
class SyntaxError < ParseError; end
class UnexpectedTokenError < ParseError; end
class UnexpectedEndOfInputError < ParseError; end
class RobustParser
def initialize(input)
@input = input
@position = 0
@line = 1
@column = 1
@errors = []
end
def parse_with_recovery
begin
parse_document
rescue ParseError => e
@errors << e
attempt_recovery
retry if @position < @input.length
end
{ result: @result, errors: @errors }
end
private
def current_char
return nil if @position >= @input.length
@input[@position]
end
def advance
char = @input[@position]
@position += 1
if char == "\n"
@line += 1
@column = 1
else
@column += 1
end
char
end
def parse_document
# Implementation with error checking
elements = []
while @position < @input.length
begin
elements << parse_element
rescue UnexpectedTokenError => e
@errors << e
skip_to_next_element
end
end
@result = elements
end
def parse_element
skip_whitespace
case current_char
when '{' then parse_object
when '[' then parse_array
when '"' then parse_string
when /\d/, '-' then parse_number
when nil
raise UnexpectedEndOfInputError.new(
"Unexpected end of input",
position: @position,
line: @line,
column: @column
)
else
raise UnexpectedTokenError.new(
"Unexpected character '#{current_char}'",
position: @position,
line: @line,
column: @column,
context: get_context
)
end
end
def get_context(radius = 10)
start_pos = [@position - radius, 0].max
end_pos = [@position + radius, @input.length - 1].min
context = @input[start_pos..end_pos]
marker_pos = @position - start_pos
context.insert(marker_pos, ">>>").insert(marker_pos + 4, "<<<")
end
def skip_to_next_element
while @position < @input.length && !current_char.match?(/[{["0-9-]/)
advance
end
end
def attempt_recovery
# Skip current problematic token and continue
advance if current_char
skip_whitespace
end
end
Validation frameworks provide declarative error checking for parsed data. These frameworks separate parsing logic from validation logic, enabling reusable validation rules across different parsers.
class ValidationRule
def initialize(name, &block)
@name = name
@validator = block
end
def validate(value, context = {})
result = @validator.call(value, context)
return { valid: true } if result == true
{
valid: false,
rule: @name,
message: result.is_a?(String) ? result : "Validation failed for #{@name}"
}
end
end
class ValidationFramework
def initialize
@rules = {}
end
def define_rule(name, &block)
@rules[name] = ValidationRule.new(name, &block)
end
def validate(data, rule_names)
errors = []
rule_names.each do |rule_name|
rule = @rules[rule_name]
next unless rule
result = rule.validate(data[:value], data)
errors << result unless result[:valid]
end
{ valid: errors.empty?, errors: errors }
end
end
# Usage in parser
class ValidatingParser
def initialize
@validator = ValidationFramework.new
setup_validation_rules
end
def parse_and_validate(input)
parsed_data = parse(input)
parsed_data.map do |item|
validation_result = @validator.validate(item, item[:validation_rules])
item[:validation] = validation_result
item
end
end
private
def setup_validation_rules
@validator.define_rule(:required) do |value, context|
!value.nil? && value != ""
end
@validator.define_rule(:email_format) do |value, context|
value.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ||
"Invalid email format"
end
@validator.define_rule(:positive_number) do |value, context|
value.is_a?(Numeric) && value > 0 || "Must be a positive number"
end
end
end
Debugging tools for parsers include token stream visualization, parse tree inspection, and step-by-step execution tracing. These tools help identify where parsing logic deviates from expected behavior.
class DebuggingParser
def initialize(input, debug: false)
@input = input
@debug = debug
@trace = []
@position = 0
end
def parse_with_debugging
debug_log("Starting parse", input_preview: preview_input)
result = parse_expression
if @debug
puts "\n=== Parse Trace ==="
@trace.each_with_index do |entry, index|
puts "#{index + 1}. #{entry[:action]} | #{entry[:details]}"
end
end
result
end
private
def debug_log(action, details = {})
return unless @debug
trace_entry = {
action: action,
position: @position,
current_char: current_char,
details: details
}
@trace << trace_entry
puts "DEBUG: #{action} at position #{@position} | #{details}"
end
def preview_input(radius = 5)
start_pos = [@position - radius, 0].max
end_pos = [@position + radius, @input.length - 1].min
@input[start_pos..end_pos].gsub(/\n/, '\\n')
end
end
Performance & Memory
Parser performance optimization involves minimizing memory allocations, reducing string operations, and implementing efficient tokenization strategies. Ruby parsers benefit from careful attention to object creation patterns and garbage collection pressure.
String parsing performance depends heavily on avoiding unnecessary string allocations. Ruby strings are mutable, but many parsing operations create new string objects. Using string indexes and ranges instead of substring extraction reduces memory pressure.
class OptimizedParser
def initialize(input)
@input = input.freeze # Prevent accidental modification
@length = input.length
end
# Inefficient: creates many substring objects
def parse_tokens_slow
tokens = []
position = 0
while position < @length
if @input[position].match?(/\w/)
start = position
position += 1 while position < @length && @input[position].match?(/\w/)
tokens << @input[start...position] # Creates new string object
else
position += 1
end
end
tokens
end
# Efficient: uses ranges to defer string creation
def parse_tokens_fast
tokens = []
position = 0
while position < @length
if @input[position].match?(/\w/)
start = position
position += 1 while position < @length && @input[position].match?(/\w/)
tokens << { start: start, end: position } # Store range info
else
position += 1
end
end
# Create strings only when needed
tokens.map { |token| @input[token[:start]...token[:end]] }
end
# Memory-efficient streaming approach
def parse_tokens_streaming(&block)
position = 0
while position < @length
if @input[position].match?(/\w/)
start = position
position += 1 while position < @length && @input[position].match?(/\w/)
# Process token immediately without storing
block.call(@input[start...position])
else
position += 1
end
end
end
end
# Performance comparison
require 'benchmark'
large_input = "word " * 100_000
parser = OptimizedParser.new(large_input)
Benchmark.bm(20) do |x|
x.report("slow (substring)") { parser.parse_tokens_slow }
x.report("fast (ranges)") { parser.parse_tokens_fast }
x.report("streaming") { parser.parse_tokens_streaming { |token| token.length } }
end
Regex performance optimization involves understanding Ruby's regex engine behavior and compilation costs. Precompiled regex objects and anchored patterns improve parsing speed significantly.
class RegexOptimizedParser
# Pre-compile regex patterns to avoid repeated compilation
EMAIL_PATTERN = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i.freeze
PHONE_PATTERN = /\A\+?[\d\s\-\(\)]{10,15}\z/.freeze
URL_PATTERN = /\Ahttps?:\/\/[\w\-\.]+\.[\w]{2,}(\/\S*)?\z/i.freeze
def initialize
@patterns = {
email: EMAIL_PATTERN,
phone: PHONE_PATTERN,
url: URL_PATTERN
}
end
def classify_input(input)
@patterns.each do |type, pattern|
return type if pattern.match?(input)
end
:unknown
end
# Efficient pattern matching with early termination
def extract_structured_data(text)
results = { emails: [], phones: [], urls: [] }
# Use single scan with alternation for multiple patterns
combined_pattern = /(#{EMAIL_PATTERN}|#{PHONE_PATTERN}|#{URL_PATTERN})/i
text.scan(combined_pattern) do |match|
value = match[0]
case classify_input(value)
when :email then results[:emails] << value
when :phone then results[:phones] << value
when :url then results[:urls] << value
end
end
results
end
end
Memory profiling helps identify parsing bottlenecks and excessive object allocation. Ruby provides memory profiling tools that track object creation during parsing operations.
require 'objspace'
class MemoryProfiledParser
def self.profile_parsing(input, &parser_block)
# Track object allocations
ObjectSpace.trace_object_allocations_start
# Record initial memory state
initial_objects = ObjectSpace.count_objects
# Execute parsing
start_time = Time.now
result = parser_block.call(input)
end_time = Time.now
# Record final memory state
final_objects = ObjectSpace.count_objects
ObjectSpace.trace_object_allocations_stop
# Calculate memory usage
object_diff = final_objects.merge(initial_objects) do |key, final_count, initial_count|
final_count - initial_count
end
{
result: result,
execution_time: end_time - start_time,
object_allocations: object_diff.select { |_, count| count > 0 },
total_objects_created: object_diff.values.sum
}
end
def self.benchmark_parsers(input, parsers = {})
results = {}
parsers.each do |name, parser_proc|
GC.start # Ensure clean state
profile_result = profile_parsing(input, &parser_proc)
results[name] = {
time: profile_result[:execution_time],
objects: profile_result[:total_objects_created],
memory_efficiency: profile_result[:execution_time] / profile_result[:total_objects_created]
}
end
results
end
end
# Usage example
large_json = '{"users": [' + ('{"name": "User", "age": 30},' * 1000)[0..-2] + ']}'
parsers = {
json_standard: ->(input) { JSON.parse(input) },
json_custom: ->(input) { CustomJSONParser.new(input).parse }
}
benchmark_results = MemoryProfiledParser.benchmark_parsers(large_json, parsers)
puts benchmark_results
Reference
Core Parsing Classes
Class | Purpose | Key Methods | Usage Notes |
---|---|---|---|
String |
Basic text parsing | #split , #scan , #match , #gsub |
Foundation for most parsing operations |
Regexp |
Pattern matching | #match , #scan , #=~ , #match? |
Use named captures for maintainable code |
MatchData |
Regex match results | #[] , #captures , #names , #offset |
Access captured groups by index or name |
StringIO |
String stream parsing | #read , #readline , #pos , #seek |
Useful for parsing large text streams |
Standard Library Parsers
Parser | Module/Class | Primary Method | Input Format | Output Type |
---|---|---|---|---|
JSON | JSON |
JSON.parse(string) |
JSON text | Hash/Array |
CSV | CSV |
CSV.parse(string) |
Comma-separated values | Array of Arrays |
YAML | YAML |
YAML.safe_load(string) |
YAML document | Hash/Array/Scalar |
URI | URI |
URI.parse(string) |
URL/URI string | URI object |
Time | Time |
Time.parse(string) |
Time string | Time object |
Date | Date |
Date.parse(string) |
Date string | Date object |
String Parsing Methods
Method | Signature | Returns | Description |
---|---|---|---|
#split(pattern, limit) |
String → Array | Array of substrings | Splits string on pattern matches |
#scan(pattern) |
String → Array | Array of matches | Returns all pattern matches |
#match(pattern, pos) |
String → MatchData/nil | Match object or nil | First pattern match from position |
#partition(pattern) |
String → Array | [before, match, after] | Splits string into three parts |
#rpartition(pattern) |
String → Array | [before, match, after] | Right-to-left partition |
#slice(start, length) |
String → String/nil | Substring or nil | Extracts substring without allocation |
#[] with Range |
String → String/nil | Substring or nil | Range-based extraction |
Regex Pattern Elements
Element | Syntax | Matches | Example |
---|---|---|---|
Character class | [abc] |
Any character in set | [0-9] matches digits |
Negated class | [^abc] |
Any character not in set | [^a-z] excludes lowercase |
Word boundary | \b |
Position at word edge | \bword\b matches whole word |
Anchors | ^ , $ |
Start/end of string | ^\d+$ matches number strings |
Quantifiers | * , + , ? , {n,m} |
Repetition patterns | \d{2,4} matches 2-4 digits |
Groups | () |
Capture matched text | (\w+)@(\w+) captures email parts |
Named groups | (?<name>...) |
Named capture | (?<year>\d{4}) names capture |
Non-capturing | (?:...) |
Groups without capture | (?:com|org) groups alternation |
Common Parsing Patterns
Pattern Type | Implementation | Use Case | Example |
---|---|---|---|
Token scanning | string.scan(/pattern/) |
Extract repeated elements | Email addresses, identifiers |
Delimiter splitting | string.split(/\s*,\s*/) |
Parse separated values | CSV fields, parameter lists |
State machine | Custom class with state tracking | Complex grammar parsing | JSON, XML, config formats |
Recursive descent | Method calls for grammar rules | Nested structures | Mathematical expressions |
Parser combinators | Functional composition | Declarative grammar | DSL parsing, protocols |
Error Handling Patterns
Error Type | Exception Class | When to Use | Recovery Strategy |
---|---|---|---|
Syntax errors | SyntaxError |
Malformed input structure | Skip to next valid token |
Unexpected tokens | UnexpectedTokenError |
Wrong token type | Attempt alternate parsing path |
End of input | UnexpectedEndOfInputError |
Premature input termination | Use default values |
Validation errors | ValidationError |
Semantic constraint violations | Log error, continue parsing |
Encoding errors | EncodingError |
Character encoding issues | Convert encoding or skip |
Performance Optimization Guidelines
Technique | Implementation | Performance Gain | Memory Impact |
---|---|---|---|
Precompiled regex | Store regex in constants | 20-50% faster matching | Minimal increase |
Range-based extraction | Use string[start...end] |
Avoids string allocation | 50-80% memory reduction |
String freezing | Call #freeze on input |
Prevents accidental mutation | Enables string sharing |
Streaming parsing | Process tokens immediately | Constant memory usage | 90%+ memory reduction |
Batch processing | Parse multiple items together | Reduces method call overhead | Variable depending on batch size |
Debugging Tools and Techniques
Tool | Usage | Output | Best For |
---|---|---|---|
String#inspect |
puts string.inspect |
Escaped string representation | Debugging invisible characters |
Regexp#source |
regex.source |
Pattern string | Examining dynamic regex |
MatchData#inspect |
match.inspect |
Match details with positions | Understanding capture groups |
ObjectSpace profiling | ObjectSpace.trace_object_allocations |
Memory allocation tracking | Performance optimization |
Custom trace logging | Add debug output to parser | Step-by-step execution log | Logic flow debugging |