CrackedRuby logo

CrackedRuby

Parser Architecture

Overview

Parser architecture in Ruby encompasses the design patterns, classes, and methodologies for transforming unstructured or semi-structured data into usable objects. Ruby provides multiple parsing approaches through its standard library and core language features, from simple string manipulation to sophisticated parser combinators.

The foundation of Ruby's parsing architecture rests on several key components. The String class offers basic tokenization through methods like #split, #scan, and #match. Regular expressions provide pattern-based parsing through the Regexp class and literal syntax. The standard library includes specialized parsers for common formats: JSON, CSV, YAML, and URI.

Ruby's parsing architecture supports both imperative and declarative approaches. Imperative parsing involves step-by-step data transformation using conditional logic and iteration. Declarative parsing defines grammar rules and transformations that the parser applies automatically.

# Basic string tokenization
input = "name:John,age:30,city:Boston"
tokens = input.split(",").map { |pair| pair.split(":") }
# => [["name", "John"], ["age", "30"], ["city", "Boston"]]

# Pattern-based parsing with regex
email_pattern = /(\w+)@(\w+\.\w+)/
"Contact: john@example.com".match(email_pattern)
# => #<MatchData "john@example.com" 1:"john" 2:"example.com">

# Standard library parser
require 'json'
JSON.parse('{"name": "John", "age": 30}')
# => {"name"=>"John", "age"=>30}

Parser architecture in Ruby commonly involves creating custom parser classes that encapsulate parsing logic, maintain state during parsing operations, and provide error handling. These parsers typically implement methods for tokenization, syntactic analysis, and semantic transformation.

Basic Usage

Ruby parsing operations begin with input preparation and tokenization. The parsing process typically involves breaking input into meaningful units, analyzing the structure of these units, and transforming them into target objects or data structures.

String-based parsing forms the simplest parsing architecture. Ruby strings support scanning operations that extract tokens based on patterns or delimiters. The #scan method returns all matches of a pattern, while #split divides strings at delimiter boundaries.

# Token extraction with scan
log_line = "2024-01-15 ERROR Database connection failed"
tokens = log_line.scan(/(\d{4}-\d{2}-\d{2})\s+(\w+)\s+(.+)/)
# => [["2024-01-15", "ERROR", "Database connection failed"]]

# Delimiter-based parsing
csv_row = "John,30,Engineer,Boston"
fields = csv_row.split(",")
# => ["John", "30", "Engineer", "Boston"]

# Multiple delimiter handling
config_line = "key=value;timeout=30;retry=true"
pairs = config_line.split(";").map { |pair| pair.split("=") }
# => [["key", "value"], ["timeout", "30"], ["retry", "true"]]

Regular expression parsing provides more sophisticated pattern matching capabilities. Ruby regex objects support named captures, which create more maintainable parsing code by associating semantic names with matched groups.

# Named capture groups for structured parsing
url_pattern = /(?<protocol>https?):\/\/(?<domain>[^\/]+)(?<path>\/.*)?/
url = "https://api.example.com/users/123"
match = url.match(url_pattern)

parsed_url = {
  protocol: match[:protocol],
  domain: match[:domain],
  path: match[:path]
}
# => {:protocol=>"https", :domain=>"api.example.com", :path=>"/users/123"}

Custom parser classes encapsulate parsing logic and maintain parsing state. These classes typically define methods for different parsing phases and provide clear interfaces for consuming parsed data.

class ConfigParser
  def initialize(input)
    @input = input
    @position = 0
    @config = {}
  end

  def parse
    lines = @input.split("\n")
    lines.each { |line| parse_line(line.strip) }
    @config
  end

  private

  def parse_line(line)
    return if line.empty? || line.start_with?("#")
    
    if line.include?("=")
      key, value = line.split("=", 2)
      @config[key.strip] = parse_value(value.strip)
    end
  end

  def parse_value(value)
    case value
    when /^\d+$/ then value.to_i
    when /^\d+\.\d+$/ then value.to_f
    when /^(true|false)$/ then value == "true"
    else value.gsub(/^["']|["']$/, "")
    end
  end
end

config_text = <<~CONFIG
  # Database settings
  host=localhost
  port=5432
  timeout=30.5
  ssl=true
  name="production_db"
CONFIG

parser = ConfigParser.new(config_text)
result = parser.parse
# => {"host"=>"localhost", "port"=>5432, "timeout"=>30.5, "ssl"=>true, "name"=>"production_db"}

Advanced Usage

Advanced parser architecture in Ruby involves implementing stateful parsers, parser combinators, and abstract syntax tree construction. These techniques handle complex grammars, nested structures, and context-sensitive parsing requirements.

Stateful parsers maintain context throughout the parsing process, enabling handling of nested structures and context-dependent tokens. State machines provide a formal approach to managing parser state transitions.

class JSONParser
  def initialize(input)
    @input = input
    @position = 0
    @current_char = @input[0]
  end

  def parse
    skip_whitespace
    parse_value
  end

  private

  def advance
    @position += 1
    @current_char = @position < @input.length ? @input[@position] : nil
  end

  def skip_whitespace
    while @current_char&.match?(/\s/)
      advance
    end
  end

  def parse_value
    skip_whitespace
    case @current_char
    when '{' then parse_object
    when '[' then parse_array
    when '"' then parse_string
    when /\d/, '-' then parse_number
    when 't', 'f' then parse_boolean
    when 'n' then parse_null
    else raise "Unexpected character: #{@current_char}"
    end
  end

  def parse_object
    advance # skip '{'
    obj = {}
    skip_whitespace
    
    return obj if @current_char == '}'
    
    loop do
      key = parse_string
      skip_whitespace
      raise "Expected ':'" unless @current_char == ':'
      advance
      value = parse_value
      obj[key] = value
      
      skip_whitespace
      break if @current_char == '}'
      raise "Expected ',' or '}'" unless @current_char == ','
      advance
      skip_whitespace
    end
    
    advance # skip '}'
    obj
  end

  def parse_string
    raise "Expected '\"'" unless @current_char == '"'
    advance
    result = ""
    
    while @current_char != '"'
      if @current_char == '\\'
        advance
        case @current_char
        when '"', '\\', '/' then result += @current_char
        when 'n' then result += "\n"
        when 't' then result += "\t"
        when 'r' then result += "\r"
        else raise "Invalid escape sequence"
        end
      else
        result += @current_char
      end
      advance
    end
    
    advance # skip closing '"'
    result
  end

  def parse_number
    start_pos = @position
    advance if @current_char == '-'
    
    raise "Invalid number" unless @current_char&.match?(/\d/)
    
    advance while @current_char&.match?(/\d/)
    
    if @current_char == '.'
      advance
      raise "Invalid number" unless @current_char&.match?(/\d/)
      advance while @current_char&.match?(/\d/)
    end
    
    number_str = @input[start_pos...@position]
    number_str.include?('.') ? number_str.to_f : number_str.to_i
  end
end

Parser combinators enable composition of smaller parsers into more complex parsing logic. This functional approach promotes reusable parsing components and declarative grammar specification.

class ParserCombinator
  def initialize(&block)
    @parser_proc = block
  end

  def call(input, position = 0)
    @parser_proc.call(input, position)
  end

  def >>(other_parser)
    ParserCombinator.new do |input, position|
      result1 = call(input, position)
      return result1 unless result1[:success]
      
      result2 = other_parser.call(input, result1[:position])
      if result2[:success]
        {
          success: true,
          value: [result1[:value], result2[:value]],
          position: result2[:position]
        }
      else
        result2
      end
    end
  end

  def |(other_parser)
    ParserCombinator.new do |input, position|
      result1 = call(input, position)
      return result1 if result1[:success]
      other_parser.call(input, position)
    end
  end

  def many
    ParserCombinator.new do |input, position|
      results = []
      current_position = position
      
      loop do
        result = call(input, current_position)
        break unless result[:success]
        
        results << result[:value]
        current_position = result[:position]
      end
      
      { success: true, value: results, position: current_position }
    end
  end
end

def char(expected_char)
  ParserCombinator.new do |input, position|
    if position < input.length && input[position] == expected_char
      { success: true, value: expected_char, position: position + 1 }
    else
      { success: false, error: "Expected '#{expected_char}'", position: position }
    end
  end
end

def regex(pattern)
  ParserCombinator.new do |input, position|
    substring = input[position..-1]
    match = substring.match(/^#{pattern}/)
    
    if match
      { success: true, value: match[0], position: position + match[0].length }
    else
      { success: false, error: "Pattern #{pattern} not found", position: position }
    end
  end
end

# Usage example
identifier = regex(/[a-zA-Z][a-zA-Z0-9]*/)
equals = char('=')
number = regex(/\d+/)
assignment = identifier >> equals >> number

result = assignment.call("variable=123", 0)
# => {:success=>true, :value=>["variable", "=", "123"], :position=>11}

Abstract syntax tree construction transforms parsed tokens into structured representations that preserve semantic relationships. AST nodes encapsulate both data and behavioral methods for tree traversal and transformation.

class ASTNode
  attr_reader :type, :value, :children

  def initialize(type, value = nil, children = [])
    @type = type
    @value = value
    @children = children
  end

  def accept(visitor)
    visitor.visit(self)
  end

  def traverse(&block)
    block.call(self)
    @children.each { |child| child.traverse(&block) }
  end
end

class ExpressionParser
  def initialize(tokens)
    @tokens = tokens
    @position = 0
  end

  def parse
    parse_expression
  end

  private

  def current_token
    @tokens[@position]
  end

  def advance
    @position += 1
  end

  def parse_expression
    left = parse_term
    
    while current_token&.match?(/[+\-]/)
      operator = current_token
      advance
      right = parse_term
      left = ASTNode.new(:binary_op, operator, [left, right])
    end
    
    left
  end

  def parse_term
    left = parse_factor
    
    while current_token&.match?(/[*\/]/)
      operator = current_token
      advance
      right = parse_factor
      left = ASTNode.new(:binary_op, operator, [left, right])
    end
    
    left
  end

  def parse_factor
    token = current_token
    advance
    
    case token
    when /\d+/
      ASTNode.new(:number, token.to_i)
    when '('
      expr = parse_expression
      advance # skip ')'
      expr
    else
      raise "Unexpected token: #{token}"
    end
  end
end

Error Handling & Debugging

Parser error handling requires distinguishing between syntax errors, semantic errors, and runtime exceptions. Ruby parsers should provide descriptive error messages with location information and recovery strategies for partial parsing scenarios.

Exception hierarchy design creates specific error types for different parsing failures. Custom exception classes carry contextual information about parsing state when errors occur.

class ParseError < StandardError
  attr_reader :position, :line, :column, :context

  def initialize(message, position: nil, line: nil, column: nil, context: nil)
    @position = position
    @line = line
    @column = column
    @context = context
    super(build_message(message))
  end

  private

  def build_message(base_message)
    parts = [base_message]
    parts << "at position #{@position}" if @position
    parts << "line #{@line}, column #{@column}" if @line && @column
    parts << "context: #{@context}" if @context
    parts.join(" ")
  end
end

class SyntaxError < ParseError; end
class UnexpectedTokenError < ParseError; end
class UnexpectedEndOfInputError < ParseError; end

class RobustParser
  def initialize(input)
    @input = input
    @position = 0
    @line = 1
    @column = 1
    @errors = []
  end

  def parse_with_recovery
    begin
      parse_document
    rescue ParseError => e
      @errors << e
      attempt_recovery
      retry if @position < @input.length
    end
    
    { result: @result, errors: @errors }
  end

  private

  def current_char
    return nil if @position >= @input.length
    @input[@position]
  end

  def advance
    char = @input[@position]
    @position += 1
    
    if char == "\n"
      @line += 1
      @column = 1
    else
      @column += 1
    end
    
    char
  end

  def parse_document
    # Implementation with error checking
    elements = []
    
    while @position < @input.length
      begin
        elements << parse_element
      rescue UnexpectedTokenError => e
        @errors << e
        skip_to_next_element
      end
    end
    
    @result = elements
  end

  def parse_element
    skip_whitespace
    
    case current_char
    when '{' then parse_object
    when '[' then parse_array
    when '"' then parse_string
    when /\d/, '-' then parse_number
    when nil
      raise UnexpectedEndOfInputError.new(
        "Unexpected end of input",
        position: @position,
        line: @line,
        column: @column
      )
    else
      raise UnexpectedTokenError.new(
        "Unexpected character '#{current_char}'",
        position: @position,
        line: @line,
        column: @column,
        context: get_context
      )
    end
  end

  def get_context(radius = 10)
    start_pos = [@position - radius, 0].max
    end_pos = [@position + radius, @input.length - 1].min
    context = @input[start_pos..end_pos]
    marker_pos = @position - start_pos
    context.insert(marker_pos, ">>>").insert(marker_pos + 4, "<<<")
  end

  def skip_to_next_element
    while @position < @input.length && !current_char.match?(/[{["0-9-]/)
      advance
    end
  end

  def attempt_recovery
    # Skip current problematic token and continue
    advance if current_char
    skip_whitespace
  end
end

Validation frameworks provide declarative error checking for parsed data. These frameworks separate parsing logic from validation logic, enabling reusable validation rules across different parsers.

class ValidationRule
  def initialize(name, &block)
    @name = name
    @validator = block
  end

  def validate(value, context = {})
    result = @validator.call(value, context)
    return { valid: true } if result == true
    
    {
      valid: false,
      rule: @name,
      message: result.is_a?(String) ? result : "Validation failed for #{@name}"
    }
  end
end

class ValidationFramework
  def initialize
    @rules = {}
  end

  def define_rule(name, &block)
    @rules[name] = ValidationRule.new(name, &block)
  end

  def validate(data, rule_names)
    errors = []
    
    rule_names.each do |rule_name|
      rule = @rules[rule_name]
      next unless rule
      
      result = rule.validate(data[:value], data)
      errors << result unless result[:valid]
    end
    
    { valid: errors.empty?, errors: errors }
  end
end

# Usage in parser
class ValidatingParser
  def initialize
    @validator = ValidationFramework.new
    setup_validation_rules
  end

  def parse_and_validate(input)
    parsed_data = parse(input)
    
    parsed_data.map do |item|
      validation_result = @validator.validate(item, item[:validation_rules])
      item[:validation] = validation_result
      item
    end
  end

  private

  def setup_validation_rules
    @validator.define_rule(:required) do |value, context|
      !value.nil? && value != ""
    end

    @validator.define_rule(:email_format) do |value, context|
      value.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ||
        "Invalid email format"
    end

    @validator.define_rule(:positive_number) do |value, context|
      value.is_a?(Numeric) && value > 0 || "Must be a positive number"
    end
  end
end

Debugging tools for parsers include token stream visualization, parse tree inspection, and step-by-step execution tracing. These tools help identify where parsing logic deviates from expected behavior.

class DebuggingParser
  def initialize(input, debug: false)
    @input = input
    @debug = debug
    @trace = []
    @position = 0
  end

  def parse_with_debugging
    debug_log("Starting parse", input_preview: preview_input)
    
    result = parse_expression
    
    if @debug
      puts "\n=== Parse Trace ==="
      @trace.each_with_index do |entry, index|
        puts "#{index + 1}. #{entry[:action]} | #{entry[:details]}"
      end
    end
    
    result
  end

  private

  def debug_log(action, details = {})
    return unless @debug
    
    trace_entry = {
      action: action,
      position: @position,
      current_char: current_char,
      details: details
    }
    
    @trace << trace_entry
    puts "DEBUG: #{action} at position #{@position} | #{details}"
  end

  def preview_input(radius = 5)
    start_pos = [@position - radius, 0].max
    end_pos = [@position + radius, @input.length - 1].min
    @input[start_pos..end_pos].gsub(/\n/, '\\n')
  end
end

Performance & Memory

Parser performance optimization involves minimizing memory allocations, reducing string operations, and implementing efficient tokenization strategies. Ruby parsers benefit from careful attention to object creation patterns and garbage collection pressure.

String parsing performance depends heavily on avoiding unnecessary string allocations. Ruby strings are mutable, but many parsing operations create new string objects. Using string indexes and ranges instead of substring extraction reduces memory pressure.

class OptimizedParser
  def initialize(input)
    @input = input.freeze # Prevent accidental modification
    @length = input.length
  end

  # Inefficient: creates many substring objects
  def parse_tokens_slow
    tokens = []
    position = 0
    
    while position < @length
      if @input[position].match?(/\w/)
        start = position
        position += 1 while position < @length && @input[position].match?(/\w/)
        tokens << @input[start...position] # Creates new string object
      else
        position += 1
      end
    end
    
    tokens
  end

  # Efficient: uses ranges to defer string creation
  def parse_tokens_fast
    tokens = []
    position = 0
    
    while position < @length
      if @input[position].match?(/\w/)
        start = position
        position += 1 while position < @length && @input[position].match?(/\w/)
        tokens << { start: start, end: position } # Store range info
      else
        position += 1
      end
    end
    
    # Create strings only when needed
    tokens.map { |token| @input[token[:start]...token[:end]] }
  end

  # Memory-efficient streaming approach
  def parse_tokens_streaming(&block)
    position = 0
    
    while position < @length
      if @input[position].match?(/\w/)
        start = position
        position += 1 while position < @length && @input[position].match?(/\w/)
        
        # Process token immediately without storing
        block.call(@input[start...position])
      else
        position += 1
      end
    end
  end
end

# Performance comparison
require 'benchmark'

large_input = "word " * 100_000
parser = OptimizedParser.new(large_input)

Benchmark.bm(20) do |x|
  x.report("slow (substring)") { parser.parse_tokens_slow }
  x.report("fast (ranges)") { parser.parse_tokens_fast }
  x.report("streaming") { parser.parse_tokens_streaming { |token| token.length } }
end

Regex performance optimization involves understanding Ruby's regex engine behavior and compilation costs. Precompiled regex objects and anchored patterns improve parsing speed significantly.

class RegexOptimizedParser
  # Pre-compile regex patterns to avoid repeated compilation
  EMAIL_PATTERN = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i.freeze
  PHONE_PATTERN = /\A\+?[\d\s\-\(\)]{10,15}\z/.freeze
  URL_PATTERN = /\Ahttps?:\/\/[\w\-\.]+\.[\w]{2,}(\/\S*)?\z/i.freeze

  def initialize
    @patterns = {
      email: EMAIL_PATTERN,
      phone: PHONE_PATTERN,
      url: URL_PATTERN
    }
  end

  def classify_input(input)
    @patterns.each do |type, pattern|
      return type if pattern.match?(input)
    end
    
    :unknown
  end

  # Efficient pattern matching with early termination
  def extract_structured_data(text)
    results = { emails: [], phones: [], urls: [] }
    
    # Use single scan with alternation for multiple patterns
    combined_pattern = /(#{EMAIL_PATTERN}|#{PHONE_PATTERN}|#{URL_PATTERN})/i
    
    text.scan(combined_pattern) do |match|
      value = match[0]
      case classify_input(value)
      when :email then results[:emails] << value
      when :phone then results[:phones] << value
      when :url then results[:urls] << value
      end
    end
    
    results
  end
end

Memory profiling helps identify parsing bottlenecks and excessive object allocation. Ruby provides memory profiling tools that track object creation during parsing operations.

require 'objspace'

class MemoryProfiledParser
  def self.profile_parsing(input, &parser_block)
    # Track object allocations
    ObjectSpace.trace_object_allocations_start
    
    # Record initial memory state
    initial_objects = ObjectSpace.count_objects
    
    # Execute parsing
    start_time = Time.now
    result = parser_block.call(input)
    end_time = Time.now
    
    # Record final memory state
    final_objects = ObjectSpace.count_objects
    
    ObjectSpace.trace_object_allocations_stop
    
    # Calculate memory usage
    object_diff = final_objects.merge(initial_objects) do |key, final_count, initial_count|
      final_count - initial_count
    end
    
    {
      result: result,
      execution_time: end_time - start_time,
      object_allocations: object_diff.select { |_, count| count > 0 },
      total_objects_created: object_diff.values.sum
    }
  end

  def self.benchmark_parsers(input, parsers = {})
    results = {}
    
    parsers.each do |name, parser_proc|
      GC.start # Ensure clean state
      
      profile_result = profile_parsing(input, &parser_proc)
      
      results[name] = {
        time: profile_result[:execution_time],
        objects: profile_result[:total_objects_created],
        memory_efficiency: profile_result[:execution_time] / profile_result[:total_objects_created]
      }
    end
    
    results
  end
end

# Usage example
large_json = '{"users": [' + ('{"name": "User", "age": 30},' * 1000)[0..-2] + ']}'

parsers = {
  json_standard: ->(input) { JSON.parse(input) },
  json_custom: ->(input) { CustomJSONParser.new(input).parse }
}

benchmark_results = MemoryProfiledParser.benchmark_parsers(large_json, parsers)
puts benchmark_results

Reference

Core Parsing Classes

Class Purpose Key Methods Usage Notes
String Basic text parsing #split, #scan, #match, #gsub Foundation for most parsing operations
Regexp Pattern matching #match, #scan, #=~, #match? Use named captures for maintainable code
MatchData Regex match results #[], #captures, #names, #offset Access captured groups by index or name
StringIO String stream parsing #read, #readline, #pos, #seek Useful for parsing large text streams

Standard Library Parsers

Parser Module/Class Primary Method Input Format Output Type
JSON JSON JSON.parse(string) JSON text Hash/Array
CSV CSV CSV.parse(string) Comma-separated values Array of Arrays
YAML YAML YAML.safe_load(string) YAML document Hash/Array/Scalar
URI URI URI.parse(string) URL/URI string URI object
Time Time Time.parse(string) Time string Time object
Date Date Date.parse(string) Date string Date object

String Parsing Methods

Method Signature Returns Description
#split(pattern, limit) String → Array Array of substrings Splits string on pattern matches
#scan(pattern) String → Array Array of matches Returns all pattern matches
#match(pattern, pos) String → MatchData/nil Match object or nil First pattern match from position
#partition(pattern) String → Array [before, match, after] Splits string into three parts
#rpartition(pattern) String → Array [before, match, after] Right-to-left partition
#slice(start, length) String → String/nil Substring or nil Extracts substring without allocation
#[] with Range String → String/nil Substring or nil Range-based extraction

Regex Pattern Elements

Element Syntax Matches Example
Character class [abc] Any character in set [0-9] matches digits
Negated class [^abc] Any character not in set [^a-z] excludes lowercase
Word boundary \b Position at word edge \bword\b matches whole word
Anchors ^, $ Start/end of string ^\d+$ matches number strings
Quantifiers *, +, ?, {n,m} Repetition patterns \d{2,4} matches 2-4 digits
Groups () Capture matched text (\w+)@(\w+) captures email parts
Named groups (?<name>...) Named capture (?<year>\d{4}) names capture
Non-capturing (?:...) Groups without capture (?:com|org) groups alternation

Common Parsing Patterns

Pattern Type Implementation Use Case Example
Token scanning string.scan(/pattern/) Extract repeated elements Email addresses, identifiers
Delimiter splitting string.split(/\s*,\s*/) Parse separated values CSV fields, parameter lists
State machine Custom class with state tracking Complex grammar parsing JSON, XML, config formats
Recursive descent Method calls for grammar rules Nested structures Mathematical expressions
Parser combinators Functional composition Declarative grammar DSL parsing, protocols

Error Handling Patterns

Error Type Exception Class When to Use Recovery Strategy
Syntax errors SyntaxError Malformed input structure Skip to next valid token
Unexpected tokens UnexpectedTokenError Wrong token type Attempt alternate parsing path
End of input UnexpectedEndOfInputError Premature input termination Use default values
Validation errors ValidationError Semantic constraint violations Log error, continue parsing
Encoding errors EncodingError Character encoding issues Convert encoding or skip

Performance Optimization Guidelines

Technique Implementation Performance Gain Memory Impact
Precompiled regex Store regex in constants 20-50% faster matching Minimal increase
Range-based extraction Use string[start...end] Avoids string allocation 50-80% memory reduction
String freezing Call #freeze on input Prevents accidental mutation Enables string sharing
Streaming parsing Process tokens immediately Constant memory usage 90%+ memory reduction
Batch processing Parse multiple items together Reduces method call overhead Variable depending on batch size

Debugging Tools and Techniques

Tool Usage Output Best For
String#inspect puts string.inspect Escaped string representation Debugging invisible characters
Regexp#source regex.source Pattern string Examining dynamic regex
MatchData#inspect match.inspect Match details with positions Understanding capture groups
ObjectSpace profiling ObjectSpace.trace_object_allocations Memory allocation tracking Performance optimization
Custom trace logging Add debug output to parser Step-by-step execution log Logic flow debugging