CrackedRuby - Parser Selection

Overview

Parser selection in Ruby refers to the automated and manual mechanisms for choosing appropriate parsing strategies based on data characteristics, file extensions, content inspection, and format detection. Ruby provides multiple layers of parser selection through its standard library and third-party gems, enabling applications to handle diverse data formats without explicit format specification.

Ruby's standard library includes automatic format detection in CSV parsing with the :auto setting for row separators, encoding detection capabilities, and content-type based parsing selection. The JSON and YAML libraries provide format-specific parsing with automatic encoding handling. Third-party libraries extend these capabilities with multi-format parsing frameworks that can automatically select between JSON, YAML, TOML, CSV, and XML parsers based on file extensions or content analysis.

The parser selection process operates through several strategies: file extension mapping, content sniffing, encoding detection, delimiter analysis, and structure inspection. Libraries like Shale provide unified interfaces that automatically route data to appropriate parsers, while gems like ACSV perform automatic encoding and separator detection for CSV files.

require 'csv'

# Automatic row separator detection
CSV.parse(data, row_sep: :auto)

# Content-based format selection
require 'shale'
person = Person.from_json(json_data)   # JSON parser selected
person = Person.from_yaml(yaml_data)   # YAML parser selected
person = Person.from_xml(xml_data)     # XML parser selected

Parser selection reduces the burden of format specification in applications that handle multiple data formats, particularly useful in data processing pipelines, API endpoints, and file import systems where the source format may vary or be unknown at runtime.

Basic Usage

Ruby's parser selection mechanisms operate through both built-in standard library features and third-party libraries that provide enhanced format detection capabilities.

The CSV library provides automatic row separator detection through the :auto setting, which analyzes data content to determine the appropriate line ending sequence:

require 'csv'

data = "name,age\r\nJohn,25\r\nJane,30\r\n"
parsed = CSV.parse(data, row_sep: :auto)
# Automatically detects \r\n line endings

Multi-format parsing libraries like Shale enable automatic parser selection based on method calls, where the format is implied by the method name:

require 'shale'

class Person < Shale::Mapper
  attribute :name, Shale::Type::String
  attribute :age, Shale::Type::Integer
end

# Parser selection based on method name
person = Person.from_json('{"name":"John","age":25}')
person = Person.from_yaml("name: John\nage: 25")
person = Person.from_csv("John,25")

File extension-based parser selection occurs in libraries that analyze file paths to determine appropriate parsing strategies:

require 'front_matter_parser'

# Parser selection based on file extension
parsed = FrontMatterParser::Parser.parse_file('document.md')  # Markdown
parsed = FrontMatterParser::Parser.parse_file('template.erb') # ERB
parsed = FrontMatterParser::Parser.parse_file('style.scss')  # SCSS

Automatic encoding and delimiter detection provides robust CSV parsing without manual configuration:

require 'acsv'

# Automatic encoding and separator detection
ACSV::CSV.foreach("data.csv") do |row|
  puts row.inspect
end

# Manual access to detection results
data = File.read("mixed_format.csv")
encoding = ACSV::Detect.encoding(data)
separator = ACSV::Detect.separator(data.force_encoding(encoding))

Content inspection for format detection enables parsing of data streams where format is determined by analyzing structure patterns:

# Biological sequence format detection
require 'dna'

File.open('sequences.unknown') do |handle|
  records = Dna.new(handle)  # Automatic format detection
  records.each { |record| puts record.length }
end

Parser adapter selection allows switching between different parsing implementations based on performance requirements or feature needs:

require 'shale'
require 'multi_json'

# Selecting JSON parser adapter
Shale.json_adapter = MultiJson

# Selecting XML parser adapter
require 'shale/adapter/nokogiri'
Shale.xml_adapter = Shale::Adapter::Nokogiri

Error Handling & Debugging

Parser selection failures occur when automatic detection mechanisms cannot determine the correct format, when multiple formats match the input data, or when the detected format is incorrect for the actual data structure.

Format detection ambiguity requires fallback strategies and explicit format specification when automatic selection fails:

require 'acsv'

def robust_csv_parse(file_path)
  begin
    ACSV::CSV.read(file_path)
  rescue ACSV::DetectionError => e
    puts "Auto-detection failed: #{e.message}"
    
    # Fallback to standard CSV with common settings
    encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252']
    separators = [',', ';', '\t']
    
    encodings.each do |encoding|
      separators.each do |sep|
        begin
          return CSV.read(file_path, encoding: encoding, col_sep: sep)
        rescue => inner_e
          next
        end
      end
    end
    
    raise "Could not parse CSV with any combination of settings"
  end
end

Content validation after parser selection prevents incorrect parsing results when format detection succeeds but selects the wrong parser:

class FormatValidator
  def self.validate_json_structure(data, expected_keys)
    parsed = JSON.parse(data)
    
    unless parsed.is_a?(Hash) && expected_keys.all? { |key| parsed.key?(key) }
      raise ArgumentError, "JSON structure doesn't match expected format"
    end
    
    parsed
  rescue JSON::ParserError => e
    raise ArgumentError, "Invalid JSON format: #{e.message}"
  end
  
  def self.validate_csv_columns(file_path, expected_columns)
    headers = CSV.open(file_path, &:readline)
    
    missing = expected_columns - headers
    unless missing.empty?
      raise ArgumentError, "Missing required columns: #{missing.join(', ')}"
    end
    
    extra = headers - expected_columns
    puts "Warning: unexpected columns found: #{extra.join(', ')}" unless extra.empty?
    
    true
  end
end

Debugging parser selection requires logging detection decisions and providing visibility into the selection process:

class ParserSelector
  def self.select_parser_with_logging(data, filename = nil)
    detection_log = []
    
    # Try JSON detection
    begin
      JSON.parse(data.strip)
      detection_log << "JSON: valid syntax detected"
      return :json
    rescue JSON::ParserError
      detection_log << "JSON: parse failed"
    end
    
    # Try YAML detection
    begin
      YAML.safe_load(data)
      detection_log << "YAML: valid syntax detected"
      return :yaml
    rescue Psych::SyntaxError
      detection_log << "YAML: parse failed"
    end
    
    # File extension fallback
    if filename
      case File.extname(filename).downcase
      when '.json'
        detection_log << "Extension: .json detected"
        return :json
      when '.yml', '.yaml'
        detection_log << "Extension: .yaml detected"
        return :yaml
      when '.csv'
        detection_log << "Extension: .csv detected"
        return :csv
      end
    end
    
    detection_log << "No format detected, defaulting to plain text"
    puts "Detection log: #{detection_log.join(' | ')}"
    
    :text
  end
end

Error recovery patterns handle parser selection failures gracefully by attempting alternative parsers or degrading to simpler parsing approaches:

class RobustDataParser
  def self.parse_with_fallbacks(data, hint = nil)
    parsers = determine_parser_order(data, hint)
    last_error = nil
    
    parsers.each do |parser_type|
      begin
        return send("parse_as_#{parser_type}", data)
      rescue StandardError => e
        last_error = e
        puts "#{parser_type.upcase} parsing failed: #{e.message}"
        next
      end
    end
    
    # Final fallback to raw text
    puts "All parsers failed, returning raw text"
    { content: data.to_s, format: :text, error: last_error&.message }
  end
  
  private
  
  def self.determine_parser_order(data, hint)
    order = []
    
    # Use hint if provided
    order << hint if hint
    
    # Content-based heuristics
    order << :json if data.strip.start_with?('{', '[')
    order << :yaml if data.include?(':') && !data.include?('{')
    order << :csv if data.include?(',') || data.include?(';')
    
    # Default order for remaining parsers
    ([:json, :yaml, :csv, :xml] - order).each { |p| order << p }
    
    order.uniq
  end
end

Production Patterns

Parser selection in production environments requires robust error handling, performance optimization, caching of detection results, and monitoring of parser selection decisions to ensure reliability and maintainability.

Caching parser selection decisions reduces overhead when processing multiple files with similar characteristics or when the same data formats are encountered repeatedly:

class ParserSelectionCache
  def initialize(max_size: 1000)
    @cache = {}
    @max_size = max_size
    @access_order = []
  end
  
  def get_parser(content_hash, &block)
    if @cache.key?(content_hash)
      # Move to end for LRU
      @access_order.delete(content_hash)
      @access_order << content_hash
      return @cache[content_hash]
    end
    
    parser = block.call
    
    if @cache.size >= @max_size
      # Remove least recently used
      oldest = @access_order.shift
      @cache.delete(oldest)
    end
    
    @cache[content_hash] = parser
    @access_order << content_hash
    parser
  end
end

# Usage with content-based caching
parser_cache = ParserSelectionCache.new(max_size: 500)

def parse_data_with_cache(data, cache)
  content_hash = Digest::SHA256.hexdigest(data[0, 1000])  # Hash first 1KB
  
  parser = cache.get_parser(content_hash) do
    detect_parser_type(data)
  end
  
  send("parse_as_#{parser}", data)
end

Performance optimization for parser selection involves minimizing detection overhead and implementing efficient content sampling strategies:

class OptimizedParserSelector
  DETECTION_SAMPLE_SIZE = 4096  # Only analyze first 4KB
  
  def self.select_parser_fast(data)
    # Use small sample for detection to avoid processing large files
    sample = data.is_a?(String) ? data[0, DETECTION_SAMPLE_SIZE] : data
    
    # Fast heuristics based on first characters
    trimmed = sample.strip
    
    return :json if json_like?(trimmed)
    return :xml if xml_like?(trimmed)
    return :yaml if yaml_like?(trimmed)
    return :csv if csv_like?(trimmed)
    
    :text
  end
  
  private
  
  def self.json_like?(sample)
    sample.start_with?('{', '[') && sample.include?('"')
  end
  
  def self.xml_like?(sample)
    sample.start_with?('<?xml', '<') && sample.include?('>')
  end
  
  def self.yaml_like?(sample)
    sample.include?(':') && !sample.include?('{') && 
    sample.split("\n").any? { |line| line.match(/^\w+\s*:/) }
  end
  
  def self.csv_like?(sample)
    lines = sample.split("\n").first(3)
    return false if lines.size < 2
    
    # Check for consistent column counts
    separators = [',', ';', '\t']
    separators.any? do |sep|
      counts = lines.map { |line| line.split(sep).size }
      counts.uniq.size == 1 && counts.first > 1
    end
  end
end

Monitoring and logging parser selection decisions provides visibility into format distribution and helps identify potential issues in production:

class ParserSelectionMonitor
  def initialize(logger: Logger.new(STDOUT))
    @logger = logger
    @stats = Hash.new(0)
    @errors = Hash.new(0)
  end
  
  def track_selection(format, source_hint: nil, duration: nil)
    @stats[format] += 1
    
    @logger.info({
      event: 'parser_selection',
      format: format,
      source_hint: source_hint,
      duration_ms: duration,
      total_selections: @stats.values.sum
    }.to_json)
  end
  
  def track_error(error_type, details: nil)
    @errors[error_type] += 1
    
    @logger.error({
      event: 'parser_selection_error',
      error_type: error_type,
      details: details,
      total_errors: @errors.values.sum
    }.to_json)
  end
  
  def format_distribution
    total = @stats.values.sum.to_f
    @stats.transform_values { |count| (count / total * 100).round(2) }
  end
  
  def error_rates
    total_attempts = @stats.values.sum + @errors.values.sum
    return {} if total_attempts.zero?
    
    @errors.transform_values do |count|
      (count.to_f / total_attempts * 100).round(2)
    end
  end
end

# Integration with parser selection
monitor = ParserSelectionMonitor.new

def monitored_parse(data, filename: nil)
  start_time = Time.now
  
  begin
    format = OptimizedParserSelector.select_parser_fast(data)
    duration = ((Time.now - start_time) * 1000).round(2)
    
    monitor.track_selection(format, 
      source_hint: File.extname(filename || ''),
      duration: duration
    )
    
    parse_with_format(data, format)
    
  rescue StandardError => e
    monitor.track_error(e.class.name, details: e.message[0, 200])
    raise
  end
end

Configuration management for parser selection allows tuning detection parameters and parser preferences for different environments:

class ParserConfig
  DEFAULT_CONFIG = {
    json: { enabled: true, priority: 1 },
    yaml: { enabled: true, priority: 2 },
    csv: { enabled: true, priority: 3, auto_detect_separator: true },
    xml: { enabled: true, priority: 4 },
    detection_timeout: 5.0,
    cache_enabled: true,
    cache_size: 1000,
    sample_size: 4096,
    fallback_parser: :text
  }.freeze
  
  def initialize(config = {})
    @config = DEFAULT_CONFIG.merge(config)
  end
  
  def parser_priority_order
    enabled_parsers = @config.select { |k, v| v.is_a?(Hash) && v[:enabled] }
    enabled_parsers.sort_by { |_, v| v[:priority] }.map(&:first)
  end
  
  def timeout_enabled?
    @config[:detection_timeout] > 0
  end
  
  def with_timeout(&block)
    if timeout_enabled?
      Timeout.timeout(@config[:detection_timeout], &block)
    else
      block.call
    end
  end
end

# Environment-specific configuration
production_config = ParserConfig.new(
  detection_timeout: 2.0,
  cache_size: 5000,
  sample_size: 2048,
  csv: { enabled: true, priority: 1, auto_detect_separator: true }
)

development_config = ParserConfig.new(
  detection_timeout: 10.0,
  cache_enabled: false,
  sample_size: 8192
)

Reference

Core Parser Selection Methods

Method	Parameters	Returns	Description
`CSV.parse(data, row_sep: :auto)`	`data` (String), `row_sep: :auto`	`Array<Array>`	Automatically detects row separator in CSV data
`ACSV::CSV.foreach(file)`	`file` (String/IO), block	`Enumerator`	Auto-detects encoding and separator for CSV iteration
`ACSV::Detect.encoding(data)`	`data` (String)	`String`	Detects character encoding of data
`ACSV::Detect.separator(data)`	`data` (String)	`String`	Detects column separator in CSV data
`FrontMatterParser::Parser.parse_file(path)`	`path` (String), options (Hash)	`ParsedResult`	Auto-detects syntax based on file extension
`Dna.new(handle)`	`handle` (IO), options (Hash)	`Parser`	Auto-detects biological sequence file format
`Shale::Mapper.from_json(data)`	`data` (String)	`Mapper`	Parses data as JSON format
`Shale::Mapper.from_yaml(data)`	`data` (String)	`Mapper`	Parses data as YAML format
`Shale::Mapper.from_csv(data)`	`data` (String)	`Mapper`	Parses data as CSV format
`JSON.parse(data)`	`data` (String), options (Hash)	`Object`	Parses JSON with encoding auto-detection
`YAML.safe_load(data)`	`data` (String), options (Hash)	`Object`	Parses YAML with format validation

Format Detection Strategies

Strategy	Detection Method	Reliability	Performance	Use Cases
File Extension	Path analysis	Medium	High	File system operations, batch processing
Content Sniffing	Structure analysis	High	Medium	API endpoints, data streams
Magic Bytes	Header inspection	High	High	Binary formats, specific file types
Delimiter Analysis	Character frequency	Medium	High	CSV/TSV detection, structured text
Encoding Detection	Byte pattern analysis	High	Low	Multi-language content, legacy data
Schema Validation	Structure validation	Very High	Low	Critical data processing, validation

Built-in Parser Selection Options

Library	Auto-Detection Features	Configuration Options
CSV	Row separator (`:auto`), encoding detection	`col_sep`, `row_sep`, `quote_char`, `encoding`
JSON	Encoding detection, nested structure handling	`symbolize_names`, `max_nesting`, `allow_nan`
YAML	Multi-document detection, safe loading	`permitted_classes`, `aliases`, `filename`
XML	Encoding declaration parsing, namespace handling	`encoding`, `external_encoding`, `recover`

Third-Party Parser Selection Libraries

Library	Formats Supported	Key Features	Selection Method
ACSV	CSV	Encoding/separator auto-detection	Content analysis, character detection
Shale	JSON, YAML, TOML, CSV, XML	Unified mapper interface	Method-based format selection
FrontMatterParser	YAML, JSON front matter	Syntax-aware comment parsing	File extension, content delimiters
MultiJSON	JSON variants	Adapter pattern, fallback parsers	Performance-based selection, availability
Dna	FASTA, FASTQ, QSEQ	Biological sequence format detection	Content structure analysis
SmarterCSV	CSV	Intelligent defaults, chunk processing	Header analysis, data type inference

Error Classes and Handling

Exception Class	Trigger Conditions	Recovery Strategies
`ACSV::DetectionError`	Encoding/separator detection failure	Manual format specification, fallback parsers
`JSON::ParserError`	Invalid JSON syntax	Alternative format attempt, content validation
`Psych::SyntaxError`	Invalid YAML structure	Format fallback, partial parsing
`CSV::MalformedCSVError`	Inconsistent CSV structure	Liberal parsing, error skipping
`FrontMatterParser::SyntaxNotRecognized`	Unknown file syntax	Custom parser definition, manual parsing
`ArgumentError`	Invalid parser configuration	Configuration validation, default settings

Performance Characteristics

Operation	Typical Duration	Memory Usage	Scalability
File extension check	< 1ms	Minimal	Excellent
Content sampling (4KB)	1-5ms	Low	Very Good
Encoding detection	10-100ms	Medium	Good
Full content parsing	Variable	High	Depends on format
Schema validation	50-500ms	Medium	Fair

Configuration Examples

# CSV auto-detection configuration
CSV.new(data, 
  row_sep: :auto,           # Automatic row separator detection
  col_sep: :auto,           # Custom: automatic column separator detection
  quote_char: :auto,        # Custom: automatic quote character detection
  encoding: :auto           # Custom: automatic encoding detection
)

# Multi-format parser configuration
Shale.configure do |config|
  config.json_adapter = MultiJSON
  config.yaml_adapter = Psych
  config.xml_adapter = Nokogiri::XML
  config.csv_adapter = SmarterCSV
end

# Front matter parser syntax registration
FrontMatterParser::SyntaxParser.register('.vue', VueParser.new)
FrontMatterParser::SyntaxParser.register('.tsx', TypeScriptParser.new)