CrackedRuby logo

CrackedRuby

Parser Selection

Overview

Parser selection in Ruby refers to the automated and manual mechanisms for choosing appropriate parsing strategies based on data characteristics, file extensions, content inspection, and format detection. Ruby provides multiple layers of parser selection through its standard library and third-party gems, enabling applications to handle diverse data formats without explicit format specification.

Ruby's standard library includes automatic format detection in CSV parsing with the :auto setting for row separators, encoding detection capabilities, and content-type based parsing selection. The JSON and YAML libraries provide format-specific parsing with automatic encoding handling. Third-party libraries extend these capabilities with multi-format parsing frameworks that can automatically select between JSON, YAML, TOML, CSV, and XML parsers based on file extensions or content analysis.

The parser selection process operates through several strategies: file extension mapping, content sniffing, encoding detection, delimiter analysis, and structure inspection. Libraries like Shale provide unified interfaces that automatically route data to appropriate parsers, while gems like ACSV perform automatic encoding and separator detection for CSV files.

require 'csv'

# Automatic row separator detection
CSV.parse(data, row_sep: :auto)

# Content-based format selection
require 'shale'
person = Person.from_json(json_data)   # JSON parser selected
person = Person.from_yaml(yaml_data)   # YAML parser selected
person = Person.from_xml(xml_data)     # XML parser selected

Parser selection reduces the burden of format specification in applications that handle multiple data formats, particularly useful in data processing pipelines, API endpoints, and file import systems where the source format may vary or be unknown at runtime.

Basic Usage

Ruby's parser selection mechanisms operate through both built-in standard library features and third-party libraries that provide enhanced format detection capabilities.

The CSV library provides automatic row separator detection through the :auto setting, which analyzes data content to determine the appropriate line ending sequence:

require 'csv'

data = "name,age\r\nJohn,25\r\nJane,30\r\n"
parsed = CSV.parse(data, row_sep: :auto)
# Automatically detects \r\n line endings

Multi-format parsing libraries like Shale enable automatic parser selection based on method calls, where the format is implied by the method name:

require 'shale'

class Person < Shale::Mapper
  attribute :name, Shale::Type::String
  attribute :age, Shale::Type::Integer
end

# Parser selection based on method name
person = Person.from_json('{"name":"John","age":25}')
person = Person.from_yaml("name: John\nage: 25")
person = Person.from_csv("John,25")

File extension-based parser selection occurs in libraries that analyze file paths to determine appropriate parsing strategies:

require 'front_matter_parser'

# Parser selection based on file extension
parsed = FrontMatterParser::Parser.parse_file('document.md')  # Markdown
parsed = FrontMatterParser::Parser.parse_file('template.erb') # ERB
parsed = FrontMatterParser::Parser.parse_file('style.scss')  # SCSS

Automatic encoding and delimiter detection provides robust CSV parsing without manual configuration:

require 'acsv'

# Automatic encoding and separator detection
ACSV::CSV.foreach("data.csv") do |row|
  puts row.inspect
end

# Manual access to detection results
data = File.read("mixed_format.csv")
encoding = ACSV::Detect.encoding(data)
separator = ACSV::Detect.separator(data.force_encoding(encoding))

Content inspection for format detection enables parsing of data streams where format is determined by analyzing structure patterns:

# Biological sequence format detection
require 'dna'

File.open('sequences.unknown') do |handle|
  records = Dna.new(handle)  # Automatic format detection
  records.each { |record| puts record.length }
end

Parser adapter selection allows switching between different parsing implementations based on performance requirements or feature needs:

require 'shale'
require 'multi_json'

# Selecting JSON parser adapter
Shale.json_adapter = MultiJson

# Selecting XML parser adapter
require 'shale/adapter/nokogiri'
Shale.xml_adapter = Shale::Adapter::Nokogiri

Error Handling & Debugging

Parser selection failures occur when automatic detection mechanisms cannot determine the correct format, when multiple formats match the input data, or when the detected format is incorrect for the actual data structure.

Format detection ambiguity requires fallback strategies and explicit format specification when automatic selection fails:

require 'acsv'

def robust_csv_parse(file_path)
  begin
    ACSV::CSV.read(file_path)
  rescue ACSV::DetectionError => e
    puts "Auto-detection failed: #{e.message}"
    
    # Fallback to standard CSV with common settings
    encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252']
    separators = [',', ';', '\t']
    
    encodings.each do |encoding|
      separators.each do |sep|
        begin
          return CSV.read(file_path, encoding: encoding, col_sep: sep)
        rescue => inner_e
          next
        end
      end
    end
    
    raise "Could not parse CSV with any combination of settings"
  end
end

Content validation after parser selection prevents incorrect parsing results when format detection succeeds but selects the wrong parser:

class FormatValidator
  def self.validate_json_structure(data, expected_keys)
    parsed = JSON.parse(data)
    
    unless parsed.is_a?(Hash) && expected_keys.all? { |key| parsed.key?(key) }
      raise ArgumentError, "JSON structure doesn't match expected format"
    end
    
    parsed
  rescue JSON::ParserError => e
    raise ArgumentError, "Invalid JSON format: #{e.message}"
  end
  
  def self.validate_csv_columns(file_path, expected_columns)
    headers = CSV.open(file_path, &:readline)
    
    missing = expected_columns - headers
    unless missing.empty?
      raise ArgumentError, "Missing required columns: #{missing.join(', ')}"
    end
    
    extra = headers - expected_columns
    puts "Warning: unexpected columns found: #{extra.join(', ')}" unless extra.empty?
    
    true
  end
end

Debugging parser selection requires logging detection decisions and providing visibility into the selection process:

class ParserSelector
  def self.select_parser_with_logging(data, filename = nil)
    detection_log = []
    
    # Try JSON detection
    begin
      JSON.parse(data.strip)
      detection_log << "JSON: valid syntax detected"
      return :json
    rescue JSON::ParserError
      detection_log << "JSON: parse failed"
    end
    
    # Try YAML detection
    begin
      YAML.safe_load(data)
      detection_log << "YAML: valid syntax detected"
      return :yaml
    rescue Psych::SyntaxError
      detection_log << "YAML: parse failed"
    end
    
    # File extension fallback
    if filename
      case File.extname(filename).downcase
      when '.json'
        detection_log << "Extension: .json detected"
        return :json
      when '.yml', '.yaml'
        detection_log << "Extension: .yaml detected"
        return :yaml
      when '.csv'
        detection_log << "Extension: .csv detected"
        return :csv
      end
    end
    
    detection_log << "No format detected, defaulting to plain text"
    puts "Detection log: #{detection_log.join(' | ')}"
    
    :text
  end
end

Error recovery patterns handle parser selection failures gracefully by attempting alternative parsers or degrading to simpler parsing approaches:

class RobustDataParser
  def self.parse_with_fallbacks(data, hint = nil)
    parsers = determine_parser_order(data, hint)
    last_error = nil
    
    parsers.each do |parser_type|
      begin
        return send("parse_as_#{parser_type}", data)
      rescue StandardError => e
        last_error = e
        puts "#{parser_type.upcase} parsing failed: #{e.message}"
        next
      end
    end
    
    # Final fallback to raw text
    puts "All parsers failed, returning raw text"
    { content: data.to_s, format: :text, error: last_error&.message }
  end
  
  private
  
  def self.determine_parser_order(data, hint)
    order = []
    
    # Use hint if provided
    order << hint if hint
    
    # Content-based heuristics
    order << :json if data.strip.start_with?('{', '[')
    order << :yaml if data.include?(':') && !data.include?('{')
    order << :csv if data.include?(',') || data.include?(';')
    
    # Default order for remaining parsers
    ([:json, :yaml, :csv, :xml] - order).each { |p| order << p }
    
    order.uniq
  end
end

Production Patterns

Parser selection in production environments requires robust error handling, performance optimization, caching of detection results, and monitoring of parser selection decisions to ensure reliability and maintainability.

Caching parser selection decisions reduces overhead when processing multiple files with similar characteristics or when the same data formats are encountered repeatedly:

class ParserSelectionCache
  def initialize(max_size: 1000)
    @cache = {}
    @max_size = max_size
    @access_order = []
  end
  
  def get_parser(content_hash, &block)
    if @cache.key?(content_hash)
      # Move to end for LRU
      @access_order.delete(content_hash)
      @access_order << content_hash
      return @cache[content_hash]
    end
    
    parser = block.call
    
    if @cache.size >= @max_size
      # Remove least recently used
      oldest = @access_order.shift
      @cache.delete(oldest)
    end
    
    @cache[content_hash] = parser
    @access_order << content_hash
    parser
  end
end

# Usage with content-based caching
parser_cache = ParserSelectionCache.new(max_size: 500)

def parse_data_with_cache(data, cache)
  content_hash = Digest::SHA256.hexdigest(data[0, 1000])  # Hash first 1KB
  
  parser = cache.get_parser(content_hash) do
    detect_parser_type(data)
  end
  
  send("parse_as_#{parser}", data)
end

Performance optimization for parser selection involves minimizing detection overhead and implementing efficient content sampling strategies:

class OptimizedParserSelector
  DETECTION_SAMPLE_SIZE = 4096  # Only analyze first 4KB
  
  def self.select_parser_fast(data)
    # Use small sample for detection to avoid processing large files
    sample = data.is_a?(String) ? data[0, DETECTION_SAMPLE_SIZE] : data
    
    # Fast heuristics based on first characters
    trimmed = sample.strip
    
    return :json if json_like?(trimmed)
    return :xml if xml_like?(trimmed)
    return :yaml if yaml_like?(trimmed)
    return :csv if csv_like?(trimmed)
    
    :text
  end
  
  private
  
  def self.json_like?(sample)
    sample.start_with?('{', '[') && sample.include?('"')
  end
  
  def self.xml_like?(sample)
    sample.start_with?('<?xml', '<') && sample.include?('>')
  end
  
  def self.yaml_like?(sample)
    sample.include?(':') && !sample.include?('{') && 
    sample.split("\n").any? { |line| line.match(/^\w+\s*:/) }
  end
  
  def self.csv_like?(sample)
    lines = sample.split("\n").first(3)
    return false if lines.size < 2
    
    # Check for consistent column counts
    separators = [',', ';', '\t']
    separators.any? do |sep|
      counts = lines.map { |line| line.split(sep).size }
      counts.uniq.size == 1 && counts.first > 1
    end
  end
end

Monitoring and logging parser selection decisions provides visibility into format distribution and helps identify potential issues in production:

class ParserSelectionMonitor
  def initialize(logger: Logger.new(STDOUT))
    @logger = logger
    @stats = Hash.new(0)
    @errors = Hash.new(0)
  end
  
  def track_selection(format, source_hint: nil, duration: nil)
    @stats[format] += 1
    
    @logger.info({
      event: 'parser_selection',
      format: format,
      source_hint: source_hint,
      duration_ms: duration,
      total_selections: @stats.values.sum
    }.to_json)
  end
  
  def track_error(error_type, details: nil)
    @errors[error_type] += 1
    
    @logger.error({
      event: 'parser_selection_error',
      error_type: error_type,
      details: details,
      total_errors: @errors.values.sum
    }.to_json)
  end
  
  def format_distribution
    total = @stats.values.sum.to_f
    @stats.transform_values { |count| (count / total * 100).round(2) }
  end
  
  def error_rates
    total_attempts = @stats.values.sum + @errors.values.sum
    return {} if total_attempts.zero?
    
    @errors.transform_values do |count|
      (count.to_f / total_attempts * 100).round(2)
    end
  end
end

# Integration with parser selection
monitor = ParserSelectionMonitor.new

def monitored_parse(data, filename: nil)
  start_time = Time.now
  
  begin
    format = OptimizedParserSelector.select_parser_fast(data)
    duration = ((Time.now - start_time) * 1000).round(2)
    
    monitor.track_selection(format, 
      source_hint: File.extname(filename || ''),
      duration: duration
    )
    
    parse_with_format(data, format)
    
  rescue StandardError => e
    monitor.track_error(e.class.name, details: e.message[0, 200])
    raise
  end
end

Configuration management for parser selection allows tuning detection parameters and parser preferences for different environments:

class ParserConfig
  DEFAULT_CONFIG = {
    json: { enabled: true, priority: 1 },
    yaml: { enabled: true, priority: 2 },
    csv: { enabled: true, priority: 3, auto_detect_separator: true },
    xml: { enabled: true, priority: 4 },
    detection_timeout: 5.0,
    cache_enabled: true,
    cache_size: 1000,
    sample_size: 4096,
    fallback_parser: :text
  }.freeze
  
  def initialize(config = {})
    @config = DEFAULT_CONFIG.merge(config)
  end
  
  def parser_priority_order
    enabled_parsers = @config.select { |k, v| v.is_a?(Hash) && v[:enabled] }
    enabled_parsers.sort_by { |_, v| v[:priority] }.map(&:first)
  end
  
  def timeout_enabled?
    @config[:detection_timeout] > 0
  end
  
  def with_timeout(&block)
    if timeout_enabled?
      Timeout.timeout(@config[:detection_timeout], &block)
    else
      block.call
    end
  end
end

# Environment-specific configuration
production_config = ParserConfig.new(
  detection_timeout: 2.0,
  cache_size: 5000,
  sample_size: 2048,
  csv: { enabled: true, priority: 1, auto_detect_separator: true }
)

development_config = ParserConfig.new(
  detection_timeout: 10.0,
  cache_enabled: false,
  sample_size: 8192
)

Reference

Core Parser Selection Methods

Method Parameters Returns Description
CSV.parse(data, row_sep: :auto) data (String), row_sep: :auto Array<Array> Automatically detects row separator in CSV data
ACSV::CSV.foreach(file) file (String/IO), block Enumerator Auto-detects encoding and separator for CSV iteration
ACSV::Detect.encoding(data) data (String) String Detects character encoding of data
ACSV::Detect.separator(data) data (String) String Detects column separator in CSV data
FrontMatterParser::Parser.parse_file(path) path (String), options (Hash) ParsedResult Auto-detects syntax based on file extension
Dna.new(handle) handle (IO), options (Hash) Parser Auto-detects biological sequence file format
Shale::Mapper.from_json(data) data (String) Mapper Parses data as JSON format
Shale::Mapper.from_yaml(data) data (String) Mapper Parses data as YAML format
Shale::Mapper.from_csv(data) data (String) Mapper Parses data as CSV format
JSON.parse(data) data (String), options (Hash) Object Parses JSON with encoding auto-detection
YAML.safe_load(data) data (String), options (Hash) Object Parses YAML with format validation

Format Detection Strategies

Strategy Detection Method Reliability Performance Use Cases
File Extension Path analysis Medium High File system operations, batch processing
Content Sniffing Structure analysis High Medium API endpoints, data streams
Magic Bytes Header inspection High High Binary formats, specific file types
Delimiter Analysis Character frequency Medium High CSV/TSV detection, structured text
Encoding Detection Byte pattern analysis High Low Multi-language content, legacy data
Schema Validation Structure validation Very High Low Critical data processing, validation

Built-in Parser Selection Options

Library Auto-Detection Features Configuration Options
CSV Row separator (:auto), encoding detection col_sep, row_sep, quote_char, encoding
JSON Encoding detection, nested structure handling symbolize_names, max_nesting, allow_nan
YAML Multi-document detection, safe loading permitted_classes, aliases, filename
XML Encoding declaration parsing, namespace handling encoding, external_encoding, recover

Third-Party Parser Selection Libraries

Library Formats Supported Key Features Selection Method
ACSV CSV Encoding/separator auto-detection Content analysis, character detection
Shale JSON, YAML, TOML, CSV, XML Unified mapper interface Method-based format selection
FrontMatterParser YAML, JSON front matter Syntax-aware comment parsing File extension, content delimiters
MultiJSON JSON variants Adapter pattern, fallback parsers Performance-based selection, availability
Dna FASTA, FASTQ, QSEQ Biological sequence format detection Content structure analysis
SmarterCSV CSV Intelligent defaults, chunk processing Header analysis, data type inference

Error Classes and Handling

Exception Class Trigger Conditions Recovery Strategies
ACSV::DetectionError Encoding/separator detection failure Manual format specification, fallback parsers
JSON::ParserError Invalid JSON syntax Alternative format attempt, content validation
Psych::SyntaxError Invalid YAML structure Format fallback, partial parsing
CSV::MalformedCSVError Inconsistent CSV structure Liberal parsing, error skipping
FrontMatterParser::SyntaxNotRecognized Unknown file syntax Custom parser definition, manual parsing
ArgumentError Invalid parser configuration Configuration validation, default settings

Performance Characteristics

Operation Typical Duration Memory Usage Scalability
File extension check < 1ms Minimal Excellent
Content sampling (4KB) 1-5ms Low Very Good
Encoding detection 10-100ms Medium Good
Full content parsing Variable High Depends on format
Schema validation 50-500ms Medium Fair

Configuration Examples

# CSV auto-detection configuration
CSV.new(data, 
  row_sep: :auto,           # Automatic row separator detection
  col_sep: :auto,           # Custom: automatic column separator detection
  quote_char: :auto,        # Custom: automatic quote character detection
  encoding: :auto           # Custom: automatic encoding detection
)

# Multi-format parser configuration
Shale.configure do |config|
  config.json_adapter = MultiJSON
  config.yaml_adapter = Psych
  config.xml_adapter = Nokogiri::XML
  config.csv_adapter = SmarterCSV
end

# Front matter parser syntax registration
FrontMatterParser::SyntaxParser.register('.vue', VueParser.new)
FrontMatterParser::SyntaxParser.register('.tsx', TypeScriptParser.new)