Overview
Parser selection in Ruby refers to the automated and manual mechanisms for choosing appropriate parsing strategies based on data characteristics, file extensions, content inspection, and format detection. Ruby provides multiple layers of parser selection through its standard library and third-party gems, enabling applications to handle diverse data formats without explicit format specification.
Ruby's standard library includes automatic format detection in CSV parsing with the :auto
setting for row separators, encoding detection capabilities, and content-type based parsing selection. The JSON and YAML libraries provide format-specific parsing with automatic encoding handling. Third-party libraries extend these capabilities with multi-format parsing frameworks that can automatically select between JSON, YAML, TOML, CSV, and XML parsers based on file extensions or content analysis.
The parser selection process operates through several strategies: file extension mapping, content sniffing, encoding detection, delimiter analysis, and structure inspection. Libraries like Shale provide unified interfaces that automatically route data to appropriate parsers, while gems like ACSV perform automatic encoding and separator detection for CSV files.
require 'csv'
# Automatic row separator detection
CSV.parse(data, row_sep: :auto)
# Content-based format selection
require 'shale'
person = Person.from_json(json_data) # JSON parser selected
person = Person.from_yaml(yaml_data) # YAML parser selected
person = Person.from_xml(xml_data) # XML parser selected
Parser selection reduces the burden of format specification in applications that handle multiple data formats, particularly useful in data processing pipelines, API endpoints, and file import systems where the source format may vary or be unknown at runtime.
Basic Usage
Ruby's parser selection mechanisms operate through both built-in standard library features and third-party libraries that provide enhanced format detection capabilities.
The CSV library provides automatic row separator detection through the :auto
setting, which analyzes data content to determine the appropriate line ending sequence:
require 'csv'
data = "name,age\r\nJohn,25\r\nJane,30\r\n"
parsed = CSV.parse(data, row_sep: :auto)
# Automatically detects \r\n line endings
Multi-format parsing libraries like Shale enable automatic parser selection based on method calls, where the format is implied by the method name:
require 'shale'
class Person < Shale::Mapper
attribute :name, Shale::Type::String
attribute :age, Shale::Type::Integer
end
# Parser selection based on method name
person = Person.from_json('{"name":"John","age":25}')
person = Person.from_yaml("name: John\nage: 25")
person = Person.from_csv("John,25")
File extension-based parser selection occurs in libraries that analyze file paths to determine appropriate parsing strategies:
require 'front_matter_parser'
# Parser selection based on file extension
parsed = FrontMatterParser::Parser.parse_file('document.md') # Markdown
parsed = FrontMatterParser::Parser.parse_file('template.erb') # ERB
parsed = FrontMatterParser::Parser.parse_file('style.scss') # SCSS
Automatic encoding and delimiter detection provides robust CSV parsing without manual configuration:
require 'acsv'
# Automatic encoding and separator detection
ACSV::CSV.foreach("data.csv") do |row|
puts row.inspect
end
# Manual access to detection results
data = File.read("mixed_format.csv")
encoding = ACSV::Detect.encoding(data)
separator = ACSV::Detect.separator(data.force_encoding(encoding))
Content inspection for format detection enables parsing of data streams where format is determined by analyzing structure patterns:
# Biological sequence format detection
require 'dna'
File.open('sequences.unknown') do |handle|
records = Dna.new(handle) # Automatic format detection
records.each { |record| puts record.length }
end
Parser adapter selection allows switching between different parsing implementations based on performance requirements or feature needs:
require 'shale'
require 'multi_json'
# Selecting JSON parser adapter
Shale.json_adapter = MultiJson
# Selecting XML parser adapter
require 'shale/adapter/nokogiri'
Shale.xml_adapter = Shale::Adapter::Nokogiri
Error Handling & Debugging
Parser selection failures occur when automatic detection mechanisms cannot determine the correct format, when multiple formats match the input data, or when the detected format is incorrect for the actual data structure.
Format detection ambiguity requires fallback strategies and explicit format specification when automatic selection fails:
require 'acsv'
def robust_csv_parse(file_path)
begin
ACSV::CSV.read(file_path)
rescue ACSV::DetectionError => e
puts "Auto-detection failed: #{e.message}"
# Fallback to standard CSV with common settings
encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252']
separators = [',', ';', '\t']
encodings.each do |encoding|
separators.each do |sep|
begin
return CSV.read(file_path, encoding: encoding, col_sep: sep)
rescue => inner_e
next
end
end
end
raise "Could not parse CSV with any combination of settings"
end
end
Content validation after parser selection prevents incorrect parsing results when format detection succeeds but selects the wrong parser:
class FormatValidator
def self.validate_json_structure(data, expected_keys)
parsed = JSON.parse(data)
unless parsed.is_a?(Hash) && expected_keys.all? { |key| parsed.key?(key) }
raise ArgumentError, "JSON structure doesn't match expected format"
end
parsed
rescue JSON::ParserError => e
raise ArgumentError, "Invalid JSON format: #{e.message}"
end
def self.validate_csv_columns(file_path, expected_columns)
headers = CSV.open(file_path, &:readline)
missing = expected_columns - headers
unless missing.empty?
raise ArgumentError, "Missing required columns: #{missing.join(', ')}"
end
extra = headers - expected_columns
puts "Warning: unexpected columns found: #{extra.join(', ')}" unless extra.empty?
true
end
end
Debugging parser selection requires logging detection decisions and providing visibility into the selection process:
class ParserSelector
def self.select_parser_with_logging(data, filename = nil)
detection_log = []
# Try JSON detection
begin
JSON.parse(data.strip)
detection_log << "JSON: valid syntax detected"
return :json
rescue JSON::ParserError
detection_log << "JSON: parse failed"
end
# Try YAML detection
begin
YAML.safe_load(data)
detection_log << "YAML: valid syntax detected"
return :yaml
rescue Psych::SyntaxError
detection_log << "YAML: parse failed"
end
# File extension fallback
if filename
case File.extname(filename).downcase
when '.json'
detection_log << "Extension: .json detected"
return :json
when '.yml', '.yaml'
detection_log << "Extension: .yaml detected"
return :yaml
when '.csv'
detection_log << "Extension: .csv detected"
return :csv
end
end
detection_log << "No format detected, defaulting to plain text"
puts "Detection log: #{detection_log.join(' | ')}"
:text
end
end
Error recovery patterns handle parser selection failures gracefully by attempting alternative parsers or degrading to simpler parsing approaches:
class RobustDataParser
def self.parse_with_fallbacks(data, hint = nil)
parsers = determine_parser_order(data, hint)
last_error = nil
parsers.each do |parser_type|
begin
return send("parse_as_#{parser_type}", data)
rescue StandardError => e
last_error = e
puts "#{parser_type.upcase} parsing failed: #{e.message}"
next
end
end
# Final fallback to raw text
puts "All parsers failed, returning raw text"
{ content: data.to_s, format: :text, error: last_error&.message }
end
private
def self.determine_parser_order(data, hint)
order = []
# Use hint if provided
order << hint if hint
# Content-based heuristics
order << :json if data.strip.start_with?('{', '[')
order << :yaml if data.include?(':') && !data.include?('{')
order << :csv if data.include?(',') || data.include?(';')
# Default order for remaining parsers
([:json, :yaml, :csv, :xml] - order).each { |p| order << p }
order.uniq
end
end
Production Patterns
Parser selection in production environments requires robust error handling, performance optimization, caching of detection results, and monitoring of parser selection decisions to ensure reliability and maintainability.
Caching parser selection decisions reduces overhead when processing multiple files with similar characteristics or when the same data formats are encountered repeatedly:
class ParserSelectionCache
def initialize(max_size: 1000)
@cache = {}
@max_size = max_size
@access_order = []
end
def get_parser(content_hash, &block)
if @cache.key?(content_hash)
# Move to end for LRU
@access_order.delete(content_hash)
@access_order << content_hash
return @cache[content_hash]
end
parser = block.call
if @cache.size >= @max_size
# Remove least recently used
oldest = @access_order.shift
@cache.delete(oldest)
end
@cache[content_hash] = parser
@access_order << content_hash
parser
end
end
# Usage with content-based caching
parser_cache = ParserSelectionCache.new(max_size: 500)
def parse_data_with_cache(data, cache)
content_hash = Digest::SHA256.hexdigest(data[0, 1000]) # Hash first 1KB
parser = cache.get_parser(content_hash) do
detect_parser_type(data)
end
send("parse_as_#{parser}", data)
end
Performance optimization for parser selection involves minimizing detection overhead and implementing efficient content sampling strategies:
class OptimizedParserSelector
DETECTION_SAMPLE_SIZE = 4096 # Only analyze first 4KB
def self.select_parser_fast(data)
# Use small sample for detection to avoid processing large files
sample = data.is_a?(String) ? data[0, DETECTION_SAMPLE_SIZE] : data
# Fast heuristics based on first characters
trimmed = sample.strip
return :json if json_like?(trimmed)
return :xml if xml_like?(trimmed)
return :yaml if yaml_like?(trimmed)
return :csv if csv_like?(trimmed)
:text
end
private
def self.json_like?(sample)
sample.start_with?('{', '[') && sample.include?('"')
end
def self.xml_like?(sample)
sample.start_with?('<?xml', '<') && sample.include?('>')
end
def self.yaml_like?(sample)
sample.include?(':') && !sample.include?('{') &&
sample.split("\n").any? { |line| line.match(/^\w+\s*:/) }
end
def self.csv_like?(sample)
lines = sample.split("\n").first(3)
return false if lines.size < 2
# Check for consistent column counts
separators = [',', ';', '\t']
separators.any? do |sep|
counts = lines.map { |line| line.split(sep).size }
counts.uniq.size == 1 && counts.first > 1
end
end
end
Monitoring and logging parser selection decisions provides visibility into format distribution and helps identify potential issues in production:
class ParserSelectionMonitor
def initialize(logger: Logger.new(STDOUT))
@logger = logger
@stats = Hash.new(0)
@errors = Hash.new(0)
end
def track_selection(format, source_hint: nil, duration: nil)
@stats[format] += 1
@logger.info({
event: 'parser_selection',
format: format,
source_hint: source_hint,
duration_ms: duration,
total_selections: @stats.values.sum
}.to_json)
end
def track_error(error_type, details: nil)
@errors[error_type] += 1
@logger.error({
event: 'parser_selection_error',
error_type: error_type,
details: details,
total_errors: @errors.values.sum
}.to_json)
end
def format_distribution
total = @stats.values.sum.to_f
@stats.transform_values { |count| (count / total * 100).round(2) }
end
def error_rates
total_attempts = @stats.values.sum + @errors.values.sum
return {} if total_attempts.zero?
@errors.transform_values do |count|
(count.to_f / total_attempts * 100).round(2)
end
end
end
# Integration with parser selection
monitor = ParserSelectionMonitor.new
def monitored_parse(data, filename: nil)
start_time = Time.now
begin
format = OptimizedParserSelector.select_parser_fast(data)
duration = ((Time.now - start_time) * 1000).round(2)
monitor.track_selection(format,
source_hint: File.extname(filename || ''),
duration: duration
)
parse_with_format(data, format)
rescue StandardError => e
monitor.track_error(e.class.name, details: e.message[0, 200])
raise
end
end
Configuration management for parser selection allows tuning detection parameters and parser preferences for different environments:
class ParserConfig
DEFAULT_CONFIG = {
json: { enabled: true, priority: 1 },
yaml: { enabled: true, priority: 2 },
csv: { enabled: true, priority: 3, auto_detect_separator: true },
xml: { enabled: true, priority: 4 },
detection_timeout: 5.0,
cache_enabled: true,
cache_size: 1000,
sample_size: 4096,
fallback_parser: :text
}.freeze
def initialize(config = {})
@config = DEFAULT_CONFIG.merge(config)
end
def parser_priority_order
enabled_parsers = @config.select { |k, v| v.is_a?(Hash) && v[:enabled] }
enabled_parsers.sort_by { |_, v| v[:priority] }.map(&:first)
end
def timeout_enabled?
@config[:detection_timeout] > 0
end
def with_timeout(&block)
if timeout_enabled?
Timeout.timeout(@config[:detection_timeout], &block)
else
block.call
end
end
end
# Environment-specific configuration
production_config = ParserConfig.new(
detection_timeout: 2.0,
cache_size: 5000,
sample_size: 2048,
csv: { enabled: true, priority: 1, auto_detect_separator: true }
)
development_config = ParserConfig.new(
detection_timeout: 10.0,
cache_enabled: false,
sample_size: 8192
)
Reference
Core Parser Selection Methods
Method | Parameters | Returns | Description |
---|---|---|---|
CSV.parse(data, row_sep: :auto) |
data (String), row_sep: :auto |
Array<Array> |
Automatically detects row separator in CSV data |
ACSV::CSV.foreach(file) |
file (String/IO), block |
Enumerator |
Auto-detects encoding and separator for CSV iteration |
ACSV::Detect.encoding(data) |
data (String) |
String |
Detects character encoding of data |
ACSV::Detect.separator(data) |
data (String) |
String |
Detects column separator in CSV data |
FrontMatterParser::Parser.parse_file(path) |
path (String), options (Hash) |
ParsedResult |
Auto-detects syntax based on file extension |
Dna.new(handle) |
handle (IO), options (Hash) |
Parser |
Auto-detects biological sequence file format |
Shale::Mapper.from_json(data) |
data (String) |
Mapper |
Parses data as JSON format |
Shale::Mapper.from_yaml(data) |
data (String) |
Mapper |
Parses data as YAML format |
Shale::Mapper.from_csv(data) |
data (String) |
Mapper |
Parses data as CSV format |
JSON.parse(data) |
data (String), options (Hash) |
Object |
Parses JSON with encoding auto-detection |
YAML.safe_load(data) |
data (String), options (Hash) |
Object |
Parses YAML with format validation |
Format Detection Strategies
Strategy | Detection Method | Reliability | Performance | Use Cases |
---|---|---|---|---|
File Extension | Path analysis | Medium | High | File system operations, batch processing |
Content Sniffing | Structure analysis | High | Medium | API endpoints, data streams |
Magic Bytes | Header inspection | High | High | Binary formats, specific file types |
Delimiter Analysis | Character frequency | Medium | High | CSV/TSV detection, structured text |
Encoding Detection | Byte pattern analysis | High | Low | Multi-language content, legacy data |
Schema Validation | Structure validation | Very High | Low | Critical data processing, validation |
Built-in Parser Selection Options
Library | Auto-Detection Features | Configuration Options |
---|---|---|
CSV | Row separator (:auto ), encoding detection |
col_sep , row_sep , quote_char , encoding |
JSON | Encoding detection, nested structure handling | symbolize_names , max_nesting , allow_nan |
YAML | Multi-document detection, safe loading | permitted_classes , aliases , filename |
XML | Encoding declaration parsing, namespace handling | encoding , external_encoding , recover |
Third-Party Parser Selection Libraries
Library | Formats Supported | Key Features | Selection Method |
---|---|---|---|
ACSV | CSV | Encoding/separator auto-detection | Content analysis, character detection |
Shale | JSON, YAML, TOML, CSV, XML | Unified mapper interface | Method-based format selection |
FrontMatterParser | YAML, JSON front matter | Syntax-aware comment parsing | File extension, content delimiters |
MultiJSON | JSON variants | Adapter pattern, fallback parsers | Performance-based selection, availability |
Dna | FASTA, FASTQ, QSEQ | Biological sequence format detection | Content structure analysis |
SmarterCSV | CSV | Intelligent defaults, chunk processing | Header analysis, data type inference |
Error Classes and Handling
Exception Class | Trigger Conditions | Recovery Strategies |
---|---|---|
ACSV::DetectionError |
Encoding/separator detection failure | Manual format specification, fallback parsers |
JSON::ParserError |
Invalid JSON syntax | Alternative format attempt, content validation |
Psych::SyntaxError |
Invalid YAML structure | Format fallback, partial parsing |
CSV::MalformedCSVError |
Inconsistent CSV structure | Liberal parsing, error skipping |
FrontMatterParser::SyntaxNotRecognized |
Unknown file syntax | Custom parser definition, manual parsing |
ArgumentError |
Invalid parser configuration | Configuration validation, default settings |
Performance Characteristics
Operation | Typical Duration | Memory Usage | Scalability |
---|---|---|---|
File extension check | < 1ms | Minimal | Excellent |
Content sampling (4KB) | 1-5ms | Low | Very Good |
Encoding detection | 10-100ms | Medium | Good |
Full content parsing | Variable | High | Depends on format |
Schema validation | 50-500ms | Medium | Fair |
Configuration Examples
# CSV auto-detection configuration
CSV.new(data,
row_sep: :auto, # Automatic row separator detection
col_sep: :auto, # Custom: automatic column separator detection
quote_char: :auto, # Custom: automatic quote character detection
encoding: :auto # Custom: automatic encoding detection
)
# Multi-format parser configuration
Shale.configure do |config|
config.json_adapter = MultiJSON
config.yaml_adapter = Psych
config.xml_adapter = Nokogiri::XML
config.csv_adapter = SmarterCSV
end
# Front matter parser syntax registration
FrontMatterParser::SyntaxParser.register('.vue', VueParser.new)
FrontMatterParser::SyntaxParser.register('.tsx', TypeScriptParser.new)