CrackedRuby logo

CrackedRuby

YAML Processing

Overview

YAML (YAML Ain't Markup Language) processing in Ruby centers around the Psych library, which became the default YAML processor in Ruby 1.9.3. The YAML module provides the primary interface for parsing YAML strings into Ruby objects and serializing Ruby objects back to YAML format.

Ruby's YAML implementation handles scalar values (strings, numbers, booleans), sequences (arrays), and mappings (hashes). The parser automatically converts YAML types to appropriate Ruby objects: strings remain strings, integers become Integer objects, and nested structures become Hash and Array objects.

The core API revolves around four primary methods: YAML.load for parsing YAML strings into Ruby objects, YAML.dump for converting Ruby objects to YAML strings, YAML.load_file for reading YAML files directly, and YAML.safe_load for security-conscious parsing with restricted object instantiation.

# Basic YAML parsing
yaml_string = "name: John\nage: 30\nskills: [Ruby, Python, JavaScript]"
data = YAML.load(yaml_string)
# => {"name"=>"John", "age"=>30, "skills"=>["Ruby", "Python", "JavaScript"]}

# Converting Ruby objects to YAML
ruby_data = { users: [{ name: "Alice", role: "admin" }, { name: "Bob", role: "user" }] }
yaml_output = YAML.dump(ruby_data)
puts yaml_output
# =>
# ---
# :users:
# - :name: Alice
#   :role: admin
# - :name: Bob
#   :role: user

The Psych parser provides additional control through the Psych::Parser and Psych::Emitter classes for streaming operations and custom YAML processing. Ruby's YAML processing supports YAML 1.1 specification features including anchors, aliases, and multi-document streams.

Basic Usage

Loading YAML data transforms text into Ruby objects through several methods depending on the data source and security requirements. YAML.load handles basic string parsing, while YAML.load_file reads directly from files without manual file operations.

# Loading from strings
config_yaml = <<~YAML
  database:
    host: localhost
    port: 5432
    credentials:
      username: app_user
      password: secure_pass
  features:
    caching: true
    logging: false
YAML

config = YAML.load(config_yaml)
db_host = config["database"]["host"]  # => "localhost"
cache_enabled = config["features"]["caching"]  # => true

File-based YAML loading eliminates the need for explicit file handling. The method automatically opens, reads, and closes files while applying the same parsing rules as string-based loading.

# config/application.yml
YAML.load_file("config/application.yml")
# Equivalent to: YAML.load(File.read("config/application.yml"))

# Loading with error handling
begin
  settings = YAML.load_file("config/settings.yml")
rescue Errno::ENOENT
  settings = {}  # Default empty configuration
end

Generating YAML output converts Ruby data structures into formatted YAML strings. The YAML.dump method handles nested objects, arrays, and complex data types while maintaining proper YAML syntax and indentation.

# Complex data structure conversion
application_config = {
  name: "MyApp",
  version: "2.1.4",
  environments: {
    development: {
      debug: true,
      database_url: "postgres://localhost/myapp_dev"
    },
    production: {
      debug: false,
      database_url: ENV["DATABASE_URL"]
    }
  },
  supported_locales: ["en", "es", "fr", "de"]
}

yaml_config = YAML.dump(application_config)
File.write("generated_config.yml", yaml_config)

Stream processing handles multiple YAML documents within a single file or string. The YAML.load_stream method returns an array containing each document as a separate Ruby object, enabling batch processing of related configurations.

# Multiple document processing
multi_doc_yaml = <<~YAML
  ---
  service: web
  replicas: 3
  ---
  service: database
  replicas: 1
  ---
  service: cache
  replicas: 2
YAML

services = YAML.load_stream(multi_doc_yaml)
services.each do |service_config|
  puts "#{service_config['service']}: #{service_config['replicas']} replicas"
end
# => web: 3 replicas
# => database: 1 replicas
# => cache: 2 replicas

Symbol key conversion requires explicit handling since YAML typically produces string keys by default. Ruby provides several approaches for converting string keys to symbols when the application expects symbolic access patterns.

# Converting string keys to symbols
yaml_data = YAML.load("name: John\nage: 30")
# => {"name"=>"John", "age"=>30}

# Manual symbol conversion
symbolized = yaml_data.transform_keys(&:to_sym)
# => {:name=>"John", :age=>30}

# Deep symbol conversion for nested hashes
def deep_symbolize_keys(obj)
  case obj
  when Hash
    obj.transform_keys(&:to_sym).transform_values { |v| deep_symbolize_keys(v) }
  when Array
    obj.map { |item| deep_symbolize_keys(item) }
  else
    obj
  end
end

Error Handling & Debugging

YAML parsing failures generate specific exception types that indicate different categories of problems. Psych::SyntaxError occurs when YAML syntax violates formatting rules, while Psych::DisallowedClass indicates attempts to instantiate restricted object types during parsing.

# Syntax error handling
malformed_yaml = <<~YAML
  name: John
    age: 30  # Incorrect indentation
  email: john@example.com
YAML

begin
  data = YAML.load(malformed_yaml)
rescue Psych::SyntaxError => e
  puts "YAML syntax error at line #{e.line}, column #{e.column}: #{e.problem}"
  puts "Context: #{e.context}" if e.context
  # Handle gracefully or provide user feedback
end

Safe loading prevents arbitrary object instantiation during YAML parsing, addressing security vulnerabilities where malicious YAML could execute code through object deserialization. The YAML.safe_load method restricts parsing to basic Ruby types: strings, numbers, arrays, hashes, and booleans.

# Safe loading with permitted classes
potentially_unsafe_yaml = <<~YAML
  created_at: 2023-12-15 10:30:00
  user_data:
    name: Alice
    permissions: [read, write, admin]
YAML

# Unsafe - could instantiate arbitrary objects
# data = YAML.load(potentially_unsafe_yaml)

# Safe approach with explicit permitted classes
safe_data = YAML.safe_load(
  potentially_unsafe_yaml,
  permitted_classes: [Date, Time, Symbol],
  aliases: true
)

Custom error handling strategies help applications gracefully handle missing files, network timeouts, or corrupted YAML data. Implementing fallback mechanisms prevents application crashes when configuration files become unavailable or malformed.

# Robust configuration loading with fallbacks
class ConfigurationLoader
  DEFAULT_CONFIG = {
    "database" => { "pool_size" => 5, "timeout" => 30 },
    "cache" => { "enabled" => true, "ttl" => 3600 },
    "logging" => { "level" => "info" }
  }.freeze

  def self.load_configuration(file_path)
    YAML.load_file(file_path)
  rescue Errno::ENOENT
    warn "Configuration file not found: #{file_path}. Using defaults."
    DEFAULT_CONFIG.dup
  rescue Psych::SyntaxError => e
    warn "Invalid YAML syntax in #{file_path}: #{e.message}"
    DEFAULT_CONFIG.dup
  rescue StandardError => e
    warn "Unexpected error loading configuration: #{e.message}"
    DEFAULT_CONFIG.dup
  end

  def self.validate_required_keys(config, required_keys)
    missing_keys = required_keys - config.keys
    unless missing_keys.empty?
      raise ArgumentError, "Missing required configuration keys: #{missing_keys.join(', ')}"
    end
    config
  end
end

# Usage with validation
config = ConfigurationLoader.load_configuration("config/app.yml")
validated_config = ConfigurationLoader.validate_required_keys(
  config,
  %w[database cache logging]
)

Debugging YAML parsing issues requires understanding the parser's state and the specific location of problems. The Psych::SyntaxError exception provides line numbers, column positions, and contextual information about parsing failures.

# Detailed debugging approach
def debug_yaml_parsing(yaml_content, source_description = "YAML content")
  begin
    parsed_data = YAML.load(yaml_content)
    puts "Successfully parsed #{source_description}"
    parsed_data
  rescue Psych::SyntaxError => e
    puts "YAML parsing failed in #{source_description}:"
    puts "  Line #{e.line}, Column #{e.column}"
    puts "  Problem: #{e.problem}"
    puts "  Context: #{e.context}" if e.context

    # Show problematic lines with context
    lines = yaml_content.lines
    start_line = [e.line - 3, 0].max
    end_line = [e.line + 2, lines.length - 1].min

    (start_line..end_line).each do |line_num|
      marker = line_num == e.line - 1 ? ">>> " : "    "
      puts "#{marker}#{line_num + 1}: #{lines[line_num]}"
    end

    nil
  end
end

Schema validation ensures YAML content matches expected structures before processing. While Ruby doesn't include built-in YAML schema validation, custom validation logic can verify required fields, data types, and value constraints.

# Custom YAML validation
class YAMLValidator
  def self.validate_database_config(config)
    errors = []

    unless config.is_a?(Hash)
      errors << "Configuration must be a hash/mapping"
      return errors
    end

    required_fields = %w[host port database]
    required_fields.each do |field|
      unless config.key?(field)
        errors << "Missing required field: #{field}"
      end
    end

    if config["port"] && !config["port"].is_a?(Integer)
      errors << "Port must be an integer"
    end

    if config["ssl"] && ![true, false].include?(config["ssl"])
      errors << "SSL must be boolean"
    end

    errors
  end
end

# Usage
config_yaml = "host: localhost\nport: 5432\ndatabase: myapp"
config = YAML.load(config_yaml)
validation_errors = YAMLValidator.validate_database_config(config)

if validation_errors.empty?
  # Proceed with valid configuration
else
  puts "Configuration errors:"
  validation_errors.each { |error| puts "  - #{error}" }
end

Performance & Memory

Large YAML file processing requires streaming approaches to avoid loading entire files into memory simultaneously. The Psych::Parser class enables event-driven parsing that processes YAML content incrementally rather than building complete object trees in memory.

# Memory-efficient streaming parser
class YAMLStreamer < Psych::Handler
  def initialize
    @current_data = {}
    @key_stack = []
    @results = []
  end

  def start_mapping(anchor, tag, implicit, style)
    @key_stack.push({})
  end

  def end_mapping
    completed_hash = @key_stack.pop
    if @key_stack.empty?
      @results << completed_hash
    else
      # Handle nested mappings
    end
  end

  def scalar(value, anchor, tag, plain, quoted, style)
    # Process individual scalar values without building full object
    process_scalar_value(value)
  end

  def process_scalar_value(value)
    # Custom processing logic for each scalar
    puts "Processing: #{value}" if value.length > 1000
  end
end

# Stream processing large files
def process_large_yaml_file(file_path)
  handler = YAMLStreamer.new
  parser = Psych::Parser.new(handler)

  File.open(file_path, 'r') do |file|
    file.each_line do |line|
      parser << line
    end
  end

  handler.results
end

Memory usage optimization involves understanding how different YAML structures consume memory during parsing and generation. Arrays with many elements create significant memory overhead compared to streaming approaches that process elements individually.

# Memory comparison example
require 'benchmark'
require 'memory_profiler'

# Generate large YAML content
large_array = (1..100_000).map { |i| { id: i, name: "Item #{i}", active: i.even? } }
yaml_content = YAML.dump(large_array)

puts "YAML content size: #{yaml_content.bytesize} bytes"

# Memory profiling
report = MemoryProfiler.report do
  parsed_data = YAML.load(yaml_content)
end

puts "Memory used: #{report.total_allocated_memsize} bytes"
puts "Objects created: #{report.total_allocated}"

Performance benchmarking reveals significant differences between parsing methods and content types. Complex nested structures require more processing time than flat mappings, while string-heavy content parses faster than mixed data types requiring type conversion.

# Performance benchmarking different YAML operations
require 'benchmark/ips'

simple_yaml = "name: John\nage: 30\nemail: john@example.com"
complex_yaml = <<~YAML
  users:
    - id: 1
      profile:
        name: Alice
        settings:
          notifications: true
          theme: dark
        permissions: [read, write, admin]
    - id: 2
      profile:
        name: Bob
        settings:
          notifications: false
          theme: light
        permissions: [read]
YAML

Benchmark.ips do |x|
  x.report("simple_load") { YAML.load(simple_yaml) }
  x.report("complex_load") { YAML.load(complex_yaml) }
  x.report("safe_load") { YAML.safe_load(simple_yaml) }
  x.report("dump_simple") { YAML.dump(YAML.load(simple_yaml)) }
  x.compare!
end

Caching strategies reduce repetitive parsing overhead when the same YAML content requires multiple access patterns. Implementing intelligent caching with file modification time checking prevents stale data while avoiding unnecessary parsing operations.

# YAML caching implementation
class CachedYAMLLoader
  def initialize
    @cache = {}
    @file_times = {}
  end

  def load_file(file_path)
    current_mtime = File.mtime(file_path)
    cached_time = @file_times[file_path]

    if cached_time.nil? || current_mtime > cached_time
      @cache[file_path] = YAML.load_file(file_path)
      @file_times[file_path] = current_mtime
    end

    @cache[file_path]
  end

  def clear_cache(file_path = nil)
    if file_path
      @cache.delete(file_path)
      @file_times.delete(file_path)
    else
      @cache.clear
      @file_times.clear
    end
  end
end

# Usage with automatic cache management
yaml_loader = CachedYAMLLoader.new
config = yaml_loader.load_file("config/settings.yml")  # Loads and caches
config_again = yaml_loader.load_file("config/settings.yml")  # Returns cached version

Optimization techniques for YAML generation focus on reducing object allocation and minimizing string operations during serialization. Pre-computing repetitive elements and avoiding unnecessary object creation during dump operations can significantly improve performance.

# Optimized YAML generation
class OptimizedYAMLGenerator
  def initialize
    @emitter = Psych::Emitter.new(StringIO.new)
  end

  def generate_user_list(users)
    @emitter.start_stream(Psych::Parser::UTF8)
    @emitter.start_document([], [], false)
    @emitter.start_mapping(nil, nil, true, Psych::Nodes::Mapping::BLOCK)

    @emitter.scalar("users", nil, nil, true, false, Psych::Nodes::Scalar::PLAIN)
    @emitter.start_sequence(nil, nil, true, Psych::Nodes::Sequence::BLOCK)

    users.each do |user|
      generate_user_mapping(user)
    end

    @emitter.end_sequence
    @emitter.end_mapping
    @emitter.end_document(false)
    @emitter.end_stream

    @emitter.target.string
  end

  private

  def generate_user_mapping(user)
    @emitter.start_mapping(nil, nil, true, Psych::Nodes::Mapping::BLOCK)

    user.each do |key, value|
      @emitter.scalar(key.to_s, nil, nil, true, false, Psych::Nodes::Scalar::PLAIN)
      @emitter.scalar(value.to_s, nil, nil, true, false, Psych::Nodes::Scalar::PLAIN)
    end

    @emitter.end_mapping
  end
end

Common Pitfalls

Indentation errors represent the most frequent YAML parsing problems, particularly when mixing spaces and tabs or using inconsistent indentation levels. YAML requires precise indentation with spaces only, and mixing indentation styles causes parsing failures that can be difficult to diagnose visually.

# Common indentation problems
problematic_yaml = <<~YAML
  database:
    host: localhost
  	port: 5432  # Tab instead of spaces - will fail
    credentials:
      username: admin
     password: secret  # Inconsistent indentation - will fail
YAML

# Correct indentation - spaces only, consistent levels
correct_yaml = <<~YAML
  database:
    host: localhost
    port: 5432
    credentials:
      username: admin
      password: secret
YAML

# Debugging indentation issues
def detect_indentation_problems(yaml_string)
  problems = []
  yaml_string.lines.each_with_index do |line, index|
    if line.match?(/\t/)
      problems << "Line #{index + 1}: Contains tab character"
    end

    leading_spaces = line[/^ */].length
    if leading_spaces > 0 && leading_spaces % 2 != 0
      problems << "Line #{index + 1}: Odd number of leading spaces (#{leading_spaces})"
    end
  end
  problems
end

Type coercion surprises occur when YAML automatically converts values to unexpected Ruby types. Numeric strings, boolean-like values, and date-formatted strings undergo automatic conversion that may not match application expectations.

# Unexpected type conversions
tricky_yaml = <<~YAML
  version: 1.0          # Becomes Float, not String
  enabled: yes          # Becomes true (boolean)
  disabled: no          # Becomes false (boolean)
  phone: 555-1234       # Remains String (contains hyphen)
  zip_code: 12345       # Becomes Integer
  date_string: 2023-12-15  # Becomes Date object
  null_value: null      # Becomes nil
  empty_string: ""      # Remains empty String
  just_spaces: "   "    # Remains String with spaces
YAML

data = YAML.load(tricky_yaml)
puts data["version"].class      # => Float (not String!)
puts data["enabled"].class      # => TrueClass (not String!)
puts data["zip_code"].class     # => Integer (not String!)

# Preventing unwanted type conversion
safe_yaml = <<~YAML
  version: "1.0"        # Quoted to remain String
  enabled: "yes"        # Quoted to remain String
  zip_code: "12345"     # Quoted to remain String
YAML

# Alternative: disable automatic type conversion
data = YAML.load(tricky_yaml, permitted_classes: [], aliases: false)

Symbol versus string key confusion creates difficult-to-debug issues when YAML loading produces string keys but application code expects symbol keys. This mismatch causes nil returns when accessing hash values and can be particularly problematic in configuration files.

# Key access confusion
config_yaml = "database_host: localhost\napi_key: abc123"
config = YAML.load(config_yaml)

# This works - string keys
puts config["database_host"]  # => "localhost"

# This fails silently - expecting symbol keys
puts config[:database_host]   # => nil (not found)

# Hybrid approach with both access methods
class FlexibleHash < Hash
  def [](key)
    super(key) || super(key.to_s) || super(key.to_sym)
  end
end

# Extension to handle both key types
config = YAML.load(config_yaml)
flexible_config = FlexibleHash.new.merge(config)
puts flexible_config[:database_host]   # => "localhost" (works!)
puts flexible_config["database_host"]  # => "localhost" (also works!)

Multi-document YAML files require special handling that differs from single-document parsing. Using YAML.load on multi-document content only returns the first document, silently ignoring subsequent documents and potentially causing data loss.

# Multi-document pitfall
multi_doc_content = <<~YAML
  ---
  service: web
  port: 3000
  ---
  service: database
  port: 5432
  ---
  service: cache
  port: 6379
YAML

# Wrong - only loads first document
single_doc = YAML.load(multi_doc_content)
puts single_doc  # => {"service"=>"web", "port"=>3000}

# Correct - loads all documents
all_docs = YAML.load_stream(multi_doc_content)
puts all_docs.length  # => 3
all_docs.each { |doc| puts "#{doc['service']}: #{doc['port']}" }

Anchor and alias handling creates unexpected object sharing that can lead to unintended mutations. When YAML contains aliases referring to anchors, Ruby creates shared object references rather than independent copies.

# Dangerous object sharing through aliases
shared_yaml = <<~YAML
  default_settings: &defaults
    timeout: 30
    retries: 3
    logging: true

  production:
    <<: *defaults
    host: prod.example.com

  development:
    <<: *defaults
    host: dev.example.com
YAML

config = YAML.load(shared_yaml)

# Modifying one environment affects the other!
config["production"]["timeout"] = 60
puts config["development"]["timeout"]  # => 60 (not 30!)

# Safe approach - deep copy shared structures
require 'deep_clone'

def safe_load_with_aliases(yaml_content)
  loaded = YAML.load(yaml_content)

  # Deep clone to prevent shared object mutations
  loaded.transform_values do |value|
    value.is_a?(Hash) ? deep_clone(value) : value
  end
end

def deep_clone(obj)
  case obj
  when Hash
    obj.transform_keys { |k| deep_clone(k) }
       .transform_values { |v| deep_clone(v) }
  when Array
    obj.map { |item| deep_clone(item) }
  else
    obj.respond_to?(:dup) ? obj.dup : obj
  end
end

Encoding issues arise when YAML files contain non-ASCII characters or when the file encoding doesn't match Ruby's expectations. This particularly affects applications processing YAML files created on different operating systems or containing internationalized content.

# Encoding handling
def safe_yaml_load_with_encoding(file_path)
  # Try UTF-8 first, fallback to system encoding
  content = begin
    File.read(file_path, encoding: 'UTF-8')
  rescue ArgumentError => e
    if e.message.include?('invalid byte sequence')
      File.read(file_path, encoding: 'ASCII-8BIT')
           .force_encoding('UTF-8')
    else
      raise
    end
  end

  YAML.load(content)
rescue Encoding::UndefinedConversionError
  # Handle files with mixed encodings
  File.open(file_path, 'r:bom|utf-8') { |f| YAML.load(f.read) }
end

Reference

Core Methods

Method Parameters Returns Description
YAML.load(yaml) yaml (String) Object Parses YAML string into Ruby object
YAML.load_file(path) path (String/Pathname) Object Loads and parses YAML file
YAML.safe_load(yaml, **opts) yaml (String), options (Hash) Object Secure parsing with restricted classes
YAML.dump(object, **opts) object (Object), options (Hash) String Converts Ruby object to YAML string
YAML.load_stream(yaml) yaml (String) Array Parses multi-document YAML into array

Safe Loading Options

Option Type Default Description
permitted_classes Array [] Classes allowed during deserialization
permitted_symbols Array [] Symbols allowed during parsing
aliases Boolean false Whether to allow YAML aliases
filename String nil Filename for error reporting

Dump Options

Option Type Default Description
line_width Integer 0 Maximum line width (0 = unlimited)
indentation Integer 2 Number of spaces for indentation
canonical Boolean false Use canonical YAML format
header Boolean false Include document header (---)

Exception Hierarchy

Exception Parent Triggered By
Psych::SyntaxError StandardError Invalid YAML syntax
Psych::DisallowedClass StandardError Restricted class instantiation
Psych::BadAlias StandardError Invalid alias reference
Psych::AliasesNotEnabled StandardError Aliases used when disabled

YAML Data Type Mapping

YAML Value Ruby Type Notes
string String Unquoted strings
"quoted" String Quoted strings
123 Integer Numeric literals
1.23 Float Decimal numbers
true, yes, on TrueClass Boolean true values
false, no, off FalseClass Boolean false values
null, ~ NilClass Null values
2023-12-15 Date ISO date format
2023-12-15 10:30:00 Time ISO datetime format

Psych Parser Events

Event Method Parameters Purpose
start_document version, tag_directives, implicit Document start
end_document implicit Document end
start_mapping anchor, tag, implicit, style Hash/mapping start
end_mapping Hash/mapping end
start_sequence anchor, tag, implicit, style Array/sequence start
end_sequence Array/sequence end
scalar value, anchor, tag, plain, quoted, style Individual value
alias anchor Alias reference

Common YAML Patterns

Pattern YAML Syntax Ruby Result
Simple mapping key: value {"key" => "value"}
Nested mapping parent:\n child: value {"parent" => {"child" => "value"}}
Sequence - item1\n- item2 ["item1", "item2"]
Mixed structure items:\n - name: first {"items" => [{"name" => "first"}]}
Multi-line string text: >\n line1\n line2 {"text" => "line1 line2"}
Literal string text: |\n line1\n line2 {"text" => "line1\nline2"}
Anchor/Alias default: &def\n val: 1\nother:\n <<: *def Shared object reference

File Extension Conventions

Extension Usage Content Type
.yml Standard YAML files Configuration, data
.yaml Alternative extension Same as .yml
.config Application config YAML configuration
.settings User settings YAML preferences