CrackedRuby logo

CrackedRuby

XML/REXML

Overview

REXML provides native XML processing capabilities in Ruby's standard library. The library handles XML document parsing, generation, modification, and querying through a tree-based Document Object Model. REXML supports XML namespaces, XPath expressions, and streaming operations for memory-efficient processing of large documents.

The core classes include REXML::Document for complete XML documents, REXML::Element for individual XML elements, and REXML::XPath for query operations. REXML processes XML through multiple parsing approaches: tree parsing for full document access, stream parsing for memory efficiency, and pull parsing for event-driven processing.

require 'rexml/document'

xml_string = '<books><book id="1">Ruby Programming</book></books>'
doc = REXML::Document.new(xml_string)
puts doc.root.name  # => "books"

REXML handles XML validation through DTD support and provides namespace-aware processing. The library maintains XML entity resolution and supports both reading from strings and files. Character encoding detection occurs automatically, with UTF-8 as the default output encoding.

# Creating XML from scratch
doc = REXML::Document.new
root = doc.add_element('catalog')
book = root.add_element('book', {'isbn' => '978-0123456789'})
book.add_element('title').text = 'Advanced Ruby'
puts doc.to_s

REXML integrates with Ruby's IO classes for file operations and provides formatted output with configurable indentation. The library supports XML comments, processing instructions, and CDATA sections while maintaining document structure integrity.

Basic Usage

Document creation begins with REXML::Document.new, accepting either XML strings or IO objects. The constructor parses the input immediately, building the complete document tree in memory. Empty documents initialize with basic XML structure when no input is provided.

require 'rexml/document'

# Parse XML from string
xml_data = <<~XML
  <library>
    <book id="1" category="fiction">
      <title>The Ruby Way</title>
      <author>Hal Fulton</author>
      <price>29.95</price>
    </book>
  </library>
XML

doc = REXML::Document.new(xml_data)

Element access occurs through several methods. The root method returns the document's root element, while elements provides an enumerable collection for traversing child elements. Element navigation supports both index-based and name-based access patterns.

root = doc.root
puts root.name  # => "library"

# Access elements by index
first_book = root.elements[1]
puts first_book.name  # => "book"

# Access elements by name
book_element = root.elements['book']
puts book_element.attributes['id']  # => "1"

Attribute manipulation uses the attributes hash-like interface. Attributes support both string and symbol keys, with automatic string conversion for values. The interface provides standard hash methods including [], []=, and delete for attribute management.

book = doc.root.elements['book']

# Read attributes
category = book.attributes['category']  # => "fiction"
book_id = book.attributes['id']  # => "1"

# Modify attributes
book.attributes['category'] = 'non-fiction'
book.attributes['published'] = '2023'

# Delete attributes
book.attributes.delete('id')

Text content access occurs through the text property for simple elements or get_text for more control. Text modification works directly through assignment or by creating REXML::Text objects for complex content including CDATA sections.

title = doc.root.elements['book/title']
puts title.text  # => "The Ruby Way"

# Modify text content
title.text = "Advanced Ruby Programming"

# Access nested text content
price_text = doc.root.elements['book/price'].get_text.value
puts price_text  # => "29.95"

Element creation uses add_element for new child elements with optional attributes and text content. The method returns the created element for immediate configuration or further nesting operations.

root = doc.root

# Add new book element
new_book = root.add_element('book')
new_book.attributes['id'] = '2'
new_book.attributes['category'] = 'technical'

# Add nested elements
title = new_book.add_element('title')
title.text = 'Ruby Internals'

author = new_book.add_element('author')
author.text = 'Pat Shaughnessy'

# Add element with attributes in one call
price = new_book.add_element('price', {'currency' => 'USD'})
price.text = '34.99'

Document serialization transforms the internal tree structure back into XML text. The to_s method provides basic serialization, while custom formatters control indentation, line breaks, and encoding options.

# Basic serialization
puts doc.to_s

# Formatted output with indentation
formatter = REXML::Formatters::Pretty.new
formatter.compact = true
output = ''
formatter.write(doc, output)
puts output

Error Handling & Debugging

REXML raises specific exception types for different parsing failures. REXML::ParseException indicates syntax errors in XML structure, while REXML::UndefinedNamespaceException occurs when namespace prefixes lack declarations. Understanding these exceptions enables targeted error recovery strategies.

require 'rexml/document'

begin
  # Malformed XML with missing closing tag
  bad_xml = '<books><book>Title</book'
  doc = REXML::Document.new(bad_xml)
rescue REXML::ParseException => e
  puts "Parse error at line #{e.line}: #{e.message}"
  puts "Position: #{e.position}"
  # Handle parsing failure - perhaps try alternative parsing approach
end

Validation errors emerge during document construction when XML violates basic well-formedness rules. These errors provide line numbers and character positions for precise error location. Production code should capture these exceptions to provide meaningful feedback rather than exposing internal parser messages.

def safe_parse_xml(xml_string)
  begin
    doc = REXML::Document.new(xml_string)
    { success: true, document: doc }
  rescue REXML::ParseException => e
    { 
      success: false, 
      error: "XML parsing failed: #{e.message}",
      line: e.line,
      position: e.position 
    }
  rescue StandardError => e
    { 
      success: false, 
      error: "Unexpected error: #{e.message}" 
    }
  end
end

result = safe_parse_xml('<invalid><xml></wrong>')
puts result[:error] if !result[:success]

Namespace errors occur when XML uses namespace prefixes without corresponding declarations. REXML enforces namespace rules strictly, raising UndefinedNamespaceException when encountering undefined prefixes. This behavior prevents silent namespace resolution failures.

begin
  # XML with undefined namespace prefix
  namespaced_xml = '<ns:root><ns:item>content</ns:item></ns:root>'
  doc = REXML::Document.new(namespaced_xml)
rescue REXML::UndefinedNamespaceException => e
  puts "Undefined namespace: #{e.message}"
  # Add proper namespace declaration or remove prefix usage
  corrected_xml = '<root xmlns:ns="http://example.com"><ns:item>content</ns:item></root>'
  doc = REXML::Document.new(corrected_xml)
end

Element access errors happen when using invalid XPath expressions or accessing non-existent elements. These operations return nil rather than raising exceptions, requiring explicit nil checks to prevent subsequent method calls on nil objects.

doc = REXML::Document.new('<books><book id="1">Title</book></books>')

# Safe element access with nil checking
book = doc.root.elements['book[@id="999"]']
if book
  title = book.elements['title'].text
else
  puts "Book with ID 999 not found"
end

# Alternative approach using get_text
title_text = doc.root.elements['book[@id="1"]/title']&.get_text&.value || 'No title'
puts title_text

Memory issues arise when processing extremely large XML documents through tree parsing. REXML loads complete documents into memory, potentially causing out-of-memory errors with multi-gigabyte files. Stream parsing provides an alternative for memory-constrained environments.

require 'rexml/streamlistener'

class MemoryEfficientParser
  include REXML::StreamListener
  
  def initialize
    @current_path = []
    @target_elements = {}
  end
  
  def tag_start(name, attrs)
    @current_path.push(name)
    
    # Process only specific elements to control memory usage
    if name == 'book' && attrs['category'] == 'technical'
      @target_elements[@current_path.dup] = { name: name, attributes: attrs }
    end
  end
  
  def tag_end(name)
    @current_path.pop
  end
  
  def text(content)
    # Process text content for target elements only
    if @target_elements.key?(@current_path)
      puts "Found technical book content: #{content.strip}"
    end
  end
end

# Use stream parsing for large files
def parse_large_file(filename)
  begin
    File.open(filename, 'r') do |file|
      parser = MemoryEfficientParser.new
      REXML::Document.parse_stream(file, parser)
    end
  rescue Errno::ENOENT => e
    puts "File not found: #{filename}"
  rescue REXML::ParseException => e
    puts "Parsing failed: #{e.message}"
  end
end

Production Patterns

High-volume XML processing requires careful memory management and performance optimization. Stream parsing becomes essential when processing large feeds or batch imports where document sizes exceed available memory. REXML's streaming interface provides memory-efficient processing for production workloads.

require 'rexml/streamlistener'

class ProductCatalogProcessor
  include REXML::StreamListener
  
  def initialize(batch_size = 1000)
    @batch_size = batch_size
    @current_product = {}
    @batch = []
    @processed_count = 0
  end
  
  def tag_start(name, attrs)
    case name
    when 'product'
      @current_product = { id: attrs['id'], sku: attrs['sku'] }
    when 'price', 'title', 'description'
      @current_field = name
    end
  end
  
  def text(content)
    if @current_field && !content.strip.empty?
      @current_product[@current_field.to_sym] = content.strip
    end
  end
  
  def tag_end(name)
    if name == 'product' && @current_product[:id]
      @batch << @current_product
      
      if @batch.size >= @batch_size
        process_batch(@batch)
        @batch.clear
      end
      
      @current_product = {}
    end
    @current_field = nil if ['price', 'title', 'description'].include?(name)
  end
  
  private
  
  def process_batch(products)
    @processed_count += products.size
    puts "Processed batch: #{products.size} products (total: #{@processed_count})"
    # Database insert, API calls, or other batch processing
  end
end

# Process large XML feeds efficiently
def process_product_feed(feed_url_or_file)
  processor = ProductCatalogProcessor.new(500)
  
  if feed_url_or_file.start_with?('http')
    require 'net/http'
    uri = URI(feed_url_or_file)
    Net::HTTP.get_response(uri) do |response|
      REXML::Document.parse_stream(response.body, processor)
    end
  else
    File.open(feed_url_or_file, 'r') do |file|
      REXML::Document.parse_stream(file, processor)
    end
  end
end

Namespace handling becomes critical in enterprise XML processing where documents often use multiple namespace declarations. REXML provides namespace-aware parsing and element access patterns that prevent namespace conflicts in complex document structures.

class NamespaceAwareProcessor
  def initialize(xml_content)
    @doc = REXML::Document.new(xml_content)
    @namespaces = extract_namespaces
  end
  
  def extract_namespaces
    namespaces = {}
    @doc.root.attributes.each do |name, value|
      if name.start_with?('xmlns:')
        prefix = name.split(':', 2).last
        namespaces[prefix] = value
      elsif name == 'xmlns'
        namespaces[''] = value  # Default namespace
      end
    end
    namespaces
  end
  
  def find_elements_by_namespace(namespace_uri, element_name)
    namespace_prefix = @namespaces.key(namespace_uri)
    return [] unless namespace_prefix
    
    xpath = namespace_prefix.empty? ? 
      "//#{element_name}" : 
      "//#{namespace_prefix}:#{element_name}"
      
    REXML::XPath.match(@doc, xpath, @namespaces)
  end
  
  def process_soap_envelope
    # Handle SOAP namespace processing
    soap_ns = 'http://schemas.xmlsoap.org/soap/envelope/'
    headers = find_elements_by_namespace(soap_ns, 'Header')
    bodies = find_elements_by_namespace(soap_ns, 'Body')
    
    {
      headers: headers.map { |h| extract_element_data(h) },
      bodies: bodies.map { |b| extract_element_data(b) }
    }
  end
  
  private
  
  def extract_element_data(element)
    {
      name: element.name,
      namespace: element.namespace,
      text: element.get_text&.value,
      attributes: element.attributes.to_h
    }
  end
end

# Production SOAP message processing
soap_xml = <<~XML
  <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
                 xmlns:web="http://example.com/webservice">
    <soap:Header>
      <web:Authentication>
        <web:Token>abc123</web:Token>
      </web:Authentication>
    </soap:Header>
    <soap:Body>
      <web:GetUserRequest>
        <web:UserId>12345</web:UserId>
      </web:GetUserRequest>
    </soap:Body>
  </soap:Envelope>
XML

processor = NamespaceAwareProcessor.new(soap_xml)
envelope_data = processor.process_soap_envelope
puts envelope_data.inspect

Error handling strategies for production environments must account for partial processing failures and provide recovery mechanisms. Robust XML processing includes validation checkpoints, transaction boundaries, and detailed logging for troubleshooting production issues.

class RobustXmlProcessor
  attr_reader :stats
  
  def initialize
    @stats = {
      processed: 0,
      errors: 0,
      warnings: 0,
      start_time: Time.now
    }
  end
  
  def process_xml_batch(xml_files, error_threshold = 0.1)
    xml_files.each_with_index do |file_path, index|
      begin
        result = process_single_file(file_path)
        @stats[:processed] += 1
        
        log_progress(index + 1, xml_files.size) if (index + 1) % 100 == 0
        
      rescue REXML::ParseException => e
        handle_parse_error(file_path, e)
      rescue StandardError => e
        handle_general_error(file_path, e)
      end
      
      # Stop processing if error rate exceeds threshold
      error_rate = @stats[:errors].to_f / (index + 1)
      if error_rate > error_threshold && index > 10
        raise "Error rate #{(error_rate * 100).round(2)}% exceeds threshold"
      end
    end
    
    log_completion_stats
  end
  
  private
  
  def process_single_file(file_path)
    File.open(file_path, 'r') do |file|
      doc = REXML::Document.new(file)
      validate_document_structure(doc)
      extract_business_data(doc)
    end
  end
  
  def validate_document_structure(doc)
    unless doc.root
      raise "Invalid XML: No root element found"
    end
    
    required_elements = ['metadata', 'content']
    required_elements.each do |element_name|
      unless doc.root.elements[element_name]
        @stats[:warnings] += 1
        puts "Warning: Missing required element '#{element_name}'"
      end
    end
  end
  
  def extract_business_data(doc)
    # Extract and process business-specific data
    metadata = doc.root.elements['metadata']
    content = doc.root.elements['content']
    
    {
      id: metadata&.attributes&.[]('id'),
      timestamp: metadata&.elements&.[]('timestamp')&.text,
      data: content&.get_text&.value
    }
  end
  
  def handle_parse_error(file_path, error)
    @stats[:errors] += 1
    puts "Parse error in #{file_path}: #{error.message} (line #{error.line})"
    # Log to error tracking system, move file to error directory, etc.
  end
  
  def handle_general_error(file_path, error)
    @stats[:errors] += 1
    puts "Processing error in #{file_path}: #{error.message}"
    # Additional error handling specific to application needs
  end
  
  def log_progress(current, total)
    elapsed = Time.now - @stats[:start_time]
    rate = @stats[:processed] / elapsed
    puts "Progress: #{current}/#{total} files, #{rate.round(2)} files/sec"
  end
  
  def log_completion_stats
    elapsed = Time.now - @stats[:start_time]
    puts "Completed: #{@stats[:processed]} processed, #{@stats[:errors]} errors, " \
         "#{@stats[:warnings]} warnings in #{elapsed.round(2)} seconds"
  end
end

Reference

Core Classes

Class Purpose Key Methods
REXML::Document Complete XML document representation new(source), root, write(output), to_s
REXML::Element Individual XML element name, text, text=, attributes, elements, add_element
REXML::Attributes Element attributes collection [], []=, delete, each, to_h
REXML::XPath XPath query operations match(element, path), first(element, path), each(element, path)
REXML::Text Text content representation value, to_s, normalize, wrap

Document Operations

Method Parameters Returns Description
REXML::Document.new(source, context) source (String/IO), context (Hash) Document Creates document from XML source
#root None Element or nil Returns root element
#add(child) child (Element/Text/etc.) child Adds child to document
#write(output, indent, transitive, ie_hack) Various formatting options nil Writes formatted XML
#to_s None String Serializes document to string
#clone None Document Creates deep copy

Element Manipulation

Method Parameters Returns Description
#add_element(name, attrs) name (String/Element), attrs (Hash) Element Creates and adds child element
#delete_element(element) element (Element/String/Integer) Element or nil Removes specified element
#elements None Elements Returns child elements collection
#each_element(xpath) xpath (String), block self Iterates over matching elements
#get_elements(xpath) xpath (String) Array Returns array of matching elements
#next_element None Element or nil Returns next sibling element
#previous_element None Element or nil Returns previous sibling element

Text and Attribute Access

Method Parameters Returns Description
#text None String or nil Returns element's text content
#text=(content) content (String) String Sets element's text content
#get_text(path) path (String) Text or nil Returns Text object for content
#add_text(text) text (String/Text) self Adds text content
#attributes[name] name (String) String or nil Gets attribute value
#attributes[name]=value name (String), value (String) String Sets attribute value
#add_attribute(name, value) name (String), value (String) self Adds or updates attribute
#delete_attribute(name) name (String) Attribute or nil Removes attribute

XPath Operations

Method Parameters Returns Description
REXML::XPath.match(element, path, namespaces, variables) Element, path string, optional context Array Returns all matching nodes
REXML::XPath.first(element, path, namespaces, variables) Element, path string, optional context Node or nil Returns first matching node
REXML::XPath.each(element, path, namespaces, variables) Element, path string, optional context, block nil Iterates over matches
#xpath(path, namespaces, variables) path (String), optional context Array Instance method for XPath queries

Stream Processing

Class/Method Purpose Key Methods
REXML::StreamListener Base module for stream processing tag_start, tag_end, text, instruction
REXML::Document.parse_stream(source, listener) Processes XML without building tree Static method for streaming
REXML::SAX2Listener SAX2-style event interface start_element, end_element, characters
REXML::PullParser Pull-style parsing interface pull, peek, has_next?

Formatting and Output

Class Purpose Configuration
REXML::Formatters::Default Basic output formatting No special formatting
REXML::Formatters::Pretty Indented output formatting compact, width properties
REXML::Formatters::Transitive Preserves whitespace formatting Maintains original spacing

Exception Hierarchy

Exception Inheritance Raised When
REXML::ParseException RuntimeError XML syntax errors during parsing
REXML::UndefinedNamespaceException ParseException Undefined namespace prefix usage
REXML::ValidationException RuntimeError Document validation failures

Common XPath Expressions

Pattern Purpose Example
//element All elements with name //book finds all book elements
/root/child Direct child path /catalog/book finds books under catalog
//element[@attr='value'] Attribute filtering //book[@category='fiction']
//element[position()] Position-based selection //book[1] selects first book
//element[text()='value'] Text content filtering //title[text()='Ruby Guide']
//element/* All child elements //book/* gets all book children
//element/text() Text nodes only //price/text() gets price text
//namespace:element Namespaced elements //soap:Body with namespace

Namespace Handling

Pattern Usage Example
Default namespace declaration xmlns="uri" <root xmlns="http://example.com">
Prefixed namespace declaration xmlns:prefix="uri" <root xmlns:ns="http://example.com">
Namespace-aware XPath Use prefix in expressions //ns:element with namespace context
Namespace context in queries Pass hash to XPath methods {'ns' => 'http://example.com'}

Performance Characteristics

Operation Memory Usage Processing Speed Best For
Tree parsing High (entire document) Fast random access Small to medium documents
Stream parsing Low (constant) Fast sequential processing Large documents
Pull parsing Low to medium Medium (event-driven) Selective processing
XPath queries Medium Variable by complexity Complex document queries