CrackedRuby - XML/REXML

Overview

REXML provides native XML processing capabilities in Ruby's standard library. The library handles XML document parsing, generation, modification, and querying through a tree-based Document Object Model. REXML supports XML namespaces, XPath expressions, and streaming operations for memory-efficient processing of large documents.

The core classes include REXML::Document for complete XML documents, REXML::Element for individual XML elements, and REXML::XPath for query operations. REXML processes XML through multiple parsing approaches: tree parsing for full document access, stream parsing for memory efficiency, and pull parsing for event-driven processing.

require 'rexml/document'

xml_string = '<books><book id="1">Ruby Programming</book></books>'
doc = REXML::Document.new(xml_string)
puts doc.root.name  # => "books"

REXML handles XML validation through DTD support and provides namespace-aware processing. The library maintains XML entity resolution and supports both reading from strings and files. Character encoding detection occurs automatically, with UTF-8 as the default output encoding.

# Creating XML from scratch
doc = REXML::Document.new
root = doc.add_element('catalog')
book = root.add_element('book', {'isbn' => '978-0123456789'})
book.add_element('title').text = 'Advanced Ruby'
puts doc.to_s

REXML integrates with Ruby's IO classes for file operations and provides formatted output with configurable indentation. The library supports XML comments, processing instructions, and CDATA sections while maintaining document structure integrity.

Basic Usage

Document creation begins with REXML::Document.new, accepting either XML strings or IO objects. The constructor parses the input immediately, building the complete document tree in memory. Empty documents initialize with basic XML structure when no input is provided.

require 'rexml/document'

# Parse XML from string
xml_data = <<~XML
  <library>
    <book id="1" category="fiction">
      <title>The Ruby Way</title>
      <author>Hal Fulton</author>
      <price>29.95</price>
    </book>
  </library>
XML

doc = REXML::Document.new(xml_data)

Element access occurs through several methods. The root method returns the document's root element, while elements provides an enumerable collection for traversing child elements. Element navigation supports both index-based and name-based access patterns.

root = doc.root
puts root.name  # => "library"

# Access elements by index
first_book = root.elements[1]
puts first_book.name  # => "book"

# Access elements by name
book_element = root.elements['book']
puts book_element.attributes['id']  # => "1"

Attribute manipulation uses the attributes hash-like interface. Attributes support both string and symbol keys, with automatic string conversion for values. The interface provides standard hash methods including [], []=, and delete for attribute management.

book = doc.root.elements['book']

# Read attributes
category = book.attributes['category']  # => "fiction"
book_id = book.attributes['id']  # => "1"

# Modify attributes
book.attributes['category'] = 'non-fiction'
book.attributes['published'] = '2023'

# Delete attributes
book.attributes.delete('id')

Text content access occurs through the text property for simple elements or get_text for more control. Text modification works directly through assignment or by creating REXML::Text objects for complex content including CDATA sections.

title = doc.root.elements['book/title']
puts title.text  # => "The Ruby Way"

# Modify text content
title.text = "Advanced Ruby Programming"

# Access nested text content
price_text = doc.root.elements['book/price'].get_text.value
puts price_text  # => "29.95"

Element creation uses add_element for new child elements with optional attributes and text content. The method returns the created element for immediate configuration or further nesting operations.

root = doc.root

# Add new book element
new_book = root.add_element('book')
new_book.attributes['id'] = '2'
new_book.attributes['category'] = 'technical'

# Add nested elements
title = new_book.add_element('title')
title.text = 'Ruby Internals'

author = new_book.add_element('author')
author.text = 'Pat Shaughnessy'

# Add element with attributes in one call
price = new_book.add_element('price', {'currency' => 'USD'})
price.text = '34.99'

Document serialization transforms the internal tree structure back into XML text. The to_s method provides basic serialization, while custom formatters control indentation, line breaks, and encoding options.

# Basic serialization
puts doc.to_s

# Formatted output with indentation
formatter = REXML::Formatters::Pretty.new
formatter.compact = true
output = ''
formatter.write(doc, output)
puts output

Error Handling & Debugging

REXML raises specific exception types for different parsing failures. REXML::ParseException indicates syntax errors in XML structure, while REXML::UndefinedNamespaceException occurs when namespace prefixes lack declarations. Understanding these exceptions enables targeted error recovery strategies.

require 'rexml/document'

begin
  # Malformed XML with missing closing tag
  bad_xml = '<books><book>Title</book'
  doc = REXML::Document.new(bad_xml)
rescue REXML::ParseException => e
  puts "Parse error at line #{e.line}: #{e.message}"
  puts "Position: #{e.position}"
  # Handle parsing failure - perhaps try alternative parsing approach
end

Validation errors emerge during document construction when XML violates basic well-formedness rules. These errors provide line numbers and character positions for precise error location. Production code should capture these exceptions to provide meaningful feedback rather than exposing internal parser messages.

def safe_parse_xml(xml_string)
  begin
    doc = REXML::Document.new(xml_string)
    { success: true, document: doc }
  rescue REXML::ParseException => e
    { 
      success: false, 
      error: "XML parsing failed: #{e.message}",
      line: e.line,
      position: e.position 
    }
  rescue StandardError => e
    { 
      success: false, 
      error: "Unexpected error: #{e.message}" 
    }
  end
end

result = safe_parse_xml('<invalid><xml></wrong>')
puts result[:error] if !result[:success]

Namespace errors occur when XML uses namespace prefixes without corresponding declarations. REXML enforces namespace rules strictly, raising UndefinedNamespaceException when encountering undefined prefixes. This behavior prevents silent namespace resolution failures.

begin
  # XML with undefined namespace prefix
  namespaced_xml = '<ns:root><ns:item>content</ns:item></ns:root>'
  doc = REXML::Document.new(namespaced_xml)
rescue REXML::UndefinedNamespaceException => e
  puts "Undefined namespace: #{e.message}"
  # Add proper namespace declaration or remove prefix usage
  corrected_xml = '<root xmlns:ns="http://example.com"><ns:item>content</ns:item></root>'
  doc = REXML::Document.new(corrected_xml)
end

Element access errors happen when using invalid XPath expressions or accessing non-existent elements. These operations return nil rather than raising exceptions, requiring explicit nil checks to prevent subsequent method calls on nil objects.

doc = REXML::Document.new('<books><book id="1">Title</book></books>')

# Safe element access with nil checking
book = doc.root.elements['book[@id="999"]']
if book
  title = book.elements['title'].text
else
  puts "Book with ID 999 not found"
end

# Alternative approach using get_text
title_text = doc.root.elements['book[@id="1"]/title']&.get_text&.value || 'No title'
puts title_text

Memory issues arise when processing extremely large XML documents through tree parsing. REXML loads complete documents into memory, potentially causing out-of-memory errors with multi-gigabyte files. Stream parsing provides an alternative for memory-constrained environments.

require 'rexml/streamlistener'

class MemoryEfficientParser
  include REXML::StreamListener
  
  def initialize
    @current_path = []
    @target_elements = {}
  end
  
  def tag_start(name, attrs)
    @current_path.push(name)
    
    # Process only specific elements to control memory usage
    if name == 'book' && attrs['category'] == 'technical'
      @target_elements[@current_path.dup] = { name: name, attributes: attrs }
    end
  end
  
  def tag_end(name)
    @current_path.pop
  end
  
  def text(content)
    # Process text content for target elements only
    if @target_elements.key?(@current_path)
      puts "Found technical book content: #{content.strip}"
    end
  end
end

# Use stream parsing for large files
def parse_large_file(filename)
  begin
    File.open(filename, 'r') do |file|
      parser = MemoryEfficientParser.new
      REXML::Document.parse_stream(file, parser)
    end
  rescue Errno::ENOENT => e
    puts "File not found: #{filename}"
  rescue REXML::ParseException => e
    puts "Parsing failed: #{e.message}"
  end
end

Production Patterns

High-volume XML processing requires careful memory management and performance optimization. Stream parsing becomes essential when processing large feeds or batch imports where document sizes exceed available memory. REXML's streaming interface provides memory-efficient processing for production workloads.

require 'rexml/streamlistener'

class ProductCatalogProcessor
  include REXML::StreamListener
  
  def initialize(batch_size = 1000)
    @batch_size = batch_size
    @current_product = {}
    @batch = []
    @processed_count = 0
  end
  
  def tag_start(name, attrs)
    case name
    when 'product'
      @current_product = { id: attrs['id'], sku: attrs['sku'] }
    when 'price', 'title', 'description'
      @current_field = name
    end
  end
  
  def text(content)
    if @current_field && !content.strip.empty?
      @current_product[@current_field.to_sym] = content.strip
    end
  end
  
  def tag_end(name)
    if name == 'product' && @current_product[:id]
      @batch << @current_product
      
      if @batch.size >= @batch_size
        process_batch(@batch)
        @batch.clear
      end
      
      @current_product = {}
    end
    @current_field = nil if ['price', 'title', 'description'].include?(name)
  end
  
  private
  
  def process_batch(products)
    @processed_count += products.size
    puts "Processed batch: #{products.size} products (total: #{@processed_count})"
    # Database insert, API calls, or other batch processing
  end
end

# Process large XML feeds efficiently
def process_product_feed(feed_url_or_file)
  processor = ProductCatalogProcessor.new(500)
  
  if feed_url_or_file.start_with?('http')
    require 'net/http'
    uri = URI(feed_url_or_file)
    Net::HTTP.get_response(uri) do |response|
      REXML::Document.parse_stream(response.body, processor)
    end
  else
    File.open(feed_url_or_file, 'r') do |file|
      REXML::Document.parse_stream(file, processor)
    end
  end
end

Namespace handling becomes critical in enterprise XML processing where documents often use multiple namespace declarations. REXML provides namespace-aware parsing and element access patterns that prevent namespace conflicts in complex document structures.

class NamespaceAwareProcessor
  def initialize(xml_content)
    @doc = REXML::Document.new(xml_content)
    @namespaces = extract_namespaces
  end
  
  def extract_namespaces
    namespaces = {}
    @doc.root.attributes.each do |name, value|
      if name.start_with?('xmlns:')
        prefix = name.split(':', 2).last
        namespaces[prefix] = value
      elsif name == 'xmlns'
        namespaces[''] = value  # Default namespace
      end
    end
    namespaces
  end
  
  def find_elements_by_namespace(namespace_uri, element_name)
    namespace_prefix = @namespaces.key(namespace_uri)
    return [] unless namespace_prefix
    
    xpath = namespace_prefix.empty? ? 
      "//#{element_name}" : 
      "//#{namespace_prefix}:#{element_name}"
      
    REXML::XPath.match(@doc, xpath, @namespaces)
  end
  
  def process_soap_envelope
    # Handle SOAP namespace processing
    soap_ns = 'http://schemas.xmlsoap.org/soap/envelope/'
    headers = find_elements_by_namespace(soap_ns, 'Header')
    bodies = find_elements_by_namespace(soap_ns, 'Body')
    
    {
      headers: headers.map { |h| extract_element_data(h) },
      bodies: bodies.map { |b| extract_element_data(b) }
    }
  end
  
  private
  
  def extract_element_data(element)
    {
      name: element.name,
      namespace: element.namespace,
      text: element.get_text&.value,
      attributes: element.attributes.to_h
    }
  end
end

# Production SOAP message processing
soap_xml = <<~XML
  <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
                 xmlns:web="http://example.com/webservice">
    <soap:Header>
      <web:Authentication>
        <web:Token>abc123</web:Token>
      </web:Authentication>
    </soap:Header>
    <soap:Body>
      <web:GetUserRequest>
        <web:UserId>12345</web:UserId>
      </web:GetUserRequest>
    </soap:Body>
  </soap:Envelope>
XML

processor = NamespaceAwareProcessor.new(soap_xml)
envelope_data = processor.process_soap_envelope
puts envelope_data.inspect

Error handling strategies for production environments must account for partial processing failures and provide recovery mechanisms. Robust XML processing includes validation checkpoints, transaction boundaries, and detailed logging for troubleshooting production issues.

class RobustXmlProcessor
  attr_reader :stats
  
  def initialize
    @stats = {
      processed: 0,
      errors: 0,
      warnings: 0,
      start_time: Time.now
    }
  end
  
  def process_xml_batch(xml_files, error_threshold = 0.1)
    xml_files.each_with_index do |file_path, index|
      begin
        result = process_single_file(file_path)
        @stats[:processed] += 1
        
        log_progress(index + 1, xml_files.size) if (index + 1) % 100 == 0
        
      rescue REXML::ParseException => e
        handle_parse_error(file_path, e)
      rescue StandardError => e
        handle_general_error(file_path, e)
      end
      
      # Stop processing if error rate exceeds threshold
      error_rate = @stats[:errors].to_f / (index + 1)
      if error_rate > error_threshold && index > 10
        raise "Error rate #{(error_rate * 100).round(2)}% exceeds threshold"
      end
    end
    
    log_completion_stats
  end
  
  private
  
  def process_single_file(file_path)
    File.open(file_path, 'r') do |file|
      doc = REXML::Document.new(file)
      validate_document_structure(doc)
      extract_business_data(doc)
    end
  end
  
  def validate_document_structure(doc)
    unless doc.root
      raise "Invalid XML: No root element found"
    end
    
    required_elements = ['metadata', 'content']
    required_elements.each do |element_name|
      unless doc.root.elements[element_name]
        @stats[:warnings] += 1
        puts "Warning: Missing required element '#{element_name}'"
      end
    end
  end
  
  def extract_business_data(doc)
    # Extract and process business-specific data
    metadata = doc.root.elements['metadata']
    content = doc.root.elements['content']
    
    {
      id: metadata&.attributes&.[]('id'),
      timestamp: metadata&.elements&.[]('timestamp')&.text,
      data: content&.get_text&.value
    }
  end
  
  def handle_parse_error(file_path, error)
    @stats[:errors] += 1
    puts "Parse error in #{file_path}: #{error.message} (line #{error.line})"
    # Log to error tracking system, move file to error directory, etc.
  end
  
  def handle_general_error(file_path, error)
    @stats[:errors] += 1
    puts "Processing error in #{file_path}: #{error.message}"
    # Additional error handling specific to application needs
  end
  
  def log_progress(current, total)
    elapsed = Time.now - @stats[:start_time]
    rate = @stats[:processed] / elapsed
    puts "Progress: #{current}/#{total} files, #{rate.round(2)} files/sec"
  end
  
  def log_completion_stats
    elapsed = Time.now - @stats[:start_time]
    puts "Completed: #{@stats[:processed]} processed, #{@stats[:errors]} errors, " \
         "#{@stats[:warnings]} warnings in #{elapsed.round(2)} seconds"
  end
end

Reference

Core Classes

Class	Purpose	Key Methods
`REXML::Document`	Complete XML document representation	`new(source)`, `root`, `write(output)`, `to_s`
`REXML::Element`	Individual XML element	`name`, `text`, `text=`, `attributes`, `elements`, `add_element`
`REXML::Attributes`	Element attributes collection	`[]`, `[]=`, `delete`, `each`, `to_h`
`REXML::XPath`	XPath query operations	`match(element, path)`, `first(element, path)`, `each(element, path)`
`REXML::Text`	Text content representation	`value`, `to_s`, `normalize`, `wrap`

Document Operations

Method	Parameters	Returns	Description
`REXML::Document.new(source, context)`	`source` (String/IO), `context` (Hash)	`Document`	Creates document from XML source
`#root`	None	`Element` or `nil`	Returns root element
`#add(child)`	`child` (Element/Text/etc.)	`child`	Adds child to document
`#write(output, indent, transitive, ie_hack)`	Various formatting options	`nil`	Writes formatted XML
`#to_s`	None	`String`	Serializes document to string
`#clone`	None	`Document`	Creates deep copy

Element Manipulation

Method	Parameters	Returns	Description
`#add_element(name, attrs)`	`name` (String/Element), `attrs` (Hash)	`Element`	Creates and adds child element
`#delete_element(element)`	`element` (Element/String/Integer)	`Element` or `nil`	Removes specified element
`#elements`	None	`Elements`	Returns child elements collection
`#each_element(xpath)`	`xpath` (String), block	`self`	Iterates over matching elements
`#get_elements(xpath)`	`xpath` (String)	`Array`	Returns array of matching elements
`#next_element`	None	`Element` or `nil`	Returns next sibling element
`#previous_element`	None	`Element` or `nil`	Returns previous sibling element

Text and Attribute Access

Method	Parameters	Returns	Description
`#text`	None	`String` or `nil`	Returns element's text content
`#text=(content)`	`content` (String)	`String`	Sets element's text content
`#get_text(path)`	`path` (String)	`Text` or `nil`	Returns Text object for content
`#add_text(text)`	`text` (String/Text)	`self`	Adds text content
`#attributes[name]`	`name` (String)	`String` or `nil`	Gets attribute value
`#attributes[name]=value`	`name` (String), `value` (String)	`String`	Sets attribute value
`#add_attribute(name, value)`	`name` (String), `value` (String)	`self`	Adds or updates attribute
`#delete_attribute(name)`	`name` (String)	`Attribute` or `nil`	Removes attribute

XPath Operations

Method	Parameters	Returns	Description
`REXML::XPath.match(element, path, namespaces, variables)`	Element, path string, optional context	`Array`	Returns all matching nodes
`REXML::XPath.first(element, path, namespaces, variables)`	Element, path string, optional context	`Node` or `nil`	Returns first matching node
`REXML::XPath.each(element, path, namespaces, variables)`	Element, path string, optional context, block	`nil`	Iterates over matches
`#xpath(path, namespaces, variables)`	`path` (String), optional context	`Array`	Instance method for XPath queries

Stream Processing

Class/Method	Purpose	Key Methods
`REXML::StreamListener`	Base module for stream processing	`tag_start`, `tag_end`, `text`, `instruction`
`REXML::Document.parse_stream(source, listener)`	Processes XML without building tree	Static method for streaming
`REXML::SAX2Listener`	SAX2-style event interface	`start_element`, `end_element`, `characters`
`REXML::PullParser`	Pull-style parsing interface	`pull`, `peek`, `has_next?`

Formatting and Output

Class	Purpose	Configuration
`REXML::Formatters::Default`	Basic output formatting	No special formatting
`REXML::Formatters::Pretty`	Indented output formatting	`compact`, `width` properties
`REXML::Formatters::Transitive`	Preserves whitespace formatting	Maintains original spacing

Exception Hierarchy

Exception	Inheritance	Raised When
`REXML::ParseException`	`RuntimeError`	XML syntax errors during parsing
`REXML::UndefinedNamespaceException`	`ParseException`	Undefined namespace prefix usage
`REXML::ValidationException`	`RuntimeError`	Document validation failures

Common XPath Expressions

Pattern	Purpose	Example
`//element`	All elements with name	`//book` finds all book elements
`/root/child`	Direct child path	`/catalog/book` finds books under catalog
`//element[@attr='value']`	Attribute filtering	`//book[@category='fiction']`
`//element[position()]`	Position-based selection	`//book[1]` selects first book
`//element[text()='value']`	Text content filtering	`//title[text()='Ruby Guide']`
`//element/*`	All child elements	`//book/*` gets all book children
`//element/text()`	Text nodes only	`//price/text()` gets price text
`//namespace:element`	Namespaced elements	`//soap:Body` with namespace

Namespace Handling

Pattern	Usage	Example
Default namespace declaration	`xmlns="uri"`	`<root xmlns="http://example.com">`
Prefixed namespace declaration	`xmlns:prefix="uri"`	`<root xmlns:ns="http://example.com">`
Namespace-aware XPath	Use prefix in expressions	`//ns:element` with namespace context
Namespace context in queries	Pass hash to XPath methods	`{'ns' => 'http://example.com'}`

Performance Characteristics

Operation	Memory Usage	Processing Speed	Best For
Tree parsing	High (entire document)	Fast random access	Small to medium documents
Stream parsing	Low (constant)	Fast sequential processing	Large documents
Pull parsing	Low to medium	Medium (event-driven)	Selective processing
XPath queries	Medium	Variable by complexity	Complex document queries