Overview
REXML provides native XML processing capabilities in Ruby's standard library. The library handles XML document parsing, generation, modification, and querying through a tree-based Document Object Model. REXML supports XML namespaces, XPath expressions, and streaming operations for memory-efficient processing of large documents.
The core classes include REXML::Document
for complete XML documents, REXML::Element
for individual XML elements, and REXML::XPath
for query operations. REXML processes XML through multiple parsing approaches: tree parsing for full document access, stream parsing for memory efficiency, and pull parsing for event-driven processing.
require 'rexml/document'
xml_string = '<books><book id="1">Ruby Programming</book></books>'
doc = REXML::Document.new(xml_string)
puts doc.root.name # => "books"
REXML handles XML validation through DTD support and provides namespace-aware processing. The library maintains XML entity resolution and supports both reading from strings and files. Character encoding detection occurs automatically, with UTF-8 as the default output encoding.
# Creating XML from scratch
doc = REXML::Document.new
root = doc.add_element('catalog')
book = root.add_element('book', {'isbn' => '978-0123456789'})
book.add_element('title').text = 'Advanced Ruby'
puts doc.to_s
REXML integrates with Ruby's IO classes for file operations and provides formatted output with configurable indentation. The library supports XML comments, processing instructions, and CDATA sections while maintaining document structure integrity.
Basic Usage
Document creation begins with REXML::Document.new
, accepting either XML strings or IO objects. The constructor parses the input immediately, building the complete document tree in memory. Empty documents initialize with basic XML structure when no input is provided.
require 'rexml/document'
# Parse XML from string
xml_data = <<~XML
<library>
<book id="1" category="fiction">
<title>The Ruby Way</title>
<author>Hal Fulton</author>
<price>29.95</price>
</book>
</library>
XML
doc = REXML::Document.new(xml_data)
Element access occurs through several methods. The root
method returns the document's root element, while elements
provides an enumerable collection for traversing child elements. Element navigation supports both index-based and name-based access patterns.
root = doc.root
puts root.name # => "library"
# Access elements by index
first_book = root.elements[1]
puts first_book.name # => "book"
# Access elements by name
book_element = root.elements['book']
puts book_element.attributes['id'] # => "1"
Attribute manipulation uses the attributes
hash-like interface. Attributes support both string and symbol keys, with automatic string conversion for values. The interface provides standard hash methods including []
, []=
, and delete
for attribute management.
book = doc.root.elements['book']
# Read attributes
category = book.attributes['category'] # => "fiction"
book_id = book.attributes['id'] # => "1"
# Modify attributes
book.attributes['category'] = 'non-fiction'
book.attributes['published'] = '2023'
# Delete attributes
book.attributes.delete('id')
Text content access occurs through the text
property for simple elements or get_text
for more control. Text modification works directly through assignment or by creating REXML::Text
objects for complex content including CDATA sections.
title = doc.root.elements['book/title']
puts title.text # => "The Ruby Way"
# Modify text content
title.text = "Advanced Ruby Programming"
# Access nested text content
price_text = doc.root.elements['book/price'].get_text.value
puts price_text # => "29.95"
Element creation uses add_element
for new child elements with optional attributes and text content. The method returns the created element for immediate configuration or further nesting operations.
root = doc.root
# Add new book element
new_book = root.add_element('book')
new_book.attributes['id'] = '2'
new_book.attributes['category'] = 'technical'
# Add nested elements
title = new_book.add_element('title')
title.text = 'Ruby Internals'
author = new_book.add_element('author')
author.text = 'Pat Shaughnessy'
# Add element with attributes in one call
price = new_book.add_element('price', {'currency' => 'USD'})
price.text = '34.99'
Document serialization transforms the internal tree structure back into XML text. The to_s
method provides basic serialization, while custom formatters control indentation, line breaks, and encoding options.
# Basic serialization
puts doc.to_s
# Formatted output with indentation
formatter = REXML::Formatters::Pretty.new
formatter.compact = true
output = ''
formatter.write(doc, output)
puts output
Error Handling & Debugging
REXML raises specific exception types for different parsing failures. REXML::ParseException
indicates syntax errors in XML structure, while REXML::UndefinedNamespaceException
occurs when namespace prefixes lack declarations. Understanding these exceptions enables targeted error recovery strategies.
require 'rexml/document'
begin
# Malformed XML with missing closing tag
bad_xml = '<books><book>Title</book'
doc = REXML::Document.new(bad_xml)
rescue REXML::ParseException => e
puts "Parse error at line #{e.line}: #{e.message}"
puts "Position: #{e.position}"
# Handle parsing failure - perhaps try alternative parsing approach
end
Validation errors emerge during document construction when XML violates basic well-formedness rules. These errors provide line numbers and character positions for precise error location. Production code should capture these exceptions to provide meaningful feedback rather than exposing internal parser messages.
def safe_parse_xml(xml_string)
begin
doc = REXML::Document.new(xml_string)
{ success: true, document: doc }
rescue REXML::ParseException => e
{
success: false,
error: "XML parsing failed: #{e.message}",
line: e.line,
position: e.position
}
rescue StandardError => e
{
success: false,
error: "Unexpected error: #{e.message}"
}
end
end
result = safe_parse_xml('<invalid><xml></wrong>')
puts result[:error] if !result[:success]
Namespace errors occur when XML uses namespace prefixes without corresponding declarations. REXML enforces namespace rules strictly, raising UndefinedNamespaceException
when encountering undefined prefixes. This behavior prevents silent namespace resolution failures.
begin
# XML with undefined namespace prefix
namespaced_xml = '<ns:root><ns:item>content</ns:item></ns:root>'
doc = REXML::Document.new(namespaced_xml)
rescue REXML::UndefinedNamespaceException => e
puts "Undefined namespace: #{e.message}"
# Add proper namespace declaration or remove prefix usage
corrected_xml = '<root xmlns:ns="http://example.com"><ns:item>content</ns:item></root>'
doc = REXML::Document.new(corrected_xml)
end
Element access errors happen when using invalid XPath expressions or accessing non-existent elements. These operations return nil
rather than raising exceptions, requiring explicit nil checks to prevent subsequent method calls on nil objects.
doc = REXML::Document.new('<books><book id="1">Title</book></books>')
# Safe element access with nil checking
book = doc.root.elements['book[@id="999"]']
if book
title = book.elements['title'].text
else
puts "Book with ID 999 not found"
end
# Alternative approach using get_text
title_text = doc.root.elements['book[@id="1"]/title']&.get_text&.value || 'No title'
puts title_text
Memory issues arise when processing extremely large XML documents through tree parsing. REXML loads complete documents into memory, potentially causing out-of-memory errors with multi-gigabyte files. Stream parsing provides an alternative for memory-constrained environments.
require 'rexml/streamlistener'
class MemoryEfficientParser
include REXML::StreamListener
def initialize
@current_path = []
@target_elements = {}
end
def tag_start(name, attrs)
@current_path.push(name)
# Process only specific elements to control memory usage
if name == 'book' && attrs['category'] == 'technical'
@target_elements[@current_path.dup] = { name: name, attributes: attrs }
end
end
def tag_end(name)
@current_path.pop
end
def text(content)
# Process text content for target elements only
if @target_elements.key?(@current_path)
puts "Found technical book content: #{content.strip}"
end
end
end
# Use stream parsing for large files
def parse_large_file(filename)
begin
File.open(filename, 'r') do |file|
parser = MemoryEfficientParser.new
REXML::Document.parse_stream(file, parser)
end
rescue Errno::ENOENT => e
puts "File not found: #{filename}"
rescue REXML::ParseException => e
puts "Parsing failed: #{e.message}"
end
end
Production Patterns
High-volume XML processing requires careful memory management and performance optimization. Stream parsing becomes essential when processing large feeds or batch imports where document sizes exceed available memory. REXML's streaming interface provides memory-efficient processing for production workloads.
require 'rexml/streamlistener'
class ProductCatalogProcessor
include REXML::StreamListener
def initialize(batch_size = 1000)
@batch_size = batch_size
@current_product = {}
@batch = []
@processed_count = 0
end
def tag_start(name, attrs)
case name
when 'product'
@current_product = { id: attrs['id'], sku: attrs['sku'] }
when 'price', 'title', 'description'
@current_field = name
end
end
def text(content)
if @current_field && !content.strip.empty?
@current_product[@current_field.to_sym] = content.strip
end
end
def tag_end(name)
if name == 'product' && @current_product[:id]
@batch << @current_product
if @batch.size >= @batch_size
process_batch(@batch)
@batch.clear
end
@current_product = {}
end
@current_field = nil if ['price', 'title', 'description'].include?(name)
end
private
def process_batch(products)
@processed_count += products.size
puts "Processed batch: #{products.size} products (total: #{@processed_count})"
# Database insert, API calls, or other batch processing
end
end
# Process large XML feeds efficiently
def process_product_feed(feed_url_or_file)
processor = ProductCatalogProcessor.new(500)
if feed_url_or_file.start_with?('http')
require 'net/http'
uri = URI(feed_url_or_file)
Net::HTTP.get_response(uri) do |response|
REXML::Document.parse_stream(response.body, processor)
end
else
File.open(feed_url_or_file, 'r') do |file|
REXML::Document.parse_stream(file, processor)
end
end
end
Namespace handling becomes critical in enterprise XML processing where documents often use multiple namespace declarations. REXML provides namespace-aware parsing and element access patterns that prevent namespace conflicts in complex document structures.
class NamespaceAwareProcessor
def initialize(xml_content)
@doc = REXML::Document.new(xml_content)
@namespaces = extract_namespaces
end
def extract_namespaces
namespaces = {}
@doc.root.attributes.each do |name, value|
if name.start_with?('xmlns:')
prefix = name.split(':', 2).last
namespaces[prefix] = value
elsif name == 'xmlns'
namespaces[''] = value # Default namespace
end
end
namespaces
end
def find_elements_by_namespace(namespace_uri, element_name)
namespace_prefix = @namespaces.key(namespace_uri)
return [] unless namespace_prefix
xpath = namespace_prefix.empty? ?
"//#{element_name}" :
"//#{namespace_prefix}:#{element_name}"
REXML::XPath.match(@doc, xpath, @namespaces)
end
def process_soap_envelope
# Handle SOAP namespace processing
soap_ns = 'http://schemas.xmlsoap.org/soap/envelope/'
headers = find_elements_by_namespace(soap_ns, 'Header')
bodies = find_elements_by_namespace(soap_ns, 'Body')
{
headers: headers.map { |h| extract_element_data(h) },
bodies: bodies.map { |b| extract_element_data(b) }
}
end
private
def extract_element_data(element)
{
name: element.name,
namespace: element.namespace,
text: element.get_text&.value,
attributes: element.attributes.to_h
}
end
end
# Production SOAP message processing
soap_xml = <<~XML
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:web="http://example.com/webservice">
<soap:Header>
<web:Authentication>
<web:Token>abc123</web:Token>
</web:Authentication>
</soap:Header>
<soap:Body>
<web:GetUserRequest>
<web:UserId>12345</web:UserId>
</web:GetUserRequest>
</soap:Body>
</soap:Envelope>
XML
processor = NamespaceAwareProcessor.new(soap_xml)
envelope_data = processor.process_soap_envelope
puts envelope_data.inspect
Error handling strategies for production environments must account for partial processing failures and provide recovery mechanisms. Robust XML processing includes validation checkpoints, transaction boundaries, and detailed logging for troubleshooting production issues.
class RobustXmlProcessor
attr_reader :stats
def initialize
@stats = {
processed: 0,
errors: 0,
warnings: 0,
start_time: Time.now
}
end
def process_xml_batch(xml_files, error_threshold = 0.1)
xml_files.each_with_index do |file_path, index|
begin
result = process_single_file(file_path)
@stats[:processed] += 1
log_progress(index + 1, xml_files.size) if (index + 1) % 100 == 0
rescue REXML::ParseException => e
handle_parse_error(file_path, e)
rescue StandardError => e
handle_general_error(file_path, e)
end
# Stop processing if error rate exceeds threshold
error_rate = @stats[:errors].to_f / (index + 1)
if error_rate > error_threshold && index > 10
raise "Error rate #{(error_rate * 100).round(2)}% exceeds threshold"
end
end
log_completion_stats
end
private
def process_single_file(file_path)
File.open(file_path, 'r') do |file|
doc = REXML::Document.new(file)
validate_document_structure(doc)
extract_business_data(doc)
end
end
def validate_document_structure(doc)
unless doc.root
raise "Invalid XML: No root element found"
end
required_elements = ['metadata', 'content']
required_elements.each do |element_name|
unless doc.root.elements[element_name]
@stats[:warnings] += 1
puts "Warning: Missing required element '#{element_name}'"
end
end
end
def extract_business_data(doc)
# Extract and process business-specific data
metadata = doc.root.elements['metadata']
content = doc.root.elements['content']
{
id: metadata&.attributes&.[]('id'),
timestamp: metadata&.elements&.[]('timestamp')&.text,
data: content&.get_text&.value
}
end
def handle_parse_error(file_path, error)
@stats[:errors] += 1
puts "Parse error in #{file_path}: #{error.message} (line #{error.line})"
# Log to error tracking system, move file to error directory, etc.
end
def handle_general_error(file_path, error)
@stats[:errors] += 1
puts "Processing error in #{file_path}: #{error.message}"
# Additional error handling specific to application needs
end
def log_progress(current, total)
elapsed = Time.now - @stats[:start_time]
rate = @stats[:processed] / elapsed
puts "Progress: #{current}/#{total} files, #{rate.round(2)} files/sec"
end
def log_completion_stats
elapsed = Time.now - @stats[:start_time]
puts "Completed: #{@stats[:processed]} processed, #{@stats[:errors]} errors, " \
"#{@stats[:warnings]} warnings in #{elapsed.round(2)} seconds"
end
end
Reference
Core Classes
Class | Purpose | Key Methods |
---|---|---|
REXML::Document |
Complete XML document representation | new(source) , root , write(output) , to_s |
REXML::Element |
Individual XML element | name , text , text= , attributes , elements , add_element |
REXML::Attributes |
Element attributes collection | [] , []= , delete , each , to_h |
REXML::XPath |
XPath query operations | match(element, path) , first(element, path) , each(element, path) |
REXML::Text |
Text content representation | value , to_s , normalize , wrap |
Document Operations
Method | Parameters | Returns | Description |
---|---|---|---|
REXML::Document.new(source, context) |
source (String/IO), context (Hash) |
Document |
Creates document from XML source |
#root |
None | Element or nil |
Returns root element |
#add(child) |
child (Element/Text/etc.) |
child |
Adds child to document |
#write(output, indent, transitive, ie_hack) |
Various formatting options | nil |
Writes formatted XML |
#to_s |
None | String |
Serializes document to string |
#clone |
None | Document |
Creates deep copy |
Element Manipulation
Method | Parameters | Returns | Description |
---|---|---|---|
#add_element(name, attrs) |
name (String/Element), attrs (Hash) |
Element |
Creates and adds child element |
#delete_element(element) |
element (Element/String/Integer) |
Element or nil |
Removes specified element |
#elements |
None | Elements |
Returns child elements collection |
#each_element(xpath) |
xpath (String), block |
self |
Iterates over matching elements |
#get_elements(xpath) |
xpath (String) |
Array |
Returns array of matching elements |
#next_element |
None | Element or nil |
Returns next sibling element |
#previous_element |
None | Element or nil |
Returns previous sibling element |
Text and Attribute Access
Method | Parameters | Returns | Description |
---|---|---|---|
#text |
None | String or nil |
Returns element's text content |
#text=(content) |
content (String) |
String |
Sets element's text content |
#get_text(path) |
path (String) |
Text or nil |
Returns Text object for content |
#add_text(text) |
text (String/Text) |
self |
Adds text content |
#attributes[name] |
name (String) |
String or nil |
Gets attribute value |
#attributes[name]=value |
name (String), value (String) |
String |
Sets attribute value |
#add_attribute(name, value) |
name (String), value (String) |
self |
Adds or updates attribute |
#delete_attribute(name) |
name (String) |
Attribute or nil |
Removes attribute |
XPath Operations
Method | Parameters | Returns | Description |
---|---|---|---|
REXML::XPath.match(element, path, namespaces, variables) |
Element, path string, optional context | Array |
Returns all matching nodes |
REXML::XPath.first(element, path, namespaces, variables) |
Element, path string, optional context | Node or nil |
Returns first matching node |
REXML::XPath.each(element, path, namespaces, variables) |
Element, path string, optional context, block | nil |
Iterates over matches |
#xpath(path, namespaces, variables) |
path (String), optional context |
Array |
Instance method for XPath queries |
Stream Processing
Class/Method | Purpose | Key Methods |
---|---|---|
REXML::StreamListener |
Base module for stream processing | tag_start , tag_end , text , instruction |
REXML::Document.parse_stream(source, listener) |
Processes XML without building tree | Static method for streaming |
REXML::SAX2Listener |
SAX2-style event interface | start_element , end_element , characters |
REXML::PullParser |
Pull-style parsing interface | pull , peek , has_next? |
Formatting and Output
Class | Purpose | Configuration |
---|---|---|
REXML::Formatters::Default |
Basic output formatting | No special formatting |
REXML::Formatters::Pretty |
Indented output formatting | compact , width properties |
REXML::Formatters::Transitive |
Preserves whitespace formatting | Maintains original spacing |
Exception Hierarchy
Exception | Inheritance | Raised When |
---|---|---|
REXML::ParseException |
RuntimeError |
XML syntax errors during parsing |
REXML::UndefinedNamespaceException |
ParseException |
Undefined namespace prefix usage |
REXML::ValidationException |
RuntimeError |
Document validation failures |
Common XPath Expressions
Pattern | Purpose | Example |
---|---|---|
//element |
All elements with name | //book finds all book elements |
/root/child |
Direct child path | /catalog/book finds books under catalog |
//element[@attr='value'] |
Attribute filtering | //book[@category='fiction'] |
//element[position()] |
Position-based selection | //book[1] selects first book |
//element[text()='value'] |
Text content filtering | //title[text()='Ruby Guide'] |
//element/* |
All child elements | //book/* gets all book children |
//element/text() |
Text nodes only | //price/text() gets price text |
//namespace:element |
Namespaced elements | //soap:Body with namespace |
Namespace Handling
Pattern | Usage | Example |
---|---|---|
Default namespace declaration | xmlns="uri" |
<root xmlns="http://example.com"> |
Prefixed namespace declaration | xmlns:prefix="uri" |
<root xmlns:ns="http://example.com"> |
Namespace-aware XPath | Use prefix in expressions | //ns:element with namespace context |
Namespace context in queries | Pass hash to XPath methods | {'ns' => 'http://example.com'} |
Performance Characteristics
Operation | Memory Usage | Processing Speed | Best For |
---|---|---|---|
Tree parsing | High (entire document) | Fast random access | Small to medium documents |
Stream parsing | Low (constant) | Fast sequential processing | Large documents |
Pull parsing | Low to medium | Medium (event-driven) | Selective processing |
XPath queries | Medium | Variable by complexity | Complex document queries |