CrackedRuby - Custom Marshaling

Overview

Custom marshaling in Ruby allows objects to define their own serialization and deserialization behavior through the marshal_dump and marshal_load methods. Ruby's Marshal module handles the core serialization process, but objects can override the default behavior by implementing these methods to control exactly what data gets serialized and how objects get reconstructed.

The Marshal module serializes Ruby objects into a binary format that can be stored or transmitted, then reconstructed later. By default, Marshal serializes all instance variables, but custom marshaling provides fine-grained control over this process.

class BankAccount
  def initialize(number, balance)
    @account_number = number
    @balance = balance
    @created_at = Time.now
  end

  def marshal_dump
    [@account_number, @balance]
  end

  def marshal_load(data)
    @account_number, @balance = data
    @created_at = Time.now
  end
end

account = BankAccount.new("12345", 1000.0)
serialized = Marshal.dump(account)
restored = Marshal.load(serialized)

Custom marshaling becomes essential when objects contain non-serializable data like file handles, database connections, or complex nested structures that need special handling. Objects can also use marshaling to implement versioning strategies or optimize serialization performance.

Ruby calls marshal_dump during serialization and expects it to return a serializable object representing the essential state. During deserialization, Ruby creates an uninitialized instance and calls marshal_load with the dumped data, allowing the object to restore its state.

Basic Usage

Implementing custom marshaling requires defining both marshal_dump and marshal_load methods. The marshal_dump method runs during Marshal.dump and should return any serializable Ruby object. The marshal_load method receives this dumped data and reconstructs the object's state.

class Configuration
  def initialize(settings = {})
    @settings = settings
    @computed_cache = {}
    @file_handles = {}
  end

  def get(key)
    @computed_cache[key] ||= expensive_computation(@settings[key])
  end

  def marshal_dump
    @settings
  end

  def marshal_load(settings)
    @settings = settings
    @computed_cache = {}
    @file_handles = {}
  end

  private

  def expensive_computation(value)
    # Simulate expensive operation
    value.to_s.upcase
  end
end

The Marshal module automatically handles the object creation process. When loading, Ruby allocates the object without calling initialize, then immediately calls marshal_load with the dumped data. This means marshal_load must fully initialize the object state.

For objects that need to serialize complex nested data, marshal_dump can return arrays, hashes, or any combination of serializable objects:

class DocumentStore
  def initialize
    @documents = {}
    @metadata = {}
    @indexes = {}
  end

  def add_document(id, content, tags = [])
    @documents[id] = content
    @metadata[id] = { tags: tags, created: Time.now }
    rebuild_indexes
  end

  def marshal_dump
    {
      documents: @documents,
      metadata: @metadata.transform_values { |meta| meta.dup },
      version: "1.0"
    }
  end

  def marshal_load(data)
    @documents = data[:documents] || {}
    @metadata = data[:metadata] || {}
    @indexes = {}
    
    # Handle version migration
    if data[:version] != "1.0"
      migrate_from_version(data[:version])
    end
    
    rebuild_indexes
  end

  private

  def rebuild_indexes
    @indexes = @metadata.each_with_object({}) do |(id, meta), idx|
      meta[:tags].each { |tag| (idx[tag] ||= []) << id }
    end
  end

  def migrate_from_version(version)
    # Handle older data formats
  end
end

Objects can also serialize references to other marshalable objects. Ruby handles object references automatically, maintaining identity relationships during the marshal/unmarshal cycle:

class Node
  attr_accessor :value, :children, :parent

  def initialize(value)
    @value = value
    @children = []
    @parent = nil
  end

  def add_child(child)
    child.parent = self
    @children << child
  end

  def marshal_dump
    [@value, @children]
  end

  def marshal_load(data)
    @value, @children = data
    @children.each { |child| child.parent = self }
  end
end

root = Node.new("root")
child1 = Node.new("child1")
child2 = Node.new("child2")
root.add_child(child1)
root.add_child(child2)

# Parent-child relationships preserved through marshaling
restored_tree = Marshal.load(Marshal.dump(root))

Performance & Memory

Custom marshaling provides significant opportunities for performance optimization, particularly when dealing with large objects or objects containing computed data. The choice of what to serialize directly impacts both marshaling speed and the size of the serialized output.

Excluding computed or cached data from marshaling reduces both serialization time and memory usage:

class DataProcessor
  def initialize(raw_data)
    @raw_data = raw_data
    @processed_cache = nil
    @statistics = nil
    @temp_files = []
  end

  def processed_data
    @processed_cache ||= expensive_processing(@raw_data)
  end

  def statistics
    @statistics ||= calculate_statistics(processed_data)
  end

  def marshal_dump
    # Only serialize the essential raw data
    # Computed caches will be rebuilt on demand
    @raw_data
  end

  def marshal_load(raw_data)
    @raw_data = raw_data
    @processed_cache = nil
    @statistics = nil
    @temp_files = []
  end

  private

  def expensive_processing(data)
    # Simulate expensive computation
    data.map(&:to_f).sort.reverse
  end

  def calculate_statistics(data)
    {
      mean: data.sum / data.size,
      max: data.max,
      min: data.min
    }
  end
end

For objects with large amounts of data, custom marshaling can implement compression or alternative encoding strategies:

require 'zlib'

class CompressedData
  def initialize(data)
    @raw_data = data
  end

  def marshal_dump
    # Compress data before serialization
    compressed = Zlib::Deflate.deflate(@raw_data.to_s)
    {
      compressed_data: compressed,
      original_size: @raw_data.to_s.bytesize,
      compression_ratio: compressed.bytesize.to_f / @raw_data.to_s.bytesize
    }
  end

  def marshal_load(dump_data)
    # Decompress during restoration
    decompressed = Zlib::Inflate.inflate(dump_data[:compressed_data])
    @raw_data = eval(decompressed) # In practice, use safer deserialization
    @original_size = dump_data[:original_size]
    @compression_ratio = dump_data[:compression_ratio]
  end

  def compression_info
    "Original: #{@original_size} bytes, Ratio: #{@compression_ratio}"
  end
end

Memory usage optimization becomes critical when marshaling large object graphs. Custom marshaling can implement strategies to break large structures into smaller, independently marshaled pieces:

class PartitionedDataset
  def initialize
    @partitions = {}
    @partition_size = 1000
    @current_partition = 0
  end

  def add_record(record)
    partition_key = (@partitions.size * @partition_size + 
                    current_partition_size) / @partition_size
    (@partitions[partition_key] ||= []) << record
  end

  def marshal_dump
    # Serialize partition structure, not actual data
    {
      partition_keys: @partitions.keys,
      partition_size: @partition_size,
      total_records: total_record_count
    }
  end

  def marshal_load(data)
    @partitions = {}
    @partition_size = data[:partition_size]
    @current_partition = 0
    
    # Mark partitions as available for lazy loading
    data[:partition_keys].each do |key|
      @partitions[key] = :lazy_load_pending
    end
  end

  def get_partition(key)
    return @partitions[key] unless @partitions[key] == :lazy_load_pending
    
    # Implement lazy loading of partition data
    @partitions[key] = load_partition_from_storage(key)
  end

  private

  def current_partition_size
    @partitions[@current_partition]&.size || 0
  end

  def total_record_count
    @partitions.values.sum(&:size)
  end

  def load_partition_from_storage(key)
    # Implementation would load from external storage
    []
  end
end

Performance profiling reveals that custom marshaling overhead primarily comes from method calls rather than data copying. Objects with deeply nested custom marshaling can benefit from flattened serialization formats:

class OptimizedTree
  def initialize(root_value = nil)
    @root = root_value ? TreeNode.new(root_value) : nil
  end

  def marshal_dump
    return nil unless @root
    
    # Flatten tree to array format for efficient serialization
    nodes = []
    stack = [[@root, nil]] # [node, parent_index]
    
    while stack.any?
      node, parent_idx = stack.pop
      current_idx = nodes.size
      nodes << [node.value, parent_idx]
      
      node.children.reverse_each do |child|
        stack << [child, current_idx]
      end
    end
    
    nodes
  end

  def marshal_load(nodes)
    return unless nodes
    
    # Rebuild tree from flattened format
    node_objects = nodes.map { |value, _| TreeNode.new(value) }
    
    nodes.each_with_index do |(_, parent_idx), idx|
      if parent_idx
        node_objects[parent_idx].add_child(node_objects[idx])
      else
        @root = node_objects[idx]
      end
    end
  end

  class TreeNode
    attr_reader :value, :children

    def initialize(value)
      @value = value
      @children = []
    end

    def add_child(child)
      @children << child
    end
  end
end

Error Handling & Debugging

Custom marshaling introduces several categories of errors that require specific handling strategies. The most common issues involve unmarshallable objects, version compatibility problems, and corruption of the marshaled data stream.

Ruby raises TypeError when marshal_dump returns objects that cannot be marshaled. This commonly occurs with Proc objects, singleton objects, or objects containing file handles:

class ServiceConnection
  def initialize(api_key)
    @api_key = api_key
    @connection = build_connection
    @request_proc = proc { |data| format_request(data) }
  end

  def marshal_dump
    # Attempt to serialize proc raises TypeError
    begin
      [@api_key, @connection, @request_proc]
    rescue TypeError => e
      # Handle unmarshallable objects gracefully
      Rails.logger.warn("Skipping unmarshallable connection data: #{e.message}")
      [@api_key, nil, nil]
    end
  end

  def marshal_load(data)
    @api_key, connection_data, proc_data = data
    
    # Rebuild non-serializable resources
    @connection = connection_data || build_connection
    @request_proc = proc_data || proc { |data| format_request(data) }
    
    # Validate restored state
    validate_connection_state
  end

  private

  def build_connection
    # Create HTTP connection or similar
    OpenStruct.new(api_key: @api_key, status: :connected)
  end

  def format_request(data)
    data.to_json
  end

  def validate_connection_state
    unless @connection && @api_key
      raise StandardError, "Invalid connection state after unmarshaling"
    end
  end
end

Version compatibility errors occur when the structure of marshaled data changes between application versions. Robust marshaling implementations include version handling and migration logic:

class VersionedModel
  CURRENT_VERSION = 3

  def initialize(name, data = {})
    @name = name
    @data = data
    @metadata = { created_at: Time.now }
    @version = CURRENT_VERSION
  end

  def marshal_dump
    {
      version: CURRENT_VERSION,
      name: @name,
      data: @data,
      metadata: @metadata
    }
  end

  def marshal_load(dumped_data)
    dumped_version = dumped_data[:version] || 1
    
    case dumped_version
    when 1
      migrate_from_v1(dumped_data)
    when 2
      migrate_from_v2(dumped_data)
    when CURRENT_VERSION
      @name = dumped_data[:name]
      @data = dumped_data[:data]
      @metadata = dumped_data[:metadata]
    else
      handle_unknown_version(dumped_version, dumped_data)
    end
    
    @version = CURRENT_VERSION
  end

  private

  def migrate_from_v1(data)
    # v1 stored everything in a single hash
    @name = data[:name]
    @data = data.reject { |k, _| [:name, :version].include?(k) }
    @metadata = { created_at: Time.now, migrated_from: 1 }
  end

  def migrate_from_v2(data)
    # v2 separated data but had different metadata structure
    @name = data[:name]
    @data = data[:data]
    @metadata = {
      created_at: data[:metadata][:timestamp] || Time.now,
      migrated_from: 2
    }
  end

  def handle_unknown_version(version, data)
    raise StandardError, "Cannot unmarshal version #{version} (current: #{CURRENT_VERSION})"
  end
end

Circular reference handling requires careful consideration in custom marshaling. Ruby's Marshal handles object identity automatically, but custom marshaling logic must avoid infinite recursion:

class CircularSafeNode
  attr_accessor :name, :connections

  def initialize(name)
    @name = name
    @connections = []
  end

  def connect_to(other_node)
    @connections << other_node unless @connections.include?(other_node)
    other_node.connections << self unless other_node.connections.include?(self)
  end

  def marshal_dump
    # Store connections by name to avoid circular serialization
    connection_names = @connections.map(&:name)
    [@name, connection_names]
  end

  def marshal_load(data)
    @name, @connection_names = data
    @connections = []
    
    # Connections will be rebuilt after all nodes are loaded
    # This requires coordination from the containing object
  end

  def resolve_connections(node_registry)
    @connection_names.each do |name|
      connected_node = node_registry[name]
      @connections << connected_node if connected_node
    end
    @connection_names = nil
  end
end

class NodeGraph
  def initialize
    @nodes = {}
  end

  def add_node(node)
    @nodes[node.name] = node
  end

  def marshal_dump
    @nodes.values.map { |node| Marshal.dump(node) }
  end

  def marshal_load(serialized_nodes)
    @nodes = {}
    
    # First pass: recreate all nodes
    serialized_nodes.each do |serialized_node|
      node = Marshal.load(serialized_node)
      @nodes[node.name] = node
    end
    
    # Second pass: resolve connections
    @nodes.each_value { |node| node.resolve_connections(@nodes) }
  end
end

Data corruption detection becomes essential for production systems. Custom marshaling can implement checksums and validation to detect corrupted serialized data:

require 'digest'

class ChecksummedData
  def initialize(payload)
    @payload = payload
    @checksum = nil
  end

  def marshal_dump
    serialized_payload = Marshal.dump(@payload)
    checksum = Digest::SHA256.hexdigest(serialized_payload)
    
    {
      payload: serialized_payload,
      checksum: checksum,
      timestamp: Time.now.to_f
    }
  end

  def marshal_load(data)
    serialized_payload = data[:payload]
    expected_checksum = data[:checksum]
    timestamp = data[:timestamp]
    
    # Verify data integrity
    actual_checksum = Digest::SHA256.hexdigest(serialized_payload)
    
    if actual_checksum != expected_checksum
      raise DataCorruptionError, 
            "Checksum mismatch: expected #{expected_checksum}, got #{actual_checksum}"
    end
    
    # Check for stale data
    if Time.now.to_f - timestamp > MAX_AGE_SECONDS
      Rails.logger.warn("Loading stale marshaled data from #{Time.at(timestamp)}")
    end
    
    @payload = Marshal.load(serialized_payload)
    @checksum = expected_checksum
  end

  class DataCorruptionError < StandardError; end
  
  MAX_AGE_SECONDS = 86400 # 24 hours
end

Common Pitfalls

Custom marshaling contains numerous subtle pitfalls that can lead to data loss, memory leaks, or application crashes. The most dangerous pitfall involves forgetting that marshal_load runs on an uninitialized object, bypassing the normal initialize method.

class DatabaseModel
  def initialize(id, connection_pool)
    @id = id
    @connection_pool = connection_pool
    @attributes = load_attributes_from_db
    @callbacks = setup_callbacks
  end

  # INCORRECT: Assumes object was properly initialized
  def marshal_dump
    [@id, @attributes]
  end

  # INCORRECT: Doesn't fully restore object state
  def marshal_load(data)
    @id, @attributes = data
    # Missing: @connection_pool and @callbacks are nil!
  end

  # CORRECT: Fully initialize object state
  def marshal_load_correct(data)
    @id, @attributes = data
    @connection_pool = ConnectionPool.current
    @callbacks = setup_callbacks
    validate_loaded_state
  end

  private

  def load_attributes_from_db
    { name: "Example", status: "active" }
  end

  def setup_callbacks
    { before_save: proc { puts "Saving..." } }
  end

  def validate_loaded_state
    raise "Invalid state" unless @connection_pool && @callbacks
  end
end

Singleton objects create particularly insidious marshaling problems because Ruby creates multiple instances of what should be singleton objects:

class ConfigurationSingleton
  include Singleton

  def initialize
    @settings = load_from_file
    @observers = []
  end

  # INCORRECT: Breaks singleton pattern
  def marshal_dump
    [@settings, @observers]
  end

  def marshal_load(data)
    @settings, @observers = data
  end

  # CORRECT: Preserve singleton identity
  def marshal_dump_correct
    [@settings]
  end

  def marshal_load_correct(data)
    settings, = data
    # Update existing singleton rather than replacing state
    instance.update_settings(settings)
  end

  def update_settings(new_settings)
    @settings = new_settings
    notify_observers
  end

  private

  def load_from_file
    { debug: false, timeout: 30 }
  end

  def notify_observers
    @observers.each(&:call)
  end
end

Class and module objects require special handling because they are not automatically marshaled correctly in all Ruby versions:

class PolymorphicContainer
  def initialize
    @items = []
  end

  def add_item(item)
    @items << {
      object: item,
      class_name: item.class.name,
      modules: item.class.included_modules.map(&:name)
    }
  end

  # INCORRECT: Classes might not marshal properly
  def marshal_dump_incorrect
    @items
  end

  # CORRECT: Store class names as strings
  def marshal_dump
    @items.map do |item_data|
      {
        object_data: item_data[:object].marshal_dump,
        class_name: item_data[:class_name],
        modules: item_data[:modules]
      }
    end
  end

  def marshal_load(data)
    @items = data.map do |item_data|
      klass = Object.const_get(item_data[:class_name])
      object = klass.allocate
      object.marshal_load(item_data[:object_data])
      
      {
        object: object,
        class_name: item_data[:class_name],
        modules: item_data[:modules]
      }
    end
  rescue NameError => e
    raise MarshalingError, "Cannot restore class #{item_data[:class_name]}: #{e.message}"
  end

  class MarshalingError < StandardError; end
end

File handles and IO objects cannot be marshaled, but developers often forget to exclude them from custom marshaling logic:

class LogProcessor
  def initialize(log_file_path)
    @log_file_path = log_file_path
    @file_handle = nil
    @buffer = []
    @stats = { lines_processed: 0 }
  end

  def process_line(line)
    ensure_file_open
    @file_handle.puts(processed_line(line))
    @stats[:lines_processed] += 1
  end

  # INCORRECT: Attempts to serialize file handle
  def marshal_dump_incorrect
    [@log_file_path, @file_handle, @buffer, @stats]
  end

  # CORRECT: Excludes non-serializable file handle
  def marshal_dump
    # Close file handle before serialization
    close_file if @file_handle
    [@log_file_path, @buffer, @stats]
  end

  def marshal_load(data)
    @log_file_path, @buffer, @stats = data
    @file_handle = nil # Will be reopened when needed
  end

  private

  def ensure_file_open
    return if @file_handle && !@file_handle.closed?
    @file_handle = File.open(@log_file_path, 'a')
  end

  def close_file
    @file_handle.close if @file_handle && !@file_handle.closed?
    @file_handle = nil
  end

  def processed_line(line)
    "#{Time.now.iso8601}: #{line.strip}"
  end
end

Thread-local variables and thread-specific state create marshaling problems because the unmarshaled object runs in a different thread context:

class ThreadAwareProcessor
  def initialize
    @worker_id = Thread.current.object_id
    @thread_local_cache = {}
    @shared_data = {}
  end

  def process(data)
    current_worker = Thread.current.object_id
    
    if current_worker != @worker_id
      handle_thread_migration
    end
    
    get_thread_cache[data.hash] ||= expensive_computation(data)
  end

  # INCORRECT: Thread-specific data doesn't transfer
  def marshal_dump_incorrect
    [@worker_id, @thread_local_cache, @shared_data]
  end

  # CORRECT: Only serialize thread-safe data
  def marshal_dump
    [@shared_data]
  end

  def marshal_load(data)
    @shared_data, = data
    @worker_id = Thread.current.object_id
    @thread_local_cache = {}
    initialize_thread_local_state
  end

  private

  def handle_thread_migration
    # Clear thread-specific state when moving to different thread
    @thread_local_cache.clear
    @worker_id = Thread.current.object_id
    initialize_thread_local_state
  end

  def get_thread_cache
    Thread.current[:processor_cache] ||= {}
  end

  def initialize_thread_local_state
    Thread.current[:processor_cache] = {}
  end

  def expensive_computation(data)
    # Simulate expensive work
    data.to_s.chars.sum(&:ord)
  end
end

Reference

Core Marshaling Methods

Method	Parameters	Returns	Description
`#marshal_dump`	none	`Object`	Returns serializable representation of object state
`#marshal_load(data)`	`data` (Object)	`self`	Restores object state from marshaled data
`Marshal.dump(obj)`	`obj` (Object)	`String`	Serializes object to binary string
`Marshal.load(data)`	`data` (String)	`Object`	Deserializes object from binary string

Marshal Module Constants

Constant	Value	Description
`Marshal::MAJOR_VERSION`	`4`	Major version of marshal format
`Marshal::MINOR_VERSION`	`8`	Minor version of marshal format

Common Marshal Exceptions

Exception	Trigger	Resolution
`TypeError`	Unmarshallable object in dump data	Implement custom marshaling or exclude object
`ArgumentError`	Corrupted or invalid marshal data	Validate data integrity, implement version handling
`NameError`	Missing class during load	Ensure class availability, handle missing classes

Object State During Marshaling

Phase	Object State	Available Methods	Notes
`marshal_dump` call	Fully initialized	All instance methods	Object in normal state
`marshal_load` call	Uninitialized allocation	No instance variables set	`initialize` was not called
After `marshal_load`	Restored state	Depends on `marshal_load` implementation	Must manually initialize all required state

Marshaling Compatibility Matrix

Ruby Object Type	Default Marshal	Custom Marshal Required	Notes
Basic objects (String, Numeric, Array, Hash)	✓	✗	Automatically handled
Custom classes	✓	Optional	All instance variables serialized
Objects with Proc/lambda	✗	✓	Procs cannot be marshaled
File/IO objects	✗	✓	File handles not transferable
Singleton objects	⚠️	✓	May break singleton pattern
Thread objects	✗	✓	Threads not transferable
Class/Module objects	⚠️	✓	Version-dependent behavior

Performance Considerations

Scenario	Optimization Strategy	Impact
Large object graphs	Exclude computed/cached data	50-90% size reduction
Deep nesting	Flatten to array structure	30-70% speed improvement
Frequent marshaling	Cache serialized forms	80-95% repeated marshal time savings
Memory constraints	Implement lazy loading	60-90% memory usage reduction

Version Migration Patterns

# Standard version handling template
def marshal_load(data)
  version = data[:version] || 1
  
  case version
  when 1
    migrate_from_v1(data)
  when CURRENT_VERSION
    restore_current_version(data)
  else
    handle_unsupported_version(version, data)
  end
end

Debugging Marshal Issues

Problem	Diagnostic Method	Solution Pattern
TypeError during dump	Inspect `marshal_dump` return value	Filter unmarshallable objects
Missing state after load	Compare pre/post marshal object state	Initialize all required instance variables
Performance issues	Profile marshal size vs. time	Optimize serialized data structure
Version conflicts	Log version information during load	Implement migration methods

Security Considerations

Risk	Mitigation	Implementation
Code injection via `eval`	Validate unmarshaled data	Use safe parsing methods
Resource exhaustion	Limit marshaled data size	Implement size checks
Class pollution	Whitelist allowed classes	Validate class names before `const_get`
Stale data attacks	Timestamp marshaled data	Check data age during unmarshal