CrackedRuby CrackedRuby

A guide to external sorting algorithms for handling datasets larger than available memory, covering multi-way merge techniques, I/O optimization, and file-based sorting implementations.

Overview

External sorting addresses the problem of sorting datasets too large to fit in main memory. While internal sorting algorithms like quicksort or mergesort operate entirely in RAM, external sorting algorithms partition data across secondary storage devices and perform sorting through a series of I/O operations.

The fundamental constraint driving external sorting is the memory hierarchy. When a dataset exceeds available RAM, the operating system begins paging memory to disk, causing severe performance degradation. External sorting algorithms accept this constraint and design around it, treating disk I/O as the primary cost metric rather than CPU comparisons.

External sorting emerged from early computing systems where memory was severely limited. Magnetic tape systems required sequential access patterns, leading to algorithms like polyphase merge sort. Modern implementations work with disk files or SSDs but retain the core principle: minimize the number of disk passes while staying within memory constraints.

# Dataset sizes requiring external sorting
internal_sort_limit = 8 * 1024 * 1024 * 1024  # 8GB available RAM
dataset_size = 100 * 1024 * 1024 * 1024        # 100GB dataset

if dataset_size > internal_sort_limit
  # External sorting required
  external_sort(dataset)
else
  # Standard in-memory sort works
  data.sort
end

The canonical example involves sorting a large file of records. A file containing 100 GB of data cannot be loaded into a system with 8 GB of RAM. External sorting reads portions of the file, sorts them in memory, writes sorted "runs" to temporary files, then merges these runs to produce the final sorted output.

Key Principles

External sorting divides the sorting process into two phases: run formation and merging. Run formation reads chunks of data that fit in memory, sorts each chunk internally, and writes sorted runs to disk. The merge phase combines these sorted runs into progressively larger sorted sequences until producing a single sorted output.

Run Formation reads the input file in blocks sized to fit available memory. Each block undergoes internal sorting using any efficient algorithm. The sorted block, called a run, writes to a temporary file. If the input contains N records and memory holds M records, run formation produces ⌈N/M⌉ runs.

Multi-Way Merging combines multiple sorted runs simultaneously. A k-way merge reads from k input runs, maintaining k buffers in memory. The algorithm repeatedly selects the smallest element across all buffers, writes it to output, and refills the depleted buffer. This continues until all inputs exhaust.

The merge proceeds in passes. Each pass reduces the number of runs by a factor of k. With b initial runs and k-way merging, the number of passes equals ⌈log_k(b)⌉. Fewer passes mean less I/O, but larger k requires more memory for input buffers.

I/O Complexity measures external sorting efficiency. Each record must be read once during run formation and written once to temporary storage. Each subsequent merge pass reads and writes every record. With p passes, total I/O equals 2N × (p + 1) operations, where N is the number of records.

Memory allocation splits between input buffers, output buffer, and workspace for the merge heap. For k-way merging with total memory M, typical allocation assigns M/k to each input buffer and a similar amount to output buffering. The remaining space holds the selection data structure, usually a min-heap of size k.

Replacement Selection improves upon simple run formation. Instead of sorting fixed-size blocks, replacement selection generates longer runs by maintaining a heap of size M. As records write to output, new records enter the heap if they exceed the last output value. This produces runs averaging 2M in length, doubling effectiveness.

Buffer Management critically impacts performance. Reading single records from k files incurs massive overhead. Block-buffered I/O reads large chunks (typically 4KB-64KB) into memory buffers. When an input buffer empties, a single large read refills it. Similarly, output buffering accumulates records before writing complete blocks.

# I/O pass calculation
records = 1_000_000_000
memory_capacity = 10_000_000
runs = (records.to_f / memory_capacity).ceil  # 100 runs

merge_factor = 10
passes = Math.log(runs, merge_factor).ceil  # 2 passes

total_io = records * 2 * (passes + 1)  # Read/write for formation + 2 merge passes
# => 6 billion I/O operations

Implementation Approaches

Two-Way Merge Sort represents the simplest external sorting strategy. Run formation creates initial sorted runs. Each merge pass combines pairs of runs, halving the run count. The algorithm continues until a single run remains. Two-way merging requires only three file handles: two inputs and one output.

Two-way merge uses minimal memory, requiring only two input buffers and one output buffer. However, the number of passes equals ⌈log_2(b)⌉, where b is the initial run count. A thousand initial runs require ten merge passes, each reading and writing the entire dataset.

Multi-Way Merge Sort reduces passes by merging k runs simultaneously. With k-way merging, passes equal ⌈log_k(b)⌉. Choosing k=10 reduces a thousand runs to two passes. However, k cannot exceed the file descriptor limit, and each input requires a buffer, consuming memory proportional to k.

The merge selection mechanism maintains a min-heap containing the next record from each input. Each iteration extracts the minimum, writes it, refills from the corresponding input, and adjusts the heap. Heap operations cost O(log k) per record, making k-way merging slightly more CPU-intensive than two-way merging, but the I/O savings dominate.

Polyphase Merge optimizes tape-based systems where sequential access prohibits random seeking. Rather than balanced merging, polyphase distributes runs unevenly across tapes following a Fibonacci-like sequence. This eliminates copying overhead on sequential media. Modern disk-based systems rarely use polyphase merging due to random access capabilities.

Replacement Selection generates longer initial runs, reducing merge passes. The algorithm maintains a heap of size M. Records enter the heap and output proceeds while maintaining the sorted property. When a record smaller than the last output appears, it belongs to the next run. Replacement selection produces runs averaging 2M records under random input.

Cascade Merge creates a hybrid between two-way and multi-way merging. Each pass uses a different merge factor, starting with higher k values and reducing as runs decrease. This balances memory usage across passes and adapts to changing run counts.

Selection criteria depend on data size, available memory, and I/O characteristics:

  • Datasets under 10× memory: Multi-way merge with high k
  • Limited memory: Two-way merge or lower k values
  • Random data: Replacement selection for longer runs
  • Nearly-sorted data: Verify if single merge pass suffices
  • SSD storage: Higher k values due to better random I/O
  • Rotating disk: Moderate k, optimize sequential access

Ruby Implementation

Ruby's standard library lacks built-in external sorting, requiring custom implementation. The following examples demonstrate file-based sorting with proper resource management and error handling.

Basic Two-Way External Sort:

class ExternalSort
  def initialize(memory_limit: 10_000)
    @memory_limit = memory_limit
    @temp_files = []
  end

  def sort_file(input_path, output_path)
    create_runs(input_path)
    merge_runs(output_path)
  ensure
    cleanup_temp_files
  end

  private

  def create_runs(input_path)
    buffer = []
    
    File.open(input_path, 'r') do |file|
      file.each_line do |line|
        buffer << line
        
        if buffer.size >= @memory_limit
          write_sorted_run(buffer)
          buffer.clear
        end
      end
      
      write_sorted_run(buffer) unless buffer.empty?
    end
  end

  def write_sorted_run(buffer)
    temp_file = Tempfile.new('run')
    buffer.sort.each { |line| temp_file.puts(line) }
    temp_file.rewind
    @temp_files << temp_file
  end

  def merge_runs(output_path)
    while @temp_files.size > 1
      new_temps = []
      
      @temp_files.each_slice(2) do |file1, file2|
        if file2
          new_temps << merge_two_files(file1, file2)
        else
          new_temps << file1
        end
      end
      
      @temp_files = new_temps
    end
    
    File.open(output_path, 'w') do |output|
      @temp_files.first.each_line { |line| output.puts(line) }
    end
  end

  def merge_two_files(file1, file2)
    temp = Tempfile.new('merge')
    
    file1.rewind
    file2.rewind
    
    line1 = file1.gets
    line2 = file2.gets
    
    while line1 && line2
      if line1 <= line2
        temp.puts(line1)
        line1 = file1.gets
      else
        temp.puts(line2)
        line2 = file2.gets
      end
    end
    
    temp.puts(line1) while line1 = file1.gets
    temp.puts(line2) while line2 = file2.gets
    
    file1.close
    file2.close
    temp.rewind
    temp
  end

  def cleanup_temp_files
    @temp_files.each do |file|
      file.close
      file.unlink
    end
  end
end

# Usage
sorter = ExternalSort.new(memory_limit: 100_000)
sorter.sort_file('large_input.txt', 'sorted_output.txt')

Multi-Way Merge with Heap:

require 'set'

class MultiWayMergeSort
  def initialize(memory_limit: 100_000, merge_factor: 10)
    @memory_limit = memory_limit
    @merge_factor = merge_factor
    @temp_files = []
  end

  def sort_file(input_path, output_path)
    create_runs(input_path)
    multiway_merge(output_path)
  ensure
    cleanup_temp_files
  end

  private

  def create_runs(input_path)
    buffer = []
    
    File.open(input_path, 'r') do |file|
      file.each_line do |line|
        buffer << line.chomp
        
        if buffer.size >= @memory_limit
          write_sorted_run(buffer)
          buffer.clear
        end
      end
      
      write_sorted_run(buffer) unless buffer.empty?
    end
  end

  def write_sorted_run(buffer)
    temp_file = Tempfile.new(['run', '.txt'])
    buffer.sort.each { |line| temp_file.puts(line) }
    temp_file.close
    @temp_files << temp_file.path
  end

  def multiway_merge(output_path)
    while @temp_files.size > 1
      pass_files = []
      
      @temp_files.each_slice(@merge_factor) do |file_group|
        pass_files << merge_k_files(file_group)
      end
      
      @temp_files.each { |f| File.delete(f) if File.exist?(f) }
      @temp_files = pass_files
    end
    
    FileUtils.mv(@temp_files.first, output_path)
  end

  def merge_k_files(file_paths)
    output_temp = Tempfile.new(['merge', '.txt'])
    handles = file_paths.map { |path| File.open(path, 'r') }
    
    # Min-heap: [value, file_index]
    heap = []
    
    # Initialize heap with first line from each file
    handles.each_with_index do |file, idx|
      if line = file.gets
        heap << [line.chomp, idx]
      end
    end
    
    heap.sort!
    
    # Merge using heap
    until heap.empty?
      value, idx = heap.shift
      output_temp.puts(value)
      
      # Read next line from same file
      if line = handles[idx].gets
        new_entry = [line.chomp, idx]
        
        # Insert into sorted position
        insert_pos = heap.bsearch_index { |v, _| v >= new_entry[0] } || heap.size
        heap.insert(insert_pos, new_entry)
      end
    end
    
    handles.each(&:close)
    output_temp.close
    output_temp.path
  end

  def cleanup_temp_files
    @temp_files.each { |f| File.delete(f) if File.exist?(f) }
  end
end

# Usage for large dataset
sorter = MultiWayMergeSort.new(memory_limit: 500_000, merge_factor: 8)
sorter.sort_file('huge_dataset.txt', 'sorted_result.txt')

Replacement Selection for Longer Runs:

class ReplacementSelectionSort
  def initialize(heap_size: 100_000)
    @heap_size = heap_size
    @temp_files = []
  end

  def sort_file(input_path, output_path)
    create_runs_replacement_selection(input_path)
    merge_all_runs(output_path)
  ensure
    cleanup_temp_files
  end

  private

  def create_runs_replacement_selection(input_path)
    File.open(input_path, 'r') do |file|
      generate_run_with_replacement(file)
    end
  end

  def generate_run_with_replacement(input_file)
    current_heap = []
    next_heap = []
    run_output = nil
    last_written = nil
    
    # Fill initial heap
    @heap_size.times do
      break unless line = input_file.gets
      current_heap << line.chomp
    end
    current_heap.sort!
    
    loop do
      if current_heap.empty?
        # Start new run
        break if next_heap.empty?
        
        run_output.close if run_output
        
        current_heap = next_heap.sort
        next_heap = []
        last_written = nil
        run_output = nil
      end
      
      # Initialize run output if needed
      unless run_output
        temp = Tempfile.new(['run', '.txt'])
        run_output = File.open(temp.path, 'w')
        @temp_files << temp.path
      end
      
      # Write smallest element that maintains sort order
      value = current_heap.shift
      run_output.puts(value)
      last_written = value
      
      # Read next input
      if line = input_file.gets
        line = line.chomp
        
        if line >= last_written
          # Can go in current run
          insert_sorted(current_heap, line)
        else
          # Save for next run
          insert_sorted(next_heap, line)
        end
      end
    end
    
    run_output.close if run_output
  end

  def insert_sorted(array, value)
    pos = array.bsearch_index { |v| v >= value } || array.size
    array.insert(pos, value)
  end

  def merge_all_runs(output_path)
    return if @temp_files.empty?
    
    if @temp_files.size == 1
      FileUtils.mv(@temp_files.first, output_path)
      return
    end
    
    # Simple two-way merge for demonstration
    while @temp_files.size > 1
      file1 = @temp_files.shift
      file2 = @temp_files.shift
      merged = merge_two_files(file1, file2)
      @temp_files << merged
      
      File.delete(file1)
      File.delete(file2)
    end
    
    FileUtils.mv(@temp_files.first, output_path)
  end

  def merge_two_files(path1, path2)
    output_temp = Tempfile.new(['merge', '.txt'])
    
    File.open(path1, 'r') do |f1|
      File.open(path2, 'r') do |f2|
        File.open(output_temp.path, 'w') do |out|
          line1 = f1.gets
          line2 = f2.gets
          
          while line1 && line2
            if line1 <= line2
              out.puts(line1)
              line1 = f1.gets
            else
              out.puts(line2)
              line2 = f2.gets
            end
          end
          
          out.puts(line1) while line1 = f1.gets
          out.puts(line2) while line2 = f2.gets
        end
      end
    end
    
    output_temp.path
  end

  def cleanup_temp_files
    @temp_files.each { |f| File.delete(f) if File.exist?(f) }
  end
end

# Generates approximately 2x heap_size run lengths on random data
sorter = ReplacementSelectionSort.new(heap_size: 50_000)
sorter.sort_file('random_data.txt', 'sorted_output.txt')

Performance Considerations

External sorting performance depends primarily on I/O characteristics rather than CPU speed. The number of disk passes determines total I/O volume, while buffer sizes affect I/O efficiency. Understanding these factors enables effective optimization.

I/O Pass Analysis calculates total data movement. With N records, run formation reads N records and writes N records (2N I/O operations). Each merge pass reads all N records and writes N records (2N operations per pass). Total I/O equals 2N × (passes + 1).

Reducing passes requires higher merge factors. With k-way merging and b initial runs, passes equal ⌈log_k(b)⌉. Doubling k from 5 to 10 cuts passes by approximately half. However, memory constraints limit k since each input stream requires a buffer.

# Comparing merge strategies
def calculate_io_operations(records, memory, merge_factor)
  runs = (records.to_f / memory).ceil
  passes = Math.log(runs, merge_factor).ceil
  
  io_operations = records * 2 * (passes + 1)
  
  {
    runs: runs,
    passes: passes,
    total_io: io_operations,
    io_per_record: io_operations / records.to_f
  }
end

# 1 billion records, 10 million record memory
two_way = calculate_io_operations(1_000_000_000, 10_000_000, 2)
# => {runs: 100, passes: 7, total_io: 16B, io_per_record: 16}

ten_way = calculate_io_operations(1_000_000_000, 10_000_000, 10)
# => {runs: 100, passes: 2, total_io: 6B, io_per_record: 6}

# 10-way merge achieves 2.67x fewer I/O operations

Buffer Size Optimization dramatically impacts throughput. Reading single records incurs OS call overhead for each read. Buffering accumulates multiple records per system call. Reading 64KB blocks instead of 100-byte records reduces system calls by 640×.

Operating systems perform read-ahead caching when detecting sequential access patterns. Buffer sizes matching filesystem block boundaries (typically 4KB multiples) maximize cache effectiveness. Very large buffers (1MB+) provide diminishing returns as OS caching handles read-ahead.

Memory allocation for k-way merging splits total available memory M across k input buffers, one output buffer, and workspace for the merge heap. Typical allocation assigns M/(k+1) to each input buffer and output buffer, leaving k elements for the heap. Larger input buffers reduce I/O frequency but decrease k.

Replacement Selection Impact varies by input distribution. Random data produces runs averaging 2× heap size. Partially sorted data generates much longer runs. Data sorted in reverse produces minimum-length runs equal to heap size. Real-world data often contains partial ordering, making replacement selection effective.

Longer runs directly reduce merge passes. If replacement selection doubles run length, initial run count halves, reducing passes by one. On random data with 100 initial runs, changing from 100 to 50 runs reduces a 10-way merge from 2 passes to 1 pass, halving I/O.

Disk vs SSD Performance affects algorithm selection. Rotating disks favor sequential access with large block reads. Random access on spinning disks incurs seek time (5-10ms) plus rotational latency (4ms average). SSDs handle random access efficiently with microsecond latencies.

For rotating disks, lower k values with larger buffers optimize sequential streaming. For SSDs, higher k values reduce passes since random access cost is low. An SSD system might use k=20 with smaller buffers, while rotating disk prefers k=5 with maximum buffer sizes.

CPU vs I/O Bound determines optimization focus. External sorting on modern systems is I/O bound, meaning CPU sits idle waiting for disk operations. Adding CPU-intensive operations (like complex comparisons) has minimal impact on total runtime. Conversely, optimizing I/O patterns yields significant improvements.

CPU costs do matter for comparison operations. String comparisons are slower than integer comparisons. If records contain complex data requiring expensive comparison, reducing total comparisons by generating longer runs provides benefit. However, this effect is secondary to I/O optimization.

Parallel I/O overlaps reading and writing. While merging runs, the algorithm reads from input buffers and writes to an output buffer. Double-buffering enables I/O and computation overlap: while processing one buffer, the OS asynchronously fills the next buffer. This prevents CPU idle time waiting for I/O.

Multi-threaded implementations pipeline run formation and merging. One thread generates runs while another begins merging completed runs. However, Ruby's Global Interpreter Lock limits true parallelism for CPU-bound operations. I/O-bound external sorting sees less GIL impact since threads often wait on I/O.

Practical Examples

Example 1: Sorting Large CSV Files

class CSVExternalSort
  def initialize(memory_limit: 1_000_000)
    @memory_limit = memory_limit
    @temp_files = []
  end

  def sort_csv(input_path, output_path, sort_column: 0)
    # Parse and create runs
    create_csv_runs(input_path, sort_column)
    
    # Merge runs maintaining CSV format
    merge_csv_runs(output_path, sort_column)
  ensure
    cleanup_temp_files
  end

  private

  def create_csv_runs(input_path, sort_column)
    require 'csv'
    buffer = []
    
    CSV.foreach(input_path, headers: true) do |row|
      buffer << row
      
      if buffer.size >= @memory_limit
        write_csv_run(buffer, sort_column)
        buffer.clear
      end
    end
    
    write_csv_run(buffer, sort_column) unless buffer.empty?
  end

  def write_csv_run(buffer, sort_column)
    temp_file = Tempfile.new(['csv_run', '.csv'])
    
    sorted = buffer.sort_by { |row| row[sort_column] }
    
    CSV.open(temp_file.path, 'w') do |csv|
      csv << sorted.first.headers if sorted.first
      sorted.each { |row| csv << row }
    end
    
    @temp_files << temp_file.path
    temp_file.close
  end

  def merge_csv_runs(output_path, sort_column)
    return if @temp_files.empty?
    
    CSV.open(output_path, 'w') do |output|
      handles = @temp_files.map { |path| CSV.open(path, headers: true) }
      
      # Write headers from first file
      output << handles.first.headers
      
      # Initialize heap
      heap = []
      handles.each_with_index do |csv, idx|
        if row = csv.shift
          heap << [row[sort_column], row, idx]
        end
      end
      heap.sort_by! { |v, _, _| v }
      
      # Merge
      until heap.empty?
        _, row, idx = heap.shift
        output << row
        
        if next_row = handles[idx].shift
          entry = [next_row[sort_column], next_row, idx]
          insert_pos = heap.bsearch_index { |v, _, _| v >= entry[0] } || heap.size
          heap.insert(insert_pos, entry)
        end
      end
      
      handles.each(&:close)
    end
  ensure
    cleanup_temp_files
  end

  def cleanup_temp_files
    @temp_files.each { |f| File.delete(f) if File.exist?(f) }
  end
end

# Sort 10GB CSV by email column
sorter = CSVExternalSort.new(memory_limit: 500_000)
sorter.sort_csv('users.csv', 'users_sorted.csv', sort_column: 'email')

Example 2: Log File Processing with Custom Sorting

class LogExternalSort
  LOG_PATTERN = /^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})\s+(\w+)\s+(.+)$/

  def initialize(memory_limit: 200_000)
    @memory_limit = memory_limit
    @temp_files = []
  end

  def sort_logs(input_path, output_path, sort_by: :timestamp)
    create_log_runs(input_path, sort_by)
    merge_log_runs(output_path, sort_by)
  ensure
    cleanup_temp_files
  end

  private

  def create_log_runs(input_path, sort_by)
    buffer = []
    
    File.open(input_path, 'r') do |file|
      file.each_line do |line|
        if match = line.match(LOG_PATTERN)
          timestamp, level, message = match.captures
          buffer << {
            timestamp: timestamp,
            level: level,
            message: message,
            raw: line
          }
          
          if buffer.size >= @memory_limit
            write_log_run(buffer, sort_by)
            buffer.clear
          end
        end
      end
      
      write_log_run(buffer, sort_by) unless buffer.empty?
    end
  end

  def write_log_run(buffer, sort_by)
    temp_file = Tempfile.new(['log_run', '.txt'])
    
    sorted = case sort_by
    when :timestamp
      buffer.sort_by { |entry| entry[:timestamp] }
    when :level
      # ERROR > WARN > INFO > DEBUG
      level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
      buffer.sort_by { |entry| [level_order[entry[:level]] || 4, entry[:timestamp]] }
    else
      buffer.sort_by { |entry| entry[sort_by] }
    end
    
    sorted.each { |entry| temp_file.puts(entry[:raw]) }
    temp_file.close
    @temp_files << temp_file.path
  end

  def merge_log_runs(output_path, sort_by)
    File.open(output_path, 'w') do |output|
      handles = @temp_files.map { |path| File.open(path, 'r') }
      heap = []
      
      # Initialize heap
      handles.each_with_index do |file, idx|
        if line = file.gets
          if match = line.match(LOG_PATTERN)
            timestamp, level, message = match.captures
            sort_key = case sort_by
            when :timestamp then timestamp
            when :level
              level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
              "#{level_order[level] || 4}_#{timestamp}"
            end
            
            heap << [sort_key, line, idx]
          end
        end
      end
      heap.sort_by! { |key, _, _| key }
      
      # Merge
      until heap.empty?
        _, line, idx = heap.shift
        output.print(line)
        
        if next_line = handles[idx].gets
          if match = next_line.match(LOG_PATTERN)
            timestamp, level, _ = match.captures
            sort_key = case sort_by
            when :timestamp then timestamp
            when :level
              level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
              "#{level_order[level] || 4}_#{timestamp}"
            end
            
            entry = [sort_key, next_line, idx]
            insert_pos = heap.bsearch_index { |k, _, _| k >= entry[0] } || heap.size
            heap.insert(insert_pos, entry)
          end
        end
      end
      
      handles.each(&:close)
    end
  ensure
    cleanup_temp_files
  end

  def cleanup_temp_files
    @temp_files.each { |f| File.delete(f) if File.exist?(f) }
  end
end

# Sort 50GB log file by timestamp
sorter = LogExternalSort.new(memory_limit: 300_000)
sorter.sort_logs('application.log', 'sorted_logs.txt', sort_by: :timestamp)

# Or sort by error level priority
sorter.sort_logs('application.log', 'errors_first.txt', sort_by: :level)

Example 3: Database-Style Multi-Column Sorting

class MultiColumnExternalSort
  def initialize(memory_limit: 100_000)
    @memory_limit = memory_limit
    @temp_files = []
  end

  def sort_file(input_path, output_path, sort_columns: [0])
    create_runs(input_path, sort_columns)
    merge_runs(output_path, sort_columns)
  ensure
    cleanup_temp_files
  end

  private

  def create_runs(input_path, sort_columns)
    require 'csv'
    buffer = []
    
    CSV.foreach(input_path) do |row|
      buffer << row
      
      if buffer.size >= @memory_limit
        write_sorted_run(buffer, sort_columns)
        buffer.clear
      end
    end
    
    write_sorted_run(buffer, sort_columns) unless buffer.empty?
  end

  def write_sorted_run(buffer, sort_columns)
    temp_file = Tempfile.new(['multicolumn_run', '.csv'])
    
    sorted = buffer.sort do |a, b|
      compare_rows(a, b, sort_columns)
    end
    
    CSV.open(temp_file.path, 'w') do |csv|
      sorted.each { |row| csv << row }
    end
    
    temp_file.close
    @temp_files << temp_file.path
  end

  def compare_rows(row_a, row_b, sort_columns)
    sort_columns.each do |col_spec|
      column, direction = parse_column_spec(col_spec)
      
      val_a = row_a[column]
      val_b = row_b[column]
      
      # Try numeric comparison
      num_a = Float(val_a) rescue nil
      num_b = Float(val_b) rescue nil
      
      comparison = if num_a && num_b
        num_a <=> num_b
      else
        val_a.to_s <=> val_b.to_s
      end
      
      comparison *= -1 if direction == :desc
      
      return comparison if comparison != 0
    end
    
    0
  end

  def parse_column_spec(spec)
    if spec.is_a?(Hash)
      spec.first
    elsif spec.is_a?(Array)
      spec
    else
      [spec, :asc]
    end
  end

  def merge_runs(output_path, sort_columns)
    CSV.open(output_path, 'w') do |output|
      handles = @temp_files.map { |path| CSV.open(path) }
      heap = []
      
      handles.each_with_index do |csv, idx|
        if row = csv.shift
          heap << [row, idx]
        end
      end
      
      heap.sort! { |a, b| compare_rows(a[0], b[0], sort_columns) }
      
      until heap.empty?
        row, idx = heap.shift
        output << row
        
        if next_row = handles[idx].shift
          entry = [next_row, idx]
          insert_pos = heap.bsearch_index do |heap_row, _|
            compare_rows(heap_row, next_row, sort_columns) >= 0
          end
          heap.insert(insert_pos || heap.size, entry)
        end
      end
      
      handles.each(&:close)
    end
  ensure
    cleanup_temp_files
  end

  def cleanup_temp_files
    @temp_files.each { |f| File.delete(f) if File.exist?(f) }
  end
end

# Sort by country (ascending), then by revenue (descending)
sorter = MultiColumnExternalSort.new(memory_limit: 50_000)
sorter.sort_file(
  'sales.csv',
  'sorted_sales.csv',
  sort_columns: [
    [2, :asc],   # Country column ascending
    [5, :desc]   # Revenue column descending
  ]
)

Common Pitfalls

Insufficient Buffer Allocation causes excessive I/O overhead. Allocating tiny buffers (hundreds of bytes) triggers a system call for nearly every record. Buffer sizes below the filesystem block size (typically 4KB) prevent OS read-ahead optimization. Minimum buffer sizes should match disk block boundaries, with 64KB or larger preferred for sequential reads.

# Poor: Tiny buffers cause excessive system calls
def slow_merge(file1, file2)
  f1 = File.open(file1, 'r')
  f2 = File.open(file2, 'r')
  # Reading character by character or line by line without buffering
  # Each gets() call triggers a system call
end

# Better: Explicit buffering reduces system calls
def fast_merge(file1, file2)
  f1 = File.open(file1, 'r', buffer_size: 65536)
  f2 = File.open(file2, 'r', buffer_size: 65536)
  # Reads happen in 64KB chunks
end

Exceeding File Descriptor Limits crashes programs during k-way merge. Operating systems limit the number of simultaneously open files (typically 1024 on Unix systems). Opening 1000 input runs simultaneously fails. The merge factor k must account for this limit, typically staying well below the maximum to leave descriptors for other operations.

Ruby processes inherit the system's file descriptor limit. Check with Process.getrlimit(Process::RLIMIT_NOFILE). For large k values, consider increasing the limit with Process.setrlimit or performing the merge in multiple stages where each stage merges a subset of files.

Incorrect Merge Logic produces incorrect output. A common error involves failing to handle buffer exhaustion properly. When an input buffer empties, the code must refill it before continuing comparison. Forgetting this causes records to appear in wrong positions.

# Incorrect: Forgets to refill buffers
def broken_merge(files)
  heap = files.map { |f| [f.gets, f] }
  
  until heap.empty?
    value, file = heap.min
    puts value
    heap.delete([value, file])  # Bug: Never refills from file
  end
end

# Correct: Refills after extracting minimum
def correct_merge(files)
  heap = files.map { |f| [f.gets, f] }.compact
  
  until heap.empty?
    value, file = heap.min
    puts value
    heap.delete([value, file])
    
    if next_value = file.gets
      heap << [next_value, file]
    end
  end
end

Disk Space Exhaustion halts sorting mid-process. External sorting requires temporary storage approximately equal to input size. A 100 GB input needs 100 GB for runs plus space for merge outputs. During merge passes, both input runs and output exist simultaneously, requiring 2× input size temporarily.

Monitor available disk space before starting. Calculate required space as input_size × (1 + number_of_passes). Fail early if insufficient space exists rather than crashing mid-sort. Clean up temporary files promptly after merging to release space for subsequent operations.

Incorrect String Comparison produces unexpected sort orders. String sorting is lexicographic: "10" comes before "2" alphabetically. Numeric fields stored as strings require explicit numeric conversion. Failing to normalize strings (case, whitespace) causes inconsistent ordering.

# Incorrect: Lexicographic comparison of numeric strings
buffer.sort_by { |row| row[:age] }  # "9" > "80" lexicographically

# Correct: Parse numbers for numeric comparison
buffer.sort_by { |row| row[:age].to_i }

# Incorrect: Case-sensitive comparison
buffer.sort_by { |row| row[:name] }  # "alice" > "Zach"

# Correct: Normalized comparison
buffer.sort_by { |row| row[:name].downcase }

Tempfile Premature Deletion loses data mid-merge. Ruby's Tempfile automatically deletes when garbage collected. Storing only the file path without keeping the Tempfile object reference causes deletion before merge completion.

# Incorrect: Tempfile gets garbage collected
def create_run(data)
  temp = Tempfile.new('run')
  temp.write(data.sort.join("\n"))
  temp.close
  temp.path  # Only return path, temp gets GC'd and deleted
end

# Correct: Keep reference to Tempfile object
def create_run(data)
  temp = Tempfile.new('run')
  temp.write(data.sort.join("\n"))
  temp.close
  temp  # Return object itself, or explicitly manage cleanup
end

Memory Underestimation causes paging and severe slowdown. Calculating memory for buffers must account for Ruby object overhead. Each string carries object header overhead (40+ bytes in Ruby). A buffer of 100,000 strings consumes far more than the string content size.

Monitor actual memory usage during run formation. If the system begins swapping to disk, external sorting becomes internal sorting with paging overhead. Reduce memory limit setting to stay comfortably within physical RAM, typically using 70-80% of available memory.

Non-Atomic File Operations risk data corruption on failure. Writing output directly to the final destination path means incomplete output on crash. The output file becomes corrupted and unusable. Write to temporary files first, then atomically rename to final destination.

# Risky: Failure leaves corrupted output file
def merge_runs(output_path)
  File.open(output_path, 'w') do |out|
    # If crash happens here, output_path is corrupted
    perform_merge(out)
  end
end

# Safer: Atomic rename after successful completion
def merge_runs(output_path)
  temp = Tempfile.new('final')
  
  File.open(temp.path, 'w') do |out|
    perform_merge(out)
  end
  
  FileUtils.mv(temp.path, output_path)  # Atomic on same filesystem
end

Reference

Algorithm Complexity Comparison

Algorithm I/O Complexity Merge Passes Memory Usage Best Use Case
Two-way merge 2N × (log₂(b) + 1) log₂(b) Minimal (3 buffers) Severely limited memory
k-way merge 2N × (logₖ(b) + 1) logₖ(b) k buffers + heap Balanced approach
Replacement selection 2N × (logₖ(b/2) + 1) logₖ(b/2) Heap size M Random or partial ordering
Polyphase merge ~2N × logₖ(b) logₖ(b) k buffers Sequential media only

Performance Characteristics

Dataset Size Memory Available Recommended k Expected Passes Total I/O Multiplier
1 GB 100 MB 8-10 1-2 4-6×
10 GB 1 GB 10-15 1-2 4-6×
100 GB 8 GB 10-20 2-3 6-8×
1 TB 64 GB 15-25 2-3 6-8×

Configuration Parameters

Parameter Typical Range Impact Tuning Guidance
Memory limit 10K-10M records Determines run count Use 70-80% of available RAM
Merge factor (k) 2-50 Controls pass count Higher k, fewer passes, more memory
Buffer size 4KB-1MB I/O efficiency Match filesystem block size minimum
Heap size (replacement) Same as memory Run length Larger heap, longer runs

File Descriptor Requirements

Operation Open Files Formula Example (k=10)
Run formation 2 1 input + 1 output 2 files
k-way merge k + 1 k inputs + 1 output 11 files
Multi-stage merge k + 1 per stage Per stage maximum 11 files per stage
Total system Varies Check ulimit -n Typically 1024

Storage Space Requirements

Phase Space Needed Duration Notes
Run formation 1× input size During formation Temporary run files
Single merge pass 2× input size During merge Input runs + output
Multi-pass merge 1.5-2× input size Peak usage Overlapping cleanup
Final output 1× input size Permanent Final sorted file

Ruby Implementation Checklist

Aspect Requirement Verification
Tempfile cleanup Always close and unlink Use ensure blocks
File descriptor limit Check before large k Process.getrlimit check
Buffer size Minimum 4KB Set explicit buffering
Memory monitoring Track actual usage GC.stat monitoring
Numeric comparison Convert strings to numbers Use to_i or to_f
Atomic writes Rename temp to final FileUtils.mv on completion
Error handling Rescue I/O errors Handle ENOSPC, EMFILE

Common Error Codes

Error Cause Resolution
EMFILE Too many open files Reduce merge factor k
ENOSPC No disk space Free space or reduce dataset
ENOMEM Out of memory Reduce memory limit setting
EACCES Permission denied Check directory permissions
EXDEV Cross-device link Move temp to same filesystem