A guide to external sorting algorithms for handling datasets larger than available memory, covering multi-way merge techniques, I/O optimization, and file-based sorting implementations.
Overview
External sorting addresses the problem of sorting datasets too large to fit in main memory. While internal sorting algorithms like quicksort or mergesort operate entirely in RAM, external sorting algorithms partition data across secondary storage devices and perform sorting through a series of I/O operations.
The fundamental constraint driving external sorting is the memory hierarchy. When a dataset exceeds available RAM, the operating system begins paging memory to disk, causing severe performance degradation. External sorting algorithms accept this constraint and design around it, treating disk I/O as the primary cost metric rather than CPU comparisons.
External sorting emerged from early computing systems where memory was severely limited. Magnetic tape systems required sequential access patterns, leading to algorithms like polyphase merge sort. Modern implementations work with disk files or SSDs but retain the core principle: minimize the number of disk passes while staying within memory constraints.
# Dataset sizes requiring external sorting
internal_sort_limit = 8 * 1024 * 1024 * 1024 # 8GB available RAM
dataset_size = 100 * 1024 * 1024 * 1024 # 100GB dataset
if dataset_size > internal_sort_limit
# External sorting required
external_sort(dataset)
else
# Standard in-memory sort works
data.sort
end
The canonical example involves sorting a large file of records. A file containing 100 GB of data cannot be loaded into a system with 8 GB of RAM. External sorting reads portions of the file, sorts them in memory, writes sorted "runs" to temporary files, then merges these runs to produce the final sorted output.
Key Principles
External sorting divides the sorting process into two phases: run formation and merging. Run formation reads chunks of data that fit in memory, sorts each chunk internally, and writes sorted runs to disk. The merge phase combines these sorted runs into progressively larger sorted sequences until producing a single sorted output.
Run Formation reads the input file in blocks sized to fit available memory. Each block undergoes internal sorting using any efficient algorithm. The sorted block, called a run, writes to a temporary file. If the input contains N records and memory holds M records, run formation produces ⌈N/M⌉ runs.
Multi-Way Merging combines multiple sorted runs simultaneously. A k-way merge reads from k input runs, maintaining k buffers in memory. The algorithm repeatedly selects the smallest element across all buffers, writes it to output, and refills the depleted buffer. This continues until all inputs exhaust.
The merge proceeds in passes. Each pass reduces the number of runs by a factor of k. With b initial runs and k-way merging, the number of passes equals ⌈log_k(b)⌉. Fewer passes mean less I/O, but larger k requires more memory for input buffers.
I/O Complexity measures external sorting efficiency. Each record must be read once during run formation and written once to temporary storage. Each subsequent merge pass reads and writes every record. With p passes, total I/O equals 2N × (p + 1) operations, where N is the number of records.
Memory allocation splits between input buffers, output buffer, and workspace for the merge heap. For k-way merging with total memory M, typical allocation assigns M/k to each input buffer and a similar amount to output buffering. The remaining space holds the selection data structure, usually a min-heap of size k.
Replacement Selection improves upon simple run formation. Instead of sorting fixed-size blocks, replacement selection generates longer runs by maintaining a heap of size M. As records write to output, new records enter the heap if they exceed the last output value. This produces runs averaging 2M in length, doubling effectiveness.
Buffer Management critically impacts performance. Reading single records from k files incurs massive overhead. Block-buffered I/O reads large chunks (typically 4KB-64KB) into memory buffers. When an input buffer empties, a single large read refills it. Similarly, output buffering accumulates records before writing complete blocks.
# I/O pass calculation
records = 1_000_000_000
memory_capacity = 10_000_000
runs = (records.to_f / memory_capacity).ceil # 100 runs
merge_factor = 10
passes = Math.log(runs, merge_factor).ceil # 2 passes
total_io = records * 2 * (passes + 1) # Read/write for formation + 2 merge passes
# => 6 billion I/O operations
Implementation Approaches
Two-Way Merge Sort represents the simplest external sorting strategy. Run formation creates initial sorted runs. Each merge pass combines pairs of runs, halving the run count. The algorithm continues until a single run remains. Two-way merging requires only three file handles: two inputs and one output.
Two-way merge uses minimal memory, requiring only two input buffers and one output buffer. However, the number of passes equals ⌈log_2(b)⌉, where b is the initial run count. A thousand initial runs require ten merge passes, each reading and writing the entire dataset.
Multi-Way Merge Sort reduces passes by merging k runs simultaneously. With k-way merging, passes equal ⌈log_k(b)⌉. Choosing k=10 reduces a thousand runs to two passes. However, k cannot exceed the file descriptor limit, and each input requires a buffer, consuming memory proportional to k.
The merge selection mechanism maintains a min-heap containing the next record from each input. Each iteration extracts the minimum, writes it, refills from the corresponding input, and adjusts the heap. Heap operations cost O(log k) per record, making k-way merging slightly more CPU-intensive than two-way merging, but the I/O savings dominate.
Polyphase Merge optimizes tape-based systems where sequential access prohibits random seeking. Rather than balanced merging, polyphase distributes runs unevenly across tapes following a Fibonacci-like sequence. This eliminates copying overhead on sequential media. Modern disk-based systems rarely use polyphase merging due to random access capabilities.
Replacement Selection generates longer initial runs, reducing merge passes. The algorithm maintains a heap of size M. Records enter the heap and output proceeds while maintaining the sorted property. When a record smaller than the last output appears, it belongs to the next run. Replacement selection produces runs averaging 2M records under random input.
Cascade Merge creates a hybrid between two-way and multi-way merging. Each pass uses a different merge factor, starting with higher k values and reducing as runs decrease. This balances memory usage across passes and adapts to changing run counts.
Selection criteria depend on data size, available memory, and I/O characteristics:
- Datasets under 10× memory: Multi-way merge with high k
- Limited memory: Two-way merge or lower k values
- Random data: Replacement selection for longer runs
- Nearly-sorted data: Verify if single merge pass suffices
- SSD storage: Higher k values due to better random I/O
- Rotating disk: Moderate k, optimize sequential access
Ruby Implementation
Ruby's standard library lacks built-in external sorting, requiring custom implementation. The following examples demonstrate file-based sorting with proper resource management and error handling.
Basic Two-Way External Sort:
class ExternalSort
def initialize(memory_limit: 10_000)
@memory_limit = memory_limit
@temp_files = []
end
def sort_file(input_path, output_path)
create_runs(input_path)
merge_runs(output_path)
ensure
cleanup_temp_files
end
private
def create_runs(input_path)
buffer = []
File.open(input_path, 'r') do |file|
file.each_line do |line|
buffer << line
if buffer.size >= @memory_limit
write_sorted_run(buffer)
buffer.clear
end
end
write_sorted_run(buffer) unless buffer.empty?
end
end
def write_sorted_run(buffer)
temp_file = Tempfile.new('run')
buffer.sort.each { |line| temp_file.puts(line) }
temp_file.rewind
@temp_files << temp_file
end
def merge_runs(output_path)
while @temp_files.size > 1
new_temps = []
@temp_files.each_slice(2) do |file1, file2|
if file2
new_temps << merge_two_files(file1, file2)
else
new_temps << file1
end
end
@temp_files = new_temps
end
File.open(output_path, 'w') do |output|
@temp_files.first.each_line { |line| output.puts(line) }
end
end
def merge_two_files(file1, file2)
temp = Tempfile.new('merge')
file1.rewind
file2.rewind
line1 = file1.gets
line2 = file2.gets
while line1 && line2
if line1 <= line2
temp.puts(line1)
line1 = file1.gets
else
temp.puts(line2)
line2 = file2.gets
end
end
temp.puts(line1) while line1 = file1.gets
temp.puts(line2) while line2 = file2.gets
file1.close
file2.close
temp.rewind
temp
end
def cleanup_temp_files
@temp_files.each do |file|
file.close
file.unlink
end
end
end
# Usage
sorter = ExternalSort.new(memory_limit: 100_000)
sorter.sort_file('large_input.txt', 'sorted_output.txt')
Multi-Way Merge with Heap:
require 'set'
class MultiWayMergeSort
def initialize(memory_limit: 100_000, merge_factor: 10)
@memory_limit = memory_limit
@merge_factor = merge_factor
@temp_files = []
end
def sort_file(input_path, output_path)
create_runs(input_path)
multiway_merge(output_path)
ensure
cleanup_temp_files
end
private
def create_runs(input_path)
buffer = []
File.open(input_path, 'r') do |file|
file.each_line do |line|
buffer << line.chomp
if buffer.size >= @memory_limit
write_sorted_run(buffer)
buffer.clear
end
end
write_sorted_run(buffer) unless buffer.empty?
end
end
def write_sorted_run(buffer)
temp_file = Tempfile.new(['run', '.txt'])
buffer.sort.each { |line| temp_file.puts(line) }
temp_file.close
@temp_files << temp_file.path
end
def multiway_merge(output_path)
while @temp_files.size > 1
pass_files = []
@temp_files.each_slice(@merge_factor) do |file_group|
pass_files << merge_k_files(file_group)
end
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
@temp_files = pass_files
end
FileUtils.mv(@temp_files.first, output_path)
end
def merge_k_files(file_paths)
output_temp = Tempfile.new(['merge', '.txt'])
handles = file_paths.map { |path| File.open(path, 'r') }
# Min-heap: [value, file_index]
heap = []
# Initialize heap with first line from each file
handles.each_with_index do |file, idx|
if line = file.gets
heap << [line.chomp, idx]
end
end
heap.sort!
# Merge using heap
until heap.empty?
value, idx = heap.shift
output_temp.puts(value)
# Read next line from same file
if line = handles[idx].gets
new_entry = [line.chomp, idx]
# Insert into sorted position
insert_pos = heap.bsearch_index { |v, _| v >= new_entry[0] } || heap.size
heap.insert(insert_pos, new_entry)
end
end
handles.each(&:close)
output_temp.close
output_temp.path
end
def cleanup_temp_files
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
end
end
# Usage for large dataset
sorter = MultiWayMergeSort.new(memory_limit: 500_000, merge_factor: 8)
sorter.sort_file('huge_dataset.txt', 'sorted_result.txt')
Replacement Selection for Longer Runs:
class ReplacementSelectionSort
def initialize(heap_size: 100_000)
@heap_size = heap_size
@temp_files = []
end
def sort_file(input_path, output_path)
create_runs_replacement_selection(input_path)
merge_all_runs(output_path)
ensure
cleanup_temp_files
end
private
def create_runs_replacement_selection(input_path)
File.open(input_path, 'r') do |file|
generate_run_with_replacement(file)
end
end
def generate_run_with_replacement(input_file)
current_heap = []
next_heap = []
run_output = nil
last_written = nil
# Fill initial heap
@heap_size.times do
break unless line = input_file.gets
current_heap << line.chomp
end
current_heap.sort!
loop do
if current_heap.empty?
# Start new run
break if next_heap.empty?
run_output.close if run_output
current_heap = next_heap.sort
next_heap = []
last_written = nil
run_output = nil
end
# Initialize run output if needed
unless run_output
temp = Tempfile.new(['run', '.txt'])
run_output = File.open(temp.path, 'w')
@temp_files << temp.path
end
# Write smallest element that maintains sort order
value = current_heap.shift
run_output.puts(value)
last_written = value
# Read next input
if line = input_file.gets
line = line.chomp
if line >= last_written
# Can go in current run
insert_sorted(current_heap, line)
else
# Save for next run
insert_sorted(next_heap, line)
end
end
end
run_output.close if run_output
end
def insert_sorted(array, value)
pos = array.bsearch_index { |v| v >= value } || array.size
array.insert(pos, value)
end
def merge_all_runs(output_path)
return if @temp_files.empty?
if @temp_files.size == 1
FileUtils.mv(@temp_files.first, output_path)
return
end
# Simple two-way merge for demonstration
while @temp_files.size > 1
file1 = @temp_files.shift
file2 = @temp_files.shift
merged = merge_two_files(file1, file2)
@temp_files << merged
File.delete(file1)
File.delete(file2)
end
FileUtils.mv(@temp_files.first, output_path)
end
def merge_two_files(path1, path2)
output_temp = Tempfile.new(['merge', '.txt'])
File.open(path1, 'r') do |f1|
File.open(path2, 'r') do |f2|
File.open(output_temp.path, 'w') do |out|
line1 = f1.gets
line2 = f2.gets
while line1 && line2
if line1 <= line2
out.puts(line1)
line1 = f1.gets
else
out.puts(line2)
line2 = f2.gets
end
end
out.puts(line1) while line1 = f1.gets
out.puts(line2) while line2 = f2.gets
end
end
end
output_temp.path
end
def cleanup_temp_files
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
end
end
# Generates approximately 2x heap_size run lengths on random data
sorter = ReplacementSelectionSort.new(heap_size: 50_000)
sorter.sort_file('random_data.txt', 'sorted_output.txt')
Performance Considerations
External sorting performance depends primarily on I/O characteristics rather than CPU speed. The number of disk passes determines total I/O volume, while buffer sizes affect I/O efficiency. Understanding these factors enables effective optimization.
I/O Pass Analysis calculates total data movement. With N records, run formation reads N records and writes N records (2N I/O operations). Each merge pass reads all N records and writes N records (2N operations per pass). Total I/O equals 2N × (passes + 1).
Reducing passes requires higher merge factors. With k-way merging and b initial runs, passes equal ⌈log_k(b)⌉. Doubling k from 5 to 10 cuts passes by approximately half. However, memory constraints limit k since each input stream requires a buffer.
# Comparing merge strategies
def calculate_io_operations(records, memory, merge_factor)
runs = (records.to_f / memory).ceil
passes = Math.log(runs, merge_factor).ceil
io_operations = records * 2 * (passes + 1)
{
runs: runs,
passes: passes,
total_io: io_operations,
io_per_record: io_operations / records.to_f
}
end
# 1 billion records, 10 million record memory
two_way = calculate_io_operations(1_000_000_000, 10_000_000, 2)
# => {runs: 100, passes: 7, total_io: 16B, io_per_record: 16}
ten_way = calculate_io_operations(1_000_000_000, 10_000_000, 10)
# => {runs: 100, passes: 2, total_io: 6B, io_per_record: 6}
# 10-way merge achieves 2.67x fewer I/O operations
Buffer Size Optimization dramatically impacts throughput. Reading single records incurs OS call overhead for each read. Buffering accumulates multiple records per system call. Reading 64KB blocks instead of 100-byte records reduces system calls by 640×.
Operating systems perform read-ahead caching when detecting sequential access patterns. Buffer sizes matching filesystem block boundaries (typically 4KB multiples) maximize cache effectiveness. Very large buffers (1MB+) provide diminishing returns as OS caching handles read-ahead.
Memory allocation for k-way merging splits total available memory M across k input buffers, one output buffer, and workspace for the merge heap. Typical allocation assigns M/(k+1) to each input buffer and output buffer, leaving k elements for the heap. Larger input buffers reduce I/O frequency but decrease k.
Replacement Selection Impact varies by input distribution. Random data produces runs averaging 2× heap size. Partially sorted data generates much longer runs. Data sorted in reverse produces minimum-length runs equal to heap size. Real-world data often contains partial ordering, making replacement selection effective.
Longer runs directly reduce merge passes. If replacement selection doubles run length, initial run count halves, reducing passes by one. On random data with 100 initial runs, changing from 100 to 50 runs reduces a 10-way merge from 2 passes to 1 pass, halving I/O.
Disk vs SSD Performance affects algorithm selection. Rotating disks favor sequential access with large block reads. Random access on spinning disks incurs seek time (5-10ms) plus rotational latency (4ms average). SSDs handle random access efficiently with microsecond latencies.
For rotating disks, lower k values with larger buffers optimize sequential streaming. For SSDs, higher k values reduce passes since random access cost is low. An SSD system might use k=20 with smaller buffers, while rotating disk prefers k=5 with maximum buffer sizes.
CPU vs I/O Bound determines optimization focus. External sorting on modern systems is I/O bound, meaning CPU sits idle waiting for disk operations. Adding CPU-intensive operations (like complex comparisons) has minimal impact on total runtime. Conversely, optimizing I/O patterns yields significant improvements.
CPU costs do matter for comparison operations. String comparisons are slower than integer comparisons. If records contain complex data requiring expensive comparison, reducing total comparisons by generating longer runs provides benefit. However, this effect is secondary to I/O optimization.
Parallel I/O overlaps reading and writing. While merging runs, the algorithm reads from input buffers and writes to an output buffer. Double-buffering enables I/O and computation overlap: while processing one buffer, the OS asynchronously fills the next buffer. This prevents CPU idle time waiting for I/O.
Multi-threaded implementations pipeline run formation and merging. One thread generates runs while another begins merging completed runs. However, Ruby's Global Interpreter Lock limits true parallelism for CPU-bound operations. I/O-bound external sorting sees less GIL impact since threads often wait on I/O.
Practical Examples
Example 1: Sorting Large CSV Files
class CSVExternalSort
def initialize(memory_limit: 1_000_000)
@memory_limit = memory_limit
@temp_files = []
end
def sort_csv(input_path, output_path, sort_column: 0)
# Parse and create runs
create_csv_runs(input_path, sort_column)
# Merge runs maintaining CSV format
merge_csv_runs(output_path, sort_column)
ensure
cleanup_temp_files
end
private
def create_csv_runs(input_path, sort_column)
require 'csv'
buffer = []
CSV.foreach(input_path, headers: true) do |row|
buffer << row
if buffer.size >= @memory_limit
write_csv_run(buffer, sort_column)
buffer.clear
end
end
write_csv_run(buffer, sort_column) unless buffer.empty?
end
def write_csv_run(buffer, sort_column)
temp_file = Tempfile.new(['csv_run', '.csv'])
sorted = buffer.sort_by { |row| row[sort_column] }
CSV.open(temp_file.path, 'w') do |csv|
csv << sorted.first.headers if sorted.first
sorted.each { |row| csv << row }
end
@temp_files << temp_file.path
temp_file.close
end
def merge_csv_runs(output_path, sort_column)
return if @temp_files.empty?
CSV.open(output_path, 'w') do |output|
handles = @temp_files.map { |path| CSV.open(path, headers: true) }
# Write headers from first file
output << handles.first.headers
# Initialize heap
heap = []
handles.each_with_index do |csv, idx|
if row = csv.shift
heap << [row[sort_column], row, idx]
end
end
heap.sort_by! { |v, _, _| v }
# Merge
until heap.empty?
_, row, idx = heap.shift
output << row
if next_row = handles[idx].shift
entry = [next_row[sort_column], next_row, idx]
insert_pos = heap.bsearch_index { |v, _, _| v >= entry[0] } || heap.size
heap.insert(insert_pos, entry)
end
end
handles.each(&:close)
end
ensure
cleanup_temp_files
end
def cleanup_temp_files
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
end
end
# Sort 10GB CSV by email column
sorter = CSVExternalSort.new(memory_limit: 500_000)
sorter.sort_csv('users.csv', 'users_sorted.csv', sort_column: 'email')
Example 2: Log File Processing with Custom Sorting
class LogExternalSort
LOG_PATTERN = /^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})\s+(\w+)\s+(.+)$/
def initialize(memory_limit: 200_000)
@memory_limit = memory_limit
@temp_files = []
end
def sort_logs(input_path, output_path, sort_by: :timestamp)
create_log_runs(input_path, sort_by)
merge_log_runs(output_path, sort_by)
ensure
cleanup_temp_files
end
private
def create_log_runs(input_path, sort_by)
buffer = []
File.open(input_path, 'r') do |file|
file.each_line do |line|
if match = line.match(LOG_PATTERN)
timestamp, level, message = match.captures
buffer << {
timestamp: timestamp,
level: level,
message: message,
raw: line
}
if buffer.size >= @memory_limit
write_log_run(buffer, sort_by)
buffer.clear
end
end
end
write_log_run(buffer, sort_by) unless buffer.empty?
end
end
def write_log_run(buffer, sort_by)
temp_file = Tempfile.new(['log_run', '.txt'])
sorted = case sort_by
when :timestamp
buffer.sort_by { |entry| entry[:timestamp] }
when :level
# ERROR > WARN > INFO > DEBUG
level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
buffer.sort_by { |entry| [level_order[entry[:level]] || 4, entry[:timestamp]] }
else
buffer.sort_by { |entry| entry[sort_by] }
end
sorted.each { |entry| temp_file.puts(entry[:raw]) }
temp_file.close
@temp_files << temp_file.path
end
def merge_log_runs(output_path, sort_by)
File.open(output_path, 'w') do |output|
handles = @temp_files.map { |path| File.open(path, 'r') }
heap = []
# Initialize heap
handles.each_with_index do |file, idx|
if line = file.gets
if match = line.match(LOG_PATTERN)
timestamp, level, message = match.captures
sort_key = case sort_by
when :timestamp then timestamp
when :level
level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
"#{level_order[level] || 4}_#{timestamp}"
end
heap << [sort_key, line, idx]
end
end
end
heap.sort_by! { |key, _, _| key }
# Merge
until heap.empty?
_, line, idx = heap.shift
output.print(line)
if next_line = handles[idx].gets
if match = next_line.match(LOG_PATTERN)
timestamp, level, _ = match.captures
sort_key = case sort_by
when :timestamp then timestamp
when :level
level_order = { 'ERROR' => 0, 'WARN' => 1, 'INFO' => 2, 'DEBUG' => 3 }
"#{level_order[level] || 4}_#{timestamp}"
end
entry = [sort_key, next_line, idx]
insert_pos = heap.bsearch_index { |k, _, _| k >= entry[0] } || heap.size
heap.insert(insert_pos, entry)
end
end
end
handles.each(&:close)
end
ensure
cleanup_temp_files
end
def cleanup_temp_files
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
end
end
# Sort 50GB log file by timestamp
sorter = LogExternalSort.new(memory_limit: 300_000)
sorter.sort_logs('application.log', 'sorted_logs.txt', sort_by: :timestamp)
# Or sort by error level priority
sorter.sort_logs('application.log', 'errors_first.txt', sort_by: :level)
Example 3: Database-Style Multi-Column Sorting
class MultiColumnExternalSort
def initialize(memory_limit: 100_000)
@memory_limit = memory_limit
@temp_files = []
end
def sort_file(input_path, output_path, sort_columns: [0])
create_runs(input_path, sort_columns)
merge_runs(output_path, sort_columns)
ensure
cleanup_temp_files
end
private
def create_runs(input_path, sort_columns)
require 'csv'
buffer = []
CSV.foreach(input_path) do |row|
buffer << row
if buffer.size >= @memory_limit
write_sorted_run(buffer, sort_columns)
buffer.clear
end
end
write_sorted_run(buffer, sort_columns) unless buffer.empty?
end
def write_sorted_run(buffer, sort_columns)
temp_file = Tempfile.new(['multicolumn_run', '.csv'])
sorted = buffer.sort do |a, b|
compare_rows(a, b, sort_columns)
end
CSV.open(temp_file.path, 'w') do |csv|
sorted.each { |row| csv << row }
end
temp_file.close
@temp_files << temp_file.path
end
def compare_rows(row_a, row_b, sort_columns)
sort_columns.each do |col_spec|
column, direction = parse_column_spec(col_spec)
val_a = row_a[column]
val_b = row_b[column]
# Try numeric comparison
num_a = Float(val_a) rescue nil
num_b = Float(val_b) rescue nil
comparison = if num_a && num_b
num_a <=> num_b
else
val_a.to_s <=> val_b.to_s
end
comparison *= -1 if direction == :desc
return comparison if comparison != 0
end
0
end
def parse_column_spec(spec)
if spec.is_a?(Hash)
spec.first
elsif spec.is_a?(Array)
spec
else
[spec, :asc]
end
end
def merge_runs(output_path, sort_columns)
CSV.open(output_path, 'w') do |output|
handles = @temp_files.map { |path| CSV.open(path) }
heap = []
handles.each_with_index do |csv, idx|
if row = csv.shift
heap << [row, idx]
end
end
heap.sort! { |a, b| compare_rows(a[0], b[0], sort_columns) }
until heap.empty?
row, idx = heap.shift
output << row
if next_row = handles[idx].shift
entry = [next_row, idx]
insert_pos = heap.bsearch_index do |heap_row, _|
compare_rows(heap_row, next_row, sort_columns) >= 0
end
heap.insert(insert_pos || heap.size, entry)
end
end
handles.each(&:close)
end
ensure
cleanup_temp_files
end
def cleanup_temp_files
@temp_files.each { |f| File.delete(f) if File.exist?(f) }
end
end
# Sort by country (ascending), then by revenue (descending)
sorter = MultiColumnExternalSort.new(memory_limit: 50_000)
sorter.sort_file(
'sales.csv',
'sorted_sales.csv',
sort_columns: [
[2, :asc], # Country column ascending
[5, :desc] # Revenue column descending
]
)
Common Pitfalls
Insufficient Buffer Allocation causes excessive I/O overhead. Allocating tiny buffers (hundreds of bytes) triggers a system call for nearly every record. Buffer sizes below the filesystem block size (typically 4KB) prevent OS read-ahead optimization. Minimum buffer sizes should match disk block boundaries, with 64KB or larger preferred for sequential reads.
# Poor: Tiny buffers cause excessive system calls
def slow_merge(file1, file2)
f1 = File.open(file1, 'r')
f2 = File.open(file2, 'r')
# Reading character by character or line by line without buffering
# Each gets() call triggers a system call
end
# Better: Explicit buffering reduces system calls
def fast_merge(file1, file2)
f1 = File.open(file1, 'r', buffer_size: 65536)
f2 = File.open(file2, 'r', buffer_size: 65536)
# Reads happen in 64KB chunks
end
Exceeding File Descriptor Limits crashes programs during k-way merge. Operating systems limit the number of simultaneously open files (typically 1024 on Unix systems). Opening 1000 input runs simultaneously fails. The merge factor k must account for this limit, typically staying well below the maximum to leave descriptors for other operations.
Ruby processes inherit the system's file descriptor limit. Check with Process.getrlimit(Process::RLIMIT_NOFILE). For large k values, consider increasing the limit with Process.setrlimit or performing the merge in multiple stages where each stage merges a subset of files.
Incorrect Merge Logic produces incorrect output. A common error involves failing to handle buffer exhaustion properly. When an input buffer empties, the code must refill it before continuing comparison. Forgetting this causes records to appear in wrong positions.
# Incorrect: Forgets to refill buffers
def broken_merge(files)
heap = files.map { |f| [f.gets, f] }
until heap.empty?
value, file = heap.min
puts value
heap.delete([value, file]) # Bug: Never refills from file
end
end
# Correct: Refills after extracting minimum
def correct_merge(files)
heap = files.map { |f| [f.gets, f] }.compact
until heap.empty?
value, file = heap.min
puts value
heap.delete([value, file])
if next_value = file.gets
heap << [next_value, file]
end
end
end
Disk Space Exhaustion halts sorting mid-process. External sorting requires temporary storage approximately equal to input size. A 100 GB input needs 100 GB for runs plus space for merge outputs. During merge passes, both input runs and output exist simultaneously, requiring 2× input size temporarily.
Monitor available disk space before starting. Calculate required space as input_size × (1 + number_of_passes). Fail early if insufficient space exists rather than crashing mid-sort. Clean up temporary files promptly after merging to release space for subsequent operations.
Incorrect String Comparison produces unexpected sort orders. String sorting is lexicographic: "10" comes before "2" alphabetically. Numeric fields stored as strings require explicit numeric conversion. Failing to normalize strings (case, whitespace) causes inconsistent ordering.
# Incorrect: Lexicographic comparison of numeric strings
buffer.sort_by { |row| row[:age] } # "9" > "80" lexicographically
# Correct: Parse numbers for numeric comparison
buffer.sort_by { |row| row[:age].to_i }
# Incorrect: Case-sensitive comparison
buffer.sort_by { |row| row[:name] } # "alice" > "Zach"
# Correct: Normalized comparison
buffer.sort_by { |row| row[:name].downcase }
Tempfile Premature Deletion loses data mid-merge. Ruby's Tempfile automatically deletes when garbage collected. Storing only the file path without keeping the Tempfile object reference causes deletion before merge completion.
# Incorrect: Tempfile gets garbage collected
def create_run(data)
temp = Tempfile.new('run')
temp.write(data.sort.join("\n"))
temp.close
temp.path # Only return path, temp gets GC'd and deleted
end
# Correct: Keep reference to Tempfile object
def create_run(data)
temp = Tempfile.new('run')
temp.write(data.sort.join("\n"))
temp.close
temp # Return object itself, or explicitly manage cleanup
end
Memory Underestimation causes paging and severe slowdown. Calculating memory for buffers must account for Ruby object overhead. Each string carries object header overhead (40+ bytes in Ruby). A buffer of 100,000 strings consumes far more than the string content size.
Monitor actual memory usage during run formation. If the system begins swapping to disk, external sorting becomes internal sorting with paging overhead. Reduce memory limit setting to stay comfortably within physical RAM, typically using 70-80% of available memory.
Non-Atomic File Operations risk data corruption on failure. Writing output directly to the final destination path means incomplete output on crash. The output file becomes corrupted and unusable. Write to temporary files first, then atomically rename to final destination.
# Risky: Failure leaves corrupted output file
def merge_runs(output_path)
File.open(output_path, 'w') do |out|
# If crash happens here, output_path is corrupted
perform_merge(out)
end
end
# Safer: Atomic rename after successful completion
def merge_runs(output_path)
temp = Tempfile.new('final')
File.open(temp.path, 'w') do |out|
perform_merge(out)
end
FileUtils.mv(temp.path, output_path) # Atomic on same filesystem
end
Reference
Algorithm Complexity Comparison
| Algorithm | I/O Complexity | Merge Passes | Memory Usage | Best Use Case |
|---|---|---|---|---|
| Two-way merge | 2N × (log₂(b) + 1) | log₂(b) | Minimal (3 buffers) | Severely limited memory |
| k-way merge | 2N × (logₖ(b) + 1) | logₖ(b) | k buffers + heap | Balanced approach |
| Replacement selection | 2N × (logₖ(b/2) + 1) | logₖ(b/2) | Heap size M | Random or partial ordering |
| Polyphase merge | ~2N × logₖ(b) | logₖ(b) | k buffers | Sequential media only |
Performance Characteristics
| Dataset Size | Memory Available | Recommended k | Expected Passes | Total I/O Multiplier |
|---|---|---|---|---|
| 1 GB | 100 MB | 8-10 | 1-2 | 4-6× |
| 10 GB | 1 GB | 10-15 | 1-2 | 4-6× |
| 100 GB | 8 GB | 10-20 | 2-3 | 6-8× |
| 1 TB | 64 GB | 15-25 | 2-3 | 6-8× |
Configuration Parameters
| Parameter | Typical Range | Impact | Tuning Guidance |
|---|---|---|---|
| Memory limit | 10K-10M records | Determines run count | Use 70-80% of available RAM |
| Merge factor (k) | 2-50 | Controls pass count | Higher k, fewer passes, more memory |
| Buffer size | 4KB-1MB | I/O efficiency | Match filesystem block size minimum |
| Heap size (replacement) | Same as memory | Run length | Larger heap, longer runs |
File Descriptor Requirements
| Operation | Open Files | Formula | Example (k=10) |
|---|---|---|---|
| Run formation | 2 | 1 input + 1 output | 2 files |
| k-way merge | k + 1 | k inputs + 1 output | 11 files |
| Multi-stage merge | k + 1 per stage | Per stage maximum | 11 files per stage |
| Total system | Varies | Check ulimit -n | Typically 1024 |
Storage Space Requirements
| Phase | Space Needed | Duration | Notes |
|---|---|---|---|
| Run formation | 1× input size | During formation | Temporary run files |
| Single merge pass | 2× input size | During merge | Input runs + output |
| Multi-pass merge | 1.5-2× input size | Peak usage | Overlapping cleanup |
| Final output | 1× input size | Permanent | Final sorted file |
Ruby Implementation Checklist
| Aspect | Requirement | Verification |
|---|---|---|
| Tempfile cleanup | Always close and unlink | Use ensure blocks |
| File descriptor limit | Check before large k | Process.getrlimit check |
| Buffer size | Minimum 4KB | Set explicit buffering |
| Memory monitoring | Track actual usage | GC.stat monitoring |
| Numeric comparison | Convert strings to numbers | Use to_i or to_f |
| Atomic writes | Rename temp to final | FileUtils.mv on completion |
| Error handling | Rescue I/O errors | Handle ENOSPC, EMFILE |
Common Error Codes
| Error | Cause | Resolution |
|---|---|---|
| EMFILE | Too many open files | Reduce merge factor k |
| ENOSPC | No disk space | Free space or reduce dataset |
| ENOMEM | Out of memory | Reduce memory limit setting |
| EACCES | Permission denied | Check directory permissions |
| EXDEV | Cross-device link | Move temp to same filesystem |