CrackedRuby - Bucket Sort

Overview

Bucket sort distributes elements from an input array into a finite number of buckets. Each bucket collects elements within a specific range of values. After distribution, the algorithm sorts each bucket individually using another sorting algorithm or recursively applies bucket sort. Finally, it concatenates the sorted buckets to produce the final sorted array.

The algorithm excels when input data distributes uniformly across a known range. Unlike comparison-based sorts that have a theoretical lower bound of O(n log n), bucket sort achieves O(n + k) average time complexity, where n represents the number of elements and k represents the number of buckets. This performance advantage stems from avoiding element-by-element comparisons in favor of distribution-based organization.

Bucket sort belongs to the family of distribution sorts, alongside radix sort and counting sort. While counting sort works with integer keys in a limited range and radix sort processes digits or characters, bucket sort handles floating-point numbers and operates on ranges of values. The algorithm requires knowledge of the input distribution to create appropriate bucket boundaries.

# Simple bucket sort for floating-point numbers in range [0, 1)
def bucket_sort_basic(array)
  return array if array.length <= 1
  
  bucket_count = array.length
  buckets = Array.new(bucket_count) { [] }
  
  # Distribute elements into buckets
  array.each do |num|
    bucket_index = (num * bucket_count).to_i
    bucket_index = bucket_count - 1 if bucket_index >= bucket_count
    buckets[bucket_index] << num
  end
  
  # Sort individual buckets and concatenate
  buckets.flat_map { |bucket| bucket.sort }
end

numbers = [0.78, 0.17, 0.39, 0.26, 0.72, 0.94, 0.21, 0.12, 0.23, 0.68]
bucket_sort_basic(numbers)
# => [0.12, 0.17, 0.21, 0.23, 0.26, 0.39, 0.68, 0.72, 0.78, 0.94]

Key Principles

Bucket sort operates on three fundamental phases: distribution, individual sorting, and concatenation. The distribution phase maps each element to a bucket based on its value and the defined bucket boundaries. The sorting phase processes each bucket independently. The concatenation phase combines the sorted buckets in order to produce the final result.

The bucket mapping function determines the algorithm's efficiency. For uniformly distributed input in range [min, max], the mapping function bucket_index = (element - min) * bucket_count / (max - min + 1) distributes elements evenly across buckets. Non-uniform distributions require adjusted mapping functions or adaptive bucket sizing to maintain performance.

Each bucket maintains insertion order during distribution, allowing stable sort behavior when the individual bucket sort algorithm is also stable. Stability means elements with equal keys maintain their relative order from the input array. This property becomes important when sorting composite objects by multiple attributes.

The choice of individual bucket sorting algorithm impacts overall performance. Insertion sort works well for small buckets due to low overhead. Quicksort or mergesort handle larger buckets efficiently. Some implementations recursively apply bucket sort when buckets exceed a threshold size.

Bucket count selection balances memory usage against time complexity. More buckets reduce individual bucket sizes but increase memory overhead and empty bucket processing. Fewer buckets save memory but increase sorting time for larger buckets. The optimal bucket count often equals the input array size for uniformly distributed data.

The algorithm's space complexity includes storage for buckets and temporary sorting space. Using k buckets requires O(k) additional space for bucket arrays, plus O(n) space for element references across all buckets. In-place bucket sorting exists but sacrifices simplicity for memory efficiency.

Ruby Implementation

Ruby's dynamic typing and built-in array methods facilitate bucket sort implementation. The algorithm adapts to different data types by customizing the bucket mapping function. Ruby's enumerable methods streamline element distribution and bucket processing.

class BucketSort
  def self.sort(array, bucket_count: nil)
    return array if array.length <= 1
    
    bucket_count ||= array.length
    min_val, max_val = array.minmax
    
    # Handle edge case where all elements are equal
    return array if min_val == max_val
    
    buckets = Array.new(bucket_count) { [] }
    range = max_val - min_val
    
    # Distribute elements
    array.each do |element|
      index = ((element - min_val) * (bucket_count - 1) / range).to_i
      buckets[index] << element
    end
    
    # Sort each bucket and concatenate
    buckets.flat_map { |bucket| bucket.sort }
  end
end

# Integer sorting
integers = [64, 34, 25, 12, 22, 11, 90, 88, 45, 50, 33, 17, 28, 73]
BucketSort.sort(integers)
# => [11, 12, 17, 22, 25, 28, 33, 34, 45, 50, 64, 73, 88, 90]

# Float sorting
floats = [4.2, 1.7, 3.9, 2.1, 5.8, 1.2, 6.4, 3.3]
BucketSort.sort(floats, bucket_count: 5)
# => [1.2, 1.7, 2.1, 3.3, 3.9, 4.2, 5.8, 6.4]

Custom comparators enable sorting of complex objects. Ruby blocks provide a clean interface for defining bucket mapping and sorting behavior.

class CustomBucketSort
  def self.sort(array, bucket_count: nil, &block)
    return array if array.length <= 1
    
    bucket_count ||= array.length
    key_extractor = block || ->(x) { x }
    
    keys = array.map(&key_extractor)
    min_key, max_key = keys.minmax
    return array if min_key == max_key
    
    buckets = Array.new(bucket_count) { [] }
    range = max_key - min_key
    
    array.each do |element|
      key = key_extractor.call(element)
      index = ((key - min_key) * (bucket_count - 1) / range).to_i
      buckets[index] << element
    end
    
    buckets.flat_map { |bucket| bucket.sort_by(&key_extractor) }
  end
end

Person = Struct.new(:name, :age, :salary)
people = [
  Person.new("Alice", 30, 75000),
  Person.new("Bob", 25, 65000),
  Person.new("Charlie", 35, 95000),
  Person.new("Diana", 28, 70000),
  Person.new("Eve", 32, 85000)
]

# Sort by salary
sorted = CustomBucketSort.sort(people, bucket_count: 3) { |p| p.salary }
sorted.map(&:name)
# => ["Bob", "Diana", "Alice", "Eve", "Charlie"]

For negative numbers, the implementation adjusts the bucket mapping to handle the full range.

def bucket_sort_with_negatives(array, bucket_count: nil)
  return array if array.length <= 1
  
  bucket_count ||= array.length
  min_val, max_val = array.minmax
  return array if min_val == max_val
  
  buckets = Array.new(bucket_count) { [] }
  range = max_val - min_val
  
  array.each do |num|
    # Shift all values to positive range
    normalized = num - min_val
    index = (normalized * (bucket_count - 1) / range).to_i
    buckets[index] << num
  end
  
  buckets.flat_map(&:sort)
end

mixed = [-15, 8, -3, 12, 0, -7, 19, -12, 5, -1]
bucket_sort_with_negatives(mixed)
# => [-15, -12, -7, -3, -1, 0, 5, 8, 12, 19]

Recursive bucket sort applies the algorithm to buckets exceeding a size threshold, avoiding quadratic behavior in individual bucket sorts.

def recursive_bucket_sort(array, threshold: 10)
  return array if array.length <= 1
  
  if array.length <= threshold
    return array.sort
  end
  
  bucket_count = array.length
  min_val, max_val = array.minmax
  return array if min_val == max_val
  
  buckets = Array.new(bucket_count) { [] }
  range = max_val - min_val
  
  array.each do |num|
    index = ((num - min_val) * (bucket_count - 1) / range).to_i
    buckets[index] << num
  end
  
  buckets.flat_map do |bucket|
    bucket.length > threshold ? recursive_bucket_sort(bucket, threshold: threshold) : bucket.sort
  end
end

Practical Examples

Sorting exam scores demonstrates bucket sort's effectiveness with known value ranges. Educational institutions often have scores between 0 and 100, providing natural bucket boundaries.

def sort_exam_scores(scores)
  return scores if scores.length <= 1
  
  # Create buckets for score ranges: 0-9, 10-19, ..., 90-100
  buckets = Array.new(11) { [] }
  
  scores.each do |score|
    bucket_index = score / 10
    bucket_index = 10 if bucket_index > 10  # Handle score of 100
    buckets[bucket_index] << score
  end
  
  buckets.flat_map { |bucket| bucket.sort }
end

exam_results = [95, 72, 88, 63, 91, 55, 78, 84, 67, 93, 58, 82, 70, 89, 76]
sorted_scores = sort_exam_scores(exam_results)
# => [55, 58, 63, 67, 70, 72, 76, 78, 82, 84, 88, 89, 91, 93, 95]

# Calculate grade distribution
grade_counts = sorted_scores.group_by { |s| s / 10 }.transform_values(&:count)
# => {5=>2, 6=>2, 7=>4, 8=>4, 9=>3}

Processing sensor readings illustrates bucket sort for real-time data streams. Temperature sensors reporting values in a known range benefit from distribution-based sorting.

SensorReading = Struct.new(:timestamp, :temperature, :sensor_id)

def sort_sensor_readings(readings)
  return readings if readings.length <= 1
  
  # Temperature range: -40°C to 60°C
  min_temp, max_temp = -40, 60
  bucket_count = 20  # 5-degree ranges
  range = max_temp - min_temp
  
  buckets = Array.new(bucket_count) { [] }
  
  readings.each do |reading|
    temp = reading.temperature
    # Clamp temperature to valid range
    temp = [[temp, min_temp].max, max_temp].min
    index = ((temp - min_temp) * (bucket_count - 1) / range).to_i
    buckets[index] << reading
  end
  
  buckets.flat_map { |bucket| bucket.sort_by(&:temperature) }
end

readings = [
  SensorReading.new(Time.now, 22.5, "S1"),
  SensorReading.new(Time.now, 18.3, "S2"),
  SensorReading.new(Time.now, 25.7, "S3"),
  SensorReading.new(Time.now, 19.1, "S4"),
  SensorReading.new(Time.now, 23.8, "S5"),
  SensorReading.new(Time.now, 17.6, "S6")
]

sorted_readings = sort_sensor_readings(readings)
sorted_readings.map(&:temperature)
# => [17.6, 18.3, 19.1, 22.5, 23.8, 25.7]

External merge sort for large datasets combines bucket sort with disk-based operations. When data exceeds available memory, bucket sort partitions data into manageable chunks.

class ExternalBucketSort
  def initialize(bucket_count: 10, memory_limit: 1000)
    @bucket_count = bucket_count
    @memory_limit = memory_limit
  end
  
  def sort(input_file, output_file)
    # First pass: distribute to temporary bucket files
    bucket_files = distribute_to_buckets(input_file)
    
    # Second pass: sort each bucket and merge
    merge_sorted_buckets(bucket_files, output_file)
    
    # Cleanup
    bucket_files.each { |f| File.delete(f) if File.exist?(f) }
  end
  
  private
  
  def distribute_to_buckets(input_file)
    data = File.readlines(input_file).map(&:to_f)
    min_val, max_val = data.minmax
    range = max_val - min_val
    
    bucket_files = Array.new(@bucket_count) { |i| "bucket_#{i}.tmp" }
    file_handles = bucket_files.map { |f| File.open(f, 'w') }
    
    data.each do |value|
      index = ((value - min_val) * (@bucket_count - 1) / range).to_i
      file_handles[index].puts(value)
    end
    
    file_handles.each(&:close)
    bucket_files
  end
  
  def merge_sorted_buckets(bucket_files, output_file)
    File.open(output_file, 'w') do |output|
      bucket_files.each do |bucket_file|
        next unless File.exist?(bucket_file) && File.size(bucket_file) > 0
        
        bucket_data = File.readlines(bucket_file).map(&:to_f).sort
        bucket_data.each { |value| output.puts(value) }
      end
    end
  end
end

Design Considerations

Bucket sort proves optimal when input data distributes uniformly across a predictable range. E-commerce applications sorting product prices, scientific computing with measurement data, and gaming leaderboards with bounded scores represent ideal use cases. The algorithm degrades when data clusters in few buckets or when the range remains unknown.

Comparison with other sorting algorithms reveals specific trade-offs. Quicksort averages O(n log n) but works with any comparable data without distribution knowledge. Mergesort guarantees O(n log n) worst case and stable sorting but requires O(n) auxiliary space. Heapsort provides O(n log n) in-place sorting but lacks stability. Bucket sort achieves O(n) average time when properly configured but requires distribution knowledge and additional space.

Algorithm	Average Time	Worst Time	Space	Stable	Distribution Knowledge Required
Bucket Sort	O(n + k)	O(n²)	O(n + k)	Yes	Yes
Quicksort	O(n log n)	O(n²)	O(log n)	No	No
Mergesort	O(n log n)	O(n log n)	O(n)	Yes	No
Heapsort	O(n log n)	O(n log n)	O(1)	No	No
Counting Sort	O(n + k)	O(n + k)	O(k)	Yes	Yes

The decision to use bucket sort depends on data characteristics. Uniform distribution across a known range favors bucket sort. Skewed distributions or unknown ranges favor comparison sorts. Small datasets make algorithm choice less critical. Large datasets with specific distribution patterns benefit most from bucket sort optimization.

Bucket count selection requires balancing memory and computation. Too few buckets create large individual buckets, increasing sorting time. Too many buckets waste memory on empty buckets and increase overhead. The formula bucket_count = sqrt(n) provides a reasonable starting point, adjusted based on distribution characteristics.

Stability requirements influence implementation choices. Applications needing stable sorts must use stable individual bucket sorting algorithms and preserve insertion order during distribution. Database operations, multi-attribute sorting, and historical data processing often require stability.

Performance Considerations

Bucket sort achieves O(n + k) average time complexity when elements distribute uniformly across k buckets. The distribution phase takes O(n) time to assign each element to a bucket. Sorting all buckets takes O(n) when each bucket contains O(1) elements and uses O(1) sorting. Concatenation requires O(k) to process all buckets, yielding O(n + k) total.

Worst-case performance degrades to O(n²) when all elements fall into a single bucket. This occurs with adversarial input or poor bucket mapping. If one bucket receives all n elements and uses insertion sort, the algorithm performs O(n²) comparisons. Choosing appropriate bucket counts and mapping functions mitigates this risk.

Space complexity consists of O(k) for bucket storage and O(n) for element references. The algorithm is not in-place, requiring auxiliary space proportional to input size. Memory-constrained environments must consider this overhead when selecting sorting algorithms.

Cache performance varies with bucket access patterns. Sequential bucket processing exhibits good spatial locality. Random element distribution during the bucketing phase may cause cache misses. Keeping bucket counts within cache line boundaries improves performance on modern processors.

require 'benchmark'

def compare_sorting_algorithms(size)
  data = Array.new(size) { rand(0.0...1.0) }
  
  Benchmark.bm(20) do |x|
    x.report("Bucket Sort:") do
      bucket_sort_basic(data.dup)
    end
    
    x.report("Ruby Sort:") do
      data.dup.sort
    end
    
    x.report("Quicksort:") do
      quicksort(data.dup)
    end
  end
end

# Results with 10,000 elements show bucket sort advantages
# for uniformly distributed floating-point numbers

The individual bucket sorting algorithm significantly impacts performance. Insertion sort works well for small buckets (< 10 elements) with O(n²) worst case but low overhead. Mergesort guarantees O(n log n) but adds overhead for small inputs. Hybrid approaches use insertion sort for small buckets and mergesort for larger ones.

Parallel bucket sort distributes buckets across multiple threads or processes. After distribution, each bucket sorts independently without synchronization. This approach scales linearly with processor count for sufficient bucket numbers and balanced distribution.

require 'concurrent'

def parallel_bucket_sort(array, thread_count: 4)
  return array if array.length <= 1
  
  bucket_count = array.length
  min_val, max_val = array.minmax
  return array if min_val == max_val
  
  buckets = Array.new(bucket_count) { [] }
  range = max_val - min_val
  
  # Distribution phase
  array.each do |num|
    index = ((num - min_val) * (bucket_count - 1) / range).to_i
    buckets[index] << num
  end
  
  # Parallel sorting phase
  pool = Concurrent::FixedThreadPool.new(thread_count)
  futures = buckets.map do |bucket|
    Concurrent::Future.execute(executor: pool) { bucket.sort }
  end
  
  # Wait for all sorting to complete
  sorted_buckets = futures.map(&:value)
  pool.shutdown
  pool.wait_for_termination
  
  # Concatenation phase
  sorted_buckets.flatten
end

Common Pitfalls

Integer division in bucket index calculation causes subtle errors. The expression (value * bucket_count).to_i truncates rather than rounds, potentially creating imbalanced bucket distribution. Values at bucket boundaries may consistently fall into the same bucket.

# Problematic: value 1.0 maps to bucket_count instead of bucket_count - 1
def buggy_bucket_mapping(value, bucket_count, min_val, max_val)
  range = max_val - min_val
  (value - min_val) * bucket_count / range
end

# Fixed: clamp index to valid range
def correct_bucket_mapping(value, bucket_count, min_val, max_val)
  range = max_val - min_val
  index = ((value - min_val) * (bucket_count - 1) / range).to_i
  [[index, 0].max, bucket_count - 1].min
end

Empty bucket handling affects performance when data distribution leaves many buckets empty. Processing empty buckets wastes cycles during concatenation. Sparse bucket arrays or linked structures reduce this overhead.

# Inefficient: processes all buckets including empty ones
def inefficient_concatenation(buckets)
  buckets.flat_map { |bucket| bucket.sort }
end

# Efficient: skips empty buckets
def efficient_concatenation(buckets)
  buckets.reject(&:empty?).flat_map(&:sort)
end

Floating-point precision errors in bucket mapping create inconsistent element distribution. Values differing by floating-point epsilon may map to different buckets. Stable bucket mapping requires epsilon tolerance.

def bucket_sort_with_epsilon(array, epsilon: 1e-10)
  return array if array.length <= 1
  
  bucket_count = array.length
  min_val, max_val = array.minmax
  
  # Add epsilon to prevent boundary issues
  range = (max_val - min_val) + epsilon
  
  buckets = Array.new(bucket_count) { [] }
  
  array.each do |num|
    index = ((num - min_val) * bucket_count / range).to_i
    index = bucket_count - 1 if index >= bucket_count
    buckets[index] << num
  end
  
  buckets.flat_map(&:sort)
end

Non-uniform data distribution causes severe performance degradation. Real-world data often exhibits clustering or patterns that create imbalanced buckets. Adaptive bucket sizing or histogram-based bucket boundary selection addresses this issue.

def adaptive_bucket_sort(array)
  return array if array.length <= 1
  
  # Analyze distribution to create balanced buckets
  sorted_sample = array.sample([array.length / 10, 100].max).sort
  
  # Create bucket boundaries from sample percentiles
  bucket_count = Math.sqrt(array.length).to_i
  bucket_boundaries = (1...bucket_count).map do |i|
    percentile = i.to_f / bucket_count
    sorted_sample[(sorted_sample.length * percentile).to_i]
  end
  
  buckets = Array.new(bucket_count) { [] }
  
  array.each do |num|
    index = bucket_boundaries.bsearch_index { |boundary| boundary > num } || bucket_count - 1
    buckets[index] << num
  end
  
  buckets.flat_map(&:sort)
end

Ignoring duplicate elements can break stability guarantees. When multiple elements have identical values, their relative order must be preserved for stable sorting. The bucket distribution and individual sorting algorithms both must maintain stability.

Memory allocation patterns cause performance issues with large datasets. Creating many small arrays for buckets triggers frequent garbage collection. Pre-allocating bucket capacity or using a single array with index ranges reduces allocation overhead.

Reference

Complexity Analysis

Metric	Best Case	Average Case	Worst Case
Time Complexity	O(n + k)	O(n + k)	O(n²)
Space Complexity	O(n + k)	O(n + k)	O(n + k)
Stability	Stable	Stable	Stable
In-Place	No	No	No

Key Parameters

Parameter	Description	Typical Value
n	Number of elements to sort	Variable
k	Number of buckets	n or sqrt(n)
range	Maximum value minus minimum value	Data-dependent
threshold	Bucket size for switching to different algorithm	10-50 elements

Bucket Mapping Functions

Data Type	Mapping Function	Range Requirements
Floating [0,1)	floor(value * k)	Known: [0, 1)
Integer [min,max]	floor((value - min) * k / range)	Known: [min, max]
Normalized	floor((value - min) / (max - min) * (k - 1))	Computed from data
Logarithmic	floor(k * log(value) / log(max))	Positive values

Algorithm Comparison

Consideration	Bucket Sort	Comparison Sorts	Counting Sort
Time Complexity	O(n + k) average	O(n log n)	O(n + k)
Distribution Knowledge	Required	Not required	Required
Value Range	Any numeric	Any comparable	Limited integer range
Space Usage	O(n + k)	O(log n) to O(n)	O(k)
Stability	Yes (with stable bucket sort)	Varies by algorithm	Yes

Implementation Checklist

Step	Consideration	Action
Input Analysis	Check distribution uniformity	Sample or analyze full dataset
Range Detection	Find minimum and maximum values	Single pass or sampling
Bucket Count	Determine optimal k value	Use n, sqrt(n), or custom
Bucket Creation	Initialize bucket structures	Array of arrays or linked lists
Distribution	Map elements to buckets	Apply mapping function
Individual Sorting	Sort each bucket	Choose appropriate algorithm
Concatenation	Combine sorted buckets	Sequential or parallel merge
Edge Cases	Handle equal elements, empty buckets	Add validation logic

Performance Optimization Techniques

Technique	Benefit	Trade-off
Adaptive bucket count	Balanced distribution	Requires data analysis
Parallel bucket sorting	Reduced wall time	Added complexity
Hybrid bucket/insertion sort	Optimized small buckets	Threshold tuning
Pre-allocated buckets	Reduced allocations	Memory overhead
Sparse bucket storage	Skip empty buckets	Added indirection
Cache-aligned buckets	Improved locality	Platform-specific

Common Use Cases

Domain	Application	Data Characteristics
Education	Grade sorting	Scores 0-100
Gaming	Leaderboard ranking	Bounded score ranges
Finance	Transaction sorting	Known price ranges
Scientific Computing	Measurement data	Sensor value ranges
Graphics	Pixel sorting	RGB values 0-255
Statistics	Histogram generation	Known data distribution

Bucket Sort