Overview
Bucket sort distributes elements from an input array into a finite number of buckets. Each bucket collects elements within a specific range of values. After distribution, the algorithm sorts each bucket individually using another sorting algorithm or recursively applies bucket sort. Finally, it concatenates the sorted buckets to produce the final sorted array.
The algorithm excels when input data distributes uniformly across a known range. Unlike comparison-based sorts that have a theoretical lower bound of O(n log n), bucket sort achieves O(n + k) average time complexity, where n represents the number of elements and k represents the number of buckets. This performance advantage stems from avoiding element-by-element comparisons in favor of distribution-based organization.
Bucket sort belongs to the family of distribution sorts, alongside radix sort and counting sort. While counting sort works with integer keys in a limited range and radix sort processes digits or characters, bucket sort handles floating-point numbers and operates on ranges of values. The algorithm requires knowledge of the input distribution to create appropriate bucket boundaries.
# Simple bucket sort for floating-point numbers in range [0, 1)
def bucket_sort_basic(array)
return array if array.length <= 1
bucket_count = array.length
buckets = Array.new(bucket_count) { [] }
# Distribute elements into buckets
array.each do |num|
bucket_index = (num * bucket_count).to_i
bucket_index = bucket_count - 1 if bucket_index >= bucket_count
buckets[bucket_index] << num
end
# Sort individual buckets and concatenate
buckets.flat_map { |bucket| bucket.sort }
end
numbers = [0.78, 0.17, 0.39, 0.26, 0.72, 0.94, 0.21, 0.12, 0.23, 0.68]
bucket_sort_basic(numbers)
# => [0.12, 0.17, 0.21, 0.23, 0.26, 0.39, 0.68, 0.72, 0.78, 0.94]
Key Principles
Bucket sort operates on three fundamental phases: distribution, individual sorting, and concatenation. The distribution phase maps each element to a bucket based on its value and the defined bucket boundaries. The sorting phase processes each bucket independently. The concatenation phase combines the sorted buckets in order to produce the final result.
The bucket mapping function determines the algorithm's efficiency. For uniformly distributed input in range [min, max], the mapping function bucket_index = (element - min) * bucket_count / (max - min + 1) distributes elements evenly across buckets. Non-uniform distributions require adjusted mapping functions or adaptive bucket sizing to maintain performance.
Each bucket maintains insertion order during distribution, allowing stable sort behavior when the individual bucket sort algorithm is also stable. Stability means elements with equal keys maintain their relative order from the input array. This property becomes important when sorting composite objects by multiple attributes.
The choice of individual bucket sorting algorithm impacts overall performance. Insertion sort works well for small buckets due to low overhead. Quicksort or mergesort handle larger buckets efficiently. Some implementations recursively apply bucket sort when buckets exceed a threshold size.
Bucket count selection balances memory usage against time complexity. More buckets reduce individual bucket sizes but increase memory overhead and empty bucket processing. Fewer buckets save memory but increase sorting time for larger buckets. The optimal bucket count often equals the input array size for uniformly distributed data.
The algorithm's space complexity includes storage for buckets and temporary sorting space. Using k buckets requires O(k) additional space for bucket arrays, plus O(n) space for element references across all buckets. In-place bucket sorting exists but sacrifices simplicity for memory efficiency.
Ruby Implementation
Ruby's dynamic typing and built-in array methods facilitate bucket sort implementation. The algorithm adapts to different data types by customizing the bucket mapping function. Ruby's enumerable methods streamline element distribution and bucket processing.
class BucketSort
def self.sort(array, bucket_count: nil)
return array if array.length <= 1
bucket_count ||= array.length
min_val, max_val = array.minmax
# Handle edge case where all elements are equal
return array if min_val == max_val
buckets = Array.new(bucket_count) { [] }
range = max_val - min_val
# Distribute elements
array.each do |element|
index = ((element - min_val) * (bucket_count - 1) / range).to_i
buckets[index] << element
end
# Sort each bucket and concatenate
buckets.flat_map { |bucket| bucket.sort }
end
end
# Integer sorting
integers = [64, 34, 25, 12, 22, 11, 90, 88, 45, 50, 33, 17, 28, 73]
BucketSort.sort(integers)
# => [11, 12, 17, 22, 25, 28, 33, 34, 45, 50, 64, 73, 88, 90]
# Float sorting
floats = [4.2, 1.7, 3.9, 2.1, 5.8, 1.2, 6.4, 3.3]
BucketSort.sort(floats, bucket_count: 5)
# => [1.2, 1.7, 2.1, 3.3, 3.9, 4.2, 5.8, 6.4]
Custom comparators enable sorting of complex objects. Ruby blocks provide a clean interface for defining bucket mapping and sorting behavior.
class CustomBucketSort
def self.sort(array, bucket_count: nil, &block)
return array if array.length <= 1
bucket_count ||= array.length
key_extractor = block || ->(x) { x }
keys = array.map(&key_extractor)
min_key, max_key = keys.minmax
return array if min_key == max_key
buckets = Array.new(bucket_count) { [] }
range = max_key - min_key
array.each do |element|
key = key_extractor.call(element)
index = ((key - min_key) * (bucket_count - 1) / range).to_i
buckets[index] << element
end
buckets.flat_map { |bucket| bucket.sort_by(&key_extractor) }
end
end
Person = Struct.new(:name, :age, :salary)
people = [
Person.new("Alice", 30, 75000),
Person.new("Bob", 25, 65000),
Person.new("Charlie", 35, 95000),
Person.new("Diana", 28, 70000),
Person.new("Eve", 32, 85000)
]
# Sort by salary
sorted = CustomBucketSort.sort(people, bucket_count: 3) { |p| p.salary }
sorted.map(&:name)
# => ["Bob", "Diana", "Alice", "Eve", "Charlie"]
For negative numbers, the implementation adjusts the bucket mapping to handle the full range.
def bucket_sort_with_negatives(array, bucket_count: nil)
return array if array.length <= 1
bucket_count ||= array.length
min_val, max_val = array.minmax
return array if min_val == max_val
buckets = Array.new(bucket_count) { [] }
range = max_val - min_val
array.each do |num|
# Shift all values to positive range
normalized = num - min_val
index = (normalized * (bucket_count - 1) / range).to_i
buckets[index] << num
end
buckets.flat_map(&:sort)
end
mixed = [-15, 8, -3, 12, 0, -7, 19, -12, 5, -1]
bucket_sort_with_negatives(mixed)
# => [-15, -12, -7, -3, -1, 0, 5, 8, 12, 19]
Recursive bucket sort applies the algorithm to buckets exceeding a size threshold, avoiding quadratic behavior in individual bucket sorts.
def recursive_bucket_sort(array, threshold: 10)
return array if array.length <= 1
if array.length <= threshold
return array.sort
end
bucket_count = array.length
min_val, max_val = array.minmax
return array if min_val == max_val
buckets = Array.new(bucket_count) { [] }
range = max_val - min_val
array.each do |num|
index = ((num - min_val) * (bucket_count - 1) / range).to_i
buckets[index] << num
end
buckets.flat_map do |bucket|
bucket.length > threshold ? recursive_bucket_sort(bucket, threshold: threshold) : bucket.sort
end
end
Practical Examples
Sorting exam scores demonstrates bucket sort's effectiveness with known value ranges. Educational institutions often have scores between 0 and 100, providing natural bucket boundaries.
def sort_exam_scores(scores)
return scores if scores.length <= 1
# Create buckets for score ranges: 0-9, 10-19, ..., 90-100
buckets = Array.new(11) { [] }
scores.each do |score|
bucket_index = score / 10
bucket_index = 10 if bucket_index > 10 # Handle score of 100
buckets[bucket_index] << score
end
buckets.flat_map { |bucket| bucket.sort }
end
exam_results = [95, 72, 88, 63, 91, 55, 78, 84, 67, 93, 58, 82, 70, 89, 76]
sorted_scores = sort_exam_scores(exam_results)
# => [55, 58, 63, 67, 70, 72, 76, 78, 82, 84, 88, 89, 91, 93, 95]
# Calculate grade distribution
grade_counts = sorted_scores.group_by { |s| s / 10 }.transform_values(&:count)
# => {5=>2, 6=>2, 7=>4, 8=>4, 9=>3}
Processing sensor readings illustrates bucket sort for real-time data streams. Temperature sensors reporting values in a known range benefit from distribution-based sorting.
SensorReading = Struct.new(:timestamp, :temperature, :sensor_id)
def sort_sensor_readings(readings)
return readings if readings.length <= 1
# Temperature range: -40°C to 60°C
min_temp, max_temp = -40, 60
bucket_count = 20 # 5-degree ranges
range = max_temp - min_temp
buckets = Array.new(bucket_count) { [] }
readings.each do |reading|
temp = reading.temperature
# Clamp temperature to valid range
temp = [[temp, min_temp].max, max_temp].min
index = ((temp - min_temp) * (bucket_count - 1) / range).to_i
buckets[index] << reading
end
buckets.flat_map { |bucket| bucket.sort_by(&:temperature) }
end
readings = [
SensorReading.new(Time.now, 22.5, "S1"),
SensorReading.new(Time.now, 18.3, "S2"),
SensorReading.new(Time.now, 25.7, "S3"),
SensorReading.new(Time.now, 19.1, "S4"),
SensorReading.new(Time.now, 23.8, "S5"),
SensorReading.new(Time.now, 17.6, "S6")
]
sorted_readings = sort_sensor_readings(readings)
sorted_readings.map(&:temperature)
# => [17.6, 18.3, 19.1, 22.5, 23.8, 25.7]
External merge sort for large datasets combines bucket sort with disk-based operations. When data exceeds available memory, bucket sort partitions data into manageable chunks.
class ExternalBucketSort
def initialize(bucket_count: 10, memory_limit: 1000)
@bucket_count = bucket_count
@memory_limit = memory_limit
end
def sort(input_file, output_file)
# First pass: distribute to temporary bucket files
bucket_files = distribute_to_buckets(input_file)
# Second pass: sort each bucket and merge
merge_sorted_buckets(bucket_files, output_file)
# Cleanup
bucket_files.each { |f| File.delete(f) if File.exist?(f) }
end
private
def distribute_to_buckets(input_file)
data = File.readlines(input_file).map(&:to_f)
min_val, max_val = data.minmax
range = max_val - min_val
bucket_files = Array.new(@bucket_count) { |i| "bucket_#{i}.tmp" }
file_handles = bucket_files.map { |f| File.open(f, 'w') }
data.each do |value|
index = ((value - min_val) * (@bucket_count - 1) / range).to_i
file_handles[index].puts(value)
end
file_handles.each(&:close)
bucket_files
end
def merge_sorted_buckets(bucket_files, output_file)
File.open(output_file, 'w') do |output|
bucket_files.each do |bucket_file|
next unless File.exist?(bucket_file) && File.size(bucket_file) > 0
bucket_data = File.readlines(bucket_file).map(&:to_f).sort
bucket_data.each { |value| output.puts(value) }
end
end
end
end
Design Considerations
Bucket sort proves optimal when input data distributes uniformly across a predictable range. E-commerce applications sorting product prices, scientific computing with measurement data, and gaming leaderboards with bounded scores represent ideal use cases. The algorithm degrades when data clusters in few buckets or when the range remains unknown.
Comparison with other sorting algorithms reveals specific trade-offs. Quicksort averages O(n log n) but works with any comparable data without distribution knowledge. Mergesort guarantees O(n log n) worst case and stable sorting but requires O(n) auxiliary space. Heapsort provides O(n log n) in-place sorting but lacks stability. Bucket sort achieves O(n) average time when properly configured but requires distribution knowledge and additional space.
| Algorithm | Average Time | Worst Time | Space | Stable | Distribution Knowledge Required |
|---|---|---|---|---|---|
| Bucket Sort | O(n + k) | O(n²) | O(n + k) | Yes | Yes |
| Quicksort | O(n log n) | O(n²) | O(log n) | No | No |
| Mergesort | O(n log n) | O(n log n) | O(n) | Yes | No |
| Heapsort | O(n log n) | O(n log n) | O(1) | No | No |
| Counting Sort | O(n + k) | O(n + k) | O(k) | Yes | Yes |
The decision to use bucket sort depends on data characteristics. Uniform distribution across a known range favors bucket sort. Skewed distributions or unknown ranges favor comparison sorts. Small datasets make algorithm choice less critical. Large datasets with specific distribution patterns benefit most from bucket sort optimization.
Bucket count selection requires balancing memory and computation. Too few buckets create large individual buckets, increasing sorting time. Too many buckets waste memory on empty buckets and increase overhead. The formula bucket_count = sqrt(n) provides a reasonable starting point, adjusted based on distribution characteristics.
Stability requirements influence implementation choices. Applications needing stable sorts must use stable individual bucket sorting algorithms and preserve insertion order during distribution. Database operations, multi-attribute sorting, and historical data processing often require stability.
Performance Considerations
Bucket sort achieves O(n + k) average time complexity when elements distribute uniformly across k buckets. The distribution phase takes O(n) time to assign each element to a bucket. Sorting all buckets takes O(n) when each bucket contains O(1) elements and uses O(1) sorting. Concatenation requires O(k) to process all buckets, yielding O(n + k) total.
Worst-case performance degrades to O(n²) when all elements fall into a single bucket. This occurs with adversarial input or poor bucket mapping. If one bucket receives all n elements and uses insertion sort, the algorithm performs O(n²) comparisons. Choosing appropriate bucket counts and mapping functions mitigates this risk.
Space complexity consists of O(k) for bucket storage and O(n) for element references. The algorithm is not in-place, requiring auxiliary space proportional to input size. Memory-constrained environments must consider this overhead when selecting sorting algorithms.
Cache performance varies with bucket access patterns. Sequential bucket processing exhibits good spatial locality. Random element distribution during the bucketing phase may cause cache misses. Keeping bucket counts within cache line boundaries improves performance on modern processors.
require 'benchmark'
def compare_sorting_algorithms(size)
data = Array.new(size) { rand(0.0...1.0) }
Benchmark.bm(20) do |x|
x.report("Bucket Sort:") do
bucket_sort_basic(data.dup)
end
x.report("Ruby Sort:") do
data.dup.sort
end
x.report("Quicksort:") do
quicksort(data.dup)
end
end
end
# Results with 10,000 elements show bucket sort advantages
# for uniformly distributed floating-point numbers
The individual bucket sorting algorithm significantly impacts performance. Insertion sort works well for small buckets (< 10 elements) with O(n²) worst case but low overhead. Mergesort guarantees O(n log n) but adds overhead for small inputs. Hybrid approaches use insertion sort for small buckets and mergesort for larger ones.
Parallel bucket sort distributes buckets across multiple threads or processes. After distribution, each bucket sorts independently without synchronization. This approach scales linearly with processor count for sufficient bucket numbers and balanced distribution.
require 'concurrent'
def parallel_bucket_sort(array, thread_count: 4)
return array if array.length <= 1
bucket_count = array.length
min_val, max_val = array.minmax
return array if min_val == max_val
buckets = Array.new(bucket_count) { [] }
range = max_val - min_val
# Distribution phase
array.each do |num|
index = ((num - min_val) * (bucket_count - 1) / range).to_i
buckets[index] << num
end
# Parallel sorting phase
pool = Concurrent::FixedThreadPool.new(thread_count)
futures = buckets.map do |bucket|
Concurrent::Future.execute(executor: pool) { bucket.sort }
end
# Wait for all sorting to complete
sorted_buckets = futures.map(&:value)
pool.shutdown
pool.wait_for_termination
# Concatenation phase
sorted_buckets.flatten
end
Common Pitfalls
Integer division in bucket index calculation causes subtle errors. The expression (value * bucket_count).to_i truncates rather than rounds, potentially creating imbalanced bucket distribution. Values at bucket boundaries may consistently fall into the same bucket.
# Problematic: value 1.0 maps to bucket_count instead of bucket_count - 1
def buggy_bucket_mapping(value, bucket_count, min_val, max_val)
range = max_val - min_val
(value - min_val) * bucket_count / range
end
# Fixed: clamp index to valid range
def correct_bucket_mapping(value, bucket_count, min_val, max_val)
range = max_val - min_val
index = ((value - min_val) * (bucket_count - 1) / range).to_i
[[index, 0].max, bucket_count - 1].min
end
Empty bucket handling affects performance when data distribution leaves many buckets empty. Processing empty buckets wastes cycles during concatenation. Sparse bucket arrays or linked structures reduce this overhead.
# Inefficient: processes all buckets including empty ones
def inefficient_concatenation(buckets)
buckets.flat_map { |bucket| bucket.sort }
end
# Efficient: skips empty buckets
def efficient_concatenation(buckets)
buckets.reject(&:empty?).flat_map(&:sort)
end
Floating-point precision errors in bucket mapping create inconsistent element distribution. Values differing by floating-point epsilon may map to different buckets. Stable bucket mapping requires epsilon tolerance.
def bucket_sort_with_epsilon(array, epsilon: 1e-10)
return array if array.length <= 1
bucket_count = array.length
min_val, max_val = array.minmax
# Add epsilon to prevent boundary issues
range = (max_val - min_val) + epsilon
buckets = Array.new(bucket_count) { [] }
array.each do |num|
index = ((num - min_val) * bucket_count / range).to_i
index = bucket_count - 1 if index >= bucket_count
buckets[index] << num
end
buckets.flat_map(&:sort)
end
Non-uniform data distribution causes severe performance degradation. Real-world data often exhibits clustering or patterns that create imbalanced buckets. Adaptive bucket sizing or histogram-based bucket boundary selection addresses this issue.
def adaptive_bucket_sort(array)
return array if array.length <= 1
# Analyze distribution to create balanced buckets
sorted_sample = array.sample([array.length / 10, 100].max).sort
# Create bucket boundaries from sample percentiles
bucket_count = Math.sqrt(array.length).to_i
bucket_boundaries = (1...bucket_count).map do |i|
percentile = i.to_f / bucket_count
sorted_sample[(sorted_sample.length * percentile).to_i]
end
buckets = Array.new(bucket_count) { [] }
array.each do |num|
index = bucket_boundaries.bsearch_index { |boundary| boundary > num } || bucket_count - 1
buckets[index] << num
end
buckets.flat_map(&:sort)
end
Ignoring duplicate elements can break stability guarantees. When multiple elements have identical values, their relative order must be preserved for stable sorting. The bucket distribution and individual sorting algorithms both must maintain stability.
Memory allocation patterns cause performance issues with large datasets. Creating many small arrays for buckets triggers frequent garbage collection. Pre-allocating bucket capacity or using a single array with index ranges reduces allocation overhead.
Reference
Complexity Analysis
| Metric | Best Case | Average Case | Worst Case |
|---|---|---|---|
| Time Complexity | O(n + k) | O(n + k) | O(n²) |
| Space Complexity | O(n + k) | O(n + k) | O(n + k) |
| Stability | Stable | Stable | Stable |
| In-Place | No | No | No |
Key Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| n | Number of elements to sort | Variable |
| k | Number of buckets | n or sqrt(n) |
| range | Maximum value minus minimum value | Data-dependent |
| threshold | Bucket size for switching to different algorithm | 10-50 elements |
Bucket Mapping Functions
| Data Type | Mapping Function | Range Requirements |
|---|---|---|
| Floating [0,1) | floor(value * k) | Known: [0, 1) |
| Integer [min,max] | floor((value - min) * k / range) | Known: [min, max] |
| Normalized | floor((value - min) / (max - min) * (k - 1)) | Computed from data |
| Logarithmic | floor(k * log(value) / log(max)) | Positive values |
Algorithm Comparison
| Consideration | Bucket Sort | Comparison Sorts | Counting Sort |
|---|---|---|---|
| Time Complexity | O(n + k) average | O(n log n) | O(n + k) |
| Distribution Knowledge | Required | Not required | Required |
| Value Range | Any numeric | Any comparable | Limited integer range |
| Space Usage | O(n + k) | O(log n) to O(n) | O(k) |
| Stability | Yes (with stable bucket sort) | Varies by algorithm | Yes |
Implementation Checklist
| Step | Consideration | Action |
|---|---|---|
| Input Analysis | Check distribution uniformity | Sample or analyze full dataset |
| Range Detection | Find minimum and maximum values | Single pass or sampling |
| Bucket Count | Determine optimal k value | Use n, sqrt(n), or custom |
| Bucket Creation | Initialize bucket structures | Array of arrays or linked lists |
| Distribution | Map elements to buckets | Apply mapping function |
| Individual Sorting | Sort each bucket | Choose appropriate algorithm |
| Concatenation | Combine sorted buckets | Sequential or parallel merge |
| Edge Cases | Handle equal elements, empty buckets | Add validation logic |
Performance Optimization Techniques
| Technique | Benefit | Trade-off |
|---|---|---|
| Adaptive bucket count | Balanced distribution | Requires data analysis |
| Parallel bucket sorting | Reduced wall time | Added complexity |
| Hybrid bucket/insertion sort | Optimized small buckets | Threshold tuning |
| Pre-allocated buckets | Reduced allocations | Memory overhead |
| Sparse bucket storage | Skip empty buckets | Added indirection |
| Cache-aligned buckets | Improved locality | Platform-specific |
Common Use Cases
| Domain | Application | Data Characteristics |
|---|---|---|
| Education | Grade sorting | Scores 0-100 |
| Gaming | Leaderboard ranking | Bounded score ranges |
| Finance | Transaction sorting | Known price ranges |
| Scientific Computing | Measurement data | Sensor value ranges |
| Graphics | Pixel sorting | RGB values 0-255 |
| Statistics | Histogram generation | Known data distribution |