Overview
Hash tables provide constant-time average complexity for insertion, deletion, and lookup operations by mapping keys to array indices through hash functions. This data structure stores key-value pairs and uses a hash function to compute an index into an array of buckets or slots from which the desired value can be found.
The hash table concept emerged in the 1950s, with early implementations appearing in assembly language programs. The fundamental mechanism remains unchanged: a hash function transforms a key of arbitrary size into an integer that serves as an index into a fixed-size array. When two keys hash to the same index, a collision occurs, requiring a resolution strategy.
Hash tables serve as the foundation for numerous data structures and algorithms in software development. Databases use hash-based indexing for rapid record retrieval. Programming languages implement dictionaries, sets, and associative arrays using hash tables. Caching systems rely on hash tables to store and retrieve cached data. Compilers use hash tables for symbol tables during lexical analysis.
# Basic hash table operation in Ruby
user_ages = { "Alice" => 30, "Bob" => 25, "Carol" => 35 }
# O(1) average time complexity for access
puts user_ages["Alice"] # => 30
# O(1) average time complexity for insertion
user_ages["David"] = 28
# O(1) average time complexity for deletion
user_ages.delete("Bob")
The efficiency of hash tables depends on the quality of the hash function and the collision resolution strategy. A poor hash function creates clusters of collisions, degrading performance from O(1) to O(n). The load factor, defined as the ratio of stored elements to array capacity, directly impacts collision frequency and performance.
Key Principles
A hash function accepts a key as input and produces an integer output that serves as an index into the hash table's underlying array. The function must be deterministic, always producing the same output for a given input. The ideal hash function distributes keys uniformly across the available array indices, minimizing collisions.
Hash functions operate through several stages. First, the function converts the key into an integer if the key is not already numeric. String keys require encoding schemes like summing ASCII values or using polynomial rolling hash techniques. Second, the function maps this integer into the valid range of array indices, typically using the modulo operation with the array size.
# Simple hash function demonstration
def simple_hash(string, array_size)
hash_value = 0
string.each_char { |char| hash_value += char.ord }
hash_value % array_size
end
puts simple_hash("hello", 10) # => 2
puts simple_hash("world", 10) # => 3
puts simple_hash("hello", 10) # => 2 (deterministic)
Collision resolution strategies fall into two categories: separate chaining and open addressing. Separate chaining stores multiple key-value pairs at each array index using a linked list, array, or other data structure. When a collision occurs, the new element appends to the existing chain at that index. Lookups require traversing the chain to find the matching key.
Open addressing stores all elements directly in the hash table array without separate data structures. When a collision occurs, the algorithm probes for the next available slot using a defined sequence. Linear probing checks consecutive slots, adding a fixed offset each time. Quadratic probing uses quadratic increments, and double hashing applies a secondary hash function to determine the probe sequence.
# Separate chaining collision resolution
class ChainedHashTable
def initialize(size = 10)
@buckets = Array.new(size) { [] }
@size = size
end
def insert(key, value)
index = hash_function(key)
bucket = @buckets[index]
# Update existing key or append new pair
existing = bucket.find { |k, v| k == key }
if existing
existing[1] = value
else
bucket << [key, value]
end
end
def get(key)
index = hash_function(key)
bucket = @buckets[index]
pair = bucket.find { |k, v| k == key }
pair ? pair[1] : nil
end
private
def hash_function(key)
key.hash % @size
end
end
The load factor α = n/m, where n represents the number of stored elements and m represents the array capacity, determines when to resize the hash table. For separate chaining, α can exceed 1.0 because chains can grow indefinitely. For open addressing, α must remain below 1.0 since all elements occupy array slots. Common practice triggers resizing when α exceeds 0.7 or 0.75.
Hash table resizing creates a new array with increased capacity, typically double the current size, and rehashes all existing elements into the new array. This operation requires O(n) time but occurs infrequently enough that amortized insertion cost remains O(1).
The birthday paradox explains why collisions occur more frequently than intuition suggests. With only 23 people in a room, the probability that two share a birthday exceeds 50%. Similarly, a hash table with 1000 slots experiences collisions after inserting approximately 37 elements with 50% probability, assuming a uniform distribution.
Ruby Implementation
Ruby provides the Hash class as a built-in hash table implementation using separate chaining for collision resolution. The Hash class uses the object's hash method to compute hash values and the eql? method to compare keys for equality. Custom objects used as hash keys must implement both methods consistently.
# Custom object as hash key
class Point
attr_reader :x, :y
def initialize(x, y)
@x = x
@y = y
end
# Required for hash table usage
def hash
[x, y].hash
end
# Required for key comparison
def eql?(other)
other.is_a?(Point) && x == other.x && y == other.y
end
end
grid = {}
p1 = Point.new(1, 2)
p2 = Point.new(1, 2)
grid[p1] = "occupied"
puts grid[p2] # => "occupied" (finds value because hash and eql? match)
Ruby's Hash implementation provides multiple construction methods. Literal syntax using curly braces offers the most concise approach. The Hash.new constructor accepts an optional default value or block that executes when accessing missing keys. Hash rockets => and symbol shorthand : provide syntax variations for key-value pairs.
# Hash construction methods
literal = { "name" => "Alice", "age" => 30 }
symbol_keys = { name: "Alice", age: 30 }
with_default = Hash.new(0)
with_block = Hash.new { |hash, key| hash[key] = [] }
# Default value behavior
with_default[:missing] += 1
puts with_default[:missing] # => 1
# Default block behavior
with_block[:items] << "first"
puts with_block[:items] # => ["first"]
The Hash class offers methods for iteration, transformation, and filtering. The each method yields key-value pairs to a block. The map method transforms values while preserving keys. The select and reject methods filter entries based on predicates. The merge method combines multiple hashes, with later values overwriting earlier ones for duplicate keys.
scores = { "Alice" => 95, "Bob" => 87, "Carol" => 92, "David" => 78 }
# Iteration
scores.each { |name, score| puts "#{name}: #{score}" }
# Transformation
normalized = scores.transform_values { |score| score / 100.0 }
# => {"Alice"=>0.95, "Bob"=>0.87, "Carol"=>0.92, "David"=>0.78}
# Filtering
high_scores = scores.select { |name, score| score >= 90 }
# => {"Alice"=>95, "Carol"=>92}
# Merging
extra = { "Eve" => 88, "Alice" => 97 }
combined = scores.merge(extra)
# => Alice's score updates to 97, Eve added
Ruby 2.7 introduced pattern matching for hashes, providing concise syntax for extracting values based on structure. Pattern matching works with both string and symbol keys and supports value matching, variable binding, and rest patterns.
user = { name: "Alice", age: 30, role: "admin" }
# Pattern matching with variable binding
case user
in { name: n, role: "admin" }
puts "Admin user: #{n}"
in { name: n, role: "user" }
puts "Regular user: #{n}"
end
# Extracting with rest pattern
case user
in { name: n, **rest }
puts "Name: #{n}, Other: #{rest}"
# => Name: Alice, Other: {:age=>30, :role=>"admin"}
end
Ruby hashes maintain insertion order since Ruby 1.9. Iterating over a hash produces entries in the order they were added, not in hash value order. This behavior differs from hash tables in many other languages where iteration order remains undefined.
ordered = {}
ordered[:third] = 3
ordered[:first] = 1
ordered[:second] = 2
ordered.each { |k, v| puts "#{k}: #{v}" }
# Output order: third, first, second
The fetch method provides safer key access than bracket notation by raising an exception for missing keys instead of returning nil. This behavior prevents silent failures when nil represents a valid value in the domain.
config = { timeout: 30, retries: 3 }
# Safe access with default
max_size = config.fetch(:max_size, 1000)
# Raises KeyError if missing and no default provided
begin
config.fetch(:missing_key)
rescue KeyError => e
puts "Key not found: #{e.message}"
end
Performance Considerations
Hash table performance depends on three factors: hash function quality, load factor, and collision resolution strategy. The theoretical O(1) average time complexity for insert, delete, and lookup operations assumes a uniform distribution of hash values and proper load factor management.
Hash function quality determines collision frequency. A poor hash function clusters keys at specific indices, creating long chains or probe sequences. Cryptographic hash functions like SHA-256 provide excellent distribution but impose computational overhead. Non-cryptographic functions like MurmurHash or CityHash balance distribution quality with computation speed.
# Comparing hash distributions
require 'digest'
def analyze_distribution(hash_func, keys, table_size)
buckets = Array.new(table_size, 0)
keys.each do |key|
index = hash_func.call(key) % table_size
buckets[index] += 1
end
# Calculate variance as distribution metric
mean = keys.size.to_f / table_size
variance = buckets.sum { |count| (count - mean) ** 2 } / table_size
variance
end
keys = (0...1000).map { |i| "key#{i}" }
# Simple hash (poor distribution)
simple_hash = ->(key) { key.each_byte.sum }
variance1 = analyze_distribution(simple_hash, keys, 100)
# Ruby's built-in hash (good distribution)
ruby_hash = ->(key) { key.hash }
variance2 = analyze_distribution(ruby_hash, keys, 100)
puts "Simple hash variance: #{variance1}"
puts "Ruby hash variance: #{variance2}"
# Ruby's hash shows significantly lower variance
The load factor α directly impacts performance. For separate chaining, average chain length equals α, making lookup time O(1 + α). For open addressing with linear probing, average probe count grows as 1/(1 - α), approaching infinity as α approaches 1. Quadratic probing and double hashing exhibit similar but less severe degradation.
# Demonstrating load factor impact
class InstrumentedHashTable
attr_reader :collision_count
def initialize(size = 10)
@buckets = Array.new(size) { [] }
@size = size
@count = 0
@collision_count = 0
end
def insert(key, value)
index = key.hash % @size
bucket = @buckets[index]
@collision_count += 1 unless bucket.empty?
bucket << [key, value]
@count += 1
end
def load_factor
@count.to_f / @size
end
end
# Test with different load factors
[0.5, 0.75, 1.0, 2.0].each do |target_load|
table = InstrumentedHashTable.new(100)
elements = (target_load * 100).to_i
elements.times { |i| table.insert(i, "value#{i}") }
puts "Load factor: #{table.load_factor.round(2)}, " \
"Collisions: #{table.collision_count}"
end
Ruby's Hash class automatically resizes when the load factor exceeds a threshold, typically around 0.75. The resize operation allocates a new array with double capacity and rehashes all entries. While individual resize operations cost O(n), they occur infrequently enough that amortized insertion remains O(1).
Memory overhead varies by implementation. Separate chaining requires additional space for chain data structures, typically storing 2-3 words per element beyond the key-value pair. Open addressing stores elements directly in the array but requires maintaining empty slots, wasting space when load factor stays low.
Worst-case performance occurs when all keys hash to the same index. For separate chaining, operations degrade to O(n) as they traverse the entire chain. For open addressing, every probe must be checked, also resulting in O(n) complexity. This scenario occurs with adversarial input or pathologically poor hash functions.
# Demonstrating worst-case behavior
class PoorHash
attr_reader :probe_count
def initialize
@data = Array.new(10)
@probe_count = 0
end
def insert(key, value)
# All keys hash to index 0 (worst case)
index = 0
@probe_count = 0
# Linear probing
while @data[index]
index = (index + 1) % @data.size
@probe_count += 1
end
@data[index] = [key, value]
end
end
poor = PoorHash.new
5.times { |i| poor.insert(i, "value#{i}") }
puts "Probes required: #{poor.probe_count}" # Grows linearly
Cache performance affects hash table speed in practice. Separate chaining exhibits poor cache locality because chain traversal requires pointer chasing across non-contiguous memory. Open addressing with linear probing demonstrates excellent cache behavior by accessing consecutive memory locations. This advantage can outweigh the theoretical algorithmic costs for small to medium hash tables.
Practical Examples
Hash tables solve the two-sum problem by storing seen values during array traversal. The algorithm checks whether the complement of the current value exists in the hash table, achieving O(n) time complexity compared to O(n²) for the nested loop approach.
def two_sum(nums, target)
seen = {}
nums.each_with_index do |num, i|
complement = target - num
if seen.key?(complement)
return [seen[complement], i]
end
seen[num] = i
end
nil
end
# Example usage
nums = [2, 7, 11, 15]
target = 9
result = two_sum(nums, target)
puts result.inspect # => [0, 1] (nums[0] + nums[1] = 2 + 7 = 9)
Counting element frequencies demonstrates hash table utility for aggregation tasks. The pattern initializes counts to zero for missing keys and increments for existing keys.
def count_frequencies(items)
frequencies = Hash.new(0)
items.each { |item| frequencies[item] += 1 }
frequencies
end
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
freq = count_frequencies(words)
puts freq.inspect # => {"apple"=>3, "banana"=>2, "cherry"=>1}
# Finding most frequent element
most_frequent = freq.max_by { |word, count| count }
puts "Most frequent: #{most_frequent[0]} (#{most_frequent[1]} times)"
Grouping related items by a computed key requires hash tables with array values. This pattern appears in data processing, where records must be organized by category, date, or other attributes.
def group_by_length(strings)
groups = Hash.new { |h, k| h[k] = [] }
strings.each do |string|
groups[string.length] << string
end
groups
end
words = ["cat", "elephant", "dog", "bear", "ant"]
grouped = group_by_length(words)
puts grouped.inspect
# => {3=>["cat", "dog", "ant"], 8=>["elephant"], 4=>["bear"]}
# Built-in Ruby method accomplishes the same
grouped_builtin = words.group_by(&:length)
Caching expensive computation results with memoization uses hash tables to store previously computed values. The Fibonacci sequence provides a classic example where naive recursion performs exponentially while memoized recursion completes in linear time.
class Fibonacci
def initialize
@cache = {}
end
def compute(n)
return n if n <= 1
@cache[n] ||= compute(n - 1) + compute(n - 2)
end
end
fib = Fibonacci.new
puts fib.compute(50) # Completes instantly
# Without memoization, fib(50) would take minutes
# Alternative with method-level memoization
def fib_memo(n, cache = {})
return n if n <= 1
cache[n] ||= fib_memo(n - 1, cache) + fib_memo(n - 2, cache)
end
Finding duplicate elements in an array demonstrates hash table efficiency for set membership testing. The algorithm achieves O(n) time compared to O(n²) for nested loops or O(n log n) for sorting approaches.
def find_duplicates(array)
seen = {}
duplicates = []
array.each do |element|
if seen[element]
duplicates << element unless duplicates.include?(element)
else
seen[element] = true
end
end
duplicates
end
numbers = [1, 2, 3, 2, 4, 5, 3, 6]
dupes = find_duplicates(numbers)
puts dupes.inspect # => [2, 3]
# More idiomatic Ruby using tally
duplicates = numbers.tally.select { |num, count| count > 1 }.keys
Design Considerations
Choosing between separate chaining and open addressing depends on several factors. Separate chaining handles high load factors gracefully because chains grow dynamically. Memory allocation happens incrementally as elements insert. Deletion remains straightforward by removing elements from chains. However, separate chaining suffers from poor cache locality and requires extra memory for chain pointers.
Open addressing eliminates pointer overhead by storing elements directly in the array. Linear probing exhibits excellent cache performance by accessing consecutive memory locations. The approach works well for low to moderate load factors. However, deletion complicates open addressing because removing an element creates gaps in probe sequences. Common solutions include tombstone markers or backward shifting of elements, both adding complexity.
# Open addressing with tombstone deletion
class OpenAddressedHash
TOMBSTONE = Object.new
def initialize(size = 10)
@table = Array.new(size)
@size = size
end
def insert(key, value)
index = find_slot(key)
@table[index] = [key, value]
end
def delete(key)
index = find_index(key)
@table[index] = TOMBSTONE if index
end
def get(key)
index = find_index(key)
index ? @table[index][1] : nil
end
private
def find_slot(key)
index = key.hash % @size
loop do
return index if @table[index].nil? || @table[index] == TOMBSTONE
return index if @table[index][0] == key
index = (index + 1) % @size
end
end
def find_index(key)
index = key.hash % @size
loop do
return nil if @table[index].nil?
return index if @table[index] != TOMBSTONE && @table[index][0] == key
index = (index + 1) % @size
end
end
end
Hash function selection balances distribution quality, computation speed, and security requirements. Non-cryptographic functions like MurmurHash or FNV-1a suffice for most applications, providing good distribution with minimal overhead. Applications vulnerable to hash collision attacks require cryptographic functions like SHA-256 or keyed functions like SipHash, accepting the performance penalty for security guarantees.
The initial capacity and resize threshold affect memory usage and resize frequency. Starting with a large capacity reduces resizing but wastes memory for small collections. Starting with a small capacity saves memory but triggers frequent resizing. Choosing thresholds of 0.75 for separate chaining and 0.5-0.7 for open addressing balances these concerns.
Ordered hash tables maintain insertion order, providing predictable iteration. Ruby's Hash implements this behavior natively. The feature adds minimal overhead by maintaining a doubly-linked list alongside the hash structure. Applications requiring sorted traversal might choose tree-based maps instead, trading O(1) operations for O(log n) with guaranteed ordering.
# Comparing ordered vs sorted iteration
hash = { c: 3, a: 1, b: 2 }
# Insertion order (Ruby default)
hash.each { |k, v| puts "#{k}: #{v}" }
# Output: c: 3, a: 1, b: 2
# Sorted iteration when needed
hash.sort.each { |k, v| puts "#{k}: #{v}" }
# Output: a: 1, b: 2, c: 3
# SortedSet for maintained sort order
require 'set'
require 'rbtree' # Requires rbtree gem
# Use TreeMap-style structure for always-sorted keys
sorted = RBTree[a: 1, c: 3, b: 2]
sorted.each { |k, v| puts "#{k}: #{v}" }
# Output: a: 1, b: 2, c: 3
Thread safety requires synchronization for concurrent access. Ruby's Hash is not thread-safe by default. Multiple threads reading simultaneously causes no issues, but concurrent modifications lead to race conditions and data corruption. Options include external locking, concurrent hash implementations, or thread-local storage.
Common Pitfalls
Mutable objects used as hash keys create subtle bugs when the object changes after insertion. Hash tables compute indices from key hash values at insertion time. Modifying the key afterwards changes its hash value, making the entry unreachable. Ruby prevents this issue for strings by freezing string keys automatically.
# Mutable key problem
key = [1, 2, 3]
hash = { key => "value" }
puts hash[key] # => "value"
key << 4 # Modifies the key
puts hash[key] # => nil (entry no longer findable)
# The original entry still exists but is unreachable
puts hash.inspect # => {[1, 2, 3, 4]=>"value"}
# Strings are frozen to prevent this
str_key = "key"
hash2 = { str_key => "value" }
puts str_key.frozen? # => true (Ruby freezes it)
Inconsistent hash and eql? implementations break hash table functionality. Objects must implement these methods together such that objects comparing equal via eql? produce identical hash values. Violating this contract causes duplicate keys or unfindable entries.
class BrokenKey
attr_accessor :value
def initialize(value)
@value = value
end
def hash
value.hash
end
# Broken: eql? always returns false
def eql?(other)
false
end
end
k1 = BrokenKey.new(42)
k2 = BrokenKey.new(42)
hash = { k1 => "first" }
# Same hash value but eql? returns false
puts k1.hash == k2.hash # => true
puts k1.eql?(k2) # => false
# Creates duplicate entries despite same hash
hash[k2] = "second"
puts hash.size # => 2 (should be 1)
The default value pitfall occurs when using Hash.new(default) with mutable objects. All missing keys share the same default object instance, causing unexpected mutations.
# Wrong: shared default object
bad_hash = Hash.new([])
bad_hash[:a] << 1
bad_hash[:b] << 2
puts bad_hash[:a].inspect # => [1, 2] (unexpected!)
puts bad_hash[:c].inspect # => [1, 2] (all keys share same array)
# Correct: default block creates new object per key
good_hash = Hash.new { |h, k| h[k] = [] }
good_hash[:a] << 1
good_hash[:b] << 2
puts good_hash[:a].inspect # => [1]
puts good_hash[:b].inspect # => [2]
Hash collision attacks exploit predictable hash functions to create pathological inputs. An attacker constructs keys that all hash to the same value, degrading performance from O(1) to O(n). This vulnerability affects web applications accepting user input as hash keys. Ruby 2.4 introduced randomized hash seeds per process to mitigate this attack.
Forgetting to resize hash tables leads to performance degradation. As load factor increases, collision frequency grows, slowing all operations. Manual implementations must track element count and trigger resizing at appropriate thresholds. Ruby's Hash handles this automatically.
Comparing hash equality requires careful consideration of key order. The == operator checks for equal keys and values but ignores order in Ruby hashes. Two hashes with identical contents in different insertion orders compare as equal.
h1 = { a: 1, b: 2, c: 3 }
h2 = { c: 3, a: 1, b: 2 }
puts h1 == h2 # => true (equal despite different order)
# Use to_a for order-sensitive comparison
puts h1.to_a == h2.to_a # => false (order matters)
Reference
Hash Table Operations Complexity
| Operation | Average Case | Worst Case | Notes |
|---|---|---|---|
| Insert | O(1) | O(n) | Worst case occurs with all collisions |
| Delete | O(1) | O(n) | May require chain traversal or probing |
| Search | O(1) | O(n) | Depends on hash function quality |
| Resize | O(n) | O(n) | Occurs infrequently for amortized O(1) |
Collision Resolution Comparison
| Strategy | Pros | Cons | Best Use Case |
|---|---|---|---|
| Separate Chaining | Handles high load factors, simple deletion | Poor cache locality, pointer overhead | Unknown or high load factors |
| Linear Probing | Excellent cache performance, no pointers | Clustering issues, complex deletion | Low to moderate load factors |
| Quadratic Probing | Reduces clustering vs linear | Requires prime table size | Moderate load factors |
| Double Hashing | Minimal clustering | Slower hash computation | Security-sensitive applications |
Ruby Hash Methods
| Method | Purpose | Example |
|---|---|---|
| [] | Access value by key | hash[:key] |
| []= | Set value for key | hash[:key] = value |
| fetch | Access with default or error | hash.fetch(:key, default) |
| key? | Check key existence | hash.key?(:name) |
| delete | Remove key-value pair | hash.delete(:key) |
| merge | Combine hashes | hash1.merge(hash2) |
| transform_keys | Modify all keys | hash.transform_keys(&:upcase) |
| transform_values | Modify all values | hash.transform_values { v * 2 } |
| select | Filter entries | hash.select { k, v > 10 } |
| each | Iterate key-value pairs | hash.each { k, v block } |
Load Factor Guidelines
| Implementation | Recommended Threshold | Typical Range | Performance Impact |
|---|---|---|---|
| Separate Chaining | 0.75 - 1.0 | 0.5 - 2.0 | Linear degradation above 1.0 |
| Linear Probing | 0.5 - 0.7 | 0.3 - 0.8 | Exponential degradation above 0.7 |
| Quadratic Probing | 0.5 - 0.75 | 0.3 - 0.8 | Table must remain under 50% for guarantees |
| Double Hashing | 0.5 - 0.75 | 0.3 - 0.8 | Better than linear, worse than chaining |
Hash Function Properties
| Property | Description | Importance |
|---|---|---|
| Deterministic | Same input always produces same output | Required for correctness |
| Uniform Distribution | Hash values spread evenly across range | Critical for performance |
| Fast Computation | Minimal CPU cycles per hash | Affects all operation speed |
| Avalanche Effect | Small input change produces large output change | Improves distribution quality |
| Collision Resistance | Difficult to find two inputs with same hash | Important for security |
Common Hash Function Algorithms
| Function | Speed | Distribution | Security | Use Case |
|---|---|---|---|---|
| Division Method | Very Fast | Poor to Fair | None | Simple applications |
| Multiplication Method | Fast | Good | None | General purpose |
| MurmurHash | Fast | Excellent | None | Non-cryptographic hashing |
| CityHash | Very Fast | Excellent | None | Short strings, high performance |
| SipHash | Medium | Excellent | Good | Hash collision attack prevention |
| SHA-256 | Slow | Excellent | Excellent | Cryptographic applications |
Ruby Hash Construction Syntax
| Syntax | Example | Notes |
|---|---|---|
| Hash rocket | {key => value} | Works with any object as key |
| Symbol shorthand | {key: value} | Symbol keys only, most concise |
| Hash.new | Hash.new(default) | Specify default value |
| Hash.new with block | Hash.new { block } | Dynamic default per key |
| Hash[] | Hash[key1, val1, key2, val2] | Alternating keys and values |
| to_h | array.to_h | Convert array of pairs to hash |