CrackedRuby CrackedRuby

Hash Tables and Hash Functions

Overview

Hash tables provide constant-time average complexity for insertion, deletion, and lookup operations by mapping keys to array indices through hash functions. This data structure stores key-value pairs and uses a hash function to compute an index into an array of buckets or slots from which the desired value can be found.

The hash table concept emerged in the 1950s, with early implementations appearing in assembly language programs. The fundamental mechanism remains unchanged: a hash function transforms a key of arbitrary size into an integer that serves as an index into a fixed-size array. When two keys hash to the same index, a collision occurs, requiring a resolution strategy.

Hash tables serve as the foundation for numerous data structures and algorithms in software development. Databases use hash-based indexing for rapid record retrieval. Programming languages implement dictionaries, sets, and associative arrays using hash tables. Caching systems rely on hash tables to store and retrieve cached data. Compilers use hash tables for symbol tables during lexical analysis.

# Basic hash table operation in Ruby
user_ages = { "Alice" => 30, "Bob" => 25, "Carol" => 35 }

# O(1) average time complexity for access
puts user_ages["Alice"]  # => 30

# O(1) average time complexity for insertion
user_ages["David"] = 28

# O(1) average time complexity for deletion
user_ages.delete("Bob")

The efficiency of hash tables depends on the quality of the hash function and the collision resolution strategy. A poor hash function creates clusters of collisions, degrading performance from O(1) to O(n). The load factor, defined as the ratio of stored elements to array capacity, directly impacts collision frequency and performance.

Key Principles

A hash function accepts a key as input and produces an integer output that serves as an index into the hash table's underlying array. The function must be deterministic, always producing the same output for a given input. The ideal hash function distributes keys uniformly across the available array indices, minimizing collisions.

Hash functions operate through several stages. First, the function converts the key into an integer if the key is not already numeric. String keys require encoding schemes like summing ASCII values or using polynomial rolling hash techniques. Second, the function maps this integer into the valid range of array indices, typically using the modulo operation with the array size.

# Simple hash function demonstration
def simple_hash(string, array_size)
  hash_value = 0
  string.each_char { |char| hash_value += char.ord }
  hash_value % array_size
end

puts simple_hash("hello", 10)  # => 2
puts simple_hash("world", 10)  # => 3
puts simple_hash("hello", 10)  # => 2 (deterministic)

Collision resolution strategies fall into two categories: separate chaining and open addressing. Separate chaining stores multiple key-value pairs at each array index using a linked list, array, or other data structure. When a collision occurs, the new element appends to the existing chain at that index. Lookups require traversing the chain to find the matching key.

Open addressing stores all elements directly in the hash table array without separate data structures. When a collision occurs, the algorithm probes for the next available slot using a defined sequence. Linear probing checks consecutive slots, adding a fixed offset each time. Quadratic probing uses quadratic increments, and double hashing applies a secondary hash function to determine the probe sequence.

# Separate chaining collision resolution
class ChainedHashTable
  def initialize(size = 10)
    @buckets = Array.new(size) { [] }
    @size = size
  end
  
  def insert(key, value)
    index = hash_function(key)
    bucket = @buckets[index]
    
    # Update existing key or append new pair
    existing = bucket.find { |k, v| k == key }
    if existing
      existing[1] = value
    else
      bucket << [key, value]
    end
  end
  
  def get(key)
    index = hash_function(key)
    bucket = @buckets[index]
    pair = bucket.find { |k, v| k == key }
    pair ? pair[1] : nil
  end
  
  private
  
  def hash_function(key)
    key.hash % @size
  end
end

The load factor α = n/m, where n represents the number of stored elements and m represents the array capacity, determines when to resize the hash table. For separate chaining, α can exceed 1.0 because chains can grow indefinitely. For open addressing, α must remain below 1.0 since all elements occupy array slots. Common practice triggers resizing when α exceeds 0.7 or 0.75.

Hash table resizing creates a new array with increased capacity, typically double the current size, and rehashes all existing elements into the new array. This operation requires O(n) time but occurs infrequently enough that amortized insertion cost remains O(1).

The birthday paradox explains why collisions occur more frequently than intuition suggests. With only 23 people in a room, the probability that two share a birthday exceeds 50%. Similarly, a hash table with 1000 slots experiences collisions after inserting approximately 37 elements with 50% probability, assuming a uniform distribution.

Ruby Implementation

Ruby provides the Hash class as a built-in hash table implementation using separate chaining for collision resolution. The Hash class uses the object's hash method to compute hash values and the eql? method to compare keys for equality. Custom objects used as hash keys must implement both methods consistently.

# Custom object as hash key
class Point
  attr_reader :x, :y
  
  def initialize(x, y)
    @x = x
    @y = y
  end
  
  # Required for hash table usage
  def hash
    [x, y].hash
  end
  
  # Required for key comparison
  def eql?(other)
    other.is_a?(Point) && x == other.x && y == other.y
  end
end

grid = {}
p1 = Point.new(1, 2)
p2 = Point.new(1, 2)

grid[p1] = "occupied"
puts grid[p2]  # => "occupied" (finds value because hash and eql? match)

Ruby's Hash implementation provides multiple construction methods. Literal syntax using curly braces offers the most concise approach. The Hash.new constructor accepts an optional default value or block that executes when accessing missing keys. Hash rockets => and symbol shorthand : provide syntax variations for key-value pairs.

# Hash construction methods
literal = { "name" => "Alice", "age" => 30 }
symbol_keys = { name: "Alice", age: 30 }
with_default = Hash.new(0)
with_block = Hash.new { |hash, key| hash[key] = [] }

# Default value behavior
with_default[:missing] += 1
puts with_default[:missing]  # => 1

# Default block behavior
with_block[:items] << "first"
puts with_block[:items]  # => ["first"]

The Hash class offers methods for iteration, transformation, and filtering. The each method yields key-value pairs to a block. The map method transforms values while preserving keys. The select and reject methods filter entries based on predicates. The merge method combines multiple hashes, with later values overwriting earlier ones for duplicate keys.

scores = { "Alice" => 95, "Bob" => 87, "Carol" => 92, "David" => 78 }

# Iteration
scores.each { |name, score| puts "#{name}: #{score}" }

# Transformation
normalized = scores.transform_values { |score| score / 100.0 }
# => {"Alice"=>0.95, "Bob"=>0.87, "Carol"=>0.92, "David"=>0.78}

# Filtering
high_scores = scores.select { |name, score| score >= 90 }
# => {"Alice"=>95, "Carol"=>92}

# Merging
extra = { "Eve" => 88, "Alice" => 97 }
combined = scores.merge(extra)
# => Alice's score updates to 97, Eve added

Ruby 2.7 introduced pattern matching for hashes, providing concise syntax for extracting values based on structure. Pattern matching works with both string and symbol keys and supports value matching, variable binding, and rest patterns.

user = { name: "Alice", age: 30, role: "admin" }

# Pattern matching with variable binding
case user
in { name: n, role: "admin" }
  puts "Admin user: #{n}"
in { name: n, role: "user" }
  puts "Regular user: #{n}"
end

# Extracting with rest pattern
case user
in { name: n, **rest }
  puts "Name: #{n}, Other: #{rest}"
  # => Name: Alice, Other: {:age=>30, :role=>"admin"}
end

Ruby hashes maintain insertion order since Ruby 1.9. Iterating over a hash produces entries in the order they were added, not in hash value order. This behavior differs from hash tables in many other languages where iteration order remains undefined.

ordered = {}
ordered[:third] = 3
ordered[:first] = 1
ordered[:second] = 2

ordered.each { |k, v| puts "#{k}: #{v}" }
# Output order: third, first, second

The fetch method provides safer key access than bracket notation by raising an exception for missing keys instead of returning nil. This behavior prevents silent failures when nil represents a valid value in the domain.

config = { timeout: 30, retries: 3 }

# Safe access with default
max_size = config.fetch(:max_size, 1000)

# Raises KeyError if missing and no default provided
begin
  config.fetch(:missing_key)
rescue KeyError => e
  puts "Key not found: #{e.message}"
end

Performance Considerations

Hash table performance depends on three factors: hash function quality, load factor, and collision resolution strategy. The theoretical O(1) average time complexity for insert, delete, and lookup operations assumes a uniform distribution of hash values and proper load factor management.

Hash function quality determines collision frequency. A poor hash function clusters keys at specific indices, creating long chains or probe sequences. Cryptographic hash functions like SHA-256 provide excellent distribution but impose computational overhead. Non-cryptographic functions like MurmurHash or CityHash balance distribution quality with computation speed.

# Comparing hash distributions
require 'digest'

def analyze_distribution(hash_func, keys, table_size)
  buckets = Array.new(table_size, 0)
  keys.each do |key|
    index = hash_func.call(key) % table_size
    buckets[index] += 1
  end
  
  # Calculate variance as distribution metric
  mean = keys.size.to_f / table_size
  variance = buckets.sum { |count| (count - mean) ** 2 } / table_size
  variance
end

keys = (0...1000).map { |i| "key#{i}" }

# Simple hash (poor distribution)
simple_hash = ->(key) { key.each_byte.sum }
variance1 = analyze_distribution(simple_hash, keys, 100)

# Ruby's built-in hash (good distribution)
ruby_hash = ->(key) { key.hash }
variance2 = analyze_distribution(ruby_hash, keys, 100)

puts "Simple hash variance: #{variance1}"
puts "Ruby hash variance: #{variance2}"
# Ruby's hash shows significantly lower variance

The load factor α directly impacts performance. For separate chaining, average chain length equals α, making lookup time O(1 + α). For open addressing with linear probing, average probe count grows as 1/(1 - α), approaching infinity as α approaches 1. Quadratic probing and double hashing exhibit similar but less severe degradation.

# Demonstrating load factor impact
class InstrumentedHashTable
  attr_reader :collision_count
  
  def initialize(size = 10)
    @buckets = Array.new(size) { [] }
    @size = size
    @count = 0
    @collision_count = 0
  end
  
  def insert(key, value)
    index = key.hash % @size
    bucket = @buckets[index]
    
    @collision_count += 1 unless bucket.empty?
    
    bucket << [key, value]
    @count += 1
  end
  
  def load_factor
    @count.to_f / @size
  end
end

# Test with different load factors
[0.5, 0.75, 1.0, 2.0].each do |target_load|
  table = InstrumentedHashTable.new(100)
  elements = (target_load * 100).to_i
  
  elements.times { |i| table.insert(i, "value#{i}") }
  
  puts "Load factor: #{table.load_factor.round(2)}, " \
       "Collisions: #{table.collision_count}"
end

Ruby's Hash class automatically resizes when the load factor exceeds a threshold, typically around 0.75. The resize operation allocates a new array with double capacity and rehashes all entries. While individual resize operations cost O(n), they occur infrequently enough that amortized insertion remains O(1).

Memory overhead varies by implementation. Separate chaining requires additional space for chain data structures, typically storing 2-3 words per element beyond the key-value pair. Open addressing stores elements directly in the array but requires maintaining empty slots, wasting space when load factor stays low.

Worst-case performance occurs when all keys hash to the same index. For separate chaining, operations degrade to O(n) as they traverse the entire chain. For open addressing, every probe must be checked, also resulting in O(n) complexity. This scenario occurs with adversarial input or pathologically poor hash functions.

# Demonstrating worst-case behavior
class PoorHash
  attr_reader :probe_count
  
  def initialize
    @data = Array.new(10)
    @probe_count = 0
  end
  
  def insert(key, value)
    # All keys hash to index 0 (worst case)
    index = 0
    @probe_count = 0
    
    # Linear probing
    while @data[index]
      index = (index + 1) % @data.size
      @probe_count += 1
    end
    
    @data[index] = [key, value]
  end
end

poor = PoorHash.new
5.times { |i| poor.insert(i, "value#{i}") }
puts "Probes required: #{poor.probe_count}"  # Grows linearly

Cache performance affects hash table speed in practice. Separate chaining exhibits poor cache locality because chain traversal requires pointer chasing across non-contiguous memory. Open addressing with linear probing demonstrates excellent cache behavior by accessing consecutive memory locations. This advantage can outweigh the theoretical algorithmic costs for small to medium hash tables.

Practical Examples

Hash tables solve the two-sum problem by storing seen values during array traversal. The algorithm checks whether the complement of the current value exists in the hash table, achieving O(n) time complexity compared to O(n²) for the nested loop approach.

def two_sum(nums, target)
  seen = {}
  
  nums.each_with_index do |num, i|
    complement = target - num
    
    if seen.key?(complement)
      return [seen[complement], i]
    end
    
    seen[num] = i
  end
  
  nil
end

# Example usage
nums = [2, 7, 11, 15]
target = 9
result = two_sum(nums, target)
puts result.inspect  # => [0, 1] (nums[0] + nums[1] = 2 + 7 = 9)

Counting element frequencies demonstrates hash table utility for aggregation tasks. The pattern initializes counts to zero for missing keys and increments for existing keys.

def count_frequencies(items)
  frequencies = Hash.new(0)
  items.each { |item| frequencies[item] += 1 }
  frequencies
end

words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
freq = count_frequencies(words)
puts freq.inspect  # => {"apple"=>3, "banana"=>2, "cherry"=>1}

# Finding most frequent element
most_frequent = freq.max_by { |word, count| count }
puts "Most frequent: #{most_frequent[0]} (#{most_frequent[1]} times)"

Grouping related items by a computed key requires hash tables with array values. This pattern appears in data processing, where records must be organized by category, date, or other attributes.

def group_by_length(strings)
  groups = Hash.new { |h, k| h[k] = [] }
  
  strings.each do |string|
    groups[string.length] << string
  end
  
  groups
end

words = ["cat", "elephant", "dog", "bear", "ant"]
grouped = group_by_length(words)
puts grouped.inspect
# => {3=>["cat", "dog", "ant"], 8=>["elephant"], 4=>["bear"]}

# Built-in Ruby method accomplishes the same
grouped_builtin = words.group_by(&:length)

Caching expensive computation results with memoization uses hash tables to store previously computed values. The Fibonacci sequence provides a classic example where naive recursion performs exponentially while memoized recursion completes in linear time.

class Fibonacci
  def initialize
    @cache = {}
  end
  
  def compute(n)
    return n if n <= 1
    
    @cache[n] ||= compute(n - 1) + compute(n - 2)
  end
end

fib = Fibonacci.new
puts fib.compute(50)  # Completes instantly
# Without memoization, fib(50) would take minutes

# Alternative with method-level memoization
def fib_memo(n, cache = {})
  return n if n <= 1
  cache[n] ||= fib_memo(n - 1, cache) + fib_memo(n - 2, cache)
end

Finding duplicate elements in an array demonstrates hash table efficiency for set membership testing. The algorithm achieves O(n) time compared to O(n²) for nested loops or O(n log n) for sorting approaches.

def find_duplicates(array)
  seen = {}
  duplicates = []
  
  array.each do |element|
    if seen[element]
      duplicates << element unless duplicates.include?(element)
    else
      seen[element] = true
    end
  end
  
  duplicates
end

numbers = [1, 2, 3, 2, 4, 5, 3, 6]
dupes = find_duplicates(numbers)
puts dupes.inspect  # => [2, 3]

# More idiomatic Ruby using tally
duplicates = numbers.tally.select { |num, count| count > 1 }.keys

Design Considerations

Choosing between separate chaining and open addressing depends on several factors. Separate chaining handles high load factors gracefully because chains grow dynamically. Memory allocation happens incrementally as elements insert. Deletion remains straightforward by removing elements from chains. However, separate chaining suffers from poor cache locality and requires extra memory for chain pointers.

Open addressing eliminates pointer overhead by storing elements directly in the array. Linear probing exhibits excellent cache performance by accessing consecutive memory locations. The approach works well for low to moderate load factors. However, deletion complicates open addressing because removing an element creates gaps in probe sequences. Common solutions include tombstone markers or backward shifting of elements, both adding complexity.

# Open addressing with tombstone deletion
class OpenAddressedHash
  TOMBSTONE = Object.new
  
  def initialize(size = 10)
    @table = Array.new(size)
    @size = size
  end
  
  def insert(key, value)
    index = find_slot(key)
    @table[index] = [key, value]
  end
  
  def delete(key)
    index = find_index(key)
    @table[index] = TOMBSTONE if index
  end
  
  def get(key)
    index = find_index(key)
    index ? @table[index][1] : nil
  end
  
  private
  
  def find_slot(key)
    index = key.hash % @size
    
    loop do
      return index if @table[index].nil? || @table[index] == TOMBSTONE
      return index if @table[index][0] == key
      index = (index + 1) % @size
    end
  end
  
  def find_index(key)
    index = key.hash % @size
    
    loop do
      return nil if @table[index].nil?
      return index if @table[index] != TOMBSTONE && @table[index][0] == key
      index = (index + 1) % @size
    end
  end
end

Hash function selection balances distribution quality, computation speed, and security requirements. Non-cryptographic functions like MurmurHash or FNV-1a suffice for most applications, providing good distribution with minimal overhead. Applications vulnerable to hash collision attacks require cryptographic functions like SHA-256 or keyed functions like SipHash, accepting the performance penalty for security guarantees.

The initial capacity and resize threshold affect memory usage and resize frequency. Starting with a large capacity reduces resizing but wastes memory for small collections. Starting with a small capacity saves memory but triggers frequent resizing. Choosing thresholds of 0.75 for separate chaining and 0.5-0.7 for open addressing balances these concerns.

Ordered hash tables maintain insertion order, providing predictable iteration. Ruby's Hash implements this behavior natively. The feature adds minimal overhead by maintaining a doubly-linked list alongside the hash structure. Applications requiring sorted traversal might choose tree-based maps instead, trading O(1) operations for O(log n) with guaranteed ordering.

# Comparing ordered vs sorted iteration
hash = { c: 3, a: 1, b: 2 }

# Insertion order (Ruby default)
hash.each { |k, v| puts "#{k}: #{v}" }
# Output: c: 3, a: 1, b: 2

# Sorted iteration when needed
hash.sort.each { |k, v| puts "#{k}: #{v}" }
# Output: a: 1, b: 2, c: 3

# SortedSet for maintained sort order
require 'set'
require 'rbtree'  # Requires rbtree gem

# Use TreeMap-style structure for always-sorted keys
sorted = RBTree[a: 1, c: 3, b: 2]
sorted.each { |k, v| puts "#{k}: #{v}" }
# Output: a: 1, b: 2, c: 3

Thread safety requires synchronization for concurrent access. Ruby's Hash is not thread-safe by default. Multiple threads reading simultaneously causes no issues, but concurrent modifications lead to race conditions and data corruption. Options include external locking, concurrent hash implementations, or thread-local storage.

Common Pitfalls

Mutable objects used as hash keys create subtle bugs when the object changes after insertion. Hash tables compute indices from key hash values at insertion time. Modifying the key afterwards changes its hash value, making the entry unreachable. Ruby prevents this issue for strings by freezing string keys automatically.

# Mutable key problem
key = [1, 2, 3]
hash = { key => "value" }

puts hash[key]  # => "value"

key << 4  # Modifies the key
puts hash[key]  # => nil (entry no longer findable)

# The original entry still exists but is unreachable
puts hash.inspect  # => {[1, 2, 3, 4]=>"value"}

# Strings are frozen to prevent this
str_key = "key"
hash2 = { str_key => "value" }
puts str_key.frozen?  # => true (Ruby freezes it)

Inconsistent hash and eql? implementations break hash table functionality. Objects must implement these methods together such that objects comparing equal via eql? produce identical hash values. Violating this contract causes duplicate keys or unfindable entries.

class BrokenKey
  attr_accessor :value
  
  def initialize(value)
    @value = value
  end
  
  def hash
    value.hash
  end
  
  # Broken: eql? always returns false
  def eql?(other)
    false
  end
end

k1 = BrokenKey.new(42)
k2 = BrokenKey.new(42)
hash = { k1 => "first" }

# Same hash value but eql? returns false
puts k1.hash == k2.hash  # => true
puts k1.eql?(k2)         # => false

# Creates duplicate entries despite same hash
hash[k2] = "second"
puts hash.size  # => 2 (should be 1)

The default value pitfall occurs when using Hash.new(default) with mutable objects. All missing keys share the same default object instance, causing unexpected mutations.

# Wrong: shared default object
bad_hash = Hash.new([])
bad_hash[:a] << 1
bad_hash[:b] << 2

puts bad_hash[:a].inspect  # => [1, 2] (unexpected!)
puts bad_hash[:c].inspect  # => [1, 2] (all keys share same array)

# Correct: default block creates new object per key
good_hash = Hash.new { |h, k| h[k] = [] }
good_hash[:a] << 1
good_hash[:b] << 2

puts good_hash[:a].inspect  # => [1]
puts good_hash[:b].inspect  # => [2]

Hash collision attacks exploit predictable hash functions to create pathological inputs. An attacker constructs keys that all hash to the same value, degrading performance from O(1) to O(n). This vulnerability affects web applications accepting user input as hash keys. Ruby 2.4 introduced randomized hash seeds per process to mitigate this attack.

Forgetting to resize hash tables leads to performance degradation. As load factor increases, collision frequency grows, slowing all operations. Manual implementations must track element count and trigger resizing at appropriate thresholds. Ruby's Hash handles this automatically.

Comparing hash equality requires careful consideration of key order. The == operator checks for equal keys and values but ignores order in Ruby hashes. Two hashes with identical contents in different insertion orders compare as equal.

h1 = { a: 1, b: 2, c: 3 }
h2 = { c: 3, a: 1, b: 2 }

puts h1 == h2  # => true (equal despite different order)

# Use to_a for order-sensitive comparison
puts h1.to_a == h2.to_a  # => false (order matters)

Reference

Hash Table Operations Complexity

Operation Average Case Worst Case Notes
Insert O(1) O(n) Worst case occurs with all collisions
Delete O(1) O(n) May require chain traversal or probing
Search O(1) O(n) Depends on hash function quality
Resize O(n) O(n) Occurs infrequently for amortized O(1)

Collision Resolution Comparison

Strategy Pros Cons Best Use Case
Separate Chaining Handles high load factors, simple deletion Poor cache locality, pointer overhead Unknown or high load factors
Linear Probing Excellent cache performance, no pointers Clustering issues, complex deletion Low to moderate load factors
Quadratic Probing Reduces clustering vs linear Requires prime table size Moderate load factors
Double Hashing Minimal clustering Slower hash computation Security-sensitive applications

Ruby Hash Methods

Method Purpose Example
[] Access value by key hash[:key]
[]= Set value for key hash[:key] = value
fetch Access with default or error hash.fetch(:key, default)
key? Check key existence hash.key?(:name)
delete Remove key-value pair hash.delete(:key)
merge Combine hashes hash1.merge(hash2)
transform_keys Modify all keys hash.transform_keys(&:upcase)
transform_values Modify all values hash.transform_values { v * 2 }
select Filter entries hash.select { k, v > 10 }
each Iterate key-value pairs hash.each { k, v block }

Load Factor Guidelines

Implementation Recommended Threshold Typical Range Performance Impact
Separate Chaining 0.75 - 1.0 0.5 - 2.0 Linear degradation above 1.0
Linear Probing 0.5 - 0.7 0.3 - 0.8 Exponential degradation above 0.7
Quadratic Probing 0.5 - 0.75 0.3 - 0.8 Table must remain under 50% for guarantees
Double Hashing 0.5 - 0.75 0.3 - 0.8 Better than linear, worse than chaining

Hash Function Properties

Property Description Importance
Deterministic Same input always produces same output Required for correctness
Uniform Distribution Hash values spread evenly across range Critical for performance
Fast Computation Minimal CPU cycles per hash Affects all operation speed
Avalanche Effect Small input change produces large output change Improves distribution quality
Collision Resistance Difficult to find two inputs with same hash Important for security

Common Hash Function Algorithms

Function Speed Distribution Security Use Case
Division Method Very Fast Poor to Fair None Simple applications
Multiplication Method Fast Good None General purpose
MurmurHash Fast Excellent None Non-cryptographic hashing
CityHash Very Fast Excellent None Short strings, high performance
SipHash Medium Excellent Good Hash collision attack prevention
SHA-256 Slow Excellent Excellent Cryptographic applications

Ruby Hash Construction Syntax

Syntax Example Notes
Hash rocket {key => value} Works with any object as key
Symbol shorthand {key: value} Symbol keys only, most concise
Hash.new Hash.new(default) Specify default value
Hash.new with block Hash.new { block } Dynamic default per key
Hash[] Hash[key1, val1, key2, val2] Alternating keys and values
to_h array.to_h Convert array of pairs to hash