Overview
String hashing converts variable-length string data into fixed-size integer values. The hash function processes each character in the string and produces a numeric representation that serves as an identifier or index. Hash values enable constant-time lookups in hash tables, quick string comparisons, and data integrity verification.
The fundamental operation maps strings from a potentially infinite domain to a finite range of integers. A string "hello" becomes a numeric value like 99162322, while "world" produces a different value like 113318802. The same input always produces the same output, but different inputs should ideally produce different outputs.
String hashing addresses the challenge of storing and retrieving string data efficiently. Direct string comparisons require examining every character, an O(n) operation where n equals string length. Hash values reduce this to O(1) for initial lookup, with O(n) comparison only when hash collisions occur.
Two primary categories exist: cryptographic and non-cryptographic hashing. Cryptographic hash functions like SHA-256 prioritize security properties and collision resistance for passwords and digital signatures. Non-cryptographic functions like polynomial hashing prioritize speed for hash tables and data structures. The choice depends on whether security or performance takes precedence.
# Basic hash value generation
"hello".hash
# => 4276896046642784406
"world".hash
# => -2329455469102737859
# Same string produces same hash
"hello".hash == "hello".hash
# => true
Key Principles
A hash function h maps strings s to integers: h(s) → integer. The mapping must satisfy determinism - calling h(s) repeatedly returns identical values. This consistency enables reliable data retrieval and comparison operations.
The hash function processes strings character by character, accumulating a numeric result. Each character contributes to the final hash value based on its position and value. The position matters because "abc" and "cba" must produce different hashes despite containing identical characters.
Collision resistance determines hash function quality. A collision occurs when h(s1) = h(s2) while s1 ≠ s2. Perfect collision avoidance proves impossible when the input domain exceeds the output range - mapping infinite strings to finite integers guarantees some collisions. Quality hash functions minimize collision probability through mathematical properties and careful design.
Distribution uniformity spreads hash values evenly across the output range. Poor distribution clusters values, increasing collision rates and degrading hash table performance. A uniform distribution means each output value appears with approximately equal probability across random inputs.
The polynomial rolling hash demonstrates core hashing principles:
hash(s) = (s[0] * p^(n-1) + s[1] * p^(n-2) + ... + s[n-1] * p^0) mod m
Where:
- s[i] represents character at position i
- p is a prime number base (commonly 31 or 53)
- n equals string length
- m defines the hash table size or modulo value
The polynomial approach weights characters by position using powers of the base. Characters at the beginning carry more weight than those at the end. The modulo operation constrains results to the desired range.
Hash table implementation requires understanding the load factor α = n/m, where n represents the number of stored elements and m represents the table size. As α increases, collision probability rises. Rehashing expands the table when α exceeds a threshold (typically 0.75).
Avalanche effect describes how small input changes produce large output changes. Flipping a single bit in the input should alter roughly half the output bits. Strong avalanche properties improve distribution and collision resistance.
Ruby Implementation
Ruby provides built-in hashing through the hash method available on all objects, including strings. The implementation uses MurmurHash, a non-cryptographic hash function optimized for speed and distribution quality.
# Built-in string hashing
str = "example"
str.hash
# => -2757638103894144305
# Hash values persist within the same Ruby process
hash1 = "data".hash
hash2 = "data".hash
hash1 == hash2
# => true
The hash method returns different values across Ruby process restarts. Ruby randomizes hash seeds at startup to prevent hash collision denial-of-service attacks. This means hash values cannot be stored persistently or shared between processes.
# Hash values differ between processes
# Process 1: "ruby".hash => 1234567890
# Process 2: "ruby".hash => 9876543210
Custom hash implementations use the Digest library for consistent, reproducible hashing:
require 'digest'
def sha256_hash(string)
Digest::SHA256.hexdigest(string)
end
sha256_hash("password")
# => "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
# Same across all processes and sessions
sha256_hash("password") == sha256_hash("password")
# => true
The Digest module offers multiple algorithms:
require 'digest'
text = "hash this"
# MD5 (fast but cryptographically broken)
Digest::MD5.hexdigest(text)
# => "8d777f385d3dfec8815d20f7496026dc"
# SHA-1 (faster than SHA-256, but deprecated for security)
Digest::SHA1.hexdigest(text)
# => "7634c5e5e1a8f9b9e8a7f5c0e4d3c2b1a0f9e8d7"
# SHA-256 (secure, widely used)
Digest::SHA256.hexdigest(text)
# => "b221d9dbb083a7f33428d7c2a3c3198ae925614d70210e28716ccaa7cd4ddb79"
# SHA-512 (more secure, slower)
Digest::SHA512.hexdigest(text)
# => "a9b7ba70783b617e9998dc4dd82eb3c5274e1a8b..."
Implementing polynomial hash manually demonstrates the algorithm:
def polynomial_hash(string, base = 31, modulo = 2**32)
hash_value = 0
string.each_char.with_index do |char, index|
hash_value = (hash_value * base + char.ord) % modulo
end
hash_value
end
polynomial_hash("abc")
# => 96354
polynomial_hash("cab")
# => 94850
# Position matters - different strings produce different hashes
polynomial_hash("abc") == polynomial_hash("cab")
# => false
Rolling hash implementation enables efficient substring hashing:
class RollingHash
def initialize(string, base = 31, modulo = 2**32)
@string = string
@base = base
@modulo = modulo
@hash = compute_hash(string)
@power = (@base ** string.length) % @modulo
end
def compute_hash(str)
hash_value = 0
str.each_char do |char|
hash_value = (hash_value * @base + char.ord) % @modulo
end
hash_value
end
def roll(old_char, new_char)
# Remove contribution of old character
@hash = (@hash - old_char.ord * @power) % @modulo
# Add new character
@hash = (@hash * @base + new_char.ord) % @modulo
@hash
end
end
roller = RollingHash.new("abc")
roller.roll('a', 'd') # Simulates moving window from "abc" to "bcd"
# => hash value for "bcd"
Hash tables in Ruby use the built-in Hash class, which relies on object hash values:
# Hash automatically uses string hash values
lookup = Hash.new
lookup["key1"] = "value1"
lookup["key2"] = "value2"
# Internal: uses "key1".hash to determine storage location
lookup["key1"]
# => "value1"
# Custom hash objects must implement hash and eql?
class CustomKey
attr_reader :value
def initialize(value)
@value = value
end
def hash
@value.hash
end
def eql?(other)
@value == other.value
end
end
custom_lookup = Hash.new
custom_lookup[CustomKey.new("test")] = "found"
custom_lookup[CustomKey.new("test")]
# => "found"
Common Patterns
Polynomial hashing forms the foundation of many string hashing applications. The pattern multiplies the accumulated hash by a base value and adds the next character:
def polynomial_hash(string, base = 31)
hash_value = 0
string.each_char do |char|
hash_value = hash_value * base + char.ord
end
hash_value
end
polynomial_hash("test")
# => 3556498
The rolling hash pattern updates hash values incrementally when processing sliding windows:
class SubstringMatcher
def initialize(pattern, base = 31, modulo = 2**32)
@pattern = pattern
@pattern_hash = compute_hash(pattern, base, modulo)
@base = base
@modulo = modulo
@pattern_length = pattern.length
@power = (base ** (@pattern_length - 1)) % modulo
end
def compute_hash(string, base, modulo)
hash = 0
string.each_char do |char|
hash = (hash * base + char.ord) % modulo
end
hash
end
def find_matches(text)
return [] if text.length < @pattern_length
matches = []
current_hash = compute_hash(text[0...@pattern_length], @base, @modulo)
(0..text.length - @pattern_length).each do |i|
if i > 0
old_char = text[i - 1]
new_char = text[i + @pattern_length - 1]
current_hash = roll_hash(current_hash, old_char, new_char)
end
if current_hash == @pattern_hash
matches << i if text[i, @pattern_length] == @pattern
end
end
matches
end
def roll_hash(hash, old_char, new_char)
hash = (hash - old_char.ord * @power) % @modulo
hash = (hash * @base + new_char.ord) % @modulo
hash
end
end
matcher = SubstringMatcher.new("abc")
matcher.find_matches("xyzabcdefabc")
# => [3, 9]
Double hashing resolves collisions in hash tables by using a secondary hash function:
class DoubleHashTable
def initialize(size = 16)
@size = size
@table = Array.new(size)
end
def hash1(key)
key.hash % @size
end
def hash2(key)
1 + (key.hash % (@size - 1))
end
def insert(key, value)
index = hash1(key)
step = hash2(key)
while @table[index]
index = (index + step) % @size
end
@table[index] = [key, value]
end
def get(key)
index = hash1(key)
step = hash2(key)
while @table[index]
return @table[index][1] if @table[index][0] == key
index = (index + step) % @size
end
nil
end
end
table = DoubleHashTable.new
table.insert("name", "Alice")
table.get("name")
# => "Alice"
Combining multiple hash functions reduces collision probability:
require 'digest'
def combined_hash(string)
hash1 = Digest::MD5.hexdigest(string)[0..7].to_i(16)
hash2 = Digest::SHA1.hexdigest(string)[0..7].to_i(16)
hash1 ^ hash2 # XOR combination
end
combined_hash("data")
# => combined hash value
FNV (Fowler-Noll-Vo) hash provides fast, high-quality distribution:
def fnv1a_hash(string)
hash = 2166136261 # FNV offset basis
string.each_byte do |byte|
hash ^= byte
hash *= 16777619 # FNV prime
hash &= 0xFFFFFFFF # Keep 32-bit
end
hash
end
fnv1a_hash("example")
# => 2851307223
Performance Considerations
Hash function speed directly impacts application performance. Non-cryptographic hash functions like MurmurHash execute in nanoseconds per string, while cryptographic functions like SHA-256 require microseconds. The 1000x speed difference matters for high-throughput applications.
require 'digest'
require 'benchmark'
text = "performance test string"
Benchmark.bm do |x|
x.report("built-in hash:") { 100000.times { text.hash } }
x.report("MD5:") { 100000.times { Digest::MD5.hexdigest(text) } }
x.report("SHA256:") { 100000.times { Digest::SHA256.hexdigest(text) } }
end
# Results (approximate):
# built-in hash: 0.002000 seconds
# MD5: 0.050000 seconds
# SHA256: 0.150000 seconds
Hash table performance depends on the load factor and collision rate. A load factor below 0.75 maintains O(1) average-case lookups. Beyond 0.75, collision chains lengthen and performance degrades toward O(n).
Collision handling strategies affect performance characteristics:
class ChainedHashTable
def initialize(size = 16)
@size = size
@buckets = Array.new(size) { [] }
@count = 0
end
def insert(key, value)
index = key.hash % @size
@buckets[index] << [key, value]
@count += 1
resize if load_factor > 0.75
end
def load_factor
@count.to_f / @size
end
def resize
old_buckets = @buckets
@size *= 2
@buckets = Array.new(@size) { [] }
@count = 0
old_buckets.each do |bucket|
bucket.each do |key, value|
insert(key, value)
end
end
end
end
String length affects hashing time linearly. Hashing a 1000-character string takes approximately 1000x longer than a single character. Caching hash values avoids recomputation:
class CachedString
attr_reader :value
def initialize(value)
@value = value
@cached_hash = nil
end
def hash
@cached_hash ||= compute_hash
end
def compute_hash
@value.hash
end
end
large_string = "x" * 10000
cached = CachedString.new(large_string)
Benchmark.bm do |x|
x.report("first call:") { cached.hash }
x.report("cached call:") { 10000.times { cached.hash } }
end
Rolling hash implementations achieve O(1) hash updates for substring operations, compared to O(n) for recomputing from scratch:
# O(n) approach - recompute entire hash
def slow_substring_search(text, pattern)
(0..text.length - pattern.length).each do |i|
substring = text[i, pattern.length]
return i if substring.hash == pattern.hash && substring == pattern
end
nil
end
# O(1) hash update approach
def fast_substring_search(text, pattern)
# Uses rolling hash as shown in Common Patterns
# Updates hash in constant time per position
end
Prime number selection for polynomial hashing affects distribution quality. Larger primes reduce collisions but increase computation time. Common choices include 31, 53, 97, and 101. Testing against actual data determines optimal values:
def test_distribution(strings, base)
hashes = strings.map { |s| polynomial_hash(s, base) }
unique_hashes = hashes.uniq.size
collision_rate = 1 - (unique_hashes.to_f / strings.size)
collision_rate
end
test_strings = ["cat", "dog", "rat", "bat", "mat"]
test_distribution(test_strings, 31)
# => collision rate (lower is better)
Security Implications
Cryptographic hash functions provide security properties absent from non-cryptographic variants. One-way functions prevent reversing hash values to recover original strings. SHA-256 makes finding x where SHA256(x) = y computationally infeasible.
Password storage requires cryptographic hashing with salts to prevent rainbow table attacks:
require 'digest'
require 'securerandom'
def hash_password(password)
salt = SecureRandom.hex(16)
hash = Digest::SHA256.hexdigest(salt + password)
"#{salt}$#{hash}"
end
def verify_password(password, stored)
salt, hash = stored.split('$')
computed_hash = Digest::SHA256.hexdigest(salt + password)
computed_hash == hash
end
stored = hash_password("secret123")
# => "a1b2c3d4....$5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
verify_password("secret123", stored)
# => true
verify_password("wrong", stored)
# => false
Hash collision attacks exploit predictable hash functions. Attackers craft inputs producing identical hash values, degrading hash table performance to O(n). Ruby randomizes hash seeds to prevent this attack vector.
# Vulnerable to collision attacks without randomization
def insecure_hash(string)
string.bytes.sum # Predictable, easy to create collisions
end
# "abc".bytes.sum => 294
# "bac".bytes.sum => 294 (collision)
# Ruby's built-in hash includes randomization
"abc".hash != "bac".hash
# => true (extremely likely, depends on random seed)
Timing attacks exploit execution time variations to extract information. Constant-time comparison prevents this:
# Vulnerable to timing attacks
def insecure_compare(hash1, hash2)
hash1 == hash2 # Returns early on first mismatch
end
# Constant-time comparison
def secure_compare(hash1, hash2)
return false if hash1.length != hash2.length
result = 0
hash1.bytes.zip(hash2.bytes).each do |b1, b2|
result |= b1 ^ b2
end
result == 0
end
MD5 and SHA-1 suffer from known collision vulnerabilities. Researchers have demonstrated collision generation, making these algorithms unsuitable for security-critical applications:
require 'digest'
# MD5 - DO NOT USE for security
Digest::MD5.hexdigest("data")
# Collisions can be generated
# SHA-1 - DO NOT USE for security
Digest::SHA1.hexdigest("data")
# Collisions demonstrated in 2017
# SHA-256 - Currently secure
Digest::SHA256.hexdigest("data")
# No practical collision attacks known
# SHA-512 - More secure, slower
Digest::SHA512.hexdigest("data")
# Higher security margin
Key derivation functions like PBKDF2, bcrypt, or scrypt provide better password hashing by incorporating iteration counts that slow down brute-force attacks:
require 'openssl'
def pbkdf2_hash(password, iterations = 100000)
salt = SecureRandom.random_bytes(16)
hash = OpenSSL::PKCS5.pbkdf2_hmac(
password,
salt,
iterations,
32,
OpenSSL::Digest::SHA256.new
)
"#{iterations}$#{Base64.encode64(salt)}$#{Base64.encode64(hash)}"
end
hashed = pbkdf2_hash("password123")
# Significantly slower to compute, resistant to brute-force
Practical Examples
Detecting duplicate files uses hash values for efficient comparison:
require 'digest'
class DuplicateFinder
def initialize
@hashes = Hash.new { |h, k| h[k] = [] }
end
def scan_directory(path)
Dir.glob("#{path}/**/*").each do |file|
next unless File.file?(file)
hash = Digest::SHA256.file(file).hexdigest
@hashes[hash] << file
end
end
def duplicates
@hashes.select { |hash, files| files.size > 1 }
end
end
finder = DuplicateFinder.new
finder.scan_directory("/path/to/files")
finder.duplicates.each do |hash, files|
puts "Duplicate files (#{hash}):"
files.each { |f| puts " #{f}" }
end
Caching computed results uses hash values as keys:
class ComputationCache
def initialize
@cache = {}
end
def compute_with_cache(input)
key = Digest::SHA256.hexdigest(input.to_s)
@cache[key] ||= expensive_computation(input)
end
def expensive_computation(input)
# Simulated expensive operation
sleep(1)
input.reverse
end
end
cache = ComputationCache.new
cache.compute_with_cache("data") # Takes 1 second
cache.compute_with_cache("data") # Instant (cached)
Bloom filters use multiple hash functions for space-efficient set membership testing:
class BloomFilter
def initialize(size = 1000)
@size = size
@bits = Array.new(size, false)
end
def hash_functions(item)
[
Digest::MD5.hexdigest(item).to_i(16) % @size,
Digest::SHA1.hexdigest(item).to_i(16) % @size,
Digest::SHA256.hexdigest(item).to_i(16) % @size
]
end
def add(item)
hash_functions(item).each do |index|
@bits[index] = true
end
end
def contains?(item)
hash_functions(item).all? { |index| @bits[index] }
end
end
filter = BloomFilter.new
filter.add("apple")
filter.add("banana")
filter.contains?("apple")
# => true
filter.contains?("orange")
# => false (probably, false positives possible)
Consistent hashing distributes data across servers:
class ConsistentHash
def initialize(servers, virtual_nodes = 150)
@ring = {}
servers.each do |server|
virtual_nodes.times do |i|
hash = Digest::SHA256.hexdigest("#{server}:#{i}").to_i(16)
@ring[hash] = server
end
end
@sorted_keys = @ring.keys.sort
end
def get_server(key)
hash = Digest::SHA256.hexdigest(key).to_i(16)
index = @sorted_keys.bsearch_index { |k| k >= hash } || 0
@ring[@sorted_keys[index]]
end
end
hash_ring = ConsistentHash.new(["server1", "server2", "server3"])
hash_ring.get_server("user123")
# => "server2"
hash_ring.get_server("user456")
# => "server1"
String interning uses hash tables to store unique string instances:
class StringInterner
def initialize
@strings = {}
end
def intern(string)
hash = string.hash
@strings[hash] ||= string.dup.freeze
end
end
interner = StringInterner.new
str1 = interner.intern("hello")
str2 = interner.intern("hello")
str1.object_id == str2.object_id
# => true (same object reference)
Reference
Hash Function Comparison
| Function | Speed | Collision Resistance | Security | Use Case |
|---|---|---|---|---|
| Built-in hash | Fastest | Good | No | Hash tables, general purpose |
| Polynomial | Fast | Moderate | No | String matching, rolling hash |
| FNV-1a | Fast | Good | No | Hash tables, checksums |
| MurmurHash | Fast | Excellent | No | Hash tables, distributed systems |
| MD5 | Medium | Poor | Broken | Legacy checksums only |
| SHA-1 | Medium | Weak | Broken | Legacy applications only |
| SHA-256 | Slow | Excellent | Strong | Digital signatures, certificates |
| SHA-512 | Slower | Excellent | Stronger | High-security applications |
| PBKDF2 | Very slow | N/A | Strong | Password hashing |
Ruby Hash Methods
| Method | Description | Example |
|---|---|---|
| hash | Returns hash value for object | "text".hash |
| eql? | Checks equality for hash lookup | key1.eql?(key2) |
| Digest::MD5 | MD5 hash computation | Digest::MD5.hexdigest(str) |
| Digest::SHA256 | SHA-256 hash computation | Digest::SHA256.hexdigest(str) |
| Digest::SHA512 | SHA-512 hash computation | Digest::SHA512.hexdigest(str) |
Performance Characteristics
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Hash computation | O(n) | O(1) | n = string length |
| Hash table insert | O(1) average | O(n) | Amortized with resize |
| Hash table lookup | O(1) average | O(n) | Worst case O(n) with collisions |
| Rolling hash update | O(1) | O(1) | Constant time per character |
| Cryptographic hash | O(n) | O(1) | Slower constant factors |
Common Hash Bases
| Base | Properties | Collision Rate | Use Case |
|---|---|---|---|
| 31 | Small prime, fast multiplication | Low | General string hashing |
| 53 | Larger prime, better distribution | Very low | Hash tables |
| 97 | Large prime, excellent distribution | Very low | Critical applications |
| 101 | Large prime, good mixing | Very low | High-quality hashing |
| 257 | Power of 2 plus 1, fast | Low | Performance-critical code |
Security Guidelines
| Scenario | Recommended Hash | Avoid | Rationale |
|---|---|---|---|
| Password storage | PBKDF2, bcrypt, scrypt | MD5, SHA-1, plain SHA-256 | Requires key derivation |
| Digital signatures | SHA-256, SHA-512 | MD5, SHA-1 | Collision resistance needed |
| File integrity | SHA-256 | MD5 | Collision attacks possible |
| Hash tables | Built-in hash, MurmurHash | Cryptographic hashes | Speed vs security tradeoff |
| Checksums | CRC32, MD5 | None for non-security | Speed acceptable for error detection |
Load Factor Impact
| Load Factor | Average Lookups | Resize Threshold | Performance |
|---|---|---|---|
| < 0.5 | 1.1 | No | Excellent, memory inefficient |
| 0.5 - 0.75 | 1.5 | No | Good balance |
| 0.75 - 1.0 | 2.5 | Yes | Degrading, resize recommended |
| > 1.0 | 5+ | Required | Poor, many collisions |
Hash Function Implementation Template
def custom_hash(string, base = 31, modulo = 2**32)
hash_value = 0
string.each_char do |char|
hash_value = (hash_value * base + char.ord) % modulo
end
hash_value
end
Collision Resolution Strategies
| Strategy | Pros | Cons | Implementation |
|---|---|---|---|
| Chaining | Simple, handles high load | Extra memory, pointer overhead | Linked lists at each bucket |
| Open addressing | Cache-friendly, no pointers | Clustering, complex deletion | Linear or quadratic probing |
| Double hashing | Reduced clustering | Two hash functions needed | Secondary hash for step size |
| Robin Hood | Balanced chains | Complex insertion | Variance-reducing probe |