CrackedRuby - String Hashing

Overview

String hashing converts variable-length string data into fixed-size integer values. The hash function processes each character in the string and produces a numeric representation that serves as an identifier or index. Hash values enable constant-time lookups in hash tables, quick string comparisons, and data integrity verification.

The fundamental operation maps strings from a potentially infinite domain to a finite range of integers. A string "hello" becomes a numeric value like 99162322, while "world" produces a different value like 113318802. The same input always produces the same output, but different inputs should ideally produce different outputs.

String hashing addresses the challenge of storing and retrieving string data efficiently. Direct string comparisons require examining every character, an O(n) operation where n equals string length. Hash values reduce this to O(1) for initial lookup, with O(n) comparison only when hash collisions occur.

Two primary categories exist: cryptographic and non-cryptographic hashing. Cryptographic hash functions like SHA-256 prioritize security properties and collision resistance for passwords and digital signatures. Non-cryptographic functions like polynomial hashing prioritize speed for hash tables and data structures. The choice depends on whether security or performance takes precedence.

# Basic hash value generation
"hello".hash
# => 4276896046642784406

"world".hash
# => -2329455469102737859

# Same string produces same hash
"hello".hash == "hello".hash
# => true

Key Principles

A hash function h maps strings s to integers: h(s) → integer. The mapping must satisfy determinism - calling h(s) repeatedly returns identical values. This consistency enables reliable data retrieval and comparison operations.

The hash function processes strings character by character, accumulating a numeric result. Each character contributes to the final hash value based on its position and value. The position matters because "abc" and "cba" must produce different hashes despite containing identical characters.

Collision resistance determines hash function quality. A collision occurs when h(s1) = h(s2) while s1 ≠ s2. Perfect collision avoidance proves impossible when the input domain exceeds the output range - mapping infinite strings to finite integers guarantees some collisions. Quality hash functions minimize collision probability through mathematical properties and careful design.

Distribution uniformity spreads hash values evenly across the output range. Poor distribution clusters values, increasing collision rates and degrading hash table performance. A uniform distribution means each output value appears with approximately equal probability across random inputs.

The polynomial rolling hash demonstrates core hashing principles:

hash(s) = (s[0] * p^(n-1) + s[1] * p^(n-2) + ... + s[n-1] * p^0) mod m

Where:

s[i] represents character at position i
p is a prime number base (commonly 31 or 53)
n equals string length
m defines the hash table size or modulo value

The polynomial approach weights characters by position using powers of the base. Characters at the beginning carry more weight than those at the end. The modulo operation constrains results to the desired range.

Hash table implementation requires understanding the load factor α = n/m, where n represents the number of stored elements and m represents the table size. As α increases, collision probability rises. Rehashing expands the table when α exceeds a threshold (typically 0.75).

Avalanche effect describes how small input changes produce large output changes. Flipping a single bit in the input should alter roughly half the output bits. Strong avalanche properties improve distribution and collision resistance.

Ruby Implementation

Ruby provides built-in hashing through the hash method available on all objects, including strings. The implementation uses MurmurHash, a non-cryptographic hash function optimized for speed and distribution quality.

# Built-in string hashing
str = "example"
str.hash
# => -2757638103894144305

# Hash values persist within the same Ruby process
hash1 = "data".hash
hash2 = "data".hash
hash1 == hash2
# => true

The hash method returns different values across Ruby process restarts. Ruby randomizes hash seeds at startup to prevent hash collision denial-of-service attacks. This means hash values cannot be stored persistently or shared between processes.

# Hash values differ between processes
# Process 1: "ruby".hash => 1234567890
# Process 2: "ruby".hash => 9876543210

Custom hash implementations use the Digest library for consistent, reproducible hashing:

require 'digest'

def sha256_hash(string)
  Digest::SHA256.hexdigest(string)
end

sha256_hash("password")
# => "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"

# Same across all processes and sessions
sha256_hash("password") == sha256_hash("password")
# => true

The Digest module offers multiple algorithms:

require 'digest'

text = "hash this"

# MD5 (fast but cryptographically broken)
Digest::MD5.hexdigest(text)
# => "8d777f385d3dfec8815d20f7496026dc"

# SHA-1 (faster than SHA-256, but deprecated for security)
Digest::SHA1.hexdigest(text)
# => "7634c5e5e1a8f9b9e8a7f5c0e4d3c2b1a0f9e8d7"

# SHA-256 (secure, widely used)
Digest::SHA256.hexdigest(text)
# => "b221d9dbb083a7f33428d7c2a3c3198ae925614d70210e28716ccaa7cd4ddb79"

# SHA-512 (more secure, slower)
Digest::SHA512.hexdigest(text)
# => "a9b7ba70783b617e9998dc4dd82eb3c5274e1a8b..."

Implementing polynomial hash manually demonstrates the algorithm:

def polynomial_hash(string, base = 31, modulo = 2**32)
  hash_value = 0
  string.each_char.with_index do |char, index|
    hash_value = (hash_value * base + char.ord) % modulo
  end
  hash_value
end

polynomial_hash("abc")
# => 96354

polynomial_hash("cab")
# => 94850

# Position matters - different strings produce different hashes
polynomial_hash("abc") == polynomial_hash("cab")
# => false

Rolling hash implementation enables efficient substring hashing:

class RollingHash
  def initialize(string, base = 31, modulo = 2**32)
    @string = string
    @base = base
    @modulo = modulo
    @hash = compute_hash(string)
    @power = (@base ** string.length) % @modulo
  end

  def compute_hash(str)
    hash_value = 0
    str.each_char do |char|
      hash_value = (hash_value * @base + char.ord) % @modulo
    end
    hash_value
  end

  def roll(old_char, new_char)
    # Remove contribution of old character
    @hash = (@hash - old_char.ord * @power) % @modulo
    
    # Add new character
    @hash = (@hash * @base + new_char.ord) % @modulo
    
    @hash
  end
end

roller = RollingHash.new("abc")
roller.roll('a', 'd')  # Simulates moving window from "abc" to "bcd"
# => hash value for "bcd"

Hash tables in Ruby use the built-in Hash class, which relies on object hash values:

# Hash automatically uses string hash values
lookup = Hash.new
lookup["key1"] = "value1"
lookup["key2"] = "value2"

# Internal: uses "key1".hash to determine storage location
lookup["key1"]
# => "value1"

# Custom hash objects must implement hash and eql?
class CustomKey
  attr_reader :value

  def initialize(value)
    @value = value
  end

  def hash
    @value.hash
  end

  def eql?(other)
    @value == other.value
  end
end

custom_lookup = Hash.new
custom_lookup[CustomKey.new("test")] = "found"
custom_lookup[CustomKey.new("test")]
# => "found"

Common Patterns

Polynomial hashing forms the foundation of many string hashing applications. The pattern multiplies the accumulated hash by a base value and adds the next character:

def polynomial_hash(string, base = 31)
  hash_value = 0
  string.each_char do |char|
    hash_value = hash_value * base + char.ord
  end
  hash_value
end

polynomial_hash("test")
# => 3556498

The rolling hash pattern updates hash values incrementally when processing sliding windows:

class SubstringMatcher
  def initialize(pattern, base = 31, modulo = 2**32)
    @pattern = pattern
    @pattern_hash = compute_hash(pattern, base, modulo)
    @base = base
    @modulo = modulo
    @pattern_length = pattern.length
    @power = (base ** (@pattern_length - 1)) % modulo
  end

  def compute_hash(string, base, modulo)
    hash = 0
    string.each_char do |char|
      hash = (hash * base + char.ord) % modulo
    end
    hash
  end

  def find_matches(text)
    return [] if text.length < @pattern_length
    
    matches = []
    current_hash = compute_hash(text[0...@pattern_length], @base, @modulo)
    
    (0..text.length - @pattern_length).each do |i|
      if i > 0
        old_char = text[i - 1]
        new_char = text[i + @pattern_length - 1]
        current_hash = roll_hash(current_hash, old_char, new_char)
      end
      
      if current_hash == @pattern_hash
        matches << i if text[i, @pattern_length] == @pattern
      end
    end
    
    matches
  end

  def roll_hash(hash, old_char, new_char)
    hash = (hash - old_char.ord * @power) % @modulo
    hash = (hash * @base + new_char.ord) % @modulo
    hash
  end
end

matcher = SubstringMatcher.new("abc")
matcher.find_matches("xyzabcdefabc")
# => [3, 9]

Double hashing resolves collisions in hash tables by using a secondary hash function:

class DoubleHashTable
  def initialize(size = 16)
    @size = size
    @table = Array.new(size)
  end

  def hash1(key)
    key.hash % @size
  end

  def hash2(key)
    1 + (key.hash % (@size - 1))
  end

  def insert(key, value)
    index = hash1(key)
    step = hash2(key)
    
    while @table[index]
      index = (index + step) % @size
    end
    
    @table[index] = [key, value]
  end

  def get(key)
    index = hash1(key)
    step = hash2(key)
    
    while @table[index]
      return @table[index][1] if @table[index][0] == key
      index = (index + step) % @size
    end
    
    nil
  end
end

table = DoubleHashTable.new
table.insert("name", "Alice")
table.get("name")
# => "Alice"

Combining multiple hash functions reduces collision probability:

require 'digest'

def combined_hash(string)
  hash1 = Digest::MD5.hexdigest(string)[0..7].to_i(16)
  hash2 = Digest::SHA1.hexdigest(string)[0..7].to_i(16)
  hash1 ^ hash2  # XOR combination
end

combined_hash("data")
# => combined hash value

FNV (Fowler-Noll-Vo) hash provides fast, high-quality distribution:

def fnv1a_hash(string)
  hash = 2166136261  # FNV offset basis
  
  string.each_byte do |byte|
    hash ^= byte
    hash *= 16777619  # FNV prime
    hash &= 0xFFFFFFFF  # Keep 32-bit
  end
  
  hash
end

fnv1a_hash("example")
# => 2851307223

Performance Considerations

Hash function speed directly impacts application performance. Non-cryptographic hash functions like MurmurHash execute in nanoseconds per string, while cryptographic functions like SHA-256 require microseconds. The 1000x speed difference matters for high-throughput applications.

require 'digest'
require 'benchmark'

text = "performance test string"

Benchmark.bm do |x|
  x.report("built-in hash:") { 100000.times { text.hash } }
  x.report("MD5:") { 100000.times { Digest::MD5.hexdigest(text) } }
  x.report("SHA256:") { 100000.times { Digest::SHA256.hexdigest(text) } }
end

# Results (approximate):
# built-in hash:  0.002000 seconds
# MD5:            0.050000 seconds
# SHA256:         0.150000 seconds

Hash table performance depends on the load factor and collision rate. A load factor below 0.75 maintains O(1) average-case lookups. Beyond 0.75, collision chains lengthen and performance degrades toward O(n).

Collision handling strategies affect performance characteristics:

class ChainedHashTable
  def initialize(size = 16)
    @size = size
    @buckets = Array.new(size) { [] }
    @count = 0
  end

  def insert(key, value)
    index = key.hash % @size
    @buckets[index] << [key, value]
    @count += 1
    
    resize if load_factor > 0.75
  end

  def load_factor
    @count.to_f / @size
  end

  def resize
    old_buckets = @buckets
    @size *= 2
    @buckets = Array.new(@size) { [] }
    @count = 0
    
    old_buckets.each do |bucket|
      bucket.each do |key, value|
        insert(key, value)
      end
    end
  end
end

String length affects hashing time linearly. Hashing a 1000-character string takes approximately 1000x longer than a single character. Caching hash values avoids recomputation:

class CachedString
  attr_reader :value

  def initialize(value)
    @value = value
    @cached_hash = nil
  end

  def hash
    @cached_hash ||= compute_hash
  end

  def compute_hash
    @value.hash
  end
end

large_string = "x" * 10000
cached = CachedString.new(large_string)

Benchmark.bm do |x|
  x.report("first call:") { cached.hash }
  x.report("cached call:") { 10000.times { cached.hash } }
end

Rolling hash implementations achieve O(1) hash updates for substring operations, compared to O(n) for recomputing from scratch:

# O(n) approach - recompute entire hash
def slow_substring_search(text, pattern)
  (0..text.length - pattern.length).each do |i|
    substring = text[i, pattern.length]
    return i if substring.hash == pattern.hash && substring == pattern
  end
  nil
end

# O(1) hash update approach
def fast_substring_search(text, pattern)
  # Uses rolling hash as shown in Common Patterns
  # Updates hash in constant time per position
end

Prime number selection for polynomial hashing affects distribution quality. Larger primes reduce collisions but increase computation time. Common choices include 31, 53, 97, and 101. Testing against actual data determines optimal values:

def test_distribution(strings, base)
  hashes = strings.map { |s| polynomial_hash(s, base) }
  unique_hashes = hashes.uniq.size
  collision_rate = 1 - (unique_hashes.to_f / strings.size)
  collision_rate
end

test_strings = ["cat", "dog", "rat", "bat", "mat"]
test_distribution(test_strings, 31)
# => collision rate (lower is better)

Security Implications

Cryptographic hash functions provide security properties absent from non-cryptographic variants. One-way functions prevent reversing hash values to recover original strings. SHA-256 makes finding x where SHA256(x) = y computationally infeasible.

Password storage requires cryptographic hashing with salts to prevent rainbow table attacks:

require 'digest'
require 'securerandom'

def hash_password(password)
  salt = SecureRandom.hex(16)
  hash = Digest::SHA256.hexdigest(salt + password)
  "#{salt}$#{hash}"
end

def verify_password(password, stored)
  salt, hash = stored.split('$')
  computed_hash = Digest::SHA256.hexdigest(salt + password)
  computed_hash == hash
end

stored = hash_password("secret123")
# => "a1b2c3d4....$5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"

verify_password("secret123", stored)
# => true

verify_password("wrong", stored)
# => false

Hash collision attacks exploit predictable hash functions. Attackers craft inputs producing identical hash values, degrading hash table performance to O(n). Ruby randomizes hash seeds to prevent this attack vector.

# Vulnerable to collision attacks without randomization
def insecure_hash(string)
  string.bytes.sum  # Predictable, easy to create collisions
end

# "abc".bytes.sum => 294
# "bac".bytes.sum => 294  (collision)

# Ruby's built-in hash includes randomization
"abc".hash != "bac".hash
# => true (extremely likely, depends on random seed)

Timing attacks exploit execution time variations to extract information. Constant-time comparison prevents this:

# Vulnerable to timing attacks
def insecure_compare(hash1, hash2)
  hash1 == hash2  # Returns early on first mismatch
end

# Constant-time comparison
def secure_compare(hash1, hash2)
  return false if hash1.length != hash2.length
  
  result = 0
  hash1.bytes.zip(hash2.bytes).each do |b1, b2|
    result |= b1 ^ b2
  end
  
  result == 0
end

MD5 and SHA-1 suffer from known collision vulnerabilities. Researchers have demonstrated collision generation, making these algorithms unsuitable for security-critical applications:

require 'digest'

# MD5 - DO NOT USE for security
Digest::MD5.hexdigest("data")
# Collisions can be generated

# SHA-1 - DO NOT USE for security
Digest::SHA1.hexdigest("data")
# Collisions demonstrated in 2017

# SHA-256 - Currently secure
Digest::SHA256.hexdigest("data")
# No practical collision attacks known

# SHA-512 - More secure, slower
Digest::SHA512.hexdigest("data")
# Higher security margin

Key derivation functions like PBKDF2, bcrypt, or scrypt provide better password hashing by incorporating iteration counts that slow down brute-force attacks:

require 'openssl'

def pbkdf2_hash(password, iterations = 100000)
  salt = SecureRandom.random_bytes(16)
  hash = OpenSSL::PKCS5.pbkdf2_hmac(
    password,
    salt,
    iterations,
    32,
    OpenSSL::Digest::SHA256.new
  )
  "#{iterations}$#{Base64.encode64(salt)}$#{Base64.encode64(hash)}"
end

hashed = pbkdf2_hash("password123")
# Significantly slower to compute, resistant to brute-force

Practical Examples

Detecting duplicate files uses hash values for efficient comparison:

require 'digest'

class DuplicateFinder
  def initialize
    @hashes = Hash.new { |h, k| h[k] = [] }
  end

  def scan_directory(path)
    Dir.glob("#{path}/**/*").each do |file|
      next unless File.file?(file)
      
      hash = Digest::SHA256.file(file).hexdigest
      @hashes[hash] << file
    end
  end

  def duplicates
    @hashes.select { |hash, files| files.size > 1 }
  end
end

finder = DuplicateFinder.new
finder.scan_directory("/path/to/files")
finder.duplicates.each do |hash, files|
  puts "Duplicate files (#{hash}):"
  files.each { |f| puts "  #{f}" }
end

Caching computed results uses hash values as keys:

class ComputationCache
  def initialize
    @cache = {}
  end

  def compute_with_cache(input)
    key = Digest::SHA256.hexdigest(input.to_s)
    
    @cache[key] ||= expensive_computation(input)
  end

  def expensive_computation(input)
    # Simulated expensive operation
    sleep(1)
    input.reverse
  end
end

cache = ComputationCache.new
cache.compute_with_cache("data")  # Takes 1 second
cache.compute_with_cache("data")  # Instant (cached)

Bloom filters use multiple hash functions for space-efficient set membership testing:

class BloomFilter
  def initialize(size = 1000)
    @size = size
    @bits = Array.new(size, false)
  end

  def hash_functions(item)
    [
      Digest::MD5.hexdigest(item).to_i(16) % @size,
      Digest::SHA1.hexdigest(item).to_i(16) % @size,
      Digest::SHA256.hexdigest(item).to_i(16) % @size
    ]
  end

  def add(item)
    hash_functions(item).each do |index|
      @bits[index] = true
    end
  end

  def contains?(item)
    hash_functions(item).all? { |index| @bits[index] }
  end
end

filter = BloomFilter.new
filter.add("apple")
filter.add("banana")

filter.contains?("apple")
# => true

filter.contains?("orange")
# => false (probably, false positives possible)

Consistent hashing distributes data across servers:

class ConsistentHash
  def initialize(servers, virtual_nodes = 150)
    @ring = {}
    servers.each do |server|
      virtual_nodes.times do |i|
        hash = Digest::SHA256.hexdigest("#{server}:#{i}").to_i(16)
        @ring[hash] = server
      end
    end
    @sorted_keys = @ring.keys.sort
  end

  def get_server(key)
    hash = Digest::SHA256.hexdigest(key).to_i(16)
    
    index = @sorted_keys.bsearch_index { |k| k >= hash } || 0
    @ring[@sorted_keys[index]]
  end
end

hash_ring = ConsistentHash.new(["server1", "server2", "server3"])
hash_ring.get_server("user123")
# => "server2"

hash_ring.get_server("user456")
# => "server1"

String interning uses hash tables to store unique string instances:

class StringInterner
  def initialize
    @strings = {}
  end

  def intern(string)
    hash = string.hash
    @strings[hash] ||= string.dup.freeze
  end
end

interner = StringInterner.new
str1 = interner.intern("hello")
str2 = interner.intern("hello")

str1.object_id == str2.object_id
# => true (same object reference)

Reference

Hash Function Comparison

Function	Speed	Collision Resistance	Security	Use Case
Built-in hash	Fastest	Good	No	Hash tables, general purpose
Polynomial	Fast	Moderate	No	String matching, rolling hash
FNV-1a	Fast	Good	No	Hash tables, checksums
MurmurHash	Fast	Excellent	No	Hash tables, distributed systems
MD5	Medium	Poor	Broken	Legacy checksums only
SHA-1	Medium	Weak	Broken	Legacy applications only
SHA-256	Slow	Excellent	Strong	Digital signatures, certificates
SHA-512	Slower	Excellent	Stronger	High-security applications
PBKDF2	Very slow	N/A	Strong	Password hashing

Ruby Hash Methods

Method	Description	Example
hash	Returns hash value for object	"text".hash
eql?	Checks equality for hash lookup	key1.eql?(key2)
Digest::MD5	MD5 hash computation	Digest::MD5.hexdigest(str)
Digest::SHA256	SHA-256 hash computation	Digest::SHA256.hexdigest(str)
Digest::SHA512	SHA-512 hash computation	Digest::SHA512.hexdigest(str)

Performance Characteristics

Operation	Time Complexity	Space Complexity	Notes
Hash computation	O(n)	O(1)	n = string length
Hash table insert	O(1) average	O(n)	Amortized with resize
Hash table lookup	O(1) average	O(n)	Worst case O(n) with collisions
Rolling hash update	O(1)	O(1)	Constant time per character
Cryptographic hash	O(n)	O(1)	Slower constant factors

Common Hash Bases

Base	Properties	Collision Rate	Use Case
31	Small prime, fast multiplication	Low	General string hashing
53	Larger prime, better distribution	Very low	Hash tables
97	Large prime, excellent distribution	Very low	Critical applications
101	Large prime, good mixing	Very low	High-quality hashing
257	Power of 2 plus 1, fast	Low	Performance-critical code

Security Guidelines

Scenario	Recommended Hash	Avoid	Rationale
Password storage	PBKDF2, bcrypt, scrypt	MD5, SHA-1, plain SHA-256	Requires key derivation
Digital signatures	SHA-256, SHA-512	MD5, SHA-1	Collision resistance needed
File integrity	SHA-256	MD5	Collision attacks possible
Hash tables	Built-in hash, MurmurHash	Cryptographic hashes	Speed vs security tradeoff
Checksums	CRC32, MD5	None for non-security	Speed acceptable for error detection

Load Factor Impact

Load Factor	Average Lookups	Resize Threshold	Performance
< 0.5	1.1	No	Excellent, memory inefficient
0.5 - 0.75	1.5	No	Good balance
0.75 - 1.0	2.5	Yes	Degrading, resize recommended
> 1.0	5+	Required	Poor, many collisions

Hash Function Implementation Template

def custom_hash(string, base = 31, modulo = 2**32)
  hash_value = 0
  string.each_char do |char|
    hash_value = (hash_value * base + char.ord) % modulo
  end
  hash_value
end

Collision Resolution Strategies

Strategy	Pros	Cons	Implementation
Chaining	Simple, handles high load	Extra memory, pointer overhead	Linked lists at each bucket
Open addressing	Cache-friendly, no pointers	Clustering, complex deletion	Linear or quadratic probing
Double hashing	Reduced clustering	Two hash functions needed	Secondary hash for step size
Robin Hood	Balanced chains	Complex insertion	Variance-reducing probe

String Hashing