CrackedRuby - Cache Memory

Overview

Cache memory sits between the CPU and main memory (RAM), providing fast access to frequently used data. Modern processors operate at multi-gigahertz speeds while main memory access takes hundreds of CPU cycles, creating a performance gap that cache memory bridges. A cache hit retrieves data from cache in a few cycles, while a cache miss requires fetching from main memory, stalling the CPU.

Modern processors implement a cache hierarchy with multiple levels. L1 cache, closest to the CPU cores, provides sub-nanosecond access but stores only 32-64 KB per core. L2 cache holds 256 KB to 1 MB with slightly higher latency. L3 cache, shared across all cores, ranges from several megabytes to tens of megabytes. Each level trades capacity for speed.

Software performance depends heavily on cache behavior. Programs with good spatial and temporal locality keep their working set in cache, achieving near-CPU speeds. Programs with poor locality trigger frequent cache misses, becoming memory-bound and running orders of magnitude slower than their cache-friendly counterparts.

# Cache-unfriendly: random access pattern
data = Array.new(1_000_000) { rand(1000) }
sum = 0
indices = (0...1_000_000).to_a.shuffle

indices.each do |i|
  sum += data[i]  # Random jumps through memory
end

Cache memory operates transparently to software, but understanding its behavior enables writing faster code. Ruby programs, despite running in a VM, experience the same cache effects as native code because memory access patterns ultimately reach the hardware cache.

Key Principles

Cache memory stores copies of data from main memory in small, fast storage. The CPU automatically checks cache before accessing main memory. When data exists in cache (a hit), the CPU retrieves it quickly. When data is absent (a miss), the CPU fetches it from main memory and stores a copy in cache.

Cache memory divides into fixed-size cache lines, typically 64 bytes on x86-64 processors. When the CPU requests a single byte, the cache system fetches an entire cache line containing that byte. This design exploits spatial locality: programs often access memory near recently accessed locations. Fetching whole cache lines amortizes the cost of memory access across multiple operations.

Temporal locality refers to accessing the same memory location repeatedly within a short time window. Caches retain recently accessed data, making subsequent accesses fast. Loops that repeatedly access the same variables exhibit strong temporal locality. Programs with poor temporal locality evict cache lines before reusing them.

# Strong temporal locality
matrix = Array.new(1000) { Array.new(1000, 0) }

# Accumulator reused in every iteration
sum = 0
matrix.each do |row|
  row.each do |value|
    sum += value  # 'sum' stays in register/L1 cache
  end
end

Cache mapping determines where in cache a memory block can reside. Direct-mapped caches place each memory block in exactly one cache location based on the block's address. Set-associative caches allow each block to reside in one of several locations within a set. Fully associative caches permit any block in any cache location. Higher associativity reduces conflict misses but increases hardware complexity and lookup time.

When multiple CPU cores share cache levels, cache coherence protocols maintain consistency. The MESI protocol (Modified, Exclusive, Shared, Invalid) tracks each cache line's state. When one core writes to a cache line, the protocol invalidates copies in other cores' caches. This coordination prevents stale data but introduces overhead when cores access shared memory.

Write policies determine when cache writes propagate to main memory. Write-through caches immediately write to both cache and memory, maintaining consistency at the cost of write performance. Write-back caches mark modified cache lines as dirty and delay writing to memory until eviction. Write-back improves performance but complicates coherence.

# Cache line demonstration
class CacheLineDemo
  def initialize
    @data = Array.new(16, 0)  # 16 integers in 2-3 cache lines
  end

  def modify_sequential
    # Efficient: modifies adjacent elements in same cache lines
    @data.each_index { |i| @data[i] += 1 }
  end

  def modify_strided
    # Less efficient: may span multiple cache lines inefficiently
    (0...16).step(4) { |i| @data[i] += 1 }
  end
end

Cache replacement policies decide which cache line to evict when bringing in new data. Least Recently Used (LRU) evicts the line accessed furthest in the past. Random replacement selects a line randomly. First In First Out (FIFO) evicts the oldest line. LRU approximates optimal behavior for programs with temporal locality.

Cache capacity limits determine how much data fits in each level. Working sets exceeding cache capacity trigger capacity misses. Programs operating on arrays larger than L3 cache cannot keep all data cached simultaneously and must carefully orchestrate memory access to maximize reuse before eviction.

Performance Considerations

Cache hit rates directly correlate with program performance. A 99% L1 cache hit rate with 1% L3 hits performs vastly better than 90% L1 hits with 10% main memory accesses. Main memory access costs 200-300 CPU cycles while L1 cache access costs 3-4 cycles, making misses two orders of magnitude slower.

Memory access patterns determine cache efficiency. Sequential access patterns exhibit excellent spatial locality, loading cache lines that subsequent operations immediately use. Strided access patterns with stride equal to cache line size waste bandwidth by using only one element per fetched cache line. Random access patterns thrash the cache, constantly evicting data before reuse.

# Measuring cache effects through timing
require 'benchmark'

def sum_row_major(matrix)
  sum = 0
  matrix.each do |row|
    row.each { |val| sum += val }  # Sequential memory access
  end
  sum
end

def sum_column_major(matrix)
  sum = 0
  (0...matrix[0].size).each do |col|
    matrix.each { |row| sum += row[col] }  # Strided access
  end
  sum
end

size = 10_000
matrix = Array.new(size) { Array.new(size) { rand(100) } }

Benchmark.bm(15) do |x|
  x.report("row-major:") { sum_row_major(matrix) }
  x.report("column-major:") { sum_column_major(matrix) }
end
# Row-major typically 2-4x faster due to cache efficiency

Data structure layout impacts cache performance. Arrays store elements contiguously, allowing sequential access to load full cache lines. Linked lists scatter nodes throughout memory, triggering a cache miss per node access. Hash tables exhibit unpredictable access patterns that often miss cache.

Block size selection affects cache behavior in algorithms. Matrix multiplication blocked into tiles that fit in cache reuses data before eviction. Blocking transforms a cache-oblivious algorithm into one that adapts to cache size. Tile size should fit in L2 or L3 cache for maximum reuse.

# Blocked matrix multiplication
def multiply_blocked(a, b, block_size = 64)
  n = a.size
  result = Array.new(n) { Array.new(n, 0) }

  (0...n).step(block_size) do |i_block|
    (0...n).step(block_size) do |j_block|
      (0...n).step(block_size) do |k_block|
        
        # Process block
        i_max = [i_block + block_size, n].min
        j_max = [j_block + block_size, n].min
        k_max = [k_block + block_size, n].min

        (i_block...i_max).each do |i|
          (j_block...j_max).each do |j|
            sum = result[i][j]
            (k_block...k_max).each do |k|
              sum += a[i][k] * b[k][j]
            end
            result[i][j] = sum
          end
        end
      end
    end
  end
  result
end

Prefetching hints can improve performance when the CPU's automatic prefetcher fails to predict access patterns. Hardware prefetchers detect simple stride patterns but miss complex patterns. Software prefetch instructions load data into cache before needed, hiding memory latency. Ruby provides no direct prefetch support, but access patterns can trigger hardware prefetching.

Cache pollution occurs when infrequently accessed data evicts frequently accessed data. Large data structures processed once should use streaming stores that bypass cache when possible. Ruby's abstraction level prevents direct streaming store control, but processing order affects what remains cached.

Ruby Implementation

Ruby's memory model abstracts hardware details but cache effects remain visible. Ruby objects occupy memory with specific layouts determined by the VM. Understanding these layouts enables cache-aware coding despite Ruby's high-level nature.

Ruby arrays store elements contiguously when possible. Small integer arrays stored as immediate values achieve excellent cache locality. Arrays of objects store references, requiring pointer chasing that may miss cache. Packed arrays of structs using strings or FFI can improve locality.

# Cache-friendly integer operations
def process_integers(count)
  data = Array.new(count) { |i| i }
  
  # Sequential processing - excellent cache behavior
  sum = 0
  data.each { |value| sum += value }
  
  # Map operation - processes sequentially
  squared = data.map { |value| value * value }
  
  sum
end

# Cache-unfriendly object references
class Point
  attr_accessor :x, :y, :z
  
  def initialize(x, y, z)
    @x, @y, @z = x, y, z
  end
end

def process_points(count)
  points = Array.new(count) { Point.new(rand, rand, rand) }
  
  # Array stores references - each Point is separate allocation
  # Accessing Point objects requires pointer chase, likely cache miss
  sum = 0
  points.each { |pt| sum += pt.x + pt.y + pt.z }
  sum
end

Ruby's numeric types exhibit different cache characteristics. Small integers (fixnums) fit in immediate values within pointers, avoiding memory access entirely. Large integers (bignums) allocate heap objects, reducing cache efficiency. Float values store as heap objects in standard Ruby but may use immediate values in some VMs.

String operations interact with cache based on string size and modification patterns. Small strings may be embedded in objects, improving locality. Large strings require separate allocations. String concatenation creating temporary strings thrashes cache if performed in loops.

# Inefficient: creates many temporary strings
def build_string_bad(parts)
  result = ""
  parts.each { |part| result += part }  # Allocates new string each time
  result
end

# Efficient: builds in-place when possible
def build_string_good(parts)
  parts.join  # Single allocation
end

# For incremental building
def build_string_incremental(parts)
  result = String.new(capacity: parts.sum(&:length))
  parts.each { |part| result << part }  # Modifies in place
  result
end

Ruby's garbage collector impacts cache behavior. Young generation collections scan recently allocated objects, likely still in cache. Full collections traverse the entire heap, potentially evicting application data from cache. Allocation patterns affect GC frequency and cache disruption.

Enumerable methods like map, select, and reduce create intermediate arrays that consume cache space. Chaining multiple operations creates multiple temporary arrays. Lazy enumerators defer computation, reducing intermediate allocations and cache pressure.

# Eager evaluation - multiple passes and intermediates
data = (1..1_000_000).to_a
result = data
  .select { |x| x.even? }      # Creates intermediate array
  .map { |x| x * x }            # Creates another intermediate
  .take(100)                    # Final array

# Lazy evaluation - single pass, fewer allocations
result = (1..1_000_000)
  .lazy
  .select { |x| x.even? }       # No intermediate array
  .map { |x| x * x }
  .take(100)                    # Stops early
  .to_a

Method calls in tight loops can impact cache. Virtual dispatch requires looking up method implementations, potentially missing instruction cache. Megamorphic call sites with many possible receivers cause instruction cache misses. Monomorphic call sites with single receiver types optimize well.

Ruby's copy-on-write mechanics interact with cache coherence. Forked processes share memory pages initially. First write to a shared page triggers a copy, invalidating cache lines in other processes. Applications forking after loading large datasets should minimize writes to shared data.

Design Considerations

Cache optimization matters most for performance-critical code processing large datasets. Small programs operating on kilobytes of data run entirely from cache regardless of optimization. Programs processing megabytes to gigabytes benefit significantly from cache-aware design.

Trade-offs exist between cache optimization and code maintainability. Blocked algorithms improve cache behavior but increase complexity. Determine whether performance gains justify increased code size and reduced readability. Profile before optimizing to identify actual bottlenecks.

# Simple but cache-inefficient
def simple_transpose(matrix)
  n = matrix.size
  result = Array.new(n) { Array.new(n) }
  
  n.times do |i|
    n.times do |j|
      result[j][i] = matrix[i][j]  # Column-major write pattern
    end
  end
  result
end

# Cache-optimized with blocking
def blocked_transpose(matrix, block_size = 64)
  n = matrix.size
  result = Array.new(n) { Array.new(n) }
  
  (0...n).step(block_size) do |i_block|
    (0...n).step(block_size) do |j_block|
      i_max = [i_block + block_size, n].min
      j_max = [j_block + block_size, n].min
      
      (i_block...i_max).each do |i|
        (j_block...j_max).each do |j|
          result[j][i] = matrix[i][j]
        end
      end
    end
  end
  result
end

Data structure selection should account for access patterns. Arrays suit sequential or random access when indices are known. Hashes optimize key-based lookup but exhibit poor cache locality when iterating. Consider whether cache efficiency or lookup speed dominates performance.

Algorithm selection can favor cache-friendly approaches. Merge sort accesses arrays sequentially with good locality. Quicksort's partitioning exhibits good cache behavior for in-place sorting. Heap-based algorithms may suffer cache misses when accessing parent/child nodes.

Organizing data for access patterns improves cache efficiency. Structure of Arrays (SoA) layout stores each field in separate arrays, improving cache utilization when processing single fields. Array of Structures (AoS) stores complete records together, benefiting when accessing multiple fields per record.

# Array of Structures - good when accessing multiple fields together
class ParticleAoS
  def initialize(count)
    @particles = Array.new(count) do
      { x: rand, y: rand, z: rand, vx: rand, vy: rand, vz: rand }
    end
  end
  
  def update_positions(dt)
    @particles.each do |p|
      p[:x] += p[:vx] * dt  # Access multiple fields per particle
      p[:y] += p[:vy] * dt
      p[:z] += p[:vz] * dt
    end
  end
end

# Structure of Arrays - good when processing single fields
class ParticleSoA
  def initialize(count)
    @x = Array.new(count) { rand }
    @y = Array.new(count) { rand }
    @z = Array.new(count) { rand }
    @vx = Array.new(count) { rand }
    @vy = Array.new(count) { rand }
    @vz = Array.new(count) { rand }
  end
  
  def update_x_positions(dt)
    @x.size.times do |i|
      @x[i] += @vx[i] * dt  # Sequential access to x and vx arrays
    end
  end
end

Padding and alignment can prevent false sharing in concurrent programs. Cache lines shared between threads cause cache coherence traffic when either thread writes. Padding ensures separate threads access different cache lines even when modifying adjacent data.

Memory allocation patterns affect cache behavior. Pre-allocating arrays at target size avoids reallocation and copying. Reusing buffers across iterations reduces allocation overhead and keeps data cache-hot. Object pooling trades memory for cache efficiency when allocation dominates performance.

Practical Examples

Matrix operations demonstrate cache effects clearly. Row-major traversal accesses consecutive memory, loading useful data in each cache line. Column-major traversal jumps by row width, accessing one element per cache line and wasting bandwidth.

require 'benchmark'

def create_matrix(size)
  Array.new(size) { Array.new(size) { rand } }
end

def sum_by_row(matrix)
  sum = 0
  matrix.each do |row|
    row.each { |val| sum += val }
  end
  sum
end

def sum_by_column(matrix)
  sum = 0
  n = matrix.size
  n.times do |col|
    matrix.each { |row| sum += row[col] }
  end
  sum
end

sizes = [1000, 2000, 4000]
sizes.each do |size|
  matrix = create_matrix(size)
  
  puts "Matrix size: #{size}x#{size}"
  Benchmark.bm(15) do |x|
    x.report("by row:") { sum_by_row(matrix) }
    x.report("by column:") { sum_by_column(matrix) }
  end
  puts
end
# Observe increasing performance gap as matrix grows

Image processing exhibits strong cache effects. Processing pixels row-by-row accesses memory sequentially. Processing column-by-column causes strided access with poor cache utilization. Tiled processing balances both directions.

# Represent image as flat array in row-major order
class Image
  attr_reader :width, :height, :pixels
  
  def initialize(width, height)
    @width = width
    @height = height
    @pixels = Array.new(width * height, 0)
  end
  
  def get(x, y)
    @pixels[y * @width + x]
  end
  
  def set(x, y, value)
    @pixels[y * @width + x] = value
  end
end

# Cache-friendly row-major processing
def blur_horizontal(image)
  result = Image.new(image.width, image.height)
  
  image.height.times do |y|
    image.width.times do |x|
      if x == 0 || x == image.width - 1
        result.set(x, y, image.get(x, y))
      else
        avg = (image.get(x-1, y) + image.get(x, y) + image.get(x+1, y)) / 3.0
        result.set(x, y, avg)
      end
    end
  end
  result
end

# Less cache-friendly vertical processing
def blur_vertical(image)
  result = Image.new(image.width, image.height)
  
  image.width.times do |x|
    image.height.times do |y|  # Inner loop jumps by row width
      if y == 0 || y == image.height - 1
        result.set(x, y, image.get(x, y))
      else
        avg = (image.get(x, y-1) + image.get(x, y) + image.get(x, y+1)) / 3.0
        result.set(x, y, avg)
      end
    end
  end
  result
end

Sorting algorithms exhibit different cache behaviors. In-place sorting modifies arrays sequentially with good locality. Algorithms creating many temporary arrays pressure cache. Comparison count matters less than memory access patterns for large datasets.

# Cache-efficient in-place quicksort
def quicksort!(arr, left = 0, right = arr.length - 1)
  return if left >= right
  
  pivot_index = partition!(arr, left, right)
  quicksort!(arr, left, pivot_index - 1)
  quicksort!(arr, pivot_index + 1, right)
end

def partition!(arr, left, right)
  pivot = arr[right]
  i = left
  
  (left...right).each do |j|
    if arr[j] <= pivot
      arr[i], arr[j] = arr[j], arr[i]  # Swap in place
      i += 1
    end
  end
  
  arr[i], arr[right] = arr[right], arr[i]
  i
end

# Test cache behavior
data = Array.new(1_000_000) { rand(1_000_000) }
data_copy = data.dup

require 'benchmark'
Benchmark.bm do |x|
  x.report("in-place quicksort:") { quicksort!(data) }
  x.report("built-in sort:") { data_copy.sort! }
end

Sparse matrix operations demonstrate the impact of data structure on cache. Coordinate list format requires sequential scan. Compressed sparse row format enables cache-friendly row operations. Choice depends on operation patterns.

Common Pitfalls

False sharing occurs when separate threads access different variables residing in the same cache line. One thread's write invalidates the entire cache line in other cores' caches, causing cache coherence traffic even though threads access different data.

# Potential false sharing
class Counter
  def initialize
    @count = 0
  end
  
  def increment
    @count += 1  # Multiple threads incrementing different counters
  end
end

# If multiple Counter objects allocated consecutively,
# they may share cache lines
counters = Array.new(8) { Counter.new }

# Better: pad to ensure separate cache lines
class PaddedCounter
  def initialize
    @count = 0
    @padding = Array.new(15, 0)  # Force different cache lines
  end
  
  def increment
    @count += 1
  end
end

Cache thrashing happens when working set exceeds cache capacity. Repeated access to data that doesn't fit causes continuous eviction and reload. Blocking algorithms reduce working set to fit in cache.

Assuming cache line size leads to portability issues. Cache line sizes vary from 32 to 128 bytes across architectures. Write portable code or detect cache line size at runtime. Ruby's abstraction usually prevents direct cache line assumptions, but be cautious with FFI or extensions.

Premature cache optimization wastes development time. Profile first to confirm cache misses cause performance issues. Many programs run fast enough without cache optimization. Optimize only proven bottlenecks.

# Premature optimization example
def compute_values(n)
  # Developer assumes cache optimization needed
  # Actually spends most time in computation, not memory access
  result = Array.new(n)
  n.times { |i| result[i] = expensive_computation(i) }
  result
end

def expensive_computation(x)
  # Complex calculation dominates runtime
  (1..100).reduce(x) { |acc, _| Math.sin(acc) * Math.cos(acc) }
end

# Profile reveals computation bottleneck, not cache
# Cache optimization pointless here

Ignoring alignment requirements in data structures wastes cache space. Misaligned structures span multiple cache lines, requiring multiple loads. Ruby handles alignment internally, but packed binary structures via strings or FFI need careful alignment.

Excessive indirection hurts cache performance. Chains of pointers require multiple cache accesses to reach final data. Ruby's object model introduces inherent indirection, but additional layers compound the problem. Flatten data structures when access patterns allow.

Focusing solely on cache ignores other bottlenecks. TLB misses occur when page table entries don't fit in TLB cache. NUMA effects on multi-socket systems cause remote memory access penalties. Network and disk I/O dominate many applications. Cache optimization matters only when memory access is the bottleneck.

Reference

Cache Hierarchy Characteristics

Cache Level	Typical Size	Access Latency	Sharing	Bandwidth
L1 Data	32-64 KB	3-4 cycles	Per core	96-128 GB/s
L1 Instruction	32-64 KB	3-4 cycles	Per core	96-128 GB/s
L2 Unified	256 KB-1 MB	10-12 cycles	Per core	48-64 GB/s
L3 Unified	8-64 MB	40-50 cycles	All cores	32-48 GB/s
Main Memory	8-128 GB	200-300 cycles	System	20-40 GB/s

Memory Access Pattern Comparison

Pattern	Spatial Locality	Temporal Locality	Cache Efficiency	Example
Sequential	Excellent	Good	90%+ hit rate	Array traversal
Strided	Poor to Good	Good	Varies with stride	Column access
Random	None	Poor	10-30% hit rate	Hash table walk
Blocked	Excellent	Excellent	85-95% hit rate	Tiled matrix ops

Cache Optimization Techniques

Technique	Benefit	Cost	When to Use
Loop blocking	Improved reuse	Code complexity	Large matrices
Data padding	Prevent false sharing	Memory overhead	Concurrent writes
Structure flattening	Fewer indirections	Flexibility loss	Hot paths
Array of Structures	Grouped field access	Wasted loads	Full record processing
Structure of Arrays	Single field access	Access complexity	Field-specific operations
Prefetching	Hide latency	Bandwidth use	Predictable patterns

Cache Miss Types

Miss Type	Cause	Solution
Compulsory	First access to data	Prefetching, larger blocks
Capacity	Working set exceeds cache	Blocking, data reduction
Conflict	Multiple addresses map to same set	Better placement, higher associativity
Coherence	Invalidation by other core	Padding, reduce sharing

Ruby Memory Optimization Checklist

Optimization	Impact	Implementation
Prefer arrays over hashes for sequential access	High	Use Array methods
Process data in chunks fitting L3 cache	High	Batch operations
Reuse buffers instead of allocating	Medium	Preallocate, clear, reuse
Use lazy enumerators for large datasets	Medium	Convert to lazy
Avoid string concatenation in loops	Medium	Use String#concat or join
Structure data for access patterns	High	SoA vs AoS decision
Minimize object allocations in tight loops	Medium	Reuse objects, use primitives
Process arrays sequentially when possible	High	Avoid random access

Cache Line Considerations

Aspect	x86-64	ARM64	RISC-V	PowerPC
Line Size	64 bytes	64-128 bytes	64 bytes	128 bytes
L1 Associativity	8-way	4-way	2-8 way	8-way
Write Policy	Write-back	Write-back	Write-back	Write-back
Coherence	MESI/MESIF	MOESI	MOESI	MESI

Cache Memory