Overview
Cache memory sits between the CPU and main memory (RAM), providing fast access to frequently used data. Modern processors operate at multi-gigahertz speeds while main memory access takes hundreds of CPU cycles, creating a performance gap that cache memory bridges. A cache hit retrieves data from cache in a few cycles, while a cache miss requires fetching from main memory, stalling the CPU.
Modern processors implement a cache hierarchy with multiple levels. L1 cache, closest to the CPU cores, provides sub-nanosecond access but stores only 32-64 KB per core. L2 cache holds 256 KB to 1 MB with slightly higher latency. L3 cache, shared across all cores, ranges from several megabytes to tens of megabytes. Each level trades capacity for speed.
Software performance depends heavily on cache behavior. Programs with good spatial and temporal locality keep their working set in cache, achieving near-CPU speeds. Programs with poor locality trigger frequent cache misses, becoming memory-bound and running orders of magnitude slower than their cache-friendly counterparts.
# Cache-unfriendly: random access pattern
data = Array.new(1_000_000) { rand(1000) }
sum = 0
indices = (0...1_000_000).to_a.shuffle
indices.each do |i|
sum += data[i] # Random jumps through memory
end
Cache memory operates transparently to software, but understanding its behavior enables writing faster code. Ruby programs, despite running in a VM, experience the same cache effects as native code because memory access patterns ultimately reach the hardware cache.
Key Principles
Cache memory stores copies of data from main memory in small, fast storage. The CPU automatically checks cache before accessing main memory. When data exists in cache (a hit), the CPU retrieves it quickly. When data is absent (a miss), the CPU fetches it from main memory and stores a copy in cache.
Cache memory divides into fixed-size cache lines, typically 64 bytes on x86-64 processors. When the CPU requests a single byte, the cache system fetches an entire cache line containing that byte. This design exploits spatial locality: programs often access memory near recently accessed locations. Fetching whole cache lines amortizes the cost of memory access across multiple operations.
Temporal locality refers to accessing the same memory location repeatedly within a short time window. Caches retain recently accessed data, making subsequent accesses fast. Loops that repeatedly access the same variables exhibit strong temporal locality. Programs with poor temporal locality evict cache lines before reusing them.
# Strong temporal locality
matrix = Array.new(1000) { Array.new(1000, 0) }
# Accumulator reused in every iteration
sum = 0
matrix.each do |row|
row.each do |value|
sum += value # 'sum' stays in register/L1 cache
end
end
Cache mapping determines where in cache a memory block can reside. Direct-mapped caches place each memory block in exactly one cache location based on the block's address. Set-associative caches allow each block to reside in one of several locations within a set. Fully associative caches permit any block in any cache location. Higher associativity reduces conflict misses but increases hardware complexity and lookup time.
When multiple CPU cores share cache levels, cache coherence protocols maintain consistency. The MESI protocol (Modified, Exclusive, Shared, Invalid) tracks each cache line's state. When one core writes to a cache line, the protocol invalidates copies in other cores' caches. This coordination prevents stale data but introduces overhead when cores access shared memory.
Write policies determine when cache writes propagate to main memory. Write-through caches immediately write to both cache and memory, maintaining consistency at the cost of write performance. Write-back caches mark modified cache lines as dirty and delay writing to memory until eviction. Write-back improves performance but complicates coherence.
# Cache line demonstration
class CacheLineDemo
def initialize
@data = Array.new(16, 0) # 16 integers in 2-3 cache lines
end
def modify_sequential
# Efficient: modifies adjacent elements in same cache lines
@data.each_index { |i| @data[i] += 1 }
end
def modify_strided
# Less efficient: may span multiple cache lines inefficiently
(0...16).step(4) { |i| @data[i] += 1 }
end
end
Cache replacement policies decide which cache line to evict when bringing in new data. Least Recently Used (LRU) evicts the line accessed furthest in the past. Random replacement selects a line randomly. First In First Out (FIFO) evicts the oldest line. LRU approximates optimal behavior for programs with temporal locality.
Cache capacity limits determine how much data fits in each level. Working sets exceeding cache capacity trigger capacity misses. Programs operating on arrays larger than L3 cache cannot keep all data cached simultaneously and must carefully orchestrate memory access to maximize reuse before eviction.
Performance Considerations
Cache hit rates directly correlate with program performance. A 99% L1 cache hit rate with 1% L3 hits performs vastly better than 90% L1 hits with 10% main memory accesses. Main memory access costs 200-300 CPU cycles while L1 cache access costs 3-4 cycles, making misses two orders of magnitude slower.
Memory access patterns determine cache efficiency. Sequential access patterns exhibit excellent spatial locality, loading cache lines that subsequent operations immediately use. Strided access patterns with stride equal to cache line size waste bandwidth by using only one element per fetched cache line. Random access patterns thrash the cache, constantly evicting data before reuse.
# Measuring cache effects through timing
require 'benchmark'
def sum_row_major(matrix)
sum = 0
matrix.each do |row|
row.each { |val| sum += val } # Sequential memory access
end
sum
end
def sum_column_major(matrix)
sum = 0
(0...matrix[0].size).each do |col|
matrix.each { |row| sum += row[col] } # Strided access
end
sum
end
size = 10_000
matrix = Array.new(size) { Array.new(size) { rand(100) } }
Benchmark.bm(15) do |x|
x.report("row-major:") { sum_row_major(matrix) }
x.report("column-major:") { sum_column_major(matrix) }
end
# Row-major typically 2-4x faster due to cache efficiency
Data structure layout impacts cache performance. Arrays store elements contiguously, allowing sequential access to load full cache lines. Linked lists scatter nodes throughout memory, triggering a cache miss per node access. Hash tables exhibit unpredictable access patterns that often miss cache.
Block size selection affects cache behavior in algorithms. Matrix multiplication blocked into tiles that fit in cache reuses data before eviction. Blocking transforms a cache-oblivious algorithm into one that adapts to cache size. Tile size should fit in L2 or L3 cache for maximum reuse.
# Blocked matrix multiplication
def multiply_blocked(a, b, block_size = 64)
n = a.size
result = Array.new(n) { Array.new(n, 0) }
(0...n).step(block_size) do |i_block|
(0...n).step(block_size) do |j_block|
(0...n).step(block_size) do |k_block|
# Process block
i_max = [i_block + block_size, n].min
j_max = [j_block + block_size, n].min
k_max = [k_block + block_size, n].min
(i_block...i_max).each do |i|
(j_block...j_max).each do |j|
sum = result[i][j]
(k_block...k_max).each do |k|
sum += a[i][k] * b[k][j]
end
result[i][j] = sum
end
end
end
end
end
result
end
Prefetching hints can improve performance when the CPU's automatic prefetcher fails to predict access patterns. Hardware prefetchers detect simple stride patterns but miss complex patterns. Software prefetch instructions load data into cache before needed, hiding memory latency. Ruby provides no direct prefetch support, but access patterns can trigger hardware prefetching.
Cache pollution occurs when infrequently accessed data evicts frequently accessed data. Large data structures processed once should use streaming stores that bypass cache when possible. Ruby's abstraction level prevents direct streaming store control, but processing order affects what remains cached.
Ruby Implementation
Ruby's memory model abstracts hardware details but cache effects remain visible. Ruby objects occupy memory with specific layouts determined by the VM. Understanding these layouts enables cache-aware coding despite Ruby's high-level nature.
Ruby arrays store elements contiguously when possible. Small integer arrays stored as immediate values achieve excellent cache locality. Arrays of objects store references, requiring pointer chasing that may miss cache. Packed arrays of structs using strings or FFI can improve locality.
# Cache-friendly integer operations
def process_integers(count)
data = Array.new(count) { |i| i }
# Sequential processing - excellent cache behavior
sum = 0
data.each { |value| sum += value }
# Map operation - processes sequentially
squared = data.map { |value| value * value }
sum
end
# Cache-unfriendly object references
class Point
attr_accessor :x, :y, :z
def initialize(x, y, z)
@x, @y, @z = x, y, z
end
end
def process_points(count)
points = Array.new(count) { Point.new(rand, rand, rand) }
# Array stores references - each Point is separate allocation
# Accessing Point objects requires pointer chase, likely cache miss
sum = 0
points.each { |pt| sum += pt.x + pt.y + pt.z }
sum
end
Ruby's numeric types exhibit different cache characteristics. Small integers (fixnums) fit in immediate values within pointers, avoiding memory access entirely. Large integers (bignums) allocate heap objects, reducing cache efficiency. Float values store as heap objects in standard Ruby but may use immediate values in some VMs.
String operations interact with cache based on string size and modification patterns. Small strings may be embedded in objects, improving locality. Large strings require separate allocations. String concatenation creating temporary strings thrashes cache if performed in loops.
# Inefficient: creates many temporary strings
def build_string_bad(parts)
result = ""
parts.each { |part| result += part } # Allocates new string each time
result
end
# Efficient: builds in-place when possible
def build_string_good(parts)
parts.join # Single allocation
end
# For incremental building
def build_string_incremental(parts)
result = String.new(capacity: parts.sum(&:length))
parts.each { |part| result << part } # Modifies in place
result
end
Ruby's garbage collector impacts cache behavior. Young generation collections scan recently allocated objects, likely still in cache. Full collections traverse the entire heap, potentially evicting application data from cache. Allocation patterns affect GC frequency and cache disruption.
Enumerable methods like map, select, and reduce create intermediate arrays that consume cache space. Chaining multiple operations creates multiple temporary arrays. Lazy enumerators defer computation, reducing intermediate allocations and cache pressure.
# Eager evaluation - multiple passes and intermediates
data = (1..1_000_000).to_a
result = data
.select { |x| x.even? } # Creates intermediate array
.map { |x| x * x } # Creates another intermediate
.take(100) # Final array
# Lazy evaluation - single pass, fewer allocations
result = (1..1_000_000)
.lazy
.select { |x| x.even? } # No intermediate array
.map { |x| x * x }
.take(100) # Stops early
.to_a
Method calls in tight loops can impact cache. Virtual dispatch requires looking up method implementations, potentially missing instruction cache. Megamorphic call sites with many possible receivers cause instruction cache misses. Monomorphic call sites with single receiver types optimize well.
Ruby's copy-on-write mechanics interact with cache coherence. Forked processes share memory pages initially. First write to a shared page triggers a copy, invalidating cache lines in other processes. Applications forking after loading large datasets should minimize writes to shared data.
Design Considerations
Cache optimization matters most for performance-critical code processing large datasets. Small programs operating on kilobytes of data run entirely from cache regardless of optimization. Programs processing megabytes to gigabytes benefit significantly from cache-aware design.
Trade-offs exist between cache optimization and code maintainability. Blocked algorithms improve cache behavior but increase complexity. Determine whether performance gains justify increased code size and reduced readability. Profile before optimizing to identify actual bottlenecks.
# Simple but cache-inefficient
def simple_transpose(matrix)
n = matrix.size
result = Array.new(n) { Array.new(n) }
n.times do |i|
n.times do |j|
result[j][i] = matrix[i][j] # Column-major write pattern
end
end
result
end
# Cache-optimized with blocking
def blocked_transpose(matrix, block_size = 64)
n = matrix.size
result = Array.new(n) { Array.new(n) }
(0...n).step(block_size) do |i_block|
(0...n).step(block_size) do |j_block|
i_max = [i_block + block_size, n].min
j_max = [j_block + block_size, n].min
(i_block...i_max).each do |i|
(j_block...j_max).each do |j|
result[j][i] = matrix[i][j]
end
end
end
end
result
end
Data structure selection should account for access patterns. Arrays suit sequential or random access when indices are known. Hashes optimize key-based lookup but exhibit poor cache locality when iterating. Consider whether cache efficiency or lookup speed dominates performance.
Algorithm selection can favor cache-friendly approaches. Merge sort accesses arrays sequentially with good locality. Quicksort's partitioning exhibits good cache behavior for in-place sorting. Heap-based algorithms may suffer cache misses when accessing parent/child nodes.
Organizing data for access patterns improves cache efficiency. Structure of Arrays (SoA) layout stores each field in separate arrays, improving cache utilization when processing single fields. Array of Structures (AoS) stores complete records together, benefiting when accessing multiple fields per record.
# Array of Structures - good when accessing multiple fields together
class ParticleAoS
def initialize(count)
@particles = Array.new(count) do
{ x: rand, y: rand, z: rand, vx: rand, vy: rand, vz: rand }
end
end
def update_positions(dt)
@particles.each do |p|
p[:x] += p[:vx] * dt # Access multiple fields per particle
p[:y] += p[:vy] * dt
p[:z] += p[:vz] * dt
end
end
end
# Structure of Arrays - good when processing single fields
class ParticleSoA
def initialize(count)
@x = Array.new(count) { rand }
@y = Array.new(count) { rand }
@z = Array.new(count) { rand }
@vx = Array.new(count) { rand }
@vy = Array.new(count) { rand }
@vz = Array.new(count) { rand }
end
def update_x_positions(dt)
@x.size.times do |i|
@x[i] += @vx[i] * dt # Sequential access to x and vx arrays
end
end
end
Padding and alignment can prevent false sharing in concurrent programs. Cache lines shared between threads cause cache coherence traffic when either thread writes. Padding ensures separate threads access different cache lines even when modifying adjacent data.
Memory allocation patterns affect cache behavior. Pre-allocating arrays at target size avoids reallocation and copying. Reusing buffers across iterations reduces allocation overhead and keeps data cache-hot. Object pooling trades memory for cache efficiency when allocation dominates performance.
Practical Examples
Matrix operations demonstrate cache effects clearly. Row-major traversal accesses consecutive memory, loading useful data in each cache line. Column-major traversal jumps by row width, accessing one element per cache line and wasting bandwidth.
require 'benchmark'
def create_matrix(size)
Array.new(size) { Array.new(size) { rand } }
end
def sum_by_row(matrix)
sum = 0
matrix.each do |row|
row.each { |val| sum += val }
end
sum
end
def sum_by_column(matrix)
sum = 0
n = matrix.size
n.times do |col|
matrix.each { |row| sum += row[col] }
end
sum
end
sizes = [1000, 2000, 4000]
sizes.each do |size|
matrix = create_matrix(size)
puts "Matrix size: #{size}x#{size}"
Benchmark.bm(15) do |x|
x.report("by row:") { sum_by_row(matrix) }
x.report("by column:") { sum_by_column(matrix) }
end
puts
end
# Observe increasing performance gap as matrix grows
Image processing exhibits strong cache effects. Processing pixels row-by-row accesses memory sequentially. Processing column-by-column causes strided access with poor cache utilization. Tiled processing balances both directions.
# Represent image as flat array in row-major order
class Image
attr_reader :width, :height, :pixels
def initialize(width, height)
@width = width
@height = height
@pixels = Array.new(width * height, 0)
end
def get(x, y)
@pixels[y * @width + x]
end
def set(x, y, value)
@pixels[y * @width + x] = value
end
end
# Cache-friendly row-major processing
def blur_horizontal(image)
result = Image.new(image.width, image.height)
image.height.times do |y|
image.width.times do |x|
if x == 0 || x == image.width - 1
result.set(x, y, image.get(x, y))
else
avg = (image.get(x-1, y) + image.get(x, y) + image.get(x+1, y)) / 3.0
result.set(x, y, avg)
end
end
end
result
end
# Less cache-friendly vertical processing
def blur_vertical(image)
result = Image.new(image.width, image.height)
image.width.times do |x|
image.height.times do |y| # Inner loop jumps by row width
if y == 0 || y == image.height - 1
result.set(x, y, image.get(x, y))
else
avg = (image.get(x, y-1) + image.get(x, y) + image.get(x, y+1)) / 3.0
result.set(x, y, avg)
end
end
end
result
end
Sorting algorithms exhibit different cache behaviors. In-place sorting modifies arrays sequentially with good locality. Algorithms creating many temporary arrays pressure cache. Comparison count matters less than memory access patterns for large datasets.
# Cache-efficient in-place quicksort
def quicksort!(arr, left = 0, right = arr.length - 1)
return if left >= right
pivot_index = partition!(arr, left, right)
quicksort!(arr, left, pivot_index - 1)
quicksort!(arr, pivot_index + 1, right)
end
def partition!(arr, left, right)
pivot = arr[right]
i = left
(left...right).each do |j|
if arr[j] <= pivot
arr[i], arr[j] = arr[j], arr[i] # Swap in place
i += 1
end
end
arr[i], arr[right] = arr[right], arr[i]
i
end
# Test cache behavior
data = Array.new(1_000_000) { rand(1_000_000) }
data_copy = data.dup
require 'benchmark'
Benchmark.bm do |x|
x.report("in-place quicksort:") { quicksort!(data) }
x.report("built-in sort:") { data_copy.sort! }
end
Sparse matrix operations demonstrate the impact of data structure on cache. Coordinate list format requires sequential scan. Compressed sparse row format enables cache-friendly row operations. Choice depends on operation patterns.
Common Pitfalls
False sharing occurs when separate threads access different variables residing in the same cache line. One thread's write invalidates the entire cache line in other cores' caches, causing cache coherence traffic even though threads access different data.
# Potential false sharing
class Counter
def initialize
@count = 0
end
def increment
@count += 1 # Multiple threads incrementing different counters
end
end
# If multiple Counter objects allocated consecutively,
# they may share cache lines
counters = Array.new(8) { Counter.new }
# Better: pad to ensure separate cache lines
class PaddedCounter
def initialize
@count = 0
@padding = Array.new(15, 0) # Force different cache lines
end
def increment
@count += 1
end
end
Cache thrashing happens when working set exceeds cache capacity. Repeated access to data that doesn't fit causes continuous eviction and reload. Blocking algorithms reduce working set to fit in cache.
Assuming cache line size leads to portability issues. Cache line sizes vary from 32 to 128 bytes across architectures. Write portable code or detect cache line size at runtime. Ruby's abstraction usually prevents direct cache line assumptions, but be cautious with FFI or extensions.
Premature cache optimization wastes development time. Profile first to confirm cache misses cause performance issues. Many programs run fast enough without cache optimization. Optimize only proven bottlenecks.
# Premature optimization example
def compute_values(n)
# Developer assumes cache optimization needed
# Actually spends most time in computation, not memory access
result = Array.new(n)
n.times { |i| result[i] = expensive_computation(i) }
result
end
def expensive_computation(x)
# Complex calculation dominates runtime
(1..100).reduce(x) { |acc, _| Math.sin(acc) * Math.cos(acc) }
end
# Profile reveals computation bottleneck, not cache
# Cache optimization pointless here
Ignoring alignment requirements in data structures wastes cache space. Misaligned structures span multiple cache lines, requiring multiple loads. Ruby handles alignment internally, but packed binary structures via strings or FFI need careful alignment.
Excessive indirection hurts cache performance. Chains of pointers require multiple cache accesses to reach final data. Ruby's object model introduces inherent indirection, but additional layers compound the problem. Flatten data structures when access patterns allow.
Focusing solely on cache ignores other bottlenecks. TLB misses occur when page table entries don't fit in TLB cache. NUMA effects on multi-socket systems cause remote memory access penalties. Network and disk I/O dominate many applications. Cache optimization matters only when memory access is the bottleneck.
Reference
Cache Hierarchy Characteristics
| Cache Level | Typical Size | Access Latency | Sharing | Bandwidth |
|---|---|---|---|---|
| L1 Data | 32-64 KB | 3-4 cycles | Per core | 96-128 GB/s |
| L1 Instruction | 32-64 KB | 3-4 cycles | Per core | 96-128 GB/s |
| L2 Unified | 256 KB-1 MB | 10-12 cycles | Per core | 48-64 GB/s |
| L3 Unified | 8-64 MB | 40-50 cycles | All cores | 32-48 GB/s |
| Main Memory | 8-128 GB | 200-300 cycles | System | 20-40 GB/s |
Memory Access Pattern Comparison
| Pattern | Spatial Locality | Temporal Locality | Cache Efficiency | Example |
|---|---|---|---|---|
| Sequential | Excellent | Good | 90%+ hit rate | Array traversal |
| Strided | Poor to Good | Good | Varies with stride | Column access |
| Random | None | Poor | 10-30% hit rate | Hash table walk |
| Blocked | Excellent | Excellent | 85-95% hit rate | Tiled matrix ops |
Cache Optimization Techniques
| Technique | Benefit | Cost | When to Use |
|---|---|---|---|
| Loop blocking | Improved reuse | Code complexity | Large matrices |
| Data padding | Prevent false sharing | Memory overhead | Concurrent writes |
| Structure flattening | Fewer indirections | Flexibility loss | Hot paths |
| Array of Structures | Grouped field access | Wasted loads | Full record processing |
| Structure of Arrays | Single field access | Access complexity | Field-specific operations |
| Prefetching | Hide latency | Bandwidth use | Predictable patterns |
Cache Miss Types
| Miss Type | Cause | Solution |
|---|---|---|
| Compulsory | First access to data | Prefetching, larger blocks |
| Capacity | Working set exceeds cache | Blocking, data reduction |
| Conflict | Multiple addresses map to same set | Better placement, higher associativity |
| Coherence | Invalidation by other core | Padding, reduce sharing |
Ruby Memory Optimization Checklist
| Optimization | Impact | Implementation |
|---|---|---|
| Prefer arrays over hashes for sequential access | High | Use Array methods |
| Process data in chunks fitting L3 cache | High | Batch operations |
| Reuse buffers instead of allocating | Medium | Preallocate, clear, reuse |
| Use lazy enumerators for large datasets | Medium | Convert to lazy |
| Avoid string concatenation in loops | Medium | Use String#concat or join |
| Structure data for access patterns | High | SoA vs AoS decision |
| Minimize object allocations in tight loops | Medium | Reuse objects, use primitives |
| Process arrays sequentially when possible | High | Avoid random access |
Cache Line Considerations
| Aspect | x86-64 | ARM64 | RISC-V | PowerPC |
|---|---|---|---|---|
| Line Size | 64 bytes | 64-128 bytes | 64 bytes | 128 bytes |
| L1 Associativity | 8-way | 4-way | 2-8 way | 8-way |
| Write Policy | Write-back | Write-back | Write-back | Write-back |
| Coherence | MESI/MESIF | MOESI | MOESI | MESI |