CrackedRuby - Paging and Segmentation

Overview

Paging and segmentation represent two fundamental approaches to memory management in operating systems. Both mechanisms implement virtual memory by translating logical addresses used by programs into physical addresses in RAM. Virtual memory creates an abstraction layer between the address space a program sees and the actual physical memory hardware, providing benefits including process isolation, memory protection, and the ability to run programs larger than available physical memory.

Paging divides both virtual and physical memory into fixed-size blocks. Virtual memory splits into pages while physical memory divides into frames of equal size. The operating system maintains page tables that map virtual pages to physical frames. When a program accesses memory, the Memory Management Unit (MMU) translates the virtual address to a physical address using the page table. If the requested page resides in physical memory, the access proceeds immediately. If not, a page fault occurs and the OS loads the page from disk.

Segmentation takes a different approach by dividing memory into variable-sized segments based on logical program structure. Each segment represents a logical unit such as code, data, stack, or heap. Programs reference memory using a segment selector and an offset within that segment. The segment table stores base addresses and limits for each segment, enabling the MMU to translate segment-relative addresses to physical addresses while enforcing access controls.

Modern operating systems typically combine both approaches. Intel x86-64 architecture, for instance, uses segmentation for backwards compatibility but implements true memory management primarily through paging with multi-level page tables. This hybrid model provides the protection benefits of segmentation with the efficient memory management of paging.

Virtual Address Space (Paging)
┌─────────────┐
│   Page 0    │ ──┐
├─────────────┤   │
│   Page 1    │   │    Page Table        Physical Memory
├─────────────┤   │   ┌──────────┐      ┌─────────────┐
│   Page 2    │ ──┼──→│ Frame 5  │─────→│   Frame 0   │
├─────────────┤   │   ├──────────┤      ├─────────────┤
│   Page 3    │   └──→│ Frame 2  │─┐    │   Frame 1   │
└─────────────┘       ├──────────┤ │    ├─────────────┤
                      │ Frame 7  │ └───→│   Frame 2   │
                      └──────────┘      └─────────────┘

Virtual Address Space (Segmentation)
┌─────────────┐
│    Code     │ ──┐
├─────────────┤   │   Segment Table
│    Data     │   │  ┌─────────────────┐
├─────────────┤   └─→│ Base: 0x10000   │
│    Stack    │ ──┐  │ Limit: 4096     │
├─────────────┤   │  ├─────────────────┤
│    Heap     │   └─→│ Base: 0x50000   │
└─────────────┘      │ Limit: 8192     │
                     └─────────────────┘

Key Principles

Virtual memory operates on the principle that programs need not occupy contiguous physical memory. The operating system and hardware collaborate to maintain the illusion of a continuous address space while actual data resides in scattered physical locations or even on disk storage.

Address Translation in Paging

A virtual address in a paging system consists of two components: a page number and an offset within that page. For a 32-bit address space with 4KB pages (2^12 bytes), the upper 20 bits identify the page number while the lower 12 bits specify the offset. The MMU uses the page number as an index into the page table, retrieves the corresponding frame number, and concatenates it with the offset to form the physical address.

Virtual Address: | Page Number (20 bits) | Offset (12 bits) |

Page Table Lookup:
Page Number → Page Table Entry → Frame Number

Physical Address: | Frame Number | Offset |

Page tables reside in memory and contain one entry per page in the virtual address space. Each page table entry (PTE) stores the frame number along with control bits indicating whether the page is present in memory, whether it has been modified (dirty bit), whether it has been accessed (reference bit), and protection information (read/write/execute permissions).

Multi-Level Paging

Single-level page tables become impractical for large address spaces. A 64-bit address space with 4KB pages would require a page table with 2^52 entries, consuming petabytes of memory per process. Multi-level paging solves this by organizing page tables hierarchically. The virtual address splits into multiple fields, each indexing a level in the page table hierarchy.

64-bit Virtual Address with 4-level paging:
| PML4 (9 bits) | Directory Ptr (9 bits) | Directory (9 bits) | Table (9 bits) | Offset (12 bits) |

Translation Process:
1. Use PML4 bits to index into PML4 table
2. Retrieved entry points to Page Directory Pointer table
3. Use Directory Ptr bits to index into that table
4. Retrieved entry points to Page Directory table
5. Use Directory bits to index into that table
6. Retrieved entry points to Page Table
7. Use Table bits to index into Page Table
8. Retrieved entry contains Frame Number
9. Concatenate Frame Number with Offset

This hierarchical structure allocates page tables only for memory regions actually in use. A process using 10MB of memory requires only a handful of page tables rather than a complete page table covering the entire 64-bit address space.

Translation Lookaside Buffer

Page table lookups require memory accesses, potentially multiple accesses for multi-level tables. To mitigate this overhead, processors include a Translation Lookaside Buffer (TLB), a small cache storing recent virtual-to-physical address mappings. The TLB operates as a fully associative cache, checking all entries in parallel for a matching virtual page number. A TLB hit completes address translation without accessing memory. A TLB miss requires a page table walk, and the resulting mapping populates the TLB, potentially evicting an old entry.

TLB reach represents the amount of memory covered by TLB entries. With 64 TLB entries and 4KB pages, the TLB covers 256KB. Programs with good spatial locality keep most memory accesses within the TLB reach, achieving near-zero translation overhead. Programs accessing memory scattered across a large address space suffer frequent TLB misses.

Segmentation Fundamentals

Segmentation divides memory based on logical program structure rather than arbitrary fixed-size blocks. A program might have five segments: code, initialized data, uninitialized data, heap, and stack. Each segment grows or shrinks independently and has distinct access permissions. The code segment permits execution but prohibits writes, while data segments allow reads and writes but prohibit execution.

Virtual addresses in segmentation contain a segment selector and an offset. The segment selector indexes into a segment table, retrieving a segment descriptor that contains the base address, limit (size), and access rights. The MMU adds the offset to the base address to produce the physical address, but first validates that the offset does not exceed the limit, raising a segmentation fault if it does.

Logical Address: | Segment Selector | Offset |

Segment Table Entry:
- Base Address: Physical memory start
- Limit: Segment size
- Access Rights: R/W/X permissions
- Present bit: In memory or swapped

Translation:
Physical Address = Base Address + Offset (if Offset < Limit)

Page Faults and Demand Paging

Operating systems implement demand paging, loading pages into memory only when accessed. Initially, page table entries mark pages as not present. The first access triggers a page fault exception. The OS page fault handler locates the page on disk, finds a free frame (or evicts an existing page), loads the page data, updates the page table entry to mark it present with the correct frame number, and resumes the program. The program remains unaware that execution paused while the OS performed disk I/O.

Page replacement algorithms determine which page to evict when physical memory fills. Common algorithms include Least Recently Used (LRU), which evicts the page accessed furthest in the past, and Clock algorithm, an approximation of LRU using reference bits. The OS must write modified pages back to disk before reusing their frames, while unmodified pages can be discarded since an unchanged copy exists on disk.

Segmentation with Paging

Combined approaches use segmentation for logical organization and paging for physical memory management. Each segment contains multiple pages, and the system maintains a page table for each segment. The logical address contains a segment selector, page number within segment, and offset within page. Translation first uses the segment selector to find the segment's page table, then performs normal paging translation using that table.

Combined Address: | Segment | Page | Offset |
                      ↓       ↓       ↓
                  Segment  Page   Physical
                  Table    Table  Memory

This combination provides segmentation's logical protection with paging's efficient memory utilization. Different segments can have different page table structures, allowing the code segment to share pages across processes (for shared libraries) while keeping data segments private.

Design Considerations

Paging Advantages

Paging eliminates external fragmentation since any frame can hold any page. All frames have identical size, so any free frame satisfies a page allocation request. This property simplifies memory allocation and allows the OS to use all available physical memory. Internal fragmentation still occurs - the last page of each memory region may be partially unused - but averages only half a page per region, typically 2KB with 4KB pages.

Paging enables efficient memory sharing between processes. Multiple page tables can map different virtual pages to the same physical frame. Shared libraries exploit this by having one copy in physical memory with every process mapping pages to those frames. Copy-on-write fork implementations initially share all pages between parent and child, marking them read-only. When either process writes to a page, the OS copies it to a new frame and updates the writing process's page table.

Page-based protection operates at coarse granularity. Each page has uniform permissions across its entire 4KB span. Programs cannot protect individual variables or small structures; the smallest protected unit remains a full page. This limitation rarely matters for code and large data structures but affects fine-grained security mechanisms.

Segmentation Advantages

Segmentation aligns with logical program structure, making protection and sharing more intuitive. The code segment naturally has execute-only permission while the stack has read-write permission. Each segment grows independently - the stack grows downward from high addresses while the heap grows upward from low addresses, and they never interfere because they occupy separate segments.

Variable-sized segments eliminate internal fragmentation. A 100-byte data structure consumes exactly 100 bytes in its segment rather than occupying a full 4KB page. For programs with many small objects, this efficiency gain becomes significant. However, external fragmentation emerges as segments of different sizes allocate and deallocate, leaving memory gaps too small for subsequent allocations.

Segment-based sharing operates at coarser granularity than page-based sharing. Two processes can share an entire segment (a complete library) but cannot share individual functions within that library unless they occupy their own segment. This limitation makes segmentation less suitable for fine-grained sharing scenarios.

Hybrid Architecture Trade-offs

Intel x86-64 combines segmentation and paging but makes segmentation mostly vestigial. The architecture maintains segment registers and segment descriptor tables for backwards compatibility, but 64-bit mode essentially disables segmentation by forcing all segments to span the entire address space. True memory management occurs through four-level paging. This design provides paging's allocation efficiency while preserving segment-based privilege levels for security.

ARM architectures historically used pure paging without segmentation, achieving simpler hardware with equivalent functionality. Modern ARM systems implement paging with additional protection mechanisms through memory domains and privilege levels, providing security without segmentation overhead.

The choice between approaches depends on hardware constraints and software requirements. Embedded systems with limited memory might prefer simpler single-level paging. Server systems handling multiple processes benefit from multi-level paging's memory efficiency. Real-time systems might avoid paging entirely or use locked pages to guarantee deterministic memory access latency.

Page Size Selection

Operating systems must choose page sizes that balance competing concerns. Larger pages reduce page table size - 2MB pages need 512 times fewer page table entries than 4KB pages for the same virtual address space. Fewer page tables consume less memory and increase TLB reach. A TLB with 64 entries covers 128MB with 2MB pages but only 256KB with 4KB pages.

However, larger pages waste more memory through internal fragmentation and reduce page fault locality. A program accessing one byte triggers loading an entire page. With 2MB pages, accessing a single array element loads 2MB, potentially evicting 512 smaller pages that the program actively uses. This trade-off pushes most systems toward smaller pages for general use.

Many architectures support multiple page sizes simultaneously. Linux on x86-64 uses 4KB pages by default but supports 2MB huge pages and 1GB giant pages. Applications explicitly request huge pages for large memory regions like database buffer pools, gaining TLB reach benefits without imposing huge page overhead on all memory.

Implementation Approaches

Single-Level Paging

The straightforward paging implementation maintains one page table per process. Each page table contains one entry per page in the virtual address space. For 32-bit systems with 4KB pages, each process needs a page table with 2^20 entries (1 million entries). With 4-byte page table entries, each page table consumes 4MB. This approach works for 32-bit systems where address spaces remain manageable.

The OS stores the page table base register (PTBR) in each process control block. During context switches, the OS loads the new process's PTBR into the MMU, instantly switching to that process's address space. The MMU adds the virtual page number multiplied by entry size to the PTBR to locate the correct page table entry.

Single-Level Translation:
1. Extract page number from virtual address
2. Calculate PTE address: PTBR + (page_number * entry_size)
3. Read PTE from calculated address
4. Extract frame number from PTE
5. Calculate physical address: (frame_number * page_size) + offset

Single-level paging becomes impractical for 64-bit systems. Even with sparse address space usage, the OS must allocate the full page table for each process or implement complex mechanisms to handle missing page table regions. The memory overhead and management complexity drive adoption of multi-level schemes.

Hierarchical Paging

Multi-level paging organizes page tables as a tree structure. The topmost level always resides in memory. Lower levels allocate on demand as programs use memory in their corresponding regions. A process using 1GB of memory in a 64-bit space might require only the top-level table (covering the entire address space) plus a handful of lower-level tables (covering actually used regions).

x86-64 uses four-level paging with 512-entry tables at each level. The virtual address splits into five components: four 9-bit indices for the four page table levels plus a 12-bit offset. Each page table occupies exactly one page (512 entries × 8 bytes per entry = 4096 bytes), simplifying page table management.

# Conceptual representation of multi-level address translation
class PageTableEntry
  attr_accessor :frame_number, :present, :writable, :executable
  
  def initialize(frame_number, present: true, writable: true, executable: false)
    @frame_number = frame_number
    @present = present
    @writable = writable
    @executable = executable
  end
end

class PageTable
  def initialize
    @entries = Array.new(512) # 512 entries per table
  end
  
  def [](index)
    @entries[index]
  end
  
  def []=(index, entry)
    @entries[index] = entry
  end
end

# Simulated 4-level page table walk
def translate_address(virtual_addr, pml4_table)
  # Extract indices from 48-bit virtual address
  pml4_idx = (virtual_addr >> 39) & 0x1FF  # bits 39-47
  pdp_idx = (virtual_addr >> 30) & 0x1FF   # bits 30-38
  pd_idx = (virtual_addr >> 21) & 0x1FF    # bits 21-29
  pt_idx = (virtual_addr >> 12) & 0x1FF    # bits 12-20
  offset = virtual_addr & 0xFFF            # bits 0-11
  
  # Level 1: PML4
  pml4_entry = pml4_table[pml4_idx]
  return nil unless pml4_entry&.present
  
  # Level 2: Page Directory Pointer
  pdp_table = get_table(pml4_entry.frame_number)
  pdp_entry = pdp_table[pdp_idx]
  return nil unless pdp_entry&.present
  
  # Level 3: Page Directory
  pd_table = get_table(pdp_entry.frame_number)
  pd_entry = pd_table[pd_idx]
  return nil unless pd_entry&.present
  
  # Level 4: Page Table
  pt_table = get_table(pd_entry.frame_number)
  pt_entry = pt_table[pt_idx]
  return nil unless pt_entry&.present
  
  # Compute physical address
  physical_addr = (pt_entry.frame_number << 12) | offset
  physical_addr
end

The memory savings from hierarchical paging prove dramatic. A process using 1GB scattered across the address space might need 1 PML4 table, 1 PDP table, 1 PD table, and 256 page tables (1GB / 2MB per PD entry), totaling about 1MB of page table overhead compared to theoretical petabytes for a full single-level table.

Inverted Page Tables

Traditional page tables scale with virtual address space size, consuming significant memory in systems with many processes or large address spaces. Inverted page tables flip this by maintaining one entry per physical frame rather than per virtual page. The table size scales with physical memory, remaining constant regardless of virtual address space size or number of processes.

Each inverted page table entry stores the process ID and virtual page number occupying that frame. Address translation requires searching the table for an entry matching the current process and requested virtual page. Linear search becomes prohibitively slow, so implementations use hash tables indexed by (process_id, virtual_page_number) pairs. Hash collisions chain entries together.

Inverted Page Table Entry:
- Process ID
- Virtual Page Number  
- Protection Bits
- Next Pointer (for hash chain)

Translation:
1. Hash (PID, VPN) to get table index
2. Search chain starting at that index
3. If found, index is the frame number
4. If not found, page fault

Inverted page tables benefit systems with large virtual spaces and limited physical memory, such as IBM PowerPC systems. They consume fixed memory regardless of process count or address space size. However, searching takes longer than indexed lookup, and sharing pages between processes requires multiple entries pointing to the same frame.

Segmentation Schemes

Pure segmentation maintains a segment table per process. Each entry contains a base address, limit, and access rights. Translation adds the offset to the base address after validating it against the limit. The simplicity enables fast translation but suffers from external fragmentation as variable-sized segments allocate and deallocate.

Memory compaction addresses fragmentation by relocating segments to consolidate free space. The OS pauses affected processes, copies segment contents to new locations, updates segment table entries with new base addresses, and resumes execution. This process costs significant time and becomes impractical with large segments or frequent allocation changes.

Best-fit and first-fit algorithms attempt to minimize fragmentation during allocation. Best-fit searches for the smallest free region accommodating the requested size, minimizing wasted space but requiring full free list traversal. First-fit uses the first adequate region, executing faster but potentially leaving small unusable fragments. Buddy system allocation restricts sizes to powers of two, enabling fast allocation and deallocation with bounded fragmentation.

Hybrid Paging and Segmentation

Multics pioneered combining segmentation and paging. Each segment contains multiple pages, and each segment has its own page table. The segment table entry points to the segment's page table rather than directly to physical memory. This design provides segmentation's logical organization with paging's efficient memory management.

Translation first uses the segment selector to find the segment descriptor, validates the offset against the segment limit, then uses the segment's page table to translate the page number within the segment to a physical frame. The result combines segmentation's protection granularity with paging's allocation efficiency and eliminates external fragmentation.

Ruby Implementation

Ruby abstracts memory management details, operating above the paging and segmentation layer. The Ruby runtime allocates memory through the operating system's virtual memory API, which handles page table management and address translation transparently. Ruby developers typically need not consider paging directly, but understanding these mechanisms helps diagnose performance issues and memory behavior.

Ruby Memory Allocation

Ruby allocates memory in two primary ways: through the Ruby heap for Ruby objects and through system malloc for native extensions and large data. The Ruby heap consists of memory pages (distinct from OS pages) that the garbage collector manages. Each Ruby heap page typically occupies 16KB and contains slots for Ruby objects of various sizes.

# Ruby heap pages are internal to Ruby, but we can observe allocation behavior
class MemoryObserver
  def self.allocation_pattern
    before = GC.stat
    
    # Allocate many small objects
    objects = Array.new(10_000) { "x" * 100 }
    
    after = GC.stat
    
    {
      heap_pages: after[:heap_allocated_pages] - before[:heap_allocated_pages],
      objects: after[:total_allocated_objects] - before[:total_allocated_objects],
      slots: after[:heap_available_slots] - before[:heap_available_slots]
    }
  end
end

stats = MemoryObserver.allocation_pattern
# => {:heap_pages=>2, :objects=>10000, :slots=>1024}

When Ruby requires more memory, it requests pages from the OS through system calls like mmap or sbrk. The OS allocates virtual memory pages, updating the process's page tables. These pages might not immediately map to physical frames - demand paging loads them into physical memory only when accessed. The first access to each new page triggers a minor page fault (a soft fault requiring no disk I/O, just frame allocation).

Observing Memory Behavior

Ruby provides tools to monitor memory usage and observe interactions with the OS memory management system. The GC.stat method returns detailed statistics about Ruby's memory usage, while /proc/self/status on Linux systems reveals OS-level memory information including virtual size, resident set size, and page fault counts.

require 'objspace'

class MemoryAnalyzer
  def self.memory_report
    gc_stats = GC.stat
    
    # Parse /proc/self/status for OS-level stats
    proc_status = File.read('/proc/self/status')
    vm_size = proc_status[/VmSize:\s+(\d+)/, 1].to_i
    vm_rss = proc_status[/VmRSS:\s+(\d+)/, 1].to_i
    vm_data = proc_status[/VmData:\s+(\d+)/, 1].to_i
    
    {
      virtual_memory_kb: vm_size,
      resident_memory_kb: vm_rss,
      data_segment_kb: vm_data,
      ruby_heap_pages: gc_stats[:heap_allocated_pages],
      ruby_heap_live_slots: gc_stats[:heap_live_slots],
      ruby_heap_free_slots: gc_stats[:heap_free_slots],
      major_gc_count: gc_stats[:major_gc_count],
      minor_gc_count: gc_stats[:minor_gc_count]
    }
  end
  
  def self.allocation_trace(seconds: 5)
    ObjectSpace.trace_object_allocations_start
    
    initial = memory_report
    sleep(seconds)
    final = memory_report
    
    ObjectSpace.trace_object_allocations_stop
    
    {
      virtual_growth_kb: final[:virtual_memory_kb] - initial[:virtual_memory_kb],
      resident_growth_kb: final[:resident_memory_kb] - initial[:resident_memory_kb],
      heap_page_growth: final[:ruby_heap_pages] - initial[:ruby_heap_pages],
      gc_major_runs: final[:major_gc_count] - initial[:major_gc_count],
      gc_minor_runs: final[:minor_gc_count] - initial[:minor_gc_count]
    }
  end
end

# Monitor memory behavior during operation
report = MemoryAnalyzer.memory_report
# => {:virtual_memory_kb=>45680, :resident_memory_kb=>23440, ...}

The difference between virtual memory (VmSize) and resident memory (VmRSS) reveals how much of the process's address space currently resides in physical RAM. Large disparities indicate that either the process allocated memory without touching it (demand paging has not yet loaded those pages) or the OS swapped pages to disk. The VmData field shows the size of the data segment, growing as Ruby allocates more heap memory.

Memory-Mapped Files

Ruby supports memory-mapped files through native extensions, creating a direct mapping between file contents and virtual memory. The OS maps file pages into the process's address space. Reading from mapped memory triggers page faults that load file data from disk into RAM. Writes to mapped memory mark pages dirty, and the OS flushes them back to the file.

# Memory-mapped I/O using fiddle (Ruby FFI)
require 'fiddle'
require 'fiddle/import'

module MMap
  extend Fiddle::Importer
  dlload Fiddle::Handle::DEFAULT
  
  # Constants for mmap
  PROT_READ = 1
  PROT_WRITE = 2
  MAP_SHARED = 1
  MAP_PRIVATE = 2
  
  extern 'void* mmap(void*, size_t, int, int, int, off_t)'
  extern 'int munmap(void*, size_t)'
  extern 'int msync(void*, size_t, int)'
end

class MemoryMappedFile
  def initialize(filename, writable: false)
    @file = File.open(filename, writable ? 'r+' : 'r')
    @size = @file.size
    
    prot = MMap::PROT_READ
    prot |= MMap::PROT_WRITE if writable
    
    # Map file into memory
    @addr = MMap.mmap(nil, @size, prot, MMap::MAP_SHARED, 
                      @file.fileno, 0)
    
    raise "mmap failed" if @addr.to_i == -1
  end
  
  def read(offset, length)
    ptr = Fiddle::Pointer.new(@addr.to_i + offset)
    ptr[0, length]
  end
  
  def write(offset, data)
    ptr = Fiddle::Pointer.new(@addr.to_i + offset)
    ptr[0, data.bytesize] = data
  end
  
  def sync
    MMap.msync(@addr, @size, 0)
  end
  
  def close
    MMap.munmap(@addr, @size)
    @file.close
  end
end

# Usage demonstrates page-level I/O
mf = MemoryMappedFile.new('large_data.bin', writable: true)
data = mf.read(0, 1024)  # Triggers page fault if not in memory
mf.write(4096, "new data")  # Writes to mapped page, marked dirty
mf.sync  # Flushes dirty pages to disk
mf.close

Memory-mapped I/O benefits applications processing large files that don't fit in memory. The OS handles paging automatically, loading file portions as needed and evicting unused portions. This approach avoids explicit read and write calls while providing efficient random access.

Garbage Collection and Paging Interaction

Ruby's garbage collector scans the heap to identify live objects, accessing potentially all allocated memory. If the Ruby heap exceeds physical memory, GC scanning triggers page faults as it accesses pages swapped to disk. This behavior causes GC pauses to extend dramatically when memory pressure forces paging.

class GCPageFaultMonitor
  def self.gc_with_monitoring
    before_stats = read_stats
    
    GC.start
    
    after_stats = read_stats
    
    {
      duration_ms: (after_stats[:timestamp] - before_stats[:timestamp]) * 1000,
      minor_faults: after_stats[:minor_faults] - before_stats[:minor_faults],
      major_faults: after_stats[:major_faults] - before_stats[:major_faults],
      rss_change_kb: after_stats[:rss] - before_stats[:rss]
    }
  end
  
  def self.read_stats
    stat = File.read('/proc/self/stat').split
    status = File.read('/proc/self/status')
    
    {
      timestamp: Process.clock_gettime(Process::CLOCK_MONOTONIC),
      minor_faults: stat[9].to_i,
      major_faults: stat[11].to_i,
      rss: status[/VmRSS:\s+(\d+)/, 1].to_i
    }
  end
end

# Monitor GC behavior under memory pressure
results = GCPageFaultMonitor.gc_with_monitoring
# => {:duration_ms=>1250, :minor_faults=>15234, :major_faults=>142, ...}

Major page faults (those requiring disk I/O) during GC indicate severe memory pressure. Each major fault adds milliseconds to GC pause time. Applications experiencing this behavior need either more physical memory or reduced heap size to fit within available RAM. Ruby's GC tuning environment variables (RUBY_GC_HEAP_GROWTH_FACTOR, RUBY_GC_HEAP_INIT_SLOTS) help control heap size growth.

Performance Considerations

Translation Lookaside Buffer Impact

TLB misses impose significant overhead on memory-intensive applications. Each TLB miss requires a page table walk accessing multiple memory locations for multi-level page tables. With four-level paging, a TLB miss costs four memory accesses plus the original data access, potentially a 5× slowdown compared to a TLB hit.

Applications can improve TLB performance through data structure layout. Arranging frequently accessed data within the same pages reduces the working set of pages, keeping all mappings in the TLB. Padding structures to page boundaries wastes memory but guarantees each structure occupies its own pages, preventing false sharing and improving cache behavior.

# Demonstrating memory access pattern impact on TLB
class TLBBenchmark
  def self.sequential_access(array_size)
    data = Array.new(array_size) { |i| i }
    
    elapsed = Benchmark.measure do
      sum = 0
      data.each { |val| sum += val }
    end
    
    elapsed.real
  end
  
  def self.random_access(array_size)
    data = Array.new(array_size) { |i| i }
    indices = (0...array_size).to_a.shuffle
    
    elapsed = Benchmark.measure do
      sum = 0
      indices.each { |idx| sum += data[idx] }
    end
    
    elapsed.real
  end
end

require 'benchmark'

# Sequential access benefits from TLB hits
sequential_time = TLBBenchmark.sequential_access(1_000_000)
# => 0.042 seconds

# Random access causes more TLB misses
random_time = TLBBenchmark.random_access(1_000_000)
# => 0.156 seconds (3.7× slower)

The performance gap between sequential and random access grows with array size. Small arrays fit within TLB reach regardless of access pattern. Large arrays exceed TLB capacity, and random access prevents the prefetcher from hiding TLB miss latency.

Working Set and Memory Footprint

A process's working set comprises the pages it actively accesses. Operating systems attempt to keep each process's working set in physical memory to avoid thrashing. Ruby applications should minimize working set size by avoiding unnecessary object retention and using memory-efficient data structures.

Long-lived objects scattered across many pages expand the working set. Generational GC helps by segregating old objects into separate pages. When old objects rarely change, those pages remain clean and the OS can discard them under memory pressure, reloading from the Ruby binary rather than swapping to disk.

class WorkingSetAnalysis
  def self.measure_working_set
    # Force GC to establish baseline
    GC.start
    
    baseline = current_memory_pages
    
    # Create working set
    working_data = Array.new(1000) { "x" * 1024 }
    
    # Measure active pages
    active_pages = current_memory_pages - baseline
    
    # Create non-working data
    unused_data = Array.new(10000) { "y" * 1024 }
    
    # Sleep to allow OS to deprioritize unused pages
    sleep(1)
    
    total_pages = current_memory_pages - baseline
    
    {
      active_pages: active_pages,
      total_pages: total_pages,
      working_set_ratio: active_pages.to_f / total_pages
    }
  end
  
  def self.current_memory_pages
    status = File.read('/proc/self/status')
    rss_kb = status[/VmRSS:\s+(\d+)/, 1].to_i
    rss_kb / 4  # Convert KB to 4KB pages
  end
end

Applications can reduce working set by releasing unused memory, using streaming algorithms that process data in passes rather than loading everything, and employing memory-mapped files for large datasets. These strategies keep the working set small relative to physical memory, ensuring the OS maintains pages in RAM.

Page Fault Cost Analysis

Minor page faults (allocation without disk I/O) cost microseconds. Major page faults (loading from disk) cost milliseconds, a thousand-fold difference. Applications should avoid major faults by staying within physical memory limits or using explicit I/O to control loading.

Ruby applications allocating memory in bursts might trigger many minor faults simultaneously. The OS services these faults sequentially, and bulk allocation can pause execution for milliseconds. Preallocating memory or spreading allocations over time reduces pause duration.

class PageFaultProfiler
  def self.profile_allocation_pattern
    results = {}
    
    # Measure burst allocation
    results[:burst] = measure_faults do
      data = Array.new(100_000) { "x" * 1024 }
    end
    
    # Measure gradual allocation
    results[:gradual] = measure_faults do
      data = []
      100_000.times do
        data << "x" * 1024
        sleep(0.00001) if data.size % 1000 == 0
      end
    end
    
    # Measure preallocation
    results[:prealloc] = measure_faults do
      data = Array.new(100_000)
      data.map! { "x" * 1024 }
    end
    
    results
  end
  
  def self.measure_faults
    before = read_fault_stats
    start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    yield
    
    end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    after = read_fault_stats
    
    {
      duration_ms: (end_time - start_time) * 1000,
      minor_faults: after[:minor] - before[:minor],
      major_faults: after[:major] - before[:major]
    }
  end
  
  def self.read_fault_stats
    stat = File.read('/proc/self/stat').split
    { minor: stat[9].to_i, major: stat[11].to_i }
  end
end

Thrashing Prevention

Thrashing occurs when the system spends more time paging than executing. Each process needs enough physical memory for its working set, but insufficient memory forces constant page faults. The OS swaps pages in to satisfy faults, only to swap them out again to make room for other faults.

Ruby applications prevent thrashing by monitoring resident set size and adjusting behavior when memory pressure increases. Reducing cache sizes, releasing temporary data, or processing smaller batches keeps memory usage within limits. The alternative - allowing thrashing - grinds performance to near-zero.

class ThrashingMonitor
  THRASHING_THRESHOLD = 100  # Major faults per second
  
  def self.monitor_and_adapt
    previous = read_fault_stats
    previous_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    sleep(1)
    
    current = read_fault_stats
    current_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    interval = current_time - previous_time
    fault_rate = (current[:major] - previous[:major]) / interval
    
    if fault_rate > THRASHING_THRESHOLD
      {
        status: :thrashing,
        fault_rate: fault_rate,
        recommendation: :reduce_memory_usage
      }
    else
      {
        status: :normal,
        fault_rate: fault_rate
      }
    end
  end
  
  def self.read_fault_stats
    stat = File.read('/proc/self/stat').split
    { major: stat[11].to_i }
  end
  
  def self.reduce_cache_size(cache, reduction_factor: 0.5)
    # Evict portion of cache entries
    remove_count = (cache.size * (1 - reduction_factor)).to_i
    cache.shift(remove_count)
  end
end

# Adaptive memory management
status = ThrashingMonitor.monitor_and_adapt
if status[:status] == :thrashing
  ThrashingMonitor.reduce_cache_size(application_cache)
end

Common Pitfalls

Excessive Page Faults from Poor Locality

Applications accessing memory randomly rather than sequentially trigger more page faults and TLB misses. Ruby programs iterating through arrays or hashes in insertion order exhibit good spatial locality, keeping accessed data within fewer pages. Programs using hash maps with scattered key-value pairs or jumping between disconnected data structures suffer poor locality.

# Poor locality - scattered access pattern
class ScatteredCache
  def initialize(size)
    @data = {}
    size.times { |i| @data[rand(1_000_000)] = "value_#{i}" }
  end
  
  def access_pattern
    # Random access causes TLB misses and poor cache behavior
    @data.keys.sample(100).each { |key| @data[key] }
  end
end

# Good locality - sequential access pattern  
class SequentialCache
  def initialize(size)
    @data = Array.new(size) { |i| "value_#{i}" }
  end
  
  def access_pattern
    # Sequential iteration keeps pages in working set
    @data.first(100).each { |val| val }
  end
end

Ruby's hash implementation uses open addressing, which provides better memory locality than chaining. Arrays naturally exhibit excellent locality since elements occupy contiguous memory. Prefer arrays over hashes when access patterns iterate through most elements.

Memory Fragmentation in Long-Running Processes

Ruby's memory allocator can fragment the heap as objects allocate and deallocate. Fragmentation prevents the OS from reclaiming memory even when Ruby frees objects, because Ruby heap pages remain partially occupied. This behavior causes virtual memory to grow continuously while actual memory usage stays constant.

Ruby 2.7+ implements heap compaction to address fragmentation. The GC.compact method moves live objects to consolidate them into fewer pages, allowing Ruby to return empty pages to the OS. Applications should invoke compaction periodically during low-load periods.

class MemoryCompactionStrategy
  def self.compact_if_fragmented(fragmentation_threshold: 0.3)
    gc_stats = GC.stat
    
    total_slots = gc_stats[:heap_available_slots]
    used_slots = gc_stats[:heap_live_slots]
    free_slots = gc_stats[:heap_free_slots]
    
    # Calculate fragmentation ratio
    fragmentation = free_slots.to_f / total_slots
    
    if fragmentation > fragmentation_threshold
      before_pages = gc_stats[:heap_allocated_pages]
      
      GC.compact
      
      after_stats = GC.stat
      after_pages = after_stats[:heap_allocated_pages]
      
      {
        compacted: true,
        pages_freed: before_pages - after_pages,
        fragmentation_before: fragmentation,
        fragmentation_after: after_stats[:heap_free_slots].to_f / 
                            after_stats[:heap_available_slots]
      }
    else
      { compacted: false, fragmentation: fragmentation }
    end
  end
end

# Periodic compaction in background thread
Thread.new do
  loop do
    sleep(300)  # Every 5 minutes
    result = MemoryCompactionStrategy.compact_if_fragmented
    puts "Compaction: #{result}" if result[:compacted]
  end
end

Copy-on-Write Friendliness

Process forking in Ruby uses copy-on-write, initially sharing all memory pages between parent and child. Writing to shared pages triggers copies, increasing memory usage. Applications should preload as much data as possible before forking, keeping that data read-only afterward to maintain sharing.

Ruby objects containing references to other objects create write barriers that trigger copies during garbage collection. Even if application code doesn't modify objects, GC updates may break sharing. Ruby 2.0+ reduces this through generational GC, but applications should still minimize object churn in parent processes before forking.

# Anti-pattern: writing after fork breaks sharing
preloaded_data = load_reference_data
pid = fork do
  # Modifying preloaded data forces page copies
  preloaded_data.each { |item| item[:accessed_at] = Time.now }
  process_requests
end

# Better: keep preloaded data immutable after fork
preloaded_data = load_reference_data.freeze
pid = fork do
  # Read-only access maintains page sharing
  process_requests(preloaded_data)
end

Monitoring resident set size before and after fork reveals sharing effectiveness. The child process's RSS should initially be minimal (only stack and some metadata). Rapid RSS growth indicates broken sharing from writes to previously shared pages.

Mmap Misconceptions

Memory-mapped files seem like they load file contents into RAM, but they actually create virtual memory mappings without immediate physical memory allocation. Accessing mapped regions triggers page faults that load data on demand. Programs treating mmapped files as instantaneous memory access may encounter unexpected delays from page faults.

Additionally, memory-mapped writes don't immediately persist to disk. The OS flushes dirty pages to disk asynchronously. Applications requiring durability must call msync to force writes, adding explicit fsync-like overhead that defeats mmap's perceived simplicity.

# Incorrect assumption: mmap gives instant memory access
def process_large_file_wrong(filename)
  mf = MemoryMappedFile.new(filename)
  
  # First access triggers page faults - not instant
  1000.times do |i|
    data = mf.read(i * 1_000_000, 1024)
    process(data)
  end
  
  mf.close
  # File might not be updated yet - writes buffered in page cache
end

# Correct: understand paging and explicitly sync when needed
def process_large_file_correct(filename)
  mf = MemoryMappedFile.new(filename, writable: true)
  
  # Expect first accesses to be slower due to page loading
  1000.times do |i|
    data = mf.read(i * 1_000_000, 1024)
    result = process(data)
    mf.write(i * 1_000_000, result)
  end
  
  # Explicitly flush critical data
  mf.sync
  mf.close
end

Swap Amplification

Enabling swap on systems running memory-intensive Ruby applications can backfire. When physical memory fills, the OS begins swapping pages to disk. Ruby's garbage collector must scan the entire heap, potentially triggering page faults for every swapped page. A GC cycle normally taking milliseconds extends to seconds or minutes, effectively freezing the application.

The solution involves either disabling swap entirely (forcing OOM killer rather than thrashing) or configuring vm.swappiness low to make the kernel reluctant to swap. Applications should monitor major page faults and kill themselves gracefully when thrashing begins rather than continuing in a degraded state.

class SwapDetector
  FAULT_THRESHOLD = 1000  # Major faults per GC cycle
  
  def self.safe_gc_with_monitoring
    before_faults = read_major_faults
    
    gc_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    GC.start
    gc_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    after_faults = read_major_faults
    faults = after_faults - before_faults
    duration_ms = (gc_end - gc_start) * 1000
    
    if faults > FAULT_THRESHOLD
      warn "Excessive paging during GC: #{faults} major faults in #{duration_ms}ms"
      warn "Consider reducing heap size or adding physical memory"
      
      # Optionally: gracefully shut down rather than thrashing
      # exit(1)
    end
    
    { duration_ms: duration_ms, major_faults: faults }
  end
  
  def self.read_major_faults
    stat = File.read('/proc/self/stat').split
    stat[11].to_i
  end
end

Reference

Terminology

Term	Definition
Virtual Memory	Abstraction providing each process an independent address space larger than physical RAM
Physical Memory	Actual RAM hardware storing data and instructions
Virtual Address	Address used by programs, translated by MMU to physical address
Physical Address	Address referencing actual RAM location
Page	Fixed-size block of virtual memory (typically 4KB)
Frame	Fixed-size block of physical memory matching page size
Page Table	Data structure mapping virtual pages to physical frames
Page Table Entry	Entry in page table containing frame number and control bits
Page Fault	Exception when accessing page not present in physical memory
Minor Fault	Page fault requiring frame allocation but no disk I/O
Major Fault	Page fault requiring loading data from disk
TLB	Translation Lookaside Buffer - cache for address translations
Segment	Variable-sized block of memory representing logical program unit
Segment Table	Data structure mapping segments to physical memory regions
Working Set	Pages actively accessed by a process
Thrashing	Excessive paging causing system performance collapse
Demand Paging	Loading pages into memory only when accessed
Page Replacement	Evicting page from memory to make room for another page
Resident Set Size	Amount of process memory currently in physical RAM
Virtual Size	Total size of process virtual address space

Paging vs Segmentation

Aspect	Paging	Segmentation
Block Size	Fixed size (4KB, 2MB, 1GB)	Variable size based on content
Address Components	Page number + Offset	Segment selector + Offset
Fragmentation	Internal only	External possible
Protection Granularity	Per page	Per segment
Sharing	Share pages between processes	Share segments between processes
Logical Organization	No correspondence to program structure	Matches program structure
Translation	Page table lookup	Segment table lookup
Hardware Support	Universal in modern CPUs	Limited in modern architectures
Memory Waste	Half page per region average	Varies with allocation pattern
Table Size	Proportional to address space	Proportional to segment count

Address Translation Components

Component	x86-64	ARM64	Purpose
Page Size	4KB, 2MB, 1GB	4KB, 2MB, 1GB	Granularity of memory mapping
Virtual Address Width	48 bits	48 bits	Maximum addressable memory
Physical Address Width	52 bits	52 bits	Maximum RAM supported
Page Table Levels	4	4	Hierarchical translation depth
TLB Entries	64-1536	32-512	Translation cache capacity
Page Table Entry Size	8 bytes	8 bytes	Memory per mapping
Pages per Table	512	512	Fanout at each level

Page Table Entry Flags

Flag	Purpose	Effect When Set
Present	Indicates page in memory	Translation succeeds
Writable	Write permission	Writes allowed
User	User-mode access	User code can access
Accessed	Tracks access	Page recently accessed
Dirty	Tracks modification	Page modified since loaded
Global	TLB behavior	Mapping survives context switch
No Execute	Execution prevention	Code execution prohibited
Cache Disable	Caching control	Bypass CPU cache

Address Translation Formulas

Single-Level Paging:
  Virtual Address = Page Number | Offset
  Physical Address = Frame Number | Offset
  PTE Address = PTBR + (Page Number × Entry Size)

Multi-Level Paging (4-level):
  VA = PML4 Index | PDP Index | PD Index | PT Index | Offset
  Each index = 9 bits (512 entries per table)
  Offset = 12 bits (4096 bytes per page)

Segmentation:
  Logical Address = Segment Selector | Offset
  Physical Address = Segment Base + Offset
  Requires: Offset < Segment Limit

Translation Cost:
  TLB Hit Time = ~1 cycle
  TLB Miss Time = (Table Levels × Memory Latency) + TLB Hit Time
  4-level miss = (4 × 100 cycles) + 1 = ~401 cycles

Performance Metrics

Metric	Typical Value	Impact
TLB Hit Rate	99%+	Critical for performance
TLB Miss Penalty	400+ cycles	Significant when frequent
Minor Fault Time	1-10 μs	Negligible for occasional faults
Major Fault Time	1-10 ms	Severe when frequent
Page Size	4KB	Balance internal fragmentation vs TLB reach
TLB Reach	256KB-2MB	Working set should fit in reach
Page Walk Time	400-1000 cycles	Cost of TLB miss
Context Switch Cost	1000-10000 cycles	Includes TLB flush overhead

Ruby Memory Configuration

Environment Variable	Default	Purpose
RUBY_GC_HEAP_INIT_SLOTS	10000	Initial heap size
RUBY_GC_HEAP_GROWTH_FACTOR	1.8	Heap growth rate
RUBY_GC_HEAP_GROWTH_MAX_SLOTS	0	Maximum heap growth per GC
RUBY_GC_HEAP_FREE_SLOTS	4096	Free slots to maintain
RUBY_GC_MALLOC_LIMIT	16MB	Malloc threshold for GC
RUBY_GC_OLDMALLOC_LIMIT	16MB	Old object malloc limit

Linux Memory Metrics

File	Metric	Meaning
/proc/self/status	VmSize	Total virtual memory
/proc/self/status	VmRSS	Resident set size
/proc/self/status	VmData	Data segment size
/proc/self/status	VmStk	Stack size
/proc/self/stat	minflt	Minor page faults
/proc/self/stat	majflt	Major page faults
/proc/meminfo	MemAvailable	Available physical memory
/proc/meminfo	SwapFree	Available swap space

Paging and Segmentation