Overview
Paging and segmentation represent two fundamental approaches to memory management in operating systems. Both mechanisms implement virtual memory by translating logical addresses used by programs into physical addresses in RAM. Virtual memory creates an abstraction layer between the address space a program sees and the actual physical memory hardware, providing benefits including process isolation, memory protection, and the ability to run programs larger than available physical memory.
Paging divides both virtual and physical memory into fixed-size blocks. Virtual memory splits into pages while physical memory divides into frames of equal size. The operating system maintains page tables that map virtual pages to physical frames. When a program accesses memory, the Memory Management Unit (MMU) translates the virtual address to a physical address using the page table. If the requested page resides in physical memory, the access proceeds immediately. If not, a page fault occurs and the OS loads the page from disk.
Segmentation takes a different approach by dividing memory into variable-sized segments based on logical program structure. Each segment represents a logical unit such as code, data, stack, or heap. Programs reference memory using a segment selector and an offset within that segment. The segment table stores base addresses and limits for each segment, enabling the MMU to translate segment-relative addresses to physical addresses while enforcing access controls.
Modern operating systems typically combine both approaches. Intel x86-64 architecture, for instance, uses segmentation for backwards compatibility but implements true memory management primarily through paging with multi-level page tables. This hybrid model provides the protection benefits of segmentation with the efficient memory management of paging.
Virtual Address Space (Paging)
┌─────────────┐
│ Page 0 │ ──┐
├─────────────┤ │
│ Page 1 │ │ Page Table Physical Memory
├─────────────┤ │ ┌──────────┐ ┌─────────────┐
│ Page 2 │ ──┼──→│ Frame 5 │─────→│ Frame 0 │
├─────────────┤ │ ├──────────┤ ├─────────────┤
│ Page 3 │ └──→│ Frame 2 │─┐ │ Frame 1 │
└─────────────┘ ├──────────┤ │ ├─────────────┤
│ Frame 7 │ └───→│ Frame 2 │
└──────────┘ └─────────────┘
Virtual Address Space (Segmentation)
┌─────────────┐
│ Code │ ──┐
├─────────────┤ │ Segment Table
│ Data │ │ ┌─────────────────┐
├─────────────┤ └─→│ Base: 0x10000 │
│ Stack │ ──┐ │ Limit: 4096 │
├─────────────┤ │ ├─────────────────┤
│ Heap │ └─→│ Base: 0x50000 │
└─────────────┘ │ Limit: 8192 │
└─────────────────┘
Key Principles
Virtual memory operates on the principle that programs need not occupy contiguous physical memory. The operating system and hardware collaborate to maintain the illusion of a continuous address space while actual data resides in scattered physical locations or even on disk storage.
Address Translation in Paging
A virtual address in a paging system consists of two components: a page number and an offset within that page. For a 32-bit address space with 4KB pages (2^12 bytes), the upper 20 bits identify the page number while the lower 12 bits specify the offset. The MMU uses the page number as an index into the page table, retrieves the corresponding frame number, and concatenates it with the offset to form the physical address.
Virtual Address: | Page Number (20 bits) | Offset (12 bits) |
Page Table Lookup:
Page Number → Page Table Entry → Frame Number
Physical Address: | Frame Number | Offset |
Page tables reside in memory and contain one entry per page in the virtual address space. Each page table entry (PTE) stores the frame number along with control bits indicating whether the page is present in memory, whether it has been modified (dirty bit), whether it has been accessed (reference bit), and protection information (read/write/execute permissions).
Multi-Level Paging
Single-level page tables become impractical for large address spaces. A 64-bit address space with 4KB pages would require a page table with 2^52 entries, consuming petabytes of memory per process. Multi-level paging solves this by organizing page tables hierarchically. The virtual address splits into multiple fields, each indexing a level in the page table hierarchy.
64-bit Virtual Address with 4-level paging:
| PML4 (9 bits) | Directory Ptr (9 bits) | Directory (9 bits) | Table (9 bits) | Offset (12 bits) |
Translation Process:
1. Use PML4 bits to index into PML4 table
2. Retrieved entry points to Page Directory Pointer table
3. Use Directory Ptr bits to index into that table
4. Retrieved entry points to Page Directory table
5. Use Directory bits to index into that table
6. Retrieved entry points to Page Table
7. Use Table bits to index into Page Table
8. Retrieved entry contains Frame Number
9. Concatenate Frame Number with Offset
This hierarchical structure allocates page tables only for memory regions actually in use. A process using 10MB of memory requires only a handful of page tables rather than a complete page table covering the entire 64-bit address space.
Translation Lookaside Buffer
Page table lookups require memory accesses, potentially multiple accesses for multi-level tables. To mitigate this overhead, processors include a Translation Lookaside Buffer (TLB), a small cache storing recent virtual-to-physical address mappings. The TLB operates as a fully associative cache, checking all entries in parallel for a matching virtual page number. A TLB hit completes address translation without accessing memory. A TLB miss requires a page table walk, and the resulting mapping populates the TLB, potentially evicting an old entry.
TLB reach represents the amount of memory covered by TLB entries. With 64 TLB entries and 4KB pages, the TLB covers 256KB. Programs with good spatial locality keep most memory accesses within the TLB reach, achieving near-zero translation overhead. Programs accessing memory scattered across a large address space suffer frequent TLB misses.
Segmentation Fundamentals
Segmentation divides memory based on logical program structure rather than arbitrary fixed-size blocks. A program might have five segments: code, initialized data, uninitialized data, heap, and stack. Each segment grows or shrinks independently and has distinct access permissions. The code segment permits execution but prohibits writes, while data segments allow reads and writes but prohibit execution.
Virtual addresses in segmentation contain a segment selector and an offset. The segment selector indexes into a segment table, retrieving a segment descriptor that contains the base address, limit (size), and access rights. The MMU adds the offset to the base address to produce the physical address, but first validates that the offset does not exceed the limit, raising a segmentation fault if it does.
Logical Address: | Segment Selector | Offset |
Segment Table Entry:
- Base Address: Physical memory start
- Limit: Segment size
- Access Rights: R/W/X permissions
- Present bit: In memory or swapped
Translation:
Physical Address = Base Address + Offset (if Offset < Limit)
Page Faults and Demand Paging
Operating systems implement demand paging, loading pages into memory only when accessed. Initially, page table entries mark pages as not present. The first access triggers a page fault exception. The OS page fault handler locates the page on disk, finds a free frame (or evicts an existing page), loads the page data, updates the page table entry to mark it present with the correct frame number, and resumes the program. The program remains unaware that execution paused while the OS performed disk I/O.
Page replacement algorithms determine which page to evict when physical memory fills. Common algorithms include Least Recently Used (LRU), which evicts the page accessed furthest in the past, and Clock algorithm, an approximation of LRU using reference bits. The OS must write modified pages back to disk before reusing their frames, while unmodified pages can be discarded since an unchanged copy exists on disk.
Segmentation with Paging
Combined approaches use segmentation for logical organization and paging for physical memory management. Each segment contains multiple pages, and the system maintains a page table for each segment. The logical address contains a segment selector, page number within segment, and offset within page. Translation first uses the segment selector to find the segment's page table, then performs normal paging translation using that table.
Combined Address: | Segment | Page | Offset |
↓ ↓ ↓
Segment Page Physical
Table Table Memory
This combination provides segmentation's logical protection with paging's efficient memory utilization. Different segments can have different page table structures, allowing the code segment to share pages across processes (for shared libraries) while keeping data segments private.
Design Considerations
Paging Advantages
Paging eliminates external fragmentation since any frame can hold any page. All frames have identical size, so any free frame satisfies a page allocation request. This property simplifies memory allocation and allows the OS to use all available physical memory. Internal fragmentation still occurs - the last page of each memory region may be partially unused - but averages only half a page per region, typically 2KB with 4KB pages.
Paging enables efficient memory sharing between processes. Multiple page tables can map different virtual pages to the same physical frame. Shared libraries exploit this by having one copy in physical memory with every process mapping pages to those frames. Copy-on-write fork implementations initially share all pages between parent and child, marking them read-only. When either process writes to a page, the OS copies it to a new frame and updates the writing process's page table.
Page-based protection operates at coarse granularity. Each page has uniform permissions across its entire 4KB span. Programs cannot protect individual variables or small structures; the smallest protected unit remains a full page. This limitation rarely matters for code and large data structures but affects fine-grained security mechanisms.
Segmentation Advantages
Segmentation aligns with logical program structure, making protection and sharing more intuitive. The code segment naturally has execute-only permission while the stack has read-write permission. Each segment grows independently - the stack grows downward from high addresses while the heap grows upward from low addresses, and they never interfere because they occupy separate segments.
Variable-sized segments eliminate internal fragmentation. A 100-byte data structure consumes exactly 100 bytes in its segment rather than occupying a full 4KB page. For programs with many small objects, this efficiency gain becomes significant. However, external fragmentation emerges as segments of different sizes allocate and deallocate, leaving memory gaps too small for subsequent allocations.
Segment-based sharing operates at coarser granularity than page-based sharing. Two processes can share an entire segment (a complete library) but cannot share individual functions within that library unless they occupy their own segment. This limitation makes segmentation less suitable for fine-grained sharing scenarios.
Hybrid Architecture Trade-offs
Intel x86-64 combines segmentation and paging but makes segmentation mostly vestigial. The architecture maintains segment registers and segment descriptor tables for backwards compatibility, but 64-bit mode essentially disables segmentation by forcing all segments to span the entire address space. True memory management occurs through four-level paging. This design provides paging's allocation efficiency while preserving segment-based privilege levels for security.
ARM architectures historically used pure paging without segmentation, achieving simpler hardware with equivalent functionality. Modern ARM systems implement paging with additional protection mechanisms through memory domains and privilege levels, providing security without segmentation overhead.
The choice between approaches depends on hardware constraints and software requirements. Embedded systems with limited memory might prefer simpler single-level paging. Server systems handling multiple processes benefit from multi-level paging's memory efficiency. Real-time systems might avoid paging entirely or use locked pages to guarantee deterministic memory access latency.
Page Size Selection
Operating systems must choose page sizes that balance competing concerns. Larger pages reduce page table size - 2MB pages need 512 times fewer page table entries than 4KB pages for the same virtual address space. Fewer page tables consume less memory and increase TLB reach. A TLB with 64 entries covers 128MB with 2MB pages but only 256KB with 4KB pages.
However, larger pages waste more memory through internal fragmentation and reduce page fault locality. A program accessing one byte triggers loading an entire page. With 2MB pages, accessing a single array element loads 2MB, potentially evicting 512 smaller pages that the program actively uses. This trade-off pushes most systems toward smaller pages for general use.
Many architectures support multiple page sizes simultaneously. Linux on x86-64 uses 4KB pages by default but supports 2MB huge pages and 1GB giant pages. Applications explicitly request huge pages for large memory regions like database buffer pools, gaining TLB reach benefits without imposing huge page overhead on all memory.
Implementation Approaches
Single-Level Paging
The straightforward paging implementation maintains one page table per process. Each page table contains one entry per page in the virtual address space. For 32-bit systems with 4KB pages, each process needs a page table with 2^20 entries (1 million entries). With 4-byte page table entries, each page table consumes 4MB. This approach works for 32-bit systems where address spaces remain manageable.
The OS stores the page table base register (PTBR) in each process control block. During context switches, the OS loads the new process's PTBR into the MMU, instantly switching to that process's address space. The MMU adds the virtual page number multiplied by entry size to the PTBR to locate the correct page table entry.
Single-Level Translation:
1. Extract page number from virtual address
2. Calculate PTE address: PTBR + (page_number * entry_size)
3. Read PTE from calculated address
4. Extract frame number from PTE
5. Calculate physical address: (frame_number * page_size) + offset
Single-level paging becomes impractical for 64-bit systems. Even with sparse address space usage, the OS must allocate the full page table for each process or implement complex mechanisms to handle missing page table regions. The memory overhead and management complexity drive adoption of multi-level schemes.
Hierarchical Paging
Multi-level paging organizes page tables as a tree structure. The topmost level always resides in memory. Lower levels allocate on demand as programs use memory in their corresponding regions. A process using 1GB of memory in a 64-bit space might require only the top-level table (covering the entire address space) plus a handful of lower-level tables (covering actually used regions).
x86-64 uses four-level paging with 512-entry tables at each level. The virtual address splits into five components: four 9-bit indices for the four page table levels plus a 12-bit offset. Each page table occupies exactly one page (512 entries × 8 bytes per entry = 4096 bytes), simplifying page table management.
# Conceptual representation of multi-level address translation
class PageTableEntry
attr_accessor :frame_number, :present, :writable, :executable
def initialize(frame_number, present: true, writable: true, executable: false)
@frame_number = frame_number
@present = present
@writable = writable
@executable = executable
end
end
class PageTable
def initialize
@entries = Array.new(512) # 512 entries per table
end
def [](index)
@entries[index]
end
def []=(index, entry)
@entries[index] = entry
end
end
# Simulated 4-level page table walk
def translate_address(virtual_addr, pml4_table)
# Extract indices from 48-bit virtual address
pml4_idx = (virtual_addr >> 39) & 0x1FF # bits 39-47
pdp_idx = (virtual_addr >> 30) & 0x1FF # bits 30-38
pd_idx = (virtual_addr >> 21) & 0x1FF # bits 21-29
pt_idx = (virtual_addr >> 12) & 0x1FF # bits 12-20
offset = virtual_addr & 0xFFF # bits 0-11
# Level 1: PML4
pml4_entry = pml4_table[pml4_idx]
return nil unless pml4_entry&.present
# Level 2: Page Directory Pointer
pdp_table = get_table(pml4_entry.frame_number)
pdp_entry = pdp_table[pdp_idx]
return nil unless pdp_entry&.present
# Level 3: Page Directory
pd_table = get_table(pdp_entry.frame_number)
pd_entry = pd_table[pd_idx]
return nil unless pd_entry&.present
# Level 4: Page Table
pt_table = get_table(pd_entry.frame_number)
pt_entry = pt_table[pt_idx]
return nil unless pt_entry&.present
# Compute physical address
physical_addr = (pt_entry.frame_number << 12) | offset
physical_addr
end
The memory savings from hierarchical paging prove dramatic. A process using 1GB scattered across the address space might need 1 PML4 table, 1 PDP table, 1 PD table, and 256 page tables (1GB / 2MB per PD entry), totaling about 1MB of page table overhead compared to theoretical petabytes for a full single-level table.
Inverted Page Tables
Traditional page tables scale with virtual address space size, consuming significant memory in systems with many processes or large address spaces. Inverted page tables flip this by maintaining one entry per physical frame rather than per virtual page. The table size scales with physical memory, remaining constant regardless of virtual address space size or number of processes.
Each inverted page table entry stores the process ID and virtual page number occupying that frame. Address translation requires searching the table for an entry matching the current process and requested virtual page. Linear search becomes prohibitively slow, so implementations use hash tables indexed by (process_id, virtual_page_number) pairs. Hash collisions chain entries together.
Inverted Page Table Entry:
- Process ID
- Virtual Page Number
- Protection Bits
- Next Pointer (for hash chain)
Translation:
1. Hash (PID, VPN) to get table index
2. Search chain starting at that index
3. If found, index is the frame number
4. If not found, page fault
Inverted page tables benefit systems with large virtual spaces and limited physical memory, such as IBM PowerPC systems. They consume fixed memory regardless of process count or address space size. However, searching takes longer than indexed lookup, and sharing pages between processes requires multiple entries pointing to the same frame.
Segmentation Schemes
Pure segmentation maintains a segment table per process. Each entry contains a base address, limit, and access rights. Translation adds the offset to the base address after validating it against the limit. The simplicity enables fast translation but suffers from external fragmentation as variable-sized segments allocate and deallocate.
Memory compaction addresses fragmentation by relocating segments to consolidate free space. The OS pauses affected processes, copies segment contents to new locations, updates segment table entries with new base addresses, and resumes execution. This process costs significant time and becomes impractical with large segments or frequent allocation changes.
Best-fit and first-fit algorithms attempt to minimize fragmentation during allocation. Best-fit searches for the smallest free region accommodating the requested size, minimizing wasted space but requiring full free list traversal. First-fit uses the first adequate region, executing faster but potentially leaving small unusable fragments. Buddy system allocation restricts sizes to powers of two, enabling fast allocation and deallocation with bounded fragmentation.
Hybrid Paging and Segmentation
Multics pioneered combining segmentation and paging. Each segment contains multiple pages, and each segment has its own page table. The segment table entry points to the segment's page table rather than directly to physical memory. This design provides segmentation's logical organization with paging's efficient memory management.
Translation first uses the segment selector to find the segment descriptor, validates the offset against the segment limit, then uses the segment's page table to translate the page number within the segment to a physical frame. The result combines segmentation's protection granularity with paging's allocation efficiency and eliminates external fragmentation.
Ruby Implementation
Ruby abstracts memory management details, operating above the paging and segmentation layer. The Ruby runtime allocates memory through the operating system's virtual memory API, which handles page table management and address translation transparently. Ruby developers typically need not consider paging directly, but understanding these mechanisms helps diagnose performance issues and memory behavior.
Ruby Memory Allocation
Ruby allocates memory in two primary ways: through the Ruby heap for Ruby objects and through system malloc for native extensions and large data. The Ruby heap consists of memory pages (distinct from OS pages) that the garbage collector manages. Each Ruby heap page typically occupies 16KB and contains slots for Ruby objects of various sizes.
# Ruby heap pages are internal to Ruby, but we can observe allocation behavior
class MemoryObserver
def self.allocation_pattern
before = GC.stat
# Allocate many small objects
objects = Array.new(10_000) { "x" * 100 }
after = GC.stat
{
heap_pages: after[:heap_allocated_pages] - before[:heap_allocated_pages],
objects: after[:total_allocated_objects] - before[:total_allocated_objects],
slots: after[:heap_available_slots] - before[:heap_available_slots]
}
end
end
stats = MemoryObserver.allocation_pattern
# => {:heap_pages=>2, :objects=>10000, :slots=>1024}
When Ruby requires more memory, it requests pages from the OS through system calls like mmap or sbrk. The OS allocates virtual memory pages, updating the process's page tables. These pages might not immediately map to physical frames - demand paging loads them into physical memory only when accessed. The first access to each new page triggers a minor page fault (a soft fault requiring no disk I/O, just frame allocation).
Observing Memory Behavior
Ruby provides tools to monitor memory usage and observe interactions with the OS memory management system. The GC.stat method returns detailed statistics about Ruby's memory usage, while /proc/self/status on Linux systems reveals OS-level memory information including virtual size, resident set size, and page fault counts.
require 'objspace'
class MemoryAnalyzer
def self.memory_report
gc_stats = GC.stat
# Parse /proc/self/status for OS-level stats
proc_status = File.read('/proc/self/status')
vm_size = proc_status[/VmSize:\s+(\d+)/, 1].to_i
vm_rss = proc_status[/VmRSS:\s+(\d+)/, 1].to_i
vm_data = proc_status[/VmData:\s+(\d+)/, 1].to_i
{
virtual_memory_kb: vm_size,
resident_memory_kb: vm_rss,
data_segment_kb: vm_data,
ruby_heap_pages: gc_stats[:heap_allocated_pages],
ruby_heap_live_slots: gc_stats[:heap_live_slots],
ruby_heap_free_slots: gc_stats[:heap_free_slots],
major_gc_count: gc_stats[:major_gc_count],
minor_gc_count: gc_stats[:minor_gc_count]
}
end
def self.allocation_trace(seconds: 5)
ObjectSpace.trace_object_allocations_start
initial = memory_report
sleep(seconds)
final = memory_report
ObjectSpace.trace_object_allocations_stop
{
virtual_growth_kb: final[:virtual_memory_kb] - initial[:virtual_memory_kb],
resident_growth_kb: final[:resident_memory_kb] - initial[:resident_memory_kb],
heap_page_growth: final[:ruby_heap_pages] - initial[:ruby_heap_pages],
gc_major_runs: final[:major_gc_count] - initial[:major_gc_count],
gc_minor_runs: final[:minor_gc_count] - initial[:minor_gc_count]
}
end
end
# Monitor memory behavior during operation
report = MemoryAnalyzer.memory_report
# => {:virtual_memory_kb=>45680, :resident_memory_kb=>23440, ...}
The difference between virtual memory (VmSize) and resident memory (VmRSS) reveals how much of the process's address space currently resides in physical RAM. Large disparities indicate that either the process allocated memory without touching it (demand paging has not yet loaded those pages) or the OS swapped pages to disk. The VmData field shows the size of the data segment, growing as Ruby allocates more heap memory.
Memory-Mapped Files
Ruby supports memory-mapped files through native extensions, creating a direct mapping between file contents and virtual memory. The OS maps file pages into the process's address space. Reading from mapped memory triggers page faults that load file data from disk into RAM. Writes to mapped memory mark pages dirty, and the OS flushes them back to the file.
# Memory-mapped I/O using fiddle (Ruby FFI)
require 'fiddle'
require 'fiddle/import'
module MMap
extend Fiddle::Importer
dlload Fiddle::Handle::DEFAULT
# Constants for mmap
PROT_READ = 1
PROT_WRITE = 2
MAP_SHARED = 1
MAP_PRIVATE = 2
extern 'void* mmap(void*, size_t, int, int, int, off_t)'
extern 'int munmap(void*, size_t)'
extern 'int msync(void*, size_t, int)'
end
class MemoryMappedFile
def initialize(filename, writable: false)
@file = File.open(filename, writable ? 'r+' : 'r')
@size = @file.size
prot = MMap::PROT_READ
prot |= MMap::PROT_WRITE if writable
# Map file into memory
@addr = MMap.mmap(nil, @size, prot, MMap::MAP_SHARED,
@file.fileno, 0)
raise "mmap failed" if @addr.to_i == -1
end
def read(offset, length)
ptr = Fiddle::Pointer.new(@addr.to_i + offset)
ptr[0, length]
end
def write(offset, data)
ptr = Fiddle::Pointer.new(@addr.to_i + offset)
ptr[0, data.bytesize] = data
end
def sync
MMap.msync(@addr, @size, 0)
end
def close
MMap.munmap(@addr, @size)
@file.close
end
end
# Usage demonstrates page-level I/O
mf = MemoryMappedFile.new('large_data.bin', writable: true)
data = mf.read(0, 1024) # Triggers page fault if not in memory
mf.write(4096, "new data") # Writes to mapped page, marked dirty
mf.sync # Flushes dirty pages to disk
mf.close
Memory-mapped I/O benefits applications processing large files that don't fit in memory. The OS handles paging automatically, loading file portions as needed and evicting unused portions. This approach avoids explicit read and write calls while providing efficient random access.
Garbage Collection and Paging Interaction
Ruby's garbage collector scans the heap to identify live objects, accessing potentially all allocated memory. If the Ruby heap exceeds physical memory, GC scanning triggers page faults as it accesses pages swapped to disk. This behavior causes GC pauses to extend dramatically when memory pressure forces paging.
class GCPageFaultMonitor
def self.gc_with_monitoring
before_stats = read_stats
GC.start
after_stats = read_stats
{
duration_ms: (after_stats[:timestamp] - before_stats[:timestamp]) * 1000,
minor_faults: after_stats[:minor_faults] - before_stats[:minor_faults],
major_faults: after_stats[:major_faults] - before_stats[:major_faults],
rss_change_kb: after_stats[:rss] - before_stats[:rss]
}
end
def self.read_stats
stat = File.read('/proc/self/stat').split
status = File.read('/proc/self/status')
{
timestamp: Process.clock_gettime(Process::CLOCK_MONOTONIC),
minor_faults: stat[9].to_i,
major_faults: stat[11].to_i,
rss: status[/VmRSS:\s+(\d+)/, 1].to_i
}
end
end
# Monitor GC behavior under memory pressure
results = GCPageFaultMonitor.gc_with_monitoring
# => {:duration_ms=>1250, :minor_faults=>15234, :major_faults=>142, ...}
Major page faults (those requiring disk I/O) during GC indicate severe memory pressure. Each major fault adds milliseconds to GC pause time. Applications experiencing this behavior need either more physical memory or reduced heap size to fit within available RAM. Ruby's GC tuning environment variables (RUBY_GC_HEAP_GROWTH_FACTOR, RUBY_GC_HEAP_INIT_SLOTS) help control heap size growth.
Performance Considerations
Translation Lookaside Buffer Impact
TLB misses impose significant overhead on memory-intensive applications. Each TLB miss requires a page table walk accessing multiple memory locations for multi-level page tables. With four-level paging, a TLB miss costs four memory accesses plus the original data access, potentially a 5× slowdown compared to a TLB hit.
Applications can improve TLB performance through data structure layout. Arranging frequently accessed data within the same pages reduces the working set of pages, keeping all mappings in the TLB. Padding structures to page boundaries wastes memory but guarantees each structure occupies its own pages, preventing false sharing and improving cache behavior.
# Demonstrating memory access pattern impact on TLB
class TLBBenchmark
def self.sequential_access(array_size)
data = Array.new(array_size) { |i| i }
elapsed = Benchmark.measure do
sum = 0
data.each { |val| sum += val }
end
elapsed.real
end
def self.random_access(array_size)
data = Array.new(array_size) { |i| i }
indices = (0...array_size).to_a.shuffle
elapsed = Benchmark.measure do
sum = 0
indices.each { |idx| sum += data[idx] }
end
elapsed.real
end
end
require 'benchmark'
# Sequential access benefits from TLB hits
sequential_time = TLBBenchmark.sequential_access(1_000_000)
# => 0.042 seconds
# Random access causes more TLB misses
random_time = TLBBenchmark.random_access(1_000_000)
# => 0.156 seconds (3.7× slower)
The performance gap between sequential and random access grows with array size. Small arrays fit within TLB reach regardless of access pattern. Large arrays exceed TLB capacity, and random access prevents the prefetcher from hiding TLB miss latency.
Working Set and Memory Footprint
A process's working set comprises the pages it actively accesses. Operating systems attempt to keep each process's working set in physical memory to avoid thrashing. Ruby applications should minimize working set size by avoiding unnecessary object retention and using memory-efficient data structures.
Long-lived objects scattered across many pages expand the working set. Generational GC helps by segregating old objects into separate pages. When old objects rarely change, those pages remain clean and the OS can discard them under memory pressure, reloading from the Ruby binary rather than swapping to disk.
class WorkingSetAnalysis
def self.measure_working_set
# Force GC to establish baseline
GC.start
baseline = current_memory_pages
# Create working set
working_data = Array.new(1000) { "x" * 1024 }
# Measure active pages
active_pages = current_memory_pages - baseline
# Create non-working data
unused_data = Array.new(10000) { "y" * 1024 }
# Sleep to allow OS to deprioritize unused pages
sleep(1)
total_pages = current_memory_pages - baseline
{
active_pages: active_pages,
total_pages: total_pages,
working_set_ratio: active_pages.to_f / total_pages
}
end
def self.current_memory_pages
status = File.read('/proc/self/status')
rss_kb = status[/VmRSS:\s+(\d+)/, 1].to_i
rss_kb / 4 # Convert KB to 4KB pages
end
end
Applications can reduce working set by releasing unused memory, using streaming algorithms that process data in passes rather than loading everything, and employing memory-mapped files for large datasets. These strategies keep the working set small relative to physical memory, ensuring the OS maintains pages in RAM.
Page Fault Cost Analysis
Minor page faults (allocation without disk I/O) cost microseconds. Major page faults (loading from disk) cost milliseconds, a thousand-fold difference. Applications should avoid major faults by staying within physical memory limits or using explicit I/O to control loading.
Ruby applications allocating memory in bursts might trigger many minor faults simultaneously. The OS services these faults sequentially, and bulk allocation can pause execution for milliseconds. Preallocating memory or spreading allocations over time reduces pause duration.
class PageFaultProfiler
def self.profile_allocation_pattern
results = {}
# Measure burst allocation
results[:burst] = measure_faults do
data = Array.new(100_000) { "x" * 1024 }
end
# Measure gradual allocation
results[:gradual] = measure_faults do
data = []
100_000.times do
data << "x" * 1024
sleep(0.00001) if data.size % 1000 == 0
end
end
# Measure preallocation
results[:prealloc] = measure_faults do
data = Array.new(100_000)
data.map! { "x" * 1024 }
end
results
end
def self.measure_faults
before = read_fault_stats
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
yield
end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
after = read_fault_stats
{
duration_ms: (end_time - start_time) * 1000,
minor_faults: after[:minor] - before[:minor],
major_faults: after[:major] - before[:major]
}
end
def self.read_fault_stats
stat = File.read('/proc/self/stat').split
{ minor: stat[9].to_i, major: stat[11].to_i }
end
end
Thrashing Prevention
Thrashing occurs when the system spends more time paging than executing. Each process needs enough physical memory for its working set, but insufficient memory forces constant page faults. The OS swaps pages in to satisfy faults, only to swap them out again to make room for other faults.
Ruby applications prevent thrashing by monitoring resident set size and adjusting behavior when memory pressure increases. Reducing cache sizes, releasing temporary data, or processing smaller batches keeps memory usage within limits. The alternative - allowing thrashing - grinds performance to near-zero.
class ThrashingMonitor
THRASHING_THRESHOLD = 100 # Major faults per second
def self.monitor_and_adapt
previous = read_fault_stats
previous_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
sleep(1)
current = read_fault_stats
current_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
interval = current_time - previous_time
fault_rate = (current[:major] - previous[:major]) / interval
if fault_rate > THRASHING_THRESHOLD
{
status: :thrashing,
fault_rate: fault_rate,
recommendation: :reduce_memory_usage
}
else
{
status: :normal,
fault_rate: fault_rate
}
end
end
def self.read_fault_stats
stat = File.read('/proc/self/stat').split
{ major: stat[11].to_i }
end
def self.reduce_cache_size(cache, reduction_factor: 0.5)
# Evict portion of cache entries
remove_count = (cache.size * (1 - reduction_factor)).to_i
cache.shift(remove_count)
end
end
# Adaptive memory management
status = ThrashingMonitor.monitor_and_adapt
if status[:status] == :thrashing
ThrashingMonitor.reduce_cache_size(application_cache)
end
Common Pitfalls
Excessive Page Faults from Poor Locality
Applications accessing memory randomly rather than sequentially trigger more page faults and TLB misses. Ruby programs iterating through arrays or hashes in insertion order exhibit good spatial locality, keeping accessed data within fewer pages. Programs using hash maps with scattered key-value pairs or jumping between disconnected data structures suffer poor locality.
# Poor locality - scattered access pattern
class ScatteredCache
def initialize(size)
@data = {}
size.times { |i| @data[rand(1_000_000)] = "value_#{i}" }
end
def access_pattern
# Random access causes TLB misses and poor cache behavior
@data.keys.sample(100).each { |key| @data[key] }
end
end
# Good locality - sequential access pattern
class SequentialCache
def initialize(size)
@data = Array.new(size) { |i| "value_#{i}" }
end
def access_pattern
# Sequential iteration keeps pages in working set
@data.first(100).each { |val| val }
end
end
Ruby's hash implementation uses open addressing, which provides better memory locality than chaining. Arrays naturally exhibit excellent locality since elements occupy contiguous memory. Prefer arrays over hashes when access patterns iterate through most elements.
Memory Fragmentation in Long-Running Processes
Ruby's memory allocator can fragment the heap as objects allocate and deallocate. Fragmentation prevents the OS from reclaiming memory even when Ruby frees objects, because Ruby heap pages remain partially occupied. This behavior causes virtual memory to grow continuously while actual memory usage stays constant.
Ruby 2.7+ implements heap compaction to address fragmentation. The GC.compact method moves live objects to consolidate them into fewer pages, allowing Ruby to return empty pages to the OS. Applications should invoke compaction periodically during low-load periods.
class MemoryCompactionStrategy
def self.compact_if_fragmented(fragmentation_threshold: 0.3)
gc_stats = GC.stat
total_slots = gc_stats[:heap_available_slots]
used_slots = gc_stats[:heap_live_slots]
free_slots = gc_stats[:heap_free_slots]
# Calculate fragmentation ratio
fragmentation = free_slots.to_f / total_slots
if fragmentation > fragmentation_threshold
before_pages = gc_stats[:heap_allocated_pages]
GC.compact
after_stats = GC.stat
after_pages = after_stats[:heap_allocated_pages]
{
compacted: true,
pages_freed: before_pages - after_pages,
fragmentation_before: fragmentation,
fragmentation_after: after_stats[:heap_free_slots].to_f /
after_stats[:heap_available_slots]
}
else
{ compacted: false, fragmentation: fragmentation }
end
end
end
# Periodic compaction in background thread
Thread.new do
loop do
sleep(300) # Every 5 minutes
result = MemoryCompactionStrategy.compact_if_fragmented
puts "Compaction: #{result}" if result[:compacted]
end
end
Copy-on-Write Friendliness
Process forking in Ruby uses copy-on-write, initially sharing all memory pages between parent and child. Writing to shared pages triggers copies, increasing memory usage. Applications should preload as much data as possible before forking, keeping that data read-only afterward to maintain sharing.
Ruby objects containing references to other objects create write barriers that trigger copies during garbage collection. Even if application code doesn't modify objects, GC updates may break sharing. Ruby 2.0+ reduces this through generational GC, but applications should still minimize object churn in parent processes before forking.
# Anti-pattern: writing after fork breaks sharing
preloaded_data = load_reference_data
pid = fork do
# Modifying preloaded data forces page copies
preloaded_data.each { |item| item[:accessed_at] = Time.now }
process_requests
end
# Better: keep preloaded data immutable after fork
preloaded_data = load_reference_data.freeze
pid = fork do
# Read-only access maintains page sharing
process_requests(preloaded_data)
end
Monitoring resident set size before and after fork reveals sharing effectiveness. The child process's RSS should initially be minimal (only stack and some metadata). Rapid RSS growth indicates broken sharing from writes to previously shared pages.
Mmap Misconceptions
Memory-mapped files seem like they load file contents into RAM, but they actually create virtual memory mappings without immediate physical memory allocation. Accessing mapped regions triggers page faults that load data on demand. Programs treating mmapped files as instantaneous memory access may encounter unexpected delays from page faults.
Additionally, memory-mapped writes don't immediately persist to disk. The OS flushes dirty pages to disk asynchronously. Applications requiring durability must call msync to force writes, adding explicit fsync-like overhead that defeats mmap's perceived simplicity.
# Incorrect assumption: mmap gives instant memory access
def process_large_file_wrong(filename)
mf = MemoryMappedFile.new(filename)
# First access triggers page faults - not instant
1000.times do |i|
data = mf.read(i * 1_000_000, 1024)
process(data)
end
mf.close
# File might not be updated yet - writes buffered in page cache
end
# Correct: understand paging and explicitly sync when needed
def process_large_file_correct(filename)
mf = MemoryMappedFile.new(filename, writable: true)
# Expect first accesses to be slower due to page loading
1000.times do |i|
data = mf.read(i * 1_000_000, 1024)
result = process(data)
mf.write(i * 1_000_000, result)
end
# Explicitly flush critical data
mf.sync
mf.close
end
Swap Amplification
Enabling swap on systems running memory-intensive Ruby applications can backfire. When physical memory fills, the OS begins swapping pages to disk. Ruby's garbage collector must scan the entire heap, potentially triggering page faults for every swapped page. A GC cycle normally taking milliseconds extends to seconds or minutes, effectively freezing the application.
The solution involves either disabling swap entirely (forcing OOM killer rather than thrashing) or configuring vm.swappiness low to make the kernel reluctant to swap. Applications should monitor major page faults and kill themselves gracefully when thrashing begins rather than continuing in a degraded state.
class SwapDetector
FAULT_THRESHOLD = 1000 # Major faults per GC cycle
def self.safe_gc_with_monitoring
before_faults = read_major_faults
gc_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
GC.start
gc_end = Process.clock_gettime(Process::CLOCK_MONOTONIC)
after_faults = read_major_faults
faults = after_faults - before_faults
duration_ms = (gc_end - gc_start) * 1000
if faults > FAULT_THRESHOLD
warn "Excessive paging during GC: #{faults} major faults in #{duration_ms}ms"
warn "Consider reducing heap size or adding physical memory"
# Optionally: gracefully shut down rather than thrashing
# exit(1)
end
{ duration_ms: duration_ms, major_faults: faults }
end
def self.read_major_faults
stat = File.read('/proc/self/stat').split
stat[11].to_i
end
end
Reference
Terminology
| Term | Definition |
|---|---|
| Virtual Memory | Abstraction providing each process an independent address space larger than physical RAM |
| Physical Memory | Actual RAM hardware storing data and instructions |
| Virtual Address | Address used by programs, translated by MMU to physical address |
| Physical Address | Address referencing actual RAM location |
| Page | Fixed-size block of virtual memory (typically 4KB) |
| Frame | Fixed-size block of physical memory matching page size |
| Page Table | Data structure mapping virtual pages to physical frames |
| Page Table Entry | Entry in page table containing frame number and control bits |
| Page Fault | Exception when accessing page not present in physical memory |
| Minor Fault | Page fault requiring frame allocation but no disk I/O |
| Major Fault | Page fault requiring loading data from disk |
| TLB | Translation Lookaside Buffer - cache for address translations |
| Segment | Variable-sized block of memory representing logical program unit |
| Segment Table | Data structure mapping segments to physical memory regions |
| Working Set | Pages actively accessed by a process |
| Thrashing | Excessive paging causing system performance collapse |
| Demand Paging | Loading pages into memory only when accessed |
| Page Replacement | Evicting page from memory to make room for another page |
| Resident Set Size | Amount of process memory currently in physical RAM |
| Virtual Size | Total size of process virtual address space |
Paging vs Segmentation
| Aspect | Paging | Segmentation |
|---|---|---|
| Block Size | Fixed size (4KB, 2MB, 1GB) | Variable size based on content |
| Address Components | Page number + Offset | Segment selector + Offset |
| Fragmentation | Internal only | External possible |
| Protection Granularity | Per page | Per segment |
| Sharing | Share pages between processes | Share segments between processes |
| Logical Organization | No correspondence to program structure | Matches program structure |
| Translation | Page table lookup | Segment table lookup |
| Hardware Support | Universal in modern CPUs | Limited in modern architectures |
| Memory Waste | Half page per region average | Varies with allocation pattern |
| Table Size | Proportional to address space | Proportional to segment count |
Address Translation Components
| Component | x86-64 | ARM64 | Purpose |
|---|---|---|---|
| Page Size | 4KB, 2MB, 1GB | 4KB, 2MB, 1GB | Granularity of memory mapping |
| Virtual Address Width | 48 bits | 48 bits | Maximum addressable memory |
| Physical Address Width | 52 bits | 52 bits | Maximum RAM supported |
| Page Table Levels | 4 | 4 | Hierarchical translation depth |
| TLB Entries | 64-1536 | 32-512 | Translation cache capacity |
| Page Table Entry Size | 8 bytes | 8 bytes | Memory per mapping |
| Pages per Table | 512 | 512 | Fanout at each level |
Page Table Entry Flags
| Flag | Purpose | Effect When Set |
|---|---|---|
| Present | Indicates page in memory | Translation succeeds |
| Writable | Write permission | Writes allowed |
| User | User-mode access | User code can access |
| Accessed | Tracks access | Page recently accessed |
| Dirty | Tracks modification | Page modified since loaded |
| Global | TLB behavior | Mapping survives context switch |
| No Execute | Execution prevention | Code execution prohibited |
| Cache Disable | Caching control | Bypass CPU cache |
Address Translation Formulas
Single-Level Paging:
Virtual Address = Page Number | Offset
Physical Address = Frame Number | Offset
PTE Address = PTBR + (Page Number × Entry Size)
Multi-Level Paging (4-level):
VA = PML4 Index | PDP Index | PD Index | PT Index | Offset
Each index = 9 bits (512 entries per table)
Offset = 12 bits (4096 bytes per page)
Segmentation:
Logical Address = Segment Selector | Offset
Physical Address = Segment Base + Offset
Requires: Offset < Segment Limit
Translation Cost:
TLB Hit Time = ~1 cycle
TLB Miss Time = (Table Levels × Memory Latency) + TLB Hit Time
4-level miss = (4 × 100 cycles) + 1 = ~401 cycles
Performance Metrics
| Metric | Typical Value | Impact |
|---|---|---|
| TLB Hit Rate | 99%+ | Critical for performance |
| TLB Miss Penalty | 400+ cycles | Significant when frequent |
| Minor Fault Time | 1-10 μs | Negligible for occasional faults |
| Major Fault Time | 1-10 ms | Severe when frequent |
| Page Size | 4KB | Balance internal fragmentation vs TLB reach |
| TLB Reach | 256KB-2MB | Working set should fit in reach |
| Page Walk Time | 400-1000 cycles | Cost of TLB miss |
| Context Switch Cost | 1000-10000 cycles | Includes TLB flush overhead |
Ruby Memory Configuration
| Environment Variable | Default | Purpose |
|---|---|---|
| RUBY_GC_HEAP_INIT_SLOTS | 10000 | Initial heap size |
| RUBY_GC_HEAP_GROWTH_FACTOR | 1.8 | Heap growth rate |
| RUBY_GC_HEAP_GROWTH_MAX_SLOTS | 0 | Maximum heap growth per GC |
| RUBY_GC_HEAP_FREE_SLOTS | 4096 | Free slots to maintain |
| RUBY_GC_MALLOC_LIMIT | 16MB | Malloc threshold for GC |
| RUBY_GC_OLDMALLOC_LIMIT | 16MB | Old object malloc limit |
Linux Memory Metrics
| File | Metric | Meaning |
|---|---|---|
| /proc/self/status | VmSize | Total virtual memory |
| /proc/self/status | VmRSS | Resident set size |
| /proc/self/status | VmData | Data segment size |
| /proc/self/status | VmStk | Stack size |
| /proc/self/stat | minflt | Minor page faults |
| /proc/self/stat | majflt | Major page faults |
| /proc/meminfo | MemAvailable | Available physical memory |
| /proc/meminfo | SwapFree | Available swap space |