Overview
I/O scheduling operates at the operating system level, managing the queue of read and write requests to storage devices. When multiple processes issue I/O operations, the OS cannot simply service them in the order received. Physical disk characteristics—particularly the mechanical movement of read/write heads—create opportunities for significant optimization.
The scheduler sits between the file system and the block device driver. Applications issue I/O requests through system calls, the file system translates these into block-level operations, and the I/O scheduler reorders these operations before passing them to the device driver. This reordering can reduce total service time by 50% or more compared to naive first-come-first-served ordering.
Consider a scenario where three processes request disk blocks at cylinders 10, 100, and 15, with the disk head currently at cylinder 50. Servicing these in order requires head movement of 40 + 90 + 85 = 215 cylinders. An intelligent scheduler might reorder this to 10, 15, 100, requiring only 40 + 5 + 85 = 130 cylinders of movement—a 40% improvement.
I/O scheduling became critical with the rise of multiprogramming systems in the 1960s. Early time-sharing systems experienced severe performance degradation when multiple users accessed disk simultaneously. The introduction of scheduling algorithms transformed disk I/O from a sequential bottleneck into a resource that could support dozens of concurrent users.
Modern systems face different challenges. Solid-state drives eliminated mechanical seek time, rendering traditional scheduling algorithms obsolete for SSDs. However, HDDs remain prevalent in data centers for bulk storage, making I/O scheduling relevant for database servers, file servers, and any application dealing with large datasets on spinning disks.
# I/O request pattern that benefits from scheduling
File.open('large_dataset.bin', 'rb') do |file|
# Random access pattern - scheduler optimizes physical disk access
offsets = [1000000, 50000, 900000, 100000]
offsets.each do |offset|
file.seek(offset)
data = file.read(4096)
# Process data
end
end
Key Principles
I/O scheduling algorithms optimize disk access by understanding physical disk geometry. A hard disk contains platters spinning at constant angular velocity, typically 5400-15000 RPM. Read/write heads move radially across the platter surface on an actuator arm. Data organizes into concentric tracks, with each track divided into sectors.
Seek Time represents the duration required to move the disk head from its current track to the target track. This mechanical operation dominates I/O latency on HDDs, typically ranging from 3-15 milliseconds. Seek time increases roughly linearly with distance, though not perfectly due to acceleration and deceleration phases.
Rotational Latency occurs after the head reaches the correct track. The head must wait for the target sector to rotate under it. On a 7200 RPM disk, average rotational latency is 4.2 milliseconds (half a rotation). This component remains constant regardless of request ordering.
Transfer Time represents the actual data read or write duration once the head positions correctly. For a 4KB block on a modern HDD, transfer time typically measures 0.1-0.3 milliseconds—negligible compared to seek and rotational delays.
The scheduler's primary objective minimizes total seek time across all pending requests. Secondary goals include fairness (preventing request starvation), bounded response time (avoiding indefinite delays), and throughput maximization. These goals often conflict. A scheduler optimizing purely for seek time might indefinitely postpone requests to distant tracks if requests keep arriving for nearby tracks.
Request merging forms another critical principle. When multiple requests target adjacent or overlapping disk blocks, the scheduler merges them into a single operation. This eliminates redundant head movements and reduces the number of operations the device driver must process.
# Sequential access benefits from request merging
File.open('logfile.txt', 'a') do |file|
1000.times do |i|
# These writes may be merged by the OS buffer cache
# and I/O scheduler into fewer disk operations
file.puts "Log entry #{i}"
end
# Flush explicitly if timing matters
file.flush
end
The request queue maintains pending I/O operations. When an application issues a read or write, the kernel adds this request to the queue rather than immediately dispatching it to the device. The scheduler periodically examines the queue, selects the next request according to its algorithm, and submits it to the device driver. Queue depth affects scheduling effectiveness—deeper queues provide more optimization opportunities but increase latency for individual requests.
Starvation prevention mechanisms ensure that no request waits indefinitely. Pure seek-time optimization algorithms can starve requests to distant disk regions. Schedulers implement timeouts, aging mechanisms, or fairness quotas to guarantee service within bounded time.
Modern schedulers distinguish between read and write operations. Reads block the requesting process, directly impacting application response time. Writes can proceed asynchronously, with the calling process continuing immediately after data reaches the buffer cache. This asymmetry leads schedulers to prioritize reads over writes, improving perceived system responsiveness at the cost of write throughput.
# Read operations block, write operations return immediately
start = Time.now
File.write('test.txt', 'data' * 1000000)
write_time = Time.now - start
# Write returns quickly - data still in cache
start = Time.now
data = File.read('test.txt')
read_time = Time.now - start
# Read might wait for physical disk access
puts "Write: #{write_time}s, Read: #{read_time}s"
# Typically write_time < read_time for large files
Implementation Approaches
First-Come, First-Served (FCFS) represents the simplest scheduling algorithm. Requests service in arrival order without reordering. This guarantees fairness and prevents starvation, but provides no seek optimization. If requests arrive for cylinders 50, 10, 90, 20, the head moves from current position to 50, then 10 (40 cylinders back), then 90 (80 forward), then 20 (70 back)—extensive back-and-forth movement.
FCFS performs acceptably under light load when few requests compete. With only one or two outstanding operations, sophisticated scheduling provides minimal benefit. However, under moderate or heavy load, FCFS performs poorly. The lack of optimization means worst-case and average-case performance converge, with throughput often 50-70% lower than optimizing schedulers.
Shortest Seek Time First (SSTF) selects the pending request closest to the current head position. If the head rests at cylinder 50 with pending requests at cylinders 20, 55, and 90, SSTF chooses cylinder 55 (distance 5), then either 20 or 90 depending on subsequent arrivals. This greedy approach minimizes seek distance for each individual operation.
SSTF dramatically improves throughput over FCFS, typically achieving 2-3x better performance under load. However, it suffers from starvation problems. If requests continuously arrive for cylinders near the current head position, requests for distant cylinders may wait indefinitely. A request for cylinder 10 might never service if requests keep arriving for cylinders 40-60 while the head services that region.
# Simulating SSTF behavior
class SSTFScheduler
def initialize(start_position)
@current_position = start_position
@queue = []
end
def add_request(cylinder)
@queue << cylinder
end
def schedule_next
return nil if @queue.empty?
# Find closest cylinder to current position
next_request = @queue.min_by { |cyl| (cyl - @current_position).abs }
@queue.delete_at(@queue.index(next_request))
distance = (next_request - @current_position).abs
@current_position = next_request
{ cylinder: next_request, seek_distance: distance }
end
end
scheduler = SSTFScheduler.new(50)
[20, 90, 55, 30, 85].each { |cyl| scheduler.add_request(cyl) }
total_distance = 0
5.times do
result = scheduler.schedule_next
puts "Service cylinder #{result[:cylinder]}, seek: #{result[:seek_distance]}"
total_distance += result[:seek_distance]
end
puts "Total seek distance: #{total_distance}"
# Output: 55, 85, 90, 30, 20 - total distance much less than FCFS
SCAN (Elevator Algorithm) moves the disk head in one direction, servicing all requests in that direction until reaching the disk end, then reverses direction. This mimics elevator behavior. Starting at cylinder 50 moving upward with requests at 20, 55, 90, 30, SCAN services 55 and 90, reaches the top, then reverses to service 30 and 20.
SCAN provides better fairness than SSTF while maintaining good throughput. Requests wait at most two full disk traversals. The algorithm prevents starvation since every request eventually falls in the head's current direction of travel. However, SCAN exhibits non-uniform wait times. Requests just behind the head's current direction wait nearly two full traversals, while requests in the current direction service almost immediately.
The algorithm also wastes movement. If the head moves to cylinder 199 (highest cylinder) but the furthest request sits at cylinder 180, the head traverses 19 cylinders unnecessarily. This overhead reduces as disk utilization increases, since requests arrive continuously during traversal.
C-SCAN (Circular SCAN) addresses SCAN's wait time non-uniformity. The head moves in one direction servicing requests, but upon reaching the end, it immediately returns to the beginning without servicing requests on the return trip. This treats the disk as a circular structure, providing more uniform wait times across all cylinder positions.
C-SCAN simplifies timing analysis. Maximum wait time equals one full disk traversal plus the time to service intervening requests. This predictability benefits real-time systems or applications requiring bounded I/O latency. However, C-SCAN wastes the return trip, reducing throughput compared to SCAN under heavy load.
# Simulating SCAN behavior
class SCANScheduler
def initialize(start_position, max_cylinder = 199)
@current_position = start_position
@max_cylinder = max_cylinder
@direction = :up # :up or :down
@queue = []
end
def add_request(cylinder)
@queue << cylinder unless @queue.include?(cylinder)
end
def schedule_next
return nil if @queue.empty?
# Get requests in current direction
if @direction == :up
candidates = @queue.select { |cyl| cyl >= @current_position }.sort
if candidates.empty?
@direction = :down
candidates = @queue.select { |cyl| cyl < @current_position }.sort.reverse
end
else
candidates = @queue.select { |cyl| cyl <= @current_position }.sort.reverse
if candidates.empty?
@direction = :up
candidates = @queue.select { |cyl| cyl > @current_position }.sort
end
end
next_request = candidates.first
@queue.delete(next_request)
distance = (next_request - @current_position).abs
@current_position = next_request
{ cylinder: next_request, seek_distance: distance, direction: @direction }
end
end
scheduler = SCANScheduler.new(50)
requests = [20, 90, 55, 30, 85, 15, 75]
requests.each { |cyl| scheduler.add_request(cyl) }
puts "SCAN scheduling (starting at 50, direction up):"
requests.length.times do
result = scheduler.schedule_next
puts " #{result[:cylinder]} (seek: #{result[:seek_distance]}, #{result[:direction]})"
end
# Services: 55, 75, 85, 90, then reverses: 30, 20, 15
LOOK and C-LOOK optimize SCAN and C-SCAN by eliminating unnecessary traversal to disk ends. LOOK reverses direction at the last request in the current direction rather than at the physical disk boundary. If the furthest request sits at cylinder 180, LOOK reverses there instead of continuing to cylinder 199.
This simple modification reduces average seek time by 5-15% depending on request distribution. The algorithm remains fair while eliminating wasted movement. C-LOOK applies the same optimization to C-SCAN, returning to the lowest pending request rather than cylinder 0.
Modern Linux systems implement variants of these classical algorithms adapted for contemporary hardware. The Completely Fair Queuing (CFQ) scheduler maintains separate queues per process, allocating disk time slices to each queue in round-robin fashion. Within each queue, CFQ applies LOOK-style ordering. This provides fairness across processes while maintaining good overall throughput.
The Deadline scheduler prioritizes requests approaching their timeout deadline. Each request receives a read or write deadline (typically 500ms for reads, 5s for writes). The scheduler normally uses LOOK ordering, but preempts this to service requests nearing their deadline. This prevents starvation while maintaining good throughput.
NOOP (No Operation) scheduling performs minimal reordering—only request merging. This algorithm targets SSDs where seek time disappears. Traditional scheduling algorithms provide no benefit and may harm performance by delaying requests. SSDs prefer receiving requests in submission order to optimize their internal parallelism and wear leveling algorithms.
Ruby Implementation
Ruby applications interact with I/O schedulers through standard system calls wrapped by Ruby's I/O classes. When Ruby code calls File.read, File.write, or socket operations, these eventually invoke POSIX functions like read(), write(), pread(), or pwrite(). The kernel routes these through the page cache and I/O scheduler before reaching the block device.
Ruby provides no direct control over I/O scheduling algorithms—these operate transparently at the kernel level. However, Ruby code influences scheduler behavior through access patterns, buffer sizes, and I/O operation ordering. Understanding scheduler behavior allows writing Ruby applications that cooperate with the scheduler for better performance.
The Ruby runtime maintains internal buffers for I/O operations. When opening a file with buffering enabled (default for text mode), Ruby accumulates writes in memory before issuing actual system calls. This reduces the number of I/O requests reaching the scheduler, allowing better merging and reordering opportunities.
# Buffered I/O - Ruby accumulates writes before syscalls
File.open('output.txt', 'w') do |file|
10000.times do |i|
file.puts "Line #{i}"
# Data accumulates in Ruby buffer
end
# File close triggers flush, issuing buffered writes to OS
end
# Unbuffered I/O - each write becomes a syscall
File.open('output.txt', 'w') do |file|
file.sync = true # Disable Ruby buffering
10000.times do |i|
file.puts "Line #{i}"
# Each puts triggers immediate system call
end
end
# Unbuffered version may be 10-50x slower due to syscall overhead
The sync attribute controls Ruby's internal buffering. Setting file.sync = true disables buffering, causing each write operation to immediately invoke the underlying system call. This guarantees data reaches the OS buffer cache promptly but dramatically increases overhead. The OS still buffers writes unless explicitly flushed with fsync.
For fine-grained control, Ruby provides flush and fsync methods. The flush method writes buffered data to the OS but doesn't guarantee disk persistence—data remains in the page cache. The fsync method forces the OS to write cached data to physical storage, blocking until the I/O scheduler completes all pending operations for this file.
require 'benchmark'
# Demonstrate impact of sync and fsync
Benchmark.bm(20) do |bm|
bm.report('buffered write:') do
File.open('test1.dat', 'w') do |f|
10000.times { f.write('x' * 100) }
end
end
bm.report('buffered + flush:') do
File.open('test2.dat', 'w') do |f|
10000.times do
f.write('x' * 100)
f.flush # Push to OS cache
end
end
end
bm.report('buffered + fsync:') do
File.open('test3.dat', 'w') do |f|
10000.times do
f.write('x' * 100)
f.flush
f.fsync # Force to disk - involves I/O scheduler
end
end
end
end
# fsync version dramatically slower due to actual disk I/O
Ruby's IO#advise method provides hints to the kernel about expected access patterns. These hints influence page cache behavior and, indirectly, I/O scheduling. The :sequential hint tells the kernel to expect sequential access, triggering aggressive read-ahead and immediate page reclaim. The :random hint disables read-ahead since it would waste I/O bandwidth.
# Provide access pattern hints to the kernel
File.open('large_dataset.bin', 'rb') do |file|
file.advise(:sequential) # Optimize for sequential reading
# Kernel now performs aggressive read-ahead
# I/O scheduler sees many sequential requests it can merge
data = file.read
end
File.open('database.idx', 'rb') do |file|
file.advise(:random) # Disable read-ahead for random access
# Read index entries at random positions
indices = [1000, 5000, 2000, 8000]
indices.each do |pos|
file.seek(pos)
entry = file.read(256)
end
end
Ruby's non-blocking I/O support through io/nonblock allows applications to issue multiple I/O operations without waiting for completion. This increases the number of concurrent requests visible to the I/O scheduler, providing more optimization opportunities. The scheduler performs better with deeper queues since it can select from more requests when choosing the next operation.
require 'io/nonblock'
# Non-blocking I/O increases scheduler queue depth
files = (0...10).map { |i| File.open("data#{i}.bin", 'rb') }
files.each { |f| f.nonblock = true }
# Issue multiple read requests without blocking
pending_operations = []
files.each do |file|
begin
data = file.read_nonblock(4096)
# Process data
rescue IO::WaitReadable
# I/O pending - scheduler will handle it
pending_operations << file
end
end
# Wait for pending operations
IO.select(pending_operations) # Block until at least one completes
For applications requiring high-performance I/O with many concurrent operations, gems like nio4r provide event-driven I/O multiplexing. These libraries use select(), poll(), or epoll() to monitor multiple file descriptors, issuing I/O operations asynchronously. This pattern gives the I/O scheduler maximum flexibility since many requests remain outstanding simultaneously.
require 'nio'
# Event-driven I/O with nio4r
selector = NIO::Selector.new
files = (0...20).map { |i| File.open("data#{i}.bin", 'rb') }
files.each do |file|
monitor = selector.register(file, :r)
monitor.value = file
end
# Process I/O as it completes
files.length.times do
selector.select do |monitor|
file = monitor.value
data = file.read_nonblock(4096) rescue nil
if data
# Process data
else
selector.deregister(file)
file.close
end
end
end
# Scheduler optimizes order of these concurrent operations
Database access patterns significantly impact I/O scheduler performance. ORMs like ActiveRecord generate SQL queries that translate to disk I/O. Sequential table scans produce sequential disk access beneficial to any scheduler. Index lookups generate random I/O, challenging for traditional schedulers but acceptable on SSDs.
require 'active_record'
# Sequential scan - scheduler-friendly access pattern
User.where('created_at > ?', 1.year.ago).find_each do |user|
# Process users sequentially
# Reads proceed in physical order on disk
end
# Multiple separate queries - may cause random I/O
user_ids = [1, 500, 25, 750, 100]
users = user_ids.map { |id| User.find(id) }
# Each find() might hit different disk regions
# Scheduler must optimize these disparate requests
# Batch loading reduces I/O operations
users = User.where(id: user_ids).to_a
# Single query reduces I/O request count
# Scheduler handles one operation instead of five
Performance Considerations
I/O scheduler performance depends heavily on workload characteristics. Sequential access patterns allow aggressive request merging and read-ahead, achieving transfer rates approaching hardware limits. Random access patterns prevent merging and may cause the head to traverse the entire disk repeatedly, reducing throughput by 100x or more on HDDs.
The ratio of read to write operations affects scheduler behavior. Most schedulers prioritize reads since they block application progress. Writes proceed asynchronously through the page cache, allowing applications to continue without waiting. Under read-heavy workloads, the scheduler may delay writes indefinitely if reads continue arriving. This improves interactive response time but can cause write starvation.
Queue depth dramatically impacts scheduling effectiveness. With only one or two outstanding requests, the scheduler has minimal reordering opportunities. As queue depth increases to 32-128 requests, the scheduler selects from many candidates, minimizing seek distance. Beyond 128-256 requests, gains plateau while per-request latency increases due to queuing delay.
require 'benchmark'
# Demonstrate impact of access patterns on performance
def sequential_access(file, size)
# Read file sequentially
file.rewind
chunks = size / 4096
chunks.times { file.read(4096) }
end
def random_access(file, size)
# Read file in random order
chunks = size / 4096
offsets = (0...chunks).map { |i| i * 4096 }.shuffle
offsets.each do |offset|
file.seek(offset)
file.read(4096)
end
end
# Create test file
size = 100 * 1024 * 1024 # 100MB
File.open('test.dat', 'wb') { |f| f.write('x' * size) }
File.open('test.dat', 'rb') do |file|
Benchmark.bm(20) do |bm|
bm.report('Sequential read:') { sequential_access(file, size) }
bm.report('Random read:') { random_access(file, size) }
end
end
# On HDD: Random read 50-100x slower than sequential
# On SSD: Random read 1-5x slower (or comparable) to sequential
Request size influences performance through interaction with both hardware and schedulers. Extremely small requests (512 bytes or less) incur overhead from syscall and scheduling costs disproportionate to data transferred. Very large requests (megabytes) may be split by the kernel into multiple block-level operations, reducing scheduler flexibility. Optimal request sizes typically range from 4KB to 128KB depending on workload.
The Linux CFQ scheduler adjusts behavior based on process I/O patterns. Processes issuing many requests receive longer time slices, while interactive processes with occasional I/O receive immediate service. This heuristic distinguishes between batch processing (throughput-oriented) and interactive applications (latency-sensitive), applying different scheduling strategies to each.
Storage device characteristics fundamentally alter scheduling trade-offs. HDDs benefit greatly from traditional scheduling algorithms. A SCAN-based scheduler might improve HDD throughput by 3-5x compared to FCFS under random workloads. SSDs show minimal benefit from classical algorithms since seek time disappears. SSD performance depends more on internal parallelism, with the NOOP scheduler often outperforming CFQ or Deadline.
# Detect storage device type and adjust I/O strategy
def detect_storage_type(path)
# Simplistic detection - real systems check /sys/block/*/queue/rotational
stat = File.stat(path)
device = stat.dev
# On Linux, check if device is rotational
rotational_file = "/sys/dev/block/#{device >> 8}:#{device & 0xff}/queue/rotational"
if File.exist?(rotational_file)
File.read(rotational_file).strip == "1" ? :hdd : :ssd
else
:unknown
end
end
def optimize_for_storage(file, storage_type)
case storage_type
when :hdd
# Sequential access and larger requests for HDDs
file.advise(:sequential)
chunk_size = 128 * 1024 # 128KB chunks
when :ssd
# SSDs handle random access well
file.advise(:random)
chunk_size = 4096 # Smaller chunks OK
else
# Conservative default
chunk_size = 16 * 1024
end
chunk_size
end
Write-back caching complicates performance analysis. When an application writes data, the write typically returns after data reaches the page cache, not physical storage. Actual disk I/O occurs later when the kernel's flusher threads write dirty pages to disk. This delayed write batches many small writes into fewer larger operations, improving throughput but making timing measurements misleading.
Database systems often use direct I/O to bypass page cache, giving applications control over caching decisions. Ruby applications can open files with the O_DIRECT flag (platform-specific) to enable direct I/O. This forces each operation to interact directly with the I/O scheduler, making scheduler behavior more predictable but requiring larger, aligned requests.
Concurrency level affects I/O performance through scheduler queue depth. A single-threaded application issues one request at a time, limiting scheduler optimization. Multi-threaded applications or async I/O increase queue depth proportionally to the number of concurrent operations. On HDDs, 4-8 concurrent operations often provide optimal throughput. SSDs benefit from higher concurrency, sometimes requiring 32+ concurrent operations to saturate available parallelism.
Practical Examples
Understanding scheduler behavior through practical examples demonstrates the performance impact of different access patterns and scheduling algorithms.
Example 1: Sequential vs Random Access Performance
# Create a 1GB test file
FILE_SIZE = 1024 * 1024 * 1024
BLOCK_SIZE = 4096
BLOCK_COUNT = FILE_SIZE / BLOCK_SIZE
File.open('testfile.bin', 'wb') do |file|
BLOCK_COUNT.times { file.write('x' * BLOCK_SIZE) }
end
# Measure sequential read performance
def measure_sequential
start = Time.now
File.open('testfile.bin', 'rb') do |file|
file.advise(:sequential)
while data = file.read(BLOCK_SIZE)
# Process block
end
end
elapsed = Time.now - start
throughput = FILE_SIZE / elapsed / (1024 * 1024)
{ time: elapsed, throughput_mb_s: throughput }
end
# Measure random read performance
def measure_random
offsets = (0...1000).map { rand(BLOCK_COUNT) * BLOCK_SIZE }
start = Time.now
File.open('testfile.bin', 'rb') do |file|
file.advise(:random)
offsets.each do |offset|
file.seek(offset)
file.read(BLOCK_SIZE)
end
end
elapsed = Time.now - start
throughput = (1000 * BLOCK_SIZE) / elapsed / (1024 * 1024)
{ time: elapsed, throughput_mb_s: throughput }
end
seq = measure_sequential
rand = measure_random
puts "Sequential: #{seq[:throughput_mb_s].round(2)} MB/s"
puts "Random: #{rand[:throughput_mb_s].round(2)} MB/s"
puts "Performance ratio: #{(seq[:throughput_mb_s] / rand[:throughput_mb_s]).round(1)}x"
# On HDD: Ratio typically 50-100x
# On SSD: Ratio typically 1-3x
This example illustrates how scheduler effectiveness varies with access patterns. Sequential access allows the scheduler to merge adjacent requests and maximize transfer rates. Random access forces the scheduler to choose between many distant requests, resulting in extensive head movement on HDDs.
Example 2: Impact of Request Batching
# Database-like access pattern - individual queries vs batch loading
class DataStore
def initialize(filename)
@file = File.open(filename, 'rb')
@record_size = 256
end
def read_record(id)
offset = id * @record_size
@file.seek(offset)
@file.read(@record_size)
end
def read_records_batch(ids)
# Sort IDs to create more sequential access pattern
ids.sort.map { |id| read_record(id) }
end
def close
@file.close
end
end
# Create test datastore
record_count = 100000
File.open('datastore.bin', 'wb') do |f|
record_count.times { |i| f.write(i.to_s.rjust(256, '0')) }
end
store = DataStore.new('datastore.bin')
# Random individual queries - poor for scheduler
random_ids = (0...1000).map { rand(record_count) }
start = Time.now
random_ids.each { |id| store.read_record(id) }
individual_time = Time.now - start
# Batch query with sorting - helps scheduler
start = Time.now
store.read_records_batch(random_ids)
batch_time = Time.now - start
store.close
puts "Individual queries: #{individual_time.round(3)}s"
puts "Batch with sorting: #{batch_time.round(3)}s"
puts "Speedup: #{(individual_time / batch_time).round(2)}x"
# Sorting IDs reduces seek distance, improving scheduler efficiency
Sorting requests by physical location helps any scheduler algorithm. Even with random logical access patterns, sorting creates more sequential physical access. This technique applies to database queries, file processing, and any scenario with multiple small operations.
Example 3: Write Buffering and Flush Behavior
# Demonstrate write buffering impact on performance
require 'benchmark'
WRITE_COUNT = 10000
DATA_SIZE = 1024 # 1KB per write
Benchmark.bm(25) do |bm|
# Fully buffered writes
bm.report('Buffered (default):') do
File.open('buffered.dat', 'wb') do |f|
WRITE_COUNT.times { f.write('x' * DATA_SIZE) }
end
end
# Flush after each write (force to page cache)
bm.report('Flush per write:') do
File.open('flushed.dat', 'wb') do |f|
WRITE_COUNT.times do
f.write('x' * DATA_SIZE)
f.flush
end
end
end
# fsync after each write (force to disk)
bm.report('Fsync per write:') do
File.open('synced.dat', 'wb') do |f|
WRITE_COUNT.times do
f.write('x' * DATA_SIZE)
f.flush
f.fsync
end
end
end
# Batch writes, then fsync once
bm.report('Batch then fsync:') do
File.open('batch.dat', 'wb') do |f|
WRITE_COUNT.times { f.write('x' * DATA_SIZE) }
f.flush
f.fsync
end
end
end
# Results typically show:
# - Buffered: fast (data stays in memory)
# - Flush: moderate (syscalls but page cache absorbs writes)
# - Fsync per write: very slow (each write waits for disk)
# - Batch then fsync: moderate (many writes, one sync wait)
This example demonstrates the cost of forcing data to storage versus allowing buffering. Applications requiring durability must fsync, but batching many operations before syncing dramatically improves performance by allowing the I/O scheduler to optimize multiple pending writes.
Example 4: Concurrent I/O and Queue Depth
require 'thread'
# Process multiple files concurrently to increase scheduler queue depth
def process_files_sequential(filenames)
filenames.each do |filename|
File.open(filename, 'rb') do |f|
while data = f.read(4096)
# Process data
sleep(0.0001) # Simulate processing time
end
end
end
end
def process_files_concurrent(filenames, thread_count = 4)
queue = Queue.new
filenames.each { |f| queue << f }
threads = thread_count.times.map do
Thread.new do
while filename = queue.pop(true) rescue nil
File.open(filename, 'rb') do |f|
while data = f.read(4096)
sleep(0.0001)
end
end
end
end
end
threads.each(&:join)
end
# Create test files
20.times do |i|
File.open("testfile_#{i}.bin", 'wb') do |f|
f.write('x' * (1024 * 1024 * 10)) # 10MB each
end
end
filenames = (0...20).map { |i| "testfile_#{i}.bin" }
# Compare sequential vs concurrent processing
require 'benchmark'
Benchmark.bm(20) do |bm|
bm.report('Sequential:') { process_files_sequential(filenames) }
bm.report('Concurrent (4 threads):') { process_files_concurrent(filenames, 4) }
end
# Concurrent version provides deeper scheduler queue
# Scheduler can optimize across multiple active files
# Performance gain depends on storage type (HDD benefits more)
Concurrent I/O increases the number of outstanding requests visible to the scheduler. With multiple threads or async I/O, the scheduler sees requests across different files and can optimize accordingly. This technique particularly benefits HDDs where the scheduler can minimize head movement across diverse requests.
Common Pitfalls
Assuming Sequential File Processing Equals Sequential Disk Access
File system layout determines physical disk locations of file data. A file's logical structure may span non-contiguous disk regions due to fragmentation. Processing a file sequentially in application code doesn't guarantee sequential disk access. The scheduler receives requests for scattered disk locations, limiting optimization opportunities.
# Sequential file reading doesn't guarantee sequential disk access
File.open('potentially_fragmented.log', 'rb') do |file|
file.advise(:sequential) # Hint may not help if file fragmented
# Logical sequential access
while line = file.gets
process(line)
end
end
# File may occupy non-contiguous blocks on disk:
# Blocks: [150, 151, 152, 890, 891, 1024, 1025, ...]
# Scheduler sees jumps between distant cylinders
Mitigation involves defragmentation (filesystems like XFS do this automatically) or ensuring files are written in large contiguous operations. Applications writing log files or database files should write large chunks rather than many small appends to increase likelihood of contiguous allocation.
Excessive fsync Causing Performance Collapse
Calling fsync after every write operation forces synchronous disk I/O, bypassing both Ruby's buffers and the page cache. This prevents the I/O scheduler from optimizing across multiple operations, reducing throughput by 100x or more. Each fsync call blocks until the scheduler completes all pending writes for the file.
# Catastrophic performance - fsync per record
def write_records_slow(records)
File.open('output.dat', 'wb') do |file|
records.each do |record|
file.write(record)
file.flush
file.fsync # Blocks until disk completes write
end
end
end
# Much better - batch then fsync
def write_records_fast(records)
File.open('output.dat', 'wb') do |file|
records.each { |record| file.write(record) }
file.flush
file.fsync # Single fsync after all writes
end
end
# write_records_fast is 50-200x faster depending on storage
Applications requiring durability should batch multiple operations then fsync periodically (e.g., every 100 records or every 100ms). This provides reasonable durability guarantees while allowing the scheduler to optimize batched writes.
Ignoring Storage Device Type
Applying HDD-optimized access patterns to SSDs wastes development effort and may reduce performance. SSDs handle random I/O efficiently, making complex sorting or sequential access patterns unnecessary. Conversely, treating HDDs like SSDs by issuing random I/O creates severe performance problems.
# Detecting storage type allows appropriate optimization
def optimize_data_processing(file, data_indices)
device_type = detect_storage_type(file.path)
case device_type
when :hdd
# Sort indices to create sequential access pattern
sorted_indices = data_indices.sort
sorted_indices.each do |idx|
file.seek(idx * RECORD_SIZE)
process(file.read(RECORD_SIZE))
end
when :ssd
# Random access fine on SSD - no need to sort
data_indices.each do |idx|
file.seek(idx * RECORD_SIZE)
process(file.read(RECORD_SIZE))
end
end
end
Modern applications should detect storage characteristics and adjust behavior accordingly. Cloud environments complicate this since VMs may not know underlying storage types. When in doubt, conservative strategies (larger requests, some batching) work reasonably across storage types.
Small Random Writes in Database Applications
ORMs and database libraries often generate many small queries that translate to random disk I/O. Updating 100 different records via 100 separate UPDATE statements causes the database to write to 100 different disk locations. The I/O scheduler struggles to optimize these scattered writes.
# Poor pattern - many small updates
users_to_update.each do |user|
user.update(last_seen: Time.now)
# Each update likely writes to different disk location
# Scheduler sees 100 unrelated write requests
end
# Better - batch update in single transaction
User.transaction do
users_to_update.each do |user|
user.last_seen = Time.now
end
users_to_update.each(&:save)
end
# Database can optimize write order
# Scheduler receives more organized write pattern
# Best - single SQL statement
User.where(id: users_to_update.map(&:id))
.update_all(last_seen: Time.now)
# Minimal disk I/O, scheduler handles efficiently
Batch operations allow databases to optimize write patterns and give the I/O scheduler larger, more organized sets of requests to optimize. This applies to INSERT, UPDATE, and DELETE operations.
Read-Ahead Defeating on Repeated File Access
Opening and closing a file between reads resets kernel read-ahead state. The I/O scheduler and page cache work together to predict future access and prefetch data. Repeatedly opening the same file prevents this optimization, forcing repeated I/O for data that could remain cached.
# Inefficient - repeated open/close
def process_log_entries(log_file, entry_ids)
entry_ids.each do |entry_id|
File.open(log_file, 'rb') do |f|
f.seek(entry_id * ENTRY_SIZE)
entry = f.read(ENTRY_SIZE)
process(entry)
end
# File closed, read-ahead state lost
end
end
# Efficient - keep file open
def process_log_entries_optimized(log_file, entry_ids)
File.open(log_file, 'rb') do |f|
entry_ids.each do |entry_id|
f.seek(entry_id * ENTRY_SIZE)
entry = f.read(ENTRY_SIZE)
process(entry)
end
end
# Read-ahead remains active, page cache effective
end
Keep files open across multiple operations when possible. The kernel maintains per-file state including read-ahead information and page cache associations. Closing and reopening discards this state, reducing both cache effectiveness and scheduler optimization.
Reference
I/O Scheduling Algorithm Comparison
| Algorithm | Seek Optimization | Fairness | Starvation Risk | Best Use Case |
|---|---|---|---|---|
| FCFS | None | Perfect | None | Light load, sequential access |
| SSTF | Excellent | Poor | High | High throughput, short-lived jobs |
| SCAN | Good | Good | None | General purpose, mixed workload |
| C-SCAN | Good | Better | None | Real-time requirements |
| LOOK | Very Good | Good | None | Modern general purpose |
| C-LOOK | Very Good | Better | None | Modern real-time |
| CFQ | Good | Excellent | Very Low | Interactive multi-process systems |
| Deadline | Good | Good | None | Latency-sensitive workloads |
| NOOP | None | Perfect | None | SSDs, RAID arrays |
Performance Characteristics by Storage Type
| Metric | HDD | SSD |
|---|---|---|
| Sequential Read | 100-200 MB/s | 500-3500 MB/s |
| Random Read IOPS | 75-150 | 10,000-500,000 |
| Sequential Write | 100-200 MB/s | 400-3000 MB/s |
| Random Write IOPS | 75-150 | 10,000-300,000 |
| Seek Time | 3-15 ms | <0.1 ms |
| Queue Depth Benefit | High | Moderate |
| Scheduler Impact | Critical | Minimal |
Ruby I/O Methods and Scheduler Interaction
| Method | Buffer Behavior | Scheduler Impact | Use Case |
|---|---|---|---|
| File.read | Single read syscall | Minimal queue depth | Small files |
| File.read with block | Multiple read syscalls | Better queue depth | Large files |
| File.write | Ruby buffered | Batched to OS | General writing |
| file.sync = true | Unbuffered per write | Each write visible to scheduler | Low-latency logging |
| file.flush | Forces syscalls | Writes reach page cache | Visibility without durability |
| file.fsync | Forces disk write | Scheduler must complete operation | Durability required |
| file.advise | Kernel hint | Affects read-ahead scheduling | Large file processing |
| IO.select | Async operation | Increases queue depth | High-concurrency I/O |
Key Timing Components
| Component | Typical Range (HDD) | Typical Range (SSD) |
|---|---|---|
| Seek Time | 3-15 ms | <0.1 ms |
| Rotational Latency | 2-8 ms | None |
| Transfer Time (4KB) | 0.1-0.3 ms | 0.01-0.05 ms |
| Syscall Overhead | 0.001-0.01 ms | 0.001-0.01 ms |
| Total Latency (Random) | 5-23 ms | 0.05-0.2 ms |
Optimization Guidelines by Workload
| Workload Pattern | HDD Strategy | SSD Strategy |
|---|---|---|
| Sequential Read | Large reads, read-ahead enabled | Moderate reads, read-ahead optional |
| Random Read | Sort requests, batch reads | Direct access, concurrent reads |
| Sequential Write | Large buffered writes, periodic fsync | Moderate writes, less frequent fsync |
| Random Write | Batch and sort writes, delayed fsync | Direct writes, reasonable fsync frequency |
| Mixed Read/Write | Separate read and write phases | Concurrent operations acceptable |
| Database Queries | Batch queries, sort by key | Batch queries for other reasons |
| Log Writing | Buffer heavily, fsync per batch | Moderate buffering, fsync per batch |
Linux I/O Scheduler Selection
| Scheduler | Sysfs Path | Configuration |
|---|---|---|
| View Current | /sys/block/sda/queue/scheduler | cat to view, shows active in brackets |
| Change Temporarily | /sys/block/sda/queue/scheduler | echo cfq > file |
| Change Permanently | /etc/default/grub | GRUB_CMDLINE_LINUX elevator=deadline |
| Per-Process Priority | ionice command | ionice -c 2 -n 0 command |
| Check Rotational | /sys/block/sda/queue/rotational | 1 for HDD, 0 for SSD |
Ruby I/O Tuning Parameters
| Parameter | Setting Method | Impact on Scheduler |
|---|---|---|
| Buffer Size | File.open with buffer size arg | Larger buffers reduce syscall frequency |
| Sync Mode | file.sync = true/false | Disabling buffering increases queue depth |
| Advise Sequential | file.advise(:sequential) | Triggers read-ahead, more sequential requests |
| Advise Random | file.advise(:random) | Disables read-ahead, reduces wasted I/O |
| Non-blocking | require 'io/nonblock', file.nonblock = true | Allows concurrent I/O operations |
| Direct I/O | Platform-specific flags | Bypasses page cache, direct scheduler interaction |