CrackedRuby - Reliable Data Transfer

Overview

Reliable Data Transfer (RDT) addresses the fundamental challenge that network communication channels are inherently unreliable. Bits can be corrupted during transmission, packets can be lost entirely, and data can arrive out of order or duplicated. RDT protocols implement mechanisms to detect and recover from these failures, creating the abstraction of a reliable channel over an unreliable medium.

The concept originated from early networking research when engineers recognized that application developers needed guarantees about data delivery without implementing complex error handling in every application. Rather than forcing each application to handle packet loss, corruption, and reordering, the transport layer provides reliability as a service. This separation of concerns became a cornerstone of the Internet protocol stack.

RDT sits at the transport layer of the network stack, positioned between the application layer above and the network layer below. The network layer (IP) provides best-effort delivery with no guarantees, while the transport layer (TCP) builds reliability on top of this unreliable foundation. Applications like web browsers, email clients, and database connections depend on RDT to function correctly without implementing their own reliability mechanisms.

Consider a simple example: sending the string "HELLO" across a network. Without reliability mechanisms, the receiving application might receive "HXLLO" (bit corruption), "HEL" (packet loss), "LLOHE" (reordering), or "HELLOHELLO" (duplication). RDT protocols prevent these issues through checksums, acknowledgments, sequence numbers, and retransmission.

The design space for RDT protocols involves fundamental trade-offs between simplicity, performance, and the types of failures handled. Early protocols like stop-and-wait provide correctness but sacrifice throughput. More sophisticated protocols like selective repeat achieve higher performance through pipelining but require additional complexity in state management and buffer handling.

Key Principles

Reliable data transfer builds on four foundational mechanisms that work together to detect and recover from transmission errors. Each mechanism addresses a specific failure mode in the communication channel.

Checksums detect bit corruption during transmission. The sender computes a mathematical hash of the packet contents and includes it in the packet header. The receiver recomputes the checksum and compares it to the transmitted value. Any mismatch indicates corruption occurred. Common checksum algorithms include Internet checksum (ones-complement sum), CRC (cyclic redundancy check), and MD5 or SHA for stronger error detection. The checksum field itself must be protected or excluded from the computation to avoid circular dependencies.

Acknowledgments inform the sender that data arrived successfully. After receiving and validating a packet, the receiver sends an acknowledgment (ACK) back to the sender. The ACK indicates which packet was received correctly. Negative acknowledgments (NAK) explicitly signal packet corruption or other errors, though many modern protocols use duplicate ACKs instead of explicit NAKs. The acknowledgment mechanism creates a feedback loop that allows the sender to know the receiver's state.

Sequence numbers enable the receiver to detect duplicates, reordering, and gaps in the data stream. Each packet carries a sequence number that identifies its position in the stream. The receiver uses sequence numbers to reassemble data in the correct order, discard duplicates, and identify missing packets. Sequence numbers typically wrap around after reaching a maximum value, so protocols must handle wraparound correctly. The sequence number space size determines how many outstanding unacknowledged packets can exist simultaneously.

Timeouts and retransmission handle packet loss when neither data nor acknowledgments arrive. The sender starts a timer when transmitting a packet. If no acknowledgment arrives before the timer expires, the sender assumes the packet was lost and retransmits it. Timeout values must balance responsiveness against spurious retransmissions caused by delayed packets or acknowledgments. Adaptive timeout algorithms adjust based on measured round-trip times.

These mechanisms combine in protocols that can be modeled as finite state machines. The sender FSM has states for waiting to send, waiting for acknowledgment, and handling timeouts. The receiver FSM has states for waiting for packets and processing received data. State transitions occur based on events like packet arrival, timeout expiration, or ACK reception.

The protocol must handle the interaction between these mechanisms. For example, sequence numbers allow the receiver to distinguish between original packets and retransmissions after a timeout. Checksums ensure that acknowledgments themselves arrive uncorrupted. The combination creates a system where data eventually arrives correctly despite any finite number of transmission errors, though timeouts may occur.

RDT protocols differ in how they handle pipelining—sending multiple packets before receiving acknowledgments. Stop-and-wait protocols send one packet and wait for its ACK before sending the next, achieving low utilization on high-latency links. Pipelined protocols like Go-Back-N and Selective Repeat send multiple packets concurrently, requiring additional buffering and more complex sequence number management but achieving much higher throughput.

Ruby Implementation

Ruby provides reliable data transfer through the TCPSocket and TCPServer classes, which implement the TCP protocol's reliability mechanisms. The socket library handles checksums, acknowledgments, sequence numbers, and retransmissions transparently, presenting a stream abstraction to the application.

require 'socket'

# Server implementing reliable message reception
server = TCPServer.new(9000)
puts "Server listening on port 9000"

client = server.accept
puts "Client connected from #{client.peeraddr[3]}:#{client.peeraddr[1]}"

# TCP handles reliability - data arrives in order without corruption
while (message = client.gets)
  puts "Received: #{message}"
  client.puts "ACK: #{message.chomp}"
end

client.close
server.close

The TCP socket handles reliability through several configurable parameters. The TCP_NODELAY option disables Nagle's algorithm, which buffers small packets to reduce overhead. Without this option, TCP delays sending small packets to coalesce them, trading latency for efficiency:

require 'socket'

socket = TCPSocket.new('example.com', 80)

# Disable Nagle's algorithm for low-latency applications
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)

# Configure socket buffer sizes for throughput
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, 65536)
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_RCVBUF, 65536)

# Set keepalive to detect dead connections
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, 1)

socket.write("GET / HTTP/1.1\r\n\r\n")
response = socket.read
puts response

socket.close

For applications requiring custom reliability mechanisms over UDP, Ruby's UDPSocket class provides unreliable datagram delivery. Building reliability on top of UDP requires implementing checksums, acknowledgments, sequence numbers, and timeouts:

require 'socket'
require 'digest'

# Custom reliable protocol over UDP
class ReliableUDPSocket
  TIMEOUT = 1.0
  MAX_RETRIES = 5
  
  def initialize(port)
    @socket = UDPSocket.new
    @socket.bind('0.0.0.0', port)
    @seq_num = 0
  end
  
  def send_reliable(data, host, port)
    retries = 0
    
    loop do
      # Create packet with sequence number and checksum
      packet = create_packet(data)
      @socket.send(packet, 0, host, port)
      
      # Wait for acknowledgment with timeout
      begin
        ready = IO.select([@socket], nil, nil, TIMEOUT)
        
        if ready
          ack_data, _ = @socket.recvfrom(1024)
          ack_seq = ack_data.unpack1('N')
          
          if ack_seq == @seq_num
            @seq_num += 1
            return true
          end
        end
      rescue IO::TimeoutError
        # Timeout occurred, fall through to retry
      end
      
      retries += 1
      raise "Max retries exceeded" if retries >= MAX_RETRIES
      puts "Timeout, retransmitting packet #{@seq_num}"
    end
  end
  
  private
  
  def create_packet(data)
    checksum = Digest::MD5.digest(data)
    [@seq_num, checksum, data].pack('NA16A*')
  end
end

# Usage
sender = ReliableUDPSocket.new(9001)
sender.send_reliable("Important message", 'localhost', 9002)

The receiver side validates checksums and sends acknowledgments:

class ReliableUDPReceiver
  def initialize(port)
    @socket = UDPSocket.new
    @socket.bind('0.0.0.0', port)
    @expected_seq = 0
  end
  
  def receive_reliable
    loop do
      data, addr = @socket.recvfrom(65536)
      
      # Parse packet
      seq_num, checksum, payload = data.unpack('NA16A*')
      
      # Verify checksum
      computed = Digest::MD5.digest(payload)
      
      if computed == checksum && seq_num == @expected_seq
        # Send acknowledgment
        ack = [@expected_seq].pack('N')
        @socket.send(ack, 0, addr[3], addr[1])
        
        @expected_seq += 1
        return payload
      else
        # Corruption or duplicate - send previous ACK
        prev_ack = [@expected_seq - 1].pack('N')
        @socket.send(prev_ack, 0, addr[3], addr[1])
      end
    end
  end
end

# Usage
receiver = ReliableUDPReceiver.new(9002)
message = receiver.receive_reliable
puts "Reliably received: #{message}"

Ruby's Async gem provides higher-level abstractions for reliable communication with built-in timeout handling and connection management:

require 'async'
require 'async/io'
require 'async/io/stream'

Async do
  endpoint = Async::IO::Endpoint.tcp('localhost', 9000)
  
  endpoint.bind do |server|
    server.listen(128)
    
    server.accept do |client|
      stream = Async::IO::Stream.new(client)
      
      # Reliable read with timeout
      begin
        Async::Task.current.with_timeout(5.0) do
          while (line = stream.read_until("\n"))
            puts "Received: #{line}"
            stream.write("ACK\n")
            stream.flush
          end
        end
      rescue Async::TimeoutError
        puts "Client connection timed out"
      ensure
        client.close
      end
    end
  end
end

Implementation Approaches

Reliable data transfer protocols fall into three main categories based on how they handle packet pipelining and acknowledgments. Each approach makes different trade-offs between simplicity, buffer requirements, and throughput.

Stop-and-Wait Protocol represents the simplest reliable transfer mechanism. The sender transmits one packet and waits for its acknowledgment before sending the next packet. This approach minimizes sender and receiver buffering requirements but severely limits throughput on high-latency links.

The stop-and-wait sender operates as a two-state finite state machine. In the "waiting for call from above" state, it accepts data from the application, creates a packet with sequence number and checksum, sends it, starts a timer, and transitions to "waiting for ACK." In the "waiting for ACK" state, it processes events: receiving a valid ACK transitions back to the first state; timer expiration triggers retransmission and timer restart; receiving a corrupt ACK or NAK also triggers retransmission.

Performance analysis reveals the protocol's limitations. Consider a 1 Gbps link with 30ms round-trip time. Stop-and-wait sends a 1KB packet, waits 30ms for ACK, then sends the next packet. The sender transmits for 8 microseconds per packet (1KB at 1Gbps) but waits 30ms between transmissions. Link utilization equals (0.008ms / 30.008ms) or approximately 0.027%, achieving only 267 Kbps effective throughput despite the 1 Gbps link capacity. The protocol's throughput is inversely proportional to the round-trip time, making it unsuitable for high-bandwidth-delay product networks.

Go-Back-N Protocol improves throughput by allowing the sender to transmit multiple packets before receiving acknowledgments. The sender maintains a window of N packets that can be outstanding simultaneously. Sequence numbers use modulo-2^k arithmetic where k is chosen such that 2^k > N. When an error occurs, the receiver discards all subsequent packets and the sender retransmits from the error point onward.

The sender maintains three variables: base (oldest unacknowledged packet), nextseqnum (next sequence number to use), and N (window size). The window represents packets in the range [base, base+N-1]. When the application provides data, if nextseqnum < base+N, the sender creates and transmits the packet and advances nextseqnum. If nextseqnum >= base+N, the window is full and the sender refuses the data or buffers it.

When an ACK arrives for packet n, the sender sets base to n+1 and slides the window forward. This cumulative acknowledgment scheme means ACK n acknowledges all packets up through n. If base equals nextseqnum after processing the ACK, no packets are outstanding and the timer stops. Otherwise, the timer restarts for the oldest unacknowledged packet.

Timeout events trigger retransmission of all packets from base to nextseqnum-1. This "go back N" behavior gives the protocol its name. If packet 5 is lost among packets 1-10, the receiver discards packets 6-10 even if they arrived correctly. When timeout occurs, the sender retransmits packets 5-10.

The receiver maintains a single variable: expectedseqnum. When a packet arrives, if its sequence number equals expectedseqnum, the receiver delivers data to the application, sends ACK expectedseqnum, and increments expectedseqnum. For any other sequence number (out of order or duplicate), the receiver discards the packet and sends ACK for the last correctly received in-order packet. This design keeps the receiver simple with minimal buffering but causes retransmission of correctly received packets after errors.

Selective Repeat Protocol eliminates unnecessary retransmissions by acknowledging packets individually and buffering out-of-order packets at the receiver. The sender maintains a timer for each outstanding packet and retransmits only packets that timeout or receive NAKs. The receiver acknowledges each correctly received packet regardless of order and delivers data to the application only when the next in-sequence packet arrives.

The sender window spans [send_base, send_base+N-1] where N is the window size. Each packet has an independent timer. When data arrives from the application, if nextseqnum is in the window, the sender transmits the packet and starts its timer. When ACK n arrives, the sender marks packet n as received and moves send_base forward if n equals send_base. Timeout for packet n triggers retransmission of only that packet.

The receiver window spans [rcv_base, rcv_base+N-1]. When packet n arrives, if n falls in [rcv_base, rcv_base+N-1], the receiver sends ACK n. If n equals rcv_base, it delivers packets starting from n up through the highest consecutive received packet, then advances rcv_base. If n falls in [rcv_base-N, rcv_base-1], the packet is a duplicate from before the window, so the receiver sends ACK n. Packets outside these ranges are discarded.

Window size selection critically affects correctness. For selective repeat to work correctly, the window size must satisfy N <= sequence_space_size / 2. If sequence numbers use modulo 4 (sequence space = 4), then N <= 2. Larger windows cause ambiguity: the receiver cannot distinguish between packet 0 from the current window and packet 0 from the next window cycle. This constraint limits the protocol's performance on high-bandwidth-delay product networks.

Performance comparison shows selective repeat achieves the best throughput by minimizing retransmissions, but requires buffer space at both sender and receiver for N packets. Go-back-N achieves good throughput with minimal receiver buffering but retransmits correctly received packets after errors. Stop-and-wait requires minimal buffering but achieves poor throughput. Protocol selection depends on the application's requirements for throughput, memory usage, and implementation complexity.

Practical Examples

Building a reliable file transfer application demonstrates how RDT principles translate into working systems. This example implements chunked transfer with acknowledgments and retransmission over TCP:

require 'socket'
require 'digest'

class ReliableFileTransfer
  CHUNK_SIZE = 8192
  MAX_RETRIES = 3
  
  def self.send_file(filename, host, port)
    socket = TCPSocket.new(host, port)
    file_size = File.size(filename)
    
    # Send file metadata
    socket.puts "#{File.basename(filename)}:#{file_size}"
    
    File.open(filename, 'rb') do |file|
      chunk_num = 0
      
      while (chunk = file.read(CHUNK_SIZE))
        retries = 0
        
        begin
          # Send chunk with sequence number and checksum
          checksum = Digest::SHA256.hexdigest(chunk)
          socket.puts "CHUNK:#{chunk_num}:#{checksum}"
          socket.write(chunk)
          socket.flush
          
          # Wait for acknowledgment
          ack = socket.gets.chomp
          
          unless ack == "ACK:#{chunk_num}"
            raise "Invalid ACK received"
          end
          
          chunk_num += 1
          print "\rSent #{chunk_num * CHUNK_SIZE} / #{file_size} bytes"
          
        rescue => e
          retries += 1
          if retries < MAX_RETRIES
            puts "\nError sending chunk #{chunk_num}, retrying..."
            file.seek(chunk_num * CHUNK_SIZE)
            retry
          else
            raise "Failed to send chunk #{chunk_num} after #{MAX_RETRIES} attempts"
          end
        end
      end
    end
    
    socket.puts "DONE"
    socket.close
    puts "\nFile transfer complete"
  end
  
  def self.receive_file(port)
    server = TCPServer.new(port)
    client = server.accept
    
    # Receive file metadata
    metadata = client.gets.chomp
    filename, file_size = metadata.split(':')
    file_size = file_size.to_i
    
    puts "Receiving file: #{filename} (#{file_size} bytes)"
    
    File.open("received_#{filename}", 'wb') do |file|
      expected_chunk = 0
      
      loop do
        header = client.gets
        break if header.nil? || header.chomp == "DONE"
        
        # Parse chunk header
        _, chunk_num, expected_checksum = header.chomp.split(':')
        chunk_num = chunk_num.to_i
        
        # Read chunk data
        chunk = client.read(CHUNK_SIZE)
        
        # Verify checksum
        actual_checksum = Digest::SHA256.hexdigest(chunk)
        
        if actual_checksum == expected_checksum && chunk_num == expected_chunk
          file.write(chunk)
          client.puts "ACK:#{chunk_num}"
          expected_chunk += 1
          print "\rReceived #{expected_chunk * CHUNK_SIZE} / #{file_size} bytes"
        else
          # Corruption or out-of-order chunk
          client.puts "NAK:#{chunk_num}"
          puts "\nCorrupted chunk #{chunk_num}, requesting retransmission"
        end
      end
    end
    
    client.close
    server.close
    puts "\nFile received successfully"
  end
end

# Usage
# Server: ReliableFileTransfer.receive_file(9000)
# Client: ReliableFileTransfer.send_file('large_file.dat', 'localhost', 9000)

A distributed task queue demonstrates reliable message delivery with at-least-once semantics. Tasks are acknowledged only after successful processing, ensuring no task is lost even if workers crash:

require 'socket'
require 'json'
require 'securerandom'

class ReliableTaskQueue
  def initialize(port)
    @port = port
    @pending_tasks = {}
    @task_queue = Queue.new
  end
  
  def enqueue_task(task_data)
    task_id = SecureRandom.uuid
    task = {
      id: task_id,
      data: task_data,
      created_at: Time.now,
      attempts: 0
    }
    
    @pending_tasks[task_id] = task
    @task_queue.push(task)
    task_id
  end
  
  def start_server
    server = TCPServer.new(@port)
    puts "Task queue listening on port #{@port}"
    
    Thread.new { monitor_timeouts }
    
    loop do
      client = server.accept
      Thread.new { handle_worker(client) }
    end
  end
  
  private
  
  def handle_worker(socket)
    worker_id = socket.gets.chomp
    puts "Worker #{worker_id} connected"
    
    loop do
      # Send task to worker
      task = @task_queue.pop
      task[:attempts] += 1
      task[:assigned_at] = Time.now
      
      socket.puts task.to_json
      socket.flush
      
      # Wait for acknowledgment
      begin
        result = socket.gets(chomp: true)
        
        if result && result.start_with?('ACK')
          task_id = result.split(':')[1]
          @pending_tasks.delete(task_id)
          puts "Task #{task_id} completed by #{worker_id}"
        elsif result && result.start_with?('FAIL')
          task_id = result.split(':')[1]
          requeue_task(task_id)
          puts "Task #{task_id} failed, requeuing"
        end
      rescue
        # Worker disconnected, requeue unacknowledged task
        requeue_task(task[:id])
        break
      end
    end
    
    socket.close
  end
  
  def monitor_timeouts
    loop do
      sleep 5
      
      @pending_tasks.each do |task_id, task|
        if task[:assigned_at] && Time.now - task[:assigned_at] > 30
          puts "Task #{task_id} timed out, requeuing"
          requeue_task(task_id)
        end
      end
    end
  end
  
  def requeue_task(task_id)
    task = @pending_tasks[task_id]
    return unless task
    
    if task[:attempts] < 3
      task[:assigned_at] = nil
      @task_queue.push(task)
    else
      puts "Task #{task_id} exceeded max attempts, moving to dead letter queue"
      @pending_tasks.delete(task_id)
    end
  end
end

# Worker implementation
class TaskWorker
  def initialize(server_host, server_port, worker_id)
    @server_host = server_host
    @server_port = server_port
    @worker_id = worker_id
  end
  
  def start
    socket = TCPSocket.new(@server_host, @server_port)
    socket.puts @worker_id
    
    loop do
      task_json = socket.gets
      break unless task_json
      
      task = JSON.parse(task_json)
      
      begin
        # Process task
        result = process_task(task['data'])
        socket.puts "ACK:#{task['id']}"
        puts "Completed task #{task['id']}"
      rescue => e
        socket.puts "FAIL:#{task['id']}"
        puts "Failed task #{task['id']}: #{e.message}"
      end
      
      socket.flush
    end
    
    socket.close
  end
  
  private
  
  def process_task(data)
    # Simulate task processing
    sleep rand(1..3)
    data
  end
end

Implementing a chat application with message delivery guarantees shows how RDT ensures no messages are lost or duplicated:

require 'socket'
require 'json'

class ReliableChat
  def initialize(port)
    @port = port
    @clients = {}
    @message_log = []
    @sequence_number = 0
  end
  
  def start_server
    server = TCPServer.new(@port)
    puts "Chat server listening on port #{@port}"
    
    loop do
      socket = server.accept
      Thread.new { handle_client(socket) }
    end
  end
  
  private
  
  def handle_client(socket)
    username = socket.gets.chomp
    
    @clients[username] = {
      socket: socket,
      last_ack: -1
    }
    
    puts "#{username} joined the chat"
    broadcast("#{username} joined", username)
    
    # Send message history
    send_history(socket, username)
    
    # Handle incoming messages
    loop do
      message = socket.gets
      break unless message
      
      data = JSON.parse(message)
      
      if data['type'] == 'message'
        handle_message(username, data['content'])
      elsif data['type'] == 'ack'
        handle_ack(username, data['seq'])
      end
    end
    
    @clients.delete(username)
    broadcast("#{username} left", username)
    socket.close
  rescue
    @clients.delete(username)
    socket.close
  end
  
  def handle_message(sender, content)
    message = {
      seq: @sequence_number,
      sender: sender,
      content: content,
      timestamp: Time.now
    }
    
    @message_log << message
    @sequence_number += 1
    
    # Broadcast to all clients with retry logic
    @clients.each do |username, client_data|
      send_with_retry(client_data[:socket], message)
    end
  end
  
  def send_with_retry(socket, message, max_retries = 3)
    retries = 0
    
    begin
      socket.puts message.to_json
      socket.flush
    rescue
      retries += 1
      retry if retries < max_retries
    end
  end
  
  def handle_ack(username, seq)
    @clients[username][:last_ack] = seq
  end
  
  def send_history(socket, username)
    last_ack = @clients[username][:last_ack]
    
    @message_log.each do |message|
      if message[:seq] > last_ack
        send_with_retry(socket, message)
      end
    end
  end
  
  def broadcast(message, exclude_user = nil)
    @clients.each do |username, client_data|
      next if username == exclude_user
      send_with_retry(client_data[:socket], {
        seq: @sequence_number,
        sender: 'system',
        content: message,
        timestamp: Time.now
      })
    end
    @sequence_number += 1
  end
end

Common Patterns

Acknowledgment Schemes determine how the receiver confirms packet receipt. Positive acknowledgment with retransmission (PAR) uses ACK messages to confirm successful delivery. The sender retransmits if no ACK arrives within the timeout period. This pattern appears in almost all reliable protocols and forms the foundation of reliability.

Cumulative acknowledgments optimize network efficiency by acknowledging multiple packets with a single ACK. ACK n confirms receipt of all packets up through n. This reduces acknowledgment traffic but provides less granular feedback. If packets 1-5 are sent and packet 3 is lost, the receiver continues sending ACK 2 until packet 3 arrives, then sends ACK 5 to acknowledge packets 3-5 simultaneously. TCP uses cumulative acknowledgments extensively.

Selective acknowledgments (SACK) provide explicit feedback about which packets arrived. The receiver includes a bitmap or list of received packet ranges in the ACK. This allows the sender to retransmit only missing packets rather than going back N. TCP implements SACK as an option that must be negotiated during connection establishment:

# Simulating SACK behavior
class SelectiveAckReceiver
  def initialize
    @received_packets = {}
    @next_expected = 0
  end
  
  def receive_packet(seq_num, data)
    @received_packets[seq_num] = data
    
    # Build SACK ranges
    sack_ranges = []
    sorted_seqs = @received_packets.keys.sort
    
    range_start = sorted_seqs[0]
    range_end = sorted_seqs[0]
    
    sorted_seqs.each do |seq|
      if seq == range_end + 1
        range_end = seq
      else
        sack_ranges << [range_start, range_end]
        range_start = seq
        range_end = seq
      end
    end
    
    sack_ranges << [range_start, range_end]
    
    {
      ack: @next_expected - 1,
      sack_ranges: sack_ranges
    }
  end
  
  def deliver_data
    data = []
    while @received_packets.key?(@next_expected)
      data << @received_packets.delete(@next_expected)
      @next_expected += 1
    end
    data
  end
end

Timeout Management requires balancing responsiveness against spurious retransmissions. Fixed timeouts work poorly because network delays vary significantly. Adaptive timeout algorithms measure round-trip time (RTT) and adjust timeout values dynamically.

The exponential weighted moving average (EWMA) smooths RTT measurements to handle variation. EstimatedRTT = (1 - α) * EstimatedRTT + α * SampleRTT, where α typically equals 0.125. This gives more weight to recent measurements while smoothing out fluctuations. The timeout value includes a safety margin based on RTT deviation: TimeoutInterval = EstimatedRTT + 4 * DevRTT, where DevRTT tracks RTT variance.

Karn's algorithm addresses a critical problem in RTT measurement: when a packet is retransmitted, the sender cannot tell whether the ACK corresponds to the original transmission or the retransmission. The algorithm excludes retransmitted packets from RTT calculations, preventing skewed estimates. Additionally, exponential backoff doubles the timeout after each retransmission, reducing network congestion:

class AdaptiveTimeout
  def initialize
    @estimated_rtt = 1.0
    @dev_rtt = 0.0
    @timeout_interval = 1.0
    @alpha = 0.125
    @beta = 0.25
  end
  
  def update_rtt(sample_rtt)
    # Update estimated RTT
    @estimated_rtt = (1 - @alpha) * @estimated_rtt + @alpha * sample_rtt
    
    # Update RTT deviation
    @dev_rtt = (1 - @beta) * @dev_rtt + @beta * (@sample_rtt - @estimated_rtt).abs
    
    # Calculate timeout interval
    @timeout_interval = @estimated_rtt + 4 * @dev_rtt
  end
  
  def timeout_value
    @timeout_interval
  end
  
  def exponential_backoff(attempt)
    @timeout_interval * (2 ** attempt)
  end
end

Flow Control prevents fast senders from overwhelming slow receivers. The receiver advertises a receive window indicating available buffer space. The sender limits outstanding data to the advertised window size. TCP implements flow control through the rwnd field in the header.

The sliding window protocol implements flow control naturally. The window size limits how many packets can be in flight simultaneously. As the receiver processes data and frees buffer space, it sends ACKs with updated window advertisements. If the receive window reaches zero, the sender stops transmitting until the window reopens:

class FlowControlSender
  def initialize(initial_window)
    @send_buffer = []
    @receive_window = initial_window
    @next_seq = 0
    @base = 0
  end
  
  def send_data(data)
    while @send_buffer.size >= @receive_window
      sleep 0.01  # Wait for window to open
    end
    
    packet = { seq: @next_seq, data: data }
    @send_buffer << packet
    @next_seq += 1
    
    transmit(packet)
  end
  
  def handle_ack(ack_seq, window_update)
    # Remove acknowledged packets
    @send_buffer.reject! { |p| p[:seq] <= ack_seq }
    @base = ack_seq + 1
    
    # Update receive window
    @receive_window = window_update
  end
  
  def available_window
    @receive_window - @send_buffer.size
  end
end

Fast Retransmit reduces latency by triggering retransmission before timeout expiration. When the sender receives three duplicate ACKs (four identical ACKs total), it immediately retransmits the suspected lost packet. This pattern exploits the observation that duplicate ACKs signal either packet loss or reordering. Three duplicate ACKs strongly suggest loss rather than reordering.

Fast retransmit significantly improves performance by avoiding timeout delays. When packet 10 is lost among packets 1-20, packets 11-20 trigger duplicate ACKs for packet 9. After receiving three duplicate ACKs, the sender retransmits packet 10 without waiting for timeout. This reduces recovery time from seconds to milliseconds on typical networks.

Error Handling & Edge Cases

Packet Corruption occurs when bit errors flip packet contents during transmission. Checksums detect corruption but cannot correct it. The receiver must discard corrupted packets and rely on retransmission for recovery. Checksum algorithms trade detection strength against computation cost. Simple checksums like Internet checksum require minimal CPU but detect only certain error patterns. CRC detects all single-bit and double-bit errors plus all error bursts up to the CRC width. Cryptographic hashes like MD5 or SHA provide strong detection but consume more CPU:

require 'zlib'

class CorruptionDetector
  def self.create_packet(data)
    # Use CRC32 for corruption detection
    checksum = Zlib.crc32(data)
    [checksum, data].pack('NA*')
  end
  
  def self.verify_packet(packet)
    checksum, data = packet.unpack('NA*')
    computed = Zlib.crc32(data)
    
    if computed == checksum
      { valid: true, data: data }
    else
      { valid: false, error: 'Checksum mismatch' }
    end
  end
  
  def self.simulate_corruption(packet, corruption_rate = 0.01)
    bytes = packet.bytes
    
    bytes.map! do |byte|
      if rand < corruption_rate
        byte ^ (1 << rand(8))  # Flip random bit
      else
        byte
      end
    end
    
    bytes.pack('C*')
  end
end

# Example usage showing corruption detection
data = "Important message"
packet = CorruptionDetector.create_packet(data)

# Simulate transmission with corruption
corrupted = CorruptionDetector.simulate_corruption(packet, 0.1)
result = CorruptionDetector.verify_packet(corrupted)

puts result[:valid] ? "Valid packet" : "Corrupted: #{result[:error]}"

Duplicate Packets arise when the sender retransmits a packet that actually arrived but whose ACK was lost or delayed. Sequence numbers enable the receiver to detect and discard duplicates. The receiver maintains the highest in-order sequence number received and rejects packets with sequence numbers at or below this threshold. However, wraparound in sequence numbers creates ambiguity if the window size is too large.

Consider sequence numbers using modulo 4 (sequence space = 4) with window size 3. The sender transmits packets 0, 1, 2 and receives ACK 2. It slides the window and transmits packet 3. If packet 0's ACK was delayed (not lost), the receiver might receive packet 0 from the next window cycle before receiving packet 3. The receiver cannot distinguish this "new" packet 0 from a duplicate of the old packet 0. The constraint N <= sequence_space / 2 prevents this ambiguity:

class SequenceNumberManager
  def initialize(sequence_bits)
    @sequence_space = 2 ** sequence_bits
    @max_window = @sequence_space / 2
    @next_seq = 0
    @expected_seq = 0
  end
  
  def get_next_sequence
    seq = @next_seq
    @next_seq = (@next_seq + 1) % @sequence_space
    seq
  end
  
  def is_duplicate?(seq)
    # Check if sequence number is behind expected
    distance = (@expected_seq - seq) % @sequence_space
    distance > 0 && distance <= @max_window
  end
  
  def is_in_window?(seq)
    distance = (seq - @expected_seq) % @sequence_space
    distance < @max_window
  end
  
  def update_expected(seq)
    @expected_seq = (seq + 1) % @sequence_space
  end
end

Packet Reordering happens when packets take different network paths with varying delays. A packet sent earlier may arrive later than subsequent packets. The receiver must buffer out-of-order packets and deliver data to the application in sequence. Go-back-N handles reordering by discarding out-of-order packets, while selective repeat buffers them.

Buffer management becomes critical with reordering. The receiver allocates buffer space for the receive window but must handle cases where many packets arrive out of order. If the buffer fills, the receiver must either drop packets or implement back pressure:

class ReorderingBuffer
  def initialize(window_size)
    @window_size = window_size
    @buffer = {}
    @next_expected = 0
  end
  
  def receive_packet(seq, data)
    # Check if packet is in acceptable range
    offset = seq - @next_expected
    
    if offset < 0
      return { status: :duplicate, delivered: [] }
    elsif offset >= @window_size
      return { status: :window_exceeded, delivered: [] }
    end
    
    # Buffer the packet
    @buffer[seq] = data
    
    # Deliver consecutive packets starting from next_expected
    delivered = []
    while @buffer.key?(@next_expected)
      delivered << @buffer.delete(@next_expected)
      @next_expected += 1
    end
    
    {
      status: :accepted,
      delivered: delivered,
      buffer_size: @buffer.size
    }
  end
  
  def get_buffer_status
    if @buffer.empty?
      { empty: true }
    else
      gaps = find_gaps
      {
        empty: false,
        buffered_count: @buffer.size,
        gaps: gaps
      }
    end
  end
  
  private
  
  def find_gaps
    return [] if @buffer.empty?
    
    sorted_seqs = @buffer.keys.sort
    gaps = []
    
    expected = @next_expected
    sorted_seqs.each do |seq|
      gaps << [expected, seq - 1] if seq > expected
      expected = seq + 1
    end
    
    gaps
  end
end

Connection Failures require detecting dead connections and cleaning up state. TCP implements keepalive probes that periodically send packets to verify the peer is still responsive. Applications should set appropriate socket timeouts and handle disconnection gracefully:

class ConnectionMonitor
  KEEPALIVE_INTERVAL = 5
  KEEPALIVE_TIMEOUT = 15
  
  def initialize(socket)
    @socket = socket
    @last_activity = Time.now
    @keepalive_thread = nil
  end
  
  def start_monitoring
    @keepalive_thread = Thread.new do
      loop do
        sleep KEEPALIVE_INTERVAL
        
        if Time.now - @last_activity > KEEPALIVE_TIMEOUT
          puts "Connection timeout detected"
          @socket.close
          break
        end
        
        begin
          # Send keepalive probe
          @socket.write("\0")
          @socket.flush
        rescue
          puts "Connection lost"
          break
        end
      end
    end
  end
  
  def record_activity
    @last_activity = Time.now
  end
  
  def stop_monitoring
    @keepalive_thread&.kill
  end
end

Sequence Number Wraparound creates ambiguity when sequence numbers cycle back to zero. Modern protocols use 32-bit sequence numbers (TCP) providing 4 billion unique values. At 10 Gbps, sequence numbers wrap around in approximately 3 seconds, so the protocol must ensure that packets from different cycles cannot coexist in the network. The maximum segment lifetime (MSL) bounds how long packets survive, typically 2 minutes. As long as the wraparound period exceeds 2*MSL, no ambiguity occurs.

Premature Timeouts fire before packets actually arrive, causing unnecessary retransmissions. This wastes bandwidth and can create duplicate packets. Adaptive timeout algorithms reduce premature timeouts by tracking network behavior, but extreme delay variation makes perfect timeout selection impossible. Retransmission ambiguity arises: when both the original packet and its retransmission arrive, the receiver sees duplicates and must discard one based on sequence numbers.

Reference

Protocol Comparison

Protocol	Window Size	Sender Buffer	Receiver Buffer	Retransmission	Throughput	Complexity
Stop-and-Wait	1	1 packet	1 packet	Single packet	Low	Minimal
Go-Back-N	N packets	N packets	1 packet	From error point	Medium	Moderate
Selective Repeat	N packets	N packets	N packets	Only lost packets	High	High

Reliability Mechanisms

Mechanism	Purpose	Implementation	Overhead
Checksums	Detect corruption	CRC, MD5, SHA	Per packet computation
Sequence Numbers	Detect duplicates/reordering	Counter with wraparound	Header space
Acknowledgments	Confirm receipt	ACK/SACK messages	Reverse traffic
Timeouts	Detect loss	Adaptive timer	State management
Flow Control	Prevent overflow	Window advertisement	Coordination delay

TCP Socket Options

Option	Level	Purpose	Typical Values
TCP_NODELAY	IPPROTO_TCP	Disable Nagle algorithm	0 (off), 1 (on)
SO_KEEPALIVE	SOL_SOCKET	Enable keepalive probes	0 (off), 1 (on)
SO_RCVBUF	SOL_SOCKET	Receive buffer size	32768-262144 bytes
SO_SNDBUF	SOL_SOCKET	Send buffer size	32768-262144 bytes
SO_REUSEADDR	SOL_SOCKET	Reuse address quickly	0 (off), 1 (on)
TCP_KEEPIDLE	IPPROTO_TCP	Keepalive idle time	7200 seconds
TCP_KEEPINTVL	IPPROTO_TCP	Keepalive interval	75 seconds
TCP_KEEPCNT	IPPROTO_TCP	Keepalive probe count	9 probes

Timeout Calculations

Parameter	Formula	Purpose
EstimatedRTT	(1-α)EstimatedRTT + αSampleRTT	Smooth RTT average
DevRTT	(1-β)DevRTT + βabs(SampleRTT-EstimatedRTT)	RTT variation
TimeoutInterval	EstimatedRTT + 4*DevRTT	Timeout value
Exponential Backoff	TimeoutInterval * 2^attempt	Retry timeout

Window Size Constraints

Protocol	Minimum Sequence Space	Formula
Stop-and-Wait	2	seq_space >= 2
Go-Back-N	N + 1	seq_space >= N + 1
Selective Repeat	2N	seq_space >= 2N

Performance Metrics

Metric	Formula	Factors
Link Utilization	(L/R) / (RTT + L/R)	Packet size, data rate, RTT
Effective Throughput	WindowSize * PacketSize / RTT	Window size, RTT
Goodput	(TotalDataDelivered - Retransmissions) / Time	Loss rate, efficiency
Bandwidth-Delay Product	Bandwidth * RTT	Link capacity, latency

Error Detection Strength

Algorithm	Bits	Detection Capability
Internet Checksum	16	Basic error detection
CRC-32	32	All single/double bit errors, bursts up to 32 bits
MD5	128	Cryptographically strong
SHA-256	256	Maximum strength

Ruby Class Quick Reference

Class	Purpose	Key Methods
TCPSocket	TCP client connection	new, read, write, close
TCPServer	TCP server listener	new, accept, close
UDPSocket	UDP datagram socket	new, send, recvfrom, bind
Socket	Low-level socket operations	setsockopt, getsockopt
IO	select for multiplexing	select with timeout
Timeout	Timeout operations	timeout block

Reliable Data Transfer