CrackedRuby - TCP Congestion Control

Overview

TCP congestion control manages the rate at which data flows through a network connection to prevent overwhelming intermediate routers and the destination host. The mechanism adjusts transmission speed dynamically based on network conditions, balancing throughput against packet loss and latency.

Congestion occurs when network demand exceeds available capacity. Without control mechanisms, senders would transmit at maximum speed regardless of network conditions, causing buffer overflows at routers, packet drops, and cascading failures as retransmissions compound the problem. TCP congestion control addresses this through algorithms that probe available bandwidth, detect congestion signals, and adjust sending rates accordingly.

The TCP protocol implements congestion control through a congestion window (cwnd) that limits the amount of unacknowledged data in flight. This window operates alongside the receiver's advertised window, with the effective window size being the minimum of the two. The sender increases cwnd when receiving acknowledgments and decreases it upon detecting congestion signals like packet loss or explicit congestion notifications.

# TCP socket establishes connection with congestion control active
require 'socket'

socket = TCPSocket.new('example.com', 80)
# Congestion window starts small and grows based on network conditions
socket.write("GET / HTTP/1.1\r\nHost: example.com\r\n\r\n")
response = socket.read
# => Returns data after TCP handles retransmissions and congestion control

Congestion control operates at the transport layer, invisible to application code but critical to network stability and performance. The mechanisms evolved from Van Jacobson's 1988 algorithms that prevented Internet collapse, through modern variants optimized for high-speed networks, data centers, and wireless environments.

Key Principles

TCP congestion control rests on four fundamental algorithms that work together: slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms manipulate the congestion window based on acknowledgment patterns and loss detection.

Slow start initializes connections conservatively. The congestion window begins at a small value (typically one or two maximum segment sizes) and doubles each round-trip time (RTT) as acknowledgments arrive. This exponential growth continues until reaching a threshold called ssthresh (slow start threshold) or detecting packet loss. The algorithm probes available bandwidth rapidly while avoiding immediate congestion.

Congestion avoidance takes over when cwnd reaches ssthresh. Instead of exponential growth, the window increases linearly—typically by one segment per RTT. This additive increase allows the connection to continue probing for bandwidth while reducing the risk of sudden congestion. When packet loss occurs, ssthresh is set to half the current cwnd, and cwnd is reduced based on the specific algorithm variant.

Fast retransmit addresses the delay inherent in timeout-based retransmission. When a receiver gets out-of-order segments, it immediately sends duplicate acknowledgments (dupacks) for the last in-order segment received. If the sender receives three duplicate acknowledgments, it retransmits the missing segment without waiting for a timeout. This mechanism dramatically reduces retransmission latency.

Fast recovery handles the post-retransmission phase. After fast retransmit, instead of returning to slow start, the algorithm maintains a higher congestion window. Each additional duplicate acknowledgment (indicating more out-of-order data has arrived) temporarily increases cwnd, allowing new data transmission. When a new acknowledgment arrives confirming the retransmitted data, cwnd is set to ssthresh and congestion avoidance resumes.

The algorithms detect congestion through two primary signals: packet loss and explicit congestion notification (ECN). Packet loss manifests as either timeout (indicating severe congestion) or duplicate acknowledgments (indicating moderate congestion). ECN allows routers to mark packets when queues fill, providing early congestion signals before packet drops occur.

The relationship between these components creates a saw-tooth pattern in throughput graphs. The congestion window grows during congestion avoidance until packet loss occurs, drops by approximately half, then begins growing again. This pattern represents TCP probing for available bandwidth while backing off when hitting network limits.

# Demonstrate TCP behavior under packet loss
require 'socket'

def measure_throughput(host, port, duration)
  socket = TCPSocket.new(host, port)
  start_time = Time.now
  bytes_sent = 0
  
  while Time.now - start_time < duration
    begin
      # TCP automatically adjusts sending rate based on ACKs and losses
      bytes = socket.write("X" * 8192)
      bytes_sent += bytes
    rescue Errno::EPIPE
      # Connection broken, TCP couldn't recover from congestion
      break
    end
  end
  
  bytes_sent / duration
  # => Throughput varies as TCP adjusts congestion window
ensure
  socket.close if socket
end

The congestion window never exceeds the receiver's advertised window, which represents the receiver's buffer capacity. The effective window equals the minimum of cwnd and the receiver window. This interaction ensures that neither network congestion nor receiver capacity is overwhelmed.

Implementation Approaches

Multiple congestion control algorithms exist, each optimizing for different network characteristics and performance goals. The selection affects throughput, latency, fairness, and behavior under various network conditions.

TCP Tahoe represents the original implementation of Van Jacobson's congestion control algorithms. Upon detecting any packet loss (timeout or duplicate acknowledgments), Tahoe sets ssthresh to half the current cwnd, reduces cwnd to one segment, and returns to slow start. This conservative approach caused significant throughput reduction even for minor packet losses, leading to the development of improved variants.

TCP Reno introduced fast recovery to avoid returning to slow start after fast retransmit. When receiving three duplicate acknowledgments, Reno halves cwnd and ssthresh, retransmits the missing segment, and enters fast recovery mode. The algorithm maintains a higher congestion window during recovery, only returning to slow start on timeout. Reno's improvements made it the dominant algorithm through the 1990s and 2000s.

# Simulate Reno-style congestion window behavior
class TCPRenoSimulator
  attr_reader :cwnd, :ssthresh
  
  def initialize(initial_cwnd = 1, initial_ssthresh = 64)
    @cwnd = initial_cwnd
    @ssthresh = initial_ssthresh
    @state = :slow_start
  end
  
  def on_ack_received
    case @state
    when :slow_start
      @cwnd += 1  # Exponential growth
      if @cwnd >= @ssthresh
        @state = :congestion_avoidance
      end
    when :congestion_avoidance
      @cwnd += 1.0 / @cwnd  # Linear growth (approximately)
    end
  end
  
  def on_triple_dupack
    @ssthresh = [@cwnd / 2, 2].max
    @cwnd = @ssthresh
    @state = :fast_recovery
  end
  
  def on_timeout
    @ssthresh = [@cwnd / 2, 2].max
    @cwnd = 1
    @state = :slow_start
  end
end

TCP NewReno addresses Reno's limitation with multiple packet losses in a single window. Reno exits fast recovery after the first retransmitted segment is acknowledged, even if other segments remain lost. NewReno remains in fast recovery until all segments in the window are acknowledged, handling multiple losses more gracefully. The algorithm uses partial acknowledgments (acks that advance the window but don't acknowledge all outstanding data) as signals to retransmit the next missing segment.

TCP CUBIC optimizes for high-bandwidth, high-latency networks where traditional algorithms fail to utilize available capacity. Instead of linear growth, CUBIC uses a cubic function centered around the window size where the last congestion event occurred. The window grows slowly when far from this point and rapidly when approaching it, then slowly again after passing it. This behavior allows faster recovery of lost bandwidth while maintaining stability. CUBIC became the Linux default in 2006 and remains widely deployed.

The cubic growth function is: W(t) = C(t - K)^3 + Wmax where Wmax is the window size at the last congestion event, K is the time period to reach Wmax, C is a scaling constant, and t is time since the last congestion event.

TCP BBR (Bottleneck Bandwidth and Round-trip propagation time) represents a fundamentally different approach. Rather than using packet loss as the primary congestion signal, BBR builds a model of the network path's bottleneck bandwidth and minimum RTT. The algorithm operates in distinct phases: startup (rapidly probing bandwidth), drain (draining queues built during startup), probe bandwidth (cycling sending rate to detect changes), and probe RTT (periodically measuring minimum RTT). BBR maintains high throughput with lower latency and reduced packet loss compared to loss-based algorithms, particularly on paths with inherent packet loss like wireless networks.

# Simplified BBR-style model tracking
class BBRModel
  attr_reader :bottleneck_bandwidth, :min_rtt
  
  def initialize
    @bottleneck_bandwidth = 0
    @min_rtt = Float::INFINITY
    @bandwidth_samples = []
    @rtt_samples = []
  end
  
  def update_bandwidth(bytes_delivered, time_interval)
    bandwidth = bytes_delivered / time_interval
    @bandwidth_samples << bandwidth
    @bandwidth_samples.shift if @bandwidth_samples.size > 10
    @bottleneck_bandwidth = @bandwidth_samples.max
  end
  
  def update_rtt(measured_rtt)
    @rtt_samples << measured_rtt
    @rtt_samples.shift if @rtt_samples.size > 10
    @min_rtt = @rtt_samples.min
  end
  
  def bdp
    # Bandwidth-delay product: optimal data in flight
    @bottleneck_bandwidth * @min_rtt
  end
  
  def pacing_rate(gain = 1.0)
    # BBR paces at bottleneck_bandwidth * gain
    @bottleneck_bandwidth * gain
  end
end

Data Center TCP (DCTCP) targets low-latency data center environments. DCTCP uses ECN marks from switches to estimate the extent of congestion, reducing cwnd proportionally to the fraction of marked packets rather than cutting it in half regardless of congestion severity. This approach maintains high throughput while achieving microsecond-scale latency. DCTCP requires ECN support throughout the network path and switch configuration for early marking.

Algorithm selection depends on the deployment environment. CUBIC works well for general Internet traffic. BBR excels on high-speed links and paths with variable latency or packet loss. DCTCP requires data center environments with ECN-capable switches. Legacy algorithms like Reno remain in some embedded systems and specialized devices.

Ruby Implementation

Ruby applications interact with TCP congestion control through the socket API. The operating system kernel handles algorithm implementation, but applications can influence behavior through socket options and sending patterns.

TCP sockets in Ruby automatically participate in congestion control. The kernel manages the congestion window, retransmissions, and acknowledgments transparently. Applications need not implement congestion control but should understand how their sending patterns interact with the underlying mechanisms.

require 'socket'

# Basic TCP client with congestion control active
def tcp_client_example
  socket = TCPSocket.new('example.com', 8080)
  
  # Set TCP_NODELAY to disable Nagle's algorithm
  # This affects interaction with congestion control
  socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
  
  # Check current socket buffer sizes
  send_buffer = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF).int
  recv_buffer = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_RCVBUF).int
  
  puts "Send buffer: #{send_buffer} bytes"
  puts "Receive buffer: #{recv_buffer} bytes"
  
  # Write data - kernel handles congestion control
  data = "X" * 1_000_000
  socket.write(data)
  
  # Read response
  response = socket.read(1024)
  
  socket.close
end

The TCP_NODELAY option disables Nagle's algorithm, which delays small packets to combine them with subsequent writes. Disabling Nagle's algorithm reduces latency for interactive applications but can decrease efficiency and interact negatively with congestion control by sending many small packets. The default behavior (Nagle enabled) works well for bulk data transfer where congestion control can operate on larger segments.

Socket buffer sizes affect how much data the kernel can queue for transmission and reception. Larger buffers allow the congestion window to grow further, potentially improving throughput on high-bandwidth, high-latency paths. The send buffer must accommodate the bandwidth-delay product of the connection for optimal performance.

require 'socket'

class TCPThroughputTest
  def initialize(host, port)
    @host = host
    @port = port
  end
  
  def test_with_buffer_size(send_buffer_size)
    socket = TCPSocket.new(@host, @port)
    
    # Set send buffer size
    socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, send_buffer_size)
    
    # Verify actual buffer size (kernel may adjust)
    actual_size = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF).int
    
    # Measure throughput
    start_time = Time.now
    bytes_sent = 0
    duration = 10
    
    while Time.now - start_time < duration
      bytes = socket.write("X" * 8192)
      bytes_sent += bytes
    end
    
    throughput = bytes_sent / duration
    
    {
      requested_buffer: send_buffer_size,
      actual_buffer: actual_size,
      throughput: throughput,
      megabits_per_second: (throughput * 8) / 1_000_000.0
    }
  ensure
    socket.close if socket
  end
end

# Test with different buffer sizes
tester = TCPThroughputTest.new('example.com', 8080)
[64_000, 128_000, 256_000, 512_000].each do |size|
  result = tester.test_with_buffer_size(size)
  puts "Buffer: #{result[:actual_buffer]}, " \
       "Throughput: #{result[:megabits_per_second].round(2)} Mbps"
end

Linux exposes TCP congestion control algorithm selection through socket options on systems with multiple algorithms available. Applications can query available algorithms and select specific implementations.

require 'socket'

def get_tcp_congestion_control(socket)
  # TCP_CONGESTION socket option (Linux-specific)
  TCP_CONGESTION = 13
  
  begin
    option = socket.getsockopt(Socket::IPPROTO_TCP, TCP_CONGESTION)
    option.data.unpack('Z*').first
  rescue SystemCallError
    "Not supported on this system"
  end
end

def set_tcp_congestion_control(socket, algorithm)
  TCP_CONGESTION = 13
  
  begin
    socket.setsockopt(Socket::IPPROTO_TCP, TCP_CONGESTION, algorithm)
    true
  rescue SystemCallError => e
    puts "Failed to set algorithm: #{e.message}"
    false
  end
end

socket = TCPSocket.new('example.com', 80)
current_algorithm = get_tcp_congestion_control(socket)
puts "Current algorithm: #{current_algorithm}"

# Attempt to switch to BBR
if set_tcp_congestion_control(socket, 'bbr')
  puts "Switched to BBR"
else
  puts "BBR not available, using #{current_algorithm}"
end

Monitoring TCP statistics reveals congestion control behavior. The netstat and ss command-line tools expose detailed connection information, and Ruby can parse this output or use system calls to gather metrics.

require 'socket'

class TCPConnectionMonitor
  def initialize(socket)
    @socket = socket
  end
  
  def local_address
    @socket.local_address
  end
  
  def remote_address
    @socket.remote_address
  end
  
  def connection_info
    # Get TCP_INFO structure (Linux-specific)
    TCP_INFO = 11
    
    begin
      info = @socket.getsockopt(Socket::IPPROTO_TCP, TCP_INFO)
      parse_tcp_info(info.data)
    rescue SystemCallError
      { error: "TCP_INFO not supported" }
    end
  end
  
  private
  
  def parse_tcp_info(data)
    # TCP_INFO structure varies by system
    # This is a simplified example
    {
      state: "Connection active",
      rtt: "Round-trip time from kernel",
      rtt_var: "RTT variation",
      snd_cwnd: "Current congestion window",
      retransmits: "Number of retransmissions"
    }
  end
end

Server applications should configure appropriate buffer sizes and socket options based on expected connection characteristics. High-throughput servers benefit from larger buffers, while latency-sensitive applications may optimize for smaller buffers and TCP_NODELAY.

require 'socket'

class TCPServer
  def initialize(port, buffer_size: 256_000, nodelay: false)
    @server = TCPServer.new(port)
    @buffer_size = buffer_size
    @nodelay = nodelay
  end
  
  def accept_connection
    client = @server.accept
    configure_socket(client)
    client
  end
  
  private
  
  def configure_socket(socket)
    # Set buffer sizes
    socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, @buffer_size)
    socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_RCVBUF, @buffer_size)
    
    # Configure TCP_NODELAY if requested
    if @nodelay
      socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
    end
    
    # Enable TCP keepalive to detect dead connections
    socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, 1)
  end
end

server = TCPServer.new(8080, buffer_size: 512_000, nodelay: true)
loop do
  client = server.accept_connection
  Thread.new(client) do |conn|
    # Handle connection
    # Congestion control operates automatically
    conn.close
  end
end

Performance Considerations

TCP congestion control significantly impacts application performance through throughput, latency, and connection reliability. The interaction between congestion control, network conditions, and application behavior determines actual performance.

Bandwidth-delay product (BDP) defines the amount of data that can be in flight to fully utilize a network path. BDP equals the bottleneck bandwidth multiplied by the round-trip time: BDP = bandwidth × RTT. The TCP congestion window must reach at least the BDP to achieve maximum throughput. On high-bandwidth, high-latency paths (like transcontinental links), the required window size can be several megabytes.

def calculate_bdp(bandwidth_mbps, rtt_ms)
  # Convert to consistent units
  bandwidth_bytes_per_sec = (bandwidth_mbps * 1_000_000) / 8
  rtt_sec = rtt_ms / 1000.0
  
  bdp_bytes = bandwidth_bytes_per_sec * rtt_sec
  
  {
    bdp_bytes: bdp_bytes,
    bdp_kilobytes: bdp_bytes / 1024,
    bdp_segments: bdp_bytes / 1460  # Typical MSS
  }
end

# Example: 100 Mbps link with 50ms RTT
result = calculate_bdp(100, 50)
puts "BDP: #{result[:bdp_kilobytes].round} KB"
puts "Requires congestion window of #{result[:bdp_segments].round} segments"
# => BDP: 610 KB
# => Requires congestion window of 428 segments

Socket buffer sizes must accommodate the BDP. Insufficient buffers limit the congestion window growth, preventing full bandwidth utilization. The operating system's default buffer sizes often suffice for local networks but may constrain performance on wide-area network connections. Automatic buffer tuning in modern kernels adjusts buffer sizes dynamically, but applications can override these settings when necessary.

Slow start causes gradual throughput ramp-up at connection start and after timeouts. For short-lived connections transferring small amounts of data, slow start prevents reaching full throughput. The connection terminates before the congestion window grows large enough. This effect particularly impacts web traffic, where many HTTP requests complete in a few round trips. HTTP/2 and HTTP/3 mitigate this through connection reuse and stream multiplexing.

require 'socket'
require 'benchmark'

def measure_connection_rampup(host, port, data_size)
  times = []
  
  # Multiple transfers to see congestion window effects
  5.times do
    time = Benchmark.measure do
      socket = TCPSocket.new(host, port)
      socket.write("X" * data_size)
      socket.close
    end
    times << time.real
  end
  
  {
    first_transfer: times.first,
    average_after_warmup: times[1..].sum / times[1..].size,
    improvement: ((times.first - times[1..].average) / times.first * 100).round(1)
  }
end

# Small transfers suffer more from slow start
result = measure_connection_rampup('example.com', 80, 10_000)
puts "First transfer: #{result[:first_transfer].round(3)}s"
puts "Subsequent average: #{result[:average_after_warmup].round(3)}s"
puts "Improvement: #{result[:improvement]}%"

Loss recovery mechanisms introduce latency spikes. Fast retransmit reduces delay compared to timeouts, but retransmission still requires at least one additional RTT. Applications sensitive to latency jitter must account for these variations. Streaming protocols and real-time applications often use UDP to avoid TCP's reliability and congestion control overhead, implementing custom congestion-aware mechanisms.

Bufferbloat occurs when excessive buffering in network devices causes latency inflation without providing throughput benefits. Large queues in routers delay congestion signals, causing TCP to send at high rates while packets accumulate in buffers. This increases latency dramatically while throughput remains unchanged. Modern queue management algorithms like CoDel and FQ-CoDel in routers mitigate bufferbloat, and congestion control algorithms like BBR reduce the effect by maintaining lower queue occupancy.

class LatencyMonitor
  def initialize
    @measurements = []
  end
  
  def measure_rtt(host)
    require 'socket'
    
    start = Time.now
    socket = TCPSocket.new(host, 80)
    socket.write("GET / HTTP/1.1\r\nHost: #{host}\r\n\r\n")
    first_byte = socket.read(1)
    rtt = (Time.now - start) * 1000  # Convert to ms
    socket.close
    
    @measurements << rtt
    rtt
  end
  
  def statistics
    return {} if @measurements.empty?
    
    sorted = @measurements.sort
    {
      min: sorted.first,
      max: sorted.last,
      median: sorted[sorted.size / 2],
      average: @measurements.sum / @measurements.size,
      jitter: sorted.last - sorted.first
    }
  end
end

monitor = LatencyMonitor.new
10.times { monitor.measure_rtt('example.com') }
stats = monitor.statistics
puts "RTT - Min: #{stats[:min].round(1)}ms, Max: #{stats[:max].round(1)}ms"
puts "Jitter: #{stats[:jitter].round(1)}ms"
# High jitter may indicate bufferbloat or congestion

Fairness issues arise when connections with different RTTs share bottleneck links. Loss-based algorithms like Reno and CUBIC favor connections with shorter RTTs, as they can recover from congestion faster and grow their windows more quickly. A connection with 10ms RTT may achieve 10x the throughput of a 100ms RTT connection sharing the same link. BBR improves fairness by modeling the network rather than reacting purely to loss.

Multiple concurrent TCP connections from the same application compete with each other for bandwidth. Opening many parallel connections provides no advantage within a single bottleneck domain and reduces fairness to other users. HTTP/2 and HTTP/3 multiplexing eliminate the need for multiple TCP connections while achieving parallelism at the application layer.

require 'socket'

def compare_serial_vs_parallel(host, port, data_size, parallel_count)
  # Serial: one connection
  serial_time = Benchmark.measure do
    socket = TCPSocket.new(host, port)
    parallel_count.times { socket.write("X" * data_size) }
    socket.close
  end.real
  
  # Parallel: multiple connections
  parallel_time = Benchmark.measure do
    threads = parallel_count.times.map do
      Thread.new do
        socket = TCPSocket.new(host, port)
        socket.write("X" * data_size)
        socket.close
      end
    end
    threads.each(&:join)
  end.real
  
  {
    serial_seconds: serial_time,
    parallel_seconds: parallel_time,
    speedup: serial_time / parallel_time
  }
end

Practical Examples

Understanding congestion control through concrete scenarios demonstrates how the mechanisms respond to various network conditions and application patterns.

Bulk Data Transfer shows congestion control ramping up throughput. A file transfer starts with slow start exponential growth, transitions to congestion avoidance linear growth, and stabilizes around the available bandwidth. Packet losses cause periodic window reductions, creating the characteristic saw-tooth pattern.

require 'socket'

class BulkTransfer
  def initialize(host, port)
    @host = host
    @port = port
    @stats = {
      bytes_sent: 0,
      send_errors: 0,
      throughput_samples: []
    }
  end
  
  def transfer_file(file_path, chunk_size: 65_536)
    socket = TCPSocket.new(@host, @port)
    
    # Configure for bulk transfer
    socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, 524_288)
    
    File.open(file_path, 'rb') do |file|
      last_sample_time = Time.now
      bytes_since_sample = 0
      
      while chunk = file.read(chunk_size)
        begin
          bytes = socket.write(chunk)
          @stats[:bytes_sent] += bytes
          bytes_since_sample += bytes
          
          # Sample throughput every second
          if Time.now - last_sample_time >= 1.0
            throughput = bytes_since_sample / (Time.now - last_sample_time)
            @stats[:throughput_samples] << throughput
            bytes_since_sample = 0
            last_sample_time = Time.now
          end
        rescue Errno::EPIPE, Errno::ECONNRESET => e
          @stats[:send_errors] += 1
          raise
        end
      end
    end
    
    socket.close
    @stats
  end
  
  def analyze_throughput
    return {} if @stats[:throughput_samples].empty?
    
    samples = @stats[:throughput_samples]
    {
      min_mbps: (samples.min * 8 / 1_000_000.0).round(2),
      max_mbps: (samples.max * 8 / 1_000_000.0).round(2),
      avg_mbps: (samples.sum / samples.size * 8 / 1_000_000.0).round(2),
      variation: ((samples.max - samples.min) / samples.average * 100).round(1)
    }
  end
end

transfer = BulkTransfer.new('fileserver.example.com', 9000)
stats = transfer.transfer_file('large_file.dat')
analysis = transfer.analyze_throughput

puts "Transferred: #{stats[:bytes_sent] / 1_000_000} MB"
puts "Throughput: #{analysis[:avg_mbps]} Mbps (#{analysis[:min_mbps]} - #{analysis[:max_mbps]})"
puts "Variation: #{analysis[:variation]}%"
# Variation shows congestion window growth/reduction cycles

Connection Timeout Recovery demonstrates slow start restart after a timeout. The connection experiences severe congestion, triggers a timeout, resets the congestion window to one segment, and gradually rebuilds throughput.

require 'socket'
require 'timeout'

class TimeoutRecoveryTest
  def test_recovery(host, port, data_size)
    socket = TCPSocket.new(host, port)
    recovery_log = []
    
    # Send data and track recovery from timeouts
    chunk = "X" * 1460  # One MSS
    chunks_sent = 0
    
    loop do
      begin
        Timeout.timeout(5) do
          socket.write(chunk)
          chunks_sent += 1
          
          # Track approximate congestion window growth
          # Window starts at 1, doubles each RTT in slow start
          estimated_cwnd = if chunks_sent < 64
                             2 ** (Math.log2(chunks_sent).floor)
                           else
                             64 + (chunks_sent - 64)  # Linear growth
                           end
          
          recovery_log << {
            chunk: chunks_sent,
            estimated_cwnd: estimated_cwnd,
            time: Time.now
          }
        end
      rescue Timeout::Error
        # Timeout resets congestion window
        recovery_log << {
          event: 'TIMEOUT',
          chunks_before_timeout: chunks_sent,
          time: Time.now
        }
        chunks_sent = 0  # Simulate cwnd reset
      rescue Errno::EPIPE
        break
      end
      
      break if chunks_sent * 1460 >= data_size
    end
    
    socket.close
    recovery_log
  end
end

Multiple Parallel Streams illustrates how concurrent connections share bandwidth. Each connection runs its own congestion control, competing for bottleneck capacity. The streams converge toward equal bandwidth shares, though convergence speed depends on the algorithm and RTT differences.

require 'socket'

class ParallelStreamTest
  def initialize(host, port, stream_count)
    @host = host
    @port = port
    @stream_count = stream_count
    @stream_stats = Array.new(stream_count) { { bytes: 0, start: nil } }
  end
  
  def run_test(duration)
    threads = @stream_count.times.map do |i|
      Thread.new do
        stream_transfer(i, duration)
      end
    end
    
    threads.each(&:join)
    analyze_fairness
  end
  
  private
  
  def stream_transfer(stream_id, duration)
    socket = TCPSocket.new(@host, @port)
    @stream_stats[stream_id][:start] = Time.now
    
    end_time = Time.now + duration
    while Time.now < end_time
      bytes = socket.write("X" * 8192)
      @stream_stats[stream_id][:bytes] += bytes
    end
    
    socket.close
  end
  
  def analyze_fairness
    throughputs = @stream_stats.map do |stat|
      stat[:bytes] / (Time.now - stat[:start])
    end
    
    avg = throughputs.sum / throughputs.size
    jains_index = (throughputs.sum ** 2) / 
                  (@stream_count * throughputs.map { |t| t ** 2 }.sum)
    
    {
      throughputs_mbps: throughputs.map { |t| (t * 8 / 1_000_000.0).round(2) },
      average_mbps: (avg * 8 / 1_000_000.0).round(2),
      fairness_index: jains_index.round(3)
    }
    # Jain's fairness index: 1.0 = perfect fairness, lower = more unfair
  end
end

test = ParallelStreamTest.new('example.com', 8080, 4)
result = test.run_test(30)
puts "Stream throughputs: #{result[:throughputs_mbps]} Mbps"
puts "Fairness index: #{result[:fairness_index]}"

Short-Lived Connections highlight slow start impact on performance. Many HTTP requests complete before the congestion window grows large, never achieving full throughput. Connection reuse through HTTP keep-alive mitigates this effect.

require 'socket'
require 'net/http'

class ConnectionReuseComparison
  def test_without_keepalive(host, path, request_count)
    times = []
    
    request_count.times do
      start = Time.now
      socket = TCPSocket.new(host, 80)
      socket.write("GET #{path} HTTP/1.1\r\nHost: #{host}\r\nConnection: close\r\n\r\n")
      response = socket.read
      socket.close
      times << Time.now - start
    end
    
    times
  end
  
  def test_with_keepalive(host, path, request_count)
    times = []
    socket = TCPSocket.new(host, 80)
    
    request_count.times do
      start = Time.now
      socket.write("GET #{path} HTTP/1.1\r\nHost: #{host}\r\nConnection: keep-alive\r\n\r\n")
      # Read response (simplified - real code needs proper HTTP parsing)
      response = socket.readpartial(4096)
      times << Time.now - start
    end
    
    socket.close
    times
  end
  
  def compare(host, path, request_count)
    without = test_without_keepalive(host, path, request_count)
    with = test_with_keepalive(host, path, request_count)
    
    {
      without_keepalive: {
        total: without.sum.round(3),
        average: (without.sum / without.size).round(3),
        first: without.first.round(3)
      },
      with_keepalive: {
        total: with.sum.round(3),
        average: (with.sum / with.size).round(3),
        first: with.first.round(3)
      },
      improvement: ((without.sum - with.sum) / without.sum * 100).round(1)
    }
  end
end

comparison = ConnectionReuseComparison.new
result = comparison.compare('example.com', '/api/data', 10)
puts "Without keep-alive: #{result[:without_keepalive][:average]}s avg"
puts "With keep-alive: #{result[:with_keepalive][:average]}s avg"
puts "Improvement: #{result[:improvement]}%"

Common Pitfalls

Several misconceptions and mistakes frequently occur when working with TCP congestion control, leading to performance issues and incorrect assumptions.

Assuming More Connections Equals Higher Throughput represents a persistent misunderstanding. Opening multiple parallel TCP connections does not increase total throughput when constrained by a single bottleneck link. Each connection competes for the same bandwidth, and the aggregate congestion control behavior results in similar total throughput with worse fairness to other users. HTTP/1.1 applications often opened 6-8 parallel connections per domain under this misconception. Modern protocols like HTTP/2 multiplex streams over a single connection, achieving parallelism without multiple TCP connections.

# Demonstrates that parallel connections don't increase throughput
require 'socket'

def test_throughput_scaling(host, port, max_connections)
  results = {}
  
  (1..max_connections).each do |conn_count|
    bytes_total = 0
    start_time = Time.now
    
    threads = conn_count.times.map do
      Thread.new do
        socket = TCPSocket.new(host, port)
        bytes = 0
        
        5.times { bytes += socket.write("X" * 65536) }
        
        socket.close
        bytes
      end
    end
    
    bytes_total = threads.map(&:value).sum
    duration = Time.now - start_time
    
    results[conn_count] = (bytes_total / duration * 8 / 1_000_000.0).round(2)
  end
  
  results
  # Shows diminishing or no returns beyond single connection
end

results = test_throughput_scaling('example.com', 8080, 5)
results.each { |conns, mbps| puts "#{conns} connections: #{mbps} Mbps" }
# Expected: Similar throughput regardless of connection count

Ignoring Buffer Size Requirements causes throughput degradation on high-BDP paths. Default socket buffer sizes work adequately for local networks but constrain performance on wide-area connections. The congestion window cannot grow beyond the send buffer size, limiting throughput. Automatic tuning in modern kernels helps but may not reach optimal sizes for all scenarios. Applications must verify buffer sizes match network characteristics.

Misinterpreting TCP_NODELAY Effects leads to latency issues or throughput reduction. Disabling Nagle's algorithm prevents small packet coalescing, reducing latency for interactive applications but potentially decreasing efficiency. Many applications blindly enable TCP_NODELAY without understanding the trade-off. Bulk data transfers generally benefit from Nagle's algorithm enabled (the default), while request-response protocols benefit from disabling it.

require 'socket'

def compare_nagle_impact(host, port, message_size, message_count)
  # Test with Nagle enabled (default)
  socket_nagle = TCPSocket.new(host, port)
  
  start = Time.now
  message_count.times { socket_nagle.write("X" * message_size) }
  time_with_nagle = Time.now - start
  socket_nagle.close
  
  # Test with Nagle disabled
  socket_nodelay = TCPSocket.new(host, port)
  socket_nodelay.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
  
  start = Time.now
  message_count.times { socket_nodelay.write("X" * message_size) }
  time_without_nagle = Time.now - start
  socket_nodelay.close
  
  {
    nagle_enabled: time_with_nagle.round(3),
    nagle_disabled: time_without_nagle.round(3),
    difference_ms: ((time_without_nagle - time_with_nagle) * 1000).round(1)
  }
  # Small messages show more impact from Nagle's algorithm
end

# Small messages (100 bytes)
result_small = compare_nagle_impact('example.com', 80, 100, 50)
puts "Small messages - Nagle: #{result_small[:nagle_enabled]}s, " \
     "No Nagle: #{result_small[:nagle_disabled]}s"

Expecting Consistent Throughput ignores congestion control's dynamic nature. TCP throughput varies continuously as the congestion window grows and shrinks. Applications requiring consistent bandwidth should implement rate limiting or use protocols designed for constant rate delivery. The saw-tooth pattern of loss-based congestion control means instantaneous throughput oscillates significantly even when average throughput appears stable.

Assuming Packet Loss Equals Network Congestion misdiagnoses issues on wireless and cellular networks. These links experience packet loss from radio interference, signal attenuation, and mobility, independent of congestion. Traditional TCP algorithms interpret all loss as congestion, unnecessarily reducing sending rates. This explains why BBR often outperforms loss-based algorithms on wireless links—it distinguishes true congestion from random loss by modeling the network path.

Overlooking Head-of-Line Blocking in application protocol design causes latency problems. TCP delivers data in order, so a single lost segment blocks delivery of all subsequent segments until retransmission completes. Applications multiplexing multiple independent streams over one TCP connection experience unnecessary delays. HTTP/2 suffered from this issue, leading to HTTP/3's adoption of QUIC, which provides per-stream ordering over UDP.

# Demonstrates sequential blocking in TCP
require 'socket'

class StreamMultiplexer
  def initialize(socket)
    @socket = socket
    @streams = {}
  end
  
  def send_stream_data(stream_id, data)
    # All streams share one TCP connection
    # Loss in any stream blocks all streams
    message = "STREAM:#{stream_id}:#{data.bytesize}:#{data}"
    @socket.write(message)
  end
  
  def receive_stream_data
    # Blocked if any data is lost, even if for different stream
    # TCP provides no per-stream delivery
    data = @socket.readpartial(4096)
    parse_stream_message(data)
  end
end

# Issue: Stream 1 loss blocks Stream 2 delivery, even though Stream 2 data arrived
# HTTP/3 with QUIC solves this with independent stream ordering

Neglecting Congestion Control Algorithm Selection leaves default choices that may not suit the application's environment. CUBIC works well for general Internet traffic but may underperform compared to BBR on high-speed links or paths with packet loss from non-congestion causes. Data center applications benefit from DCTCP when supported. Applications should evaluate algorithm choices based on deployment characteristics.

Assuming Loss Detection is Immediate leads to incorrect timeout values in application protocols. TCP requires three duplicate acknowledgments for fast retransmit—requiring four total packets to arrive after the lost segment. On paths with few packets in flight or when multiple consecutive packets are lost, timeout-based retransmission becomes necessary, introducing significant delay. Applications building on TCP must account for worst-case recovery times rather than optimistic scenarios.

Reference

Congestion Control States

State	Window Behavior	Trigger	Transition
Slow Start	Exponential growth (doubles per RTT)	Connection start or timeout	Reaches ssthresh or loss detected
Congestion Avoidance	Linear growth (increases by 1 MSS per RTT)	cwnd reaches ssthresh	Loss detection
Fast Recovery	Inflated window during recovery	Three duplicate ACKs	New ACK received or timeout

Algorithm Comparison

Algorithm	Loss Response	Fairness	Latency	Best Environment
Tahoe	Reset to 1 segment	Poor on high-BDP	High after loss	Legacy systems
Reno	Halve window on dupacks	Moderate	Moderate	General Internet
NewReno	Improved multi-loss handling	Moderate	Moderate	General Internet
CUBIC	Cubic growth function	Good on high-BDP	Moderate	High-speed networks
BBR	Model-based rate control	Excellent	Low	All environments
DCTCP	Proportional to ECN marks	Excellent	Very low	Data centers

Key Parameters

Parameter	Description	Typical Value	Impact
Initial cwnd	Starting congestion window	10 segments	Connection startup time
MSS	Maximum segment size	1460 bytes	Bandwidth efficiency
ssthresh	Slow start threshold	64 KB or half cwnd after loss	Growth rate transition
RTO	Retransmission timeout	Calculated from RTT	Loss recovery delay
Duplicate ACK threshold	Triggers fast retransmit	3	Loss detection speed

Socket Options

Option	Level	Description	Usage
TCP_NODELAY	IPPROTO_TCP	Disable Nagle's algorithm	Interactive protocols
SO_SNDBUF	SOL_SOCKET	Send buffer size	High-BDP paths
SO_RCVBUF	SOL_SOCKET	Receive buffer size	High-BDP paths
TCP_CONGESTION	IPPROTO_TCP	Select algorithm (Linux)	Environment-specific tuning
SO_KEEPALIVE	SOL_SOCKET	Enable keepalive probes	Detect dead connections
TCP_INFO	IPPROTO_TCP	Query connection statistics	Monitoring and debugging

Congestion Signals

Signal	Detection	Algorithm Response	Recovery Time
Three duplicate ACKs	Fast retransmit	Halve cwnd, enter fast recovery	1 RTT minimum
Timeout	RTO expiration	Reset cwnd to 1, reset ssthresh	RTO duration plus recovery
ECN mark	IP header flag	Reduce cwnd (proportional in DCTCP)	1 RTT
Partial ACK	NewReno in fast recovery	Retransmit next segment	Continues recovery

Performance Formulas

Formula	Description	Usage
BDP = Bandwidth × RTT	Bandwidth-delay product	Minimum window size for full utilization
Throughput ≈ (MSS / RTT) × (C / √Loss)	Mathis equation for loss-based algorithms	Estimate achievable throughput
Throughput ≈ Window / RTT	Instantaneous throughput	Current sending rate
RTT_smoothed = (7/8) × RTT_old + (1/8) × RTT_new	Exponential smoothing of RTT	RTO calculation

Common Buffer Sizes

Scenario	Send Buffer	Receive Buffer	Rationale
LAN	64-128 KB	64-128 KB	Low RTT, modest BDP
Internet	256-512 KB	256-512 KB	Moderate RTT and bandwidth
High-speed WAN	2-4 MB	2-4 MB	High BDP requires large buffers
Data center	128-256 KB	128-256 KB	Low latency, controlled environment
Mobile	64-256 KB	64-256 KB	Variable conditions, memory constrained

Algorithm Evolution Timeline

Year	Development	Significance
1988	TCP Tahoe	First congestion control implementation
1990	TCP Reno	Fast recovery improves performance
1996	TCP NewReno	Better multiple loss handling
2001	ECN	Explicit congestion signals
2006	CUBIC default in Linux	Optimized for high-speed networks
2010	DCTCP	Data center optimization
2016	BBR	Model-based control paradigm

TCP Congestion Control