Overview
TCP congestion control manages the rate at which data flows through a network connection to prevent overwhelming intermediate routers and the destination host. The mechanism adjusts transmission speed dynamically based on network conditions, balancing throughput against packet loss and latency.
Congestion occurs when network demand exceeds available capacity. Without control mechanisms, senders would transmit at maximum speed regardless of network conditions, causing buffer overflows at routers, packet drops, and cascading failures as retransmissions compound the problem. TCP congestion control addresses this through algorithms that probe available bandwidth, detect congestion signals, and adjust sending rates accordingly.
The TCP protocol implements congestion control through a congestion window (cwnd) that limits the amount of unacknowledged data in flight. This window operates alongside the receiver's advertised window, with the effective window size being the minimum of the two. The sender increases cwnd when receiving acknowledgments and decreases it upon detecting congestion signals like packet loss or explicit congestion notifications.
# TCP socket establishes connection with congestion control active
require 'socket'
socket = TCPSocket.new('example.com', 80)
# Congestion window starts small and grows based on network conditions
socket.write("GET / HTTP/1.1\r\nHost: example.com\r\n\r\n")
response = socket.read
# => Returns data after TCP handles retransmissions and congestion control
Congestion control operates at the transport layer, invisible to application code but critical to network stability and performance. The mechanisms evolved from Van Jacobson's 1988 algorithms that prevented Internet collapse, through modern variants optimized for high-speed networks, data centers, and wireless environments.
Key Principles
TCP congestion control rests on four fundamental algorithms that work together: slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms manipulate the congestion window based on acknowledgment patterns and loss detection.
Slow start initializes connections conservatively. The congestion window begins at a small value (typically one or two maximum segment sizes) and doubles each round-trip time (RTT) as acknowledgments arrive. This exponential growth continues until reaching a threshold called ssthresh (slow start threshold) or detecting packet loss. The algorithm probes available bandwidth rapidly while avoiding immediate congestion.
Congestion avoidance takes over when cwnd reaches ssthresh. Instead of exponential growth, the window increases linearly—typically by one segment per RTT. This additive increase allows the connection to continue probing for bandwidth while reducing the risk of sudden congestion. When packet loss occurs, ssthresh is set to half the current cwnd, and cwnd is reduced based on the specific algorithm variant.
Fast retransmit addresses the delay inherent in timeout-based retransmission. When a receiver gets out-of-order segments, it immediately sends duplicate acknowledgments (dupacks) for the last in-order segment received. If the sender receives three duplicate acknowledgments, it retransmits the missing segment without waiting for a timeout. This mechanism dramatically reduces retransmission latency.
Fast recovery handles the post-retransmission phase. After fast retransmit, instead of returning to slow start, the algorithm maintains a higher congestion window. Each additional duplicate acknowledgment (indicating more out-of-order data has arrived) temporarily increases cwnd, allowing new data transmission. When a new acknowledgment arrives confirming the retransmitted data, cwnd is set to ssthresh and congestion avoidance resumes.
The algorithms detect congestion through two primary signals: packet loss and explicit congestion notification (ECN). Packet loss manifests as either timeout (indicating severe congestion) or duplicate acknowledgments (indicating moderate congestion). ECN allows routers to mark packets when queues fill, providing early congestion signals before packet drops occur.
The relationship between these components creates a saw-tooth pattern in throughput graphs. The congestion window grows during congestion avoidance until packet loss occurs, drops by approximately half, then begins growing again. This pattern represents TCP probing for available bandwidth while backing off when hitting network limits.
# Demonstrate TCP behavior under packet loss
require 'socket'
def measure_throughput(host, port, duration)
socket = TCPSocket.new(host, port)
start_time = Time.now
bytes_sent = 0
while Time.now - start_time < duration
begin
# TCP automatically adjusts sending rate based on ACKs and losses
bytes = socket.write("X" * 8192)
bytes_sent += bytes
rescue Errno::EPIPE
# Connection broken, TCP couldn't recover from congestion
break
end
end
bytes_sent / duration
# => Throughput varies as TCP adjusts congestion window
ensure
socket.close if socket
end
The congestion window never exceeds the receiver's advertised window, which represents the receiver's buffer capacity. The effective window equals the minimum of cwnd and the receiver window. This interaction ensures that neither network congestion nor receiver capacity is overwhelmed.
Implementation Approaches
Multiple congestion control algorithms exist, each optimizing for different network characteristics and performance goals. The selection affects throughput, latency, fairness, and behavior under various network conditions.
TCP Tahoe represents the original implementation of Van Jacobson's congestion control algorithms. Upon detecting any packet loss (timeout or duplicate acknowledgments), Tahoe sets ssthresh to half the current cwnd, reduces cwnd to one segment, and returns to slow start. This conservative approach caused significant throughput reduction even for minor packet losses, leading to the development of improved variants.
TCP Reno introduced fast recovery to avoid returning to slow start after fast retransmit. When receiving three duplicate acknowledgments, Reno halves cwnd and ssthresh, retransmits the missing segment, and enters fast recovery mode. The algorithm maintains a higher congestion window during recovery, only returning to slow start on timeout. Reno's improvements made it the dominant algorithm through the 1990s and 2000s.
# Simulate Reno-style congestion window behavior
class TCPRenoSimulator
attr_reader :cwnd, :ssthresh
def initialize(initial_cwnd = 1, initial_ssthresh = 64)
@cwnd = initial_cwnd
@ssthresh = initial_ssthresh
@state = :slow_start
end
def on_ack_received
case @state
when :slow_start
@cwnd += 1 # Exponential growth
if @cwnd >= @ssthresh
@state = :congestion_avoidance
end
when :congestion_avoidance
@cwnd += 1.0 / @cwnd # Linear growth (approximately)
end
end
def on_triple_dupack
@ssthresh = [@cwnd / 2, 2].max
@cwnd = @ssthresh
@state = :fast_recovery
end
def on_timeout
@ssthresh = [@cwnd / 2, 2].max
@cwnd = 1
@state = :slow_start
end
end
TCP NewReno addresses Reno's limitation with multiple packet losses in a single window. Reno exits fast recovery after the first retransmitted segment is acknowledged, even if other segments remain lost. NewReno remains in fast recovery until all segments in the window are acknowledged, handling multiple losses more gracefully. The algorithm uses partial acknowledgments (acks that advance the window but don't acknowledge all outstanding data) as signals to retransmit the next missing segment.
TCP CUBIC optimizes for high-bandwidth, high-latency networks where traditional algorithms fail to utilize available capacity. Instead of linear growth, CUBIC uses a cubic function centered around the window size where the last congestion event occurred. The window grows slowly when far from this point and rapidly when approaching it, then slowly again after passing it. This behavior allows faster recovery of lost bandwidth while maintaining stability. CUBIC became the Linux default in 2006 and remains widely deployed.
The cubic growth function is: W(t) = C(t - K)^3 + Wmax where Wmax is the window size at the last congestion event, K is the time period to reach Wmax, C is a scaling constant, and t is time since the last congestion event.
TCP BBR (Bottleneck Bandwidth and Round-trip propagation time) represents a fundamentally different approach. Rather than using packet loss as the primary congestion signal, BBR builds a model of the network path's bottleneck bandwidth and minimum RTT. The algorithm operates in distinct phases: startup (rapidly probing bandwidth), drain (draining queues built during startup), probe bandwidth (cycling sending rate to detect changes), and probe RTT (periodically measuring minimum RTT). BBR maintains high throughput with lower latency and reduced packet loss compared to loss-based algorithms, particularly on paths with inherent packet loss like wireless networks.
# Simplified BBR-style model tracking
class BBRModel
attr_reader :bottleneck_bandwidth, :min_rtt
def initialize
@bottleneck_bandwidth = 0
@min_rtt = Float::INFINITY
@bandwidth_samples = []
@rtt_samples = []
end
def update_bandwidth(bytes_delivered, time_interval)
bandwidth = bytes_delivered / time_interval
@bandwidth_samples << bandwidth
@bandwidth_samples.shift if @bandwidth_samples.size > 10
@bottleneck_bandwidth = @bandwidth_samples.max
end
def update_rtt(measured_rtt)
@rtt_samples << measured_rtt
@rtt_samples.shift if @rtt_samples.size > 10
@min_rtt = @rtt_samples.min
end
def bdp
# Bandwidth-delay product: optimal data in flight
@bottleneck_bandwidth * @min_rtt
end
def pacing_rate(gain = 1.0)
# BBR paces at bottleneck_bandwidth * gain
@bottleneck_bandwidth * gain
end
end
Data Center TCP (DCTCP) targets low-latency data center environments. DCTCP uses ECN marks from switches to estimate the extent of congestion, reducing cwnd proportionally to the fraction of marked packets rather than cutting it in half regardless of congestion severity. This approach maintains high throughput while achieving microsecond-scale latency. DCTCP requires ECN support throughout the network path and switch configuration for early marking.
Algorithm selection depends on the deployment environment. CUBIC works well for general Internet traffic. BBR excels on high-speed links and paths with variable latency or packet loss. DCTCP requires data center environments with ECN-capable switches. Legacy algorithms like Reno remain in some embedded systems and specialized devices.
Ruby Implementation
Ruby applications interact with TCP congestion control through the socket API. The operating system kernel handles algorithm implementation, but applications can influence behavior through socket options and sending patterns.
TCP sockets in Ruby automatically participate in congestion control. The kernel manages the congestion window, retransmissions, and acknowledgments transparently. Applications need not implement congestion control but should understand how their sending patterns interact with the underlying mechanisms.
require 'socket'
# Basic TCP client with congestion control active
def tcp_client_example
socket = TCPSocket.new('example.com', 8080)
# Set TCP_NODELAY to disable Nagle's algorithm
# This affects interaction with congestion control
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
# Check current socket buffer sizes
send_buffer = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF).int
recv_buffer = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_RCVBUF).int
puts "Send buffer: #{send_buffer} bytes"
puts "Receive buffer: #{recv_buffer} bytes"
# Write data - kernel handles congestion control
data = "X" * 1_000_000
socket.write(data)
# Read response
response = socket.read(1024)
socket.close
end
The TCP_NODELAY option disables Nagle's algorithm, which delays small packets to combine them with subsequent writes. Disabling Nagle's algorithm reduces latency for interactive applications but can decrease efficiency and interact negatively with congestion control by sending many small packets. The default behavior (Nagle enabled) works well for bulk data transfer where congestion control can operate on larger segments.
Socket buffer sizes affect how much data the kernel can queue for transmission and reception. Larger buffers allow the congestion window to grow further, potentially improving throughput on high-bandwidth, high-latency paths. The send buffer must accommodate the bandwidth-delay product of the connection for optimal performance.
require 'socket'
class TCPThroughputTest
def initialize(host, port)
@host = host
@port = port
end
def test_with_buffer_size(send_buffer_size)
socket = TCPSocket.new(@host, @port)
# Set send buffer size
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, send_buffer_size)
# Verify actual buffer size (kernel may adjust)
actual_size = socket.getsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF).int
# Measure throughput
start_time = Time.now
bytes_sent = 0
duration = 10
while Time.now - start_time < duration
bytes = socket.write("X" * 8192)
bytes_sent += bytes
end
throughput = bytes_sent / duration
{
requested_buffer: send_buffer_size,
actual_buffer: actual_size,
throughput: throughput,
megabits_per_second: (throughput * 8) / 1_000_000.0
}
ensure
socket.close if socket
end
end
# Test with different buffer sizes
tester = TCPThroughputTest.new('example.com', 8080)
[64_000, 128_000, 256_000, 512_000].each do |size|
result = tester.test_with_buffer_size(size)
puts "Buffer: #{result[:actual_buffer]}, " \
"Throughput: #{result[:megabits_per_second].round(2)} Mbps"
end
Linux exposes TCP congestion control algorithm selection through socket options on systems with multiple algorithms available. Applications can query available algorithms and select specific implementations.
require 'socket'
def get_tcp_congestion_control(socket)
# TCP_CONGESTION socket option (Linux-specific)
TCP_CONGESTION = 13
begin
option = socket.getsockopt(Socket::IPPROTO_TCP, TCP_CONGESTION)
option.data.unpack('Z*').first
rescue SystemCallError
"Not supported on this system"
end
end
def set_tcp_congestion_control(socket, algorithm)
TCP_CONGESTION = 13
begin
socket.setsockopt(Socket::IPPROTO_TCP, TCP_CONGESTION, algorithm)
true
rescue SystemCallError => e
puts "Failed to set algorithm: #{e.message}"
false
end
end
socket = TCPSocket.new('example.com', 80)
current_algorithm = get_tcp_congestion_control(socket)
puts "Current algorithm: #{current_algorithm}"
# Attempt to switch to BBR
if set_tcp_congestion_control(socket, 'bbr')
puts "Switched to BBR"
else
puts "BBR not available, using #{current_algorithm}"
end
Monitoring TCP statistics reveals congestion control behavior. The netstat and ss command-line tools expose detailed connection information, and Ruby can parse this output or use system calls to gather metrics.
require 'socket'
class TCPConnectionMonitor
def initialize(socket)
@socket = socket
end
def local_address
@socket.local_address
end
def remote_address
@socket.remote_address
end
def connection_info
# Get TCP_INFO structure (Linux-specific)
TCP_INFO = 11
begin
info = @socket.getsockopt(Socket::IPPROTO_TCP, TCP_INFO)
parse_tcp_info(info.data)
rescue SystemCallError
{ error: "TCP_INFO not supported" }
end
end
private
def parse_tcp_info(data)
# TCP_INFO structure varies by system
# This is a simplified example
{
state: "Connection active",
rtt: "Round-trip time from kernel",
rtt_var: "RTT variation",
snd_cwnd: "Current congestion window",
retransmits: "Number of retransmissions"
}
end
end
Server applications should configure appropriate buffer sizes and socket options based on expected connection characteristics. High-throughput servers benefit from larger buffers, while latency-sensitive applications may optimize for smaller buffers and TCP_NODELAY.
require 'socket'
class TCPServer
def initialize(port, buffer_size: 256_000, nodelay: false)
@server = TCPServer.new(port)
@buffer_size = buffer_size
@nodelay = nodelay
end
def accept_connection
client = @server.accept
configure_socket(client)
client
end
private
def configure_socket(socket)
# Set buffer sizes
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, @buffer_size)
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_RCVBUF, @buffer_size)
# Configure TCP_NODELAY if requested
if @nodelay
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
end
# Enable TCP keepalive to detect dead connections
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, 1)
end
end
server = TCPServer.new(8080, buffer_size: 512_000, nodelay: true)
loop do
client = server.accept_connection
Thread.new(client) do |conn|
# Handle connection
# Congestion control operates automatically
conn.close
end
end
Performance Considerations
TCP congestion control significantly impacts application performance through throughput, latency, and connection reliability. The interaction between congestion control, network conditions, and application behavior determines actual performance.
Bandwidth-delay product (BDP) defines the amount of data that can be in flight to fully utilize a network path. BDP equals the bottleneck bandwidth multiplied by the round-trip time: BDP = bandwidth × RTT. The TCP congestion window must reach at least the BDP to achieve maximum throughput. On high-bandwidth, high-latency paths (like transcontinental links), the required window size can be several megabytes.
def calculate_bdp(bandwidth_mbps, rtt_ms)
# Convert to consistent units
bandwidth_bytes_per_sec = (bandwidth_mbps * 1_000_000) / 8
rtt_sec = rtt_ms / 1000.0
bdp_bytes = bandwidth_bytes_per_sec * rtt_sec
{
bdp_bytes: bdp_bytes,
bdp_kilobytes: bdp_bytes / 1024,
bdp_segments: bdp_bytes / 1460 # Typical MSS
}
end
# Example: 100 Mbps link with 50ms RTT
result = calculate_bdp(100, 50)
puts "BDP: #{result[:bdp_kilobytes].round} KB"
puts "Requires congestion window of #{result[:bdp_segments].round} segments"
# => BDP: 610 KB
# => Requires congestion window of 428 segments
Socket buffer sizes must accommodate the BDP. Insufficient buffers limit the congestion window growth, preventing full bandwidth utilization. The operating system's default buffer sizes often suffice for local networks but may constrain performance on wide-area network connections. Automatic buffer tuning in modern kernels adjusts buffer sizes dynamically, but applications can override these settings when necessary.
Slow start causes gradual throughput ramp-up at connection start and after timeouts. For short-lived connections transferring small amounts of data, slow start prevents reaching full throughput. The connection terminates before the congestion window grows large enough. This effect particularly impacts web traffic, where many HTTP requests complete in a few round trips. HTTP/2 and HTTP/3 mitigate this through connection reuse and stream multiplexing.
require 'socket'
require 'benchmark'
def measure_connection_rampup(host, port, data_size)
times = []
# Multiple transfers to see congestion window effects
5.times do
time = Benchmark.measure do
socket = TCPSocket.new(host, port)
socket.write("X" * data_size)
socket.close
end
times << time.real
end
{
first_transfer: times.first,
average_after_warmup: times[1..].sum / times[1..].size,
improvement: ((times.first - times[1..].average) / times.first * 100).round(1)
}
end
# Small transfers suffer more from slow start
result = measure_connection_rampup('example.com', 80, 10_000)
puts "First transfer: #{result[:first_transfer].round(3)}s"
puts "Subsequent average: #{result[:average_after_warmup].round(3)}s"
puts "Improvement: #{result[:improvement]}%"
Loss recovery mechanisms introduce latency spikes. Fast retransmit reduces delay compared to timeouts, but retransmission still requires at least one additional RTT. Applications sensitive to latency jitter must account for these variations. Streaming protocols and real-time applications often use UDP to avoid TCP's reliability and congestion control overhead, implementing custom congestion-aware mechanisms.
Bufferbloat occurs when excessive buffering in network devices causes latency inflation without providing throughput benefits. Large queues in routers delay congestion signals, causing TCP to send at high rates while packets accumulate in buffers. This increases latency dramatically while throughput remains unchanged. Modern queue management algorithms like CoDel and FQ-CoDel in routers mitigate bufferbloat, and congestion control algorithms like BBR reduce the effect by maintaining lower queue occupancy.
class LatencyMonitor
def initialize
@measurements = []
end
def measure_rtt(host)
require 'socket'
start = Time.now
socket = TCPSocket.new(host, 80)
socket.write("GET / HTTP/1.1\r\nHost: #{host}\r\n\r\n")
first_byte = socket.read(1)
rtt = (Time.now - start) * 1000 # Convert to ms
socket.close
@measurements << rtt
rtt
end
def statistics
return {} if @measurements.empty?
sorted = @measurements.sort
{
min: sorted.first,
max: sorted.last,
median: sorted[sorted.size / 2],
average: @measurements.sum / @measurements.size,
jitter: sorted.last - sorted.first
}
end
end
monitor = LatencyMonitor.new
10.times { monitor.measure_rtt('example.com') }
stats = monitor.statistics
puts "RTT - Min: #{stats[:min].round(1)}ms, Max: #{stats[:max].round(1)}ms"
puts "Jitter: #{stats[:jitter].round(1)}ms"
# High jitter may indicate bufferbloat or congestion
Fairness issues arise when connections with different RTTs share bottleneck links. Loss-based algorithms like Reno and CUBIC favor connections with shorter RTTs, as they can recover from congestion faster and grow their windows more quickly. A connection with 10ms RTT may achieve 10x the throughput of a 100ms RTT connection sharing the same link. BBR improves fairness by modeling the network rather than reacting purely to loss.
Multiple concurrent TCP connections from the same application compete with each other for bandwidth. Opening many parallel connections provides no advantage within a single bottleneck domain and reduces fairness to other users. HTTP/2 and HTTP/3 multiplexing eliminate the need for multiple TCP connections while achieving parallelism at the application layer.
require 'socket'
def compare_serial_vs_parallel(host, port, data_size, parallel_count)
# Serial: one connection
serial_time = Benchmark.measure do
socket = TCPSocket.new(host, port)
parallel_count.times { socket.write("X" * data_size) }
socket.close
end.real
# Parallel: multiple connections
parallel_time = Benchmark.measure do
threads = parallel_count.times.map do
Thread.new do
socket = TCPSocket.new(host, port)
socket.write("X" * data_size)
socket.close
end
end
threads.each(&:join)
end.real
{
serial_seconds: serial_time,
parallel_seconds: parallel_time,
speedup: serial_time / parallel_time
}
end
Practical Examples
Understanding congestion control through concrete scenarios demonstrates how the mechanisms respond to various network conditions and application patterns.
Bulk Data Transfer shows congestion control ramping up throughput. A file transfer starts with slow start exponential growth, transitions to congestion avoidance linear growth, and stabilizes around the available bandwidth. Packet losses cause periodic window reductions, creating the characteristic saw-tooth pattern.
require 'socket'
class BulkTransfer
def initialize(host, port)
@host = host
@port = port
@stats = {
bytes_sent: 0,
send_errors: 0,
throughput_samples: []
}
end
def transfer_file(file_path, chunk_size: 65_536)
socket = TCPSocket.new(@host, @port)
# Configure for bulk transfer
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, 524_288)
File.open(file_path, 'rb') do |file|
last_sample_time = Time.now
bytes_since_sample = 0
while chunk = file.read(chunk_size)
begin
bytes = socket.write(chunk)
@stats[:bytes_sent] += bytes
bytes_since_sample += bytes
# Sample throughput every second
if Time.now - last_sample_time >= 1.0
throughput = bytes_since_sample / (Time.now - last_sample_time)
@stats[:throughput_samples] << throughput
bytes_since_sample = 0
last_sample_time = Time.now
end
rescue Errno::EPIPE, Errno::ECONNRESET => e
@stats[:send_errors] += 1
raise
end
end
end
socket.close
@stats
end
def analyze_throughput
return {} if @stats[:throughput_samples].empty?
samples = @stats[:throughput_samples]
{
min_mbps: (samples.min * 8 / 1_000_000.0).round(2),
max_mbps: (samples.max * 8 / 1_000_000.0).round(2),
avg_mbps: (samples.sum / samples.size * 8 / 1_000_000.0).round(2),
variation: ((samples.max - samples.min) / samples.average * 100).round(1)
}
end
end
transfer = BulkTransfer.new('fileserver.example.com', 9000)
stats = transfer.transfer_file('large_file.dat')
analysis = transfer.analyze_throughput
puts "Transferred: #{stats[:bytes_sent] / 1_000_000} MB"
puts "Throughput: #{analysis[:avg_mbps]} Mbps (#{analysis[:min_mbps]} - #{analysis[:max_mbps]})"
puts "Variation: #{analysis[:variation]}%"
# Variation shows congestion window growth/reduction cycles
Connection Timeout Recovery demonstrates slow start restart after a timeout. The connection experiences severe congestion, triggers a timeout, resets the congestion window to one segment, and gradually rebuilds throughput.
require 'socket'
require 'timeout'
class TimeoutRecoveryTest
def test_recovery(host, port, data_size)
socket = TCPSocket.new(host, port)
recovery_log = []
# Send data and track recovery from timeouts
chunk = "X" * 1460 # One MSS
chunks_sent = 0
loop do
begin
Timeout.timeout(5) do
socket.write(chunk)
chunks_sent += 1
# Track approximate congestion window growth
# Window starts at 1, doubles each RTT in slow start
estimated_cwnd = if chunks_sent < 64
2 ** (Math.log2(chunks_sent).floor)
else
64 + (chunks_sent - 64) # Linear growth
end
recovery_log << {
chunk: chunks_sent,
estimated_cwnd: estimated_cwnd,
time: Time.now
}
end
rescue Timeout::Error
# Timeout resets congestion window
recovery_log << {
event: 'TIMEOUT',
chunks_before_timeout: chunks_sent,
time: Time.now
}
chunks_sent = 0 # Simulate cwnd reset
rescue Errno::EPIPE
break
end
break if chunks_sent * 1460 >= data_size
end
socket.close
recovery_log
end
end
Multiple Parallel Streams illustrates how concurrent connections share bandwidth. Each connection runs its own congestion control, competing for bottleneck capacity. The streams converge toward equal bandwidth shares, though convergence speed depends on the algorithm and RTT differences.
require 'socket'
class ParallelStreamTest
def initialize(host, port, stream_count)
@host = host
@port = port
@stream_count = stream_count
@stream_stats = Array.new(stream_count) { { bytes: 0, start: nil } }
end
def run_test(duration)
threads = @stream_count.times.map do |i|
Thread.new do
stream_transfer(i, duration)
end
end
threads.each(&:join)
analyze_fairness
end
private
def stream_transfer(stream_id, duration)
socket = TCPSocket.new(@host, @port)
@stream_stats[stream_id][:start] = Time.now
end_time = Time.now + duration
while Time.now < end_time
bytes = socket.write("X" * 8192)
@stream_stats[stream_id][:bytes] += bytes
end
socket.close
end
def analyze_fairness
throughputs = @stream_stats.map do |stat|
stat[:bytes] / (Time.now - stat[:start])
end
avg = throughputs.sum / throughputs.size
jains_index = (throughputs.sum ** 2) /
(@stream_count * throughputs.map { |t| t ** 2 }.sum)
{
throughputs_mbps: throughputs.map { |t| (t * 8 / 1_000_000.0).round(2) },
average_mbps: (avg * 8 / 1_000_000.0).round(2),
fairness_index: jains_index.round(3)
}
# Jain's fairness index: 1.0 = perfect fairness, lower = more unfair
end
end
test = ParallelStreamTest.new('example.com', 8080, 4)
result = test.run_test(30)
puts "Stream throughputs: #{result[:throughputs_mbps]} Mbps"
puts "Fairness index: #{result[:fairness_index]}"
Short-Lived Connections highlight slow start impact on performance. Many HTTP requests complete before the congestion window grows large, never achieving full throughput. Connection reuse through HTTP keep-alive mitigates this effect.
require 'socket'
require 'net/http'
class ConnectionReuseComparison
def test_without_keepalive(host, path, request_count)
times = []
request_count.times do
start = Time.now
socket = TCPSocket.new(host, 80)
socket.write("GET #{path} HTTP/1.1\r\nHost: #{host}\r\nConnection: close\r\n\r\n")
response = socket.read
socket.close
times << Time.now - start
end
times
end
def test_with_keepalive(host, path, request_count)
times = []
socket = TCPSocket.new(host, 80)
request_count.times do
start = Time.now
socket.write("GET #{path} HTTP/1.1\r\nHost: #{host}\r\nConnection: keep-alive\r\n\r\n")
# Read response (simplified - real code needs proper HTTP parsing)
response = socket.readpartial(4096)
times << Time.now - start
end
socket.close
times
end
def compare(host, path, request_count)
without = test_without_keepalive(host, path, request_count)
with = test_with_keepalive(host, path, request_count)
{
without_keepalive: {
total: without.sum.round(3),
average: (without.sum / without.size).round(3),
first: without.first.round(3)
},
with_keepalive: {
total: with.sum.round(3),
average: (with.sum / with.size).round(3),
first: with.first.round(3)
},
improvement: ((without.sum - with.sum) / without.sum * 100).round(1)
}
end
end
comparison = ConnectionReuseComparison.new
result = comparison.compare('example.com', '/api/data', 10)
puts "Without keep-alive: #{result[:without_keepalive][:average]}s avg"
puts "With keep-alive: #{result[:with_keepalive][:average]}s avg"
puts "Improvement: #{result[:improvement]}%"
Common Pitfalls
Several misconceptions and mistakes frequently occur when working with TCP congestion control, leading to performance issues and incorrect assumptions.
Assuming More Connections Equals Higher Throughput represents a persistent misunderstanding. Opening multiple parallel TCP connections does not increase total throughput when constrained by a single bottleneck link. Each connection competes for the same bandwidth, and the aggregate congestion control behavior results in similar total throughput with worse fairness to other users. HTTP/1.1 applications often opened 6-8 parallel connections per domain under this misconception. Modern protocols like HTTP/2 multiplex streams over a single connection, achieving parallelism without multiple TCP connections.
# Demonstrates that parallel connections don't increase throughput
require 'socket'
def test_throughput_scaling(host, port, max_connections)
results = {}
(1..max_connections).each do |conn_count|
bytes_total = 0
start_time = Time.now
threads = conn_count.times.map do
Thread.new do
socket = TCPSocket.new(host, port)
bytes = 0
5.times { bytes += socket.write("X" * 65536) }
socket.close
bytes
end
end
bytes_total = threads.map(&:value).sum
duration = Time.now - start_time
results[conn_count] = (bytes_total / duration * 8 / 1_000_000.0).round(2)
end
results
# Shows diminishing or no returns beyond single connection
end
results = test_throughput_scaling('example.com', 8080, 5)
results.each { |conns, mbps| puts "#{conns} connections: #{mbps} Mbps" }
# Expected: Similar throughput regardless of connection count
Ignoring Buffer Size Requirements causes throughput degradation on high-BDP paths. Default socket buffer sizes work adequately for local networks but constrain performance on wide-area connections. The congestion window cannot grow beyond the send buffer size, limiting throughput. Automatic tuning in modern kernels helps but may not reach optimal sizes for all scenarios. Applications must verify buffer sizes match network characteristics.
Misinterpreting TCP_NODELAY Effects leads to latency issues or throughput reduction. Disabling Nagle's algorithm prevents small packet coalescing, reducing latency for interactive applications but potentially decreasing efficiency. Many applications blindly enable TCP_NODELAY without understanding the trade-off. Bulk data transfers generally benefit from Nagle's algorithm enabled (the default), while request-response protocols benefit from disabling it.
require 'socket'
def compare_nagle_impact(host, port, message_size, message_count)
# Test with Nagle enabled (default)
socket_nagle = TCPSocket.new(host, port)
start = Time.now
message_count.times { socket_nagle.write("X" * message_size) }
time_with_nagle = Time.now - start
socket_nagle.close
# Test with Nagle disabled
socket_nodelay = TCPSocket.new(host, port)
socket_nodelay.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
start = Time.now
message_count.times { socket_nodelay.write("X" * message_size) }
time_without_nagle = Time.now - start
socket_nodelay.close
{
nagle_enabled: time_with_nagle.round(3),
nagle_disabled: time_without_nagle.round(3),
difference_ms: ((time_without_nagle - time_with_nagle) * 1000).round(1)
}
# Small messages show more impact from Nagle's algorithm
end
# Small messages (100 bytes)
result_small = compare_nagle_impact('example.com', 80, 100, 50)
puts "Small messages - Nagle: #{result_small[:nagle_enabled]}s, " \
"No Nagle: #{result_small[:nagle_disabled]}s"
Expecting Consistent Throughput ignores congestion control's dynamic nature. TCP throughput varies continuously as the congestion window grows and shrinks. Applications requiring consistent bandwidth should implement rate limiting or use protocols designed for constant rate delivery. The saw-tooth pattern of loss-based congestion control means instantaneous throughput oscillates significantly even when average throughput appears stable.
Assuming Packet Loss Equals Network Congestion misdiagnoses issues on wireless and cellular networks. These links experience packet loss from radio interference, signal attenuation, and mobility, independent of congestion. Traditional TCP algorithms interpret all loss as congestion, unnecessarily reducing sending rates. This explains why BBR often outperforms loss-based algorithms on wireless links—it distinguishes true congestion from random loss by modeling the network path.
Overlooking Head-of-Line Blocking in application protocol design causes latency problems. TCP delivers data in order, so a single lost segment blocks delivery of all subsequent segments until retransmission completes. Applications multiplexing multiple independent streams over one TCP connection experience unnecessary delays. HTTP/2 suffered from this issue, leading to HTTP/3's adoption of QUIC, which provides per-stream ordering over UDP.
# Demonstrates sequential blocking in TCP
require 'socket'
class StreamMultiplexer
def initialize(socket)
@socket = socket
@streams = {}
end
def send_stream_data(stream_id, data)
# All streams share one TCP connection
# Loss in any stream blocks all streams
message = "STREAM:#{stream_id}:#{data.bytesize}:#{data}"
@socket.write(message)
end
def receive_stream_data
# Blocked if any data is lost, even if for different stream
# TCP provides no per-stream delivery
data = @socket.readpartial(4096)
parse_stream_message(data)
end
end
# Issue: Stream 1 loss blocks Stream 2 delivery, even though Stream 2 data arrived
# HTTP/3 with QUIC solves this with independent stream ordering
Neglecting Congestion Control Algorithm Selection leaves default choices that may not suit the application's environment. CUBIC works well for general Internet traffic but may underperform compared to BBR on high-speed links or paths with packet loss from non-congestion causes. Data center applications benefit from DCTCP when supported. Applications should evaluate algorithm choices based on deployment characteristics.
Assuming Loss Detection is Immediate leads to incorrect timeout values in application protocols. TCP requires three duplicate acknowledgments for fast retransmit—requiring four total packets to arrive after the lost segment. On paths with few packets in flight or when multiple consecutive packets are lost, timeout-based retransmission becomes necessary, introducing significant delay. Applications building on TCP must account for worst-case recovery times rather than optimistic scenarios.
Reference
Congestion Control States
| State | Window Behavior | Trigger | Transition |
|---|---|---|---|
| Slow Start | Exponential growth (doubles per RTT) | Connection start or timeout | Reaches ssthresh or loss detected |
| Congestion Avoidance | Linear growth (increases by 1 MSS per RTT) | cwnd reaches ssthresh | Loss detection |
| Fast Recovery | Inflated window during recovery | Three duplicate ACKs | New ACK received or timeout |
Algorithm Comparison
| Algorithm | Loss Response | Fairness | Latency | Best Environment |
|---|---|---|---|---|
| Tahoe | Reset to 1 segment | Poor on high-BDP | High after loss | Legacy systems |
| Reno | Halve window on dupacks | Moderate | Moderate | General Internet |
| NewReno | Improved multi-loss handling | Moderate | Moderate | General Internet |
| CUBIC | Cubic growth function | Good on high-BDP | Moderate | High-speed networks |
| BBR | Model-based rate control | Excellent | Low | All environments |
| DCTCP | Proportional to ECN marks | Excellent | Very low | Data centers |
Key Parameters
| Parameter | Description | Typical Value | Impact |
|---|---|---|---|
| Initial cwnd | Starting congestion window | 10 segments | Connection startup time |
| MSS | Maximum segment size | 1460 bytes | Bandwidth efficiency |
| ssthresh | Slow start threshold | 64 KB or half cwnd after loss | Growth rate transition |
| RTO | Retransmission timeout | Calculated from RTT | Loss recovery delay |
| Duplicate ACK threshold | Triggers fast retransmit | 3 | Loss detection speed |
Socket Options
| Option | Level | Description | Usage |
|---|---|---|---|
| TCP_NODELAY | IPPROTO_TCP | Disable Nagle's algorithm | Interactive protocols |
| SO_SNDBUF | SOL_SOCKET | Send buffer size | High-BDP paths |
| SO_RCVBUF | SOL_SOCKET | Receive buffer size | High-BDP paths |
| TCP_CONGESTION | IPPROTO_TCP | Select algorithm (Linux) | Environment-specific tuning |
| SO_KEEPALIVE | SOL_SOCKET | Enable keepalive probes | Detect dead connections |
| TCP_INFO | IPPROTO_TCP | Query connection statistics | Monitoring and debugging |
Congestion Signals
| Signal | Detection | Algorithm Response | Recovery Time |
|---|---|---|---|
| Three duplicate ACKs | Fast retransmit | Halve cwnd, enter fast recovery | 1 RTT minimum |
| Timeout | RTO expiration | Reset cwnd to 1, reset ssthresh | RTO duration plus recovery |
| ECN mark | IP header flag | Reduce cwnd (proportional in DCTCP) | 1 RTT |
| Partial ACK | NewReno in fast recovery | Retransmit next segment | Continues recovery |
Performance Formulas
| Formula | Description | Usage |
|---|---|---|
| BDP = Bandwidth × RTT | Bandwidth-delay product | Minimum window size for full utilization |
| Throughput ≈ (MSS / RTT) × (C / √Loss) | Mathis equation for loss-based algorithms | Estimate achievable throughput |
| Throughput ≈ Window / RTT | Instantaneous throughput | Current sending rate |
| RTT_smoothed = (7/8) × RTT_old + (1/8) × RTT_new | Exponential smoothing of RTT | RTO calculation |
Common Buffer Sizes
| Scenario | Send Buffer | Receive Buffer | Rationale |
|---|---|---|---|
| LAN | 64-128 KB | 64-128 KB | Low RTT, modest BDP |
| Internet | 256-512 KB | 256-512 KB | Moderate RTT and bandwidth |
| High-speed WAN | 2-4 MB | 2-4 MB | High BDP requires large buffers |
| Data center | 128-256 KB | 128-256 KB | Low latency, controlled environment |
| Mobile | 64-256 KB | 64-256 KB | Variable conditions, memory constrained |
Algorithm Evolution Timeline
| Year | Development | Significance |
|---|---|---|
| 1988 | TCP Tahoe | First congestion control implementation |
| 1990 | TCP Reno | Fast recovery improves performance |
| 1996 | TCP NewReno | Better multiple loss handling |
| 2001 | ECN | Explicit congestion signals |
| 2006 | CUBIC default in Linux | Optimized for high-speed networks |
| 2010 | DCTCP | Data center optimization |
| 2016 | BBR | Model-based control paradigm |