Overview
Distributed computing involves coordinating multiple independent computers to work together as a single system, sharing computation, storage, and resources across network boundaries. Each node operates autonomously with its own memory and processor, communicating through message passing rather than shared memory.
The concept emerged from the need to solve problems too large for single machines: processing massive datasets, serving millions of concurrent users, or providing fault tolerance through redundancy. Distributed systems power modern infrastructure from web applications and databases to cloud platforms and microservices architectures.
Three characteristics define distributed computing: concurrency (multiple nodes execute simultaneously), lack of global clock (no single authoritative time source), and independent failure modes (components fail independently without bringing down the entire system). These create fundamental challenges absent in single-machine computing.
# Simple distributed task example
require 'socket'
class DistributedWorker
def initialize(coordinator_host, coordinator_port)
@socket = TCPSocket.new(coordinator_host, coordinator_port)
end
def request_work
@socket.puts("READY")
task = @socket.gets.chomp
result = process_task(task)
@socket.puts("RESULT:#{result}")
end
def process_task(task)
# Perform computation
eval(task)
end
end
# Worker connects to coordinator and processes tasks
worker = DistributedWorker.new('localhost', 5000)
worker.request_work
The distinction between parallel and distributed computing matters: parallel computing uses multiple processors within a single machine sharing memory, while distributed computing coordinates separate machines with isolated memory. Distributed systems face network latency, partial failures, and consistency challenges that parallel systems avoid.
Key Principles
Network Communication: Nodes communicate through message passing over unreliable networks. Messages may arrive out of order, be duplicated, or never arrive at all. Protocols must handle these scenarios through acknowledgments, timeouts, and retries. TCP provides reliable ordered delivery but adds latency; UDP offers lower latency without delivery guarantees.
Consistency Models: Distributed systems trade consistency for availability and partition tolerance (CAP theorem). Strong consistency ensures all nodes see the same data simultaneously but requires coordination overhead. Eventual consistency allows temporary inconsistencies, providing higher availability. Causal consistency preserves cause-effect relationships without global synchronization.
# Eventual consistency example with version vectors
class EventuallyConsistentStore
def initialize
@data = {}
@versions = Hash.new { |h, k| h[k] = {} }
end
def write(key, value, node_id)
@versions[key][node_id] = (@versions[key][node_id] || 0) + 1
@data[key] = { value: value, version: @versions[key].dup }
end
def read(key)
@data[key]
end
def merge(key, remote_value, remote_version)
local_version = @versions[key]
# Detect conflicts using version vectors
if dominates?(remote_version, local_version)
# Remote is newer, accept it
@data[key] = { value: remote_value, version: remote_version }
@versions[key] = remote_version
elsif dominates?(local_version, remote_version)
# Local is newer, keep it
return
else
# Concurrent writes - conflict resolution needed
resolve_conflict(key, remote_value, remote_version)
end
end
def dominates?(v1, v2)
v1.all? { |node, count| count >= (v2[node] || 0) } &&
v1.any? { |node, count| count > (v2[node] || 0) }
end
def resolve_conflict(key, remote_value, remote_version)
# Application-specific conflict resolution
# Could use last-write-wins, merge values, etc.
end
end
Fault Tolerance: Distributed systems must continue operating despite node failures. Replication stores data on multiple nodes to survive failures. Consensus algorithms like Paxos and Raft enable nodes to agree on values despite failures. Circuit breakers prevent cascading failures by stopping requests to failing services.
Time and Ordering: Without synchronized clocks, determining event order becomes complex. Logical clocks (Lamport timestamps) track causality through message passing. Vector clocks detect concurrent events across nodes. Time synchronization protocols like NTP provide approximate wall-clock time but with bounded uncertainty.
State Management: Distributed state requires careful coordination. Stateless services simplify scaling by handling each request independently. Stateful services must partition and replicate state across nodes. Session affinity routes requests to specific nodes but creates scaling bottlenecks.
Data Partitioning: Splitting data across nodes enables horizontal scaling. Hash-based partitioning distributes data uniformly but complicates range queries. Range-based partitioning groups related data but risks uneven load distribution. Consistent hashing minimizes data movement when nodes join or leave.
# Consistent hashing implementation
require 'digest'
class ConsistentHash
def initialize(replicas = 150)
@replicas = replicas
@ring = {}
@sorted_keys = []
@nodes = []
end
def add_node(node)
@replicas.times do |i|
key = hash_key("#{node}:#{i}")
@ring[key] = node
@sorted_keys << key
end
@sorted_keys.sort!
@nodes << node
end
def remove_node(node)
@replicas.times do |i|
key = hash_key("#{node}:#{i}")
@ring.delete(key)
@sorted_keys.delete(key)
end
@nodes.delete(node)
end
def get_node(key)
return nil if @ring.empty?
hash = hash_key(key)
idx = @sorted_keys.bsearch_index { |k| k >= hash } || 0
@ring[@sorted_keys[idx]]
end
private
def hash_key(key)
Digest::MD5.hexdigest(key.to_s).to_i(16)
end
end
# Usage
ch = ConsistentHash.new
ch.add_node('server1')
ch.add_node('server2')
ch.add_node('server3')
puts ch.get_node('user:1001') # => server2
puts ch.get_node('user:1002') # => server1
Implementation Approaches
Coordinator-Worker Pattern: A coordinator node distributes tasks to worker nodes, collecting and aggregating results. Workers remain stateless, pulling tasks from the coordinator. This approach simplifies task distribution but creates a single point of failure. The coordinator must track worker health, reassign failed tasks, and handle worker registration. Load balancing distributes tasks based on worker capacity and current load.
Peer-to-Peer Architecture: Nodes operate as equals without central coordination. Each node maintains partial knowledge of the network through gossip protocols. Distributed hash tables (DHT) enable key-based routing across nodes. P2P systems eliminate single points of failure but complicate service discovery and consistency. Membership management handles node joins, leaves, and failures through failure detection and stabilization protocols.
Master-Slave Replication: A master node handles writes while slave nodes replicate data for read scaling. Synchronous replication ensures consistency but adds write latency. Asynchronous replication improves write performance but allows temporary inconsistencies. Failover promotes a slave to master when the master fails. Split-brain scenarios occur when network partitions create multiple masters, requiring fencing mechanisms to prevent data corruption.
Microservices Architecture: Decomposing applications into independent services that communicate through APIs. Each service owns its data and can scale independently. Service discovery enables dynamic service location through registries like Consul or etcd. API gateways route external requests to appropriate services. This approach improves fault isolation and deployment flexibility but increases operational complexity.
# Service registry implementation
class ServiceRegistry
def initialize
@services = Hash.new { |h, k| h[k] = [] }
@health_checks = {}
@mutex = Mutex.new
end
def register(service_name, host, port, health_check_url)
@mutex.synchronize do
instance = { host: host, port: port, registered_at: Time.now }
@services[service_name] << instance
@health_checks[instance] = health_check_url
end
end
def deregister(service_name, host, port)
@mutex.synchronize do
@services[service_name].reject! do |instance|
instance[:host] == host && instance[:port] == port
end
end
end
def discover(service_name)
@mutex.synchronize do
healthy_instances = @services[service_name].select do |instance|
healthy?(instance)
end
# Return random healthy instance
healthy_instances.sample
end
end
def healthy?(instance)
# Simple health check implementation
return false unless @health_checks[instance]
begin
response = Net::HTTP.get_response(
URI(@health_checks[instance])
)
response.code.to_i == 200
rescue
false
end
end
end
Event-Driven Architecture: Services communicate through asynchronous events rather than synchronous requests. Event producers publish events to message brokers; consumers subscribe to relevant events. This decouples services temporally and spatially. Event sourcing stores all state changes as events, enabling audit trails and time travel. CQRS (Command Query Responsibility Segregation) separates read and write models for independent scaling.
Distributed Consensus: Algorithms like Raft and Paxos enable nodes to agree on values despite failures. These provide linearizable consistency for critical operations like leader election and configuration management. Multi-Paxos optimizes for multiple sequential decisions. Raft simplifies implementation through explicit leader election and log replication. Consensus adds latency but ensures correctness for critical operations.
Ruby Implementation
Ruby provides multiple tools and libraries for building distributed systems. The standard library includes TCP/UDP sockets for network communication, threading for concurrency, and DRb for distributed Ruby objects.
DRb (Distributed Ruby): Built-in mechanism for remote method invocation between Ruby processes. Objects on one machine become accessible to other Ruby processes through transparent method calls. DRb handles serialization, network communication, and remote references automatically.
# DRb server exposing a distributed cache
require 'drb'
class DistributedCache
def initialize
@cache = {}
@mutex = Mutex.new
end
def get(key)
@mutex.synchronize { @cache[key] }
end
def set(key, value)
@mutex.synchronize { @cache[key] = value }
end
def delete(key)
@mutex.synchronize { @cache.delete(key) }
end
def keys
@mutex.synchronize { @cache.keys }
end
end
cache = DistributedCache.new
DRb.start_service('druby://localhost:9000', cache)
DRb.thread.join
# Client accessing remote cache
require 'drb'
DRb.start_service
cache = DRbObject.new_nil('druby://localhost:9000')
cache.set('user:1001', { name: 'Alice', email: 'alice@example.com' })
user = cache.get('user:1001')
puts user[:name] # => Alice
Resque/Sidekiq: Background job processing frameworks that distribute work across multiple worker processes. Jobs serialize to Redis, workers pull and execute jobs, and failed jobs retry automatically. Sidekiq uses threads for higher concurrency than Resque's process model.
# Sidekiq job for distributed processing
class ImageProcessor
include Sidekiq::Worker
sidekiq_options retry: 3, queue: :default
def perform(image_id, transformations)
image = Image.find(image_id)
transformations.each do |transformation|
case transformation['type']
when 'resize'
resize_image(image, transformation['width'], transformation['height'])
when 'filter'
apply_filter(image, transformation['filter'])
when 'compress'
compress_image(image, transformation['quality'])
end
end
image.processed = true
image.save
end
private
def resize_image(image, width, height)
# Image resizing logic
end
def apply_filter(image, filter)
# Filter application logic
end
def compress_image(image, quality)
# Compression logic
end
end
# Enqueue job from web request
ImageProcessor.perform_async(
params[:image_id],
[
{ 'type' => 'resize', 'width' => 800, 'height' => 600 },
{ 'type' => 'compress', 'quality' => 85 }
]
)
Celluloid: Actor-based concurrent object framework that simplifies building distributed systems. Each actor runs in its own thread, processing messages asynchronously. Supervision trees restart failed actors automatically. Actors communicate through asynchronous method calls.
require 'celluloid'
class DistributedCounter
include Celluloid
def initialize
@count = 0
@mutex = Mutex.new
end
def increment
@mutex.synchronize { @count += 1 }
end
def decrement
@mutex.synchronize { @count -= 1 }
end
def value
@mutex.synchronize { @count }
end
end
class CounterCoordinator
include Celluloid
def initialize(num_workers)
@workers = num_workers.times.map { DistributedCounter.new }
end
def increment(worker_id)
@workers[worker_id].async.increment
end
def total
@workers.map(&:value).sum
end
end
coordinator = CounterCoordinator.new(4)
1000.times { |i| coordinator.increment(i % 4) }
sleep 0.1 # Allow async operations to complete
puts coordinator.total # => 1000
RabbitMQ/Kafka Clients: Message broker clients for asynchronous communication. Bunny provides RabbitMQ integration, ruby-kafka connects to Kafka clusters. These enable publish-subscribe patterns, work queues, and event streaming.
require 'bunny'
# RabbitMQ publisher
class EventPublisher
def initialize(host)
@connection = Bunny.new(host: host)
@connection.start
@channel = @connection.create_channel
@exchange = @channel.fanout('events')
end
def publish(event_type, data)
message = {
type: event_type,
data: data,
timestamp: Time.now.to_i
}.to_json
@exchange.publish(message, routing_key: event_type)
end
def close
@connection.close
end
end
# RabbitMQ subscriber
class EventSubscriber
def initialize(host, queue_name)
@connection = Bunny.new(host: host)
@connection.start
@channel = @connection.create_channel
@queue = @channel.queue(queue_name)
@exchange = @channel.fanout('events')
@queue.bind(@exchange)
end
def subscribe(&block)
@queue.subscribe(block: true) do |delivery_info, properties, body|
event = JSON.parse(body)
block.call(event)
end
end
end
# Usage
publisher = EventPublisher.new('localhost')
publisher.publish('user.created', { id: 1001, name: 'Bob' })
subscriber = EventSubscriber.new('localhost', 'user_events')
subscriber.subscribe do |event|
puts "Received: #{event['type']} - #{event['data']}"
end
gRPC: High-performance RPC framework using Protocol Buffers for serialization. Supports streaming, bidirectional communication, and multiple programming languages. Generated client and server code from proto definitions.
# gRPC service definition (distributed_service.proto)
# syntax = "proto3";
#
# service DistributedService {
# rpc ProcessTask(TaskRequest) returns (TaskResponse);
# rpc StreamResults(stream DataChunk) returns (SummaryResponse);
# }
require 'grpc'
require 'distributed_service_services_pb'
class DistributedServiceImpl < DistributedService::Service
def process_task(task_request, _call)
result = perform_computation(task_request.data)
TaskResponse.new(
task_id: task_request.task_id,
result: result,
status: 'completed'
)
end
def stream_results(call)
total_size = 0
count = 0
call.each do |chunk|
total_size += chunk.data.size
count += 1
end
SummaryResponse.new(
total_size: total_size,
chunk_count: count
)
end
private
def perform_computation(data)
# Computation logic
data.upcase
end
end
# Start gRPC server
server = GRPC::RpcServer.new
server.add_http2_port('0.0.0.0:50051', :this_port_is_insecure)
server.handle(DistributedServiceImpl)
server.run_till_terminated
Common Patterns
Leader Election: Selecting a single coordinator from multiple nodes to prevent split-brain scenarios. Nodes propose themselves as leaders, other nodes vote, and the winner coordinates operations. Leases expire, forcing periodic re-election. Implementations include Raft leader election, ZooKeeper ephemeral nodes, and distributed locks.
require 'redis'
class RedisLeaderElection
LEASE_DURATION = 30 # seconds
def initialize(redis, node_id)
@redis = redis
@node_id = node_id
@leader_key = 'cluster:leader'
end
def try_become_leader
# Attempt to set leader key with expiration
success = @redis.set(
@leader_key,
@node_id,
nx: true, # Only set if doesn't exist
ex: LEASE_DURATION
)
success ? @node_id : nil
end
def is_leader?
@redis.get(@leader_key) == @node_id
end
def renew_lease
# Renew only if still leader
script = <<-LUA
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("expire", KEYS[1], ARGV[2])
else
return 0
end
LUA
@redis.eval(script, [@leader_key], [@node_id, LEASE_DURATION]) == 1
end
def get_leader
@redis.get(@leader_key)
end
def resign
# Release leadership
script = <<-LUA
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
LUA
@redis.eval(script, [@leader_key], [@node_id])
end
end
# Usage pattern
election = RedisLeaderElection.new(Redis.new, "node-#{Process.pid}")
loop do
if election.try_become_leader
puts "Became leader"
# Perform leader duties
while election.is_leader?
# Do work
perform_coordination_tasks
# Renew lease
election.renew_lease
sleep 10
end
puts "Lost leadership"
else
puts "Follower mode"
sleep 5
end
end
Circuit Breaker: Preventing cascading failures by stopping requests to failing services. States transition between closed (normal), open (failing), and half-open (testing recovery). Failed requests increment error counters; excessive failures open the circuit. After timeout, half-open state allows test requests to check recovery.
Saga Pattern: Managing distributed transactions across multiple services without two-phase commit. Each service performs a local transaction and publishes an event. Compensating transactions undo completed steps if later steps fail. Orchestration uses a central coordinator; choreography relies on event propagation.
class OrderSaga
def initialize
@steps = []
@compensations = []
end
def add_step(name, action, compensation)
@steps << { name: name, action: action }
@compensations << { name: name, action: compensation }
end
def execute(order_data)
completed_steps = []
begin
@steps.each do |step|
puts "Executing: #{step[:name]}"
result = step[:action].call(order_data)
completed_steps << step
order_data.merge!(result)
end
{ success: true, data: order_data }
rescue => e
puts "Error in saga: #{e.message}"
compensate(completed_steps.reverse, order_data)
{ success: false, error: e.message }
end
end
private
def compensate(completed_steps, order_data)
completed_steps.each_with_index do |step, i|
compensation = @compensations.reverse[i]
puts "Compensating: #{compensation[:name]}"
begin
compensation[:action].call(order_data)
rescue => e
puts "Compensation failed: #{e.message}"
end
end
end
end
# Define saga for order processing
saga = OrderSaga.new
saga.add_step(
'reserve_inventory',
->(data) {
# Reserve inventory
{ inventory_reserved: true, reservation_id: 'RES123' }
},
->(data) {
# Cancel reservation
cancel_inventory_reservation(data[:reservation_id])
}
)
saga.add_step(
'process_payment',
->(data) {
# Process payment
{ payment_processed: true, transaction_id: 'TXN456' }
},
->(data) {
# Refund payment
refund_payment(data[:transaction_id])
}
)
saga.add_step(
'create_shipment',
->(data) {
# Create shipment
{ shipment_created: true, tracking_number: 'TRACK789' }
},
->(data) {
# Cancel shipment
cancel_shipment(data[:tracking_number])
}
)
result = saga.execute(order_id: 1001, items: ['item1', 'item2'])
Retry with Exponential Backoff: Handling transient failures through automatic retries with increasing delays. Initial retry happens quickly; subsequent retries double the delay. Jitter randomizes delays to prevent thundering herd. Maximum retry limits prevent infinite loops.
Request Hedging: Sending duplicate requests to multiple nodes, using the first response. Reduces tail latency but increases resource usage. Applied selectively to latency-sensitive operations. Cancel redundant requests after receiving first response.
Tools & Ecosystem
Redis: In-memory data store for caching, message queues, and coordination primitives. Pub/sub messaging supports event-driven architectures. Atomic operations enable distributed locks and counters. Cluster mode provides automatic sharding and replication.
ZooKeeper: Centralized coordination service for distributed applications. Provides hierarchical namespace for configuration, naming service for service discovery, distributed locks for synchronization, and leader election primitives. Guarantees strong consistency through quorum-based replication.
etcd: Distributed key-value store for configuration and service discovery. Uses Raft consensus for consistency. Provides watch mechanism for configuration changes. HTTP/gRPC APIs for language-agnostic access. Common in Kubernetes infrastructure.
Consul: Service mesh solution with service discovery, health checking, and key-value storage. Provides DNS interface for service resolution. Multi-datacenter support with WAN gossip. Connect feature enables service-to-service communication with automatic TLS.
RabbitMQ: Message broker implementing AMQP protocol. Supports multiple messaging patterns: work queues, publish-subscribe, routing, and RPC. Plugins extend functionality with federation, shovel, and monitoring. High availability through mirrored queues.
Apache Kafka: Distributed event streaming platform for high-throughput data pipelines. Topics partition across brokers for parallelism. Consumer groups enable load balancing. Log compaction retains latest message per key. Connect framework integrates with databases and other systems.
Hazelcast: In-memory data grid providing distributed data structures. Maps, queues, topics, and locks distributed across cluster. Near cache improves read performance. Split-brain protection prevents data inconsistency during network partitions.
Error Handling & Edge Cases
Network Partitions: Network failures isolate subsets of nodes. Partition-tolerant systems continue operating with reduced functionality. Quorum-based systems require majority connectivity. Split-brain resolution detects multiple active coordinators and forces reconciliation.
class QuorumChecker
def initialize(nodes, quorum_size)
@nodes = nodes
@quorum_size = quorum_size
@last_contact = Hash.new { |h, k| h[k] = Time.now }
@timeout = 5 # seconds
end
def heartbeat(node_id)
@last_contact[node_id] = Time.now
end
def has_quorum?
alive_nodes = @nodes.count do |node_id|
Time.now - @last_contact[node_id] < @timeout
end
alive_nodes >= @quorum_size
end
def can_accept_writes?
has_quorum?
end
def alive_nodes
@nodes.select do |node_id|
Time.now - @last_contact[node_id] < @timeout
end
end
end
checker = QuorumChecker.new(['node1', 'node2', 'node3', 'node4', 'node5'], 3)
if checker.can_accept_writes?
# Process write request
process_write(data)
else
# Reject write, no quorum
return { error: 'No quorum available', code: 503 }
end
Cascading Failures: Single component failure triggers failures in dependent components. Bulkheads isolate failures to specific subsystems. Timeout limits prevent thread exhaustion waiting for slow services. Load shedding drops requests when system overloaded.
Clock Skew: Unsynchronized clocks cause incorrect ordering and timeout calculations. NTP synchronization provides bounded clock drift. Logical clocks (Lamport, vector) track causality without physical time. Hybrid logical clocks combine physical and logical time.
Thundering Herd: Many clients simultaneously request same resource after cache expiration or service restart. Cache stampede prevention uses early expiration with probability-based refresh. Request coalescing consolidates duplicate concurrent requests.
class RequestCoalescer
def initialize
@pending = {}
@mutex = Mutex.new
end
def coalesce(key, &block)
promise = nil
@mutex.synchronize do
if @pending[key]
# Request already in progress, wait for it
promise = @pending[key]
else
# First request, create promise
promise = Promise.new
@pending[key] = promise
end
end
# If we created the promise, execute the block
if promise.pending?
begin
result = block.call
promise.fulfill(result)
rescue => e
promise.reject(e)
ensure
@mutex.synchronize { @pending.delete(key) }
end
end
promise.value
end
end
class Promise
def initialize
@mutex = Mutex.new
@condition = ConditionVariable.new
@state = :pending
@value = nil
end
def pending?
@mutex.synchronize { @state == :pending }
end
def fulfill(value)
@mutex.synchronize do
@value = value
@state = :fulfilled
@condition.broadcast
end
end
def reject(error)
@mutex.synchronize do
@value = error
@state = :rejected
@condition.broadcast
end
end
def value
@mutex.synchronize do
@condition.wait(@mutex) while @state == :pending
raise @value if @state == :rejected
@value
end
end
end
# Usage prevents duplicate database queries
coalescer = RequestCoalescer.new
# Multiple concurrent requests for same data
threads = 10.times.map do
Thread.new do
result = coalescer.coalesce('user:1001') do
# Only executed once despite multiple threads
expensive_database_query('user:1001')
end
puts "Got result: #{result}"
end
end
threads.each(&:join)
Poison Messages: Invalid messages that cause processing failures and repeated retries. Dead letter queues store failed messages for investigation. Message validation rejects malformed data early. Retry limits prevent infinite processing loops.
Duplicate Messages: Message brokers may deliver messages multiple times. Idempotent processing produces same result regardless of duplicate deliveries. Deduplication tracking stores processed message IDs. Natural idempotency through upserts and deterministic operations.
Partial Failures: Some operations succeed while others fail across distributed operations. Compensating transactions undo completed operations. Eventual consistency accepts temporary inconsistencies. Reconciliation jobs detect and fix inconsistencies.
Reference
Consistency Models Comparison
| Model | Guarantees | Performance | Use Cases |
|---|---|---|---|
| Strong Consistency | All nodes see same data | Slowest, requires coordination | Financial transactions, inventory |
| Eventual Consistency | All nodes converge eventually | Fastest, no coordination | Social media feeds, caching |
| Causal Consistency | Cause precedes effect | Moderate overhead | Collaborative editing, messaging |
| Sequential Consistency | Operations appear in order | Moderate coordination | Configuration management |
| Read-your-writes | See own writes immediately | Client-side tracking | User profiles, preferences |
CAP Theorem Trade-offs
| System Type | Prioritizes | Sacrifices | Examples |
|---|---|---|---|
| CP Systems | Consistency + Partition Tolerance | Availability | MongoDB, HBase, Redis Cluster |
| AP Systems | Availability + Partition Tolerance | Consistency | Cassandra, DynamoDB, CouchDB |
| CA Systems | Consistency + Availability | Partition Tolerance | Traditional RDBMS (single node) |
Distributed Coordination Primitives
| Primitive | Purpose | Implementation Options |
|---|---|---|
| Distributed Lock | Mutual exclusion across nodes | Redis SETNX, ZooKeeper, etcd |
| Leader Election | Select single coordinator | Raft, Paxos, ZooKeeper |
| Barrier | Synchronize multiple processes | Redis counters, ZooKeeper |
| Semaphore | Limit concurrent access | Redis sorted sets, database rows |
| Queue | Order processing across workers | Redis lists, RabbitMQ, SQS |
Message Delivery Guarantees
| Level | Behavior | Implementation |
|---|---|---|
| At-most-once | Fire and forget, may lose messages | UDP, async without acks |
| At-least-once | Retry until acknowledged, may duplicate | TCP with retries, message queues |
| Exactly-once | Deliver once without duplicates | Idempotent processing + deduplication |
Partitioning Strategies
| Strategy | Distribution | Pros | Cons |
|---|---|---|---|
| Hash | Uniform across nodes | Even distribution | No range queries |
| Range | By key ranges | Ordered data, range queries | Hotspots possible |
| Consistent Hash | Hash with minimal rebalancing | Stable on membership changes | Complex implementation |
| Directory | Lookup table maps keys to nodes | Flexible assignment | Lookup overhead |
Replication Patterns
| Pattern | Consistency | Write Performance | Read Performance |
|---|---|---|---|
| Synchronous | Strong | Slow, waits for replicas | Fast from any replica |
| Asynchronous | Eventual | Fast, fire and forget | Stale reads possible |
| Semi-synchronous | Tunable | Moderate | Good |
| Multi-master | Conflicts possible | Parallel writes | Fast local reads |
Ruby Distributed Computing Gems
| Gem | Purpose | Key Features |
|---|---|---|
| sidekiq | Background jobs | Thread-based, Redis queue, retries |
| resque | Background jobs | Process-based, Redis queue |
| bunny | RabbitMQ client | AMQP protocol, async operations |
| ruby-kafka | Kafka client | Consumer groups, compression |
| gruf | gRPC framework | Built on gRPC, interceptors |
| dalli | Memcached client | Connection pooling, multi-get |
| redis-rb | Redis client | Pipelining, clustering support |
| celluloid | Actor framework | Supervision, async messaging |
Network Protocols for Distributed Systems
| Protocol | Transport | Use Case | Characteristics |
|---|---|---|---|
| HTTP/REST | TCP | Request-response APIs | Simple, stateless, widely supported |
| gRPC | TCP (HTTP/2) | RPC, microservices | Binary, streaming, efficient |
| WebSocket | TCP | Bidirectional, real-time | Persistent connection, low latency |
| AMQP | TCP | Message queuing | Reliable, routing, transactions |
| Gossip | UDP | Membership, failure detection | Eventual consistency, scalable |
Failure Detection Timeouts
| Scenario | Typical Timeout | Trade-off |
|---|---|---|
| Health check | 1-5 seconds | False positives vs detection speed |
| Request timeout | 30-60 seconds | User experience vs resource usage |
| Lease expiration | 10-30 seconds | Failover speed vs stability |
| Connection timeout | 5-10 seconds | Retry attempts vs latency |
| Session timeout | 15-30 minutes | Security vs convenience |