CrackedRuby CrackedRuby

Distributed Computing Basics

Overview

Distributed computing involves coordinating multiple independent computers to work together as a single system, sharing computation, storage, and resources across network boundaries. Each node operates autonomously with its own memory and processor, communicating through message passing rather than shared memory.

The concept emerged from the need to solve problems too large for single machines: processing massive datasets, serving millions of concurrent users, or providing fault tolerance through redundancy. Distributed systems power modern infrastructure from web applications and databases to cloud platforms and microservices architectures.

Three characteristics define distributed computing: concurrency (multiple nodes execute simultaneously), lack of global clock (no single authoritative time source), and independent failure modes (components fail independently without bringing down the entire system). These create fundamental challenges absent in single-machine computing.

# Simple distributed task example
require 'socket'

class DistributedWorker
  def initialize(coordinator_host, coordinator_port)
    @socket = TCPSocket.new(coordinator_host, coordinator_port)
  end
  
  def request_work
    @socket.puts("READY")
    task = @socket.gets.chomp
    result = process_task(task)
    @socket.puts("RESULT:#{result}")
  end
  
  def process_task(task)
    # Perform computation
    eval(task)
  end
end

# Worker connects to coordinator and processes tasks
worker = DistributedWorker.new('localhost', 5000)
worker.request_work

The distinction between parallel and distributed computing matters: parallel computing uses multiple processors within a single machine sharing memory, while distributed computing coordinates separate machines with isolated memory. Distributed systems face network latency, partial failures, and consistency challenges that parallel systems avoid.

Key Principles

Network Communication: Nodes communicate through message passing over unreliable networks. Messages may arrive out of order, be duplicated, or never arrive at all. Protocols must handle these scenarios through acknowledgments, timeouts, and retries. TCP provides reliable ordered delivery but adds latency; UDP offers lower latency without delivery guarantees.

Consistency Models: Distributed systems trade consistency for availability and partition tolerance (CAP theorem). Strong consistency ensures all nodes see the same data simultaneously but requires coordination overhead. Eventual consistency allows temporary inconsistencies, providing higher availability. Causal consistency preserves cause-effect relationships without global synchronization.

# Eventual consistency example with version vectors
class EventuallyConsistentStore
  def initialize
    @data = {}
    @versions = Hash.new { |h, k| h[k] = {} }
  end
  
  def write(key, value, node_id)
    @versions[key][node_id] = (@versions[key][node_id] || 0) + 1
    @data[key] = { value: value, version: @versions[key].dup }
  end
  
  def read(key)
    @data[key]
  end
  
  def merge(key, remote_value, remote_version)
    local_version = @versions[key]
    
    # Detect conflicts using version vectors
    if dominates?(remote_version, local_version)
      # Remote is newer, accept it
      @data[key] = { value: remote_value, version: remote_version }
      @versions[key] = remote_version
    elsif dominates?(local_version, remote_version)
      # Local is newer, keep it
      return
    else
      # Concurrent writes - conflict resolution needed
      resolve_conflict(key, remote_value, remote_version)
    end
  end
  
  def dominates?(v1, v2)
    v1.all? { |node, count| count >= (v2[node] || 0) } &&
      v1.any? { |node, count| count > (v2[node] || 0) }
  end
  
  def resolve_conflict(key, remote_value, remote_version)
    # Application-specific conflict resolution
    # Could use last-write-wins, merge values, etc.
  end
end

Fault Tolerance: Distributed systems must continue operating despite node failures. Replication stores data on multiple nodes to survive failures. Consensus algorithms like Paxos and Raft enable nodes to agree on values despite failures. Circuit breakers prevent cascading failures by stopping requests to failing services.

Time and Ordering: Without synchronized clocks, determining event order becomes complex. Logical clocks (Lamport timestamps) track causality through message passing. Vector clocks detect concurrent events across nodes. Time synchronization protocols like NTP provide approximate wall-clock time but with bounded uncertainty.

State Management: Distributed state requires careful coordination. Stateless services simplify scaling by handling each request independently. Stateful services must partition and replicate state across nodes. Session affinity routes requests to specific nodes but creates scaling bottlenecks.

Data Partitioning: Splitting data across nodes enables horizontal scaling. Hash-based partitioning distributes data uniformly but complicates range queries. Range-based partitioning groups related data but risks uneven load distribution. Consistent hashing minimizes data movement when nodes join or leave.

# Consistent hashing implementation
require 'digest'

class ConsistentHash
  def initialize(replicas = 150)
    @replicas = replicas
    @ring = {}
    @sorted_keys = []
    @nodes = []
  end
  
  def add_node(node)
    @replicas.times do |i|
      key = hash_key("#{node}:#{i}")
      @ring[key] = node
      @sorted_keys << key
    end
    @sorted_keys.sort!
    @nodes << node
  end
  
  def remove_node(node)
    @replicas.times do |i|
      key = hash_key("#{node}:#{i}")
      @ring.delete(key)
      @sorted_keys.delete(key)
    end
    @nodes.delete(node)
  end
  
  def get_node(key)
    return nil if @ring.empty?
    
    hash = hash_key(key)
    idx = @sorted_keys.bsearch_index { |k| k >= hash } || 0
    @ring[@sorted_keys[idx]]
  end
  
  private
  
  def hash_key(key)
    Digest::MD5.hexdigest(key.to_s).to_i(16)
  end
end

# Usage
ch = ConsistentHash.new
ch.add_node('server1')
ch.add_node('server2')
ch.add_node('server3')

puts ch.get_node('user:1001')  # => server2
puts ch.get_node('user:1002')  # => server1

Implementation Approaches

Coordinator-Worker Pattern: A coordinator node distributes tasks to worker nodes, collecting and aggregating results. Workers remain stateless, pulling tasks from the coordinator. This approach simplifies task distribution but creates a single point of failure. The coordinator must track worker health, reassign failed tasks, and handle worker registration. Load balancing distributes tasks based on worker capacity and current load.

Peer-to-Peer Architecture: Nodes operate as equals without central coordination. Each node maintains partial knowledge of the network through gossip protocols. Distributed hash tables (DHT) enable key-based routing across nodes. P2P systems eliminate single points of failure but complicate service discovery and consistency. Membership management handles node joins, leaves, and failures through failure detection and stabilization protocols.

Master-Slave Replication: A master node handles writes while slave nodes replicate data for read scaling. Synchronous replication ensures consistency but adds write latency. Asynchronous replication improves write performance but allows temporary inconsistencies. Failover promotes a slave to master when the master fails. Split-brain scenarios occur when network partitions create multiple masters, requiring fencing mechanisms to prevent data corruption.

Microservices Architecture: Decomposing applications into independent services that communicate through APIs. Each service owns its data and can scale independently. Service discovery enables dynamic service location through registries like Consul or etcd. API gateways route external requests to appropriate services. This approach improves fault isolation and deployment flexibility but increases operational complexity.

# Service registry implementation
class ServiceRegistry
  def initialize
    @services = Hash.new { |h, k| h[k] = [] }
    @health_checks = {}
    @mutex = Mutex.new
  end
  
  def register(service_name, host, port, health_check_url)
    @mutex.synchronize do
      instance = { host: host, port: port, registered_at: Time.now }
      @services[service_name] << instance
      @health_checks[instance] = health_check_url
    end
  end
  
  def deregister(service_name, host, port)
    @mutex.synchronize do
      @services[service_name].reject! do |instance|
        instance[:host] == host && instance[:port] == port
      end
    end
  end
  
  def discover(service_name)
    @mutex.synchronize do
      healthy_instances = @services[service_name].select do |instance|
        healthy?(instance)
      end
      
      # Return random healthy instance
      healthy_instances.sample
    end
  end
  
  def healthy?(instance)
    # Simple health check implementation
    return false unless @health_checks[instance]
    
    begin
      response = Net::HTTP.get_response(
        URI(@health_checks[instance])
      )
      response.code.to_i == 200
    rescue
      false
    end
  end
end

Event-Driven Architecture: Services communicate through asynchronous events rather than synchronous requests. Event producers publish events to message brokers; consumers subscribe to relevant events. This decouples services temporally and spatially. Event sourcing stores all state changes as events, enabling audit trails and time travel. CQRS (Command Query Responsibility Segregation) separates read and write models for independent scaling.

Distributed Consensus: Algorithms like Raft and Paxos enable nodes to agree on values despite failures. These provide linearizable consistency for critical operations like leader election and configuration management. Multi-Paxos optimizes for multiple sequential decisions. Raft simplifies implementation through explicit leader election and log replication. Consensus adds latency but ensures correctness for critical operations.

Ruby Implementation

Ruby provides multiple tools and libraries for building distributed systems. The standard library includes TCP/UDP sockets for network communication, threading for concurrency, and DRb for distributed Ruby objects.

DRb (Distributed Ruby): Built-in mechanism for remote method invocation between Ruby processes. Objects on one machine become accessible to other Ruby processes through transparent method calls. DRb handles serialization, network communication, and remote references automatically.

# DRb server exposing a distributed cache
require 'drb'

class DistributedCache
  def initialize
    @cache = {}
    @mutex = Mutex.new
  end
  
  def get(key)
    @mutex.synchronize { @cache[key] }
  end
  
  def set(key, value)
    @mutex.synchronize { @cache[key] = value }
  end
  
  def delete(key)
    @mutex.synchronize { @cache.delete(key) }
  end
  
  def keys
    @mutex.synchronize { @cache.keys }
  end
end

cache = DistributedCache.new
DRb.start_service('druby://localhost:9000', cache)
DRb.thread.join

# Client accessing remote cache
require 'drb'

DRb.start_service
cache = DRbObject.new_nil('druby://localhost:9000')

cache.set('user:1001', { name: 'Alice', email: 'alice@example.com' })
user = cache.get('user:1001')
puts user[:name]  # => Alice

Resque/Sidekiq: Background job processing frameworks that distribute work across multiple worker processes. Jobs serialize to Redis, workers pull and execute jobs, and failed jobs retry automatically. Sidekiq uses threads for higher concurrency than Resque's process model.

# Sidekiq job for distributed processing
class ImageProcessor
  include Sidekiq::Worker
  
  sidekiq_options retry: 3, queue: :default
  
  def perform(image_id, transformations)
    image = Image.find(image_id)
    
    transformations.each do |transformation|
      case transformation['type']
      when 'resize'
        resize_image(image, transformation['width'], transformation['height'])
      when 'filter'
        apply_filter(image, transformation['filter'])
      when 'compress'
        compress_image(image, transformation['quality'])
      end
    end
    
    image.processed = true
    image.save
  end
  
  private
  
  def resize_image(image, width, height)
    # Image resizing logic
  end
  
  def apply_filter(image, filter)
    # Filter application logic
  end
  
  def compress_image(image, quality)
    # Compression logic
  end
end

# Enqueue job from web request
ImageProcessor.perform_async(
  params[:image_id],
  [
    { 'type' => 'resize', 'width' => 800, 'height' => 600 },
    { 'type' => 'compress', 'quality' => 85 }
  ]
)

Celluloid: Actor-based concurrent object framework that simplifies building distributed systems. Each actor runs in its own thread, processing messages asynchronously. Supervision trees restart failed actors automatically. Actors communicate through asynchronous method calls.

require 'celluloid'

class DistributedCounter
  include Celluloid
  
  def initialize
    @count = 0
    @mutex = Mutex.new
  end
  
  def increment
    @mutex.synchronize { @count += 1 }
  end
  
  def decrement
    @mutex.synchronize { @count -= 1 }
  end
  
  def value
    @mutex.synchronize { @count }
  end
end

class CounterCoordinator
  include Celluloid
  
  def initialize(num_workers)
    @workers = num_workers.times.map { DistributedCounter.new }
  end
  
  def increment(worker_id)
    @workers[worker_id].async.increment
  end
  
  def total
    @workers.map(&:value).sum
  end
end

coordinator = CounterCoordinator.new(4)
1000.times { |i| coordinator.increment(i % 4) }
sleep 0.1  # Allow async operations to complete
puts coordinator.total  # => 1000

RabbitMQ/Kafka Clients: Message broker clients for asynchronous communication. Bunny provides RabbitMQ integration, ruby-kafka connects to Kafka clusters. These enable publish-subscribe patterns, work queues, and event streaming.

require 'bunny'

# RabbitMQ publisher
class EventPublisher
  def initialize(host)
    @connection = Bunny.new(host: host)
    @connection.start
    @channel = @connection.create_channel
    @exchange = @channel.fanout('events')
  end
  
  def publish(event_type, data)
    message = {
      type: event_type,
      data: data,
      timestamp: Time.now.to_i
    }.to_json
    
    @exchange.publish(message, routing_key: event_type)
  end
  
  def close
    @connection.close
  end
end

# RabbitMQ subscriber
class EventSubscriber
  def initialize(host, queue_name)
    @connection = Bunny.new(host: host)
    @connection.start
    @channel = @connection.create_channel
    @queue = @channel.queue(queue_name)
    @exchange = @channel.fanout('events')
    @queue.bind(@exchange)
  end
  
  def subscribe(&block)
    @queue.subscribe(block: true) do |delivery_info, properties, body|
      event = JSON.parse(body)
      block.call(event)
    end
  end
end

# Usage
publisher = EventPublisher.new('localhost')
publisher.publish('user.created', { id: 1001, name: 'Bob' })

subscriber = EventSubscriber.new('localhost', 'user_events')
subscriber.subscribe do |event|
  puts "Received: #{event['type']} - #{event['data']}"
end

gRPC: High-performance RPC framework using Protocol Buffers for serialization. Supports streaming, bidirectional communication, and multiple programming languages. Generated client and server code from proto definitions.

# gRPC service definition (distributed_service.proto)
# syntax = "proto3";
# 
# service DistributedService {
#   rpc ProcessTask(TaskRequest) returns (TaskResponse);
#   rpc StreamResults(stream DataChunk) returns (SummaryResponse);
# }

require 'grpc'
require 'distributed_service_services_pb'

class DistributedServiceImpl < DistributedService::Service
  def process_task(task_request, _call)
    result = perform_computation(task_request.data)
    
    TaskResponse.new(
      task_id: task_request.task_id,
      result: result,
      status: 'completed'
    )
  end
  
  def stream_results(call)
    total_size = 0
    count = 0
    
    call.each do |chunk|
      total_size += chunk.data.size
      count += 1
    end
    
    SummaryResponse.new(
      total_size: total_size,
      chunk_count: count
    )
  end
  
  private
  
  def perform_computation(data)
    # Computation logic
    data.upcase
  end
end

# Start gRPC server
server = GRPC::RpcServer.new
server.add_http2_port('0.0.0.0:50051', :this_port_is_insecure)
server.handle(DistributedServiceImpl)
server.run_till_terminated

Common Patterns

Leader Election: Selecting a single coordinator from multiple nodes to prevent split-brain scenarios. Nodes propose themselves as leaders, other nodes vote, and the winner coordinates operations. Leases expire, forcing periodic re-election. Implementations include Raft leader election, ZooKeeper ephemeral nodes, and distributed locks.

require 'redis'

class RedisLeaderElection
  LEASE_DURATION = 30  # seconds
  
  def initialize(redis, node_id)
    @redis = redis
    @node_id = node_id
    @leader_key = 'cluster:leader'
  end
  
  def try_become_leader
    # Attempt to set leader key with expiration
    success = @redis.set(
      @leader_key,
      @node_id,
      nx: true,  # Only set if doesn't exist
      ex: LEASE_DURATION
    )
    
    success ? @node_id : nil
  end
  
  def is_leader?
    @redis.get(@leader_key) == @node_id
  end
  
  def renew_lease
    # Renew only if still leader
    script = <<-LUA
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("expire", KEYS[1], ARGV[2])
      else
        return 0
      end
    LUA
    
    @redis.eval(script, [@leader_key], [@node_id, LEASE_DURATION]) == 1
  end
  
  def get_leader
    @redis.get(@leader_key)
  end
  
  def resign
    # Release leadership
    script = <<-LUA
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    LUA
    
    @redis.eval(script, [@leader_key], [@node_id])
  end
end

# Usage pattern
election = RedisLeaderElection.new(Redis.new, "node-#{Process.pid}")

loop do
  if election.try_become_leader
    puts "Became leader"
    
    # Perform leader duties
    while election.is_leader?
      # Do work
      perform_coordination_tasks
      
      # Renew lease
      election.renew_lease
      sleep 10
    end
    
    puts "Lost leadership"
  else
    puts "Follower mode"
    sleep 5
  end
end

Circuit Breaker: Preventing cascading failures by stopping requests to failing services. States transition between closed (normal), open (failing), and half-open (testing recovery). Failed requests increment error counters; excessive failures open the circuit. After timeout, half-open state allows test requests to check recovery.

Saga Pattern: Managing distributed transactions across multiple services without two-phase commit. Each service performs a local transaction and publishes an event. Compensating transactions undo completed steps if later steps fail. Orchestration uses a central coordinator; choreography relies on event propagation.

class OrderSaga
  def initialize
    @steps = []
    @compensations = []
  end
  
  def add_step(name, action, compensation)
    @steps << { name: name, action: action }
    @compensations << { name: name, action: compensation }
  end
  
  def execute(order_data)
    completed_steps = []
    
    begin
      @steps.each do |step|
        puts "Executing: #{step[:name]}"
        result = step[:action].call(order_data)
        completed_steps << step
        order_data.merge!(result)
      end
      
      { success: true, data: order_data }
    rescue => e
      puts "Error in saga: #{e.message}"
      compensate(completed_steps.reverse, order_data)
      { success: false, error: e.message }
    end
  end
  
  private
  
  def compensate(completed_steps, order_data)
    completed_steps.each_with_index do |step, i|
      compensation = @compensations.reverse[i]
      puts "Compensating: #{compensation[:name]}"
      
      begin
        compensation[:action].call(order_data)
      rescue => e
        puts "Compensation failed: #{e.message}"
      end
    end
  end
end

# Define saga for order processing
saga = OrderSaga.new

saga.add_step(
  'reserve_inventory',
  ->(data) { 
    # Reserve inventory
    { inventory_reserved: true, reservation_id: 'RES123' }
  },
  ->(data) { 
    # Cancel reservation
    cancel_inventory_reservation(data[:reservation_id])
  }
)

saga.add_step(
  'process_payment',
  ->(data) { 
    # Process payment
    { payment_processed: true, transaction_id: 'TXN456' }
  },
  ->(data) { 
    # Refund payment
    refund_payment(data[:transaction_id])
  }
)

saga.add_step(
  'create_shipment',
  ->(data) { 
    # Create shipment
    { shipment_created: true, tracking_number: 'TRACK789' }
  },
  ->(data) { 
    # Cancel shipment
    cancel_shipment(data[:tracking_number])
  }
)

result = saga.execute(order_id: 1001, items: ['item1', 'item2'])

Retry with Exponential Backoff: Handling transient failures through automatic retries with increasing delays. Initial retry happens quickly; subsequent retries double the delay. Jitter randomizes delays to prevent thundering herd. Maximum retry limits prevent infinite loops.

Request Hedging: Sending duplicate requests to multiple nodes, using the first response. Reduces tail latency but increases resource usage. Applied selectively to latency-sensitive operations. Cancel redundant requests after receiving first response.

Tools & Ecosystem

Redis: In-memory data store for caching, message queues, and coordination primitives. Pub/sub messaging supports event-driven architectures. Atomic operations enable distributed locks and counters. Cluster mode provides automatic sharding and replication.

ZooKeeper: Centralized coordination service for distributed applications. Provides hierarchical namespace for configuration, naming service for service discovery, distributed locks for synchronization, and leader election primitives. Guarantees strong consistency through quorum-based replication.

etcd: Distributed key-value store for configuration and service discovery. Uses Raft consensus for consistency. Provides watch mechanism for configuration changes. HTTP/gRPC APIs for language-agnostic access. Common in Kubernetes infrastructure.

Consul: Service mesh solution with service discovery, health checking, and key-value storage. Provides DNS interface for service resolution. Multi-datacenter support with WAN gossip. Connect feature enables service-to-service communication with automatic TLS.

RabbitMQ: Message broker implementing AMQP protocol. Supports multiple messaging patterns: work queues, publish-subscribe, routing, and RPC. Plugins extend functionality with federation, shovel, and monitoring. High availability through mirrored queues.

Apache Kafka: Distributed event streaming platform for high-throughput data pipelines. Topics partition across brokers for parallelism. Consumer groups enable load balancing. Log compaction retains latest message per key. Connect framework integrates with databases and other systems.

Hazelcast: In-memory data grid providing distributed data structures. Maps, queues, topics, and locks distributed across cluster. Near cache improves read performance. Split-brain protection prevents data inconsistency during network partitions.

Error Handling & Edge Cases

Network Partitions: Network failures isolate subsets of nodes. Partition-tolerant systems continue operating with reduced functionality. Quorum-based systems require majority connectivity. Split-brain resolution detects multiple active coordinators and forces reconciliation.

class QuorumChecker
  def initialize(nodes, quorum_size)
    @nodes = nodes
    @quorum_size = quorum_size
    @last_contact = Hash.new { |h, k| h[k] = Time.now }
    @timeout = 5  # seconds
  end
  
  def heartbeat(node_id)
    @last_contact[node_id] = Time.now
  end
  
  def has_quorum?
    alive_nodes = @nodes.count do |node_id|
      Time.now - @last_contact[node_id] < @timeout
    end
    
    alive_nodes >= @quorum_size
  end
  
  def can_accept_writes?
    has_quorum?
  end
  
  def alive_nodes
    @nodes.select do |node_id|
      Time.now - @last_contact[node_id] < @timeout
    end
  end
end

checker = QuorumChecker.new(['node1', 'node2', 'node3', 'node4', 'node5'], 3)

if checker.can_accept_writes?
  # Process write request
  process_write(data)
else
  # Reject write, no quorum
  return { error: 'No quorum available', code: 503 }
end

Cascading Failures: Single component failure triggers failures in dependent components. Bulkheads isolate failures to specific subsystems. Timeout limits prevent thread exhaustion waiting for slow services. Load shedding drops requests when system overloaded.

Clock Skew: Unsynchronized clocks cause incorrect ordering and timeout calculations. NTP synchronization provides bounded clock drift. Logical clocks (Lamport, vector) track causality without physical time. Hybrid logical clocks combine physical and logical time.

Thundering Herd: Many clients simultaneously request same resource after cache expiration or service restart. Cache stampede prevention uses early expiration with probability-based refresh. Request coalescing consolidates duplicate concurrent requests.

class RequestCoalescer
  def initialize
    @pending = {}
    @mutex = Mutex.new
  end
  
  def coalesce(key, &block)
    promise = nil
    
    @mutex.synchronize do
      if @pending[key]
        # Request already in progress, wait for it
        promise = @pending[key]
      else
        # First request, create promise
        promise = Promise.new
        @pending[key] = promise
      end
    end
    
    # If we created the promise, execute the block
    if promise.pending?
      begin
        result = block.call
        promise.fulfill(result)
      rescue => e
        promise.reject(e)
      ensure
        @mutex.synchronize { @pending.delete(key) }
      end
    end
    
    promise.value
  end
end

class Promise
  def initialize
    @mutex = Mutex.new
    @condition = ConditionVariable.new
    @state = :pending
    @value = nil
  end
  
  def pending?
    @mutex.synchronize { @state == :pending }
  end
  
  def fulfill(value)
    @mutex.synchronize do
      @value = value
      @state = :fulfilled
      @condition.broadcast
    end
  end
  
  def reject(error)
    @mutex.synchronize do
      @value = error
      @state = :rejected
      @condition.broadcast
    end
  end
  
  def value
    @mutex.synchronize do
      @condition.wait(@mutex) while @state == :pending
      raise @value if @state == :rejected
      @value
    end
  end
end

# Usage prevents duplicate database queries
coalescer = RequestCoalescer.new

# Multiple concurrent requests for same data
threads = 10.times.map do
  Thread.new do
    result = coalescer.coalesce('user:1001') do
      # Only executed once despite multiple threads
      expensive_database_query('user:1001')
    end
    puts "Got result: #{result}"
  end
end

threads.each(&:join)

Poison Messages: Invalid messages that cause processing failures and repeated retries. Dead letter queues store failed messages for investigation. Message validation rejects malformed data early. Retry limits prevent infinite processing loops.

Duplicate Messages: Message brokers may deliver messages multiple times. Idempotent processing produces same result regardless of duplicate deliveries. Deduplication tracking stores processed message IDs. Natural idempotency through upserts and deterministic operations.

Partial Failures: Some operations succeed while others fail across distributed operations. Compensating transactions undo completed operations. Eventual consistency accepts temporary inconsistencies. Reconciliation jobs detect and fix inconsistencies.

Reference

Consistency Models Comparison

Model Guarantees Performance Use Cases
Strong Consistency All nodes see same data Slowest, requires coordination Financial transactions, inventory
Eventual Consistency All nodes converge eventually Fastest, no coordination Social media feeds, caching
Causal Consistency Cause precedes effect Moderate overhead Collaborative editing, messaging
Sequential Consistency Operations appear in order Moderate coordination Configuration management
Read-your-writes See own writes immediately Client-side tracking User profiles, preferences

CAP Theorem Trade-offs

System Type Prioritizes Sacrifices Examples
CP Systems Consistency + Partition Tolerance Availability MongoDB, HBase, Redis Cluster
AP Systems Availability + Partition Tolerance Consistency Cassandra, DynamoDB, CouchDB
CA Systems Consistency + Availability Partition Tolerance Traditional RDBMS (single node)

Distributed Coordination Primitives

Primitive Purpose Implementation Options
Distributed Lock Mutual exclusion across nodes Redis SETNX, ZooKeeper, etcd
Leader Election Select single coordinator Raft, Paxos, ZooKeeper
Barrier Synchronize multiple processes Redis counters, ZooKeeper
Semaphore Limit concurrent access Redis sorted sets, database rows
Queue Order processing across workers Redis lists, RabbitMQ, SQS

Message Delivery Guarantees

Level Behavior Implementation
At-most-once Fire and forget, may lose messages UDP, async without acks
At-least-once Retry until acknowledged, may duplicate TCP with retries, message queues
Exactly-once Deliver once without duplicates Idempotent processing + deduplication

Partitioning Strategies

Strategy Distribution Pros Cons
Hash Uniform across nodes Even distribution No range queries
Range By key ranges Ordered data, range queries Hotspots possible
Consistent Hash Hash with minimal rebalancing Stable on membership changes Complex implementation
Directory Lookup table maps keys to nodes Flexible assignment Lookup overhead

Replication Patterns

Pattern Consistency Write Performance Read Performance
Synchronous Strong Slow, waits for replicas Fast from any replica
Asynchronous Eventual Fast, fire and forget Stale reads possible
Semi-synchronous Tunable Moderate Good
Multi-master Conflicts possible Parallel writes Fast local reads

Ruby Distributed Computing Gems

Gem Purpose Key Features
sidekiq Background jobs Thread-based, Redis queue, retries
resque Background jobs Process-based, Redis queue
bunny RabbitMQ client AMQP protocol, async operations
ruby-kafka Kafka client Consumer groups, compression
gruf gRPC framework Built on gRPC, interceptors
dalli Memcached client Connection pooling, multi-get
redis-rb Redis client Pipelining, clustering support
celluloid Actor framework Supervision, async messaging

Network Protocols for Distributed Systems

Protocol Transport Use Case Characteristics
HTTP/REST TCP Request-response APIs Simple, stateless, widely supported
gRPC TCP (HTTP/2) RPC, microservices Binary, streaming, efficient
WebSocket TCP Bidirectional, real-time Persistent connection, low latency
AMQP TCP Message queuing Reliable, routing, transactions
Gossip UDP Membership, failure detection Eventual consistency, scalable

Failure Detection Timeouts

Scenario Typical Timeout Trade-off
Health check 1-5 seconds False positives vs detection speed
Request timeout 30-60 seconds User experience vs resource usage
Lease expiration 10-30 seconds Failover speed vs stability
Connection timeout 5-10 seconds Retry attempts vs latency
Session timeout 15-30 minutes Security vs convenience