CrackedRuby CrackedRuby

Distributed System Concepts

Overview

Distributed systems consist of multiple autonomous computing components located on different networked computers that communicate and coordinate their actions by passing messages. These systems appear to users as a single coherent system despite running on independent nodes that may fail independently. The primary motivation for distributed systems includes horizontal scalability, geographic distribution, fault tolerance through redundancy, and performance optimization through parallel processing.

Unlike monolithic systems where all components run within a single process space with shared memory, distributed systems face challenges that emerge from network communication. These challenges include partial failures where some nodes fail while others continue operating, network latency and unpredictability, message loss or duplication, and the lack of a global clock for ordering events. The fundamental difficulty stems from the inability to distinguish between a slow node and a failed node, or between a lost message and a delayed message.

The evolution of distributed systems accelerated with cloud computing and microservices architectures. Modern applications routinely span multiple data centers, rely on distributed databases, implement service-oriented architectures, and use message queues for asynchronous communication. Understanding distributed system concepts has become essential for building scalable, reliable applications.

# Simple distributed system simulation
class Node
  attr_reader :id, :data
  
  def initialize(id)
    @id = id
    @data = {}
    @peers = []
  end
  
  def add_peer(node)
    @peers << node unless @peers.include?(node)
  end
  
  def write(key, value)
    @data[key] = value
    replicate_to_peers(key, value)
  end
  
  def read(key)
    @data[key]
  end
  
  private
  
  def replicate_to_peers(key, value)
    @peers.each do |peer|
      peer.receive_replication(key, value)
    end
  end
  
  def receive_replication(key, value)
    @data[key] = value
  end
end

# Create a simple cluster
node1 = Node.new(1)
node2 = Node.new(2)
node3 = Node.new(3)

node1.add_peer(node2)
node1.add_peer(node3)

node1.write("user:123", {name: "Alice", email: "alice@example.com"})
# Data now replicated across all nodes

The example demonstrates basic replication, but real distributed systems must handle node failures, network partitions, concurrent writes, and consistency guarantees.

Key Principles

The CAP theorem, formulated by Eric Brewer, states that distributed systems can provide at most two of three guarantees simultaneously: Consistency (all nodes see the same data at the same time), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions). Since network partitions occur inevitably in distributed environments, systems must choose between consistency and availability during partitions.

Consistency models define the guarantees a distributed system provides regarding the visibility and ordering of operations. Strong consistency ensures all nodes see the same data simultaneously, equivalent to a single-node system. Sequential consistency preserves the order of operations from each client's perspective but allows different clients to see different orderings. Causal consistency guarantees operations with causal relationships occur in order, while allowing concurrent operations to be seen in different orders. Eventual consistency guarantees that if no new updates occur, all replicas eventually converge to the same value. The choice of consistency model profoundly impacts system design, performance, and behavior.

Consensus algorithms enable distributed nodes to agree on a single value despite failures and network issues. The Two-Phase Commit protocol ensures all nodes either commit or abort a transaction, but blocks if the coordinator fails. Paxos provides fault-tolerant consensus through a complex protocol involving proposers, acceptors, and learners. Raft simplifies consensus through leader election and log replication with clear role assignments. These algorithms form the foundation of distributed databases and coordination services.

Fault tolerance mechanisms allow systems to continue functioning despite component failures. Replication creates multiple copies of data across nodes, enabling the system to survive node failures. Quorum-based approaches require a majority of nodes to agree on operations, preventing split-brain scenarios. Heartbeat mechanisms detect failed nodes through periodic health checks. Checkpointing and logging enable recovery from failures by preserving system state.

# Implementing a simple quorum read/write
class QuorumStore
  def initialize(nodes, write_quorum, read_quorum)
    @nodes = nodes
    @write_quorum = write_quorum
    @read_quorum = read_quorum
  end
  
  def write(key, value, version)
    results = @nodes.map do |node|
      Thread.new { node.write(key, value, version) }
    end.map(&:value)
    
    successes = results.count { |r| r[:success] }
    raise "Write failed: insufficient quorum" if successes < @write_quorum
    
    {success: true, replicas: successes}
  end
  
  def read(key)
    results = @nodes.map do |node|
      Thread.new { node.read(key) }
    end.map(&:value).compact
    
    raise "Read failed: insufficient quorum" if results.size < @read_quorum
    
    # Return value with highest version
    results.max_by { |r| r[:version] }
  end
end

Time and ordering present unique challenges without synchronized clocks. Lamport timestamps assign logical timestamps to events, capturing happens-before relationships without physical clocks. Vector clocks track causality by maintaining a version number for each node, enabling detection of concurrent operations. Hybrid logical clocks combine physical and logical time for better ordering while remaining bounded.

Scalability requires distributing data and load across nodes. Horizontal partitioning (sharding) divides data by a partition key, with consistent hashing minimizing redistribution when nodes are added or removed. Replication increases read capacity by serving requests from multiple copies. Load balancing distributes requests across nodes, though load balancers themselves can become bottlenecks. Caching reduces load on backend systems but introduces cache invalidation complexity.

Design Considerations

Choosing between consistency and availability during network partitions represents the central trade-off in distributed system design. Systems prioritizing consistency reject requests during partitions to prevent divergent state, ensuring clients always see correct data but reducing availability. Systems prioritizing availability accept requests during partitions, maintaining responsiveness but allowing temporary inconsistencies. Financial transactions typically require consistency to prevent double-spending or incorrect balances. Social media feeds tolerate eventual consistency for higher availability. The decision depends on business requirements, user expectations, and regulatory constraints.

Latency characteristics differ fundamentally from single-node systems. Network round trips add milliseconds to every remote operation. Sequential remote calls amplify latency, while parallel requests reduce total time. Batching multiple operations into single requests reduces overhead. Caching frequently accessed data near clients minimizes latency. Geographic distribution requires careful data placement to minimize latency for primary user populations. Techniques like read replicas, CDNs, and edge computing bring data closer to users.

Data partitioning strategies profoundly impact performance and operational complexity. Range-based partitioning groups data by key ranges, enabling efficient range queries but risking hotspots if access patterns are skewed. Hash-based partitioning distributes data uniformly but prevents range queries. Consistent hashing minimizes data movement when nodes are added or removed, but requires careful hash function selection. Directory-based partitioning uses a lookup service to map keys to nodes, adding flexibility at the cost of additional indirection.

# Consistent hashing implementation
class ConsistentHash
  def initialize(nodes, virtual_nodes: 150)
    @nodes = nodes
    @virtual_nodes = virtual_nodes
    @ring = {}
    
    nodes.each do |node|
      add_node(node)
    end
  end
  
  def add_node(node)
    @virtual_nodes.times do |i|
      hash = Digest::MD5.hexdigest("#{node.id}:#{i}").to_i(16)
      @ring[hash] = node
    end
  end
  
  def remove_node(node)
    @ring.delete_if { |_hash, n| n.id == node.id }
  end
  
  def get_node(key)
    return @nodes.first if @ring.empty?
    
    hash = Digest::MD5.hexdigest(key).to_i(16)
    @ring.keys.sort.each do |ring_hash|
      return @ring[ring_hash] if ring_hash >= hash
    end
    
    # Wrap around to first node
    @ring[@ring.keys.min]
  end
end

nodes = [
  Node.new(1),
  Node.new(2), 
  Node.new(3)
]

hash_ring = ConsistentHash.new(nodes)
node_for_key = hash_ring.get_node("user:12345")

Transaction boundaries and isolation levels require careful consideration. Distributed transactions using two-phase commit provide ACID guarantees but suffer from blocking and coordinator failures. Saga patterns coordinate long-running transactions through compensating actions, trading atomicity for availability. Event sourcing records state changes as events, enabling reconstruction and time travel but complicating queries. Command Query Responsibility Segregation separates read and write models, optimizing each independently but introducing eventual consistency.

Monitoring and observability become critical with distributed systems. Distributed tracing follows requests across services using correlation IDs, revealing performance bottlenecks and dependencies. Centralized logging aggregates logs from multiple nodes for analysis. Metrics collection tracks system health, resource usage, and business KPIs. These tools help operators understand system behavior and diagnose issues across multiple components.

The decision to adopt distributed architecture should consider alternatives. Vertical scaling increases single-node capacity, avoiding distribution complexity but hitting hardware limits. Read replicas handle read-heavy workloads without full distribution. Caching layers reduce backend load. Database connection pooling maximizes single-node efficiency. Only when these approaches prove insufficient should full distribution be considered, given the operational complexity and subtle bugs it introduces.

Implementation Approaches

Microservices architecture decomposes applications into independent services communicating via APIs. Each service owns its data, implements a bounded context, and can be deployed independently. This approach enables team autonomy, technology diversity, and independent scaling. However, it introduces distributed system challenges including service discovery, inter-service communication, distributed transactions, and operational complexity. Services communicate synchronously via HTTP/REST or gRPC, or asynchronously via message queues. API gateways provide a unified entry point, handling cross-cutting concerns like authentication, rate limiting, and routing.

Event-driven architectures use asynchronous message passing for communication. Services publish events when state changes occur, and interested services subscribe to relevant events. This decouples services in time and space, enabling horizontal scaling and fault isolation. Event sourcing stores state changes as immutable events rather than current state, providing audit trails and enabling temporal queries. Command Query Responsibility Segregation separates write models (optimized for transactions) from read models (optimized for queries), synchronizing them asynchronously. These patterns trade immediate consistency for scalability and resilience.

# Event-driven service communication
class EventBus
  def initialize
    @subscribers = Hash.new { |h, k| h[k] = [] }
  end
  
  def subscribe(event_type, handler)
    @subscribers[event_type] << handler
  end
  
  def publish(event)
    @subscribers[event.type].each do |handler|
      Thread.new { handler.call(event) }
    end
  end
end

class OrderService
  def initialize(event_bus)
    @event_bus = event_bus
  end
  
  def create_order(items)
    order = Order.create(items: items, status: 'pending')
    @event_bus.publish(Event.new(
      type: 'order.created',
      data: {order_id: order.id, items: items}
    ))
    order
  end
end

class InventoryService
  def initialize(event_bus)
    @event_bus = event_bus
    event_bus.subscribe('order.created', method(:handle_order_created))
  end
  
  def handle_order_created(event)
    order_id = event.data[:order_id]
    items = event.data[:items]
    
    if reserve_inventory(items)
      @event_bus.publish(Event.new(
        type: 'inventory.reserved',
        data: {order_id: order_id}
      ))
    else
      @event_bus.publish(Event.new(
        type: 'inventory.insufficient',
        data: {order_id: order_id}
      ))
    end
  end
  
  def reserve_inventory(items)
    # Check and reserve inventory
    true
  end
end

Service mesh architectures add an infrastructure layer for service-to-service communication. A sidecar proxy deployed alongside each service handles networking concerns including load balancing, service discovery, encryption, authentication, circuit breaking, and observability. The control plane configures these proxies and collects telemetry. This approach centralizes cross-cutting concerns, enforces policies consistently, and provides visibility without application changes. However, it adds operational complexity and resource overhead.

Distributed data stores implement different consistency and availability guarantees. Master-slave replication directs writes to a master node and replicates to read-only slaves, providing read scalability but creating a single point of failure. Multi-master replication allows writes to any node, requiring conflict resolution. Leaderless replication uses quorum reads and writes for consistency without a designated master. Distributed SQL databases provide ACID transactions across nodes using consensus protocols, while NoSQL stores often prioritize availability and partition tolerance over consistency.

Coordination services like ZooKeeper and etcd provide primitives for distributed applications including configuration management, service discovery, leader election, and distributed locking. These services use consensus algorithms for strong consistency and fault tolerance. Applications use watches to receive notifications of configuration changes. Leader election enables active-passive failover where one node handles requests while others wait as backups. Distributed locks coordinate access to shared resources, though they require careful timeout handling to prevent deadlocks.

Common Patterns

Leader election patterns designate one node as the leader responsible for coordinating operations. The Bully algorithm has nodes with higher IDs take over leadership, but generates many messages. The Ring algorithm passes election messages around a ring topology. Consensus-based election uses Paxos or Raft for reliable leader selection with minority fault tolerance. Leader election enables active-passive failover, serialized operations that require ordering, and centralized coordination. Leader nodes risk becoming bottlenecks and single points of failure.

# Simple leader election using heartbeats
class LeaderElection
  attr_reader :current_leader
  
  def initialize(nodes, timeout: 5)
    @nodes = nodes
    @timeout = timeout
    @current_leader = nil
    @last_heartbeat = {}
    start_monitoring
  end
  
  def start_monitoring
    Thread.new do
      loop do
        check_leader_health
        sleep 1
      end
    end
  end
  
  def receive_heartbeat(node_id)
    @last_heartbeat[node_id] = Time.now
    @current_leader = node_id if @current_leader.nil?
  end
  
  def check_leader_health
    return if @current_leader.nil?
    
    last_seen = @last_heartbeat[@current_leader]
    if last_seen.nil? || Time.now - last_seen > @timeout
      elect_new_leader
    end
  end
  
  def elect_new_leader
    # Simple highest-ID election
    candidates = @nodes.select do |node|
      hb = @last_heartbeat[node.id]
      hb && Time.now - hb < @timeout
    end
    
    @current_leader = candidates.max_by(&:id)&.id
  end
end

Replication patterns maintain copies of data across nodes for fault tolerance and read scalability. Synchronous replication waits for acknowledgment from replicas before confirming writes, ensuring consistency but increasing latency. Asynchronous replication confirms writes immediately and replicates in the background, improving performance but risking data loss on failures. Quorum-based replication requires acknowledgment from a majority of nodes, balancing consistency and availability. Read-your-writes consistency ensures clients see their own writes immediately, while allowing other clients to see stale data temporarily.

Circuit breaker patterns prevent cascading failures by detecting unhealthy dependencies and failing fast. The circuit starts in a closed state, allowing requests through. After a threshold of failures, it transitions to an open state, immediately rejecting requests without calling the dependency. After a timeout, it enters a half-open state, allowing limited test requests. If these succeed, it returns to closed. This pattern prevents resource exhaustion from retrying failed operations and allows failing services time to recover.

class CircuitBreaker
  STATES = [:closed, :open, :half_open].freeze
  
  def initialize(threshold: 5, timeout: 60, half_open_requests: 3)
    @threshold = threshold
    @timeout = timeout
    @half_open_requests = half_open_requests
    @state = :closed
    @failure_count = 0
    @last_failure_time = nil
    @half_open_attempts = 0
  end
  
  def call(&block)
    case @state
    when :open
      if should_attempt_reset?
        transition_to_half_open
      else
        raise CircuitOpenError, "Circuit breaker is open"
      end
    end
    
    begin
      result = block.call
      handle_success
      result
    rescue => e
      handle_failure
      raise e
    end
  end
  
  private
  
  def handle_success
    case @state
    when :half_open
      @half_open_attempts += 1
      if @half_open_attempts >= @half_open_requests
        transition_to_closed
      end
    when :closed
      @failure_count = 0
    end
  end
  
  def handle_failure
    @failure_count += 1
    @last_failure_time = Time.now
    
    case @state
    when :closed
      transition_to_open if @failure_count >= @threshold
    when :half_open
      transition_to_open
    end
  end
  
  def should_attempt_reset?
    Time.now - @last_failure_time > @timeout
  end
  
  def transition_to_closed
    @state = :closed
    @failure_count = 0
    @half_open_attempts = 0
  end
  
  def transition_to_open
    @state = :open
  end
  
  def transition_to_half_open
    @state = :half_open
    @half_open_attempts = 0
  end
end

Saga patterns coordinate long-running distributed transactions without distributed locks. Choreography-based sagas have each service listen for events and publish new events, with no central coordinator. Orchestration-based sagas use a central coordinator to direct the transaction flow. Each step includes a compensating transaction to undo its effects if the saga fails. This enables eventual consistency across services while maintaining business invariants. Sagas require careful design of compensating actions and handling of partial failures.

Bulkhead patterns isolate resources to prevent failures from spreading. Named after ship compartments that prevent flooding from spreading, this pattern partitions resources like thread pools, connection pools, and memory. If one partition fails or becomes saturated, others remain functional. This improves resilience by limiting the blast radius of failures. Thread pool bulkheads dedicate separate pools to different operations. Circuit breakers often combine with bulkheads for comprehensive failure isolation.

Retry patterns handle transient failures through repeated attempts with exponential backoff. Simple retries immediately reattempt failed operations, but can overwhelm struggling services. Exponential backoff increases delay between retries, giving services time to recover. Jitter adds randomness to retry timing, preventing thundering herds where many clients retry simultaneously. Maximum retry limits prevent infinite loops. Idempotency ensures retried operations produce the same result as single operations, critical for safe retries.

Ruby Implementation

Ruby applications typically participate in distributed systems as service components rather than implementing core distributed system algorithms. Ruby's dynamic nature, garbage collection, and global interpreter lock make it less suitable for low-level distributed systems infrastructure, but well-suited for application services that use distributed system platforms.

Message queue integration enables asynchronous communication between services. Sidekiq provides background job processing backed by Redis, supporting distributed workers across multiple servers. Jobs serialize to JSON, get stored in Redis queues, and workers fetch and execute them. Sidekiq handles retries, dead letter queues, and job scheduling. It scales horizontally by adding workers without code changes.

# Sidekiq for distributed job processing
class OrderProcessingJob
  include Sidekiq::Job
  
  sidekiq_options retry: 5, dead: true
  
  def perform(order_id)
    order = Order.find(order_id)
    
    # Process order across multiple steps
    inventory_result = InventoryService.reserve(order.items)
    payment_result = PaymentService.charge(order.total)
    
    if inventory_result.success? && payment_result.success?
      order.update(status: 'confirmed')
      ShippingJob.perform_async(order_id)
    else
      order.update(status: 'failed')
      # Compensating transactions
      InventoryService.release(order.items) if inventory_result.success?
    end
  end
end

# Configuring Sidekiq cluster
# config/sidekiq.yml
# :concurrency: 25
# :queues:
#   - critical
#   - default
#   - low_priority

# Scale by running multiple Sidekiq processes
# bundle exec sidekiq -C config/sidekiq.yml

Kafka provides distributed event streaming for high-throughput, fault-tolerant message processing. The ruby-kafka gem enables producing and consuming Kafka messages. Producers publish messages to topics partitioned across brokers. Consumers join consumer groups to parallelize processing, with each partition consumed by one group member. Kafka guarantees ordering within partitions and provides configurable durability through replication.

require 'kafka'

class KafkaEventPublisher
  def initialize
    @kafka = Kafka.new(['localhost:9092'], client_id: 'ruby-service')
  end
  
  def publish_event(topic, event)
    producer = @kafka.producer
    
    producer.produce(
      event.to_json,
      topic: topic,
      key: event[:entity_id],
      partition_key: event[:entity_id]
    )
    
    producer.deliver_messages
  ensure
    producer.shutdown
  end
end

class KafkaEventConsumer
  def initialize
    @kafka = Kafka.new(['localhost:9092'], client_id: 'ruby-consumer')
    @consumer = @kafka.consumer(group_id: 'order-processor')
  end
  
  def start
    @consumer.subscribe('orders')
    
    @consumer.each_message do |message|
      event = JSON.parse(message.value)
      process_event(event)
      @consumer.commit_offsets
    end
  end
  
  def process_event(event)
    case event['type']
    when 'order.created'
      handle_order_created(event)
    when 'order.cancelled'
      handle_order_cancelled(event)
    end
  end
end

Service communication typically uses HTTP clients for synchronous requests. The Faraday gem provides a flexible HTTP client with middleware support. Timeouts prevent hanging requests, circuit breakers prevent cascading failures, and retry middleware handles transient errors. Connection pooling reuses HTTP connections across requests.

require 'faraday'
require 'faraday/retry'

class ServiceClient
  def initialize(base_url)
    @conn = Faraday.new(url: base_url) do |f|
      f.request :json
      f.response :json
      
      # Retry middleware for transient failures
      f.request :retry,
        max: 3,
        interval: 0.5,
        backoff_factor: 2,
        retry_statuses: [429, 500, 502, 503],
        methods: [:get, :post]
      
      # Timeouts prevent hanging
      f.options.timeout = 5
      f.options.open_timeout = 2
      
      f.adapter :net_http_persistent
    end
  end
  
  def get_user(user_id)
    response = @conn.get("/users/#{user_id}")
    response.body
  rescue Faraday::TimeoutError => e
    # Handle timeout
    raise ServiceUnavailable, "User service timeout"
  rescue Faraday::Error => e
    # Handle other network errors
    raise ServiceError, e.message
  end
end

Service discovery helps services locate each other in dynamic environments. Consul provides service registry, health checking, and DNS-based discovery. Services register themselves on startup and deregister on shutdown. Client libraries query Consul to discover available instances of required services.

require 'diplomat'

class ServiceRegistry
  def initialize
    Diplomat.configure do |config|
      config.url = 'http://localhost:8500'
    end
  end
  
  def register_service(name, port)
    service = {
      name: name,
      address: local_ip,
      port: port,
      check: {
        http: "http://#{local_ip}:#{port}/health",
        interval: '10s',
        timeout: '5s'
      }
    }
    
    Diplomat::Service.register(service)
  end
  
  def discover_service(name)
    services = Diplomat::Service.get(name)
    services.map do |service|
      {
        address: service.ServiceAddress,
        port: service.ServicePort
      }
    end
  end
  
  def deregister_service(id)
    Diplomat::Service.deregister(id)
  end
  
  private
  
  def local_ip
    Socket.ip_address_list
          .find { |ai| ai.ipv4? && !ai.ipv4_loopback? }
          .ip_address
  end
end

Distributed caching with Redis provides shared state across service instances. Redis clusters partition data across nodes for horizontal scaling. Ruby applications use connection pooling to efficiently share Redis connections across threads.

require 'redis'
require 'connection_pool'

class DistributedCache
  def initialize
    @pool = ConnectionPool.new(size: 10, timeout: 5) do
      Redis.new(
        cluster: ['redis://node1:6379', 'redis://node2:6379'],
        timeout: 2,
        reconnect_attempts: 3
      )
    end
  end
  
  def get(key)
    @pool.with { |redis| redis.get(key) }
  end
  
  def set(key, value, ttl: 3600)
    @pool.with { |redis| redis.setex(key, ttl, value) }
  end
  
  def delete(key)
    @pool.with { |redis| redis.del(key) }
  end
  
  # Cache-aside pattern
  def fetch(key, ttl: 3600, &block)
    value = get(key)
    return value if value
    
    value = block.call
    set(key, value, ttl: ttl)
    value
  end
end

# Usage with cache-aside pattern
cache = DistributedCache.new
user = cache.fetch("user:#{user_id}") do
  User.find(user_id)
end

Tools & Ecosystem

Kubernetes orchestrates containerized applications across clusters of machines. It handles deployment, scaling, load balancing, and service discovery. Pods group containers that share resources, Services expose pod groups with stable endpoints, and Deployments manage pod lifecycle. Kubernetes provides horizontal pod autoscaling based on metrics, rolling updates for zero-downtime deployments, and self-healing through automatic restarts.

Docker containers package applications with dependencies, ensuring consistent environments across development, testing, and production. Images define application stack, dependencies, and configuration. Multi-stage builds optimize image size. Container registries store and distribute images. Docker Compose orchestrates multi-container applications for local development, though Kubernetes handles production orchestration.

RabbitMQ provides enterprise message queuing with complex routing capabilities. Exchanges route messages to queues based on routing keys and bindings. Consumers acknowledge messages after processing, and unacknowledged messages get redelivered. Dead letter exchanges handle failed messages. Priority queues and message TTLs provide flow control. RabbitMQ clusters replicate queues across nodes for high availability.

Apache Kafka excels at high-throughput event streaming with retention. Topics partition across brokers for parallel processing. Consumer groups enable horizontal scaling. Kafka Streams processes data within Kafka clusters. Kafka Connect integrates external systems. The platform handles millions of messages per second with low latency and guarantees delivery ordering within partitions.

Redis serves as distributed cache, session store, message broker, and data structure server. Cluster mode partitions data across nodes with automatic failover. Redis Sentinel provides high availability for non-clustered deployments. Pub/sub messaging enables real-time features. Lua scripting ensures atomic operations. Persistence options balance durability with performance.

Consul provides service discovery, configuration management, and service mesh capabilities. Services register with health checks, and clients discover services via DNS or HTTP API. Key-value store holds configuration with watch notifications. Connect service mesh secures service-to-service communication with mutual TLS. Multi-datacenter support enables global service discovery.

etcd implements distributed configuration storage using Raft consensus. It provides strong consistency, watches for configuration changes, and lease-based TTLs. Kubernetes uses etcd for cluster state. Applications use etcd for leader election, service discovery, and distributed locks. Clustering provides fault tolerance and high availability.

Prometheus collects metrics from distributed systems with dimensional data model. Services expose metrics endpoints that Prometheus scrapes. PromQL queries aggregate and analyze metrics. Alertmanager handles alert routing and grouping. Grafana visualizes Prometheus data. Service discovery integrations automatically find monitored targets.

Jaeger implements distributed tracing following OpenTracing standards. Services instrument code to create spans representing operations. Traces link spans from a request across services. Sampling reduces overhead while preserving representative traces. The Jaeger UI visualizes request flows, latency distributions, and service dependencies. This helps identify performance bottlenecks and understand system behavior.

# Distributed tracing with Jaeger
require 'jaeger/client'

OpenTracing.global_tracer = Jaeger::Client.build(
  host: 'localhost',
  port: 6831,
  service_name: 'order-service'
)

class OrderController
  def create
    tracer = OpenTracing.global_tracer
    span = tracer.start_active_span('create_order')
    
    begin
      order = create_order_record(span.span)
      process_payment(order, span.span)
      reserve_inventory(order, span.span)
      
      span.span.set_tag('order.id', order.id)
      span.span.set_tag('order.total', order.total)
      
      {success: true, order_id: order.id}
    rescue => e
      span.span.set_tag('error', true)
      span.span.log_kv(
        event: 'error',
        'error.kind': e.class.name,
        message: e.message
      )
      raise
    ensure
      span.close
    end
  end
  
  private
  
  def process_payment(order, parent_span)
    child_span = OpenTracing.global_tracer.start_span(
      'process_payment',
      child_of: parent_span
    )
    
    # Payment processing logic
    
  ensure
    child_span.finish
  end
end

Reference

Consistency Models

Model Guarantee Performance Use Cases
Strong All nodes see same data Slowest Financial transactions, inventory
Sequential Operations ordered per client Moderate User sessions, social feeds
Causal Causal operations ordered Moderate Comment threads, messaging
Eventual Converges eventually Fastest Analytics, recommendations
Read-your-writes Clients see own writes Moderate User profile updates

CAP Theorem Choices

System Type Consistency Availability Example Use
CP Yes During partitions, no Banking, inventory
AP Eventually Yes Social media, caching
CA Yes In single partition Traditional RDBMS

Replication Strategies

Strategy Write Latency Consistency Data Loss Risk
Synchronous High Strong None
Asynchronous Low Eventual Possible
Quorum Medium Tunable Minimal
Master-slave Low Read lag On master failure
Multi-master Medium Conflicts Complex resolution

Consensus Algorithms

Algorithm Fault Tolerance Complexity Performance
2PC Coordinator SPOF Simple Blocks on failure
Paxos Minority failures Complex Good
Raft Minority failures Moderate Good
Gossip No failures Simple Eventually consistent

Partitioning Strategies

Strategy Range Queries Hot Spots Rebalancing
Range Efficient Possible Complex
Hash Not supported Uniform Simple
Consistent hash Not supported Uniform Minimal movement
Directory Flexible Depends Flexible

Message Delivery Guarantees

Guarantee Behavior Implementation Use Case
At-most-once May lose messages Fire and forget Metrics, logs
At-least-once May duplicate Retries without dedup Processing with idempotency
Exactly-once Single delivery Idempotency + dedup Financial transactions

Distributed System Patterns

Pattern Purpose Trade-offs
Leader election Coordinate operations Single point of failure risk
Circuit breaker Prevent cascade failures Reduced functionality during open
Bulkhead Isolate failures Increased resource usage
Saga Distributed transactions Complex compensation logic
CQRS Separate read/write models Eventual consistency
Event sourcing Audit trail, time travel Storage overhead, query complexity

Ruby Distributed System Libraries

Library Purpose Key Features
Sidekiq Background jobs Redis-backed, multi-threaded, retries
ruby-kafka Kafka client Producer/consumer, partition aware
Faraday HTTP client Middleware, connection pooling, retries
Diplomat Consul client Service discovery, KV store
redis-rb Redis client Cluster support, pipelining
bunny RabbitMQ client Publisher/consumer, routing

Service Communication Patterns

Pattern Sync/Async Coupling Use Case
REST Sync Tight Simple queries, CRUD
gRPC Sync Tight High-performance RPC
Message queue Async Loose Background jobs, decoupling
Event streaming Async Loose Event sourcing, analytics
GraphQL Sync Medium Flexible data fetching

Failure Detection Methods

Method Detection Time Accuracy Overhead
Heartbeat Timeout period False positives Low
Gossip Propagation time Eventually accurate Medium
Request/response Immediate Request-dependent None extra
Monitoring service Poll interval Centralized view High