Overview
Distributed systems consist of multiple autonomous computing components located on different networked computers that communicate and coordinate their actions by passing messages. These systems appear to users as a single coherent system despite running on independent nodes that may fail independently. The primary motivation for distributed systems includes horizontal scalability, geographic distribution, fault tolerance through redundancy, and performance optimization through parallel processing.
Unlike monolithic systems where all components run within a single process space with shared memory, distributed systems face challenges that emerge from network communication. These challenges include partial failures where some nodes fail while others continue operating, network latency and unpredictability, message loss or duplication, and the lack of a global clock for ordering events. The fundamental difficulty stems from the inability to distinguish between a slow node and a failed node, or between a lost message and a delayed message.
The evolution of distributed systems accelerated with cloud computing and microservices architectures. Modern applications routinely span multiple data centers, rely on distributed databases, implement service-oriented architectures, and use message queues for asynchronous communication. Understanding distributed system concepts has become essential for building scalable, reliable applications.
# Simple distributed system simulation
class Node
attr_reader :id, :data
def initialize(id)
@id = id
@data = {}
@peers = []
end
def add_peer(node)
@peers << node unless @peers.include?(node)
end
def write(key, value)
@data[key] = value
replicate_to_peers(key, value)
end
def read(key)
@data[key]
end
private
def replicate_to_peers(key, value)
@peers.each do |peer|
peer.receive_replication(key, value)
end
end
def receive_replication(key, value)
@data[key] = value
end
end
# Create a simple cluster
node1 = Node.new(1)
node2 = Node.new(2)
node3 = Node.new(3)
node1.add_peer(node2)
node1.add_peer(node3)
node1.write("user:123", {name: "Alice", email: "alice@example.com"})
# Data now replicated across all nodes
The example demonstrates basic replication, but real distributed systems must handle node failures, network partitions, concurrent writes, and consistency guarantees.
Key Principles
The CAP theorem, formulated by Eric Brewer, states that distributed systems can provide at most two of three guarantees simultaneously: Consistency (all nodes see the same data at the same time), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions). Since network partitions occur inevitably in distributed environments, systems must choose between consistency and availability during partitions.
Consistency models define the guarantees a distributed system provides regarding the visibility and ordering of operations. Strong consistency ensures all nodes see the same data simultaneously, equivalent to a single-node system. Sequential consistency preserves the order of operations from each client's perspective but allows different clients to see different orderings. Causal consistency guarantees operations with causal relationships occur in order, while allowing concurrent operations to be seen in different orders. Eventual consistency guarantees that if no new updates occur, all replicas eventually converge to the same value. The choice of consistency model profoundly impacts system design, performance, and behavior.
Consensus algorithms enable distributed nodes to agree on a single value despite failures and network issues. The Two-Phase Commit protocol ensures all nodes either commit or abort a transaction, but blocks if the coordinator fails. Paxos provides fault-tolerant consensus through a complex protocol involving proposers, acceptors, and learners. Raft simplifies consensus through leader election and log replication with clear role assignments. These algorithms form the foundation of distributed databases and coordination services.
Fault tolerance mechanisms allow systems to continue functioning despite component failures. Replication creates multiple copies of data across nodes, enabling the system to survive node failures. Quorum-based approaches require a majority of nodes to agree on operations, preventing split-brain scenarios. Heartbeat mechanisms detect failed nodes through periodic health checks. Checkpointing and logging enable recovery from failures by preserving system state.
# Implementing a simple quorum read/write
class QuorumStore
def initialize(nodes, write_quorum, read_quorum)
@nodes = nodes
@write_quorum = write_quorum
@read_quorum = read_quorum
end
def write(key, value, version)
results = @nodes.map do |node|
Thread.new { node.write(key, value, version) }
end.map(&:value)
successes = results.count { |r| r[:success] }
raise "Write failed: insufficient quorum" if successes < @write_quorum
{success: true, replicas: successes}
end
def read(key)
results = @nodes.map do |node|
Thread.new { node.read(key) }
end.map(&:value).compact
raise "Read failed: insufficient quorum" if results.size < @read_quorum
# Return value with highest version
results.max_by { |r| r[:version] }
end
end
Time and ordering present unique challenges without synchronized clocks. Lamport timestamps assign logical timestamps to events, capturing happens-before relationships without physical clocks. Vector clocks track causality by maintaining a version number for each node, enabling detection of concurrent operations. Hybrid logical clocks combine physical and logical time for better ordering while remaining bounded.
Scalability requires distributing data and load across nodes. Horizontal partitioning (sharding) divides data by a partition key, with consistent hashing minimizing redistribution when nodes are added or removed. Replication increases read capacity by serving requests from multiple copies. Load balancing distributes requests across nodes, though load balancers themselves can become bottlenecks. Caching reduces load on backend systems but introduces cache invalidation complexity.
Design Considerations
Choosing between consistency and availability during network partitions represents the central trade-off in distributed system design. Systems prioritizing consistency reject requests during partitions to prevent divergent state, ensuring clients always see correct data but reducing availability. Systems prioritizing availability accept requests during partitions, maintaining responsiveness but allowing temporary inconsistencies. Financial transactions typically require consistency to prevent double-spending or incorrect balances. Social media feeds tolerate eventual consistency for higher availability. The decision depends on business requirements, user expectations, and regulatory constraints.
Latency characteristics differ fundamentally from single-node systems. Network round trips add milliseconds to every remote operation. Sequential remote calls amplify latency, while parallel requests reduce total time. Batching multiple operations into single requests reduces overhead. Caching frequently accessed data near clients minimizes latency. Geographic distribution requires careful data placement to minimize latency for primary user populations. Techniques like read replicas, CDNs, and edge computing bring data closer to users.
Data partitioning strategies profoundly impact performance and operational complexity. Range-based partitioning groups data by key ranges, enabling efficient range queries but risking hotspots if access patterns are skewed. Hash-based partitioning distributes data uniformly but prevents range queries. Consistent hashing minimizes data movement when nodes are added or removed, but requires careful hash function selection. Directory-based partitioning uses a lookup service to map keys to nodes, adding flexibility at the cost of additional indirection.
# Consistent hashing implementation
class ConsistentHash
def initialize(nodes, virtual_nodes: 150)
@nodes = nodes
@virtual_nodes = virtual_nodes
@ring = {}
nodes.each do |node|
add_node(node)
end
end
def add_node(node)
@virtual_nodes.times do |i|
hash = Digest::MD5.hexdigest("#{node.id}:#{i}").to_i(16)
@ring[hash] = node
end
end
def remove_node(node)
@ring.delete_if { |_hash, n| n.id == node.id }
end
def get_node(key)
return @nodes.first if @ring.empty?
hash = Digest::MD5.hexdigest(key).to_i(16)
@ring.keys.sort.each do |ring_hash|
return @ring[ring_hash] if ring_hash >= hash
end
# Wrap around to first node
@ring[@ring.keys.min]
end
end
nodes = [
Node.new(1),
Node.new(2),
Node.new(3)
]
hash_ring = ConsistentHash.new(nodes)
node_for_key = hash_ring.get_node("user:12345")
Transaction boundaries and isolation levels require careful consideration. Distributed transactions using two-phase commit provide ACID guarantees but suffer from blocking and coordinator failures. Saga patterns coordinate long-running transactions through compensating actions, trading atomicity for availability. Event sourcing records state changes as events, enabling reconstruction and time travel but complicating queries. Command Query Responsibility Segregation separates read and write models, optimizing each independently but introducing eventual consistency.
Monitoring and observability become critical with distributed systems. Distributed tracing follows requests across services using correlation IDs, revealing performance bottlenecks and dependencies. Centralized logging aggregates logs from multiple nodes for analysis. Metrics collection tracks system health, resource usage, and business KPIs. These tools help operators understand system behavior and diagnose issues across multiple components.
The decision to adopt distributed architecture should consider alternatives. Vertical scaling increases single-node capacity, avoiding distribution complexity but hitting hardware limits. Read replicas handle read-heavy workloads without full distribution. Caching layers reduce backend load. Database connection pooling maximizes single-node efficiency. Only when these approaches prove insufficient should full distribution be considered, given the operational complexity and subtle bugs it introduces.
Implementation Approaches
Microservices architecture decomposes applications into independent services communicating via APIs. Each service owns its data, implements a bounded context, and can be deployed independently. This approach enables team autonomy, technology diversity, and independent scaling. However, it introduces distributed system challenges including service discovery, inter-service communication, distributed transactions, and operational complexity. Services communicate synchronously via HTTP/REST or gRPC, or asynchronously via message queues. API gateways provide a unified entry point, handling cross-cutting concerns like authentication, rate limiting, and routing.
Event-driven architectures use asynchronous message passing for communication. Services publish events when state changes occur, and interested services subscribe to relevant events. This decouples services in time and space, enabling horizontal scaling and fault isolation. Event sourcing stores state changes as immutable events rather than current state, providing audit trails and enabling temporal queries. Command Query Responsibility Segregation separates write models (optimized for transactions) from read models (optimized for queries), synchronizing them asynchronously. These patterns trade immediate consistency for scalability and resilience.
# Event-driven service communication
class EventBus
def initialize
@subscribers = Hash.new { |h, k| h[k] = [] }
end
def subscribe(event_type, handler)
@subscribers[event_type] << handler
end
def publish(event)
@subscribers[event.type].each do |handler|
Thread.new { handler.call(event) }
end
end
end
class OrderService
def initialize(event_bus)
@event_bus = event_bus
end
def create_order(items)
order = Order.create(items: items, status: 'pending')
@event_bus.publish(Event.new(
type: 'order.created',
data: {order_id: order.id, items: items}
))
order
end
end
class InventoryService
def initialize(event_bus)
@event_bus = event_bus
event_bus.subscribe('order.created', method(:handle_order_created))
end
def handle_order_created(event)
order_id = event.data[:order_id]
items = event.data[:items]
if reserve_inventory(items)
@event_bus.publish(Event.new(
type: 'inventory.reserved',
data: {order_id: order_id}
))
else
@event_bus.publish(Event.new(
type: 'inventory.insufficient',
data: {order_id: order_id}
))
end
end
def reserve_inventory(items)
# Check and reserve inventory
true
end
end
Service mesh architectures add an infrastructure layer for service-to-service communication. A sidecar proxy deployed alongside each service handles networking concerns including load balancing, service discovery, encryption, authentication, circuit breaking, and observability. The control plane configures these proxies and collects telemetry. This approach centralizes cross-cutting concerns, enforces policies consistently, and provides visibility without application changes. However, it adds operational complexity and resource overhead.
Distributed data stores implement different consistency and availability guarantees. Master-slave replication directs writes to a master node and replicates to read-only slaves, providing read scalability but creating a single point of failure. Multi-master replication allows writes to any node, requiring conflict resolution. Leaderless replication uses quorum reads and writes for consistency without a designated master. Distributed SQL databases provide ACID transactions across nodes using consensus protocols, while NoSQL stores often prioritize availability and partition tolerance over consistency.
Coordination services like ZooKeeper and etcd provide primitives for distributed applications including configuration management, service discovery, leader election, and distributed locking. These services use consensus algorithms for strong consistency and fault tolerance. Applications use watches to receive notifications of configuration changes. Leader election enables active-passive failover where one node handles requests while others wait as backups. Distributed locks coordinate access to shared resources, though they require careful timeout handling to prevent deadlocks.
Common Patterns
Leader election patterns designate one node as the leader responsible for coordinating operations. The Bully algorithm has nodes with higher IDs take over leadership, but generates many messages. The Ring algorithm passes election messages around a ring topology. Consensus-based election uses Paxos or Raft for reliable leader selection with minority fault tolerance. Leader election enables active-passive failover, serialized operations that require ordering, and centralized coordination. Leader nodes risk becoming bottlenecks and single points of failure.
# Simple leader election using heartbeats
class LeaderElection
attr_reader :current_leader
def initialize(nodes, timeout: 5)
@nodes = nodes
@timeout = timeout
@current_leader = nil
@last_heartbeat = {}
start_monitoring
end
def start_monitoring
Thread.new do
loop do
check_leader_health
sleep 1
end
end
end
def receive_heartbeat(node_id)
@last_heartbeat[node_id] = Time.now
@current_leader = node_id if @current_leader.nil?
end
def check_leader_health
return if @current_leader.nil?
last_seen = @last_heartbeat[@current_leader]
if last_seen.nil? || Time.now - last_seen > @timeout
elect_new_leader
end
end
def elect_new_leader
# Simple highest-ID election
candidates = @nodes.select do |node|
hb = @last_heartbeat[node.id]
hb && Time.now - hb < @timeout
end
@current_leader = candidates.max_by(&:id)&.id
end
end
Replication patterns maintain copies of data across nodes for fault tolerance and read scalability. Synchronous replication waits for acknowledgment from replicas before confirming writes, ensuring consistency but increasing latency. Asynchronous replication confirms writes immediately and replicates in the background, improving performance but risking data loss on failures. Quorum-based replication requires acknowledgment from a majority of nodes, balancing consistency and availability. Read-your-writes consistency ensures clients see their own writes immediately, while allowing other clients to see stale data temporarily.
Circuit breaker patterns prevent cascading failures by detecting unhealthy dependencies and failing fast. The circuit starts in a closed state, allowing requests through. After a threshold of failures, it transitions to an open state, immediately rejecting requests without calling the dependency. After a timeout, it enters a half-open state, allowing limited test requests. If these succeed, it returns to closed. This pattern prevents resource exhaustion from retrying failed operations and allows failing services time to recover.
class CircuitBreaker
STATES = [:closed, :open, :half_open].freeze
def initialize(threshold: 5, timeout: 60, half_open_requests: 3)
@threshold = threshold
@timeout = timeout
@half_open_requests = half_open_requests
@state = :closed
@failure_count = 0
@last_failure_time = nil
@half_open_attempts = 0
end
def call(&block)
case @state
when :open
if should_attempt_reset?
transition_to_half_open
else
raise CircuitOpenError, "Circuit breaker is open"
end
end
begin
result = block.call
handle_success
result
rescue => e
handle_failure
raise e
end
end
private
def handle_success
case @state
when :half_open
@half_open_attempts += 1
if @half_open_attempts >= @half_open_requests
transition_to_closed
end
when :closed
@failure_count = 0
end
end
def handle_failure
@failure_count += 1
@last_failure_time = Time.now
case @state
when :closed
transition_to_open if @failure_count >= @threshold
when :half_open
transition_to_open
end
end
def should_attempt_reset?
Time.now - @last_failure_time > @timeout
end
def transition_to_closed
@state = :closed
@failure_count = 0
@half_open_attempts = 0
end
def transition_to_open
@state = :open
end
def transition_to_half_open
@state = :half_open
@half_open_attempts = 0
end
end
Saga patterns coordinate long-running distributed transactions without distributed locks. Choreography-based sagas have each service listen for events and publish new events, with no central coordinator. Orchestration-based sagas use a central coordinator to direct the transaction flow. Each step includes a compensating transaction to undo its effects if the saga fails. This enables eventual consistency across services while maintaining business invariants. Sagas require careful design of compensating actions and handling of partial failures.
Bulkhead patterns isolate resources to prevent failures from spreading. Named after ship compartments that prevent flooding from spreading, this pattern partitions resources like thread pools, connection pools, and memory. If one partition fails or becomes saturated, others remain functional. This improves resilience by limiting the blast radius of failures. Thread pool bulkheads dedicate separate pools to different operations. Circuit breakers often combine with bulkheads for comprehensive failure isolation.
Retry patterns handle transient failures through repeated attempts with exponential backoff. Simple retries immediately reattempt failed operations, but can overwhelm struggling services. Exponential backoff increases delay between retries, giving services time to recover. Jitter adds randomness to retry timing, preventing thundering herds where many clients retry simultaneously. Maximum retry limits prevent infinite loops. Idempotency ensures retried operations produce the same result as single operations, critical for safe retries.
Ruby Implementation
Ruby applications typically participate in distributed systems as service components rather than implementing core distributed system algorithms. Ruby's dynamic nature, garbage collection, and global interpreter lock make it less suitable for low-level distributed systems infrastructure, but well-suited for application services that use distributed system platforms.
Message queue integration enables asynchronous communication between services. Sidekiq provides background job processing backed by Redis, supporting distributed workers across multiple servers. Jobs serialize to JSON, get stored in Redis queues, and workers fetch and execute them. Sidekiq handles retries, dead letter queues, and job scheduling. It scales horizontally by adding workers without code changes.
# Sidekiq for distributed job processing
class OrderProcessingJob
include Sidekiq::Job
sidekiq_options retry: 5, dead: true
def perform(order_id)
order = Order.find(order_id)
# Process order across multiple steps
inventory_result = InventoryService.reserve(order.items)
payment_result = PaymentService.charge(order.total)
if inventory_result.success? && payment_result.success?
order.update(status: 'confirmed')
ShippingJob.perform_async(order_id)
else
order.update(status: 'failed')
# Compensating transactions
InventoryService.release(order.items) if inventory_result.success?
end
end
end
# Configuring Sidekiq cluster
# config/sidekiq.yml
# :concurrency: 25
# :queues:
# - critical
# - default
# - low_priority
# Scale by running multiple Sidekiq processes
# bundle exec sidekiq -C config/sidekiq.yml
Kafka provides distributed event streaming for high-throughput, fault-tolerant message processing. The ruby-kafka gem enables producing and consuming Kafka messages. Producers publish messages to topics partitioned across brokers. Consumers join consumer groups to parallelize processing, with each partition consumed by one group member. Kafka guarantees ordering within partitions and provides configurable durability through replication.
require 'kafka'
class KafkaEventPublisher
def initialize
@kafka = Kafka.new(['localhost:9092'], client_id: 'ruby-service')
end
def publish_event(topic, event)
producer = @kafka.producer
producer.produce(
event.to_json,
topic: topic,
key: event[:entity_id],
partition_key: event[:entity_id]
)
producer.deliver_messages
ensure
producer.shutdown
end
end
class KafkaEventConsumer
def initialize
@kafka = Kafka.new(['localhost:9092'], client_id: 'ruby-consumer')
@consumer = @kafka.consumer(group_id: 'order-processor')
end
def start
@consumer.subscribe('orders')
@consumer.each_message do |message|
event = JSON.parse(message.value)
process_event(event)
@consumer.commit_offsets
end
end
def process_event(event)
case event['type']
when 'order.created'
handle_order_created(event)
when 'order.cancelled'
handle_order_cancelled(event)
end
end
end
Service communication typically uses HTTP clients for synchronous requests. The Faraday gem provides a flexible HTTP client with middleware support. Timeouts prevent hanging requests, circuit breakers prevent cascading failures, and retry middleware handles transient errors. Connection pooling reuses HTTP connections across requests.
require 'faraday'
require 'faraday/retry'
class ServiceClient
def initialize(base_url)
@conn = Faraday.new(url: base_url) do |f|
f.request :json
f.response :json
# Retry middleware for transient failures
f.request :retry,
max: 3,
interval: 0.5,
backoff_factor: 2,
retry_statuses: [429, 500, 502, 503],
methods: [:get, :post]
# Timeouts prevent hanging
f.options.timeout = 5
f.options.open_timeout = 2
f.adapter :net_http_persistent
end
end
def get_user(user_id)
response = @conn.get("/users/#{user_id}")
response.body
rescue Faraday::TimeoutError => e
# Handle timeout
raise ServiceUnavailable, "User service timeout"
rescue Faraday::Error => e
# Handle other network errors
raise ServiceError, e.message
end
end
Service discovery helps services locate each other in dynamic environments. Consul provides service registry, health checking, and DNS-based discovery. Services register themselves on startup and deregister on shutdown. Client libraries query Consul to discover available instances of required services.
require 'diplomat'
class ServiceRegistry
def initialize
Diplomat.configure do |config|
config.url = 'http://localhost:8500'
end
end
def register_service(name, port)
service = {
name: name,
address: local_ip,
port: port,
check: {
http: "http://#{local_ip}:#{port}/health",
interval: '10s',
timeout: '5s'
}
}
Diplomat::Service.register(service)
end
def discover_service(name)
services = Diplomat::Service.get(name)
services.map do |service|
{
address: service.ServiceAddress,
port: service.ServicePort
}
end
end
def deregister_service(id)
Diplomat::Service.deregister(id)
end
private
def local_ip
Socket.ip_address_list
.find { |ai| ai.ipv4? && !ai.ipv4_loopback? }
.ip_address
end
end
Distributed caching with Redis provides shared state across service instances. Redis clusters partition data across nodes for horizontal scaling. Ruby applications use connection pooling to efficiently share Redis connections across threads.
require 'redis'
require 'connection_pool'
class DistributedCache
def initialize
@pool = ConnectionPool.new(size: 10, timeout: 5) do
Redis.new(
cluster: ['redis://node1:6379', 'redis://node2:6379'],
timeout: 2,
reconnect_attempts: 3
)
end
end
def get(key)
@pool.with { |redis| redis.get(key) }
end
def set(key, value, ttl: 3600)
@pool.with { |redis| redis.setex(key, ttl, value) }
end
def delete(key)
@pool.with { |redis| redis.del(key) }
end
# Cache-aside pattern
def fetch(key, ttl: 3600, &block)
value = get(key)
return value if value
value = block.call
set(key, value, ttl: ttl)
value
end
end
# Usage with cache-aside pattern
cache = DistributedCache.new
user = cache.fetch("user:#{user_id}") do
User.find(user_id)
end
Tools & Ecosystem
Kubernetes orchestrates containerized applications across clusters of machines. It handles deployment, scaling, load balancing, and service discovery. Pods group containers that share resources, Services expose pod groups with stable endpoints, and Deployments manage pod lifecycle. Kubernetes provides horizontal pod autoscaling based on metrics, rolling updates for zero-downtime deployments, and self-healing through automatic restarts.
Docker containers package applications with dependencies, ensuring consistent environments across development, testing, and production. Images define application stack, dependencies, and configuration. Multi-stage builds optimize image size. Container registries store and distribute images. Docker Compose orchestrates multi-container applications for local development, though Kubernetes handles production orchestration.
RabbitMQ provides enterprise message queuing with complex routing capabilities. Exchanges route messages to queues based on routing keys and bindings. Consumers acknowledge messages after processing, and unacknowledged messages get redelivered. Dead letter exchanges handle failed messages. Priority queues and message TTLs provide flow control. RabbitMQ clusters replicate queues across nodes for high availability.
Apache Kafka excels at high-throughput event streaming with retention. Topics partition across brokers for parallel processing. Consumer groups enable horizontal scaling. Kafka Streams processes data within Kafka clusters. Kafka Connect integrates external systems. The platform handles millions of messages per second with low latency and guarantees delivery ordering within partitions.
Redis serves as distributed cache, session store, message broker, and data structure server. Cluster mode partitions data across nodes with automatic failover. Redis Sentinel provides high availability for non-clustered deployments. Pub/sub messaging enables real-time features. Lua scripting ensures atomic operations. Persistence options balance durability with performance.
Consul provides service discovery, configuration management, and service mesh capabilities. Services register with health checks, and clients discover services via DNS or HTTP API. Key-value store holds configuration with watch notifications. Connect service mesh secures service-to-service communication with mutual TLS. Multi-datacenter support enables global service discovery.
etcd implements distributed configuration storage using Raft consensus. It provides strong consistency, watches for configuration changes, and lease-based TTLs. Kubernetes uses etcd for cluster state. Applications use etcd for leader election, service discovery, and distributed locks. Clustering provides fault tolerance and high availability.
Prometheus collects metrics from distributed systems with dimensional data model. Services expose metrics endpoints that Prometheus scrapes. PromQL queries aggregate and analyze metrics. Alertmanager handles alert routing and grouping. Grafana visualizes Prometheus data. Service discovery integrations automatically find monitored targets.
Jaeger implements distributed tracing following OpenTracing standards. Services instrument code to create spans representing operations. Traces link spans from a request across services. Sampling reduces overhead while preserving representative traces. The Jaeger UI visualizes request flows, latency distributions, and service dependencies. This helps identify performance bottlenecks and understand system behavior.
# Distributed tracing with Jaeger
require 'jaeger/client'
OpenTracing.global_tracer = Jaeger::Client.build(
host: 'localhost',
port: 6831,
service_name: 'order-service'
)
class OrderController
def create
tracer = OpenTracing.global_tracer
span = tracer.start_active_span('create_order')
begin
order = create_order_record(span.span)
process_payment(order, span.span)
reserve_inventory(order, span.span)
span.span.set_tag('order.id', order.id)
span.span.set_tag('order.total', order.total)
{success: true, order_id: order.id}
rescue => e
span.span.set_tag('error', true)
span.span.log_kv(
event: 'error',
'error.kind': e.class.name,
message: e.message
)
raise
ensure
span.close
end
end
private
def process_payment(order, parent_span)
child_span = OpenTracing.global_tracer.start_span(
'process_payment',
child_of: parent_span
)
# Payment processing logic
ensure
child_span.finish
end
end
Reference
Consistency Models
| Model | Guarantee | Performance | Use Cases |
|---|---|---|---|
| Strong | All nodes see same data | Slowest | Financial transactions, inventory |
| Sequential | Operations ordered per client | Moderate | User sessions, social feeds |
| Causal | Causal operations ordered | Moderate | Comment threads, messaging |
| Eventual | Converges eventually | Fastest | Analytics, recommendations |
| Read-your-writes | Clients see own writes | Moderate | User profile updates |
CAP Theorem Choices
| System Type | Consistency | Availability | Example Use |
|---|---|---|---|
| CP | Yes | During partitions, no | Banking, inventory |
| AP | Eventually | Yes | Social media, caching |
| CA | Yes | In single partition | Traditional RDBMS |
Replication Strategies
| Strategy | Write Latency | Consistency | Data Loss Risk |
|---|---|---|---|
| Synchronous | High | Strong | None |
| Asynchronous | Low | Eventual | Possible |
| Quorum | Medium | Tunable | Minimal |
| Master-slave | Low | Read lag | On master failure |
| Multi-master | Medium | Conflicts | Complex resolution |
Consensus Algorithms
| Algorithm | Fault Tolerance | Complexity | Performance |
|---|---|---|---|
| 2PC | Coordinator SPOF | Simple | Blocks on failure |
| Paxos | Minority failures | Complex | Good |
| Raft | Minority failures | Moderate | Good |
| Gossip | No failures | Simple | Eventually consistent |
Partitioning Strategies
| Strategy | Range Queries | Hot Spots | Rebalancing |
|---|---|---|---|
| Range | Efficient | Possible | Complex |
| Hash | Not supported | Uniform | Simple |
| Consistent hash | Not supported | Uniform | Minimal movement |
| Directory | Flexible | Depends | Flexible |
Message Delivery Guarantees
| Guarantee | Behavior | Implementation | Use Case |
|---|---|---|---|
| At-most-once | May lose messages | Fire and forget | Metrics, logs |
| At-least-once | May duplicate | Retries without dedup | Processing with idempotency |
| Exactly-once | Single delivery | Idempotency + dedup | Financial transactions |
Distributed System Patterns
| Pattern | Purpose | Trade-offs |
|---|---|---|
| Leader election | Coordinate operations | Single point of failure risk |
| Circuit breaker | Prevent cascade failures | Reduced functionality during open |
| Bulkhead | Isolate failures | Increased resource usage |
| Saga | Distributed transactions | Complex compensation logic |
| CQRS | Separate read/write models | Eventual consistency |
| Event sourcing | Audit trail, time travel | Storage overhead, query complexity |
Ruby Distributed System Libraries
| Library | Purpose | Key Features |
|---|---|---|
| Sidekiq | Background jobs | Redis-backed, multi-threaded, retries |
| ruby-kafka | Kafka client | Producer/consumer, partition aware |
| Faraday | HTTP client | Middleware, connection pooling, retries |
| Diplomat | Consul client | Service discovery, KV store |
| redis-rb | Redis client | Cluster support, pipelining |
| bunny | RabbitMQ client | Publisher/consumer, routing |
Service Communication Patterns
| Pattern | Sync/Async | Coupling | Use Case |
|---|---|---|---|
| REST | Sync | Tight | Simple queries, CRUD |
| gRPC | Sync | Tight | High-performance RPC |
| Message queue | Async | Loose | Background jobs, decoupling |
| Event streaming | Async | Loose | Event sourcing, analytics |
| GraphQL | Sync | Medium | Flexible data fetching |
Failure Detection Methods
| Method | Detection Time | Accuracy | Overhead |
|---|---|---|---|
| Heartbeat | Timeout period | False positives | Low |
| Gossip | Propagation time | Eventually accurate | Medium |
| Request/response | Immediate | Request-dependent | None extra |
| Monitoring service | Poll interval | Centralized view | High |