CrackedRuby CrackedRuby

Overview

Data replication maintains multiple copies of data across different nodes, servers, or geographic locations. This technique enables high availability, fault tolerance, load distribution, and improved read performance by keeping synchronized copies of data that can serve requests independently.

Replication addresses several critical needs in distributed systems. When a primary database fails, replicas provide immediate failover capability without data loss. Geographic distribution places data closer to users, reducing latency for global applications. Read-heavy workloads distribute queries across multiple replicas, preventing any single database from becoming a bottleneck.

The fundamental challenge in replication involves maintaining consistency between copies while managing the inherent delays and failures in distributed networks. When data changes on one node, that change must propagate to all replicas. The timing, ordering, and guarantees around this propagation define different replication strategies, each with distinct trade-offs.

# Conceptual representation of a replicated data store
class ReplicatedStore
  def initialize(primary, replicas)
    @primary = primary
    @replicas = replicas
  end
  
  def write(key, value)
    # Write to primary first
    @primary.set(key, value)
    
    # Propagate to replicas
    @replicas.each { |replica| replica.set(key, value) }
  end
  
  def read(key)
    # Read from any available replica
    available_node = [@primary, *@replicas].sample
    available_node.get(key)
  end
end

Replication operates at different system layers. Database replication duplicates entire databases or specific tables. Application-level replication synchronizes data through custom logic. File system replication maintains identical copies of files across storage systems. Message queue replication ensures event delivery across multiple brokers.

Key Principles

Replication relies on several fundamental mechanisms that govern how data copies remain synchronized and how the system behaves during normal operation and failures.

Replication Lag represents the time delay between a write on the primary and its visibility on replicas. This lag exists in all asynchronous replication systems and directly impacts read consistency. A replica lagging by five seconds serves data that may be five seconds stale. Applications must account for this staleness when designing read patterns.

Write Path determines where write operations occur and how they propagate. In primary-based replication, all writes go to a designated primary node that forwards changes to replicas. In multi-master replication, multiple nodes accept writes concurrently, requiring conflict resolution when the same data changes in different locations.

Replication Log captures all changes to data in a sequential, ordered format. This log serves as the source of truth for replication. PostgreSQL uses Write-Ahead Logging (WAL), MySQL uses binary logs, and many distributed systems use commit logs. Replicas consume these logs to apply changes in the same order they occurred on the primary.

# Simplified replication log structure
class ReplicationLog
  def initialize
    @entries = []
    @position = 0
  end
  
  def append(operation, key, value, timestamp)
    entry = {
      position: @position,
      operation: operation,
      key: key,
      value: value,
      timestamp: timestamp
    }
    @entries << entry
    @position += 1
    entry
  end
  
  def entries_since(position)
    @entries.select { |e| e[:position] > position }
  end
end

Consistency Models define guarantees about what data replicas return when read. Strong consistency ensures all replicas return identical data at any moment, but requires coordination that adds latency. Eventual consistency allows replicas to temporarily diverge, guaranteeing only that they will converge given enough time without new writes. Causal consistency preserves cause-effect relationships while allowing some divergence.

Conflict Resolution becomes necessary when multiple nodes accept writes to the same data. Last-write-wins uses timestamps to pick the most recent write, though clock skew creates problems. Vector clocks track causality to detect concurrent writes that genuinely conflict. Application-specific resolvers use domain knowledge to merge conflicting values.

Topology describes the arrangement of primary and replica nodes. Star topology has one primary and multiple replicas. Chain topology connects nodes sequentially, with each forwarding to the next. Tree topology organizes nodes hierarchically. The topology affects latency, network utilization, and failure behavior.

Checkpoint and Recovery enables replicas to catch up after disconnection without replaying the entire replication log. Replicas periodically snapshot their state and record their log position. After reconnection, they load the snapshot and replay only log entries since that position.

Implementation Approaches

Different replication strategies offer distinct trade-offs between consistency, performance, availability, and operational complexity.

Synchronous Replication writes data to the primary and waits for acknowledgment from one or more replicas before confirming the write to the client. This approach guarantees that committed data exists on multiple nodes, preventing data loss if the primary fails. The primary blocks until replicas respond, adding latency proportional to network round-trip time plus replica processing time.

Synchronous replication typically uses quorum-based commits where writes complete after a majority of replicas acknowledge. With five replicas, writes succeed after three acknowledgments. This configuration tolerates two simultaneous failures while maintaining data durability. The system becomes unavailable for writes if too many replicas fail to reach quorum.

class SynchronousReplicator
  def initialize(primary, replicas, quorum_size)
    @primary = primary
    @replicas = replicas
    @quorum_size = quorum_size
  end
  
  def write(key, value)
    # Write to primary
    @primary.set(key, value)
    
    # Wait for quorum from replicas
    acknowledgments = []
    @replicas.each do |replica|
      thread = Thread.new do
        begin
          replica.set(key, value)
          acknowledgments << replica
        rescue => e
          # Replica failed, continue
        end
      end
    end
    
    # Wait until quorum reached or timeout
    timeout = 5
    start_time = Time.now
    
    until acknowledgments.size >= @quorum_size || (Time.now - start_time) > timeout
      sleep 0.1
    end
    
    if acknowledgments.size >= @quorum_size
      :success
    else
      raise "Failed to reach quorum"
    end
  end
end

Asynchronous Replication confirms writes to clients immediately after writing to the primary, then propagates changes to replicas in the background. This approach provides low write latency since clients do not wait for replicas. Replicas lag behind the primary by some amount, meaning recent writes may not appear on replicas immediately.

The risk with asynchronous replication involves data loss during primary failure. If the primary crashes after confirming a write but before replicating it, that write disappears. The acceptable amount of potential data loss depends on the application. Financial systems typically cannot tolerate any loss, while social media applications might accept losing a few seconds of data during failures.

Semi-Synchronous Replication combines characteristics of both approaches. Writes wait for at least one replica to acknowledge before confirming to the client, then replicate to remaining replicas asynchronously. This strategy balances durability and performance by ensuring data exists on two nodes while not blocking on all replicas.

MySQL semi-synchronous replication waits for one replica acknowledgment, then returns success. If all replicas become unavailable, the system can fall back to asynchronous mode to maintain write availability, accepting the increased risk of data loss.

Logical Replication copies data changes at a logical level, replicating operations like "insert row with values X, Y, Z" rather than physical disk changes. This approach enables filtering specific tables, transforming data during replication, and replicating between different database versions or even different database systems.

PostgreSQL logical replication uses publications on the primary that define which tables to replicate and subscriptions on replicas that consume those publications. Changes stream through a replication slot that buffers modifications. Applications can subscribe to the same logical replication stream to build derived data stores or implement change data capture.

Physical Replication copies the exact byte-level changes to data files and transaction logs. Physical replicas maintain identical copies of the primary's data structures. This approach replicates everything—all databases, all tables, all indexes—with minimal overhead since it operates below the SQL layer.

PostgreSQL streaming replication ships WAL records from primary to standby servers that replay those records to reconstruct the same physical state. Physical replicas can serve read queries, though they remain in a special recovery mode that continuously applies WAL records.

Ruby Implementation

Ruby applications interact with replicated data stores through database adapters, connection pooling libraries, and replication-aware gems that abstract the complexity of managing primary and replica connections.

Database Connection Management in Ruby typically uses ActiveRecord for Rails applications or Sequel for standalone applications. Both support read/write connection splitting where write queries go to the primary and read queries distribute across replicas.

# ActiveRecord configuration for primary-replica setup
class ApplicationRecord < ActiveRecord::Base
  connects_to database: {
    writing: :primary,
    reading: :replica
  }
end

# Automatically routes reads to replica
User.where(active: true).first  # Reads from replica

# Explicitly use primary for consistency
User.connected_to(role: :writing) do
  User.where(active: true).first  # Reads from primary
end

# Write operations always use primary
user = User.create(name: "Alice")  # Writes to primary

Applications configure multiple replica connections in database.yml, and ActiveRecord distributes read queries across them using a load balancing strategy. The default strategy randomly selects a replica for each query. Custom strategies can implement weighted distribution, latency-based selection, or geographic affinity.

Handling Replication Lag requires application awareness. After writing data, immediately reading it from a replica may return stale results if replication has not completed. Applications use several patterns to manage this:

class UserService
  def create_and_fetch(attributes)
    # Write to primary
    user = User.create(attributes)
    
    # Force read from primary for consistency
    User.connected_to(role: :writing) do
      User.find(user.id)
    end
  end
  
  def create_with_retry(attributes)
    user = User.create(attributes)
    
    # Poll replica until data appears
    max_attempts = 10
    attempt = 0
    
    loop do
      begin
        return User.find(user.id)
      rescue ActiveRecord::RecordNotFound
        attempt += 1
        raise if attempt >= max_attempts
        sleep 0.1
      end
    end
  end
  
  def create_with_session_stickiness(attributes, session_id)
    user = User.create(attributes)
    
    # Store write timestamp in session
    session_store.set("last_write_#{session_id}", Time.now)
    
    # Later reads check if enough time has passed
    last_write = session_store.get("last_write_#{session_id}")
    if last_write && (Time.now - last_write) < 5
      # Use primary if recent write
      User.connected_to(role: :writing) { User.find(user.id) }
    else
      # Use replica if no recent writes
      User.find(user.id)
    end
  end
end

Makara provides advanced connection management for Ruby applications with read/write splitting and failover capabilities. It wraps ActiveRecord's connection adapter to intercept queries and route them based on type and current node health.

# Gemfile
gem 'makara'

# database.yml
production:
  adapter: mysql2_makara
  database: myapp_production
  makara:
    connections:
      - role: master
        host: primary.db.example.com
        
      - role: slave
        host: replica1.db.example.com
        
      - role: slave
        host: replica2.db.example.com
    
    master_ttl: 5  # Stick to master for 5 seconds after write
    blacklist_duration: 30  # Blacklist failed nodes for 30 seconds

# Application code remains unchanged
User.where(active: true).first  # Automatically routed

Makara tracks which connections recently performed writes and temporarily routes subsequent reads from that connection to the primary. This "stickiness" period prevents reading stale data after writes. When a replica fails health checks, Makara blacklists it temporarily and redistributes queries to healthy replicas.

Redis Replication in Ruby uses the redis-rb gem with Sentinel support for automatic failover. Sentinel monitors Redis instances and promotes replicas to primary when failures occur.

require 'redis'

# Connect through Sentinel for automatic failover
redis = Redis.new(
  url: "redis://mymaster",
  sentinels: [
    { host: "sentinel1.example.com", port: 26379 },
    { host: "sentinel2.example.com", port: 26379 },
    { host: "sentinel3.example.com", port: 26379 }
  ],
  role: :master
)

# Writes go to current primary
redis.set("key", "value")

# Reads can use replicas for scaling
redis_replica = Redis.new(
  url: "redis://mymaster",
  sentinels: [
    { host: "sentinel1.example.com", port: 26379 },
    { host: "sentinel2.example.com", port: 26379 },
    { host: "sentinel3.example.com", port: 26379 }
  ],
  role: :slave
)

redis_replica.get("key")

Kafka Consumer Groups in Ruby enable building replicated data pipelines where multiple consumers process events concurrently. The ruby-kafka gem handles partition assignment and offset management.

require 'kafka'

kafka = Kafka.new(
  seed_brokers: ["kafka1.example.com:9092", "kafka2.example.com:9092"],
  client_id: "user-service"
)

# Consumer group ensures only one consumer processes each partition
consumer = kafka.consumer(group_id: "user-replicator")
consumer.subscribe("user-events")

consumer.each_message do |message|
  event = JSON.parse(message.value)
  
  case event["type"]
  when "user.created"
    # Replicate user creation to another data store
    SearchIndex.create_user(event["data"])
    
  when "user.updated"
    SearchIndex.update_user(event["data"])
    
  when "user.deleted"
    SearchIndex.delete_user(event["data"]["id"])
  end
  
  # Commit offset after successful processing
  consumer.commit_offsets
end

Custom Replication Logic sometimes becomes necessary for application-specific requirements. Building a simple replication system demonstrates the core concepts:

class SimpleReplicator
  def initialize(primary_url, replica_urls)
    @primary = Database.connect(primary_url)
    @replicas = replica_urls.map { |url| Database.connect(url) }
    @replication_log = []
    @replica_positions = Hash.new(0)
  end
  
  def write(table, id, data)
    # Write to primary
    @primary.write(table, id, data)
    
    # Record in replication log
    entry = {
      position: @replication_log.size,
      timestamp: Time.now,
      table: table,
      id: id,
      data: data
    }
    @replication_log << entry
    
    # Replicate asynchronously
    replicate_to_followers
    
    entry
  end
  
  def read(table, id, consistency: :eventual)
    case consistency
    when :strong
      # Always read from primary
      @primary.read(table, id)
      
    when :eventual
      # Read from random replica
      replica = @replicas.sample || @primary
      replica.read(table, id)
    end
  end
  
  private
  
  def replicate_to_followers
    Thread.new do
      @replicas.each_with_index do |replica, index|
        begin
          position = @replica_positions[index]
          entries = @replication_log[position..-1]
          
          entries.each do |entry|
            replica.write(entry[:table], entry[:id], entry[:data])
            @replica_positions[index] = entry[:position] + 1
          end
        rescue => e
          # Log error, continue with other replicas
          puts "Replication to replica #{index} failed: #{e.message}"
        end
      end
    end
  end
end

Design Considerations

Selecting appropriate replication strategies requires analyzing consistency requirements, performance goals, failure tolerance, and operational constraints.

Consistency Requirements dominate replication decisions. Financial transactions demand strong consistency where all nodes always return identical data. Social media timelines tolerate eventual consistency where different users might temporarily see different post counts. Inventory systems need bounded staleness where replica lag stays within acceptable limits.

Strong consistency requires synchronous replication or distributed consensus protocols. These approaches add latency since operations wait for coordination across nodes. Systems using strong consistency often sacrifice write throughput for correctness guarantees. Geographical distribution compounds these costs as network latency between distant data centers adds hundreds of milliseconds to each operation.

Eventual consistency enables high write throughput and geographic distribution by allowing replicas to temporarily diverge. Applications must handle scenarios where recently written data does not appear immediately on replicas. Session stickiness routes users to specific replicas to provide consistent views within a session while different users might see different data.

# Design pattern for choosing consistency level
class DataAccessStrategy
  def self.for_operation(operation_type)
    case operation_type
    when :financial_transaction
      StrongConsistencyAccess.new
      
    when :user_profile_read
      EventualConsistencyAccess.new
      
    when :inventory_check
      BoundedStalenessAccess.new(max_lag: 10)
    end
  end
end

class StrongConsistencyAccess
  def read(key)
    PrimaryDatabase.connected_to(role: :writing) do
      PrimaryDatabase.read(key)
    end
  end
end

class EventualConsistencyAccess
  def read(key)
    ReplicaDatabase.read(key)
  end
end

class BoundedStalenessAccess
  def initialize(max_lag:)
    @max_lag = max_lag
  end
  
  def read(key)
    replica = find_replica_within_lag(@max_lag)
    replica ? replica.read(key) : PrimaryDatabase.read(key)
  end
end

Read/Write Ratio determines replication value. Read-heavy workloads with 90% reads and 10% writes gain substantial benefit from replication. Multiple replicas distribute read load, preventing primary overload. Write-heavy workloads gain less benefit since writes still concentrate on the primary and replication overhead increases.

Applications with regional user bases benefit from geographic replication regardless of read/write ratio. Placing replicas near users reduces latency even when replication overhead increases. A European user accessing a European replica experiences 50ms latency compared to 200ms for a US-based primary.

Failure Recovery Time affects replication topology decisions. Synchronous replication enables instantaneous failover since replicas contain all committed data. Asynchronous replication requires determining how much data loss is acceptable during failover. Semi-synchronous replication provides middle ground with one replica always synchronized while others lag.

Automated failover tools monitor primary health and promote replicas when failures occur. Manual failover requires operator intervention, increasing recovery time but preventing automated split-brain scenarios where multiple nodes believe they are primary.

Network Topology influences replication performance. Data centers connected by dedicated low-latency links support synchronous replication across sites. Geographically distributed sites over public internet require asynchronous replication due to variable latency and occasional connectivity issues.

Chain replication sends data from primary to first replica, first replica to second replica, and so on. This topology reduces primary network bandwidth but increases end-to-end replication latency. Star topology where the primary sends directly to all replicas provides lower latency but requires more primary bandwidth.

Operational Complexity increases with replication. Monitoring lag across replicas, handling failover scenarios, resolving conflicts in multi-master setups, and maintaining backup schedules across multiple nodes require sophisticated operational practices. Managed database services handle much of this complexity but limit configuration options.

Performance Considerations

Replication impacts system performance through multiple mechanisms that affect throughput, latency, and resource utilization.

Write Amplification occurs when each write to the primary generates writes to multiple replicas. A system with five replicas performs six total writes for each logical operation. Network bandwidth, disk I/O, and CPU all scale with replica count. Asynchronous replication decouples this cost from client-perceived latency, but synchronous replication adds latency proportional to the slowest replica.

Batch replication groups multiple changes before sending to replicas, reducing network overhead and per-operation costs. Instead of sending 100 individual changes, the primary batches them into one network message containing all changes. This optimization trades increased replication lag for better throughput.

class BatchReplicator
  def initialize(primary, replicas, batch_size: 100, batch_timeout: 1.0)
    @primary = primary
    @replicas = replicas
    @batch_size = batch_size
    @batch_timeout = batch_timeout
    @pending_changes = []
    @last_flush = Time.now
    
    start_flush_timer
  end
  
  def write(key, value)
    @primary.set(key, value)
    
    @pending_changes << { key: key, value: value, timestamp: Time.now }
    
    flush_if_needed
  end
  
  private
  
  def flush_if_needed
    should_flush = @pending_changes.size >= @batch_size ||
                   (Time.now - @last_flush) >= @batch_timeout
    
    flush if should_flush && !@pending_changes.empty?
  end
  
  def flush
    batch = @pending_changes.dup
    @pending_changes.clear
    @last_flush = Time.now
    
    Thread.new do
      @replicas.each do |replica|
        replica.batch_set(batch)
      end
    end
  end
  
  def start_flush_timer
    Thread.new do
      loop do
        sleep @batch_timeout
        flush_if_needed
      end
    end
  end
end

Read Performance improves with replication as queries distribute across multiple nodes. With three replicas, each node handles one-fourth of read traffic assuming equal distribution. Geographic distribution further improves read latency by routing requests to nearby replicas.

Connection pooling maximizes replica utilization by maintaining persistent connections to all replicas. Each application server maintains pools to both primary and replica databases, quickly switching between them as queries execute.

Replication Lag Measurement enables performance monitoring and troubleshooting. PostgreSQL provides pg_stat_replication view showing byte lag between primary and each replica. MySQL offers SHOW SLAVE STATUS with similar information. Applications should monitor these metrics and alert when lag exceeds acceptable thresholds.

class ReplicationMonitor
  def initialize(primary_connection)
    @primary = primary_connection
  end
  
  def check_replica_lag
    # PostgreSQL example
    results = @primary.exec(<<-SQL)
      SELECT
        application_name,
        client_addr,
        state,
        pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag,
        pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag,
        pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag,
        pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
      FROM pg_stat_replication
    SQL
    
    results.map do |row|
      {
        name: row['application_name'],
        address: row['client_addr'],
        state: row['state'],
        sent_lag_bytes: row['sent_lag'].to_i,
        write_lag_bytes: row['write_lag'].to_i,
        flush_lag_bytes: row['flush_lag'].to_i,
        replay_lag_bytes: row['replay_lag'].to_i
      }
    end
  end
  
  def alert_if_lagging(max_lag_bytes: 10_000_000)
    lagging_replicas = check_replica_lag.select do |replica|
      replica[:replay_lag_bytes] > max_lag_bytes
    end
    
    if lagging_replicas.any?
      AlertService.send(
        severity: :warning,
        message: "Replicas lagging behind primary",
        details: lagging_replicas
      )
    end
  end
end

Index Replication affects replica disk space and maintenance overhead. Replicas maintain the same indexes as the primary, doubling storage requirements. Applications can configure replicas with different indexes optimized for specific query patterns, though this requires logical replication that operates at the SQL level rather than physical replication.

Network Bandwidth limits replication throughput in geographically distributed systems. A database generating 100 MB/s of write traffic requires 100 MB/s bandwidth to each replica. Compression reduces bandwidth requirements but adds CPU overhead. Incremental or delta replication sends only changed bytes rather than entire rows, reducing bandwidth for updates to large objects.

Failover Impact creates performance spikes during transitions. When a primary fails and a replica promotes to primary, that replica suddenly handles both read and write traffic. If the previous primary handled 30% of total traffic and the replica handled 30%, the promoted replica now handles 60% during failover. Over-provisioning replicas accounts for this scenario.

Tools & Ecosystem

The Ruby ecosystem provides various tools and libraries for implementing and managing replicated data systems.

ActiveRecord offers built-in support for read/write splitting across primary and replica databases. Configuration defines multiple database connections, and ActiveRecord routes queries based on operation type.

Makara extends ActiveRecord with sophisticated connection pooling, automatic failover, and health checking. It intercepts database queries to route them intelligently across healthy nodes.

Sheepdog provides automatic PostgreSQL failover orchestration for Ruby applications. It monitors primary health using health checks, triggers failover to standby servers, and updates application configuration.

pg-replication gem enables building custom replication solutions using PostgreSQL logical replication. Applications consume the replication stream to build derived data stores, implement change data capture, or synchronize data to external systems.

require 'pg'

# Subscribe to logical replication stream
conn = PG.connect(
  host: 'primary.example.com',
  dbname: 'myapp_production',
  replication: 'database'
)

# Create replication slot
conn.exec("CREATE_REPLICATION_SLOT my_slot LOGICAL pgoutput")

# Start replication from slot
conn.exec(
  "START_REPLICATION SLOT my_slot LOGICAL 0/0 (proto_version '1', publication_names 'my_publication')"
)

loop do
  msg = conn.get_copy_data
  next unless msg
  
  # Parse and process replication messages
  case msg[0]
  when 'B'  # Begin transaction
    # Handle transaction start
    
  when 'C'  # Commit transaction
    # Handle transaction commit
    
  when 'I'  # Insert
    # Handle insert operation
    
  when 'U'  # Update
    # Handle update operation
    
  when 'D'  # Delete
    # Handle delete operation
  end
end

ruby-kafka implements Kafka client functionality including consumer groups that replicate event processing across multiple instances. The library handles partition assignment, offset management, and rebalancing.

redis-rb with Sentinel support provides Redis replication and automatic failover. Applications configure Sentinel addresses, and the client automatically discovers current primary and replica locations.

etcd-client enables building distributed configuration systems with built-in replication. Etcd uses the Raft consensus protocol to replicate configuration data across cluster members with strong consistency.

Consul-Ruby interfaces with HashiCorp Consul for service discovery and distributed configuration backed by replicated key-value storage. Services register themselves, and clients query for available instances, automatically adapting to changes.

Database-Specific Tools include:

PostgreSQL tools:

  • pg_basebackup for creating replica base backups
  • pg_receivewal for archiving WAL segments
  • repmgr for replication management and failover

MySQL tools:

  • mysqldump with replication options
  • xtrabackup for hot backups
  • mysql-utilities for replication management

Monitoring Tools track replication health and performance:

  • Prometheus with database exporters for metrics collection
  • Grafana for replication lag visualization
  • PgHero for PostgreSQL monitoring including replication status
  • VividCortex for MySQL replication monitoring

Cloud Provider Services offer managed replication:

  • Amazon RDS provides automated replication setup and failover
  • Google Cloud SQL handles replication configuration and monitoring
  • Azure Database offers automatic replica provisioning
  • DigitalOcean Managed Databases include built-in replication

These managed services reduce operational burden but limit configuration options. Applications cannot customize replication timing, conflict resolution, or failover policies beyond provider-supported options.

Reference

Replication Types

Type Write Latency Data Loss Risk Use Case
Synchronous High - waits for replicas None - replicas confirmed Financial transactions, critical data
Asynchronous Low - immediate return Possible - unconfirmed changes High throughput systems, eventual consistency
Semi-synchronous Medium - waits for one replica Minimal - one replica confirmed Balanced durability and performance
Logical Medium - SQL-level overhead Depends on mode Cross-version, filtered replication
Physical Low - binary copy Depends on mode Same-version full database replication

Consistency Models

Model Guarantee Latency Complexity
Strong All nodes identical High High - requires coordination
Eventual Converges over time Low Medium - conflict resolution
Causal Preserves cause-effect Medium High - tracks dependencies
Bounded staleness Lag within threshold Medium Medium - monitors lag
Read-your-writes Own writes visible Low Low - session tracking
Monotonic reads No time travel Low Low - session affinity

Replication Topologies

Topology Pros Cons Failure Behavior
Primary-Replica Simple, clear write path Single write bottleneck Manual/auto promote replica
Multi-Master Multiple write locations Conflict resolution needed Complex - may split brain
Chain Reduces primary bandwidth Higher end-to-end latency Break in chain disrupts downstream
Tree Hierarchical organization Complex failover Parent failure affects children

ActiveRecord Replication Configuration

Setting Purpose Example
connects_to Define primary and replica roles connects_to database: { writing: :primary, reading: :replica }
connected_to Force specific connection connected_to(role: :writing) { query }
replica_timeout Failback to primary timeout 2.seconds
automatic_role_switching Enable automatic routing true/false

PostgreSQL Replication Commands

Command Purpose
pg_basebackup -h primary -D /var/lib/postgresql/data -P Create replica base backup
pg_receivewal -h primary -D /var/lib/postgresql/wal_archive Archive WAL segments
SELECT pg_current_wal_lsn() Get current WAL position
SELECT pg_last_wal_receive_lsn() Get last received WAL position on replica
SELECT pg_last_wal_replay_lsn() Get last replayed WAL position on replica

MySQL Replication Commands

Command Purpose
SHOW MASTER STATUS Display primary binary log position
SHOW SLAVE STATUS Display replica replication status
CHANGE MASTER TO MASTER_HOST='host', MASTER_LOG_FILE='file', MASTER_LOG_POS=position Configure replication
START SLAVE Begin replication process
STOP SLAVE Stop replication process

Replication Lag Monitoring

Database Query/Method Metric
PostgreSQL pg_stat_replication view byte_lag, replay_lag_ms
MySQL SHOW SLAVE STATUS Seconds_Behind_Master
Redis INFO replication master_repl_offset difference
MongoDB rs.printSlaveReplicationInfo() syncedTo timestamp difference

Common Replication Gems

Gem Purpose Key Features
makara Read/write splitting Automatic failover, health checks, stickiness
activerecord-turntable Sharding and replication Shard key routing, replica distribution
octoshark Multi-tenant replication Per-tenant connection routing
switchman Sharding support Shard management, routing

Failover Checklist

Step Action Validation
1 Verify primary is truly down Multiple connection attempts from different sources
2 Select replica for promotion Choose most up-to-date replica
3 Promote replica to primary Execute promotion command
4 Update DNS or connection strings Verify new primary resolvable
5 Redirect application connections Test write operations succeed
6 Reconfigure remaining replicas Point to new primary
7 Monitor replication lag Ensure replicas catching up
8 Document incident Record timeline and decisions