CrackedRuby - Database Replication

Overview

Database replication creates and maintains duplicate copies of data across multiple database servers. Each copy, called a replica, receives updates from a source database to maintain consistency. Applications can distribute read operations across replicas to handle increased load, while write operations typically target a primary server.

Replication serves three primary purposes: fault tolerance through redundancy, horizontal scaling of read operations, and geographic distribution of data closer to users. When a primary database fails, a replica can assume responsibility for serving requests. When read traffic exceeds a single server's capacity, replicas distribute the load. When users access data from multiple continents, local replicas reduce network latency.

The concept originated in distributed systems research during the 1970s and became commercially available in database products during the 1980s. Modern databases implement replication through various mechanisms, from simple binary log shipping to sophisticated consensus protocols.

Primary Database (writes)
    |
    ├─> Replica 1 (reads)
    ├─> Replica 2 (reads)
    └─> Replica 3 (reads)

Replication introduces complexity through synchronization challenges. Data exists in multiple locations simultaneously, creating opportunities for inconsistency. Network partitions can prevent replicas from receiving updates. Different replicas may serve different versions of data temporarily. Applications must account for these scenarios when reading from replicas.

Key Principles

Database replication operates through continuous data transfer from source to destination databases. The source database, called the primary or master, records all data modifications. These modifications propagate to replica databases, also called secondary or slave databases, which apply the changes to maintain synchronized state.

Replication mechanisms fall into two categories: statement-based and row-based. Statement-based replication transmits SQL statements that modify data. Each replica executes these statements against its local copy. Row-based replication transmits the actual data changes as binary differences. Each approach has distinct performance and correctness characteristics.

Replication Lag measures the time delay between a write operation completing on the primary and becoming visible on replicas. Zero lag means replicas reflect all committed transactions. Positive lag means replicas lag behind the primary by some time interval. Lag varies based on network conditions, replica hardware, and transaction volume. Applications reading from replicas may observe stale data during lag periods.

Consistency Models define what guarantees applications receive when reading from replicas:

Strong consistency ensures reads always return the most recent write. This requires coordination between primary and replicas, adding latency to operations. Reads must either target the primary or wait for replica synchronization before returning data.

Eventual consistency allows replicas to lag behind the primary temporarily. Reads may return stale data, but all replicas converge to the same state given sufficient time without new writes. This model enables lower latency and higher availability at the cost of temporary inconsistency.

Read-your-writes consistency guarantees that after a client writes data, subsequent reads by that client reflect the write. This prevents users from writing data and immediately seeing old values. Implementation requires routing a client's reads to the primary or tracking which replicas have received each client's writes.

Replication Topologies describe how data flows between database instances:

Single-leader replication designates one database as the primary write target. All writes go to this leader, which propagates changes to follower replicas. Followers serve read traffic only. This topology provides straightforward conflict avoidance since one source produces all writes.

Multi-leader replication allows multiple databases to accept writes simultaneously. Each leader replicates changes to other leaders and to follower replicas. This topology enables lower write latency by having leaders in multiple geographic regions, but introduces conflict resolution complexity when concurrent writes modify the same data.

Leaderless replication distributes writes across all replicas without designating any as primary. Clients send writes to multiple replicas and consider the write successful when a quorum acknowledges it. Read operations query multiple replicas and merge results. This topology provides high availability and partition tolerance but complicates consistency guarantees.

Synchronous vs Asynchronous Replication determines when a write operation completes:

Synchronous replication waits for at least one replica to acknowledge receiving the write before reporting success to the client. This guarantees that data persists in multiple locations before the write completes, preventing data loss if the primary fails immediately after. The synchronous guarantee adds latency to write operations since the primary must wait for network round trips.

Asynchronous replication reports write success immediately after the primary commits locally, without waiting for replica acknowledgment. This provides lower write latency but risks data loss if the primary fails before changes reach replicas. Most production systems use asynchronous replication to maintain write performance.

Semi-synchronous replication waits for acknowledgment from at least one replica but not all replicas. This balances durability and performance by ensuring data exists in multiple locations while not waiting for slower replicas.

Design Considerations

Selecting a replication strategy requires analyzing application requirements for consistency, availability, latency, and geographic distribution. Each choice involves trade-offs that affect system behavior during normal operations and failure scenarios.

Single-Leader vs Multi-Leader selection depends on write patterns and geographic distribution. Single-leader replication suits applications where all writes can route through one location. This includes most web applications where users submit writes to a central service regardless of their location. The single leader prevents write conflicts by serializing all modifications.

Multi-leader replication suits applications requiring local write latency in multiple regions. Users in Asia and Europe both need fast write response times. Placing leaders in both regions allows writes to complete quickly against a nearby leader. The trade-off involves conflict resolution when users in different regions modify the same data concurrently.

Consider an application where users edit documents collaboratively. With single-leader replication, an editor in Tokyo must send every keystroke across the Pacific to a leader in California, incurring 100ms+ latency per operation. With multi-leader replication, Tokyo edits commit locally in 1-2ms, then asynchronously synchronize with California. Conflicts occur when two users edit the same document section simultaneously, requiring merge strategies.

Consistency Requirements shape how applications interact with replicas. Applications needing strong consistency must either read from the primary exclusively or implement read-your-writes semantics by tracking replication positions.

An e-commerce application updates inventory when users add items to carts. If the application writes inventory changes to the primary then immediately reads from a lagging replica, it may show incorrect available quantities. Solutions include:

Reading inventory from the primary database for the session that modified it. This ensures the modifying client sees its writes but adds load to the primary.

Tracking the replication position of each write and waiting for replicas to catch up before reading. The application records the binary log position of each write, then polls replicas until they reach that position.

Using session stickiness to route all reads from a session to one replica. This prevents seeing writes disappear and reappear as requests hit different replicas with different lag.

Accepting eventual consistency and designing the application to handle stale reads gracefully. Showing slightly outdated inventory is acceptable in many cases, with overselling handled through reservation systems.

Failover Strategy determines how systems respond when the primary database becomes unavailable. Manual failover requires human operators to promote a replica to primary status, update application configuration to point at the new primary, and restart applications. This approach gives operators control but increases downtime.

Automatic failover detects primary failures through health checks and promotes a replica without human intervention. The system must determine which replica to promote, ensuring it has received the most recent data. Applications must handle connection failures by reconnecting to the new primary. Split-brain scenarios occur when the primary becomes temporarily unreachable but continues operating, creating two primaries accepting writes. Fencing mechanisms prevent the old primary from accepting writes after failover.

# Application-level failover handling
def with_primary_failover
  retries = 0
  begin
    yield
  rescue DatabaseConnectionError => e
    retries += 1
    if retries < 3
      discover_new_primary
      retry
    else
      raise
    end
  end
end

Read Scaling Requirements determine how many replicas to maintain and how to route reads. Applications with read-heavy workloads benefit from multiple replicas distributed behind load balancers. Each replica handles a portion of read traffic, scaling read capacity horizontally.

Read routing strategies include:

Random selection distributes reads evenly across available replicas. Simple to implement but may route reads to lagging replicas.

Least-lag selection routes reads to replicas with minimal replication lag. Requires monitoring lag on each replica and routing decisions based on current lag values.

Geographic routing directs reads to replicas physically closest to the requesting client. Reduces latency but may increase lag-related inconsistencies if distant replicas lag more.

Application-level routing allows applications to choose replicas based on consistency requirements. Critical reads target the primary while less-critical reads use replicas.

Replication Lag Tolerance affects application behavior and user experience. Some applications tolerate minutes of lag, while others require seconds. Social media timeline feeds tolerate 10-30 second lag since users don't expect instant visibility of new posts. Financial applications require sub-second lag to show accurate account balances.

Lag monitoring and alerting detect when replication falls behind acceptable thresholds. Applications may redirect reads to the primary when lag exceeds limits, sacrificing read scaling for consistency. Some systems maintain multiple replica tiers with different lag characteristics, routing reads based on required freshness.

Implementation Approaches

Database replication implementations vary across database systems but follow common architectural patterns. Understanding these patterns helps select appropriate technologies and design applications that work correctly with replication.

Binary Log Replication records all data modifications in a sequential log file on the primary database. PostgreSQL calls this the Write-Ahead Log (WAL), MySQL calls it the binary log, and MongoDB calls it the oplog. The primary appends every transaction to this log as it commits. Replicas connect to the primary, request log entries starting from their current position, and apply these entries to their local database.

Log-based replication provides durability by persisting changes before confirming transactions. If the primary crashes after confirming a transaction, the transaction data exists in the log for recovery. Replicas read from the log to stay synchronized, potentially replaying from any historical position.

Physical replication transmits actual disk block changes from the log. This approach replicates exactly what the primary writes to disk, including index updates and storage engine internal structures. Physical replicas must run the same database version and configuration as the primary since they replicate internal data structures.

Logical replication transmits higher-level change descriptions like "insert row with values X, Y, Z into table T". This allows replicas to run different database versions or selectively replicate subsets of tables. Logical replication adds overhead since the primary must decode disk-level changes back into logical operations.

Statement-Based Replication transmits SQL statements from the primary to replicas. Each replica executes these statements against its local database. This approach minimizes network bandwidth since statements are typically smaller than the data they affect. A single UPDATE statement modifying millions of rows transmits as a small statement rather than millions of row changes.

Statement-based replication fails for non-deterministic statements. INSERT INTO logs (timestamp) VALUES (NOW()) produces different timestamps on each replica. UPDATE inventory SET quantity = quantity - FLOOR(RAND() * 10) generates different random numbers on each replica. Databases handle these cases by either logging the computed values alongside statements or switching to row-based replication for non-deterministic operations.

Trigger-Based Replication uses database triggers to capture changes and route them to replicas. Each table being replicated has triggers on INSERT, UPDATE, and DELETE operations. These triggers write change records to a queue or log table. A separate replication process reads from this queue and applies changes to replicas.

This approach provides flexibility since triggers can transform data, filter which changes replicate, or route changes to different destinations. The overhead of executing triggers on every modification limits scalability. Databases optimized for high transaction rates perform poorly with trigger-based replication.

Streaming Replication establishes persistent connections between primary and replicas. The primary streams change records to replicas as transactions commit rather than waiting for replicas to poll for changes. This reduces replication lag since replicas receive changes immediately.

PostgreSQL streaming replication maintains open connections from each replica to the primary. The primary streams WAL records to connected replicas in real-time. Replicas apply these records immediately, maintaining sub-second replication lag under normal conditions.

Multi-Version Concurrency Control (MVCC) Integration affects replication behavior. Databases using MVCC maintain multiple versions of rows to support concurrent transactions. Replication must transmit both old and new row versions or sufficient information to reconstruct version history on replicas.

PostgreSQL MVCC creates new row versions for updates rather than modifying rows in place. Replication transmits the new row version including transaction IDs and visibility information. Replicas apply these versions and maintain their own MVCC state for concurrent read transactions.

Chain Replication arranges replicas in a sequence where each replica replicates to the next. The primary replicates to replica A, which replicates to replica B, which replicates to replica C. This reduces load on the primary since it only transmits changes to one replica. Later replicas have higher lag since changes must propagate through the chain.

Chain replication suits scenarios with many replicas where primary network bandwidth becomes a bottleneck. A primary can support 2-3 direct replicas before network saturation, but using chain replication scales to dozens of replicas at the cost of increased lag for downstream replicas.

Ruby Implementation

Ruby applications interact with database replication through ActiveRecord and database adapter configurations. The application doesn't implement replication itself but must account for replication characteristics when reading and writing data.

Multiple Database Connections configure read replicas in Rails applications. Rails 6.0+ supports multiple database connections through config/database.yml:

# config/database.yml
production:
  primary:
    adapter: postgresql
    host: primary.database.example.com
    database: myapp_production
  
  replica:
    adapter: postgresql
    host: replica.database.example.com
    database: myapp_production
    replica: true

ActiveRecord establishes connections to both databases. The application specifies which connection to use for each operation:

# Read from replica
users = ActiveRecord::Base.connected_to(role: :reading) do
  User.where(active: true).to_a
end

# Write to primary
ActiveRecord::Base.connected_to(role: :writing) do
  User.create(name: "Alice", email: "alice@example.com")
end

Automatic Connection Switching routes reads to replicas and writes to the primary based on HTTP verb:

# config/application.rb
class Application < Rails::Application
  config.active_record.database_selector = { delay: 2.seconds }
  config.active_record.database_resolver = ActiveRecord::Middleware::DatabaseSelector::Resolver
  config.active_record.database_resolver_context = ActiveRecord::Middleware::DatabaseSelector::Resolver::Session
end

Rails automatically connects to the primary for POST, PUT, PATCH, and DELETE requests, and to replicas for GET requests. The delay setting prevents reading from replicas immediately after writes, accounting for replication lag.

Session-Based Primary Stickiness ensures clients see their own writes by remembering recent write timestamps:

# After a write operation
class ApplicationController < ActionController::Base
  def create
    @post = Post.create(post_params)
    ActiveRecord::Base.connected_to(role: :writing) do
      mark_as_recently_written
    end
    redirect_to @post
  end
  
  private
  
  def mark_as_recently_written
    session[:last_write_timestamp] = Time.current
  end
end

# Middleware routes to primary if recent writes occurred
class PrimaryAfterWriteMiddleware
  def initialize(app)
    @app = app
  end
  
  def call(env)
    request = ActionDispatch::Request.new(env)
    last_write = request.session[:last_write_timestamp]
    
    if last_write && last_write > 5.seconds.ago
      ActiveRecord::Base.connected_to(role: :writing) do
        @app.call(env)
      end
    else
      @app.call(env)
    end
  end
end

Replica Lag Detection queries replication status to avoid reading from significantly lagging replicas:

class ReplicaHealthCheck
  def self.lag_seconds
    result = ActiveRecord::Base.connected_to(role: :reading) do
      ActiveRecord::Base.connection.execute(
        "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))"
      )
    end
    result[0]["extract"].to_f
  end
  
  def self.healthy?(max_lag_seconds: 10)
    lag_seconds < max_lag_seconds
  rescue => e
    Rails.logger.error("Replica health check failed: #{e.message}")
    false
  end
end

# Route to primary if replica is lagging
def fetch_inventory(product_id)
  if ReplicaHealthCheck.healthy?
    ActiveRecord::Base.connected_to(role: :reading) do
      Inventory.find_by(product_id: product_id)
    end
  else
    ActiveRecord::Base.connected_to(role: :writing) do
      Inventory.find_by(product_id: product_id)
    end
  end
end

Transaction-Aware Routing ensures transactions execute entirely against the primary:

class Order < ApplicationRecord
  def self.create_with_inventory_check(params)
    ActiveRecord::Base.transaction do
      # This transaction must use the primary even for reads
      ActiveRecord::Base.connected_to(role: :writing) do
        inventory = Inventory.lock.find_by(product_id: params[:product_id])
        
        raise InsufficientInventory if inventory.quantity < params[:quantity]
        
        order = create!(params)
        inventory.update!(quantity: inventory.quantity - params[:quantity])
        
        order
      end
    end
  end
end

Connection Pool Management configures separate pools for primary and replica connections:

# config/database.yml
production:
  primary:
    adapter: postgresql
    pool: 5
    host: primary.database.example.com
  
  replica:
    adapter: postgresql
    pool: 15  # More connections since most traffic is reads
    host: replica.database.example.com
    replica: true

Background Job Considerations must account for replication lag when scheduling jobs after writes:

class PublishArticleJob < ApplicationJob
  def perform(article_id)
    # Wait for replication to catch up
    sleep 2
    
    # Or verify the article exists on the replica
    article = ActiveRecord::Base.connected_to(role: :reading) do
      Article.find_by(id: article_id)
    end
    
    unless article
      # Fall back to primary if replica hasn't received the record
      article = ActiveRecord::Base.connected_to(role: :writing) do
        Article.find(article_id)
      end
    end
    
    # Process article...
  end
end

# Schedule with delay to allow replication
Article.create!(title: "New Article")
PublishArticleJob.set(wait: 5.seconds).perform_later(@article.id)

Read Replica Load Balancing distributes traffic across multiple replicas:

# config/database.yml with multiple replicas
production:
  primary:
    adapter: postgresql
    host: primary.database.example.com
  
  replica1:
    adapter: postgresql
    host: replica1.database.example.com
    replica: true
  
  replica2:
    adapter: postgresql
    host: replica2.database.example.com
    replica: true

class ReplicaLoadBalancer
  REPLICAS = [:replica1, :replica2]
  
  def self.read_from_replica(&block)
    replica = REPLICAS.sample
    ActiveRecord::Base.connected_to(role: :reading, shard: replica) do
      yield
    end
  end
end

# Use in application code
users = ReplicaLoadBalancer.read_from_replica do
  User.where(active: true).to_a
end

Critical Read Forcing ensures specific queries always hit the primary despite automatic routing:

class Account < ApplicationRecord
  def self.find_for_transaction(account_id)
    # Always read account balance from primary for financial transactions
    ActiveRecord::Base.connected_to(role: :writing) do
      find(account_id)
    end
  end
end

# Or use a scope that forces primary reads
class Product < ApplicationRecord
  scope :from_primary, -> {
    ActiveRecord::Base.connected_to(role: :writing) { all }
  }
end

# Usage: Product.from_primary.find_by(sku: "ABC123")

Performance Considerations

Replication introduces performance characteristics that differ from single-database systems. Applications must consider these factors when designing for scale and reliability.

Read Scaling Limits depend on replica count and read-write ratio. Each replica handles a portion of read traffic but replication overhead increases with replica count. Adding the fifth replica provides less marginal benefit than adding the second replica. Replication overhead includes network bandwidth from the primary to each replica, primary CPU for serializing changes, and replica CPU for applying changes.

A system with 90% reads and 10% writes achieves near-linear read scaling up to 5-10 replicas. Beyond this, primary replication overhead limits scaling. A system with 50% reads and 50% writes sees limited benefit beyond 2-3 replicas since write load remains on the primary.

Replication Lag Impact varies with transaction volume and replica hardware. Under normal conditions, PostgreSQL streaming replication maintains 50-200ms lag. During high write volume, lag increases to seconds or minutes if replicas can't apply changes fast enough. MySQL semi-synchronous replication maintains single-digit millisecond lag for the synchronous replica but allows asynchronous replicas to lag more.

class PerformanceMonitor
  def self.replication_metrics
    primary_position = ActiveRecord::Base.connected_to(role: :writing) do
      get_wal_position
    end
    
    replica_position = ActiveRecord::Base.connected_to(role: :reading) do
      get_wal_position
    end
    
    lag_bytes = primary_position - replica_position
    
    {
      lag_bytes: lag_bytes,
      lag_seconds: estimate_lag_seconds(lag_bytes),
      throughput: calculate_throughput
    }
  end
  
  private
  
  def self.get_wal_position
    result = ActiveRecord::Base.connection.execute(
      "SELECT pg_current_wal_lsn()"
    )
    wal_to_bytes(result[0]["pg_current_wal_lsn"])
  end
  
  def self.wal_to_bytes(wal_position)
    # Convert PostgreSQL WAL position to byte offset
    parts = wal_position.split('/')
    (parts[0].to_i(16) << 32) + parts[1].to_i(16)
  end
end

Write Throughput Bottlenecks occur when the primary reaches CPU or I/O limits. Single-leader replication concentrates all writes on one server. Horizontal scaling for writes requires sharding or multi-leader replication. A primary handling 10,000 writes per second may replicate successfully to many replicas, but cannot scale beyond 10,000 writes per second without architectural changes.

Connection Overhead increases with replica count since applications maintain connections to each replica. A Rails application with 20 processes and 3 replicas maintains 60 replica connections. Connection pooling solutions like PgBouncer reduce the number of actual database connections:

# config/database.yml with connection pooler
production:
  primary:
    adapter: postgresql
    host: pgbouncer-primary.example.com
    port: 6432
  
  replica:
    adapter: postgresql
    host: pgbouncer-replica.example.com
    port: 6432
    replica: true

Network Bandwidth Consumption grows linearly with replica count. Each replica receives a full copy of all writes. A primary generating 50 MB/sec of WAL data consumes 250 MB/sec of network bandwidth to support 5 replicas. This bandwidth usage becomes a limiting factor in cloud environments with network transfer costs.

Query Performance Differences occur between primary and replicas due to replication lag and query load distribution. Replicas may have stale statistics leading to suboptimal query plans. Replicas receiving heavy read traffic may experience I/O contention affecting query performance. Applications should monitor query performance separately on primary and replicas:

class QueryPerformanceTracker
  def self.track_query(query_name, role, &block)
    start_time = Time.current
    result = yield
    duration = Time.current - start_time
    
    record_metric(
      query_name: query_name,
      role: role,
      duration_ms: duration * 1000
    )
    
    result
  end
end

# Usage
users = QueryPerformanceTracker.track_query("active_users", :reading) do
  ActiveRecord::Base.connected_to(role: :reading) do
    User.where(active: true).to_a
  end
end

Cache Warming Strategies differ between primary and replicas. Replicas promoted to primary after failover start with cold caches, causing temporary performance degradation. Some systems maintain hot standby replicas that mirror primary cache contents by executing read queries against replicas that match production traffic patterns.

Index Maintenance Overhead occurs on both primary and replicas. The primary builds indexes for new data, consuming CPU and I/O. Replicas receive index updates through replication and apply them, doubling the system-wide index maintenance work. Systems with frequent index changes experience higher replication overhead.

Synchronous Replication Performance trades write latency for durability. Semi-synchronous replication adds one network round trip to write latency, typically 1-10ms depending on network distance between primary and replica. This 1-10ms increase in write latency may be acceptable for durability benefits. Synchronous replication to distant geographic regions adds 50-200ms to write latency, often unacceptable for interactive applications.

Real-World Applications

Production systems implement replication patterns that balance consistency, performance, and operational complexity. These patterns emerge from practical experience deploying replicated databases at scale.

Geographic Distribution places replicas near users to reduce read latency. A global application serves users in North America, Europe, and Asia. The primary database runs in North America, with replicas in Europe and Asia. European users connect to the European replica for reads, experiencing 10-20ms latency instead of 100-150ms latency to North America.

This pattern requires application-level routing to direct users to their nearest replica. DNS-based routing directs users to regional application servers, which connect to local replicas. User sessions stick to regions to maintain read-your-writes consistency within sessions.

class GeographicRouter
  def self.replica_for_region(region)
    case region
    when "north_america"
      :replica_us_east
    when "europe"
      :replica_eu_west
    when "asia"
      :replica_ap_southeast
    else
      :replica_us_east  # Default fallback
    end
  end
  
  def self.read_from_nearest_replica(user)
    replica = replica_for_region(user.region)
    ActiveRecord::Base.connected_to(role: :reading, shard: replica) do
      yield
    end
  end
end

Read-Heavy Workloads scale read capacity by adding replicas behind load balancers. A content publishing platform serves millions of article views with thousands of concurrent readers but few writes. Five replicas distribute read traffic, each handling 20% of reads. The primary handles all writes plus a small percentage of reads requiring strong consistency.

Load balancers perform health checks on replicas, removing unhealthy replicas from rotation. Applications retry failed reads against different replicas. This architecture scales read capacity linearly with replica count up to network and load balancer limits:

class ReplicaHealthChecker
  def self.check_all_replicas
    replicas = [:replica1, :replica2, :replica3, :replica4, :replica5]
    
    replicas.map do |replica|
      {
        replica: replica,
        healthy: check_replica_health(replica),
        lag_seconds: check_replica_lag(replica)
      }
    end
  end
  
  def self.check_replica_health(replica)
    ActiveRecord::Base.connected_to(role: :reading, shard: replica) do
      ActiveRecord::Base.connection.execute("SELECT 1")
      true
    end
  rescue => e
    false
  end
  
  def self.check_replica_lag(replica)
    ActiveRecord::Base.connected_to(role: :reading, shard: replica) do
      result = ActiveRecord::Base.connection.execute(
        "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))"
      )
      result[0]["extract"].to_f
    end
  rescue => e
    Float::INFINITY
  end
end

Reporting and Analytics offload heavy analytical queries to dedicated replicas. Production applications separate transactional traffic from analytical queries by routing reports to specific replicas. These analytical replicas may lag behind by minutes, acceptable for business intelligence workloads that don't require real-time data.

Analytical replicas often have different configuration optimized for large scans and aggregations rather than transactional performance. Larger shared buffers, higher work memory, and parallel query execution improve analytical query performance:

class ReportingDatabase
  def self.run_report(&block)
    ActiveRecord::Base.connected_to(role: :reading, shard: :analytics_replica) do
      # Set session parameters for analytical queries
      ActiveRecord::Base.connection.execute(
        "SET work_mem = '256MB'; SET max_parallel_workers_per_gather = 4;"
      )
      
      yield
    end
  end
end

# Usage for expensive reports
ReportingDatabase.run_report do
  Report.generate_monthly_revenue_by_region
end

Disaster Recovery maintains replicas in separate availability zones or regions for business continuity. When a data center fails, replicas in other locations continue serving traffic. RPO (Recovery Point Objective) measures acceptable data loss in disaster scenarios. Asynchronous replication may lose seconds to minutes of data during primary failure. Synchronous replication to disaster recovery replicas eliminates data loss but adds latency.

Automated failover to disaster recovery replicas requires careful coordination:

class DisasterRecoveryFailover
  def self.initiate_failover(failed_primary, new_primary)
    # Stop writes to failed primary
    fence_failed_primary(failed_primary)
    
    # Wait for new primary to catch up
    wait_for_replication_catchup(new_primary)
    
    # Promote new primary
    promote_replica(new_primary)
    
    # Update application configuration
    update_connection_configuration(new_primary)
    
    # Restart application servers
    restart_application_servers
  end
  
  private
  
  def self.fence_failed_primary(primary)
    # Implement fencing mechanism (network isolation, storage fencing, etc.)
    # Prevent split-brain scenarios
  end
  
  def self.wait_for_replication_catchup(replica)
    timeout = 60.seconds
    start_time = Time.current
    
    loop do
      lag = check_replica_lag(replica)
      break if lag < 1.second
      
      raise TimeoutError if Time.current - start_time > timeout
      sleep 1
    end
  end
end

Blue-Green Deployments utilize replication for zero-downtime schema migrations. The application maintains two database clusters: blue (current production) and green (new version). Schema changes deploy to the green cluster while the blue cluster continues serving traffic. Replication keeps green synchronized with blue. After validation, traffic switches to green.

This pattern requires bidirectional replication during the transition period and careful coordination of schema compatibility:

class BlueGreenDeployment
  def self.migrate_to_green
    # Ensure green replica is caught up
    verify_replication_status(:green)
    
    # Stop writes to blue
    stop_application_writes
    
    # Final replication sync
    wait_for_final_sync(:green)
    
    # Switch application to green
    switch_database_configuration(:green)
    
    # Restart application
    restart_application
    
    # Start reverse replication from green to blue
    setup_reverse_replication(:green, :blue)
  end
end

Multi-Tenant Applications use replication to separate tenant data while maintaining global views. Each tenant has a dedicated replica for reads, isolating tenant workloads. The primary handles writes for all tenants. This pattern provides tenant-level read scaling and isolation without fully sharding data:

class TenantReplicaRouter
  def self.replica_for_tenant(tenant_id)
    replica_assignments = {
      1..1000 => :replica_set_1,
      1001..2000 => :replica_set_2,
      2001..3000 => :replica_set_3
    }
    
    replica_assignments.find { |range, _| range.include?(tenant_id) }&.last || :replica_set_1
  end
  
  def self.read_for_tenant(tenant_id, &block)
    replica = replica_for_tenant(tenant_id)
    ActiveRecord::Base.connected_to(role: :reading, shard: replica) do
      yield
    end
  end
end

Reference

Replication Topology Comparison

Topology	Write Targets	Conflict Resolution	Use Case
Single-Leader	One primary	Not needed	Most web applications
Multi-Leader	Multiple primaries	Required	Multi-region writes
Leaderless	All replicas	Quorum-based	High availability systems
Chain	Leader only	Not needed	Many replicas, bandwidth limited

Consistency Models

Model	Guarantee	Read Latency	Use Case
Strong	Latest write visible	Higher	Financial transactions
Eventual	Converges over time	Lower	Social media feeds
Read-Your-Writes	Own writes visible	Medium	User profile data
Monotonic Reads	No backward time travel	Medium	Session data

PostgreSQL Replication Configuration

Parameter	Description	Values
synchronous_commit	Wait for replica acknowledgment	on, off, remote_write, remote_apply, local
max_wal_senders	Maximum concurrent replicas	1-262143
wal_level	WAL detail level	minimal, replica, logical
max_replication_slots	Replication slot limit	0-262143
hot_standby	Allow reads on replicas	on, off

MySQL Replication Modes

Mode	Description	Data Loss Risk
Asynchronous	No wait for replica	Seconds to minutes
Semi-Synchronous	Wait for one replica	Minimal
Group Replication	Multi-primary consensus	None after commit
GTID-based	Global transaction IDs	Depends on mode

Replication Lag Causes

Cause	Impact	Mitigation
Network congestion	Seconds to minutes	Increase bandwidth, QoS
Replica hardware	Variable	Upgrade replica resources
Large transactions	Proportional to size	Break into smaller transactions
Slow queries on replica	Duration of query	Optimize queries, add indexes
Replica maintenance	Duration of operation	Schedule during low traffic

Rails Connection Configuration

Setting	Purpose	Example Value
database_selector	Enable automatic routing	delay: 2.seconds
database_resolver	Connection routing logic	ActiveRecord::Middleware::DatabaseSelector::Resolver
replica: true	Mark as replica connection	true
pool	Connection pool size	5-25
checkout_timeout	Wait time for connection	5 seconds

Monitoring Queries

PostgreSQL Replication Status:

SELECT 
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  replay_lsn,
  sync_state,
  EXTRACT(EPOCH FROM (now() - replay_timestamp)) AS lag_seconds
FROM pg_stat_replication;

PostgreSQL Replica Lag:

SELECT 
  EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds,
  pg_is_in_recovery() AS is_replica;

MySQL Replication Status:

SHOW SLAVE STATUS;

MySQL Replica Lag:

SHOW SLAVE STATUS\G
-- Check Seconds_Behind_Master field

Failover Decision Matrix

Scenario	Action	Rationale
Primary unresponsive, lag under 1s	Automatic failover	Minimal data loss
Primary unresponsive, lag over 30s	Manual failover	Significant data loss risk
Network partition	Wait for primary	Prevent split-brain
Planned maintenance	Manual failover	Controlled process
Replica failure	Remove from rotation	Primary unaffected

Performance Tuning Parameters

Write Performance:

Disable synchronous_commit for higher throughput, accept data loss risk
Batch small transactions into larger transactions
Reduce number of replicas if primary saturated
Use connection pooling to reduce connection overhead
Enable compression for WAN replication

Read Performance:

Add replicas to distribute read load
Use read-your-writes only when necessary
Implement query result caching
Optimize slow queries affecting replica performance
Configure replicas with read-optimized settings

Lag Reduction:

Upgrade replica hardware to match or exceed primary
Increase network bandwidth between primary and replicas
Enable parallel replication where supported
Reduce long-running transactions
Monitor and kill blocking queries on replicas

Database Replication