Overview
Data replication maintains multiple copies of data across different nodes, servers, or geographic locations. This technique enables high availability, fault tolerance, load distribution, and improved read performance by keeping synchronized copies of data that can serve requests independently.
Replication addresses several critical needs in distributed systems. When a primary database fails, replicas provide immediate failover capability without data loss. Geographic distribution places data closer to users, reducing latency for global applications. Read-heavy workloads distribute queries across multiple replicas, preventing any single database from becoming a bottleneck.
The fundamental challenge in replication involves maintaining consistency between copies while managing the inherent delays and failures in distributed networks. When data changes on one node, that change must propagate to all replicas. The timing, ordering, and guarantees around this propagation define different replication strategies, each with distinct trade-offs.
# Conceptual representation of a replicated data store
class ReplicatedStore
def initialize(primary, replicas)
@primary = primary
@replicas = replicas
end
def write(key, value)
# Write to primary first
@primary.set(key, value)
# Propagate to replicas
@replicas.each { |replica| replica.set(key, value) }
end
def read(key)
# Read from any available replica
available_node = [@primary, *@replicas].sample
available_node.get(key)
end
end
Replication operates at different system layers. Database replication duplicates entire databases or specific tables. Application-level replication synchronizes data through custom logic. File system replication maintains identical copies of files across storage systems. Message queue replication ensures event delivery across multiple brokers.
Key Principles
Replication relies on several fundamental mechanisms that govern how data copies remain synchronized and how the system behaves during normal operation and failures.
Replication Lag represents the time delay between a write on the primary and its visibility on replicas. This lag exists in all asynchronous replication systems and directly impacts read consistency. A replica lagging by five seconds serves data that may be five seconds stale. Applications must account for this staleness when designing read patterns.
Write Path determines where write operations occur and how they propagate. In primary-based replication, all writes go to a designated primary node that forwards changes to replicas. In multi-master replication, multiple nodes accept writes concurrently, requiring conflict resolution when the same data changes in different locations.
Replication Log captures all changes to data in a sequential, ordered format. This log serves as the source of truth for replication. PostgreSQL uses Write-Ahead Logging (WAL), MySQL uses binary logs, and many distributed systems use commit logs. Replicas consume these logs to apply changes in the same order they occurred on the primary.
# Simplified replication log structure
class ReplicationLog
def initialize
@entries = []
@position = 0
end
def append(operation, key, value, timestamp)
entry = {
position: @position,
operation: operation,
key: key,
value: value,
timestamp: timestamp
}
@entries << entry
@position += 1
entry
end
def entries_since(position)
@entries.select { |e| e[:position] > position }
end
end
Consistency Models define guarantees about what data replicas return when read. Strong consistency ensures all replicas return identical data at any moment, but requires coordination that adds latency. Eventual consistency allows replicas to temporarily diverge, guaranteeing only that they will converge given enough time without new writes. Causal consistency preserves cause-effect relationships while allowing some divergence.
Conflict Resolution becomes necessary when multiple nodes accept writes to the same data. Last-write-wins uses timestamps to pick the most recent write, though clock skew creates problems. Vector clocks track causality to detect concurrent writes that genuinely conflict. Application-specific resolvers use domain knowledge to merge conflicting values.
Topology describes the arrangement of primary and replica nodes. Star topology has one primary and multiple replicas. Chain topology connects nodes sequentially, with each forwarding to the next. Tree topology organizes nodes hierarchically. The topology affects latency, network utilization, and failure behavior.
Checkpoint and Recovery enables replicas to catch up after disconnection without replaying the entire replication log. Replicas periodically snapshot their state and record their log position. After reconnection, they load the snapshot and replay only log entries since that position.
Implementation Approaches
Different replication strategies offer distinct trade-offs between consistency, performance, availability, and operational complexity.
Synchronous Replication writes data to the primary and waits for acknowledgment from one or more replicas before confirming the write to the client. This approach guarantees that committed data exists on multiple nodes, preventing data loss if the primary fails. The primary blocks until replicas respond, adding latency proportional to network round-trip time plus replica processing time.
Synchronous replication typically uses quorum-based commits where writes complete after a majority of replicas acknowledge. With five replicas, writes succeed after three acknowledgments. This configuration tolerates two simultaneous failures while maintaining data durability. The system becomes unavailable for writes if too many replicas fail to reach quorum.
class SynchronousReplicator
def initialize(primary, replicas, quorum_size)
@primary = primary
@replicas = replicas
@quorum_size = quorum_size
end
def write(key, value)
# Write to primary
@primary.set(key, value)
# Wait for quorum from replicas
acknowledgments = []
@replicas.each do |replica|
thread = Thread.new do
begin
replica.set(key, value)
acknowledgments << replica
rescue => e
# Replica failed, continue
end
end
end
# Wait until quorum reached or timeout
timeout = 5
start_time = Time.now
until acknowledgments.size >= @quorum_size || (Time.now - start_time) > timeout
sleep 0.1
end
if acknowledgments.size >= @quorum_size
:success
else
raise "Failed to reach quorum"
end
end
end
Asynchronous Replication confirms writes to clients immediately after writing to the primary, then propagates changes to replicas in the background. This approach provides low write latency since clients do not wait for replicas. Replicas lag behind the primary by some amount, meaning recent writes may not appear on replicas immediately.
The risk with asynchronous replication involves data loss during primary failure. If the primary crashes after confirming a write but before replicating it, that write disappears. The acceptable amount of potential data loss depends on the application. Financial systems typically cannot tolerate any loss, while social media applications might accept losing a few seconds of data during failures.
Semi-Synchronous Replication combines characteristics of both approaches. Writes wait for at least one replica to acknowledge before confirming to the client, then replicate to remaining replicas asynchronously. This strategy balances durability and performance by ensuring data exists on two nodes while not blocking on all replicas.
MySQL semi-synchronous replication waits for one replica acknowledgment, then returns success. If all replicas become unavailable, the system can fall back to asynchronous mode to maintain write availability, accepting the increased risk of data loss.
Logical Replication copies data changes at a logical level, replicating operations like "insert row with values X, Y, Z" rather than physical disk changes. This approach enables filtering specific tables, transforming data during replication, and replicating between different database versions or even different database systems.
PostgreSQL logical replication uses publications on the primary that define which tables to replicate and subscriptions on replicas that consume those publications. Changes stream through a replication slot that buffers modifications. Applications can subscribe to the same logical replication stream to build derived data stores or implement change data capture.
Physical Replication copies the exact byte-level changes to data files and transaction logs. Physical replicas maintain identical copies of the primary's data structures. This approach replicates everything—all databases, all tables, all indexes—with minimal overhead since it operates below the SQL layer.
PostgreSQL streaming replication ships WAL records from primary to standby servers that replay those records to reconstruct the same physical state. Physical replicas can serve read queries, though they remain in a special recovery mode that continuously applies WAL records.
Ruby Implementation
Ruby applications interact with replicated data stores through database adapters, connection pooling libraries, and replication-aware gems that abstract the complexity of managing primary and replica connections.
Database Connection Management in Ruby typically uses ActiveRecord for Rails applications or Sequel for standalone applications. Both support read/write connection splitting where write queries go to the primary and read queries distribute across replicas.
# ActiveRecord configuration for primary-replica setup
class ApplicationRecord < ActiveRecord::Base
connects_to database: {
writing: :primary,
reading: :replica
}
end
# Automatically routes reads to replica
User.where(active: true).first # Reads from replica
# Explicitly use primary for consistency
User.connected_to(role: :writing) do
User.where(active: true).first # Reads from primary
end
# Write operations always use primary
user = User.create(name: "Alice") # Writes to primary
Applications configure multiple replica connections in database.yml, and ActiveRecord distributes read queries across them using a load balancing strategy. The default strategy randomly selects a replica for each query. Custom strategies can implement weighted distribution, latency-based selection, or geographic affinity.
Handling Replication Lag requires application awareness. After writing data, immediately reading it from a replica may return stale results if replication has not completed. Applications use several patterns to manage this:
class UserService
def create_and_fetch(attributes)
# Write to primary
user = User.create(attributes)
# Force read from primary for consistency
User.connected_to(role: :writing) do
User.find(user.id)
end
end
def create_with_retry(attributes)
user = User.create(attributes)
# Poll replica until data appears
max_attempts = 10
attempt = 0
loop do
begin
return User.find(user.id)
rescue ActiveRecord::RecordNotFound
attempt += 1
raise if attempt >= max_attempts
sleep 0.1
end
end
end
def create_with_session_stickiness(attributes, session_id)
user = User.create(attributes)
# Store write timestamp in session
session_store.set("last_write_#{session_id}", Time.now)
# Later reads check if enough time has passed
last_write = session_store.get("last_write_#{session_id}")
if last_write && (Time.now - last_write) < 5
# Use primary if recent write
User.connected_to(role: :writing) { User.find(user.id) }
else
# Use replica if no recent writes
User.find(user.id)
end
end
end
Makara provides advanced connection management for Ruby applications with read/write splitting and failover capabilities. It wraps ActiveRecord's connection adapter to intercept queries and route them based on type and current node health.
# Gemfile
gem 'makara'
# database.yml
production:
adapter: mysql2_makara
database: myapp_production
makara:
connections:
- role: master
host: primary.db.example.com
- role: slave
host: replica1.db.example.com
- role: slave
host: replica2.db.example.com
master_ttl: 5 # Stick to master for 5 seconds after write
blacklist_duration: 30 # Blacklist failed nodes for 30 seconds
# Application code remains unchanged
User.where(active: true).first # Automatically routed
Makara tracks which connections recently performed writes and temporarily routes subsequent reads from that connection to the primary. This "stickiness" period prevents reading stale data after writes. When a replica fails health checks, Makara blacklists it temporarily and redistributes queries to healthy replicas.
Redis Replication in Ruby uses the redis-rb gem with Sentinel support for automatic failover. Sentinel monitors Redis instances and promotes replicas to primary when failures occur.
require 'redis'
# Connect through Sentinel for automatic failover
redis = Redis.new(
url: "redis://mymaster",
sentinels: [
{ host: "sentinel1.example.com", port: 26379 },
{ host: "sentinel2.example.com", port: 26379 },
{ host: "sentinel3.example.com", port: 26379 }
],
role: :master
)
# Writes go to current primary
redis.set("key", "value")
# Reads can use replicas for scaling
redis_replica = Redis.new(
url: "redis://mymaster",
sentinels: [
{ host: "sentinel1.example.com", port: 26379 },
{ host: "sentinel2.example.com", port: 26379 },
{ host: "sentinel3.example.com", port: 26379 }
],
role: :slave
)
redis_replica.get("key")
Kafka Consumer Groups in Ruby enable building replicated data pipelines where multiple consumers process events concurrently. The ruby-kafka gem handles partition assignment and offset management.
require 'kafka'
kafka = Kafka.new(
seed_brokers: ["kafka1.example.com:9092", "kafka2.example.com:9092"],
client_id: "user-service"
)
# Consumer group ensures only one consumer processes each partition
consumer = kafka.consumer(group_id: "user-replicator")
consumer.subscribe("user-events")
consumer.each_message do |message|
event = JSON.parse(message.value)
case event["type"]
when "user.created"
# Replicate user creation to another data store
SearchIndex.create_user(event["data"])
when "user.updated"
SearchIndex.update_user(event["data"])
when "user.deleted"
SearchIndex.delete_user(event["data"]["id"])
end
# Commit offset after successful processing
consumer.commit_offsets
end
Custom Replication Logic sometimes becomes necessary for application-specific requirements. Building a simple replication system demonstrates the core concepts:
class SimpleReplicator
def initialize(primary_url, replica_urls)
@primary = Database.connect(primary_url)
@replicas = replica_urls.map { |url| Database.connect(url) }
@replication_log = []
@replica_positions = Hash.new(0)
end
def write(table, id, data)
# Write to primary
@primary.write(table, id, data)
# Record in replication log
entry = {
position: @replication_log.size,
timestamp: Time.now,
table: table,
id: id,
data: data
}
@replication_log << entry
# Replicate asynchronously
replicate_to_followers
entry
end
def read(table, id, consistency: :eventual)
case consistency
when :strong
# Always read from primary
@primary.read(table, id)
when :eventual
# Read from random replica
replica = @replicas.sample || @primary
replica.read(table, id)
end
end
private
def replicate_to_followers
Thread.new do
@replicas.each_with_index do |replica, index|
begin
position = @replica_positions[index]
entries = @replication_log[position..-1]
entries.each do |entry|
replica.write(entry[:table], entry[:id], entry[:data])
@replica_positions[index] = entry[:position] + 1
end
rescue => e
# Log error, continue with other replicas
puts "Replication to replica #{index} failed: #{e.message}"
end
end
end
end
end
Design Considerations
Selecting appropriate replication strategies requires analyzing consistency requirements, performance goals, failure tolerance, and operational constraints.
Consistency Requirements dominate replication decisions. Financial transactions demand strong consistency where all nodes always return identical data. Social media timelines tolerate eventual consistency where different users might temporarily see different post counts. Inventory systems need bounded staleness where replica lag stays within acceptable limits.
Strong consistency requires synchronous replication or distributed consensus protocols. These approaches add latency since operations wait for coordination across nodes. Systems using strong consistency often sacrifice write throughput for correctness guarantees. Geographical distribution compounds these costs as network latency between distant data centers adds hundreds of milliseconds to each operation.
Eventual consistency enables high write throughput and geographic distribution by allowing replicas to temporarily diverge. Applications must handle scenarios where recently written data does not appear immediately on replicas. Session stickiness routes users to specific replicas to provide consistent views within a session while different users might see different data.
# Design pattern for choosing consistency level
class DataAccessStrategy
def self.for_operation(operation_type)
case operation_type
when :financial_transaction
StrongConsistencyAccess.new
when :user_profile_read
EventualConsistencyAccess.new
when :inventory_check
BoundedStalenessAccess.new(max_lag: 10)
end
end
end
class StrongConsistencyAccess
def read(key)
PrimaryDatabase.connected_to(role: :writing) do
PrimaryDatabase.read(key)
end
end
end
class EventualConsistencyAccess
def read(key)
ReplicaDatabase.read(key)
end
end
class BoundedStalenessAccess
def initialize(max_lag:)
@max_lag = max_lag
end
def read(key)
replica = find_replica_within_lag(@max_lag)
replica ? replica.read(key) : PrimaryDatabase.read(key)
end
end
Read/Write Ratio determines replication value. Read-heavy workloads with 90% reads and 10% writes gain substantial benefit from replication. Multiple replicas distribute read load, preventing primary overload. Write-heavy workloads gain less benefit since writes still concentrate on the primary and replication overhead increases.
Applications with regional user bases benefit from geographic replication regardless of read/write ratio. Placing replicas near users reduces latency even when replication overhead increases. A European user accessing a European replica experiences 50ms latency compared to 200ms for a US-based primary.
Failure Recovery Time affects replication topology decisions. Synchronous replication enables instantaneous failover since replicas contain all committed data. Asynchronous replication requires determining how much data loss is acceptable during failover. Semi-synchronous replication provides middle ground with one replica always synchronized while others lag.
Automated failover tools monitor primary health and promote replicas when failures occur. Manual failover requires operator intervention, increasing recovery time but preventing automated split-brain scenarios where multiple nodes believe they are primary.
Network Topology influences replication performance. Data centers connected by dedicated low-latency links support synchronous replication across sites. Geographically distributed sites over public internet require asynchronous replication due to variable latency and occasional connectivity issues.
Chain replication sends data from primary to first replica, first replica to second replica, and so on. This topology reduces primary network bandwidth but increases end-to-end replication latency. Star topology where the primary sends directly to all replicas provides lower latency but requires more primary bandwidth.
Operational Complexity increases with replication. Monitoring lag across replicas, handling failover scenarios, resolving conflicts in multi-master setups, and maintaining backup schedules across multiple nodes require sophisticated operational practices. Managed database services handle much of this complexity but limit configuration options.
Performance Considerations
Replication impacts system performance through multiple mechanisms that affect throughput, latency, and resource utilization.
Write Amplification occurs when each write to the primary generates writes to multiple replicas. A system with five replicas performs six total writes for each logical operation. Network bandwidth, disk I/O, and CPU all scale with replica count. Asynchronous replication decouples this cost from client-perceived latency, but synchronous replication adds latency proportional to the slowest replica.
Batch replication groups multiple changes before sending to replicas, reducing network overhead and per-operation costs. Instead of sending 100 individual changes, the primary batches them into one network message containing all changes. This optimization trades increased replication lag for better throughput.
class BatchReplicator
def initialize(primary, replicas, batch_size: 100, batch_timeout: 1.0)
@primary = primary
@replicas = replicas
@batch_size = batch_size
@batch_timeout = batch_timeout
@pending_changes = []
@last_flush = Time.now
start_flush_timer
end
def write(key, value)
@primary.set(key, value)
@pending_changes << { key: key, value: value, timestamp: Time.now }
flush_if_needed
end
private
def flush_if_needed
should_flush = @pending_changes.size >= @batch_size ||
(Time.now - @last_flush) >= @batch_timeout
flush if should_flush && !@pending_changes.empty?
end
def flush
batch = @pending_changes.dup
@pending_changes.clear
@last_flush = Time.now
Thread.new do
@replicas.each do |replica|
replica.batch_set(batch)
end
end
end
def start_flush_timer
Thread.new do
loop do
sleep @batch_timeout
flush_if_needed
end
end
end
end
Read Performance improves with replication as queries distribute across multiple nodes. With three replicas, each node handles one-fourth of read traffic assuming equal distribution. Geographic distribution further improves read latency by routing requests to nearby replicas.
Connection pooling maximizes replica utilization by maintaining persistent connections to all replicas. Each application server maintains pools to both primary and replica databases, quickly switching between them as queries execute.
Replication Lag Measurement enables performance monitoring and troubleshooting. PostgreSQL provides pg_stat_replication view showing byte lag between primary and each replica. MySQL offers SHOW SLAVE STATUS with similar information. Applications should monitor these metrics and alert when lag exceeds acceptable thresholds.
class ReplicationMonitor
def initialize(primary_connection)
@primary = primary_connection
end
def check_replica_lag
# PostgreSQL example
results = @primary.exec(<<-SQL)
SELECT
application_name,
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication
SQL
results.map do |row|
{
name: row['application_name'],
address: row['client_addr'],
state: row['state'],
sent_lag_bytes: row['sent_lag'].to_i,
write_lag_bytes: row['write_lag'].to_i,
flush_lag_bytes: row['flush_lag'].to_i,
replay_lag_bytes: row['replay_lag'].to_i
}
end
end
def alert_if_lagging(max_lag_bytes: 10_000_000)
lagging_replicas = check_replica_lag.select do |replica|
replica[:replay_lag_bytes] > max_lag_bytes
end
if lagging_replicas.any?
AlertService.send(
severity: :warning,
message: "Replicas lagging behind primary",
details: lagging_replicas
)
end
end
end
Index Replication affects replica disk space and maintenance overhead. Replicas maintain the same indexes as the primary, doubling storage requirements. Applications can configure replicas with different indexes optimized for specific query patterns, though this requires logical replication that operates at the SQL level rather than physical replication.
Network Bandwidth limits replication throughput in geographically distributed systems. A database generating 100 MB/s of write traffic requires 100 MB/s bandwidth to each replica. Compression reduces bandwidth requirements but adds CPU overhead. Incremental or delta replication sends only changed bytes rather than entire rows, reducing bandwidth for updates to large objects.
Failover Impact creates performance spikes during transitions. When a primary fails and a replica promotes to primary, that replica suddenly handles both read and write traffic. If the previous primary handled 30% of total traffic and the replica handled 30%, the promoted replica now handles 60% during failover. Over-provisioning replicas accounts for this scenario.
Tools & Ecosystem
The Ruby ecosystem provides various tools and libraries for implementing and managing replicated data systems.
ActiveRecord offers built-in support for read/write splitting across primary and replica databases. Configuration defines multiple database connections, and ActiveRecord routes queries based on operation type.
Makara extends ActiveRecord with sophisticated connection pooling, automatic failover, and health checking. It intercepts database queries to route them intelligently across healthy nodes.
Sheepdog provides automatic PostgreSQL failover orchestration for Ruby applications. It monitors primary health using health checks, triggers failover to standby servers, and updates application configuration.
pg-replication gem enables building custom replication solutions using PostgreSQL logical replication. Applications consume the replication stream to build derived data stores, implement change data capture, or synchronize data to external systems.
require 'pg'
# Subscribe to logical replication stream
conn = PG.connect(
host: 'primary.example.com',
dbname: 'myapp_production',
replication: 'database'
)
# Create replication slot
conn.exec("CREATE_REPLICATION_SLOT my_slot LOGICAL pgoutput")
# Start replication from slot
conn.exec(
"START_REPLICATION SLOT my_slot LOGICAL 0/0 (proto_version '1', publication_names 'my_publication')"
)
loop do
msg = conn.get_copy_data
next unless msg
# Parse and process replication messages
case msg[0]
when 'B' # Begin transaction
# Handle transaction start
when 'C' # Commit transaction
# Handle transaction commit
when 'I' # Insert
# Handle insert operation
when 'U' # Update
# Handle update operation
when 'D' # Delete
# Handle delete operation
end
end
ruby-kafka implements Kafka client functionality including consumer groups that replicate event processing across multiple instances. The library handles partition assignment, offset management, and rebalancing.
redis-rb with Sentinel support provides Redis replication and automatic failover. Applications configure Sentinel addresses, and the client automatically discovers current primary and replica locations.
etcd-client enables building distributed configuration systems with built-in replication. Etcd uses the Raft consensus protocol to replicate configuration data across cluster members with strong consistency.
Consul-Ruby interfaces with HashiCorp Consul for service discovery and distributed configuration backed by replicated key-value storage. Services register themselves, and clients query for available instances, automatically adapting to changes.
Database-Specific Tools include:
PostgreSQL tools:
- pg_basebackup for creating replica base backups
- pg_receivewal for archiving WAL segments
- repmgr for replication management and failover
MySQL tools:
- mysqldump with replication options
- xtrabackup for hot backups
- mysql-utilities for replication management
Monitoring Tools track replication health and performance:
- Prometheus with database exporters for metrics collection
- Grafana for replication lag visualization
- PgHero for PostgreSQL monitoring including replication status
- VividCortex for MySQL replication monitoring
Cloud Provider Services offer managed replication:
- Amazon RDS provides automated replication setup and failover
- Google Cloud SQL handles replication configuration and monitoring
- Azure Database offers automatic replica provisioning
- DigitalOcean Managed Databases include built-in replication
These managed services reduce operational burden but limit configuration options. Applications cannot customize replication timing, conflict resolution, or failover policies beyond provider-supported options.
Reference
Replication Types
| Type | Write Latency | Data Loss Risk | Use Case |
|---|---|---|---|
| Synchronous | High - waits for replicas | None - replicas confirmed | Financial transactions, critical data |
| Asynchronous | Low - immediate return | Possible - unconfirmed changes | High throughput systems, eventual consistency |
| Semi-synchronous | Medium - waits for one replica | Minimal - one replica confirmed | Balanced durability and performance |
| Logical | Medium - SQL-level overhead | Depends on mode | Cross-version, filtered replication |
| Physical | Low - binary copy | Depends on mode | Same-version full database replication |
Consistency Models
| Model | Guarantee | Latency | Complexity |
|---|---|---|---|
| Strong | All nodes identical | High | High - requires coordination |
| Eventual | Converges over time | Low | Medium - conflict resolution |
| Causal | Preserves cause-effect | Medium | High - tracks dependencies |
| Bounded staleness | Lag within threshold | Medium | Medium - monitors lag |
| Read-your-writes | Own writes visible | Low | Low - session tracking |
| Monotonic reads | No time travel | Low | Low - session affinity |
Replication Topologies
| Topology | Pros | Cons | Failure Behavior |
|---|---|---|---|
| Primary-Replica | Simple, clear write path | Single write bottleneck | Manual/auto promote replica |
| Multi-Master | Multiple write locations | Conflict resolution needed | Complex - may split brain |
| Chain | Reduces primary bandwidth | Higher end-to-end latency | Break in chain disrupts downstream |
| Tree | Hierarchical organization | Complex failover | Parent failure affects children |
ActiveRecord Replication Configuration
| Setting | Purpose | Example |
|---|---|---|
| connects_to | Define primary and replica roles | connects_to database: { writing: :primary, reading: :replica } |
| connected_to | Force specific connection | connected_to(role: :writing) { query } |
| replica_timeout | Failback to primary timeout | 2.seconds |
| automatic_role_switching | Enable automatic routing | true/false |
PostgreSQL Replication Commands
| Command | Purpose |
|---|---|
| pg_basebackup -h primary -D /var/lib/postgresql/data -P | Create replica base backup |
| pg_receivewal -h primary -D /var/lib/postgresql/wal_archive | Archive WAL segments |
| SELECT pg_current_wal_lsn() | Get current WAL position |
| SELECT pg_last_wal_receive_lsn() | Get last received WAL position on replica |
| SELECT pg_last_wal_replay_lsn() | Get last replayed WAL position on replica |
MySQL Replication Commands
| Command | Purpose |
|---|---|
| SHOW MASTER STATUS | Display primary binary log position |
| SHOW SLAVE STATUS | Display replica replication status |
| CHANGE MASTER TO MASTER_HOST='host', MASTER_LOG_FILE='file', MASTER_LOG_POS=position | Configure replication |
| START SLAVE | Begin replication process |
| STOP SLAVE | Stop replication process |
Replication Lag Monitoring
| Database | Query/Method | Metric |
|---|---|---|
| PostgreSQL | pg_stat_replication view | byte_lag, replay_lag_ms |
| MySQL | SHOW SLAVE STATUS | Seconds_Behind_Master |
| Redis | INFO replication | master_repl_offset difference |
| MongoDB | rs.printSlaveReplicationInfo() | syncedTo timestamp difference |
Common Replication Gems
| Gem | Purpose | Key Features |
|---|---|---|
| makara | Read/write splitting | Automatic failover, health checks, stickiness |
| activerecord-turntable | Sharding and replication | Shard key routing, replica distribution |
| octoshark | Multi-tenant replication | Per-tenant connection routing |
| switchman | Sharding support | Shard management, routing |
Failover Checklist
| Step | Action | Validation |
|---|---|---|
| 1 | Verify primary is truly down | Multiple connection attempts from different sources |
| 2 | Select replica for promotion | Choose most up-to-date replica |
| 3 | Promote replica to primary | Execute promotion command |
| 4 | Update DNS or connection strings | Verify new primary resolvable |
| 5 | Redirect application connections | Test write operations succeed |
| 6 | Reconfigure remaining replicas | Point to new primary |
| 7 | Monitor replication lag | Ensure replicas catching up |
| 8 | Document incident | Record timeline and decisions |