Overview
Disaster recovery encompasses the policies, procedures, and technical implementations that enable an organization to recover from catastrophic failures. These failures range from hardware malfunctions and data corruption to complete data center outages caused by natural disasters, cyber attacks, or infrastructure collapse. Unlike high availability, which focuses on minimizing downtime during normal operations, disaster recovery addresses scenarios where primary systems become completely unavailable.
The scope of disaster recovery extends beyond simple data backups. A comprehensive DR plan addresses application state, configuration management, database consistency, service dependencies, network routing, DNS failover, monitoring systems, and the orchestration required to restore operations in an alternate location. The plan must account for both technical restoration and operational coordination, including communication protocols, escalation procedures, and validation steps.
DR strategies operate on two fundamental metrics: Recovery Time Objective (RTO) defines the maximum acceptable downtime, while Recovery Point Objective (RPO) defines the maximum acceptable data loss measured in time. A system with an RTO of 4 hours and RPO of 1 hour can tolerate 4 hours of downtime and may lose up to 1 hour of recent data. These metrics directly influence architectural decisions, infrastructure costs, and operational complexity.
# DR metrics for a Ruby application
class DisasterRecoveryMetrics
attr_reader :rto, :rpo, :last_backup, :last_test
def initialize(rto:, rpo:)
@rto = rto # Recovery Time Objective in seconds
@rpo = rpo # Recovery Point Objective in seconds
end
def backup_frequency
# Backup frequency must be at least as frequent as RPO
@rpo / 2 # Conservative: backup twice as often as RPO
end
def within_rpo?(data_timestamp)
Time.now - data_timestamp <= @rpo
end
def recovery_window_expired?(failure_time)
Time.now - failure_time > @rto
end
end
# Example: E-commerce application requirements
ecommerce_dr = DisasterRecoveryMetrics.new(
rto: 2 * 3600, # 2 hours
rpo: 5 * 60 # 5 minutes
)
puts ecommerce_dr.backup_frequency # => 150 (backup every 2.5 minutes)
The financial and operational impact of disasters varies dramatically by industry. Financial services face regulatory requirements mandating specific RTO/RPO values, often measured in minutes. E-commerce platforms lose revenue directly correlated to downtime duration. Internal business applications may tolerate longer recovery windows but still require data consistency guarantees.
Key Principles
Recovery Time Objective (RTO) represents the target duration from disaster declaration to service restoration. This metric encompasses failure detection time, decision-making time, failover execution, service validation, and traffic cutover. Organizations often underestimate the non-technical components: confirming the disaster scope, authorizing the failover, coordinating with dependent teams, and verifying data integrity. A system capable of technical restoration in 30 minutes may have an effective RTO of 2 hours when human factors are included.
Recovery Point Objective (RPO) quantifies acceptable data loss in time units. An RPO of zero requires synchronous replication to multiple locations, introducing latency and complexity. An RPO of 15 minutes permits asynchronous replication or periodic snapshots. RPO directly impacts backup frequency, replication strategies, and storage costs. Applications with strict RPO requirements must implement continuous data protection mechanisms rather than periodic backups.
# RPO implementation with continuous backup
class ContinuousBackupManager
def initialize(primary_db:, backup_db:, rpo_seconds:)
@primary_db = primary_db
@backup_db = backup_db
@rpo_seconds = rpo_seconds
@pending_operations = Queue.new
@last_sync = Time.now
end
def record_operation(operation)
# Queue operation for replication
@pending_operations << {
timestamp: Time.now,
operation: operation
}
# Force sync if approaching RPO limit
sync_pending if time_since_sync >= @rpo_seconds * 0.8
end
def sync_pending
operations = []
operations << @pending_operations.pop until @pending_operations.empty?
@backup_db.transaction do
operations.each { |op| @backup_db.execute(op[:operation]) }
end
@last_sync = Time.now
end
private
def time_since_sync
Time.now - @last_sync
end
end
Redundancy forms the foundation of disaster recovery. Geographic redundancy distributes systems across multiple physical locations, protecting against regional failures. Component redundancy duplicates critical infrastructure elements. Data redundancy maintains multiple copies of persistent state. The principle of redundancy applies at every system layer: multiple availability zones, redundant network paths, replicated databases, and distributed service instances.
Failover mechanisms detect failures and redirect traffic to backup systems. Automated failover reduces RTO but introduces complexity and false-positive risks. Manual failover provides control but extends recovery time. The detection mechanism must distinguish actual disasters from transient failures. Premature failover can cause data inconsistency, while delayed detection increases downtime. Health checks, consensus protocols, and multiple validation signals help balance these concerns.
Backup strategies differ in granularity, frequency, and retention. Full backups capture complete system state but consume significant storage and time. Incremental backups record changes since the last backup, reducing resource usage but complicating restoration. Differential backups capture changes since the last full backup, balancing storage efficiency and restoration complexity. Continuous data protection streams changes in real-time, minimizing RPO at the cost of infrastructure complexity.
# Backup strategy implementation
class BackupStrategy
attr_reader :type, :schedule, :retention_days
def initialize(type:, schedule:, retention_days:)
@type = type # :full, :incremental, :differential
@schedule = schedule
@retention_days = retention_days
end
def create_backup(data_source, backup_target)
case @type
when :full
create_full_backup(data_source, backup_target)
when :incremental
create_incremental_backup(data_source, backup_target)
when :differential
create_differential_backup(data_source, backup_target)
end
end
private
def create_full_backup(source, target)
timestamp = Time.now.to_i
backup_path = "#{target}/full_#{timestamp}.backup"
File.open(backup_path, 'w') do |f|
source.each_record do |record|
f.puts record.to_json
end
end
{ type: :full, path: backup_path, timestamp: timestamp }
end
def create_incremental_backup(source, target)
last_backup = find_last_backup(target)
timestamp = Time.now.to_i
backup_path = "#{target}/incremental_#{timestamp}.backup"
File.open(backup_path, 'w') do |f|
source.changes_since(last_backup[:timestamp]).each do |change|
f.puts change.to_json
end
end
{ type: :incremental, path: backup_path, timestamp: timestamp }
end
end
Data consistency during disaster recovery requires careful transaction management. Applications must handle in-flight transactions at the moment of failure. Synchronous replication guarantees consistency but impacts performance. Asynchronous replication improves performance but may lose recent transactions. The choice depends on application requirements: financial transactions demand consistency, while analytics workloads may tolerate eventual consistency.
Testing validates disaster recovery procedures before actual disasters occur. DR drills identify gaps in documentation, expose infrastructure issues, and train personnel. Tests should simulate realistic failure scenarios including partial failures, network partitions, and cascading failures. Regular testing prevents configuration drift where documented procedures diverge from actual system state.
Implementation Approaches
The backup and restore approach represents the most basic DR strategy. The system regularly backs up data and configuration to durable storage. During a disaster, operators provision new infrastructure and restore from backups. This approach minimizes costs as no standby infrastructure runs continuously. The RTO spans hours to days depending on backup size, restoration speed, and infrastructure provisioning time. The RPO equals the backup frequency. This strategy suits applications with relaxed recovery requirements and cost sensitivity.
# Backup and restore implementation
class BackupRestoreStrategy
def initialize(storage_adapter:, backup_frequency:)
@storage = storage_adapter
@backup_frequency = backup_frequency
@scheduler = BackupScheduler.new(frequency: backup_frequency)
end
def start_backup_schedule
@scheduler.schedule do
perform_backup
end
end
def perform_backup
timestamp = Time.now.to_i
backup_id = "backup_#{timestamp}"
# Capture database state
db_snapshot = capture_database_snapshot
# Capture application configuration
config_snapshot = capture_configuration
# Capture uploaded files
file_snapshot = capture_file_storage
# Bundle and upload to storage
backup_bundle = {
id: backup_id,
timestamp: timestamp,
database: db_snapshot,
configuration: config_snapshot,
files: file_snapshot
}
@storage.upload(backup_id, backup_bundle)
cleanup_old_backups
end
def restore_from_backup(backup_id)
backup_bundle = @storage.download(backup_id)
# Restore in correct order to maintain consistency
restore_configuration(backup_bundle[:configuration])
restore_database(backup_bundle[:database])
restore_file_storage(backup_bundle[:files])
verify_restoration
end
private
def cleanup_old_backups
retention_period = 30 * 24 * 3600 # 30 days
cutoff_time = Time.now.to_i - retention_period
@storage.list_backups.each do |backup|
@storage.delete(backup[:id]) if backup[:timestamp] < cutoff_time
end
end
end
Pilot light maintains minimal infrastructure in the secondary location with data replication active. Core database instances run but handle no production traffic. Application servers remain stopped but can be launched rapidly. During disaster, operators scale up the secondary environment and redirect traffic. This approach reduces RTO compared to backup/restore by eliminating data restoration time. Infrastructure costs remain moderate as only essential services run continuously.
Warm standby runs a scaled-down version of the production environment continuously. Application servers handle health checks and potentially serve read-only traffic. Databases replicate from the primary site. During disaster, operators scale up capacity and redirect production traffic. This approach achieves RTO measured in minutes to hours. The ongoing infrastructure costs are moderate, representing 20-50% of full production capacity.
Hot standby or active-passive maintains full production capacity in the secondary location. The secondary environment mirrors primary environment specifications. Traffic routing mechanisms can redirect users in minutes or seconds. Some implementations use DNS failover, while others employ global load balancers. This approach provides the lowest RTO but incurs the highest infrastructure costs as full redundant capacity runs continuously.
# Hot standby with health checking and failover
class HotStandbyManager
def initialize(primary_endpoint:, secondary_endpoint:, health_check_interval: 30)
@primary = primary_endpoint
@secondary = secondary_endpoint
@health_check_interval = health_check_interval
@current_active = :primary
@consecutive_failures = 0
@failure_threshold = 3
end
def start_monitoring
Thread.new do
loop do
check_health_and_failover
sleep @health_check_interval
end
end
end
def check_health_and_failover
primary_healthy = health_check(@primary)
secondary_healthy = health_check(@secondary)
if @current_active == :primary
if !primary_healthy
@consecutive_failures += 1
if @consecutive_failures >= @failure_threshold && secondary_healthy
perform_failover_to_secondary
end
else
@consecutive_failures = 0
end
elsif @current_active == :secondary
if primary_healthy && !secondary_healthy
perform_failback_to_primary
end
end
end
def health_check(endpoint)
response = Net::HTTP.get_response(URI("#{endpoint}/health"))
response.code == '200' && response.body.include?('healthy')
rescue StandardError
false
end
def perform_failover_to_secondary
puts "Primary unhealthy, failing over to secondary"
update_dns_routing(@secondary)
update_load_balancer(@secondary)
@current_active = :secondary
@consecutive_failures = 0
send_alert("Failover to secondary completed")
end
def perform_failback_to_primary
puts "Primary recovered, failing back"
update_dns_routing(@primary)
update_load_balancer(@primary)
@current_active = :primary
send_alert("Failback to primary completed")
end
end
Active-active or multi-site active deployments run production traffic across multiple locations simultaneously. Global load balancers distribute users based on geography, latency, or availability. Each location handles production load and can absorb additional traffic if another site fails. This approach provides the best RTO and RPO, often achieving near-zero downtime. The complexity lies in data consistency across sites, requiring distributed databases or conflict resolution strategies. Infrastructure costs are highest as multiple full-capacity sites operate concurrently.
Data replication strategies underpin all DR approaches beyond basic backup/restore. Synchronous replication writes data to multiple locations before acknowledging success. This guarantees zero data loss but introduces latency proportional to geographic distance. Asynchronous replication acknowledges writes before secondary sites confirm receipt, minimizing latency but risking data loss during failures. Semi-synchronous replication requires acknowledgment from at least one secondary, balancing consistency and performance.
Ruby Implementation
Ruby applications require careful state management for disaster recovery. Session state, background job queues, uploaded files, and database transactions must all survive failover. Stateless application design simplifies DR by storing all persistent state in databases or external services. Stateful applications must serialize and replicate state across locations.
# Stateless session management for DR
class DisasterRecoverySession
def initialize(session_store:)
@session_store = session_store # Redis, Memcached, or DB
end
def create_session(user_id, data)
session_id = SecureRandom.uuid
session_data = {
user_id: user_id,
created_at: Time.now.to_i,
data: data
}
# Store in replicated session store
@session_store.set(session_key(session_id), session_data.to_json)
@session_store.expire(session_key(session_id), 24 * 3600)
session_id
end
def get_session(session_id)
session_json = @session_store.get(session_key(session_id))
return nil unless session_json
JSON.parse(session_json, symbolize_names: true)
end
def destroy_session(session_id)
@session_store.del(session_key(session_id))
end
private
def session_key(session_id)
"session:#{session_id}"
end
end
Database backup strategies in Ruby applications typically use database-specific tools wrapped in Ruby scripts. PostgreSQL applications use pg_dump for logical backups or filesystem snapshots for physical backups. MySQL applications use mysqldump or Percona XtraBackup. The Ruby code orchestrates backup scheduling, manages retention, uploads to object storage, and handles encryption.
# PostgreSQL backup orchestration
class PostgreSQLBackupManager
def initialize(database_url:, s3_bucket:, encryption_key:)
@database_url = URI.parse(database_url)
@s3_bucket = s3_bucket
@encryption_key = encryption_key
end
def create_backup
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
backup_filename = "pg_backup_#{timestamp}.sql.gz"
local_path = "/tmp/#{backup_filename}"
# Execute pg_dump with compression
pg_dump_cmd = build_pg_dump_command(local_path)
success = system(pg_dump_cmd)
raise "Backup failed" unless success && File.exist?(local_path)
# Encrypt backup
encrypted_path = encrypt_file(local_path)
# Upload to S3
s3_key = "backups/postgresql/#{backup_filename}.enc"
upload_to_s3(encrypted_path, s3_key)
# Cleanup local files
File.delete(local_path) if File.exist?(local_path)
File.delete(encrypted_path) if File.exist?(encrypted_path)
{ timestamp: timestamp, s3_key: s3_key, size: File.size(encrypted_path) }
end
def restore_backup(s3_key)
# Download from S3
local_encrypted = "/tmp/backup.sql.gz.enc"
download_from_s3(s3_key, local_encrypted)
# Decrypt
local_decrypted = decrypt_file(local_encrypted)
# Restore using psql
restore_cmd = build_restore_command(local_decrypted)
success = system(restore_cmd)
# Cleanup
File.delete(local_encrypted)
File.delete(local_decrypted)
raise "Restore failed" unless success
end
private
def build_pg_dump_command(output_path)
host = @database_url.host
port = @database_url.port || 5432
database = @database_url.path[1..]
user = @database_url.user
"PGPASSWORD='#{@database_url.password}' pg_dump -h #{host} -p #{port} " \
"-U #{user} -Fc #{database} | gzip > #{output_path}"
end
def encrypt_file(input_path)
output_path = "#{input_path}.enc"
OpenSSL::Cipher.new('aes-256-cbc').tap do |cipher|
cipher.encrypt
cipher.key = @encryption_key
File.open(output_path, 'wb') do |outfile|
File.open(input_path, 'rb') do |infile|
outfile.write(cipher.update(infile.read))
outfile.write(cipher.final)
end
end
end
output_path
end
end
Background job queues require careful handling during failover. Sidekiq, DelayedJob, and Resque store job state in Redis or databases. The DR strategy must account for jobs in-flight during failure. Idempotent job design allows safe retry after failover. Job priorities ensure critical operations process first during recovery. Dead letter queues capture repeatedly failing jobs for manual investigation.
# DR-aware background job processing
class DisasterRecoveryJob
include Sidekiq::Job
sidekiq_options retry: 5, dead: true
def perform(operation_id, params)
# Check if operation already completed (idempotency)
return if OperationLog.completed?(operation_id)
begin
result = execute_operation(params)
# Record completion to prevent duplicate execution after failover
OperationLog.record_completion(
operation_id: operation_id,
result: result,
timestamp: Time.now
)
rescue StandardError => e
# Log error for DR analysis
OperationLog.record_failure(
operation_id: operation_id,
error: e.message,
attempt: self.class.get_sidekiq_options['retry_count'],
timestamp: Time.now
)
raise e # Re-raise to trigger Sidekiq retry
end
end
private
def execute_operation(params)
# Actual operation implementation
end
end
File uploads present challenges for DR as files typically reside in filesystem storage or object storage. DR strategies must ensure uploaded files replicate to secondary locations. Object storage services like AWS S3 provide cross-region replication. Applications using local filesystem storage require rsync replication or distributed filesystems. Ruby applications should abstract storage behind adapters that support replication.
# Storage adapter with replication support
class ReplicatedStorageAdapter
def initialize(primary:, secondary:, replication_mode: :async)
@primary = primary
@secondary = secondary
@replication_mode = replication_mode
end
def upload(key, data)
# Write to primary
@primary.put(key, data)
# Replicate to secondary
if @replication_mode == :sync
@secondary.put(key, data)
else
replicate_async(key, data)
end
end
def download(key)
# Try primary first, fallback to secondary
@primary.get(key)
rescue StorageError
@secondary.get(key)
end
def delete(key)
@primary.delete(key)
@secondary.delete(key)
end
private
def replicate_async(key, data)
# Queue replication job
ReplicationJob.perform_async(key, data, @secondary.endpoint)
end
end
Health checks provide the foundation for automated failover detection. Ruby applications expose health check endpoints that verify database connectivity, dependency availability, and application functionality. Health checks must distinguish between transient issues and catastrophic failures to prevent false-positive failovers.
# Comprehensive health check implementation
class HealthCheckController < ApplicationController
def show
checks = {
database: check_database,
redis: check_redis,
storage: check_storage,
external_api: check_external_dependencies
}
overall_healthy = checks.values.all? { |check| check[:healthy] }
status_code = overall_healthy ? 200 : 503
render json: {
status: overall_healthy ? 'healthy' : 'unhealthy',
timestamp: Time.now.to_i,
checks: checks
}, status: status_code
end
private
def check_database
ActiveRecord::Base.connection.execute('SELECT 1')
{ healthy: true, latency_ms: 0 }
rescue StandardError => e
{ healthy: false, error: e.message }
end
def check_redis
start = Time.now
$redis.ping
latency = ((Time.now - start) * 1000).round
{ healthy: true, latency_ms: latency }
rescue StandardError => e
{ healthy: false, error: e.message }
end
def check_storage
# Verify can write and read
test_key = "health_check_#{SecureRandom.hex(8)}"
Storage.upload(test_key, 'test')
Storage.download(test_key)
Storage.delete(test_key)
{ healthy: true }
rescue StandardError => e
{ healthy: false, error: e.message }
end
end
Tools & Ecosystem
AWS provides extensive DR capabilities through services spanning multiple layers. Route 53 health checks monitor endpoint availability and automatically update DNS records during failures. RDS automated backups create daily snapshots with transaction log archival. Aurora Global Database replicates data across regions with sub-second replication lag. S3 Cross-Region Replication copies objects to multiple regions. CloudFormation templates enable rapid infrastructure reproduction. AWS Backup centralizes backup management across services.
# AWS Route 53 health check management
require 'aws-sdk-route53'
class Route53HealthCheckManager
def initialize
@client = Aws::Route53::Client.new
end
def create_health_check(domain:, port:, path:)
response = @client.create_health_check({
health_check_config: {
type: 'HTTPS',
resource_path: path,
fully_qualified_domain_name: domain,
port: port,
request_interval: 30,
failure_threshold: 3
}
})
response.health_check.id
end
def update_dns_failover(hosted_zone_id:, record_name:, primary_ip:, secondary_ip:, health_check_id:)
# Create primary record set with health check
@client.change_resource_record_sets({
hosted_zone_id: hosted_zone_id,
change_batch: {
changes: [
{
action: 'UPSERT',
resource_record_set: {
name: record_name,
type: 'A',
set_identifier: 'Primary',
failover: 'PRIMARY',
health_check_id: health_check_id,
ttl: 60,
resource_records: [{ value: primary_ip }]
}
},
{
action: 'UPSERT',
resource_record_set: {
name: record_name,
type: 'A',
set_identifier: 'Secondary',
failover: 'SECONDARY',
ttl: 60,
resource_records: [{ value: secondary_ip }]
}
}
]
}
})
end
end
PostgreSQL streaming replication provides built-in database replication. The primary server streams write-ahead log (WAL) records to standby servers. Standby servers apply changes continuously, maintaining synchronized copies. Ruby applications interact with streaming replication through connection management and failover coordination. The pg gem supports read replica connections for load distribution.
Consul provides service discovery and health checking critical for DR scenarios. Applications register services with health check definitions. During failover, Consul automatically removes unhealthy services from discovery. Ruby applications use the Diplomat gem to interact with Consul APIs for service registration, discovery, and key-value storage.
# Consul integration for service discovery and failover
require 'diplomat'
class ConsulServiceManager
def register_service(name:, address:, port:, health_check_path:)
Diplomat::Service.register(
{
name: name,
address: address,
port: port,
check: {
http: "http://#{address}:#{port}#{health_check_path}",
interval: '10s',
timeout: '5s'
}
}
)
end
def discover_healthy_services(service_name)
services = Diplomat::Service.get(service_name, :passing)
services.map do |service|
{
address: service.ServiceAddress,
port: service.ServicePort,
id: service.ServiceID
}
end
end
def deregister_service(service_id)
Diplomat::Service.deregister(service_id)
end
end
Terraform enables infrastructure-as-code for disaster recovery. DR environments can be defined as code and rapidly provisioned when needed. Terraform modules encapsulate reusable infrastructure patterns. State files track deployed resources. Ruby applications can shell out to Terraform commands or use the Terraform API through SDKs.
Docker and container orchestration platforms like Kubernetes facilitate DR through portable application packaging. Container images bundle application code and dependencies. Kubernetes manifests define deployment topology. During DR failover, operators apply manifests to secondary clusters. Ruby applications containerized with Docker gain portability across environments.
Database backup tools integrate with Ruby applications for automated backup management. pgbackrest provides advanced PostgreSQL backup features including incremental backups, parallel processing, and cloud storage integration. Percona XtraBackup offers hot backup capabilities for MySQL. Ruby scripts orchestrate these tools, manage schedules, and monitor backup health.
Object storage services provide durable backup destinations. AWS S3, Google Cloud Storage, and Azure Blob Storage offer 99.999999999% durability through replication across multiple devices and facilities. Ruby applications use aws-sdk-s3, google-cloud-storage, or azure-storage-blob gems. These services support versioning, lifecycle policies, and cross-region replication.
Testing Approaches
DR testing validates recovery procedures, identifies gaps, and trains personnel. Tests must simulate realistic failure scenarios without impacting production systems. Test frequency depends on RTO requirements and change rate. Applications with stringent RTO targets require monthly or quarterly testing. Less critical systems may test annually.
Tabletop exercises represent the simplest testing approach. Teams walk through DR procedures discussing each step without executing actions. Participants identify unclear instructions, missing information, or procedural gaps. Tabletop exercises suit initial DR plan validation and require minimal resources. The limitation is lack of technical validation.
# Tabletop exercise checklist generator
class TabletopExerciseGenerator
def generate_exercise_plan(dr_plan:)
{
scenario: generate_scenario(dr_plan.threat_model),
objectives: [
'Validate communication procedures',
'Identify documentation gaps',
'Confirm role assignments',
'Review decision criteria'
],
participants: required_participants(dr_plan),
duration: '2-3 hours',
agenda: [
{ time: '0:00', activity: 'Scenario introduction' },
{ time: '0:15', activity: 'Initial response walkthrough' },
{ time: '0:45', activity: 'Technical recovery discussion' },
{ time: '1:30', activity: 'Service restoration review' },
{ time: '2:00', activity: 'Lessons learned' }
],
success_criteria: [
'All participants understand their roles',
'Critical gaps documented',
'Timeline realistic',
'Dependencies identified'
]
}
end
end
Simulation testing executes DR procedures in isolated test environments. Teams provision secondary infrastructure, restore from backups, and validate application functionality. Simulation tests verify technical procedures work correctly. The test environment should mirror production specifications to ensure realistic validation. Ruby applications can be deployed to test environments using the same deployment pipelines as production.
# DR simulation test orchestration
class DisasterRecoverySimulation
def initialize(test_environment:, backup_source:)
@test_env = test_environment
@backup_source = backup_source
@test_results = []
end
def run_full_simulation
test_start = Time.now
begin
# Phase 1: Infrastructure provisioning
record_phase('Infrastructure Provisioning') do
provision_test_infrastructure
end
# Phase 2: Backup restoration
record_phase('Backup Restoration') do
restore_latest_backup
end
# Phase 3: Application deployment
record_phase('Application Deployment') do
deploy_application_to_test_env
end
# Phase 4: Smoke tests
record_phase('Smoke Testing') do
run_smoke_tests
end
# Phase 5: Data validation
record_phase('Data Validation') do
validate_data_integrity
end
total_duration = Time.now - test_start
generate_test_report(total_duration)
ensure
cleanup_test_environment
end
end
private
def record_phase(phase_name)
start_time = Time.now
success = false
error = nil
begin
yield
success = true
rescue StandardError => e
error = e.message
raise
ensure
duration = Time.now - start_time
@test_results << {
phase: phase_name,
success: success,
duration: duration,
error: error
}
end
end
def generate_test_report(total_duration)
{
test_date: Time.now,
total_duration: total_duration,
phases: @test_results,
overall_success: @test_results.all? { |r| r[:success] },
rto_met: total_duration <= @test_env.target_rto,
recommendations: generate_recommendations
}
end
end
Partial failover tests validate specific components without full disaster simulation. Database failover tests switch traffic to read replicas. DNS failover tests update routing rules. Load balancer failover tests redirect traffic between availability zones. Partial tests reduce risk and complexity while validating critical components.
Production failover tests execute actual failover to DR sites with production traffic. These tests provide the highest confidence but carry the highest risk. Organizations with active-active deployments can gradually shift traffic to secondary sites, validate operation, and shift back. Single-site deployments require maintenance windows for failover tests.
Automated testing integrates DR validation into continuous integration pipelines. Backup restoration tests run automatically after each backup. Health check tests validate monitoring systems. Configuration tests verify infrastructure definitions remain deployable. Automated tests provide continuous validation as systems evolve.
# Automated DR validation tests
require 'rspec'
RSpec.describe 'Disaster Recovery Validation' do
let(:backup_manager) { BackupManager.new }
let(:health_checker) { HealthChecker.new }
describe 'Backup Restoration' do
it 'restores latest backup successfully' do
latest_backup = backup_manager.find_latest_backup
test_db = provision_test_database
restore_result = backup_manager.restore(
backup: latest_backup,
target: test_db
)
expect(restore_result.success?).to be true
expect(test_db.record_count).to eq latest_backup.record_count
end
it 'completes restoration within RTO' do
start_time = Time.now
backup_manager.restore_latest
duration = Time.now - start_time
expect(duration).to be < RTO_SECONDS
end
end
describe 'Health Monitoring' do
it 'detects primary endpoint failure' do
simulate_primary_failure
sleep HEALTH_CHECK_INTERVAL + 1
expect(health_checker.primary_healthy?).to be false
expect(health_checker.failover_triggered?).to be true
end
end
describe 'Data Integrity' do
it 'maintains referential integrity after restore' do
restored_db = restore_from_latest_backup
integrity_check = run_integrity_checks(restored_db)
expect(integrity_check.orphaned_records).to be_empty
expect(integrity_check.missing_foreign_keys).to be_empty
end
end
end
Real-World Applications
Financial services applications require aggressive RTO and RPO targets due to regulatory requirements and revenue impact. Trading platforms maintain hot standby systems with synchronous replication across data centers. Databases use multi-master replication or consensus protocols. Application servers run in multiple regions with global load balancing. Session state resides in distributed caches replicated across sites. DR drills occur quarterly with partial failover tests monthly.
E-commerce platforms balance DR costs against revenue loss during outages. Peak shopping periods like holidays demand minimal downtime. Many platforms use warm standby configurations with rapid scale-up capabilities. Product catalogs replicate through content delivery networks. Order processing systems maintain event logs enabling reconstruction after failures. Payment processing integrates with multiple providers for redundancy.
# E-commerce order processing with DR resilience
class ResilientOrderProcessor
def initialize
@primary_payment_gateway = StripeGateway.new
@backup_payment_gateway = BraintreeGateway.new
@order_event_log = EventLog.new(storage: ReplicatedStorage.new)
end
def process_order(order)
# Log order event before processing
event_id = @order_event_log.append(
event_type: 'order_received',
order_id: order.id,
timestamp: Time.now,
payload: order.to_json
)
begin
# Attempt payment with primary gateway
payment_result = @primary_payment_gateway.charge(
amount: order.total,
customer: order.customer_id
)
@order_event_log.append(
event_type: 'payment_processed',
order_id: order.id,
gateway: 'stripe',
transaction_id: payment_result.transaction_id
)
rescue PaymentGatewayError => e
# Failover to backup gateway
payment_result = @backup_payment_gateway.charge(
amount: order.total,
customer: order.customer_id
)
@order_event_log.append(
event_type: 'payment_processed',
order_id: order.id,
gateway: 'braintree',
transaction_id: payment_result.transaction_id,
note: 'Primary gateway failed, used backup'
)
end
order.complete!
end
def recover_incomplete_orders
# After disaster recovery, replay events to complete interrupted orders
incomplete_events = @order_event_log.find_incomplete_orders
incomplete_events.each do |event|
order = Order.find(event.order_id)
next if order.completed?
# Resume processing from last recorded state
process_order(order)
end
end
end
SaaS applications serve multiple tenants requiring tenant-aware DR strategies. Multi-tenant databases complicate backup restoration as individual tenant data must be recoverable. Some platforms implement per-tenant backup schedules. Geographic data residency requirements may mandate region-specific DR sites. Tenant priority tiers determine restoration order during disasters affecting multiple tenants.
Healthcare applications handle protected health information (PHI) requiring HIPAA-compliant DR procedures. Backup encryption protects patient data. Access controls restrict DR system access. Audit logs track all DR operations. RTO targets often span hours rather than minutes as healthcare systems tolerate scheduled maintenance windows. However, critical systems like emergency room applications demand high availability.
Media streaming platforms manage large volumes of content requiring efficient backup strategies. Content files replicate to edge locations through content delivery networks. Metadata databases require careful backup as content references depend on metadata integrity. User viewing progress and recommendations require session state preservation. During DR events, platforms may serve cached content while restoring backend systems.
# Media platform DR with CDN integration
class MediaPlatformDR
def initialize(cdn:, origin_storage:, metadata_db:)
@cdn = cdn
@origin_storage = origin_storage
@metadata_db = metadata_db
end
def failover_to_secondary_origin
# Update CDN to fetch from secondary origin
@cdn.update_origin_configuration(
primary_origin: secondary_origin_endpoint,
backup_origin: primary_origin_endpoint
)
# Verify content availability
sample_content = @metadata_db.sample_content_urls(100)
failures = sample_content.reject { |url| verify_content_available(url) }
if failures.empty?
notify_success("Failover complete, all content verified available")
else
notify_warning("Failover complete, #{failures.size} items unavailable")
end
end
def restore_user_state_from_backup
# Restore viewing progress from latest backup
latest_state_backup = find_latest_state_backup
restored_count = 0
latest_state_backup.each_record do |user_id, state|
UserStateCache.restore(user_id, state)
restored_count += 1
end
{ restored_users: restored_count, backup_age: latest_state_backup.age }
end
end
Internal business applications often tolerate longer recovery windows but require complete data consistency. Finance systems require precise transaction reconciliation. HR systems contain sensitive employee data requiring secure backup handling. CRM systems maintain relationship history requiring point-in-time restoration capabilities. Backup frequency typically matches business cycle periods.
Reference
Recovery Metrics
| Metric | Definition | Typical Range | Factors Affecting Cost |
|---|---|---|---|
| RTO | Maximum acceptable downtime | Minutes to hours | Infrastructure redundancy, automation level |
| RPO | Maximum acceptable data loss | Seconds to hours | Replication strategy, backup frequency |
| MTTR | Mean time to recovery | Hours to days | Team training, documentation quality |
| MTBF | Mean time between failures | Months to years | Infrastructure quality, maintenance practices |
DR Strategy Comparison
| Strategy | Infrastructure Cost | RTO | RPO | Complexity | Best For |
|---|---|---|---|---|---|
| Backup/Restore | Low | Hours-Days | Hours | Low | Non-critical systems, budget constraints |
| Pilot Light | Low-Medium | Hours | Minutes-Hours | Medium | Moderate criticality, cost-sensitive |
| Warm Standby | Medium | Minutes-Hours | Minutes | Medium-High | Business-critical systems |
| Hot Standby | High | Seconds-Minutes | Seconds | High | Mission-critical, zero-downtime requirements |
| Active-Active | Very High | Near-zero | Near-zero | Very High | Global services, highest availability needs |
Replication Methods
| Method | Data Loss Risk | Performance Impact | Use Cases |
|---|---|---|---|
| Synchronous | None | High latency | Financial transactions, critical data |
| Asynchronous | Possible | Minimal latency | General applications, analytics |
| Semi-synchronous | Minimal | Moderate latency | Balanced consistency and performance |
| Snapshot-based | Moderate | Periodic overhead | Point-in-time recovery, compliance |
Backup Types
| Type | Storage Size | Restoration Time | Backup Duration | Best For |
|---|---|---|---|---|
| Full | 100% | Fast | Slow | Weekly/monthly baselines |
| Incremental | Small | Slow | Fast | Frequent backups, storage efficiency |
| Differential | Medium | Medium | Medium | Balance between full and incremental |
| Continuous | Large | Fast | Ongoing | Zero/near-zero RPO requirements |
Common Failure Scenarios
| Scenario | Detection Time | Recovery Actions | Typical RTO |
|---|---|---|---|
| Hardware failure | Minutes | Replace hardware, restore from replica | 1-4 hours |
| Data center outage | Minutes-Hours | Failover to secondary site | 2-8 hours |
| Database corruption | Hours | Restore from backup, validate data | 4-24 hours |
| Ransomware attack | Hours-Days | Isolate systems, restore clean backups | 1-7 days |
| Natural disaster | Hours | Activate DR site, restore operations | 8-72 hours |
| Network partition | Minutes | Wait for resolution or force failover | 1-6 hours |
Health Check Components
| Component | Check Method | Failure Threshold | Response Action |
|---|---|---|---|
| Database | Query execution | 3 consecutive failures | Switch to replica |
| Application | HTTP endpoint | 3 failures in 90 seconds | Remove from load balancer |
| Storage | Read/write test | 2 consecutive failures | Switch to secondary storage |
| External API | Dependency test | 5 failures in 5 minutes | Use cached data or circuit breaker |
Testing Schedule Recommendations
| System Criticality | Tabletop Exercises | Simulation Tests | Partial Failover | Full Failover |
|---|---|---|---|---|
| Mission-Critical | Quarterly | Monthly | Quarterly | Annually |
| Business-Critical | Semi-annually | Quarterly | Semi-annually | Annually |
| Important | Annually | Semi-annually | Annually | Every 2 years |
| Standard | Every 2 years | Annually | Every 2 years | Every 3 years |
Ruby DR Gems
| Gem | Purpose | Key Features |
|---|---|---|
| aws-sdk-s3 | Backup storage | Cross-region replication, versioning, lifecycle policies |
| pg | PostgreSQL replication | Connection management, replica reads, failover support |
| redis-namespace | Session replication | Key namespacing, multi-instance support |
| diplomat | Service discovery | Consul integration, health checks, KV storage |
| whenever | Backup scheduling | Cron job management, Ruby DSL |
Cost Optimization Strategies
| Strategy | Cost Savings | Trade-offs |
|---|---|---|
| Compress backups | 50-80% storage reduction | CPU overhead, restoration time increase |
| Incremental backups | 80-95% storage reduction | Complex restoration process |
| Tiered storage | 70-90% storage costs | Slower retrieval for older backups |
| Pilot light instead of hot standby | 60-80% infrastructure costs | Higher RTO (hours vs minutes) |
| Automated scaling | 40-60% infrastructure costs | Scale-up delay during DR |
| Reserved instances | 30-50% compute costs | Long-term commitment required |