CrackedRuby - Disaster Recovery

Overview

Disaster recovery encompasses the policies, procedures, and technical implementations that enable an organization to recover from catastrophic failures. These failures range from hardware malfunctions and data corruption to complete data center outages caused by natural disasters, cyber attacks, or infrastructure collapse. Unlike high availability, which focuses on minimizing downtime during normal operations, disaster recovery addresses scenarios where primary systems become completely unavailable.

The scope of disaster recovery extends beyond simple data backups. A comprehensive DR plan addresses application state, configuration management, database consistency, service dependencies, network routing, DNS failover, monitoring systems, and the orchestration required to restore operations in an alternate location. The plan must account for both technical restoration and operational coordination, including communication protocols, escalation procedures, and validation steps.

DR strategies operate on two fundamental metrics: Recovery Time Objective (RTO) defines the maximum acceptable downtime, while Recovery Point Objective (RPO) defines the maximum acceptable data loss measured in time. A system with an RTO of 4 hours and RPO of 1 hour can tolerate 4 hours of downtime and may lose up to 1 hour of recent data. These metrics directly influence architectural decisions, infrastructure costs, and operational complexity.

# DR metrics for a Ruby application
class DisasterRecoveryMetrics
  attr_reader :rto, :rpo, :last_backup, :last_test
  
  def initialize(rto:, rpo:)
    @rto = rto  # Recovery Time Objective in seconds
    @rpo = rpo  # Recovery Point Objective in seconds
  end
  
  def backup_frequency
    # Backup frequency must be at least as frequent as RPO
    @rpo / 2  # Conservative: backup twice as often as RPO
  end
  
  def within_rpo?(data_timestamp)
    Time.now - data_timestamp <= @rpo
  end
  
  def recovery_window_expired?(failure_time)
    Time.now - failure_time > @rto
  end
end

# Example: E-commerce application requirements
ecommerce_dr = DisasterRecoveryMetrics.new(
  rto: 2 * 3600,    # 2 hours
  rpo: 5 * 60       # 5 minutes
)

puts ecommerce_dr.backup_frequency  # => 150 (backup every 2.5 minutes)

The financial and operational impact of disasters varies dramatically by industry. Financial services face regulatory requirements mandating specific RTO/RPO values, often measured in minutes. E-commerce platforms lose revenue directly correlated to downtime duration. Internal business applications may tolerate longer recovery windows but still require data consistency guarantees.

Key Principles

Recovery Time Objective (RTO) represents the target duration from disaster declaration to service restoration. This metric encompasses failure detection time, decision-making time, failover execution, service validation, and traffic cutover. Organizations often underestimate the non-technical components: confirming the disaster scope, authorizing the failover, coordinating with dependent teams, and verifying data integrity. A system capable of technical restoration in 30 minutes may have an effective RTO of 2 hours when human factors are included.

Recovery Point Objective (RPO) quantifies acceptable data loss in time units. An RPO of zero requires synchronous replication to multiple locations, introducing latency and complexity. An RPO of 15 minutes permits asynchronous replication or periodic snapshots. RPO directly impacts backup frequency, replication strategies, and storage costs. Applications with strict RPO requirements must implement continuous data protection mechanisms rather than periodic backups.

# RPO implementation with continuous backup
class ContinuousBackupManager
  def initialize(primary_db:, backup_db:, rpo_seconds:)
    @primary_db = primary_db
    @backup_db = backup_db
    @rpo_seconds = rpo_seconds
    @pending_operations = Queue.new
    @last_sync = Time.now
  end
  
  def record_operation(operation)
    # Queue operation for replication
    @pending_operations << {
      timestamp: Time.now,
      operation: operation
    }
    
    # Force sync if approaching RPO limit
    sync_pending if time_since_sync >= @rpo_seconds * 0.8
  end
  
  def sync_pending
    operations = []
    operations << @pending_operations.pop until @pending_operations.empty?
    
    @backup_db.transaction do
      operations.each { |op| @backup_db.execute(op[:operation]) }
    end
    
    @last_sync = Time.now
  end
  
  private
  
  def time_since_sync
    Time.now - @last_sync
  end
end

Redundancy forms the foundation of disaster recovery. Geographic redundancy distributes systems across multiple physical locations, protecting against regional failures. Component redundancy duplicates critical infrastructure elements. Data redundancy maintains multiple copies of persistent state. The principle of redundancy applies at every system layer: multiple availability zones, redundant network paths, replicated databases, and distributed service instances.

Failover mechanisms detect failures and redirect traffic to backup systems. Automated failover reduces RTO but introduces complexity and false-positive risks. Manual failover provides control but extends recovery time. The detection mechanism must distinguish actual disasters from transient failures. Premature failover can cause data inconsistency, while delayed detection increases downtime. Health checks, consensus protocols, and multiple validation signals help balance these concerns.

Backup strategies differ in granularity, frequency, and retention. Full backups capture complete system state but consume significant storage and time. Incremental backups record changes since the last backup, reducing resource usage but complicating restoration. Differential backups capture changes since the last full backup, balancing storage efficiency and restoration complexity. Continuous data protection streams changes in real-time, minimizing RPO at the cost of infrastructure complexity.

# Backup strategy implementation
class BackupStrategy
  attr_reader :type, :schedule, :retention_days
  
  def initialize(type:, schedule:, retention_days:)
    @type = type  # :full, :incremental, :differential
    @schedule = schedule
    @retention_days = retention_days
  end
  
  def create_backup(data_source, backup_target)
    case @type
    when :full
      create_full_backup(data_source, backup_target)
    when :incremental
      create_incremental_backup(data_source, backup_target)
    when :differential
      create_differential_backup(data_source, backup_target)
    end
  end
  
  private
  
  def create_full_backup(source, target)
    timestamp = Time.now.to_i
    backup_path = "#{target}/full_#{timestamp}.backup"
    
    File.open(backup_path, 'w') do |f|
      source.each_record do |record|
        f.puts record.to_json
      end
    end
    
    { type: :full, path: backup_path, timestamp: timestamp }
  end
  
  def create_incremental_backup(source, target)
    last_backup = find_last_backup(target)
    timestamp = Time.now.to_i
    backup_path = "#{target}/incremental_#{timestamp}.backup"
    
    File.open(backup_path, 'w') do |f|
      source.changes_since(last_backup[:timestamp]).each do |change|
        f.puts change.to_json
      end
    end
    
    { type: :incremental, path: backup_path, timestamp: timestamp }
  end
end

Data consistency during disaster recovery requires careful transaction management. Applications must handle in-flight transactions at the moment of failure. Synchronous replication guarantees consistency but impacts performance. Asynchronous replication improves performance but may lose recent transactions. The choice depends on application requirements: financial transactions demand consistency, while analytics workloads may tolerate eventual consistency.

Testing validates disaster recovery procedures before actual disasters occur. DR drills identify gaps in documentation, expose infrastructure issues, and train personnel. Tests should simulate realistic failure scenarios including partial failures, network partitions, and cascading failures. Regular testing prevents configuration drift where documented procedures diverge from actual system state.

Implementation Approaches

The backup and restore approach represents the most basic DR strategy. The system regularly backs up data and configuration to durable storage. During a disaster, operators provision new infrastructure and restore from backups. This approach minimizes costs as no standby infrastructure runs continuously. The RTO spans hours to days depending on backup size, restoration speed, and infrastructure provisioning time. The RPO equals the backup frequency. This strategy suits applications with relaxed recovery requirements and cost sensitivity.

# Backup and restore implementation
class BackupRestoreStrategy
  def initialize(storage_adapter:, backup_frequency:)
    @storage = storage_adapter
    @backup_frequency = backup_frequency
    @scheduler = BackupScheduler.new(frequency: backup_frequency)
  end
  
  def start_backup_schedule
    @scheduler.schedule do
      perform_backup
    end
  end
  
  def perform_backup
    timestamp = Time.now.to_i
    backup_id = "backup_#{timestamp}"
    
    # Capture database state
    db_snapshot = capture_database_snapshot
    
    # Capture application configuration
    config_snapshot = capture_configuration
    
    # Capture uploaded files
    file_snapshot = capture_file_storage
    
    # Bundle and upload to storage
    backup_bundle = {
      id: backup_id,
      timestamp: timestamp,
      database: db_snapshot,
      configuration: config_snapshot,
      files: file_snapshot
    }
    
    @storage.upload(backup_id, backup_bundle)
    cleanup_old_backups
  end
  
  def restore_from_backup(backup_id)
    backup_bundle = @storage.download(backup_id)
    
    # Restore in correct order to maintain consistency
    restore_configuration(backup_bundle[:configuration])
    restore_database(backup_bundle[:database])
    restore_file_storage(backup_bundle[:files])
    
    verify_restoration
  end
  
  private
  
  def cleanup_old_backups
    retention_period = 30 * 24 * 3600  # 30 days
    cutoff_time = Time.now.to_i - retention_period
    
    @storage.list_backups.each do |backup|
      @storage.delete(backup[:id]) if backup[:timestamp] < cutoff_time
    end
  end
end

Pilot light maintains minimal infrastructure in the secondary location with data replication active. Core database instances run but handle no production traffic. Application servers remain stopped but can be launched rapidly. During disaster, operators scale up the secondary environment and redirect traffic. This approach reduces RTO compared to backup/restore by eliminating data restoration time. Infrastructure costs remain moderate as only essential services run continuously.

Warm standby runs a scaled-down version of the production environment continuously. Application servers handle health checks and potentially serve read-only traffic. Databases replicate from the primary site. During disaster, operators scale up capacity and redirect production traffic. This approach achieves RTO measured in minutes to hours. The ongoing infrastructure costs are moderate, representing 20-50% of full production capacity.

Hot standby or active-passive maintains full production capacity in the secondary location. The secondary environment mirrors primary environment specifications. Traffic routing mechanisms can redirect users in minutes or seconds. Some implementations use DNS failover, while others employ global load balancers. This approach provides the lowest RTO but incurs the highest infrastructure costs as full redundant capacity runs continuously.

# Hot standby with health checking and failover
class HotStandbyManager
  def initialize(primary_endpoint:, secondary_endpoint:, health_check_interval: 30)
    @primary = primary_endpoint
    @secondary = secondary_endpoint
    @health_check_interval = health_check_interval
    @current_active = :primary
    @consecutive_failures = 0
    @failure_threshold = 3
  end
  
  def start_monitoring
    Thread.new do
      loop do
        check_health_and_failover
        sleep @health_check_interval
      end
    end
  end
  
  def check_health_and_failover
    primary_healthy = health_check(@primary)
    secondary_healthy = health_check(@secondary)
    
    if @current_active == :primary
      if !primary_healthy
        @consecutive_failures += 1
        if @consecutive_failures >= @failure_threshold && secondary_healthy
          perform_failover_to_secondary
        end
      else
        @consecutive_failures = 0
      end
    elsif @current_active == :secondary
      if primary_healthy && !secondary_healthy
        perform_failback_to_primary
      end
    end
  end
  
  def health_check(endpoint)
    response = Net::HTTP.get_response(URI("#{endpoint}/health"))
    response.code == '200' && response.body.include?('healthy')
  rescue StandardError
    false
  end
  
  def perform_failover_to_secondary
    puts "Primary unhealthy, failing over to secondary"
    update_dns_routing(@secondary)
    update_load_balancer(@secondary)
    @current_active = :secondary
    @consecutive_failures = 0
    send_alert("Failover to secondary completed")
  end
  
  def perform_failback_to_primary
    puts "Primary recovered, failing back"
    update_dns_routing(@primary)
    update_load_balancer(@primary)
    @current_active = :primary
    send_alert("Failback to primary completed")
  end
end

Active-active or multi-site active deployments run production traffic across multiple locations simultaneously. Global load balancers distribute users based on geography, latency, or availability. Each location handles production load and can absorb additional traffic if another site fails. This approach provides the best RTO and RPO, often achieving near-zero downtime. The complexity lies in data consistency across sites, requiring distributed databases or conflict resolution strategies. Infrastructure costs are highest as multiple full-capacity sites operate concurrently.

Data replication strategies underpin all DR approaches beyond basic backup/restore. Synchronous replication writes data to multiple locations before acknowledging success. This guarantees zero data loss but introduces latency proportional to geographic distance. Asynchronous replication acknowledges writes before secondary sites confirm receipt, minimizing latency but risking data loss during failures. Semi-synchronous replication requires acknowledgment from at least one secondary, balancing consistency and performance.

Ruby Implementation

Ruby applications require careful state management for disaster recovery. Session state, background job queues, uploaded files, and database transactions must all survive failover. Stateless application design simplifies DR by storing all persistent state in databases or external services. Stateful applications must serialize and replicate state across locations.

# Stateless session management for DR
class DisasterRecoverySession
  def initialize(session_store:)
    @session_store = session_store  # Redis, Memcached, or DB
  end
  
  def create_session(user_id, data)
    session_id = SecureRandom.uuid
    session_data = {
      user_id: user_id,
      created_at: Time.now.to_i,
      data: data
    }
    
    # Store in replicated session store
    @session_store.set(session_key(session_id), session_data.to_json)
    @session_store.expire(session_key(session_id), 24 * 3600)
    
    session_id
  end
  
  def get_session(session_id)
    session_json = @session_store.get(session_key(session_id))
    return nil unless session_json
    
    JSON.parse(session_json, symbolize_names: true)
  end
  
  def destroy_session(session_id)
    @session_store.del(session_key(session_id))
  end
  
  private
  
  def session_key(session_id)
    "session:#{session_id}"
  end
end

Database backup strategies in Ruby applications typically use database-specific tools wrapped in Ruby scripts. PostgreSQL applications use pg_dump for logical backups or filesystem snapshots for physical backups. MySQL applications use mysqldump or Percona XtraBackup. The Ruby code orchestrates backup scheduling, manages retention, uploads to object storage, and handles encryption.

# PostgreSQL backup orchestration
class PostgreSQLBackupManager
  def initialize(database_url:, s3_bucket:, encryption_key:)
    @database_url = URI.parse(database_url)
    @s3_bucket = s3_bucket
    @encryption_key = encryption_key
  end
  
  def create_backup
    timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
    backup_filename = "pg_backup_#{timestamp}.sql.gz"
    local_path = "/tmp/#{backup_filename}"
    
    # Execute pg_dump with compression
    pg_dump_cmd = build_pg_dump_command(local_path)
    success = system(pg_dump_cmd)
    
    raise "Backup failed" unless success && File.exist?(local_path)
    
    # Encrypt backup
    encrypted_path = encrypt_file(local_path)
    
    # Upload to S3
    s3_key = "backups/postgresql/#{backup_filename}.enc"
    upload_to_s3(encrypted_path, s3_key)
    
    # Cleanup local files
    File.delete(local_path) if File.exist?(local_path)
    File.delete(encrypted_path) if File.exist?(encrypted_path)
    
    { timestamp: timestamp, s3_key: s3_key, size: File.size(encrypted_path) }
  end
  
  def restore_backup(s3_key)
    # Download from S3
    local_encrypted = "/tmp/backup.sql.gz.enc"
    download_from_s3(s3_key, local_encrypted)
    
    # Decrypt
    local_decrypted = decrypt_file(local_encrypted)
    
    # Restore using psql
    restore_cmd = build_restore_command(local_decrypted)
    success = system(restore_cmd)
    
    # Cleanup
    File.delete(local_encrypted)
    File.delete(local_decrypted)
    
    raise "Restore failed" unless success
  end
  
  private
  
  def build_pg_dump_command(output_path)
    host = @database_url.host
    port = @database_url.port || 5432
    database = @database_url.path[1..]
    user = @database_url.user
    
    "PGPASSWORD='#{@database_url.password}' pg_dump -h #{host} -p #{port} " \
    "-U #{user} -Fc #{database} | gzip > #{output_path}"
  end
  
  def encrypt_file(input_path)
    output_path = "#{input_path}.enc"
    OpenSSL::Cipher.new('aes-256-cbc').tap do |cipher|
      cipher.encrypt
      cipher.key = @encryption_key
      
      File.open(output_path, 'wb') do |outfile|
        File.open(input_path, 'rb') do |infile|
          outfile.write(cipher.update(infile.read))
          outfile.write(cipher.final)
        end
      end
    end
    output_path
  end
end

Background job queues require careful handling during failover. Sidekiq, DelayedJob, and Resque store job state in Redis or databases. The DR strategy must account for jobs in-flight during failure. Idempotent job design allows safe retry after failover. Job priorities ensure critical operations process first during recovery. Dead letter queues capture repeatedly failing jobs for manual investigation.

# DR-aware background job processing
class DisasterRecoveryJob
  include Sidekiq::Job
  
  sidekiq_options retry: 5, dead: true
  
  def perform(operation_id, params)
    # Check if operation already completed (idempotency)
    return if OperationLog.completed?(operation_id)
    
    begin
      result = execute_operation(params)
      
      # Record completion to prevent duplicate execution after failover
      OperationLog.record_completion(
        operation_id: operation_id,
        result: result,
        timestamp: Time.now
      )
    rescue StandardError => e
      # Log error for DR analysis
      OperationLog.record_failure(
        operation_id: operation_id,
        error: e.message,
        attempt: self.class.get_sidekiq_options['retry_count'],
        timestamp: Time.now
      )
      raise e  # Re-raise to trigger Sidekiq retry
    end
  end
  
  private
  
  def execute_operation(params)
    # Actual operation implementation
  end
end

File uploads present challenges for DR as files typically reside in filesystem storage or object storage. DR strategies must ensure uploaded files replicate to secondary locations. Object storage services like AWS S3 provide cross-region replication. Applications using local filesystem storage require rsync replication or distributed filesystems. Ruby applications should abstract storage behind adapters that support replication.

# Storage adapter with replication support
class ReplicatedStorageAdapter
  def initialize(primary:, secondary:, replication_mode: :async)
    @primary = primary
    @secondary = secondary
    @replication_mode = replication_mode
  end
  
  def upload(key, data)
    # Write to primary
    @primary.put(key, data)
    
    # Replicate to secondary
    if @replication_mode == :sync
      @secondary.put(key, data)
    else
      replicate_async(key, data)
    end
  end
  
  def download(key)
    # Try primary first, fallback to secondary
    @primary.get(key)
  rescue StorageError
    @secondary.get(key)
  end
  
  def delete(key)
    @primary.delete(key)
    @secondary.delete(key)
  end
  
  private
  
  def replicate_async(key, data)
    # Queue replication job
    ReplicationJob.perform_async(key, data, @secondary.endpoint)
  end
end

Health checks provide the foundation for automated failover detection. Ruby applications expose health check endpoints that verify database connectivity, dependency availability, and application functionality. Health checks must distinguish between transient issues and catastrophic failures to prevent false-positive failovers.

# Comprehensive health check implementation
class HealthCheckController < ApplicationController
  def show
    checks = {
      database: check_database,
      redis: check_redis,
      storage: check_storage,
      external_api: check_external_dependencies
    }
    
    overall_healthy = checks.values.all? { |check| check[:healthy] }
    status_code = overall_healthy ? 200 : 503
    
    render json: {
      status: overall_healthy ? 'healthy' : 'unhealthy',
      timestamp: Time.now.to_i,
      checks: checks
    }, status: status_code
  end
  
  private
  
  def check_database
    ActiveRecord::Base.connection.execute('SELECT 1')
    { healthy: true, latency_ms: 0 }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_redis
    start = Time.now
    $redis.ping
    latency = ((Time.now - start) * 1000).round
    { healthy: true, latency_ms: latency }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_storage
    # Verify can write and read
    test_key = "health_check_#{SecureRandom.hex(8)}"
    Storage.upload(test_key, 'test')
    Storage.download(test_key)
    Storage.delete(test_key)
    { healthy: true }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
end

Tools & Ecosystem

AWS provides extensive DR capabilities through services spanning multiple layers. Route 53 health checks monitor endpoint availability and automatically update DNS records during failures. RDS automated backups create daily snapshots with transaction log archival. Aurora Global Database replicates data across regions with sub-second replication lag. S3 Cross-Region Replication copies objects to multiple regions. CloudFormation templates enable rapid infrastructure reproduction. AWS Backup centralizes backup management across services.

# AWS Route 53 health check management
require 'aws-sdk-route53'

class Route53HealthCheckManager
  def initialize
    @client = Aws::Route53::Client.new
  end
  
  def create_health_check(domain:, port:, path:)
    response = @client.create_health_check({
      health_check_config: {
        type: 'HTTPS',
        resource_path: path,
        fully_qualified_domain_name: domain,
        port: port,
        request_interval: 30,
        failure_threshold: 3
      }
    })
    
    response.health_check.id
  end
  
  def update_dns_failover(hosted_zone_id:, record_name:, primary_ip:, secondary_ip:, health_check_id:)
    # Create primary record set with health check
    @client.change_resource_record_sets({
      hosted_zone_id: hosted_zone_id,
      change_batch: {
        changes: [
          {
            action: 'UPSERT',
            resource_record_set: {
              name: record_name,
              type: 'A',
              set_identifier: 'Primary',
              failover: 'PRIMARY',
              health_check_id: health_check_id,
              ttl: 60,
              resource_records: [{ value: primary_ip }]
            }
          },
          {
            action: 'UPSERT',
            resource_record_set: {
              name: record_name,
              type: 'A',
              set_identifier: 'Secondary',
              failover: 'SECONDARY',
              ttl: 60,
              resource_records: [{ value: secondary_ip }]
            }
          }
        ]
      }
    })
  end
end

PostgreSQL streaming replication provides built-in database replication. The primary server streams write-ahead log (WAL) records to standby servers. Standby servers apply changes continuously, maintaining synchronized copies. Ruby applications interact with streaming replication through connection management and failover coordination. The pg gem supports read replica connections for load distribution.

Consul provides service discovery and health checking critical for DR scenarios. Applications register services with health check definitions. During failover, Consul automatically removes unhealthy services from discovery. Ruby applications use the Diplomat gem to interact with Consul APIs for service registration, discovery, and key-value storage.

# Consul integration for service discovery and failover
require 'diplomat'

class ConsulServiceManager
  def register_service(name:, address:, port:, health_check_path:)
    Diplomat::Service.register(
      {
        name: name,
        address: address,
        port: port,
        check: {
          http: "http://#{address}:#{port}#{health_check_path}",
          interval: '10s',
          timeout: '5s'
        }
      }
    )
  end
  
  def discover_healthy_services(service_name)
    services = Diplomat::Service.get(service_name, :passing)
    services.map do |service|
      {
        address: service.ServiceAddress,
        port: service.ServicePort,
        id: service.ServiceID
      }
    end
  end
  
  def deregister_service(service_id)
    Diplomat::Service.deregister(service_id)
  end
end

Terraform enables infrastructure-as-code for disaster recovery. DR environments can be defined as code and rapidly provisioned when needed. Terraform modules encapsulate reusable infrastructure patterns. State files track deployed resources. Ruby applications can shell out to Terraform commands or use the Terraform API through SDKs.

Docker and container orchestration platforms like Kubernetes facilitate DR through portable application packaging. Container images bundle application code and dependencies. Kubernetes manifests define deployment topology. During DR failover, operators apply manifests to secondary clusters. Ruby applications containerized with Docker gain portability across environments.

Database backup tools integrate with Ruby applications for automated backup management. pgbackrest provides advanced PostgreSQL backup features including incremental backups, parallel processing, and cloud storage integration. Percona XtraBackup offers hot backup capabilities for MySQL. Ruby scripts orchestrate these tools, manage schedules, and monitor backup health.

Object storage services provide durable backup destinations. AWS S3, Google Cloud Storage, and Azure Blob Storage offer 99.999999999% durability through replication across multiple devices and facilities. Ruby applications use aws-sdk-s3, google-cloud-storage, or azure-storage-blob gems. These services support versioning, lifecycle policies, and cross-region replication.

Testing Approaches

DR testing validates recovery procedures, identifies gaps, and trains personnel. Tests must simulate realistic failure scenarios without impacting production systems. Test frequency depends on RTO requirements and change rate. Applications with stringent RTO targets require monthly or quarterly testing. Less critical systems may test annually.

Tabletop exercises represent the simplest testing approach. Teams walk through DR procedures discussing each step without executing actions. Participants identify unclear instructions, missing information, or procedural gaps. Tabletop exercises suit initial DR plan validation and require minimal resources. The limitation is lack of technical validation.

# Tabletop exercise checklist generator
class TabletopExerciseGenerator
  def generate_exercise_plan(dr_plan:)
    {
      scenario: generate_scenario(dr_plan.threat_model),
      objectives: [
        'Validate communication procedures',
        'Identify documentation gaps',
        'Confirm role assignments',
        'Review decision criteria'
      ],
      participants: required_participants(dr_plan),
      duration: '2-3 hours',
      agenda: [
        { time: '0:00', activity: 'Scenario introduction' },
        { time: '0:15', activity: 'Initial response walkthrough' },
        { time: '0:45', activity: 'Technical recovery discussion' },
        { time: '1:30', activity: 'Service restoration review' },
        { time: '2:00', activity: 'Lessons learned' }
      ],
      success_criteria: [
        'All participants understand their roles',
        'Critical gaps documented',
        'Timeline realistic',
        'Dependencies identified'
      ]
    }
  end
end

Simulation testing executes DR procedures in isolated test environments. Teams provision secondary infrastructure, restore from backups, and validate application functionality. Simulation tests verify technical procedures work correctly. The test environment should mirror production specifications to ensure realistic validation. Ruby applications can be deployed to test environments using the same deployment pipelines as production.

# DR simulation test orchestration
class DisasterRecoverySimulation
  def initialize(test_environment:, backup_source:)
    @test_env = test_environment
    @backup_source = backup_source
    @test_results = []
  end
  
  def run_full_simulation
    test_start = Time.now
    
    begin
      # Phase 1: Infrastructure provisioning
      record_phase('Infrastructure Provisioning') do
        provision_test_infrastructure
      end
      
      # Phase 2: Backup restoration
      record_phase('Backup Restoration') do
        restore_latest_backup
      end
      
      # Phase 3: Application deployment
      record_phase('Application Deployment') do
        deploy_application_to_test_env
      end
      
      # Phase 4: Smoke tests
      record_phase('Smoke Testing') do
        run_smoke_tests
      end
      
      # Phase 5: Data validation
      record_phase('Data Validation') do
        validate_data_integrity
      end
      
      total_duration = Time.now - test_start
      
      generate_test_report(total_duration)
    ensure
      cleanup_test_environment
    end
  end
  
  private
  
  def record_phase(phase_name)
    start_time = Time.now
    success = false
    error = nil
    
    begin
      yield
      success = true
    rescue StandardError => e
      error = e.message
      raise
    ensure
      duration = Time.now - start_time
      @test_results << {
        phase: phase_name,
        success: success,
        duration: duration,
        error: error
      }
    end
  end
  
  def generate_test_report(total_duration)
    {
      test_date: Time.now,
      total_duration: total_duration,
      phases: @test_results,
      overall_success: @test_results.all? { |r| r[:success] },
      rto_met: total_duration <= @test_env.target_rto,
      recommendations: generate_recommendations
    }
  end
end

Partial failover tests validate specific components without full disaster simulation. Database failover tests switch traffic to read replicas. DNS failover tests update routing rules. Load balancer failover tests redirect traffic between availability zones. Partial tests reduce risk and complexity while validating critical components.

Production failover tests execute actual failover to DR sites with production traffic. These tests provide the highest confidence but carry the highest risk. Organizations with active-active deployments can gradually shift traffic to secondary sites, validate operation, and shift back. Single-site deployments require maintenance windows for failover tests.

Automated testing integrates DR validation into continuous integration pipelines. Backup restoration tests run automatically after each backup. Health check tests validate monitoring systems. Configuration tests verify infrastructure definitions remain deployable. Automated tests provide continuous validation as systems evolve.

# Automated DR validation tests
require 'rspec'

RSpec.describe 'Disaster Recovery Validation' do
  let(:backup_manager) { BackupManager.new }
  let(:health_checker) { HealthChecker.new }
  
  describe 'Backup Restoration' do
    it 'restores latest backup successfully' do
      latest_backup = backup_manager.find_latest_backup
      
      test_db = provision_test_database
      restore_result = backup_manager.restore(
        backup: latest_backup,
        target: test_db
      )
      
      expect(restore_result.success?).to be true
      expect(test_db.record_count).to eq latest_backup.record_count
    end
    
    it 'completes restoration within RTO' do
      start_time = Time.now
      backup_manager.restore_latest
      duration = Time.now - start_time
      
      expect(duration).to be < RTO_SECONDS
    end
  end
  
  describe 'Health Monitoring' do
    it 'detects primary endpoint failure' do
      simulate_primary_failure
      
      sleep HEALTH_CHECK_INTERVAL + 1
      
      expect(health_checker.primary_healthy?).to be false
      expect(health_checker.failover_triggered?).to be true
    end
  end
  
  describe 'Data Integrity' do
    it 'maintains referential integrity after restore' do
      restored_db = restore_from_latest_backup
      
      integrity_check = run_integrity_checks(restored_db)
      
      expect(integrity_check.orphaned_records).to be_empty
      expect(integrity_check.missing_foreign_keys).to be_empty
    end
  end
end

Real-World Applications

Financial services applications require aggressive RTO and RPO targets due to regulatory requirements and revenue impact. Trading platforms maintain hot standby systems with synchronous replication across data centers. Databases use multi-master replication or consensus protocols. Application servers run in multiple regions with global load balancing. Session state resides in distributed caches replicated across sites. DR drills occur quarterly with partial failover tests monthly.

E-commerce platforms balance DR costs against revenue loss during outages. Peak shopping periods like holidays demand minimal downtime. Many platforms use warm standby configurations with rapid scale-up capabilities. Product catalogs replicate through content delivery networks. Order processing systems maintain event logs enabling reconstruction after failures. Payment processing integrates with multiple providers for redundancy.

# E-commerce order processing with DR resilience
class ResilientOrderProcessor
  def initialize
    @primary_payment_gateway = StripeGateway.new
    @backup_payment_gateway = BraintreeGateway.new
    @order_event_log = EventLog.new(storage: ReplicatedStorage.new)
  end
  
  def process_order(order)
    # Log order event before processing
    event_id = @order_event_log.append(
      event_type: 'order_received',
      order_id: order.id,
      timestamp: Time.now,
      payload: order.to_json
    )
    
    begin
      # Attempt payment with primary gateway
      payment_result = @primary_payment_gateway.charge(
        amount: order.total,
        customer: order.customer_id
      )
      
      @order_event_log.append(
        event_type: 'payment_processed',
        order_id: order.id,
        gateway: 'stripe',
        transaction_id: payment_result.transaction_id
      )
    rescue PaymentGatewayError => e
      # Failover to backup gateway
      payment_result = @backup_payment_gateway.charge(
        amount: order.total,
        customer: order.customer_id
      )
      
      @order_event_log.append(
        event_type: 'payment_processed',
        order_id: order.id,
        gateway: 'braintree',
        transaction_id: payment_result.transaction_id,
        note: 'Primary gateway failed, used backup'
      )
    end
    
    order.complete!
  end
  
  def recover_incomplete_orders
    # After disaster recovery, replay events to complete interrupted orders
    incomplete_events = @order_event_log.find_incomplete_orders
    
    incomplete_events.each do |event|
      order = Order.find(event.order_id)
      next if order.completed?
      
      # Resume processing from last recorded state
      process_order(order)
    end
  end
end

SaaS applications serve multiple tenants requiring tenant-aware DR strategies. Multi-tenant databases complicate backup restoration as individual tenant data must be recoverable. Some platforms implement per-tenant backup schedules. Geographic data residency requirements may mandate region-specific DR sites. Tenant priority tiers determine restoration order during disasters affecting multiple tenants.

Healthcare applications handle protected health information (PHI) requiring HIPAA-compliant DR procedures. Backup encryption protects patient data. Access controls restrict DR system access. Audit logs track all DR operations. RTO targets often span hours rather than minutes as healthcare systems tolerate scheduled maintenance windows. However, critical systems like emergency room applications demand high availability.

Media streaming platforms manage large volumes of content requiring efficient backup strategies. Content files replicate to edge locations through content delivery networks. Metadata databases require careful backup as content references depend on metadata integrity. User viewing progress and recommendations require session state preservation. During DR events, platforms may serve cached content while restoring backend systems.

# Media platform DR with CDN integration
class MediaPlatformDR
  def initialize(cdn:, origin_storage:, metadata_db:)
    @cdn = cdn
    @origin_storage = origin_storage
    @metadata_db = metadata_db
  end
  
  def failover_to_secondary_origin
    # Update CDN to fetch from secondary origin
    @cdn.update_origin_configuration(
      primary_origin: secondary_origin_endpoint,
      backup_origin: primary_origin_endpoint
    )
    
    # Verify content availability
    sample_content = @metadata_db.sample_content_urls(100)
    failures = sample_content.reject { |url| verify_content_available(url) }
    
    if failures.empty?
      notify_success("Failover complete, all content verified available")
    else
      notify_warning("Failover complete, #{failures.size} items unavailable")
    end
  end
  
  def restore_user_state_from_backup
    # Restore viewing progress from latest backup
    latest_state_backup = find_latest_state_backup
    
    restored_count = 0
    latest_state_backup.each_record do |user_id, state|
      UserStateCache.restore(user_id, state)
      restored_count += 1
    end
    
    { restored_users: restored_count, backup_age: latest_state_backup.age }
  end
end

Internal business applications often tolerate longer recovery windows but require complete data consistency. Finance systems require precise transaction reconciliation. HR systems contain sensitive employee data requiring secure backup handling. CRM systems maintain relationship history requiring point-in-time restoration capabilities. Backup frequency typically matches business cycle periods.

Reference

Recovery Metrics

Metric	Definition	Typical Range	Factors Affecting Cost
RTO	Maximum acceptable downtime	Minutes to hours	Infrastructure redundancy, automation level
RPO	Maximum acceptable data loss	Seconds to hours	Replication strategy, backup frequency
MTTR	Mean time to recovery	Hours to days	Team training, documentation quality
MTBF	Mean time between failures	Months to years	Infrastructure quality, maintenance practices

DR Strategy Comparison

Strategy	Infrastructure Cost	RTO	RPO	Complexity	Best For
Backup/Restore	Low	Hours-Days	Hours	Low	Non-critical systems, budget constraints
Pilot Light	Low-Medium	Hours	Minutes-Hours	Medium	Moderate criticality, cost-sensitive
Warm Standby	Medium	Minutes-Hours	Minutes	Medium-High	Business-critical systems
Hot Standby	High	Seconds-Minutes	Seconds	High	Mission-critical, zero-downtime requirements
Active-Active	Very High	Near-zero	Near-zero	Very High	Global services, highest availability needs

Replication Methods

Method	Data Loss Risk	Performance Impact	Use Cases
Synchronous	None	High latency	Financial transactions, critical data
Asynchronous	Possible	Minimal latency	General applications, analytics
Semi-synchronous	Minimal	Moderate latency	Balanced consistency and performance
Snapshot-based	Moderate	Periodic overhead	Point-in-time recovery, compliance

Backup Types

Type	Storage Size	Restoration Time	Backup Duration	Best For
Full	100%	Fast	Slow	Weekly/monthly baselines
Incremental	Small	Slow	Fast	Frequent backups, storage efficiency
Differential	Medium	Medium	Medium	Balance between full and incremental
Continuous	Large	Fast	Ongoing	Zero/near-zero RPO requirements

Common Failure Scenarios

Scenario	Detection Time	Recovery Actions	Typical RTO
Hardware failure	Minutes	Replace hardware, restore from replica	1-4 hours
Data center outage	Minutes-Hours	Failover to secondary site	2-8 hours
Database corruption	Hours	Restore from backup, validate data	4-24 hours
Ransomware attack	Hours-Days	Isolate systems, restore clean backups	1-7 days
Natural disaster	Hours	Activate DR site, restore operations	8-72 hours
Network partition	Minutes	Wait for resolution or force failover	1-6 hours

Health Check Components

Component	Check Method	Failure Threshold	Response Action
Database	Query execution	3 consecutive failures	Switch to replica
Application	HTTP endpoint	3 failures in 90 seconds	Remove from load balancer
Storage	Read/write test	2 consecutive failures	Switch to secondary storage
External API	Dependency test	5 failures in 5 minutes	Use cached data or circuit breaker

Testing Schedule Recommendations

System Criticality	Tabletop Exercises	Simulation Tests	Partial Failover	Full Failover
Mission-Critical	Quarterly	Monthly	Quarterly	Annually
Business-Critical	Semi-annually	Quarterly	Semi-annually	Annually
Important	Annually	Semi-annually	Annually	Every 2 years
Standard	Every 2 years	Annually	Every 2 years	Every 3 years

Ruby DR Gems

Gem	Purpose	Key Features
aws-sdk-s3	Backup storage	Cross-region replication, versioning, lifecycle policies
pg	PostgreSQL replication	Connection management, replica reads, failover support
redis-namespace	Session replication	Key namespacing, multi-instance support
diplomat	Service discovery	Consul integration, health checks, KV storage
whenever	Backup scheduling	Cron job management, Ruby DSL

Cost Optimization Strategies

Strategy	Cost Savings	Trade-offs
Compress backups	50-80% storage reduction	CPU overhead, restoration time increase
Incremental backups	80-95% storage reduction	Complex restoration process
Tiered storage	70-90% storage costs	Slower retrieval for older backups
Pilot light instead of hot standby	60-80% infrastructure costs	Higher RTO (hours vs minutes)
Automated scaling	40-60% infrastructure costs	Scale-up delay during DR
Reserved instances	30-50% compute costs	Long-term commitment required

Disaster Recovery