CrackedRuby CrackedRuby

Failover and High Availability

Overview

Failover represents the automatic transfer of operations from a failed component to a redundant standby component, while high availability (HA) describes system design that minimizes downtime through redundancy, fault detection, and automatic recovery mechanisms. These concepts address the fundamental challenge of maintaining service continuity when individual components fail, which occurs inevitably in distributed systems due to hardware failures, network partitions, software bugs, or operational errors.

High availability systems measure reliability using availability percentages, commonly expressed as "nines of availability." A system with 99.9% availability ("three nines") experiences approximately 8.76 hours of downtime annually, while 99.99% availability ("four nines") reduces this to 52.56 minutes per year. Each additional nine requires exponentially more investment in redundancy, monitoring, and operational discipline.

The distinction between failover and high availability matters: failover describes a specific mechanism for handling failures, while high availability represents the broader goal achieved through multiple techniques including failover, redundancy, load balancing, and fault tolerance. A highly available system employs failover as one component within a comprehensive reliability strategy.

Consider a web application backed by a database. Without HA mechanisms, database server failure renders the application unusable until manual intervention restores service. With failover implemented, a standby database server automatically assumes the primary role when failure detection occurs, typically completing the transition within seconds to minutes. The application continues operating with minimal disruption, though potentially degraded performance during the transition.

Key Principles

Redundancy forms the foundation of high availability by maintaining multiple instances of critical components. Redundancy operates at multiple levels: hardware redundancy duplicates physical servers, network redundancy provides multiple network paths, data redundancy replicates information across storage systems, and geographic redundancy distributes resources across data centers. The N+1 redundancy model maintains N components required for full capacity plus one additional component for failover, while N+M redundancy adds M spare components to handle multiple simultaneous failures.

Fault Detection identifies component failures through health checks, heartbeat monitoring, and service-level monitoring. Detection systems balance sensitivity against false positives: aggressive detection minimizes downtime from actual failures but risks unnecessary failovers from transient issues, while conservative detection reduces false alarms but increases recovery time. Detection mechanisms typically combine multiple signals including response time monitoring, error rate tracking, and explicit health check endpoints that verify both service availability and dependency health.

Automatic Recovery executes predetermined responses to detected failures without human intervention. Recovery automation eliminates human response time from the critical path and ensures consistent execution under stress. Recovery procedures handle state transfer from failed to healthy components, routing updates to direct traffic away from failed instances, and verification steps confirming successful recovery before declaring the system healthy.

State Management determines how systems maintain consistency during failover. Stateless components simplify failover since any instance can serve any request without coordination, but many real-world systems maintain state requiring careful synchronization. Session state, transaction state, and application data must either replicate continuously to standby instances or persist externally where multiple instances access shared state.

Split-Brain Prevention addresses scenarios where network partitions cause multiple components to simultaneously believe they hold the primary role. Without prevention mechanisms, split-brain scenarios create data conflicts and consistency violations. Quorum-based systems require majority agreement before assuming primary status, fencing mechanisms forcibly disable failed primaries before promoting standbys, and distributed coordination services provide authoritative leadership election.

Health check implementation demonstrates these principles:

class HealthChecker
  def initialize(endpoint, timeout: 5, interval: 10)
    @endpoint = endpoint
    @timeout = timeout
    @interval = interval
    @consecutive_failures = 0
    @failure_threshold = 3
  end

  def monitor
    loop do
      healthy = check_health
      
      if healthy
        @consecutive_failures = 0
      else
        @consecutive_failures += 1
        trigger_failover if @consecutive_failures >= @failure_threshold
      end
      
      sleep @interval
    end
  end

  private

  def check_health
    response = HTTP.timeout(@timeout).get(@endpoint)
    response.status.success? && validate_response(response)
  rescue HTTP::Error, Timeout::Error
    false
  end

  def validate_response(response)
    data = JSON.parse(response.body)
    data['status'] == 'healthy' && 
      data['dependencies'].all? { |dep| dep['status'] == 'healthy' }
  end
end

Design Considerations

The decision to implement high availability requires balancing availability requirements against costs in infrastructure, operational complexity, and consistency trade-offs. Systems with strict uptime requirements justify HA investments, while internal tools or services with flexible availability needs may accept simpler architectures with longer recovery times.

Availability Tiers map business requirements to architectural approaches. Tier 1 systems target 99.95% or higher availability, requiring automated failover, redundant components across multiple availability zones, and 24/7 monitoring. These systems support revenue-generating services or critical business operations where downtime directly impacts business outcomes. Tier 2 systems target 99.9% availability with automated recovery but potentially within a single data center, acceptable for important but not critical services. Tier 3 systems accept 99% availability with manual failover procedures, suitable for internal tools or services with flexible usage patterns.

Consistency vs Availability Trade-offs emerge from the CAP theorem, which states distributed systems can provide at most two of three guarantees: consistency, availability, and partition tolerance. Since network partitions occur inevitably, systems choose between consistency (CP systems) or availability (AP systems) during partitions. Financial systems typically choose consistency, accepting unavailability during partitions rather than risk inconsistent data. Social media feeds typically choose availability, accepting eventual consistency to maintain service during partitions.

Database replication illustrates these trade-offs:

# Synchronous replication - prioritizes consistency
class SynchronousReplicator
  def write(key, value)
    primary.write(key, value)
    
    # Block until replicas confirm write
    replicas.each do |replica|
      replica.write(key, value)
      raise ReplicationError unless replica.confirm_write(key)
    end
    
    true
  rescue ReplicationError
    # Sacrifice availability to maintain consistency
    rollback_write(key)
    raise
  end
end

# Asynchronous replication - prioritizes availability
class AsynchronousReplicator
  def write(key, value)
    primary.write(key, value)
    
    # Return immediately, replicate in background
    Thread.new do
      replicas.each do |replica|
        replica.write(key, value) rescue log_replication_failure(replica)
      end
    end
    
    true
  end
end

Failure Domain Isolation minimizes the impact of individual failures by ensuring failures affect limited system subsets. Geographic distribution across regions protects against data center outages but introduces latency and complexity. Availability zone distribution within regions provides fault isolation with lower latency but shared regional risks. Rack-level distribution addresses power and network failures within data centers but doesn't protect against facility-wide issues.

Cost Considerations extend beyond infrastructure to operational overhead. Maintaining standby capacity that remains idle until failures occur requires justification through downtime cost analysis. A service generating $10,000 hourly revenue and targeting 99.99% availability spends approximately $876 annually on downtime ($10,000 × 52.56 minutes / 60 minutes), justifying significant HA investment. A service with minimal downtime costs may accept simpler approaches despite lower availability.

Implementation Approaches

Active-Passive Failover maintains one active component handling all traffic while passive standby components remain idle until failure occurs. The active component processes requests and replicates state to passive instances either synchronously or asynchronously. Detection systems monitor active component health and trigger failover to passive components upon failure. This approach minimizes complexity and resource utilization but introduces recovery time during failover and potentially data loss if replication lags.

Database primary-replica setups exemplify active-passive failover. The primary database accepts all writes and replicates changes to replicas. Applications direct read traffic to replicas for load distribution but all writes target the primary. Upon primary failure, one replica promotes to primary and other replicas reconfigure to replicate from the new primary.

Active-Active Configuration distributes traffic across multiple active components simultaneously, each capable of handling the full workload. Load balancers distribute requests across active instances using round-robin, least-connections, or weighted algorithms. All instances maintain synchronized state through shared data stores or distributed coordination. This approach maximizes resource utilization and eliminates failover delay since healthy instances immediately absorb traffic from failed instances. However, maintaining consistency across active instances increases complexity, and session affinity requirements may complicate load distribution.

Web application servers commonly use active-active configurations:

# Load balancer health check endpoint
class HealthController < ApplicationController
  def check
    # Verify application and dependencies
    checks = {
      database: check_database,
      cache: check_cache,
      external_api: check_external_services
    }
    
    if checks.values.all?
      render json: { status: 'healthy', checks: checks }, status: :ok
    else
      render json: { status: 'unhealthy', checks: checks }, status: :service_unavailable
    end
  end

  private

  def check_database
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  rescue StandardError
    false
  end

  def check_cache
    Rails.cache.write('health_check', Time.current)
    Rails.cache.read('health_check')
    true
  rescue StandardError
    false
  end
end

N+1 Redundancy provisions N components to handle expected load plus one additional component for failover. This approach balances cost against availability by maintaining minimal spare capacity. Systems operating at 80% capacity with N+1 redundancy continue functioning at 100% capacity after single component failure. Multiple simultaneous failures may degrade performance but systems remain operational. N+1 redundancy suits most availability requirements but falls short for systems requiring tolerance of multiple simultaneous failures.

Geographic Distribution spreads redundant components across multiple geographic regions to protect against regional outages. Multi-region architectures introduce significant complexity through increased latency between regions, data consistency challenges across geographic distances, and routing logic to direct traffic to optimal regions. Applications must handle scenarios where some regions become unavailable while others continue operating.

Traffic routing in multi-region deployments considers user location, region health, and capacity:

class RegionRouter
  def initialize
    @regions = [
      { name: 'us-east', endpoint: 'https://api-east.example.com', weight: 50 },
      { name: 'us-west', endpoint: 'https://api-west.example.com', weight: 30 },
      { name: 'eu-west', endpoint: 'https://api-eu.example.com', weight: 20 }
    ]
    @health_checker = RegionHealthChecker.new(@regions)
  end

  def route_request(user_location)
    healthy_regions = @regions.select { |r| @health_checker.healthy?(r[:name]) }
    
    return fallback_endpoint if healthy_regions.empty?
    
    # Route to geographically closest healthy region
    closest = healthy_regions.min_by do |region|
      calculate_distance(user_location, region[:name])
    end
    
    closest[:endpoint]
  end

  private

  def calculate_distance(user_location, region_name)
    # Geographic distance calculation
    region_coords = REGION_COORDINATES[region_name]
    haversine_distance(user_location, region_coords)
  end
end

Ruby Implementation

Ruby applications implement high availability through connection management, retry logic, circuit breakers, and integration with external HA infrastructure. Ruby's exception handling and block-based APIs provide natural patterns for implementing resilience patterns.

Connection Pool Management prevents resource exhaustion and provides fault tolerance for database and external service connections. The connection_pool gem offers thread-safe connection pooling:

require 'connection_pool'

# Database connection pool with automatic retry
class ResilientDatabase
  def initialize(size: 5, timeout: 5)
    @pool = ConnectionPool.new(size: size, timeout: timeout) do
      PG.connect(
        host: ENV['DB_HOST'],
        dbname: ENV['DB_NAME'],
        connect_timeout: 3
      )
    end
  end

  def execute(query)
    @pool.with do |conn|
      conn.exec(query)
    end
  rescue PG::Error => e
    # Attempt failover to replica
    execute_on_replica(query)
  end

  private

  def execute_on_replica(query)
    replica_pool.with do |conn|
      conn.exec(query)
    end
  end

  def replica_pool
    @replica_pool ||= ConnectionPool.new(size: 5, timeout: 5) do
      PG.connect(host: ENV['DB_REPLICA_HOST'], dbname: ENV['DB_NAME'])
    end
  end
end

Retry Logic with Exponential Backoff handles transient failures common in distributed systems. Network timeouts, rate limiting, and temporary service unavailability often resolve within seconds, making retries effective:

class ResilientHttpClient
  MAX_RETRIES = 3
  BASE_DELAY = 0.5
  MAX_DELAY = 30

  def get(url, options = {})
    attempt = 0
    
    begin
      attempt += 1
      response = HTTP.timeout(5).get(url, options)
      
      raise RetryableError if response.status.server_error?
      
      response
    rescue HTTP::Error, RetryableError => e
      if attempt < MAX_RETRIES && retryable?(e)
        delay = calculate_backoff(attempt)
        sleep delay
        retry
      end
      
      raise
    end
  end

  private

  def retryable?(error)
    error.is_a?(HTTP::TimeoutError) ||
      error.is_a?(HTTP::ConnectionError) ||
      error.is_a?(RetryableError)
  end

  def calculate_backoff(attempt)
    # Exponential backoff with jitter
    delay = [BASE_DELAY * (2 ** (attempt - 1)), MAX_DELAY].min
    delay * (0.5 + rand * 0.5)
  end
end

Circuit Breakers prevent cascading failures by stopping requests to failing services. The circuit breaker monitors failure rates and transitions between closed (normal operation), open (rejecting requests), and half-open (testing recovery) states:

require 'stoplight'

# Configure circuit breaker for external service
Stoplight('external-api') do
  HTTP.get('https://api.example.com/data')
end
  .with_threshold(5)              # Open after 5 failures
  .with_timeout(10)               # Within 10 seconds
  .with_cool_off_time(60)         # Wait 60s before retry
  .with_fallback { cached_data }  # Return cached data when open

# Usage in application code
class DataService
  def fetch_data
    light = Stoplight('external-api') do
      response = HTTP.timeout(5).get(api_endpoint)
      JSON.parse(response.body)
    end
      .with_fallback { Rails.cache.read('last_known_data') }
    
    light.run
  end
end

Health Check Implementation exposes service health to load balancers and orchestration systems. Comprehensive health checks verify not just service availability but dependency health:

class DeepHealthCheck
  def initialize
    @checks = [
      DatabaseCheck.new,
      CacheCheck.new,
      QueueCheck.new,
      ExternalApiCheck.new
    ]
  end

  def perform
    results = @checks.map do |check|
      start_time = Time.current
      
      begin
        status = check.call
        duration = Time.current - start_time
        
        {
          name: check.name,
          status: status ? 'healthy' : 'unhealthy',
          duration_ms: (duration * 1000).round(2)
        }
      rescue StandardError => e
        {
          name: check.name,
          status: 'unhealthy',
          error: e.message
        }
      end
    end
    
    {
      status: results.all? { |r| r[:status] == 'healthy' } ? 'healthy' : 'degraded',
      checks: results,
      timestamp: Time.current.iso8601
    }
  end
end

class DatabaseCheck
  def name
    'database'
  end

  def call
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  end
end

Tools & Ecosystem

Load Balancers distribute traffic across multiple instances and remove failed instances from rotation. HAProxy and Nginx offer Ruby integration through configuration management and health check endpoints. HAProxy provides TCP and HTTP load balancing with sophisticated health checking:

# HAProxy configuration generator
class HAProxyConfig
  def generate(backends)
    <<~CONFIG
      global
        maxconn 4096
      
      defaults
        mode http
        timeout connect 5000ms
        timeout client 50000ms
        timeout server 50000ms
      
      frontend http-in
        bind *:80
        default_backend app_servers
      
      backend app_servers
        balance roundrobin
        option httpchk GET /health
        #{backend_servers(backends)}
    CONFIG
  end

  private

  def backend_servers(backends)
    backends.map.with_index do |backend, i|
      "server app#{i} #{backend[:host]}:#{backend[:port]} check inter 2000 fall 3 rise 2"
    end.join("\n    ")
  end
end

Service Discovery automates the process of tracking available service instances. Consul provides service registration and health checking with Ruby client libraries:

require 'diplomat'

class ServiceRegistry
  def initialize
    Diplomat.configure do |config|
      config.url = ENV['CONSUL_URL']
    end
  end

  def register_service(name, port)
    service_def = {
      name: name,
      port: port,
      address: local_ip,
      check: {
        http: "http://#{local_ip}:#{port}/health",
        interval: '10s',
        timeout: '5s'
      }
    }
    
    Diplomat::Service.register(service_def)
  end

  def discover_services(name)
    services = Diplomat::Service.get(name, :all)
    services.map do |service|
      {
        address: service.ServiceAddress,
        port: service.ServicePort
      }
    end
  end

  def deregister_service(service_id)
    Diplomat::Service.deregister(service_id)
  end

  private

  def local_ip
    Socket.ip_address_list.find { |ai| ai.ipv4? && !ai.ipv4_loopback? }.ip_address
  end
end

Database Replication Tools handle data synchronization between primary and replica databases. PostgreSQL streaming replication works with Ruby applications through connection string configuration:

# Database configuration with automatic failover
class DatabaseConfig
  def self.connection_config
    {
      primary: {
        adapter: 'postgresql',
        host: ENV['DB_PRIMARY_HOST'],
        port: 5432,
        database: ENV['DB_NAME'],
        pool: 5,
        checkout_timeout: 5,
        connect_timeout: 2
      },
      replica: {
        adapter: 'postgresql',
        host: ENV['DB_REPLICA_HOST'],
        port: 5432,
        database: ENV['DB_NAME'],
        pool: 5,
        replica: true
      }
    }
  end
end

# Read-write splitting
class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true
  
  connects_to database: { 
    writing: :primary, 
    reading: :replica 
  }
end

Background Job Processing requires HA considerations for job queues and workers. Sidekiq provides built-in resilience through Redis clustering and job retry mechanisms:

class ResilientWorker
  include Sidekiq::Worker
  
  sidekiq_options retry: 5, 
                  dead: false,
                  queue: :critical

  sidekiq_retry_in do |count, exception|
    case exception
    when Timeout::Error
      10 * (count + 1) # Quick retry for timeouts
    when ExternalServiceError
      60 * (count ** 2) # Slower backoff for external failures
    end
  end

  def perform(order_id)
    order = Order.find(order_id)
    
    # Process with timeout protection
    Timeout.timeout(30) do
      ExternalService.process_order(order)
    end
  rescue ExternalServiceError => e
    # Log and allow automatic retry
    logger.error "Order processing failed: #{e.message}"
    raise
  end
end

Real-World Applications

Web Application High Availability requires coordinating multiple layers. Load balancers distribute HTTP traffic, application servers scale horizontally, databases replicate for failover, and caching layers reduce database load. Session management presents particular challenges: sticky sessions concentrate users on specific servers complicating load distribution, while session stores in Redis or databases add latency but enable server-side failure.

A production Ruby on Rails deployment implements HA across these layers:

# Application-level session management
Rails.application.config.session_store :redis_store,
  servers: [
    { host: ENV['REDIS_PRIMARY'], port: 6379, db: 0 },
    { host: ENV['REDIS_REPLICA'], port: 6379, db: 0 }
  ],
  expire_after: 2.hours,
  key: '_app_session',
  threadsafe: true,
  compress: true

# Cache configuration with fallback
Rails.application.config.cache_store = :redis_cache_store, {
  url: ENV['REDIS_CACHE_URL'],
  connect_timeout: 2,
  read_timeout: 1,
  write_timeout: 1,
  reconnect_attempts: 3,
  error_handler: -> (method:, returning:, exception:) {
    Rails.logger.error "Redis cache error: #{exception.message}"
    # Continue with degraded performance rather than failing
  }
}

API Service Reliability demands low-latency failover since users experience failures immediately. Circuit breakers prevent retry storms, timeouts ensure responsive failure handling, and graceful degradation maintains partial functionality when dependencies fail:

class ResilientApiController < ApplicationController
  rescue_from StandardError, with: :handle_error
  
  def show
    data = Rails.cache.fetch("resource:#{params[:id]}", expires_in: 5.minutes) do
      fetch_with_circuit_breaker(params[:id])
    end
    
    render json: data
  end

  private

  def fetch_with_circuit_breaker(id)
    Stoplight("external-api-#{id}") do
      Timeout.timeout(3) do
        ExternalApi.fetch_resource(id)
      end
    end
      .with_threshold(5)
      .with_fallback { fetch_from_replica(id) }
      .run
  end

  def fetch_from_replica(id)
    # Attempt replica or return cached data
    Timeout.timeout(5) do
      ReplicaApi.fetch_resource(id)
    end
  rescue Timeout::Error
    Rails.cache.read("stale_resource:#{id}")
  end

  def handle_error(exception)
    case exception
    when Timeout::Error
      render json: { error: 'Service timeout' }, status: :gateway_timeout
    when ExternalServiceError
      render json: { error: 'Service unavailable' }, status: :service_unavailable
    else
      render json: { error: 'Internal error' }, status: :internal_server_error
    end
  end
end

Background Processing Systems handle failures through job persistence, retry mechanisms, and dead letter queues. Long-running jobs require idempotency to handle repeated execution after failures:

class OrderProcessor
  include Sidekiq::Worker
  
  def perform(order_id)
    order = Order.lock.find(order_id)
    
    # Skip if already processed (idempotency)
    return if order.processed?
    
    ActiveRecord::Base.transaction do
      process_payment(order)
      update_inventory(order)
      send_confirmation(order)
      
      order.update!(processed: true, processed_at: Time.current)
    end
  rescue PaymentError => e
    order.update!(status: 'payment_failed', error: e.message)
    # Don't retry payment failures
    raise Sidekiq::JobRetry::Skip
  rescue InventoryError => e
    # Retry inventory errors
    order.update!(status: 'inventory_pending')
    raise
  end
end

Database Failover in Production requires coordinated updates across application instances. Connection pool management handles failover transparently:

class DatabaseFailoverHandler
  def self.monitor
    Thread.new do
      loop do
        check_primary_health
        sleep 5
      end
    rescue StandardError => e
      Rails.logger.error "Failover monitoring error: #{e.message}"
      retry
    end
  end

  def self.check_primary_health
    ActiveRecord::Base.connection.execute('SELECT 1')
  rescue PG::Error => e
    Rails.logger.error "Primary database failure detected: #{e.message}"
    trigger_failover
  end

  def self.trigger_failover
    # Update connection to replica
    new_config = DatabaseConfig.connection_config[:replica].merge(
      host: ENV['DB_FAILOVER_HOST']
    )
    
    ActiveRecord::Base.establish_connection(new_config)
    
    # Verify connection
    ActiveRecord::Base.connection.execute('SELECT 1')
    Rails.logger.info "Database failover completed successfully"
  rescue StandardError => e
    Rails.logger.error "Failover failed: #{e.message}"
    raise
  end
end

Reference

Availability Metrics

Availability Downtime/Year Downtime/Month Downtime/Week Common Use Cases
90% 36.5 days 3 days 16.8 hours Development environments
99% 3.65 days 7.2 hours 1.68 hours Internal tools
99.9% 8.76 hours 43.2 minutes 10.1 minutes Standard web applications
99.99% 52.56 minutes 4.32 minutes 1.01 minutes Business-critical services
99.999% 5.26 minutes 25.9 seconds 6.05 seconds Financial systems

Failover Types

Type Recovery Time Data Loss Risk Complexity Resource Utilization
Active-Passive Seconds to minutes Possible with async replication Low Low (standby idle)
Active-Active Immediate Minimal with proper sync High High (all active)
Hot Standby Seconds Minimal Medium Medium (standby ready)
Warm Standby Minutes Possible Medium Medium (standby starting)
Cold Standby Hours Likely Low Low (manual startup)

Ruby HA Gems

Gem Purpose Key Features
connection_pool Connection pooling Thread-safe, timeout handling, automatic cleanup
stoplight Circuit breaker Configurable thresholds, fallbacks, monitoring
semian Resilience patterns Circuit breaker, bulkhead, adaptive timeout
puma Web server Clustering, worker management, graceful restart
sidekiq Background jobs Retry logic, failure handling, persistence
redis-rb Caching/session Connection pooling, automatic reconnection

Health Check Status Codes

Code Status Meaning Load Balancer Action
200 OK Service healthy Route traffic
429 Too Many Requests Rate limited Reduce traffic
503 Service Unavailable Temporary failure Stop routing, retry
504 Gateway Timeout Downstream timeout Stop routing

Detection Thresholds

Metric Threshold Action
Consecutive failures 3-5 failures Mark unhealthy
Success rate Below 95% Trigger warning
Response time Above 5s Degraded status
Error rate Above 5% Circuit breaker open
Health check interval 5-10 seconds Standard monitoring
Retry attempts 3-5 retries Before failure

Replication Configuration

Setting Synchronous Asynchronous
Consistency Strong Eventual
Write latency Higher Lower
Availability during partition Lower Higher
Data loss risk None Possible
Use case Financial data Analytics, logs