Overview
Failover represents the automatic transfer of operations from a failed component to a redundant standby component, while high availability (HA) describes system design that minimizes downtime through redundancy, fault detection, and automatic recovery mechanisms. These concepts address the fundamental challenge of maintaining service continuity when individual components fail, which occurs inevitably in distributed systems due to hardware failures, network partitions, software bugs, or operational errors.
High availability systems measure reliability using availability percentages, commonly expressed as "nines of availability." A system with 99.9% availability ("three nines") experiences approximately 8.76 hours of downtime annually, while 99.99% availability ("four nines") reduces this to 52.56 minutes per year. Each additional nine requires exponentially more investment in redundancy, monitoring, and operational discipline.
The distinction between failover and high availability matters: failover describes a specific mechanism for handling failures, while high availability represents the broader goal achieved through multiple techniques including failover, redundancy, load balancing, and fault tolerance. A highly available system employs failover as one component within a comprehensive reliability strategy.
Consider a web application backed by a database. Without HA mechanisms, database server failure renders the application unusable until manual intervention restores service. With failover implemented, a standby database server automatically assumes the primary role when failure detection occurs, typically completing the transition within seconds to minutes. The application continues operating with minimal disruption, though potentially degraded performance during the transition.
Key Principles
Redundancy forms the foundation of high availability by maintaining multiple instances of critical components. Redundancy operates at multiple levels: hardware redundancy duplicates physical servers, network redundancy provides multiple network paths, data redundancy replicates information across storage systems, and geographic redundancy distributes resources across data centers. The N+1 redundancy model maintains N components required for full capacity plus one additional component for failover, while N+M redundancy adds M spare components to handle multiple simultaneous failures.
Fault Detection identifies component failures through health checks, heartbeat monitoring, and service-level monitoring. Detection systems balance sensitivity against false positives: aggressive detection minimizes downtime from actual failures but risks unnecessary failovers from transient issues, while conservative detection reduces false alarms but increases recovery time. Detection mechanisms typically combine multiple signals including response time monitoring, error rate tracking, and explicit health check endpoints that verify both service availability and dependency health.
Automatic Recovery executes predetermined responses to detected failures without human intervention. Recovery automation eliminates human response time from the critical path and ensures consistent execution under stress. Recovery procedures handle state transfer from failed to healthy components, routing updates to direct traffic away from failed instances, and verification steps confirming successful recovery before declaring the system healthy.
State Management determines how systems maintain consistency during failover. Stateless components simplify failover since any instance can serve any request without coordination, but many real-world systems maintain state requiring careful synchronization. Session state, transaction state, and application data must either replicate continuously to standby instances or persist externally where multiple instances access shared state.
Split-Brain Prevention addresses scenarios where network partitions cause multiple components to simultaneously believe they hold the primary role. Without prevention mechanisms, split-brain scenarios create data conflicts and consistency violations. Quorum-based systems require majority agreement before assuming primary status, fencing mechanisms forcibly disable failed primaries before promoting standbys, and distributed coordination services provide authoritative leadership election.
Health check implementation demonstrates these principles:
class HealthChecker
def initialize(endpoint, timeout: 5, interval: 10)
@endpoint = endpoint
@timeout = timeout
@interval = interval
@consecutive_failures = 0
@failure_threshold = 3
end
def monitor
loop do
healthy = check_health
if healthy
@consecutive_failures = 0
else
@consecutive_failures += 1
trigger_failover if @consecutive_failures >= @failure_threshold
end
sleep @interval
end
end
private
def check_health
response = HTTP.timeout(@timeout).get(@endpoint)
response.status.success? && validate_response(response)
rescue HTTP::Error, Timeout::Error
false
end
def validate_response(response)
data = JSON.parse(response.body)
data['status'] == 'healthy' &&
data['dependencies'].all? { |dep| dep['status'] == 'healthy' }
end
end
Design Considerations
The decision to implement high availability requires balancing availability requirements against costs in infrastructure, operational complexity, and consistency trade-offs. Systems with strict uptime requirements justify HA investments, while internal tools or services with flexible availability needs may accept simpler architectures with longer recovery times.
Availability Tiers map business requirements to architectural approaches. Tier 1 systems target 99.95% or higher availability, requiring automated failover, redundant components across multiple availability zones, and 24/7 monitoring. These systems support revenue-generating services or critical business operations where downtime directly impacts business outcomes. Tier 2 systems target 99.9% availability with automated recovery but potentially within a single data center, acceptable for important but not critical services. Tier 3 systems accept 99% availability with manual failover procedures, suitable for internal tools or services with flexible usage patterns.
Consistency vs Availability Trade-offs emerge from the CAP theorem, which states distributed systems can provide at most two of three guarantees: consistency, availability, and partition tolerance. Since network partitions occur inevitably, systems choose between consistency (CP systems) or availability (AP systems) during partitions. Financial systems typically choose consistency, accepting unavailability during partitions rather than risk inconsistent data. Social media feeds typically choose availability, accepting eventual consistency to maintain service during partitions.
Database replication illustrates these trade-offs:
# Synchronous replication - prioritizes consistency
class SynchronousReplicator
def write(key, value)
primary.write(key, value)
# Block until replicas confirm write
replicas.each do |replica|
replica.write(key, value)
raise ReplicationError unless replica.confirm_write(key)
end
true
rescue ReplicationError
# Sacrifice availability to maintain consistency
rollback_write(key)
raise
end
end
# Asynchronous replication - prioritizes availability
class AsynchronousReplicator
def write(key, value)
primary.write(key, value)
# Return immediately, replicate in background
Thread.new do
replicas.each do |replica|
replica.write(key, value) rescue log_replication_failure(replica)
end
end
true
end
end
Failure Domain Isolation minimizes the impact of individual failures by ensuring failures affect limited system subsets. Geographic distribution across regions protects against data center outages but introduces latency and complexity. Availability zone distribution within regions provides fault isolation with lower latency but shared regional risks. Rack-level distribution addresses power and network failures within data centers but doesn't protect against facility-wide issues.
Cost Considerations extend beyond infrastructure to operational overhead. Maintaining standby capacity that remains idle until failures occur requires justification through downtime cost analysis. A service generating $10,000 hourly revenue and targeting 99.99% availability spends approximately $876 annually on downtime ($10,000 × 52.56 minutes / 60 minutes), justifying significant HA investment. A service with minimal downtime costs may accept simpler approaches despite lower availability.
Implementation Approaches
Active-Passive Failover maintains one active component handling all traffic while passive standby components remain idle until failure occurs. The active component processes requests and replicates state to passive instances either synchronously or asynchronously. Detection systems monitor active component health and trigger failover to passive components upon failure. This approach minimizes complexity and resource utilization but introduces recovery time during failover and potentially data loss if replication lags.
Database primary-replica setups exemplify active-passive failover. The primary database accepts all writes and replicates changes to replicas. Applications direct read traffic to replicas for load distribution but all writes target the primary. Upon primary failure, one replica promotes to primary and other replicas reconfigure to replicate from the new primary.
Active-Active Configuration distributes traffic across multiple active components simultaneously, each capable of handling the full workload. Load balancers distribute requests across active instances using round-robin, least-connections, or weighted algorithms. All instances maintain synchronized state through shared data stores or distributed coordination. This approach maximizes resource utilization and eliminates failover delay since healthy instances immediately absorb traffic from failed instances. However, maintaining consistency across active instances increases complexity, and session affinity requirements may complicate load distribution.
Web application servers commonly use active-active configurations:
# Load balancer health check endpoint
class HealthController < ApplicationController
def check
# Verify application and dependencies
checks = {
database: check_database,
cache: check_cache,
external_api: check_external_services
}
if checks.values.all?
render json: { status: 'healthy', checks: checks }, status: :ok
else
render json: { status: 'unhealthy', checks: checks }, status: :service_unavailable
end
end
private
def check_database
ActiveRecord::Base.connection.execute('SELECT 1')
true
rescue StandardError
false
end
def check_cache
Rails.cache.write('health_check', Time.current)
Rails.cache.read('health_check')
true
rescue StandardError
false
end
end
N+1 Redundancy provisions N components to handle expected load plus one additional component for failover. This approach balances cost against availability by maintaining minimal spare capacity. Systems operating at 80% capacity with N+1 redundancy continue functioning at 100% capacity after single component failure. Multiple simultaneous failures may degrade performance but systems remain operational. N+1 redundancy suits most availability requirements but falls short for systems requiring tolerance of multiple simultaneous failures.
Geographic Distribution spreads redundant components across multiple geographic regions to protect against regional outages. Multi-region architectures introduce significant complexity through increased latency between regions, data consistency challenges across geographic distances, and routing logic to direct traffic to optimal regions. Applications must handle scenarios where some regions become unavailable while others continue operating.
Traffic routing in multi-region deployments considers user location, region health, and capacity:
class RegionRouter
def initialize
@regions = [
{ name: 'us-east', endpoint: 'https://api-east.example.com', weight: 50 },
{ name: 'us-west', endpoint: 'https://api-west.example.com', weight: 30 },
{ name: 'eu-west', endpoint: 'https://api-eu.example.com', weight: 20 }
]
@health_checker = RegionHealthChecker.new(@regions)
end
def route_request(user_location)
healthy_regions = @regions.select { |r| @health_checker.healthy?(r[:name]) }
return fallback_endpoint if healthy_regions.empty?
# Route to geographically closest healthy region
closest = healthy_regions.min_by do |region|
calculate_distance(user_location, region[:name])
end
closest[:endpoint]
end
private
def calculate_distance(user_location, region_name)
# Geographic distance calculation
region_coords = REGION_COORDINATES[region_name]
haversine_distance(user_location, region_coords)
end
end
Ruby Implementation
Ruby applications implement high availability through connection management, retry logic, circuit breakers, and integration with external HA infrastructure. Ruby's exception handling and block-based APIs provide natural patterns for implementing resilience patterns.
Connection Pool Management prevents resource exhaustion and provides fault tolerance for database and external service connections. The connection_pool gem offers thread-safe connection pooling:
require 'connection_pool'
# Database connection pool with automatic retry
class ResilientDatabase
def initialize(size: 5, timeout: 5)
@pool = ConnectionPool.new(size: size, timeout: timeout) do
PG.connect(
host: ENV['DB_HOST'],
dbname: ENV['DB_NAME'],
connect_timeout: 3
)
end
end
def execute(query)
@pool.with do |conn|
conn.exec(query)
end
rescue PG::Error => e
# Attempt failover to replica
execute_on_replica(query)
end
private
def execute_on_replica(query)
replica_pool.with do |conn|
conn.exec(query)
end
end
def replica_pool
@replica_pool ||= ConnectionPool.new(size: 5, timeout: 5) do
PG.connect(host: ENV['DB_REPLICA_HOST'], dbname: ENV['DB_NAME'])
end
end
end
Retry Logic with Exponential Backoff handles transient failures common in distributed systems. Network timeouts, rate limiting, and temporary service unavailability often resolve within seconds, making retries effective:
class ResilientHttpClient
MAX_RETRIES = 3
BASE_DELAY = 0.5
MAX_DELAY = 30
def get(url, options = {})
attempt = 0
begin
attempt += 1
response = HTTP.timeout(5).get(url, options)
raise RetryableError if response.status.server_error?
response
rescue HTTP::Error, RetryableError => e
if attempt < MAX_RETRIES && retryable?(e)
delay = calculate_backoff(attempt)
sleep delay
retry
end
raise
end
end
private
def retryable?(error)
error.is_a?(HTTP::TimeoutError) ||
error.is_a?(HTTP::ConnectionError) ||
error.is_a?(RetryableError)
end
def calculate_backoff(attempt)
# Exponential backoff with jitter
delay = [BASE_DELAY * (2 ** (attempt - 1)), MAX_DELAY].min
delay * (0.5 + rand * 0.5)
end
end
Circuit Breakers prevent cascading failures by stopping requests to failing services. The circuit breaker monitors failure rates and transitions between closed (normal operation), open (rejecting requests), and half-open (testing recovery) states:
require 'stoplight'
# Configure circuit breaker for external service
Stoplight('external-api') do
HTTP.get('https://api.example.com/data')
end
.with_threshold(5) # Open after 5 failures
.with_timeout(10) # Within 10 seconds
.with_cool_off_time(60) # Wait 60s before retry
.with_fallback { cached_data } # Return cached data when open
# Usage in application code
class DataService
def fetch_data
light = Stoplight('external-api') do
response = HTTP.timeout(5).get(api_endpoint)
JSON.parse(response.body)
end
.with_fallback { Rails.cache.read('last_known_data') }
light.run
end
end
Health Check Implementation exposes service health to load balancers and orchestration systems. Comprehensive health checks verify not just service availability but dependency health:
class DeepHealthCheck
def initialize
@checks = [
DatabaseCheck.new,
CacheCheck.new,
QueueCheck.new,
ExternalApiCheck.new
]
end
def perform
results = @checks.map do |check|
start_time = Time.current
begin
status = check.call
duration = Time.current - start_time
{
name: check.name,
status: status ? 'healthy' : 'unhealthy',
duration_ms: (duration * 1000).round(2)
}
rescue StandardError => e
{
name: check.name,
status: 'unhealthy',
error: e.message
}
end
end
{
status: results.all? { |r| r[:status] == 'healthy' } ? 'healthy' : 'degraded',
checks: results,
timestamp: Time.current.iso8601
}
end
end
class DatabaseCheck
def name
'database'
end
def call
ActiveRecord::Base.connection.execute('SELECT 1')
true
end
end
Tools & Ecosystem
Load Balancers distribute traffic across multiple instances and remove failed instances from rotation. HAProxy and Nginx offer Ruby integration through configuration management and health check endpoints. HAProxy provides TCP and HTTP load balancing with sophisticated health checking:
# HAProxy configuration generator
class HAProxyConfig
def generate(backends)
<<~CONFIG
global
maxconn 4096
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend http-in
bind *:80
default_backend app_servers
backend app_servers
balance roundrobin
option httpchk GET /health
#{backend_servers(backends)}
CONFIG
end
private
def backend_servers(backends)
backends.map.with_index do |backend, i|
"server app#{i} #{backend[:host]}:#{backend[:port]} check inter 2000 fall 3 rise 2"
end.join("\n ")
end
end
Service Discovery automates the process of tracking available service instances. Consul provides service registration and health checking with Ruby client libraries:
require 'diplomat'
class ServiceRegistry
def initialize
Diplomat.configure do |config|
config.url = ENV['CONSUL_URL']
end
end
def register_service(name, port)
service_def = {
name: name,
port: port,
address: local_ip,
check: {
http: "http://#{local_ip}:#{port}/health",
interval: '10s',
timeout: '5s'
}
}
Diplomat::Service.register(service_def)
end
def discover_services(name)
services = Diplomat::Service.get(name, :all)
services.map do |service|
{
address: service.ServiceAddress,
port: service.ServicePort
}
end
end
def deregister_service(service_id)
Diplomat::Service.deregister(service_id)
end
private
def local_ip
Socket.ip_address_list.find { |ai| ai.ipv4? && !ai.ipv4_loopback? }.ip_address
end
end
Database Replication Tools handle data synchronization between primary and replica databases. PostgreSQL streaming replication works with Ruby applications through connection string configuration:
# Database configuration with automatic failover
class DatabaseConfig
def self.connection_config
{
primary: {
adapter: 'postgresql',
host: ENV['DB_PRIMARY_HOST'],
port: 5432,
database: ENV['DB_NAME'],
pool: 5,
checkout_timeout: 5,
connect_timeout: 2
},
replica: {
adapter: 'postgresql',
host: ENV['DB_REPLICA_HOST'],
port: 5432,
database: ENV['DB_NAME'],
pool: 5,
replica: true
}
}
end
end
# Read-write splitting
class ApplicationRecord < ActiveRecord::Base
self.abstract_class = true
connects_to database: {
writing: :primary,
reading: :replica
}
end
Background Job Processing requires HA considerations for job queues and workers. Sidekiq provides built-in resilience through Redis clustering and job retry mechanisms:
class ResilientWorker
include Sidekiq::Worker
sidekiq_options retry: 5,
dead: false,
queue: :critical
sidekiq_retry_in do |count, exception|
case exception
when Timeout::Error
10 * (count + 1) # Quick retry for timeouts
when ExternalServiceError
60 * (count ** 2) # Slower backoff for external failures
end
end
def perform(order_id)
order = Order.find(order_id)
# Process with timeout protection
Timeout.timeout(30) do
ExternalService.process_order(order)
end
rescue ExternalServiceError => e
# Log and allow automatic retry
logger.error "Order processing failed: #{e.message}"
raise
end
end
Real-World Applications
Web Application High Availability requires coordinating multiple layers. Load balancers distribute HTTP traffic, application servers scale horizontally, databases replicate for failover, and caching layers reduce database load. Session management presents particular challenges: sticky sessions concentrate users on specific servers complicating load distribution, while session stores in Redis or databases add latency but enable server-side failure.
A production Ruby on Rails deployment implements HA across these layers:
# Application-level session management
Rails.application.config.session_store :redis_store,
servers: [
{ host: ENV['REDIS_PRIMARY'], port: 6379, db: 0 },
{ host: ENV['REDIS_REPLICA'], port: 6379, db: 0 }
],
expire_after: 2.hours,
key: '_app_session',
threadsafe: true,
compress: true
# Cache configuration with fallback
Rails.application.config.cache_store = :redis_cache_store, {
url: ENV['REDIS_CACHE_URL'],
connect_timeout: 2,
read_timeout: 1,
write_timeout: 1,
reconnect_attempts: 3,
error_handler: -> (method:, returning:, exception:) {
Rails.logger.error "Redis cache error: #{exception.message}"
# Continue with degraded performance rather than failing
}
}
API Service Reliability demands low-latency failover since users experience failures immediately. Circuit breakers prevent retry storms, timeouts ensure responsive failure handling, and graceful degradation maintains partial functionality when dependencies fail:
class ResilientApiController < ApplicationController
rescue_from StandardError, with: :handle_error
def show
data = Rails.cache.fetch("resource:#{params[:id]}", expires_in: 5.minutes) do
fetch_with_circuit_breaker(params[:id])
end
render json: data
end
private
def fetch_with_circuit_breaker(id)
Stoplight("external-api-#{id}") do
Timeout.timeout(3) do
ExternalApi.fetch_resource(id)
end
end
.with_threshold(5)
.with_fallback { fetch_from_replica(id) }
.run
end
def fetch_from_replica(id)
# Attempt replica or return cached data
Timeout.timeout(5) do
ReplicaApi.fetch_resource(id)
end
rescue Timeout::Error
Rails.cache.read("stale_resource:#{id}")
end
def handle_error(exception)
case exception
when Timeout::Error
render json: { error: 'Service timeout' }, status: :gateway_timeout
when ExternalServiceError
render json: { error: 'Service unavailable' }, status: :service_unavailable
else
render json: { error: 'Internal error' }, status: :internal_server_error
end
end
end
Background Processing Systems handle failures through job persistence, retry mechanisms, and dead letter queues. Long-running jobs require idempotency to handle repeated execution after failures:
class OrderProcessor
include Sidekiq::Worker
def perform(order_id)
order = Order.lock.find(order_id)
# Skip if already processed (idempotency)
return if order.processed?
ActiveRecord::Base.transaction do
process_payment(order)
update_inventory(order)
send_confirmation(order)
order.update!(processed: true, processed_at: Time.current)
end
rescue PaymentError => e
order.update!(status: 'payment_failed', error: e.message)
# Don't retry payment failures
raise Sidekiq::JobRetry::Skip
rescue InventoryError => e
# Retry inventory errors
order.update!(status: 'inventory_pending')
raise
end
end
Database Failover in Production requires coordinated updates across application instances. Connection pool management handles failover transparently:
class DatabaseFailoverHandler
def self.monitor
Thread.new do
loop do
check_primary_health
sleep 5
end
rescue StandardError => e
Rails.logger.error "Failover monitoring error: #{e.message}"
retry
end
end
def self.check_primary_health
ActiveRecord::Base.connection.execute('SELECT 1')
rescue PG::Error => e
Rails.logger.error "Primary database failure detected: #{e.message}"
trigger_failover
end
def self.trigger_failover
# Update connection to replica
new_config = DatabaseConfig.connection_config[:replica].merge(
host: ENV['DB_FAILOVER_HOST']
)
ActiveRecord::Base.establish_connection(new_config)
# Verify connection
ActiveRecord::Base.connection.execute('SELECT 1')
Rails.logger.info "Database failover completed successfully"
rescue StandardError => e
Rails.logger.error "Failover failed: #{e.message}"
raise
end
end
Reference
Availability Metrics
| Availability | Downtime/Year | Downtime/Month | Downtime/Week | Common Use Cases |
|---|---|---|---|---|
| 90% | 36.5 days | 3 days | 16.8 hours | Development environments |
| 99% | 3.65 days | 7.2 hours | 1.68 hours | Internal tools |
| 99.9% | 8.76 hours | 43.2 minutes | 10.1 minutes | Standard web applications |
| 99.99% | 52.56 minutes | 4.32 minutes | 1.01 minutes | Business-critical services |
| 99.999% | 5.26 minutes | 25.9 seconds | 6.05 seconds | Financial systems |
Failover Types
| Type | Recovery Time | Data Loss Risk | Complexity | Resource Utilization |
|---|---|---|---|---|
| Active-Passive | Seconds to minutes | Possible with async replication | Low | Low (standby idle) |
| Active-Active | Immediate | Minimal with proper sync | High | High (all active) |
| Hot Standby | Seconds | Minimal | Medium | Medium (standby ready) |
| Warm Standby | Minutes | Possible | Medium | Medium (standby starting) |
| Cold Standby | Hours | Likely | Low | Low (manual startup) |
Ruby HA Gems
| Gem | Purpose | Key Features |
|---|---|---|
| connection_pool | Connection pooling | Thread-safe, timeout handling, automatic cleanup |
| stoplight | Circuit breaker | Configurable thresholds, fallbacks, monitoring |
| semian | Resilience patterns | Circuit breaker, bulkhead, adaptive timeout |
| puma | Web server | Clustering, worker management, graceful restart |
| sidekiq | Background jobs | Retry logic, failure handling, persistence |
| redis-rb | Caching/session | Connection pooling, automatic reconnection |
Health Check Status Codes
| Code | Status | Meaning | Load Balancer Action |
|---|---|---|---|
| 200 | OK | Service healthy | Route traffic |
| 429 | Too Many Requests | Rate limited | Reduce traffic |
| 503 | Service Unavailable | Temporary failure | Stop routing, retry |
| 504 | Gateway Timeout | Downstream timeout | Stop routing |
Detection Thresholds
| Metric | Threshold | Action |
|---|---|---|
| Consecutive failures | 3-5 failures | Mark unhealthy |
| Success rate | Below 95% | Trigger warning |
| Response time | Above 5s | Degraded status |
| Error rate | Above 5% | Circuit breaker open |
| Health check interval | 5-10 seconds | Standard monitoring |
| Retry attempts | 3-5 retries | Before failure |
Replication Configuration
| Setting | Synchronous | Asynchronous |
|---|---|---|
| Consistency | Strong | Eventual |
| Write latency | Higher | Lower |
| Availability during partition | Lower | Higher |
| Data loss risk | None | Possible |
| Use case | Financial data | Analytics, logs |