Overview
The Circuit Breaker Pattern protects distributed systems from cascading failures by wrapping remote service calls in a monitoring component that tracks failures and automatically stops attempting requests when error rates exceed defined thresholds. Named after electrical circuit breakers that interrupt current flow to prevent damage, this pattern monitors service health and provides three distinct states: Closed (normal operation), Open (failing fast), and Half-Open (testing recovery).
In distributed architectures, a single slow or failing service can exhaust connection pools, thread pools, and other shared resources, causing failures to propagate across multiple services. Circuit breakers detect these failure conditions and interrupt the cascade before system-wide degradation occurs. The pattern originated in Michael Nygard's book "Release It!" and has become a fundamental component of resilient microservice architectures.
The pattern operates by counting failures within a time window. When failures exceed a configured threshold, the circuit "trips" to the Open state and immediately rejects requests without attempting the remote call. After a timeout period, the circuit transitions to Half-Open state, allowing a limited number of test requests to determine if the remote service has recovered. If test requests succeed, the circuit closes and normal operation resumes. If test requests fail, the circuit reopens and the timeout period resets.
class CircuitBreaker
attr_reader :state, :failure_count
def initialize(failure_threshold: 5, timeout: 60)
@state = :closed
@failure_count = 0
@failure_threshold = failure_threshold
@timeout = timeout
@opened_at = nil
end
def call
raise CircuitOpenError if open? && !timeout_expired?
attempt_half_open if open? && timeout_expired?
begin
result = yield
on_success
result
rescue => e
on_failure
raise e
end
end
private
def open?
@state == :open
end
def timeout_expired?
Time.now - @opened_at >= @timeout
end
def on_success
@failure_count = 0
@state = :closed
end
def on_failure
@failure_count += 1
if @failure_count >= @failure_threshold
@state = :open
@opened_at = Time.now
end
end
def attempt_half_open
@state = :half_open
end
end
Circuit breakers reduce latency during failures by failing fast instead of waiting for timeouts, preserve system resources by preventing excessive retry attempts, and provide monitoring data about service health. The pattern applies to any remote dependency including HTTP APIs, database connections, message queues, and external services.
Key Principles
The Circuit Breaker Pattern implements automatic failure detection and recovery through state transitions and threshold monitoring. The pattern's effectiveness depends on correctly configuring failure detection criteria, timeout periods, and recovery testing strategies.
State Machine Model
The circuit breaker implements a finite state machine with three states and specific transition conditions:
Closed state represents normal operation where all requests pass through to the remote service. The circuit breaker monitors each request, tracking successes and failures within a sliding time window. When the failure count or failure rate exceeds the configured threshold, the circuit transitions to Open state.
Open state blocks all requests immediately without attempting the remote call, throwing a specific exception that calling code can handle. This fail-fast behavior prevents resource exhaustion and reduces latency. The circuit remains open for a configured timeout period, after which it transitions to Half-Open state.
Half-Open state allows a limited number of test requests to determine if the remote service has recovered. If test requests succeed, the circuit transitions back to Closed state and normal operation resumes. If test requests fail, the circuit returns to Open state and the timeout period resets.
class StatefulCircuitBreaker
STATES = [:closed, :open, :half_open].freeze
def initialize(options = {})
@failure_threshold = options[:failure_threshold] || 5
@success_threshold = options[:success_threshold] || 2
@timeout = options[:timeout] || 60
@half_open_limit = options[:half_open_limit] || 3
@state = :closed
@failure_count = 0
@success_count = 0
@half_open_attempts = 0
@opened_at = nil
@mutex = Mutex.new
end
def call
@mutex.synchronize do
case @state
when :open
check_timeout
raise CircuitOpenError, "Circuit is open" if @state == :open
when :half_open
raise CircuitOpenError, "Half-open limit reached" if @half_open_attempts >= @half_open_limit
@half_open_attempts += 1
end
end
execute_with_monitoring { yield }
end
private
def check_timeout
if Time.now - @opened_at >= @timeout
transition_to_half_open
end
end
def execute_with_monitoring
begin
result = yield
record_success
result
rescue => error
record_failure
raise error
end
end
def transition_to_half_open
@state = :half_open
@half_open_attempts = 0
@success_count = 0
end
end
Failure Detection Criteria
Circuit breakers monitor multiple failure indicators to determine when to trip. Failure counting strategies include consecutive failures, failure rate within a time window, and weighted failure scoring. Different failure types may have different weights—timeouts might count more heavily than connection refused errors since they indicate resource exhaustion.
Time window implementations track failures in sliding windows or tumbling windows. Sliding windows provide more accurate failure rate calculations but require more memory to store individual request timestamps. Tumbling windows aggregate failures into fixed time buckets, using less memory but potentially missing failure rate spikes that span bucket boundaries.
Threshold Configuration
Failure thresholds balance sensitivity and stability. Low thresholds cause circuits to trip quickly but may trigger on transient failures. High thresholds tolerate more failures but take longer to detect persistent outages. The optimal threshold depends on expected failure rates, request volume, and acceptable latency.
Success thresholds in Half-Open state determine how many consecutive successful requests must occur before closing the circuit. Higher success thresholds provide more confidence in recovery but delay return to normal operation. Most implementations use 2-3 successful requests as the threshold.
Timeout duration controls how long the circuit remains open before attempting recovery. Short timeouts enable faster recovery from transient failures but may cause repeated state transitions if the remote service needs extended recovery time. Long timeouts reduce state transition overhead but delay recovery unnecessarily for brief outages.
Request Isolation
Circuit breakers operate independently for each remote service or endpoint. Failures in one service should not trigger circuit breakers for unrelated services. Some implementations maintain circuit breaker state per endpoint, while others group related endpoints under a single circuit breaker to simplify configuration.
Thread safety requires synchronization when updating circuit breaker state from multiple concurrent requests. Lock-free implementations using atomic operations provide better performance but increase implementation complexity. Most production circuit breakers use mutexes or similar locking mechanisms to ensure state consistency.
Ruby Implementation
Ruby applications implement circuit breakers through gems, custom classes, or integration with service mesh components. The Ruby ecosystem provides several mature circuit breaker implementations with varying feature sets and performance characteristics.
Core Implementation Pattern
A production-ready Ruby circuit breaker requires thread-safe state management, configurable thresholds, and monitoring hooks:
require 'thread'
class ProductionCircuitBreaker
class CircuitOpenError < StandardError; end
attr_reader :name, :state, :failure_count, :last_failure_time
def initialize(name, options = {})
@name = name
@failure_threshold = options[:failure_threshold] || 5
@success_threshold = options[:success_threshold] || 2
@timeout = options[:timeout] || 60
@volume_threshold = options[:volume_threshold] || 10
@error_rate_threshold = options[:error_rate_threshold] || 0.5
@state = :closed
@failure_count = 0
@success_count = 0
@request_count = 0
@opened_at = nil
@last_failure_time = nil
@mutex = Mutex.new
@window_start = Time.now
@window_duration = options[:window_duration] || 60
end
def call(&block)
check_and_update_state
if @state == :open
record_rejected_request
raise CircuitOpenError, "Circuit #{@name} is open"
end
execute_protected(&block)
end
def metrics
@mutex.synchronize do
{
name: @name,
state: @state,
failure_count: @failure_count,
success_count: @success_count,
request_count: @request_count,
error_rate: calculate_error_rate,
opened_at: @opened_at,
last_failure_time: @last_failure_time
}
end
end
private
def check_and_update_state
@mutex.synchronize do
reset_window_if_expired
if @state == :open && timeout_expired?
transition_to(:half_open)
end
end
end
def execute_protected
begin
result = yield
handle_success
result
rescue => error
handle_failure(error)
raise error
end
end
def handle_success
@mutex.synchronize do
@request_count += 1
if @state == :half_open
@success_count += 1
if @success_count >= @success_threshold
transition_to(:closed)
end
end
end
end
def handle_failure(error)
@mutex.synchronize do
@request_count += 1
@failure_count += 1
@last_failure_time = Time.now
if should_trip?
transition_to(:open)
end
end
end
def should_trip?
return false if @request_count < @volume_threshold
error_rate = calculate_error_rate
error_rate >= @error_rate_threshold || @failure_count >= @failure_threshold
end
def calculate_error_rate
return 0.0 if @request_count == 0
@failure_count.to_f / @request_count
end
def reset_window_if_expired
if Time.now - @window_start >= @window_duration
@failure_count = 0
@request_count = 0
@window_start = Time.now
end
end
def timeout_expired?
return false unless @opened_at
Time.now - @opened_at >= @timeout
end
def transition_to(new_state)
old_state = @state
@state = new_state
case new_state
when :open
@opened_at = Time.now
when :half_open
@success_count = 0
@failure_count = 0
when :closed
@failure_count = 0
@success_count = 0
@opened_at = nil
end
notify_state_change(old_state, new_state)
end
def notify_state_change(old_state, new_state)
# Hook for monitoring systems
puts "Circuit #{@name}: #{old_state} -> #{new_state}"
end
def record_rejected_request
# Hook for metrics collection
end
end
Using the Faraday Middleware
The Faraday HTTP client library provides circuit breaker middleware that integrates with HTTP request/response cycles:
require 'faraday'
require 'faraday/circuit_breaker'
# Configure circuit breaker for HTTP client
conn = Faraday.new(url: 'https://api.example.com') do |f|
f.request :circuit_breaker,
threshold: 5,
timeout: 60,
exceptions: [Faraday::TimeoutError, Faraday::ConnectionFailed]
f.adapter Faraday.default_adapter
end
# Circuit breaker protects this request
begin
response = conn.get('/users')
rescue CircuitBreaker::OpenCircuitError => e
# Handle open circuit
Rails.logger.warn "Circuit open for users API: #{e.message}"
return cached_users
end
Redis-Backed Circuit Breaker
Distributed applications require shared circuit breaker state across multiple processes or servers. Redis provides persistent, shared state with atomic operations:
require 'redis'
class RedisCircuitBreaker
def initialize(redis, key_prefix, options = {})
@redis = redis
@key_prefix = key_prefix
@failure_threshold = options[:failure_threshold] || 5
@timeout = options[:timeout] || 60
end
def call(circuit_name, &block)
state_key = "#{@key_prefix}:#{circuit_name}:state"
failure_key = "#{@key_prefix}:#{circuit_name}:failures"
state = @redis.get(state_key) || 'closed'
if state == 'open'
opened_at = @redis.get("#{@key_prefix}:#{circuit_name}:opened_at").to_i
if Time.now.to_i - opened_at >= @timeout
@redis.set(state_key, 'half_open')
state = 'half_open'
else
raise CircuitOpenError, "Circuit #{circuit_name} is open"
end
end
begin
result = yield
if state == 'half_open'
@redis.del(failure_key)
@redis.set(state_key, 'closed')
end
result
rescue => error
failures = @redis.incr(failure_key)
@redis.expire(failure_key, @timeout)
if failures >= @failure_threshold
@redis.set(state_key, 'open')
@redis.set("#{@key_prefix}:#{circuit_name}:opened_at", Time.now.to_i)
end
raise error
end
end
end
# Usage across multiple processes
redis = Redis.new(url: ENV['REDIS_URL'])
breaker = RedisCircuitBreaker.new(redis, 'cb', failure_threshold: 3, timeout: 30)
breaker.call('payment_service') do
PaymentAPI.charge(amount: 100)
end
Integration with Sidekiq
Background job processing requires circuit breakers to prevent job queues from filling with failing jobs:
class PaymentProcessor
include Sidekiq::Worker
sidekiq_options retry: 3
def perform(payment_id)
breaker = CircuitBreakerRegistry.get(:payment_gateway)
breaker.call do
payment = Payment.find(payment_id)
PaymentGateway.process(payment)
end
rescue CircuitOpenError => e
# Reschedule when circuit opens
self.class.perform_in(5.minutes, payment_id)
rescue => e
Rails.logger.error "Payment processing failed: #{e.message}"
raise e
end
end
Design Considerations
Circuit breakers add complexity and latency overhead to service calls. Applications must evaluate whether the resilience benefits outweigh these costs based on failure rates, request patterns, and system architecture.
When to Apply Circuit Breakers
Circuit breakers provide value when calling remote services with unpredictable availability, when cascading failures risk system-wide outages, when failing fast improves user experience compared to waiting for timeouts, and when resource exhaustion from hanging connections threatens system stability.
Synchronous HTTP APIs represent the primary use case for circuit breakers. These calls can hang indefinitely if the remote service becomes unresponsive, exhausting thread pools and connection pools. Circuit breakers detect these conditions and prevent resource exhaustion.
Asynchronous message queues benefit less from circuit breakers since message processing naturally decouples request and response. However, circuit breakers remain useful when message handlers call synchronous services or when queue backlog growth indicates consumer failures.
Database connections typically use connection pool limits rather than circuit breakers since databases rarely experience the transient failures common in network services. Circuit breakers may apply to database queries that access unreliable external data sources or distributed database clusters with node failures.
Trade-offs and Limitations
Circuit breakers introduce several trade-offs. State management overhead increases request latency by 1-5 milliseconds per call. Memory usage grows with the number of monitored endpoints and the size of failure tracking windows. Thread synchronization adds contention under high concurrency.
False positives occur when circuits trip during legitimate traffic spikes or transient network issues. Applications must implement fallback strategies to handle open circuits gracefully. Fallback options include returning cached data, degraded functionality, or user-friendly error messages.
False negatives occur when failures persist below the threshold or when slow responses exhaust resources without triggering failure counts. Timeout-based circuit breakers address this by treating slow responses as failures.
State consistency across distributed systems requires shared storage like Redis, adding external dependencies and failure modes. Local circuit breakers in each process provide simpler implementation but prevent coordinated failure response across service instances.
Alternative Patterns
Several patterns address similar reliability concerns with different trade-offs:
Retry logic with exponential backoff provides resilience against transient failures without maintaining state. Retries work well for truly transient issues but can worsen cascading failures by generating more load on already struggling services. Circuit breakers complement retry logic by preventing retries when failures persist.
Timeouts limit request duration but don't prevent repeated attempts against failing services. Aggressive timeouts reduce resource exhaustion but may interrupt legitimate slow operations. Circuit breakers reduce timeout-related load by blocking requests entirely.
Bulkheads isolate resources to prevent failures in one component from affecting others. Thread pool isolation and connection pool limits implement bulkheads at the resource level. Circuit breakers operate at the service call level. Combining both patterns provides defense in depth.
Rate limiting controls request volume but doesn't respond to service health. Circuit breakers complement rate limiting by adjusting request behavior based on failure rates rather than just volume.
Practical Examples
Circuit breakers protect applications across various scenarios from microservice communication to third-party API integration.
HTTP API Protection
An e-commerce application calls a recommendation service to display product suggestions. When the recommendation service experiences issues, circuit breakers prevent checkout delays:
class RecommendationService
def initialize
@circuit_breaker = ProductionCircuitBreaker.new(
'recommendations',
failure_threshold: 5,
timeout: 30,
volume_threshold: 20,
error_rate_threshold: 0.5
)
@cache = Rails.cache
end
def recommendations_for(user_id, category)
cache_key = "recommendations:#{user_id}:#{category}"
@circuit_breaker.call do
response = HTTP.timeout(2).get(
"#{ENV['RECOMMENDATIONS_URL']}/recommendations",
params: { user_id: user_id, category: category }
)
recommendations = JSON.parse(response.body)
@cache.write(cache_key, recommendations, expires_in: 1.hour)
recommendations
end
rescue CircuitOpenError => e
# Return cached recommendations when circuit is open
Rails.logger.warn "Recommendations circuit open, using cache"
cached = @cache.read(cache_key)
cached || default_recommendations(category)
rescue HTTP::TimeoutError, HTTP::ConnectionError => e
# Let circuit breaker track the failure
Rails.logger.error "Recommendations request failed: #{e.message}"
raise e
end
private
def default_recommendations(category)
# Return popular items as fallback
Product.where(category: category)
.order(sales_count: :desc)
.limit(10)
.map { |p| { id: p.id, name: p.name, price: p.price } }
end
end
Payment Gateway Resilience
Payment processing requires high reliability since failures directly impact revenue. Circuit breakers prevent payment queue backlog during gateway outages:
class PaymentGatewayClient
class PaymentError < StandardError; end
def initialize
@breakers = {
authorize: ProductionCircuitBreaker.new('payment_authorize',
failure_threshold: 3,
timeout: 60,
volume_threshold: 10
),
capture: ProductionCircuitBreaker.new('payment_capture',
failure_threshold: 3,
timeout: 60,
volume_threshold: 10
),
refund: ProductionCircuitBreaker.new('payment_refund',
failure_threshold: 2,
timeout: 120,
volume_threshold: 5
)
}
end
def authorize(amount, card_token)
@breakers[:authorize].call do
response = gateway_request('/authorize', {
amount: amount,
card_token: card_token
})
unless response['success']
raise PaymentError, response['error']
end
{
transaction_id: response['transaction_id'],
authorized_at: Time.now,
amount: amount
}
end
rescue CircuitOpenError => e
# Queue for retry when gateway recovers
PaymentRetryJob.perform_later(amount, card_token)
raise PaymentError, "Payment gateway unavailable"
end
def capture(transaction_id, amount)
@breakers[:capture].call do
response = gateway_request('/capture', {
transaction_id: transaction_id,
amount: amount
})
unless response['success']
raise PaymentError, response['error']
end
{
captured_at: Time.now,
amount: amount
}
end
rescue CircuitOpenError => e
# Capture operations must succeed eventually
PaymentCaptureRetryJob.perform_in(5.minutes, transaction_id, amount)
raise PaymentError, "Payment capture deferred"
end
private
def gateway_request(path, params)
response = HTTP.timeout(5)
.headers(authorization: "Bearer #{ENV['GATEWAY_TOKEN']}")
.post("#{ENV['GATEWAY_URL']}#{path}", json: params)
JSON.parse(response.body)
rescue HTTP::TimeoutError => e
raise PaymentError, "Gateway timeout"
rescue HTTP::ConnectionError => e
raise PaymentError, "Gateway connection failed"
end
end
Microservice Communication
Service mesh architectures implement circuit breakers at the proxy level, but application-level circuit breakers provide additional control and context-specific behavior:
class OrderService
def initialize
@inventory_breaker = ProductionCircuitBreaker.new(
'inventory_service',
failure_threshold: 5,
timeout: 30
)
@shipping_breaker = ProductionCircuitBreaker.new(
'shipping_service',
failure_threshold: 3,
timeout: 45
)
end
def create_order(cart_items, shipping_address)
order = Order.create!(
user_id: current_user.id,
status: 'pending',
items: cart_items
)
# Reserve inventory with circuit breaker
inventory_reserved = false
begin
@inventory_breaker.call do
InventoryClient.reserve(order.id, cart_items)
end
inventory_reserved = true
rescue CircuitOpenError => e
order.update!(
status: 'failed',
failure_reason: 'inventory_unavailable'
)
return { success: false, error: 'Inventory service unavailable' }
rescue InventoryClient::InsufficientStockError => e
order.update!(status: 'failed', failure_reason: 'insufficient_stock')
return { success: false, error: 'Insufficient stock' }
end
# Calculate shipping with circuit breaker
begin
@shipping_breaker.call do
shipping_quote = ShippingClient.calculate(
cart_items,
shipping_address
)
order.update!(
shipping_cost: shipping_quote[:cost],
estimated_delivery: shipping_quote[:estimated_delivery]
)
end
rescue CircuitOpenError => e
# Shipping calculation failure doesn't block order creation
order.update!(
shipping_cost: 0,
estimated_delivery: 7.days.from_now,
requires_shipping_calculation: true
)
Rails.logger.warn "Order #{order.id} created without shipping calculation"
end
order.update!(status: 'confirmed')
{ success: true, order: order }
end
end
Third-Party API Integration
Applications integrating third-party APIs face unpredictable availability and rate limits. Circuit breakers prevent API quota exhaustion and reduce error rates:
class WeatherDataClient
def initialize
@breaker = ProductionCircuitBreaker.new(
'weather_api',
failure_threshold: 10,
timeout: 120,
volume_threshold: 50,
error_rate_threshold: 0.3
)
@rate_limiter = RateLimiter.new(max_requests: 100, period: 1.minute)
end
def current_weather(location)
cache_key = "weather:current:#{location}"
# Return cached data if available
cached = Rails.cache.read(cache_key)
return cached if cached
# Rate limit before circuit breaker
unless @rate_limiter.allow?
return Rails.cache.read("weather:stale:#{location}")
end
@breaker.call do
response = HTTP.timeout(3).get(
"#{ENV['WEATHER_API_URL']}/current",
params: {
location: location,
api_key: ENV['WEATHER_API_KEY']
}
)
weather_data = JSON.parse(response.body)
# Cache successful response
Rails.cache.write(cache_key, weather_data, expires_in: 10.minutes)
Rails.cache.write("weather:stale:#{location}", weather_data, expires_in: 2.hours)
weather_data
end
rescue CircuitOpenError => e
# Return stale data when circuit is open
stale = Rails.cache.read("weather:stale:#{location}")
stale || { temperature: nil, conditions: 'unavailable' }
rescue JSON::ParserError, HTTP::Error => e
Rails.logger.error "Weather API error: #{e.message}"
raise e
end
end
Common Patterns
Circuit breaker implementations vary in state management strategies, failure detection mechanisms, and recovery testing approaches. Production systems often combine multiple patterns to address specific reliability requirements.
Sliding Window Pattern
Sliding window circuit breakers track individual request timestamps within a time window, providing accurate failure rate calculations but requiring more memory:
class SlidingWindowCircuitBreaker
def initialize(name, options = {})
@name = name
@window_size = options[:window_size] || 60
@failure_threshold = options[:failure_threshold] || 0.5
@volume_threshold = options[:volume_threshold] || 10
@timeout = options[:timeout] || 60
@state = :closed
@requests = []
@opened_at = nil
@mutex = Mutex.new
end
def call(&block)
@mutex.synchronize { check_state }
raise CircuitOpenError if @state == :open
begin
result = yield
record_request(success: true)
result
rescue => error
record_request(success: false)
raise error
end
end
private
def record_request(success:)
@mutex.synchronize do
now = Time.now
@requests << { timestamp: now, success: success }
remove_old_requests(now)
evaluate_threshold
end
end
def remove_old_requests(current_time)
cutoff = current_time - @window_size
@requests.reject! { |req| req[:timestamp] < cutoff }
end
def evaluate_threshold
return if @requests.size < @volume_threshold
failure_count = @requests.count { |req| !req[:success] }
failure_rate = failure_count.to_f / @requests.size
if failure_rate >= @failure_threshold
@state = :open
@opened_at = Time.now
end
end
def check_state
if @state == :open && Time.now - @opened_at >= @timeout
@state = :half_open
@requests.clear
end
end
end
Token Bucket Recovery Pattern
Token bucket patterns control Half-Open state testing by limiting the number of concurrent test requests, preventing thundering herd problems during recovery:
class TokenBucketCircuitBreaker
def initialize(name, options = {})
@name = name
@failure_threshold = options[:failure_threshold] || 5
@timeout = options[:timeout] || 60
@max_half_open_requests = options[:max_half_open_requests] || 3
@success_threshold = options[:success_threshold] || 2
@state = :closed
@failure_count = 0
@success_count = 0
@half_open_tokens = @max_half_open_requests
@opened_at = nil
@mutex = Mutex.new
@condition = ConditionVariable.new
end
def call(&block)
acquire_token
begin
result = yield
handle_success
result
rescue => error
handle_failure
raise error
ensure
release_token
end
end
private
def acquire_token
@mutex.synchronize do
check_timeout
while @state == :half_open && @half_open_tokens <= 0
@condition.wait(@mutex)
end
raise CircuitOpenError if @state == :open
@half_open_tokens -= 1 if @state == :half_open
end
end
def release_token
@mutex.synchronize do
if @state == :half_open
@half_open_tokens += 1
@condition.signal
end
end
end
def handle_success
@mutex.synchronize do
if @state == :half_open
@success_count += 1
if @success_count >= @success_threshold
transition_to_closed
end
end
@failure_count = 0 if @state == :closed
end
end
def handle_failure
@mutex.synchronize do
@failure_count += 1
if @state == :half_open || @failure_count >= @failure_threshold
transition_to_open
end
end
end
def check_timeout
if @state == :open && Time.now - @opened_at >= @timeout
transition_to_half_open
end
end
def transition_to_closed
@state = :closed
@failure_count = 0
@success_count = 0
@half_open_tokens = @max_half_open_requests
@condition.broadcast
end
def transition_to_open
@state = :open
@opened_at = Time.now
@success_count = 0
end
def transition_to_half_open
@state = :half_open
@half_open_tokens = @max_half_open_requests
@success_count = 0
end
end
Adaptive Threshold Pattern
Adaptive circuit breakers adjust failure thresholds based on historical patterns and request volume, reducing false positives during traffic spikes:
class AdaptiveCircuitBreaker
def initialize(name, options = {})
@name = name
@base_threshold = options[:base_threshold] || 0.5
@adaptation_rate = options[:adaptation_rate] || 0.1
@window_size = options[:window_size] || 60
@timeout = options[:timeout] || 60
@state = :closed
@current_threshold = @base_threshold
@historical_error_rate = 0.0
@requests = []
@opened_at = nil
@mutex = Mutex.new
end
def call(&block)
check_and_adapt
raise CircuitOpenError if @state == :open
begin
result = yield
record_success
result
rescue => error
record_failure
raise error
end
end
private
def check_and_adapt
@mutex.synchronize do
clean_old_requests
adapt_threshold
if @state == :open && timeout_expired?
@state = :half_open
end
evaluate_state
end
end
def adapt_threshold
return if @requests.empty?
current_error_rate = calculate_error_rate
@historical_error_rate = (@historical_error_rate * (1 - @adaptation_rate)) +
(current_error_rate * @adaptation_rate)
# Adjust threshold based on historical patterns
@current_threshold = [@base_threshold + (@historical_error_rate * 0.5), 0.9].min
end
def calculate_error_rate
failures = @requests.count { |r| !r[:success] }
failures.to_f / @requests.size
end
def evaluate_state
return unless @requests.size >= 20
current_error_rate = calculate_error_rate
if current_error_rate >= @current_threshold
@state = :open
@opened_at = Time.now
end
end
def clean_old_requests
cutoff = Time.now - @window_size
@requests.reject! { |r| r[:timestamp] < cutoff }
end
def record_success
@mutex.synchronize do
@requests << { timestamp: Time.now, success: true }
@state = :closed if @state == :half_open
end
end
def record_failure
@mutex.synchronize do
@requests << { timestamp: Time.now, success: false }
end
end
def timeout_expired?
@opened_at && Time.now - @opened_at >= @timeout
end
end
Error Handling & Edge Cases
Circuit breakers must handle various failure modes, edge cases, and operational scenarios that occur in production systems.
State Transition Race Conditions
Multiple concurrent requests during state transitions can cause inconsistent behavior. Requests arriving during the Open to Half-Open transition might all attempt test requests simultaneously, overwhelming the recovering service:
class ThreadSafeCircuitBreaker
def initialize(name, options = {})
@name = name
@failure_threshold = options[:failure_threshold] || 5
@timeout = options[:timeout] || 60
@max_concurrent_half_open = options[:max_concurrent_half_open] || 1
@state = :closed
@failure_count = 0
@opened_at = nil
@half_open_count = 0
@mutex = Mutex.new
end
def call(&block)
permitted = false
@mutex.synchronize do
check_timeout
case @state
when :open
raise CircuitOpenError
when :half_open
if @half_open_count < @max_concurrent_half_open
@half_open_count += 1
permitted = true
else
raise CircuitOpenError, "Half-open limit reached"
end
when :closed
permitted = true
end
end
begin
result = yield
handle_success
result
rescue => error
handle_failure
raise error
ensure
release_half_open_slot if permitted && @state == :half_open
end
end
private
def release_half_open_slot
@mutex.synchronize { @half_open_count -= 1 }
end
def check_timeout
if @state == :open && Time.now - @opened_at >= @timeout
@state = :half_open
@half_open_count = 0
end
end
def handle_success
@mutex.synchronize do
@failure_count = 0
@state = :closed if @state == :half_open
end
end
def handle_failure
@mutex.synchronize do
@failure_count += 1
if @failure_count >= @failure_threshold || @state == :half_open
@state = :open
@opened_at = Time.now
end
end
end
end
Monitoring and Alerting
Production circuit breakers require comprehensive monitoring to detect issues and tune configuration. State transitions, rejection rates, and recovery patterns indicate system health:
class MonitoredCircuitBreaker
def initialize(name, options = {})
@name = name
@options = options
@breaker = ProductionCircuitBreaker.new(name, options)
@metrics_client = options[:metrics_client] || StatsD.new
@alert_threshold = options[:alert_threshold] || 0.1
@last_alert_time = nil
@alert_cooldown = options[:alert_cooldown] || 300
end
def call(&block)
start_time = Time.now
begin
result = @breaker.call(&block)
record_success_metrics(Time.now - start_time)
result
rescue CircuitOpenError => e
record_rejection_metrics
check_alert_conditions
raise e
rescue => e
record_failure_metrics(Time.now - start_time, e.class.name)
raise e
end
end
def metrics
@breaker.metrics.merge(rejection_rate: calculate_rejection_rate)
end
private
def record_success_metrics(duration)
@metrics_client.increment("circuit_breaker.#{@name}.success")
@metrics_client.timing("circuit_breaker.#{@name}.duration", duration * 1000)
@metrics_client.gauge("circuit_breaker.#{@name}.state", state_value)
end
def record_failure_metrics(duration, error_type)
@metrics_client.increment("circuit_breaker.#{@name}.failure")
@metrics_client.increment("circuit_breaker.#{@name}.error.#{error_type}")
@metrics_client.timing("circuit_breaker.#{@name}.duration", duration * 1000)
end
def record_rejection_metrics
@metrics_client.increment("circuit_breaker.#{@name}.rejected")
@metrics_client.gauge("circuit_breaker.#{@name}.state", state_value)
end
def state_value
case @breaker.state
when :closed then 0
when :half_open then 1
when :open then 2
end
end
def calculate_rejection_rate
metrics_data = @breaker.metrics
total = metrics_data[:request_count]
return 0.0 if total == 0
rejected = total - (metrics_data[:failure_count] + metrics_data[:success_count])
rejected.to_f / total
end
def check_alert_conditions
return if recently_alerted?
rejection_rate = calculate_rejection_rate
if rejection_rate >= @alert_threshold
send_alert(rejection_rate)
@last_alert_time = Time.now
end
end
def recently_alerted?
@last_alert_time && Time.now - @last_alert_time < @alert_cooldown
end
def send_alert(rejection_rate)
AlertService.notify(
severity: 'warning',
message: "Circuit breaker #{@name} open, rejection rate: #{(rejection_rate * 100).round(2)}%",
context: @breaker.metrics
)
end
end
Partial Failures and Timeout Handling
Services may respond slowly without failing completely, exhausting resources and triggering cascading failures. Circuit breakers must distinguish between fast failures and timeouts:
class TimeoutAwareCircuitBreaker
def initialize(name, options = {})
@name = name
@request_timeout = options[:request_timeout] || 5
@failure_threshold = options[:failure_threshold] || 5
@slow_call_threshold = options[:slow_call_threshold] || 3
@slow_call_duration = options[:slow_call_duration] || 2
@breaker = ProductionCircuitBreaker.new(name, options)
@slow_call_count = 0
@mutex = Mutex.new
end
def call(&block)
start_time = Time.now
timeout_result = Timeout.timeout(@request_timeout) do
@breaker.call(&block)
end
duration = Time.now - start_time
check_slow_call(duration)
timeout_result
rescue Timeout::Error => e
@mutex.synchronize { @slow_call_count += 1 }
check_slow_call_threshold
raise e
end
private
def check_slow_call(duration)
if duration >= @slow_call_duration
@mutex.synchronize { @slow_call_count += 1 }
check_slow_call_threshold
else
@mutex.synchronize { @slow_call_count = [@slow_call_count - 1, 0].max }
end
end
def check_slow_call_threshold
if @slow_call_count >= @slow_call_threshold
Rails.logger.warn "Circuit #{@name} experiencing slow calls, count: #{@slow_call_count}"
# Optionally trip circuit based on slow calls
end
end
end
Real-World Applications
Production deployments of circuit breakers reveal implementation patterns and operational considerations that differ from simplified examples.
Microservices API Gateway
API gateways implement circuit breakers per downstream service, aggregating health across multiple backend endpoints:
class ServiceGateway
def initialize
@service_breakers = Hash.new do |hash, service_name|
hash[service_name] = ProductionCircuitBreaker.new(
"gateway_#{service_name}",
failure_threshold: 5,
timeout: 30,
volume_threshold: 20
)
end
@fallback_strategies = {
'user_profile' => method(:cached_user_profile),
'recommendations' => method(:default_recommendations),
'reviews' => method(:empty_reviews)
}
end
def proxy_request(service_name, path, params = {})
breaker = @service_breakers[service_name]
breaker.call do
response = HTTP.timeout(5).get(
service_url(service_name, path),
params: params
)
unless response.status.success?
raise ServiceError, "Service returned #{response.status}"
end
JSON.parse(response.body)
end
rescue CircuitOpenError => e
handle_open_circuit(service_name, path, params)
rescue ServiceError, HTTP::Error => e
log_service_error(service_name, e)
raise e
end
def health_check
services = @service_breakers.keys
services.map do |service|
metrics = @service_breakers[service].metrics
{
service: service,
state: metrics[:state],
error_rate: metrics[:error_rate],
last_failure: metrics[:last_failure_time]
}
end
end
private
def handle_open_circuit(service_name, path, params)
fallback = @fallback_strategies[service_name]
if fallback
Rails.logger.warn "Using fallback for #{service_name}"
fallback.call(path, params)
else
raise ServiceUnavailableError, "#{service_name} circuit is open"
end
end
def cached_user_profile(path, params)
user_id = params[:user_id]
Rails.cache.read("user_profile:#{user_id}") || { id: user_id, name: 'Unknown' }
end
def default_recommendations(path, params)
{ items: [], fallback: true }
end
def empty_reviews(path, params)
{ reviews: [], total: 0, fallback: true }
end
def service_url(service_name, path)
base_url = ENV["#{service_name.upcase}_SERVICE_URL"]
"#{base_url}#{path}"
end
def log_service_error(service_name, error)
Rails.logger.error("Service error in #{service_name}: #{error.message}")
ErrorTracker.notify(error, service: service_name)
end
end
Database Query Protection
Circuit breakers protect applications from database query failures when accessing distributed databases or unreliable replicas:
class ReplicaCircuitBreaker
def initialize
@replica_breakers = ENV['DATABASE_REPLICAS'].split(',').map do |replica_url|
[
replica_url,
ProductionCircuitBreaker.new(
"replica_#{replica_url}",
failure_threshold: 3,
timeout: 60
)
]
end.to_h
@primary_db = DatabaseConnection.primary
@mutex = Mutex.new
end
def execute_query(sql, params = [])
available_replicas = find_available_replicas
if available_replicas.any?
execute_on_replica(available_replicas.first, sql, params)
else
Rails.logger.warn "All replicas unavailable, using primary"
execute_on_primary(sql, params)
end
end
private
def find_available_replicas
@replica_breakers.select do |url, breaker|
breaker.state != :open
end.keys
end
def execute_on_replica(replica_url, sql, params)
breaker = @replica_breakers[replica_url]
breaker.call do
connection = DatabaseConnection.connect(replica_url)
connection.execute(sql, params)
end
rescue CircuitOpenError => e
# Try next available replica
remaining = find_available_replicas - [replica_url]
if remaining.any?
execute_on_replica(remaining.first, sql, params)
else
execute_on_primary(sql, params)
end
end
def execute_on_primary(sql, params)
@primary_db.execute(sql, params)
end
end
Background Job Processing
Circuit breakers prevent job queue saturation when external services fail, deferring jobs until services recover:
class ExternalAPIJob
include Sidekiq::Worker
sidekiq_options queue: :external_api, retry: 5
def perform(job_type, payload)
breaker = CircuitBreakerRegistry.get(job_type)
breaker.call do
case job_type
when 'webhook_delivery'
deliver_webhook(payload)
when 'data_sync'
sync_external_data(payload)
when 'notification'
send_external_notification(payload)
end
end
rescue CircuitOpenError => e
# Exponentially backoff when circuit is open
retry_delay = [2 ** self.class.sidekiq_options['retry_count'], 3600].min
self.class.perform_in(retry_delay, job_type, payload)
Rails.logger.warn(
"Job deferred due to open circuit",
job_type: job_type,
retry_delay: retry_delay,
circuit_state: breaker.state
)
rescue => e
Rails.logger.error("Job failed: #{e.message}", job_type: job_type)
raise e # Let Sidekiq's retry mechanism handle
end
private
def deliver_webhook(payload)
HTTP.timeout(10).post(payload['url'], json: payload['data'])
end
def sync_external_data(payload)
ExternalAPI.sync(payload['entity_type'], payload['entity_id'])
end
def send_external_notification(payload)
NotificationService.send(payload['recipient'], payload['message'])
end
end
class CircuitBreakerRegistry
@breakers = {}
@mutex = Mutex.new
def self.get(name)
@mutex.synchronize do
@breakers[name] ||= ProductionCircuitBreaker.new(
name,
failure_threshold: 5,
timeout: 120,
volume_threshold: 10
)
end
end
def self.all
@breakers.values
end
def self.metrics
@breakers.transform_values(&:metrics)
end
end
Reference
Circuit Breaker States
| State | Behavior | Transition Condition |
|---|---|---|
| Closed | All requests pass through to remote service | Failure count exceeds threshold |
| Open | All requests fail immediately without calling service | Timeout period expires |
| Half-Open | Limited test requests allowed | Successful tests close circuit, failures reopen |
Configuration Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| failure_threshold | Number of failures before opening circuit | 3-10 failures |
| error_rate_threshold | Percentage of failed requests before opening | 0.3-0.7 (30-70%) |
| timeout | Duration circuit remains open before testing recovery | 30-120 seconds |
| volume_threshold | Minimum requests before evaluating error rate | 10-50 requests |
| success_threshold | Successful tests required to close circuit | 2-3 successes |
| window_duration | Time window for tracking failures | 30-120 seconds |
| half_open_limit | Maximum concurrent test requests in half-open state | 1-5 requests |
Implementation Checklist
| Component | Requirement | Implementation |
|---|---|---|
| State Management | Thread-safe state transitions | Use Mutex or atomic operations |
| Failure Tracking | Time-windowed failure counting | Sliding or tumbling window |
| Timeout Handling | Automatic transition to half-open | Background thread or lazy evaluation |
| Metrics Collection | State changes and request outcomes | Integrate with monitoring system |
| Fallback Strategy | Graceful degradation when circuit open | Cache, defaults, or degraded mode |
| Configuration | Externalized threshold values | Environment variables or config files |
| Testing | Simulate failures and recovery | Mock failures, test state transitions |
| Monitoring | Alert on circuit state changes | Dashboard and alerting integration |
Common Error Types
| Error Type | Weight | Rationale |
|---|---|---|
| Timeout | High | Indicates resource exhaustion or unresponsive service |
| Connection Failed | High | Service unavailable or network issues |
| HTTP 5xx | Medium | Server errors that may be transient |
| HTTP 429 | Medium | Rate limit exceeded, service overloaded |
| HTTP 4xx | Low | Client errors typically don't indicate service health |
| HTTP 408 | High | Request timeout indicates performance degradation |
Ruby Gems and Libraries
| Gem | Description | Use Case |
|---|---|---|
| circuitbox | Production-ready circuit breaker with Faraday integration | HTTP client protection |
| stoplight | Redis-backed circuit breaker with custom error handlers | Distributed systems |
| semian | Resource protection including circuit breakers and bulkheads | Database and service protection |
| resilient | Flexible circuit breaker with multiple failure detection strategies | Custom implementations |
Metric Definitions
| Metric | Calculation | Purpose |
|---|---|---|
| Error Rate | failures / total_requests | Determine when to open circuit |
| Rejection Rate | rejected / total_requests | Monitor impact of open circuit |
| Mean Time to Recovery | sum(open_duration) / open_count | Optimize timeout configuration |
| False Positive Rate | unnecessary_trips / total_trips | Tune failure thresholds |
| State Duration | time_in_state | Analyze state transition patterns |
Monitoring Queries
| Query Purpose | Implementation |
|---|---|
| Circuit state distribution | Group by state, count occurrences |
| Rejection rate trend | Calculate rejected / total over time |
| Frequent circuit trips | Count open transitions per circuit |
| Recovery success rate | Calculate closed / half_open transitions |
| Service availability | Calculate uptime from circuit states |