CrackedRuby CrackedRuby

Overview

The Circuit Breaker Pattern protects distributed systems from cascading failures by wrapping remote service calls in a monitoring component that tracks failures and automatically stops attempting requests when error rates exceed defined thresholds. Named after electrical circuit breakers that interrupt current flow to prevent damage, this pattern monitors service health and provides three distinct states: Closed (normal operation), Open (failing fast), and Half-Open (testing recovery).

In distributed architectures, a single slow or failing service can exhaust connection pools, thread pools, and other shared resources, causing failures to propagate across multiple services. Circuit breakers detect these failure conditions and interrupt the cascade before system-wide degradation occurs. The pattern originated in Michael Nygard's book "Release It!" and has become a fundamental component of resilient microservice architectures.

The pattern operates by counting failures within a time window. When failures exceed a configured threshold, the circuit "trips" to the Open state and immediately rejects requests without attempting the remote call. After a timeout period, the circuit transitions to Half-Open state, allowing a limited number of test requests to determine if the remote service has recovered. If test requests succeed, the circuit closes and normal operation resumes. If test requests fail, the circuit reopens and the timeout period resets.

class CircuitBreaker
  attr_reader :state, :failure_count

  def initialize(failure_threshold: 5, timeout: 60)
    @state = :closed
    @failure_count = 0
    @failure_threshold = failure_threshold
    @timeout = timeout
    @opened_at = nil
  end

  def call
    raise CircuitOpenError if open? && !timeout_expired?
    
    attempt_half_open if open? && timeout_expired?
    
    begin
      result = yield
      on_success
      result
    rescue => e
      on_failure
      raise e
    end
  end

  private

  def open?
    @state == :open
  end

  def timeout_expired?
    Time.now - @opened_at >= @timeout
  end

  def on_success
    @failure_count = 0
    @state = :closed
  end

  def on_failure
    @failure_count += 1
    if @failure_count >= @failure_threshold
      @state = :open
      @opened_at = Time.now
    end
  end

  def attempt_half_open
    @state = :half_open
  end
end

Circuit breakers reduce latency during failures by failing fast instead of waiting for timeouts, preserve system resources by preventing excessive retry attempts, and provide monitoring data about service health. The pattern applies to any remote dependency including HTTP APIs, database connections, message queues, and external services.

Key Principles

The Circuit Breaker Pattern implements automatic failure detection and recovery through state transitions and threshold monitoring. The pattern's effectiveness depends on correctly configuring failure detection criteria, timeout periods, and recovery testing strategies.

State Machine Model

The circuit breaker implements a finite state machine with three states and specific transition conditions:

Closed state represents normal operation where all requests pass through to the remote service. The circuit breaker monitors each request, tracking successes and failures within a sliding time window. When the failure count or failure rate exceeds the configured threshold, the circuit transitions to Open state.

Open state blocks all requests immediately without attempting the remote call, throwing a specific exception that calling code can handle. This fail-fast behavior prevents resource exhaustion and reduces latency. The circuit remains open for a configured timeout period, after which it transitions to Half-Open state.

Half-Open state allows a limited number of test requests to determine if the remote service has recovered. If test requests succeed, the circuit transitions back to Closed state and normal operation resumes. If test requests fail, the circuit returns to Open state and the timeout period resets.

class StatefulCircuitBreaker
  STATES = [:closed, :open, :half_open].freeze

  def initialize(options = {})
    @failure_threshold = options[:failure_threshold] || 5
    @success_threshold = options[:success_threshold] || 2
    @timeout = options[:timeout] || 60
    @half_open_limit = options[:half_open_limit] || 3
    
    @state = :closed
    @failure_count = 0
    @success_count = 0
    @half_open_attempts = 0
    @opened_at = nil
    @mutex = Mutex.new
  end

  def call
    @mutex.synchronize do
      case @state
      when :open
        check_timeout
        raise CircuitOpenError, "Circuit is open" if @state == :open
      when :half_open
        raise CircuitOpenError, "Half-open limit reached" if @half_open_attempts >= @half_open_limit
        @half_open_attempts += 1
      end
    end

    execute_with_monitoring { yield }
  end

  private

  def check_timeout
    if Time.now - @opened_at >= @timeout
      transition_to_half_open
    end
  end

  def execute_with_monitoring
    begin
      result = yield
      record_success
      result
    rescue => error
      record_failure
      raise error
    end
  end

  def transition_to_half_open
    @state = :half_open
    @half_open_attempts = 0
    @success_count = 0
  end
end

Failure Detection Criteria

Circuit breakers monitor multiple failure indicators to determine when to trip. Failure counting strategies include consecutive failures, failure rate within a time window, and weighted failure scoring. Different failure types may have different weights—timeouts might count more heavily than connection refused errors since they indicate resource exhaustion.

Time window implementations track failures in sliding windows or tumbling windows. Sliding windows provide more accurate failure rate calculations but require more memory to store individual request timestamps. Tumbling windows aggregate failures into fixed time buckets, using less memory but potentially missing failure rate spikes that span bucket boundaries.

Threshold Configuration

Failure thresholds balance sensitivity and stability. Low thresholds cause circuits to trip quickly but may trigger on transient failures. High thresholds tolerate more failures but take longer to detect persistent outages. The optimal threshold depends on expected failure rates, request volume, and acceptable latency.

Success thresholds in Half-Open state determine how many consecutive successful requests must occur before closing the circuit. Higher success thresholds provide more confidence in recovery but delay return to normal operation. Most implementations use 2-3 successful requests as the threshold.

Timeout duration controls how long the circuit remains open before attempting recovery. Short timeouts enable faster recovery from transient failures but may cause repeated state transitions if the remote service needs extended recovery time. Long timeouts reduce state transition overhead but delay recovery unnecessarily for brief outages.

Request Isolation

Circuit breakers operate independently for each remote service or endpoint. Failures in one service should not trigger circuit breakers for unrelated services. Some implementations maintain circuit breaker state per endpoint, while others group related endpoints under a single circuit breaker to simplify configuration.

Thread safety requires synchronization when updating circuit breaker state from multiple concurrent requests. Lock-free implementations using atomic operations provide better performance but increase implementation complexity. Most production circuit breakers use mutexes or similar locking mechanisms to ensure state consistency.

Ruby Implementation

Ruby applications implement circuit breakers through gems, custom classes, or integration with service mesh components. The Ruby ecosystem provides several mature circuit breaker implementations with varying feature sets and performance characteristics.

Core Implementation Pattern

A production-ready Ruby circuit breaker requires thread-safe state management, configurable thresholds, and monitoring hooks:

require 'thread'

class ProductionCircuitBreaker
  class CircuitOpenError < StandardError; end
  
  attr_reader :name, :state, :failure_count, :last_failure_time

  def initialize(name, options = {})
    @name = name
    @failure_threshold = options[:failure_threshold] || 5
    @success_threshold = options[:success_threshold] || 2
    @timeout = options[:timeout] || 60
    @volume_threshold = options[:volume_threshold] || 10
    @error_rate_threshold = options[:error_rate_threshold] || 0.5
    
    @state = :closed
    @failure_count = 0
    @success_count = 0
    @request_count = 0
    @opened_at = nil
    @last_failure_time = nil
    @mutex = Mutex.new
    @window_start = Time.now
    @window_duration = options[:window_duration] || 60
  end

  def call(&block)
    check_and_update_state
    
    if @state == :open
      record_rejected_request
      raise CircuitOpenError, "Circuit #{@name} is open"
    end

    execute_protected(&block)
  end

  def metrics
    @mutex.synchronize do
      {
        name: @name,
        state: @state,
        failure_count: @failure_count,
        success_count: @success_count,
        request_count: @request_count,
        error_rate: calculate_error_rate,
        opened_at: @opened_at,
        last_failure_time: @last_failure_time
      }
    end
  end

  private

  def check_and_update_state
    @mutex.synchronize do
      reset_window_if_expired
      
      if @state == :open && timeout_expired?
        transition_to(:half_open)
      end
    end
  end

  def execute_protected
    begin
      result = yield
      handle_success
      result
    rescue => error
      handle_failure(error)
      raise error
    end
  end

  def handle_success
    @mutex.synchronize do
      @request_count += 1
      
      if @state == :half_open
        @success_count += 1
        if @success_count >= @success_threshold
          transition_to(:closed)
        end
      end
    end
  end

  def handle_failure(error)
    @mutex.synchronize do
      @request_count += 1
      @failure_count += 1
      @last_failure_time = Time.now
      
      if should_trip?
        transition_to(:open)
      end
    end
  end

  def should_trip?
    return false if @request_count < @volume_threshold
    
    error_rate = calculate_error_rate
    error_rate >= @error_rate_threshold || @failure_count >= @failure_threshold
  end

  def calculate_error_rate
    return 0.0 if @request_count == 0
    @failure_count.to_f / @request_count
  end

  def reset_window_if_expired
    if Time.now - @window_start >= @window_duration
      @failure_count = 0
      @request_count = 0
      @window_start = Time.now
    end
  end

  def timeout_expired?
    return false unless @opened_at
    Time.now - @opened_at >= @timeout
  end

  def transition_to(new_state)
    old_state = @state
    @state = new_state
    
    case new_state
    when :open
      @opened_at = Time.now
    when :half_open
      @success_count = 0
      @failure_count = 0
    when :closed
      @failure_count = 0
      @success_count = 0
      @opened_at = nil
    end
    
    notify_state_change(old_state, new_state)
  end

  def notify_state_change(old_state, new_state)
    # Hook for monitoring systems
    puts "Circuit #{@name}: #{old_state} -> #{new_state}"
  end

  def record_rejected_request
    # Hook for metrics collection
  end
end

Using the Faraday Middleware

The Faraday HTTP client library provides circuit breaker middleware that integrates with HTTP request/response cycles:

require 'faraday'
require 'faraday/circuit_breaker'

# Configure circuit breaker for HTTP client
conn = Faraday.new(url: 'https://api.example.com') do |f|
  f.request :circuit_breaker,
    threshold: 5,
    timeout: 60,
    exceptions: [Faraday::TimeoutError, Faraday::ConnectionFailed]
  
  f.adapter Faraday.default_adapter
end

# Circuit breaker protects this request
begin
  response = conn.get('/users')
rescue CircuitBreaker::OpenCircuitError => e
  # Handle open circuit
  Rails.logger.warn "Circuit open for users API: #{e.message}"
  return cached_users
end

Redis-Backed Circuit Breaker

Distributed applications require shared circuit breaker state across multiple processes or servers. Redis provides persistent, shared state with atomic operations:

require 'redis'

class RedisCircuitBreaker
  def initialize(redis, key_prefix, options = {})
    @redis = redis
    @key_prefix = key_prefix
    @failure_threshold = options[:failure_threshold] || 5
    @timeout = options[:timeout] || 60
  end

  def call(circuit_name, &block)
    state_key = "#{@key_prefix}:#{circuit_name}:state"
    failure_key = "#{@key_prefix}:#{circuit_name}:failures"
    
    state = @redis.get(state_key) || 'closed'
    
    if state == 'open'
      opened_at = @redis.get("#{@key_prefix}:#{circuit_name}:opened_at").to_i
      if Time.now.to_i - opened_at >= @timeout
        @redis.set(state_key, 'half_open')
        state = 'half_open'
      else
        raise CircuitOpenError, "Circuit #{circuit_name} is open"
      end
    end

    begin
      result = yield
      
      if state == 'half_open'
        @redis.del(failure_key)
        @redis.set(state_key, 'closed')
      end
      
      result
    rescue => error
      failures = @redis.incr(failure_key)
      @redis.expire(failure_key, @timeout)
      
      if failures >= @failure_threshold
        @redis.set(state_key, 'open')
        @redis.set("#{@key_prefix}:#{circuit_name}:opened_at", Time.now.to_i)
      end
      
      raise error
    end
  end
end

# Usage across multiple processes
redis = Redis.new(url: ENV['REDIS_URL'])
breaker = RedisCircuitBreaker.new(redis, 'cb', failure_threshold: 3, timeout: 30)

breaker.call('payment_service') do
  PaymentAPI.charge(amount: 100)
end

Integration with Sidekiq

Background job processing requires circuit breakers to prevent job queues from filling with failing jobs:

class PaymentProcessor
  include Sidekiq::Worker
  
  sidekiq_options retry: 3
  
  def perform(payment_id)
    breaker = CircuitBreakerRegistry.get(:payment_gateway)
    
    breaker.call do
      payment = Payment.find(payment_id)
      PaymentGateway.process(payment)
    end
  rescue CircuitOpenError => e
    # Reschedule when circuit opens
    self.class.perform_in(5.minutes, payment_id)
  rescue => e
    Rails.logger.error "Payment processing failed: #{e.message}"
    raise e
  end
end

Design Considerations

Circuit breakers add complexity and latency overhead to service calls. Applications must evaluate whether the resilience benefits outweigh these costs based on failure rates, request patterns, and system architecture.

When to Apply Circuit Breakers

Circuit breakers provide value when calling remote services with unpredictable availability, when cascading failures risk system-wide outages, when failing fast improves user experience compared to waiting for timeouts, and when resource exhaustion from hanging connections threatens system stability.

Synchronous HTTP APIs represent the primary use case for circuit breakers. These calls can hang indefinitely if the remote service becomes unresponsive, exhausting thread pools and connection pools. Circuit breakers detect these conditions and prevent resource exhaustion.

Asynchronous message queues benefit less from circuit breakers since message processing naturally decouples request and response. However, circuit breakers remain useful when message handlers call synchronous services or when queue backlog growth indicates consumer failures.

Database connections typically use connection pool limits rather than circuit breakers since databases rarely experience the transient failures common in network services. Circuit breakers may apply to database queries that access unreliable external data sources or distributed database clusters with node failures.

Trade-offs and Limitations

Circuit breakers introduce several trade-offs. State management overhead increases request latency by 1-5 milliseconds per call. Memory usage grows with the number of monitored endpoints and the size of failure tracking windows. Thread synchronization adds contention under high concurrency.

False positives occur when circuits trip during legitimate traffic spikes or transient network issues. Applications must implement fallback strategies to handle open circuits gracefully. Fallback options include returning cached data, degraded functionality, or user-friendly error messages.

False negatives occur when failures persist below the threshold or when slow responses exhaust resources without triggering failure counts. Timeout-based circuit breakers address this by treating slow responses as failures.

State consistency across distributed systems requires shared storage like Redis, adding external dependencies and failure modes. Local circuit breakers in each process provide simpler implementation but prevent coordinated failure response across service instances.

Alternative Patterns

Several patterns address similar reliability concerns with different trade-offs:

Retry logic with exponential backoff provides resilience against transient failures without maintaining state. Retries work well for truly transient issues but can worsen cascading failures by generating more load on already struggling services. Circuit breakers complement retry logic by preventing retries when failures persist.

Timeouts limit request duration but don't prevent repeated attempts against failing services. Aggressive timeouts reduce resource exhaustion but may interrupt legitimate slow operations. Circuit breakers reduce timeout-related load by blocking requests entirely.

Bulkheads isolate resources to prevent failures in one component from affecting others. Thread pool isolation and connection pool limits implement bulkheads at the resource level. Circuit breakers operate at the service call level. Combining both patterns provides defense in depth.

Rate limiting controls request volume but doesn't respond to service health. Circuit breakers complement rate limiting by adjusting request behavior based on failure rates rather than just volume.

Practical Examples

Circuit breakers protect applications across various scenarios from microservice communication to third-party API integration.

HTTP API Protection

An e-commerce application calls a recommendation service to display product suggestions. When the recommendation service experiences issues, circuit breakers prevent checkout delays:

class RecommendationService
  def initialize
    @circuit_breaker = ProductionCircuitBreaker.new(
      'recommendations',
      failure_threshold: 5,
      timeout: 30,
      volume_threshold: 20,
      error_rate_threshold: 0.5
    )
    @cache = Rails.cache
  end

  def recommendations_for(user_id, category)
    cache_key = "recommendations:#{user_id}:#{category}"
    
    @circuit_breaker.call do
      response = HTTP.timeout(2).get(
        "#{ENV['RECOMMENDATIONS_URL']}/recommendations",
        params: { user_id: user_id, category: category }
      )
      
      recommendations = JSON.parse(response.body)
      @cache.write(cache_key, recommendations, expires_in: 1.hour)
      recommendations
    end
  rescue CircuitOpenError => e
    # Return cached recommendations when circuit is open
    Rails.logger.warn "Recommendations circuit open, using cache"
    cached = @cache.read(cache_key)
    cached || default_recommendations(category)
  rescue HTTP::TimeoutError, HTTP::ConnectionError => e
    # Let circuit breaker track the failure
    Rails.logger.error "Recommendations request failed: #{e.message}"
    raise e
  end

  private

  def default_recommendations(category)
    # Return popular items as fallback
    Product.where(category: category)
           .order(sales_count: :desc)
           .limit(10)
           .map { |p| { id: p.id, name: p.name, price: p.price } }
  end
end

Payment Gateway Resilience

Payment processing requires high reliability since failures directly impact revenue. Circuit breakers prevent payment queue backlog during gateway outages:

class PaymentGatewayClient
  class PaymentError < StandardError; end
  
  def initialize
    @breakers = {
      authorize: ProductionCircuitBreaker.new('payment_authorize',
        failure_threshold: 3,
        timeout: 60,
        volume_threshold: 10
      ),
      capture: ProductionCircuitBreaker.new('payment_capture',
        failure_threshold: 3,
        timeout: 60,
        volume_threshold: 10
      ),
      refund: ProductionCircuitBreaker.new('payment_refund',
        failure_threshold: 2,
        timeout: 120,
        volume_threshold: 5
      )
    }
  end

  def authorize(amount, card_token)
    @breakers[:authorize].call do
      response = gateway_request('/authorize', {
        amount: amount,
        card_token: card_token
      })
      
      unless response['success']
        raise PaymentError, response['error']
      end
      
      {
        transaction_id: response['transaction_id'],
        authorized_at: Time.now,
        amount: amount
      }
    end
  rescue CircuitOpenError => e
    # Queue for retry when gateway recovers
    PaymentRetryJob.perform_later(amount, card_token)
    raise PaymentError, "Payment gateway unavailable"
  end

  def capture(transaction_id, amount)
    @breakers[:capture].call do
      response = gateway_request('/capture', {
        transaction_id: transaction_id,
        amount: amount
      })
      
      unless response['success']
        raise PaymentError, response['error']
      end
      
      {
        captured_at: Time.now,
        amount: amount
      }
    end
  rescue CircuitOpenError => e
    # Capture operations must succeed eventually
    PaymentCaptureRetryJob.perform_in(5.minutes, transaction_id, amount)
    raise PaymentError, "Payment capture deferred"
  end

  private

  def gateway_request(path, params)
    response = HTTP.timeout(5)
                   .headers(authorization: "Bearer #{ENV['GATEWAY_TOKEN']}")
                   .post("#{ENV['GATEWAY_URL']}#{path}", json: params)
    
    JSON.parse(response.body)
  rescue HTTP::TimeoutError => e
    raise PaymentError, "Gateway timeout"
  rescue HTTP::ConnectionError => e
    raise PaymentError, "Gateway connection failed"
  end
end

Microservice Communication

Service mesh architectures implement circuit breakers at the proxy level, but application-level circuit breakers provide additional control and context-specific behavior:

class OrderService
  def initialize
    @inventory_breaker = ProductionCircuitBreaker.new(
      'inventory_service',
      failure_threshold: 5,
      timeout: 30
    )
    
    @shipping_breaker = ProductionCircuitBreaker.new(
      'shipping_service',
      failure_threshold: 3,
      timeout: 45
    )
  end

  def create_order(cart_items, shipping_address)
    order = Order.create!(
      user_id: current_user.id,
      status: 'pending',
      items: cart_items
    )

    # Reserve inventory with circuit breaker
    inventory_reserved = false
    begin
      @inventory_breaker.call do
        InventoryClient.reserve(order.id, cart_items)
      end
      inventory_reserved = true
    rescue CircuitOpenError => e
      order.update!(
        status: 'failed',
        failure_reason: 'inventory_unavailable'
      )
      return { success: false, error: 'Inventory service unavailable' }
    rescue InventoryClient::InsufficientStockError => e
      order.update!(status: 'failed', failure_reason: 'insufficient_stock')
      return { success: false, error: 'Insufficient stock' }
    end

    # Calculate shipping with circuit breaker
    begin
      @shipping_breaker.call do
        shipping_quote = ShippingClient.calculate(
          cart_items,
          shipping_address
        )
        order.update!(
          shipping_cost: shipping_quote[:cost],
          estimated_delivery: shipping_quote[:estimated_delivery]
        )
      end
    rescue CircuitOpenError => e
      # Shipping calculation failure doesn't block order creation
      order.update!(
        shipping_cost: 0,
        estimated_delivery: 7.days.from_now,
        requires_shipping_calculation: true
      )
      Rails.logger.warn "Order #{order.id} created without shipping calculation"
    end

    order.update!(status: 'confirmed')
    { success: true, order: order }
  end
end

Third-Party API Integration

Applications integrating third-party APIs face unpredictable availability and rate limits. Circuit breakers prevent API quota exhaustion and reduce error rates:

class WeatherDataClient
  def initialize
    @breaker = ProductionCircuitBreaker.new(
      'weather_api',
      failure_threshold: 10,
      timeout: 120,
      volume_threshold: 50,
      error_rate_threshold: 0.3
    )
    @rate_limiter = RateLimiter.new(max_requests: 100, period: 1.minute)
  end

  def current_weather(location)
    cache_key = "weather:current:#{location}"
    
    # Return cached data if available
    cached = Rails.cache.read(cache_key)
    return cached if cached

    # Rate limit before circuit breaker
    unless @rate_limiter.allow?
      return Rails.cache.read("weather:stale:#{location}")
    end

    @breaker.call do
      response = HTTP.timeout(3).get(
        "#{ENV['WEATHER_API_URL']}/current",
        params: {
          location: location,
          api_key: ENV['WEATHER_API_KEY']
        }
      )
      
      weather_data = JSON.parse(response.body)
      
      # Cache successful response
      Rails.cache.write(cache_key, weather_data, expires_in: 10.minutes)
      Rails.cache.write("weather:stale:#{location}", weather_data, expires_in: 2.hours)
      
      weather_data
    end
  rescue CircuitOpenError => e
    # Return stale data when circuit is open
    stale = Rails.cache.read("weather:stale:#{location}")
    stale || { temperature: nil, conditions: 'unavailable' }
  rescue JSON::ParserError, HTTP::Error => e
    Rails.logger.error "Weather API error: #{e.message}"
    raise e
  end
end

Common Patterns

Circuit breaker implementations vary in state management strategies, failure detection mechanisms, and recovery testing approaches. Production systems often combine multiple patterns to address specific reliability requirements.

Sliding Window Pattern

Sliding window circuit breakers track individual request timestamps within a time window, providing accurate failure rate calculations but requiring more memory:

class SlidingWindowCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @window_size = options[:window_size] || 60
    @failure_threshold = options[:failure_threshold] || 0.5
    @volume_threshold = options[:volume_threshold] || 10
    @timeout = options[:timeout] || 60
    
    @state = :closed
    @requests = []
    @opened_at = nil
    @mutex = Mutex.new
  end

  def call(&block)
    @mutex.synchronize { check_state }
    
    raise CircuitOpenError if @state == :open

    begin
      result = yield
      record_request(success: true)
      result
    rescue => error
      record_request(success: false)
      raise error
    end
  end

  private

  def record_request(success:)
    @mutex.synchronize do
      now = Time.now
      @requests << { timestamp: now, success: success }
      remove_old_requests(now)
      evaluate_threshold
    end
  end

  def remove_old_requests(current_time)
    cutoff = current_time - @window_size
    @requests.reject! { |req| req[:timestamp] < cutoff }
  end

  def evaluate_threshold
    return if @requests.size < @volume_threshold

    failure_count = @requests.count { |req| !req[:success] }
    failure_rate = failure_count.to_f / @requests.size

    if failure_rate >= @failure_threshold
      @state = :open
      @opened_at = Time.now
    end
  end

  def check_state
    if @state == :open && Time.now - @opened_at >= @timeout
      @state = :half_open
      @requests.clear
    end
  end
end

Token Bucket Recovery Pattern

Token bucket patterns control Half-Open state testing by limiting the number of concurrent test requests, preventing thundering herd problems during recovery:

class TokenBucketCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @failure_threshold = options[:failure_threshold] || 5
    @timeout = options[:timeout] || 60
    @max_half_open_requests = options[:max_half_open_requests] || 3
    @success_threshold = options[:success_threshold] || 2
    
    @state = :closed
    @failure_count = 0
    @success_count = 0
    @half_open_tokens = @max_half_open_requests
    @opened_at = nil
    @mutex = Mutex.new
    @condition = ConditionVariable.new
  end

  def call(&block)
    acquire_token
    
    begin
      result = yield
      handle_success
      result
    rescue => error
      handle_failure
      raise error
    ensure
      release_token
    end
  end

  private

  def acquire_token
    @mutex.synchronize do
      check_timeout
      
      while @state == :half_open && @half_open_tokens <= 0
        @condition.wait(@mutex)
      end
      
      raise CircuitOpenError if @state == :open
      
      @half_open_tokens -= 1 if @state == :half_open
    end
  end

  def release_token
    @mutex.synchronize do
      if @state == :half_open
        @half_open_tokens += 1
        @condition.signal
      end
    end
  end

  def handle_success
    @mutex.synchronize do
      if @state == :half_open
        @success_count += 1
        if @success_count >= @success_threshold
          transition_to_closed
        end
      end
      
      @failure_count = 0 if @state == :closed
    end
  end

  def handle_failure
    @mutex.synchronize do
      @failure_count += 1
      
      if @state == :half_open || @failure_count >= @failure_threshold
        transition_to_open
      end
    end
  end

  def check_timeout
    if @state == :open && Time.now - @opened_at >= @timeout
      transition_to_half_open
    end
  end

  def transition_to_closed
    @state = :closed
    @failure_count = 0
    @success_count = 0
    @half_open_tokens = @max_half_open_requests
    @condition.broadcast
  end

  def transition_to_open
    @state = :open
    @opened_at = Time.now
    @success_count = 0
  end

  def transition_to_half_open
    @state = :half_open
    @half_open_tokens = @max_half_open_requests
    @success_count = 0
  end
end

Adaptive Threshold Pattern

Adaptive circuit breakers adjust failure thresholds based on historical patterns and request volume, reducing false positives during traffic spikes:

class AdaptiveCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @base_threshold = options[:base_threshold] || 0.5
    @adaptation_rate = options[:adaptation_rate] || 0.1
    @window_size = options[:window_size] || 60
    @timeout = options[:timeout] || 60
    
    @state = :closed
    @current_threshold = @base_threshold
    @historical_error_rate = 0.0
    @requests = []
    @opened_at = nil
    @mutex = Mutex.new
  end

  def call(&block)
    check_and_adapt
    
    raise CircuitOpenError if @state == :open

    begin
      result = yield
      record_success
      result
    rescue => error
      record_failure
      raise error
    end
  end

  private

  def check_and_adapt
    @mutex.synchronize do
      clean_old_requests
      adapt_threshold
      
      if @state == :open && timeout_expired?
        @state = :half_open
      end
      
      evaluate_state
    end
  end

  def adapt_threshold
    return if @requests.empty?
    
    current_error_rate = calculate_error_rate
    @historical_error_rate = (@historical_error_rate * (1 - @adaptation_rate)) +
                             (current_error_rate * @adaptation_rate)
    
    # Adjust threshold based on historical patterns
    @current_threshold = [@base_threshold + (@historical_error_rate * 0.5), 0.9].min
  end

  def calculate_error_rate
    failures = @requests.count { |r| !r[:success] }
    failures.to_f / @requests.size
  end

  def evaluate_state
    return unless @requests.size >= 20
    
    current_error_rate = calculate_error_rate
    
    if current_error_rate >= @current_threshold
      @state = :open
      @opened_at = Time.now
    end
  end

  def clean_old_requests
    cutoff = Time.now - @window_size
    @requests.reject! { |r| r[:timestamp] < cutoff }
  end

  def record_success
    @mutex.synchronize do
      @requests << { timestamp: Time.now, success: true }
      @state = :closed if @state == :half_open
    end
  end

  def record_failure
    @mutex.synchronize do
      @requests << { timestamp: Time.now, success: false }
    end
  end

  def timeout_expired?
    @opened_at && Time.now - @opened_at >= @timeout
  end
end

Error Handling & Edge Cases

Circuit breakers must handle various failure modes, edge cases, and operational scenarios that occur in production systems.

State Transition Race Conditions

Multiple concurrent requests during state transitions can cause inconsistent behavior. Requests arriving during the Open to Half-Open transition might all attempt test requests simultaneously, overwhelming the recovering service:

class ThreadSafeCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @failure_threshold = options[:failure_threshold] || 5
    @timeout = options[:timeout] || 60
    @max_concurrent_half_open = options[:max_concurrent_half_open] || 1
    
    @state = :closed
    @failure_count = 0
    @opened_at = nil
    @half_open_count = 0
    @mutex = Mutex.new
  end

  def call(&block)
    permitted = false
    
    @mutex.synchronize do
      check_timeout
      
      case @state
      when :open
        raise CircuitOpenError
      when :half_open
        if @half_open_count < @max_concurrent_half_open
          @half_open_count += 1
          permitted = true
        else
          raise CircuitOpenError, "Half-open limit reached"
        end
      when :closed
        permitted = true
      end
    end

    begin
      result = yield
      handle_success
      result
    rescue => error
      handle_failure
      raise error
    ensure
      release_half_open_slot if permitted && @state == :half_open
    end
  end

  private

  def release_half_open_slot
    @mutex.synchronize { @half_open_count -= 1 }
  end

  def check_timeout
    if @state == :open && Time.now - @opened_at >= @timeout
      @state = :half_open
      @half_open_count = 0
    end
  end

  def handle_success
    @mutex.synchronize do
      @failure_count = 0
      @state = :closed if @state == :half_open
    end
  end

  def handle_failure
    @mutex.synchronize do
      @failure_count += 1
      if @failure_count >= @failure_threshold || @state == :half_open
        @state = :open
        @opened_at = Time.now
      end
    end
  end
end

Monitoring and Alerting

Production circuit breakers require comprehensive monitoring to detect issues and tune configuration. State transitions, rejection rates, and recovery patterns indicate system health:

class MonitoredCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @options = options
    @breaker = ProductionCircuitBreaker.new(name, options)
    @metrics_client = options[:metrics_client] || StatsD.new
    @alert_threshold = options[:alert_threshold] || 0.1
    @last_alert_time = nil
    @alert_cooldown = options[:alert_cooldown] || 300
  end

  def call(&block)
    start_time = Time.now
    
    begin
      result = @breaker.call(&block)
      record_success_metrics(Time.now - start_time)
      result
    rescue CircuitOpenError => e
      record_rejection_metrics
      check_alert_conditions
      raise e
    rescue => e
      record_failure_metrics(Time.now - start_time, e.class.name)
      raise e
    end
  end

  def metrics
    @breaker.metrics.merge(rejection_rate: calculate_rejection_rate)
  end

  private

  def record_success_metrics(duration)
    @metrics_client.increment("circuit_breaker.#{@name}.success")
    @metrics_client.timing("circuit_breaker.#{@name}.duration", duration * 1000)
    @metrics_client.gauge("circuit_breaker.#{@name}.state", state_value)
  end

  def record_failure_metrics(duration, error_type)
    @metrics_client.increment("circuit_breaker.#{@name}.failure")
    @metrics_client.increment("circuit_breaker.#{@name}.error.#{error_type}")
    @metrics_client.timing("circuit_breaker.#{@name}.duration", duration * 1000)
  end

  def record_rejection_metrics
    @metrics_client.increment("circuit_breaker.#{@name}.rejected")
    @metrics_client.gauge("circuit_breaker.#{@name}.state", state_value)
  end

  def state_value
    case @breaker.state
    when :closed then 0
    when :half_open then 1
    when :open then 2
    end
  end

  def calculate_rejection_rate
    metrics_data = @breaker.metrics
    total = metrics_data[:request_count]
    return 0.0 if total == 0
    
    rejected = total - (metrics_data[:failure_count] + metrics_data[:success_count])
    rejected.to_f / total
  end

  def check_alert_conditions
    return if recently_alerted?
    
    rejection_rate = calculate_rejection_rate
    
    if rejection_rate >= @alert_threshold
      send_alert(rejection_rate)
      @last_alert_time = Time.now
    end
  end

  def recently_alerted?
    @last_alert_time && Time.now - @last_alert_time < @alert_cooldown
  end

  def send_alert(rejection_rate)
    AlertService.notify(
      severity: 'warning',
      message: "Circuit breaker #{@name} open, rejection rate: #{(rejection_rate * 100).round(2)}%",
      context: @breaker.metrics
    )
  end
end

Partial Failures and Timeout Handling

Services may respond slowly without failing completely, exhausting resources and triggering cascading failures. Circuit breakers must distinguish between fast failures and timeouts:

class TimeoutAwareCircuitBreaker
  def initialize(name, options = {})
    @name = name
    @request_timeout = options[:request_timeout] || 5
    @failure_threshold = options[:failure_threshold] || 5
    @slow_call_threshold = options[:slow_call_threshold] || 3
    @slow_call_duration = options[:slow_call_duration] || 2
    
    @breaker = ProductionCircuitBreaker.new(name, options)
    @slow_call_count = 0
    @mutex = Mutex.new
  end

  def call(&block)
    start_time = Time.now
    
    timeout_result = Timeout.timeout(@request_timeout) do
      @breaker.call(&block)
    end
    
    duration = Time.now - start_time
    check_slow_call(duration)
    
    timeout_result
  rescue Timeout::Error => e
    @mutex.synchronize { @slow_call_count += 1 }
    check_slow_call_threshold
    raise e
  end

  private

  def check_slow_call(duration)
    if duration >= @slow_call_duration
      @mutex.synchronize { @slow_call_count += 1 }
      check_slow_call_threshold
    else
      @mutex.synchronize { @slow_call_count = [@slow_call_count - 1, 0].max }
    end
  end

  def check_slow_call_threshold
    if @slow_call_count >= @slow_call_threshold
      Rails.logger.warn "Circuit #{@name} experiencing slow calls, count: #{@slow_call_count}"
      # Optionally trip circuit based on slow calls
    end
  end
end

Real-World Applications

Production deployments of circuit breakers reveal implementation patterns and operational considerations that differ from simplified examples.

Microservices API Gateway

API gateways implement circuit breakers per downstream service, aggregating health across multiple backend endpoints:

class ServiceGateway
  def initialize
    @service_breakers = Hash.new do |hash, service_name|
      hash[service_name] = ProductionCircuitBreaker.new(
        "gateway_#{service_name}",
        failure_threshold: 5,
        timeout: 30,
        volume_threshold: 20
      )
    end
    
    @fallback_strategies = {
      'user_profile' => method(:cached_user_profile),
      'recommendations' => method(:default_recommendations),
      'reviews' => method(:empty_reviews)
    }
  end

  def proxy_request(service_name, path, params = {})
    breaker = @service_breakers[service_name]
    
    breaker.call do
      response = HTTP.timeout(5).get(
        service_url(service_name, path),
        params: params
      )
      
      unless response.status.success?
        raise ServiceError, "Service returned #{response.status}"
      end
      
      JSON.parse(response.body)
    end
  rescue CircuitOpenError => e
    handle_open_circuit(service_name, path, params)
  rescue ServiceError, HTTP::Error => e
    log_service_error(service_name, e)
    raise e
  end

  def health_check
    services = @service_breakers.keys
    
    services.map do |service|
      metrics = @service_breakers[service].metrics
      {
        service: service,
        state: metrics[:state],
        error_rate: metrics[:error_rate],
        last_failure: metrics[:last_failure_time]
      }
    end
  end

  private

  def handle_open_circuit(service_name, path, params)
    fallback = @fallback_strategies[service_name]
    
    if fallback
      Rails.logger.warn "Using fallback for #{service_name}"
      fallback.call(path, params)
    else
      raise ServiceUnavailableError, "#{service_name} circuit is open"
    end
  end

  def cached_user_profile(path, params)
    user_id = params[:user_id]
    Rails.cache.read("user_profile:#{user_id}") || { id: user_id, name: 'Unknown' }
  end

  def default_recommendations(path, params)
    { items: [], fallback: true }
  end

  def empty_reviews(path, params)
    { reviews: [], total: 0, fallback: true }
  end

  def service_url(service_name, path)
    base_url = ENV["#{service_name.upcase}_SERVICE_URL"]
    "#{base_url}#{path}"
  end

  def log_service_error(service_name, error)
    Rails.logger.error("Service error in #{service_name}: #{error.message}")
    ErrorTracker.notify(error, service: service_name)
  end
end

Database Query Protection

Circuit breakers protect applications from database query failures when accessing distributed databases or unreliable replicas:

class ReplicaCircuitBreaker
  def initialize
    @replica_breakers = ENV['DATABASE_REPLICAS'].split(',').map do |replica_url|
      [
        replica_url,
        ProductionCircuitBreaker.new(
          "replica_#{replica_url}",
          failure_threshold: 3,
          timeout: 60
        )
      ]
    end.to_h
    
    @primary_db = DatabaseConnection.primary
    @mutex = Mutex.new
  end

  def execute_query(sql, params = [])
    available_replicas = find_available_replicas
    
    if available_replicas.any?
      execute_on_replica(available_replicas.first, sql, params)
    else
      Rails.logger.warn "All replicas unavailable, using primary"
      execute_on_primary(sql, params)
    end
  end

  private

  def find_available_replicas
    @replica_breakers.select do |url, breaker|
      breaker.state != :open
    end.keys
  end

  def execute_on_replica(replica_url, sql, params)
    breaker = @replica_breakers[replica_url]
    
    breaker.call do
      connection = DatabaseConnection.connect(replica_url)
      connection.execute(sql, params)
    end
  rescue CircuitOpenError => e
    # Try next available replica
    remaining = find_available_replicas - [replica_url]
    if remaining.any?
      execute_on_replica(remaining.first, sql, params)
    else
      execute_on_primary(sql, params)
    end
  end

  def execute_on_primary(sql, params)
    @primary_db.execute(sql, params)
  end
end

Background Job Processing

Circuit breakers prevent job queue saturation when external services fail, deferring jobs until services recover:

class ExternalAPIJob
  include Sidekiq::Worker
  
  sidekiq_options queue: :external_api, retry: 5
  
  def perform(job_type, payload)
    breaker = CircuitBreakerRegistry.get(job_type)
    
    breaker.call do
      case job_type
      when 'webhook_delivery'
        deliver_webhook(payload)
      when 'data_sync'
        sync_external_data(payload)
      when 'notification'
        send_external_notification(payload)
      end
    end
  rescue CircuitOpenError => e
    # Exponentially backoff when circuit is open
    retry_delay = [2 ** self.class.sidekiq_options['retry_count'], 3600].min
    self.class.perform_in(retry_delay, job_type, payload)
    
    Rails.logger.warn(
      "Job deferred due to open circuit",
      job_type: job_type,
      retry_delay: retry_delay,
      circuit_state: breaker.state
    )
  rescue => e
    Rails.logger.error("Job failed: #{e.message}", job_type: job_type)
    raise e # Let Sidekiq's retry mechanism handle
  end

  private

  def deliver_webhook(payload)
    HTTP.timeout(10).post(payload['url'], json: payload['data'])
  end

  def sync_external_data(payload)
    ExternalAPI.sync(payload['entity_type'], payload['entity_id'])
  end

  def send_external_notification(payload)
    NotificationService.send(payload['recipient'], payload['message'])
  end
end

class CircuitBreakerRegistry
  @breakers = {}
  @mutex = Mutex.new
  
  def self.get(name)
    @mutex.synchronize do
      @breakers[name] ||= ProductionCircuitBreaker.new(
        name,
        failure_threshold: 5,
        timeout: 120,
        volume_threshold: 10
      )
    end
  end
  
  def self.all
    @breakers.values
  end
  
  def self.metrics
    @breakers.transform_values(&:metrics)
  end
end

Reference

Circuit Breaker States

State Behavior Transition Condition
Closed All requests pass through to remote service Failure count exceeds threshold
Open All requests fail immediately without calling service Timeout period expires
Half-Open Limited test requests allowed Successful tests close circuit, failures reopen

Configuration Parameters

Parameter Description Typical Value
failure_threshold Number of failures before opening circuit 3-10 failures
error_rate_threshold Percentage of failed requests before opening 0.3-0.7 (30-70%)
timeout Duration circuit remains open before testing recovery 30-120 seconds
volume_threshold Minimum requests before evaluating error rate 10-50 requests
success_threshold Successful tests required to close circuit 2-3 successes
window_duration Time window for tracking failures 30-120 seconds
half_open_limit Maximum concurrent test requests in half-open state 1-5 requests

Implementation Checklist

Component Requirement Implementation
State Management Thread-safe state transitions Use Mutex or atomic operations
Failure Tracking Time-windowed failure counting Sliding or tumbling window
Timeout Handling Automatic transition to half-open Background thread or lazy evaluation
Metrics Collection State changes and request outcomes Integrate with monitoring system
Fallback Strategy Graceful degradation when circuit open Cache, defaults, or degraded mode
Configuration Externalized threshold values Environment variables or config files
Testing Simulate failures and recovery Mock failures, test state transitions
Monitoring Alert on circuit state changes Dashboard and alerting integration

Common Error Types

Error Type Weight Rationale
Timeout High Indicates resource exhaustion or unresponsive service
Connection Failed High Service unavailable or network issues
HTTP 5xx Medium Server errors that may be transient
HTTP 429 Medium Rate limit exceeded, service overloaded
HTTP 4xx Low Client errors typically don't indicate service health
HTTP 408 High Request timeout indicates performance degradation

Ruby Gems and Libraries

Gem Description Use Case
circuitbox Production-ready circuit breaker with Faraday integration HTTP client protection
stoplight Redis-backed circuit breaker with custom error handlers Distributed systems
semian Resource protection including circuit breakers and bulkheads Database and service protection
resilient Flexible circuit breaker with multiple failure detection strategies Custom implementations

Metric Definitions

Metric Calculation Purpose
Error Rate failures / total_requests Determine when to open circuit
Rejection Rate rejected / total_requests Monitor impact of open circuit
Mean Time to Recovery sum(open_duration) / open_count Optimize timeout configuration
False Positive Rate unnecessary_trips / total_trips Tune failure thresholds
State Duration time_in_state Analyze state transition patterns

Monitoring Queries

Query Purpose Implementation
Circuit state distribution Group by state, count occurrences
Rejection rate trend Calculate rejected / total over time
Frequent circuit trips Count open transitions per circuit
Recovery success rate Calculate closed / half_open transitions
Service availability Calculate uptime from circuit states