CrackedRuby - Chaos Engineering

Overview

Chaos Engineering originated at Netflix in 2011 with the creation of Chaos Monkey, a tool that randomly terminates virtual machine instances in production. The practice emerged from the recognition that traditional testing approaches fail to identify systemic weaknesses in distributed systems. Complex systems exhibit emergent behaviors that only manifest under production conditions with real traffic patterns, network topologies, and component interactions.

The discipline addresses a fundamental challenge in modern software development: systems grow increasingly complex while failures become increasingly expensive. A distributed system with dozens of microservices, multiple databases, caching layers, message queues, and external dependencies creates thousands of potential failure scenarios. Testing each scenario individually becomes impractical, and many failure modes only emerge from the interaction between multiple components under specific conditions.

Chaos Engineering inverts the traditional testing paradigm. Instead of attempting to prove a system works correctly, practitioners attempt to discover how it fails. The approach assumes failures will occur and seeks to identify weaknesses proactively rather than reactively during incidents. Organizations conduct controlled experiments in production, observing how systems respond to deliberate disruptions while maintaining acceptable service levels.

The practice differs from traditional fault injection testing in several key aspects. Chaos experiments run continuously in production environments rather than isolated test environments. They target entire systems rather than individual components. The experiments measure real business metrics rather than synthetic test assertions. Teams halt experiments immediately if they detect customer impact, treating each experiment as a learning opportunity rather than a pass-fail test.

Consider a financial services application handling payment processing. Traditional testing might verify that the payment service correctly processes transactions and handles database connection failures. Chaos Engineering would terminate random payment service instances during peak traffic, introduce latency in database queries, and corrupt network packets between services to observe whether the system maintains transaction integrity, preserves customer data, and provides appropriate error messages. The experiments reveal whether circuit breakers activate correctly, whether retries happen with appropriate backoff, and whether the system gracefully degrades rather than cascading failures across services.

Key Principles

The Principles of Chaos Engineering, formalized by Netflix engineers, define five core tenets that guide the practice. These principles establish the theoretical foundation and differentiate chaos engineering from random disruption.

Build a Hypothesis Around Steady State Behavior

Chaos experiments begin by defining steady state as measurable output that indicates normal system operation. The steady state represents business metrics rather than internal system metrics. For an e-commerce platform, steady state might measure successful checkout completion rate, average page load time, and search result accuracy. The hypothesis predicts that steady state will continue during chaos experiments. Teams avoid focusing on internal metrics like CPU utilization or memory consumption unless they directly correlate with customer experience.

Defining steady state requires understanding what constitutes acceptable system behavior under normal conditions. A video streaming service might define steady state as 99.9% of streams starting within 3 seconds with less than 0.1% buffering events per hour. The definition must be specific enough to detect degradation yet tolerant enough to accommodate normal variations in system performance.

Vary Real-World Events

Chaos experiments simulate failures that occur in production systems. Hardware failures, network partitions, traffic spikes, dependency outages, configuration errors, and resource exhaustion all represent real scenarios. The events selected for experiments come from post-incident reviews, monitoring data, and production failure patterns. Teams prioritize scenarios based on historical frequency and potential customer impact.

Real-world events include both sudden failures and gradual degradations. A database server might fail instantly, or it might slowly degrade as disk I/O performance decreases. Network latency might spike suddenly or creep upward over hours. Memory leaks cause gradual resource exhaustion. Chaos experiments model these patterns rather than only simulating catastrophic instant failures.

The scope of events extends beyond infrastructure failures. Application-level failures matter equally. Chaos experiments might introduce bugs in code paths, corrupt data in caches, return incorrect results from dependencies, or simulate third-party API changes. State-based failures where the system enters invalid states due to timing issues or race conditions represent particularly valuable experiment targets.

Run Experiments in Production

Production environments exhibit complexity that cannot be replicated in test environments. Real traffic patterns, actual data volumes, genuine network topologies, authentic security controls, and production deployment configurations create emergent behaviors. Running experiments in production provides the only reliable method to verify system behavior under actual operating conditions.

The production requirement creates tension with risk management. Organizations balance the need for realistic experiments against the potential for customer impact. Progressive rollout strategies mitigate risk. Experiments begin with small percentages of traffic, limited geographic regions, or specific customer segments. Teams monitor business metrics continuously and implement automated halt conditions that stop experiments when impact exceeds acceptable thresholds.

Production experiments require sophisticated control mechanisms. Teams need the ability to target specific components, limit the blast radius, and instantly revert changes. The experiment infrastructure itself must be more reliable than the systems under test to prevent chaos tooling from becoming a source of outages.

Automate Experiments to Run Continuously

Manual chaos engineering fails to scale and becomes inconsistent. Systems change daily through deployments, configuration updates, dependency upgrades, and infrastructure modifications. Each change potentially introduces new failure modes. Automated continuous experimentation catches regressions and validates that recent changes maintain resilience properties.

Automation enables coverage across the entire system surface. A manually conducted experiment might test a specific failure scenario once per quarter. Automated experiments run the same scenario daily across all environments, services, and regions. The continuous execution builds confidence that the system maintains resilience properties over time rather than only at the moment of manual testing.

The automation encompasses experiment design, execution, monitoring, and analysis. Systems automatically select scenarios based on production traffic patterns and recent changes. They execute experiments according to schedules that avoid maintenance windows and high-traffic periods. Automated analysis compares steady state metrics before and during experiments, flagging anomalies for human review.

Minimize Blast Radius

Chaos experiments must not cause outages. The practice exists to prevent customer-impacting failures, not create them. Teams design experiments to contain potential damage while still providing valuable insights. Blast radius limitations include targeting specific instances rather than entire clusters, limiting experiments to subset of traffic, implementing rapid rollback mechanisms, and defining clear halt conditions.

Minimizing blast radius requires careful experiment design. An experiment testing database failover should affect one database replica rather than the primary. Network latency injection should target a small percentage of requests. Instance terminations should respect minimum capacity requirements. Each experiment includes explicit boundaries that prevent catastrophic failures even if the system responds unexpectedly.

The principle extends to organizational boundaries. Teams own experiments affecting their services. Cross-team experiments require coordination and approval. Shared infrastructure experiments need particularly careful scoping to avoid impacting multiple teams simultaneously.

Design Considerations

Organizations face several decisions when implementing chaos engineering programs. The choices affect experiment effectiveness, operational overhead, and organizational adoption.

Experiment Scope Selection

Teams must decide whether to start with infrastructure-level experiments or application-level experiments. Infrastructure experiments like instance termination and network failures provide quick value with minimal code changes. Application-level experiments testing business logic failures and data corruption require deeper integration but reveal more sophisticated failure modes. Starting with infrastructure experiments builds confidence and organizational support before expanding to application-level testing.

The decision between broad system-wide experiments and narrow component-specific experiments presents another tradeoff. System-wide experiments reveal emergent behaviors and cross-service dependencies but create complex failure scenarios that make root cause analysis difficult. Component-specific experiments provide clear causal relationships but might miss interactions between components. Mature chaos engineering programs employ both approaches: component experiments verify individual service resilience while system experiments validate end-to-end behavior.

Production vs Non-Production Environments

Running experiments exclusively in production maximizes realism but maximizes risk. Organizations with low risk tolerance or strict compliance requirements might begin chaos engineering in production-like staging environments. This approach reduces risk but also reduces insights since staging rarely matches production complexity. Systems behave differently under synthetic load compared to real traffic patterns. Dependencies configured in staging might not match production. The staging approach works as a stepping stone but should progress toward production experiments as teams build confidence.

Some organizations adopt a hybrid model where infrastructure experiments run in production while application experiments run in staging. Another approach uses production experiments with limited scope, affecting only internal users or specific customer segments. The key decision factor is whether the organization values realism over risk minimization.

Manual vs Automated Execution

Manual chaos engineering requires less initial investment and provides more control during experiments. Teams manually trigger experiments, observe system behavior, and make real-time decisions about continuing or halting. This approach works well for learning the practice and understanding system behavior but doesn't scale. Manual experiments happen infrequently, cover limited scenarios, and depend on individual expertise.

Automated chaos engineering requires significant upfront investment in tooling and monitoring but provides continuous validation. Systems automatically select scenarios, execute experiments, evaluate results, and alert on anomalies. The automation enables coverage across all services and scenarios while requiring less human involvement. Organizations typically begin with manual experiments to understand failure modes before automating the most valuable scenarios.

Metrics and Observability Requirements

Effective chaos engineering depends on comprehensive observability. Systems need metrics that accurately reflect customer experience, detailed enough to identify specific failure modes, and real-time enough to halt experiments quickly. Organizations must decide whether to use existing monitoring or build chaos-specific observability.

Existing monitoring provides immediate value but might lack the granularity needed for experiment analysis. A single "requests per second" metric doesn't reveal whether errors affect specific endpoints or customer segments. Chaos-specific observability adds metrics like "successful checkout rate for experiment participants" but increases complexity. The decision depends on the maturity of existing monitoring and the organization's ability to maintain additional tooling.

Organizational Readiness

Chaos engineering requires cultural and technical prerequisites. The organization needs incident response processes, blameless postmortems, on-call rotations, and monitoring infrastructure before beginning chaos experiments. Teams must have time allocated for resilience work rather than only feature development. Engineering leadership must support learning from failures rather than punishing them.

Technical readiness includes deployment automation, feature flags, circuit breakers, retry logic, and graceful degradation mechanisms. Organizations lacking these capabilities should build them before starting chaos engineering since experiments will immediately reveal their absence. Starting chaos engineering too early creates frustration when experiments find problems teams lack the tools to fix.

Implementation Approaches

Chaos engineering programs can follow several implementation strategies, each with different timelines, resource requirements, and organizational impacts.

Game Days

Game days schedule regular chaos exercises where teams manually inject failures and observe system behavior. The approach typically happens quarterly or monthly, involves multiple teams, and lasts several hours. Game days provide structured learning opportunities, build team coordination, and create shared understanding of system dependencies.

The game day structure begins with planning where teams select scenarios based on recent incidents or identified risks. During execution, one team injects failures while other teams monitor their services and coordinate responses. Post-game analysis reviews what worked, what failed, and what improvements the organization needs. Game days work well for organizations beginning chaos engineering since they provide controlled environments for learning without requiring automated tooling.

Game days have limitations. They happen infrequently, cover limited scenarios, and require significant coordination overhead. Systems might behave differently during scheduled exercises compared to unexpected failures. Teams prepare specifically for game days rather than maintaining continuous readiness. Game days serve as stepping stones toward continuous automated chaos engineering rather than long-term solutions.

Steady-State Automated Testing

This approach integrates chaos experiments into continuous integration pipelines. Each deployment triggers automated chaos tests that verify the new version maintains resilience properties. Tests might terminate instances of the new service version, introduce latency to dependencies, or simulate downstream failures. The deployment proceeds only if chaos tests pass.

Steady-state testing provides rapid feedback on resilience regressions. Developers learn immediately when code changes break circuit breakers or introduce cascading failures. The approach prevents resilience problems from reaching production. However, it shares limitations of all pre-production testing: the test environment differs from production, synthetic load differs from real traffic, and the full system complexity might not exist in the test environment.

Organizations implementing this approach need investment in test environment infrastructure and automation. The test environment must be complex enough to reveal meaningful failure modes yet cost-effective enough to run continuously. The automation must execute experiments reliably and produce deterministic results to avoid flaky tests that block deployments.

Continuous Production Verification

Continuous production verification runs chaos experiments constantly in production environments. The system automatically selects scenarios, executes experiments at regular intervals, evaluates results, and alerts on anomalies. This approach provides continuous validation that the system maintains resilience properties despite daily changes.

Implementation requires sophisticated automation that selects appropriate scenarios based on recent deployments, production traffic patterns, and historical experiments. The system must implement blast radius controls, automated halt conditions, and result analysis. Organizations need mature incident response processes since experiments occasionally reveal real problems requiring immediate attention.

The continuous approach provides maximum confidence in system resilience. Teams detect regressions within hours rather than weeks. The frequent experiments normalize failure injection, making teams more comfortable with operational uncertainty. However, the approach requires significant investment in automation tooling and organizational commitment to resilience.

Fault Injection at the Service Level

Service-level fault injection embeds chaos capabilities directly into application code rather than using external tools. Services include configuration-driven fault injection that introduces errors, latency, or resource constraints based on runtime parameters. Teams activate injection through feature flags, enabling experiments without deploying new code.

This approach provides fine-grained control over failure modes. Application code can simulate specific error conditions, corrupt data structures, or introduce business logic failures that external tools cannot create. The injection code lives alongside the application code, making it easy to maintain and update. Teams can inject failures at specific code paths rather than affecting entire services.

The disadvantage is maintenance overhead. Each service needs fault injection code. The injection logic can become complex and introduce bugs. Teams must remember to activate and deactivate injection, or build automation to manage it. Services accumulate injection code that never runs in normal operation, increasing testing burden.

Tools & Ecosystem

The chaos engineering ecosystem includes commercial platforms, open-source tools, cloud provider services, and language-specific libraries. Selection depends on infrastructure environment, experiment sophistication, and budget.

Commercial Platforms

Gremlin provides a comprehensive chaos engineering platform supporting AWS, Azure, Google Cloud, and Kubernetes environments. The platform offers pre-built attack types including instance termination, resource exhaustion, network failures, and state manipulation. Gremlin includes blast radius controls, automated halt conditions, and integrated metrics analysis. The commercial platform reduces implementation time but requires budget allocation and vendor lock-in acceptance.

Harness Chaos Engineering (formerly ChaosNative Litmus) offers both open-source and commercial options. The platform emphasizes Kubernetes-native chaos engineering with extensive support for container orchestration failures. Teams can define chaos experiments as YAML manifests, integrating experiments into GitOps workflows. The commercial tier adds scheduled experiments, centralized reporting, and enterprise support.

Open Source Tools

Chaos Monkey remains widely used despite being Netflix's original tool. The tool randomly terminates instances in AWS, Azure, or Google Cloud according to configurable schedules and rules. Chaos Monkey integrates with Spinnaker for deployment coordination. The tool provides basic instance termination but lacks sophisticated failure injection capabilities.

Chaos Toolkit offers language-agnostic chaos engineering through extensions and drivers. Teams define experiments in JSON or YAML describing steady state probes, experimental actions, and rollback procedures. The tool supports numerous platforms through drivers including Kubernetes, AWS, Azure, and Google Cloud. Chaos Toolkit emphasizes experiment reproducibility and version control integration.

Toxiproxy, created by Shopify, specializes in network failure injection. The proxy sits between services and introduces latency, bandwidth throttling, connection errors, and data corruption. Toxiproxy works particularly well for testing application resilience to dependency failures. The tool provides programmatic control through HTTP API or client libraries in multiple languages.

Pumba focuses on Docker and Kubernetes chaos engineering. The tool can kill containers, pause containers, stop containers temporarily, introduce network delays, packet loss, bandwidth limits, and corrupt network packets. Pumba supports randomized and scheduled experiments, making it suitable for both manual and automated chaos engineering.

Cloud Provider Services

AWS Fault Injection Simulator provides managed chaos engineering for AWS services. The service supports injecting failures into EC2 instances, ECS tasks, EKS pods, and RDS databases. Experiments target specific resources using tags or resource identifiers. The service includes built-in templates for common scenarios and integrates with AWS monitoring services. The managed approach reduces operational overhead but limits experiments to AWS environments.

Azure Chaos Studio offers similar capabilities for Azure resources. The platform supports experiments on virtual machines, Kubernetes clusters, and Azure services. Chaos Studio emphasizes integration with Azure Monitor and Application Insights for experiment analysis. Microsoft provides pre-defined experiments and supports custom experiments through extensibility mechanisms.

Ruby-Specific Libraries

The chaos_rb gem provides Ruby applications with built-in chaos engineering capabilities. The library supports error injection, latency injection, and resource exhaustion simulation. Applications configure chaos experiments through environment variables or configuration files, enabling activation without code changes. The gem integrates with popular Ruby web frameworks including Rails and Sinatra.

Semian, created by Shopify, implements circuit breakers and bulkheading for Ruby applications. While not strictly a chaos engineering tool, Semian enables testing of resilience patterns by allowing controlled activation of circuit breakers and resource limits. The library protects applications from cascading failures and dependency overload.

Kubernetes Chaos Engineering

Chaos Mesh provides comprehensive chaos engineering for Kubernetes clusters. The tool injects failures at multiple levels including pod failures, network chaos, stress testing, and file system failures. Teams define chaos experiments as Kubernetes custom resources, integrating experiments into cluster management workflows. Chaos Mesh includes a web interface for experiment management and visualization.

PowerfulSeal tests Kubernetes cluster resilience by killing pods, draining nodes, and introducing network issues. The tool supports policy-driven chaos where teams define rules for automated chaos injection. PowerfulSeal emphasizes testing cluster resilience to node failures and pod rescheduling scenarios.

Ruby Implementation

Ruby applications can implement chaos engineering through built-in fault injection, external tool integration, or framework-specific patterns. The implementation approach depends on application architecture and infrastructure environment.

Built-in Fault Injection with Middleware

Rack middleware provides a natural integration point for fault injection in Ruby web applications. The middleware intercepts requests and introduces failures based on configuration:

class ChaosMiddleware
  def initialize(app, config = {})
    @app = app
    @error_rate = config.fetch(:error_rate, 0.0)
    @latency_rate = config.fetch(:latency_rate, 0.0)
    @latency_ms = config.fetch(:latency_ms, 100)
    @enabled = config.fetch(:enabled, false)
  end

  def call(env)
    return @app.call(env) unless @enabled
    return @app.call(env) unless chaos_enabled_for_request?(env)

    inject_latency if should_inject_latency?
    inject_error if should_inject_error?

    @app.call(env)
  end

  private

  def chaos_enabled_for_request?(env)
    # Enable chaos for specific endpoints or percentage of requests
    path = env['PATH_INFO']
    return false if path.start_with?('/health')
    
    # Enable for 10% of requests based on request ID
    request_id = env['HTTP_X_REQUEST_ID']
    return false unless request_id
    
    request_id.hash % 100 < 10
  end

  def should_inject_latency?
    rand < @latency_rate
  end

  def should_inject_error?
    rand < @error_rate
  end

  def inject_latency
    sleep(@latency_ms / 1000.0)
  end

  def inject_error
    raise StandardError, "Chaos engineering error injection"
  end
end

The middleware configuration happens in the application initialization:

# config.ru or application configuration
chaos_config = {
  enabled: ENV['CHAOS_ENABLED'] == 'true',
  error_rate: ENV.fetch('CHAOS_ERROR_RATE', '0.01').to_f,
  latency_rate: ENV.fetch('CHAOS_LATENCY_RATE', '0.05').to_f,
  latency_ms: ENV.fetch('CHAOS_LATENCY_MS', '200').to_i
}

use ChaosMiddleware, chaos_config

Dependency Chaos Injection

Applications can inject failures into external dependencies using HTTP client wrappers:

class ChaosHTTPClient
  def initialize(client, config = {})
    @client = client
    @enabled = config.fetch(:enabled, false)
    @timeout_rate = config.fetch(:timeout_rate, 0.0)
    @error_rate = config.fetch(:error_rate, 0.0)
    @latency_rate = config.fetch(:latency_rate, 0.0)
    @latency_ms = config.fetch(:latency_ms, 100)
  end

  def get(url, options = {})
    execute_with_chaos(:get, url, options)
  end

  def post(url, body, options = {})
    execute_with_chaos(:post, url, body, options)
  end

  private

  def execute_with_chaos(method, *args)
    return @client.send(method, *args) unless @enabled

    inject_latency if rand < @latency_rate
    raise Timeout::Error, "Chaos timeout" if rand < @timeout_rate
    raise StandardError, "Chaos error" if rand < @error_rate

    @client.send(method, *args)
  end

  def inject_latency
    sleep(@latency_ms / 1000.0)
  end
end

# Usage in application
http_client = ChaosHTTPClient.new(
  Faraday.new,
  enabled: ENV['CHAOS_HTTP_ENABLED'] == 'true',
  timeout_rate: 0.01,
  error_rate: 0.02,
  latency_rate: 0.05,
  latency_ms: 300
)

Database Connection Chaos

Testing database resilience requires injecting failures at the connection level:

class ChaosDatabase
  def initialize(connection_pool, config = {})
    @pool = connection_pool
    @enabled = config.fetch(:enabled, false)
    @disconnect_rate = config.fetch(:disconnect_rate, 0.0)
    @slow_query_rate = config.fetch(:slow_query_rate, 0.0)
    @slow_query_ms = config.fetch(:slow_query_ms, 1000)
  end

  def execute(sql)
    return @pool.execute(sql) unless @enabled

    if rand < @disconnect_rate
      raise PG::ConnectionBad, "Chaos connection failure"
    end

    if rand < @slow_query_rate
      sleep(@slow_query_ms / 1000.0)
    end

    @pool.execute(sql)
  end

  def transaction(&block)
    return @pool.transaction(&block) unless @enabled

    if rand < @disconnect_rate
      raise PG::ConnectionBad, "Chaos transaction failure"
    end

    @pool.transaction(&block)
  end
end

Circuit Breaker Testing

Applications using circuit breakers can inject failures to verify breaker behavior:

require 'semian'

# Configure circuit breaker with Semian
Semian.register(
  :payment_service,
  tickets: 10,
  timeout: 0.5,
  error_threshold: 3,
  error_timeout: 10,
  success_threshold: 2
)

class PaymentService
  def process_payment(amount)
    Semian[:payment_service].acquire do
      # Inject chaos to trigger circuit breaker
      if chaos_enabled? && rand < chaos_error_rate
        raise StandardError, "Chaos payment error"
      end

      make_payment_api_call(amount)
    end
  rescue Semian::OpenCircuitError => e
    # Circuit breaker opened due to failures
    Logger.warn("Payment service circuit open: #{e.message}")
    queue_for_retry(amount)
  end

  private

  def chaos_enabled?
    ENV['CHAOS_CIRCUIT_BREAKER'] == 'true'
  end

  def chaos_error_rate
    ENV.fetch('CHAOS_CIRCUIT_ERROR_RATE', '0.3').to_f
  end
end

Resource Exhaustion Simulation

Testing application behavior under resource constraints:

class ChaosResourceManager
  def self.exhaust_memory(megabytes, duration_seconds)
    return unless ENV['CHAOS_MEMORY'] == 'true'

    arrays = []
    megabytes.times do
      # Allocate 1MB arrays
      arrays << Array.new(262_144, 0)
    end

    sleep(duration_seconds)
    arrays.clear
    GC.start
  end

  def self.exhaust_cpu(duration_seconds)
    return unless ENV['CHAOS_CPU'] == 'true'

    start_time = Time.now
    threads = []

    # Spawn threads to consume CPU
    4.times do
      threads << Thread.new do
        while Time.now - start_time < duration_seconds
          # Busy work
          1000.times { Math.sqrt(rand) }
        end
      end
    end

    threads.each(&:join)
  end

  def self.exhaust_file_descriptors(count, duration_seconds)
    return unless ENV['CHAOS_FD'] == 'true'

    files = []
    count.times do |i|
      files << File.open("/tmp/chaos_fd_#{i}", 'w')
    end

    sleep(duration_seconds)
    files.each(&:close)
  end
end

# Usage in background job or test
ChaosResourceManager.exhaust_memory(500, 30)  # 500MB for 30 seconds
ChaosResourceManager.exhaust_cpu(60)           # CPU stress for 60 seconds

Automated Experiment Scheduling

Ruby applications can schedule chaos experiments using background job frameworks:

class ChaosExperimentJob
  include Sidekiq::Job

  def perform(experiment_type, config)
    return unless Rails.env.production?
    return unless experiment_enabled?(experiment_type)

    experiment = create_experiment(experiment_type, config)
    
    begin
      experiment.setup
      experiment.execute
      record_success(experiment)
    rescue => e
      experiment.halt
      record_failure(experiment, e)
      raise
    ensure
      experiment.cleanup
    end
  end

  private

  def experiment_enabled?(type)
    Redis.current.get("chaos:experiment:#{type}:enabled") == 'true'
  end

  def create_experiment(type, config)
    case type
    when 'instance_termination'
      InstanceTerminationExperiment.new(config)
    when 'network_latency'
      NetworkLatencyExperiment.new(config)
    when 'dependency_failure'
      DependencyFailureExperiment.new(config)
    end
  end

  def record_success(experiment)
    Metrics.increment('chaos.experiment.success', 
                      tags: ["type:#{experiment.type}"])
  end

  def record_failure(experiment, error)
    Metrics.increment('chaos.experiment.failure',
                      tags: ["type:#{experiment.type}"])
    Logger.error("Chaos experiment failed: #{error.message}")
  end
end

# Schedule experiments
ChaosExperimentJob.perform_in(1.hour, 'network_latency', {
  target_service: 'payment_api',
  latency_ms: 500,
  duration_seconds: 300
})

Practical Examples

Example 1: Database Failover Testing

A Ruby application depends on a PostgreSQL primary-replica setup. The chaos experiment verifies the application handles primary database failure gracefully:

class DatabaseFailoverExperiment
  def initialize(config)
    @duration_seconds = config.fetch(:duration, 300)
    @primary_host = config.fetch(:primary_host)
    @replica_host = config.fetch(:replica_host)
    @steady_state_threshold = config.fetch(:success_rate_threshold, 0.99)
  end

  def execute
    # Measure steady state before experiment
    baseline_metrics = measure_steady_state(60)
    
    unless meets_threshold?(baseline_metrics)
      raise "Baseline below threshold: #{baseline_metrics.inspect}"
    end

    # Inject failure: block traffic to primary database
    block_database_traffic(@primary_host)

    # Monitor system behavior during failure
    failure_metrics = measure_during_failure(@duration_seconds)

    # Restore traffic
    restore_database_traffic(@primary_host)

    # Wait for recovery
    sleep(30)

    # Measure recovery metrics
    recovery_metrics = measure_steady_state(60)

    analyze_results(baseline_metrics, failure_metrics, recovery_metrics)
  end

  private

  def measure_steady_state(duration)
    success_count = 0
    total_count = 0
    latency_samples = []

    start_time = Time.now
    while Time.now - start_time < duration
      result = perform_sample_query
      total_count += 1
      success_count += 1 if result.success?
      latency_samples << result.latency_ms if result.success?
      sleep(1)
    end

    {
      success_rate: success_count.to_f / total_count,
      p99_latency: calculate_percentile(latency_samples, 0.99),
      sample_count: total_count
    }
  end

  def measure_during_failure(duration)
    # Similar measurement but expects degraded performance
    measure_steady_state(duration)
  end

  def meets_threshold?(metrics)
    metrics[:success_rate] >= @steady_state_threshold
  end

  def block_database_traffic(host)
    # Use iptables or cloud provider API to block traffic
    system("iptables -A OUTPUT -d #{host} -j DROP")
  end

  def restore_database_traffic(host)
    system("iptables -D OUTPUT -d #{host} -j DROP")
  end

  def analyze_results(baseline, failure, recovery)
    # The application should failover to replica
    # Success rate should remain high during failure
    # Latency may increase but should remain acceptable
    # Recovery should return to baseline
    
    results = {
      baseline_success: baseline[:success_rate],
      failure_success: failure[:success_rate],
      recovery_success: recovery[:success_rate],
      baseline_latency: baseline[:p99_latency],
      failure_latency: failure[:p99_latency],
      recovery_latency: recovery[:p99_latency]
    }

    if failure[:success_rate] < 0.95
      raise "Application failed during database failover: #{results.inspect}"
    end

    Logger.info("Database failover experiment passed: #{results.inspect}")
    results
  end
end

Example 2: Dependency Timeout Testing

Testing that the application properly handles slow dependencies without cascading failures:

class DependencyTimeoutExperiment
  def initialize(config)
    @target_service = config.fetch(:target_service)
    @latency_ms = config.fetch(:latency_ms)
    @duration_seconds = config.fetch(:duration_seconds)
    @affected_percentage = config.fetch(:affected_percentage, 0.5)
  end

  def execute
    # Instrument dependency calls
    instrument_service_calls

    # Measure baseline
    baseline = measure_metrics(30)

    # Start latency injection using Toxiproxy
    toxiproxy_client = Toxiproxy::Client.new
    proxy = toxiproxy_client.proxies.find { |p| p.name == @target_service }
    
    toxic = proxy.downstream(:latency, 
                             latency: @latency_ms,
                             jitter: @latency_ms * 0.2)
    toxic.apply

    # Measure during latency
    failure_metrics = measure_metrics(@duration_seconds)

    # Remove latency injection
    toxic.remove

    # Measure recovery
    sleep(30)
    recovery_metrics = measure_metrics(30)

    evaluate_timeout_handling(baseline, failure_metrics, recovery_metrics)
  end

  private

  def instrument_service_calls
    # Track calls to the dependency
    ServiceCallTracker.instrument(@target_service) do |call|
      {
        duration_ms: call.duration_ms,
        success: call.success?,
        timeout: call.timeout?,
        circuit_open: call.circuit_open?
      }
    end
  end

  def measure_metrics(duration)
    calls = ServiceCallTracker.calls_for(@target_service, 
                                          duration: duration)
    
    {
      total_calls: calls.count,
      success_rate: calls.count(&:success?) / calls.count.to_f,
      timeout_rate: calls.count(&:timeout?) / calls.count.to_f,
      circuit_open_rate: calls.count(&:circuit_open?) / calls.count.to_f,
      p95_latency: calculate_percentile(calls.map(&:duration_ms), 0.95)
    }
  end

  def evaluate_timeout_handling(baseline, failure, recovery)
    # Circuit breaker should open when timeouts exceed threshold
    # Application should handle circuit open gracefully
    # Recovery should restore normal operation

    if failure[:timeout_rate] < 0.3
      raise "Expected higher timeout rate, got #{failure[:timeout_rate]}"
    end

    if failure[:circuit_open_rate] < 0.5
      raise "Circuit breaker did not open: #{failure[:circuit_open_rate]}"
    end

    if recovery[:success_rate] < baseline[:success_rate] * 0.95
      raise "Recovery incomplete: #{recovery.inspect}"
    end

    Logger.info("Dependency timeout handled correctly: #{failure.inspect}")
  end
end

Example 3: Message Queue Processing Chaos

Testing Sidekiq job processing resilience when Redis becomes unavailable:

class MessageQueueChaosExperiment
  def initialize(config)
    @chaos_duration = config.fetch(:duration_seconds)
    @redis_host = config.fetch(:redis_host)
    @acceptable_job_loss_rate = config.fetch(:acceptable_job_loss, 0.01)
  end

  def execute
    # Enqueue test jobs continuously
    test_jobs = start_job_enqueuing
    
    # Measure baseline processing
    sleep(30)
    baseline_processed = count_processed_jobs(test_jobs)

    # Block Redis traffic to simulate failure
    block_redis_traffic

    # Continue enqueuing jobs during failure
    sleep(@chaos_duration)
    failure_processed = count_processed_jobs(test_jobs)

    # Restore Redis
    restore_redis_traffic

    # Wait for queue to drain
    wait_for_queue_drain

    # Count final processed jobs
    final_processed = count_processed_jobs(test_jobs)

    evaluate_queue_resilience(test_jobs.count, final_processed)
  ensure
    stop_job_enqueuing(test_jobs)
  end

  private

  def start_job_enqueuing
    job_ids = []
    @enqueuing_thread = Thread.new do
      loop do
        job_id = SecureRandom.uuid
        TestJob.perform_async(job_id)
        job_ids << job_id
        sleep(0.1)
      rescue => e
        Logger.debug("Enqueue failed during chaos: #{e.message}")
      end
    end

    job_ids
  end

  def stop_job_enqueuing(job_ids)
    @enqueuing_thread.kill if @enqueuing_thread
  end

  def count_processed_jobs(job_ids)
    # Check database or tracking system for processed jobs
    ProcessedJobTracker.count_processed(job_ids)
  end

  def block_redis_traffic
    system("iptables -A OUTPUT -d #{@redis_host} -j DROP")
  end

  def restore_redis_traffic
    system("iptables -D OUTPUT -d #{@redis_host} -j DROP")
  end

  def wait_for_queue_drain
    timeout = 300
    start = Time.now
    
    while Sidekiq::Queue.new.size > 0
      break if Time.now - start > timeout
      sleep(5)
    end
  end

  def evaluate_queue_resilience(total_jobs, processed_jobs)
    loss_rate = (total_jobs - processed_jobs).to_f / total_jobs

    if loss_rate > @acceptable_job_loss_rate
      raise "Job loss rate #{loss_rate} exceeds threshold"
    end

    Logger.info("Queue chaos handled with #{processed_jobs}/#{total_jobs} processed")
  end
end

class TestJob
  include Sidekiq::Job

  def perform(job_id)
    # Record that this job processed successfully
    ProcessedJobTracker.mark_processed(job_id)
  end
end

Example 4: Memory Leak Simulation

Testing application behavior under memory pressure:

class MemoryPressureExperiment
  def initialize(config)
    @target_memory_mb = config.fetch(:memory_mb)
    @duration_seconds = config.fetch(:duration_seconds)
    @memory_leak_rate = config.fetch(:leak_rate_mb_per_minute)
  end

  def execute
    # Measure baseline memory and response times
    baseline = measure_system_health

    # Start gradual memory leak
    leak_thread = start_memory_leak

    # Monitor system degradation
    monitor_degradation(@duration_seconds)

    # Stop leak and force GC
    leak_thread.kill
    force_garbage_collection

    # Measure recovery
    sleep(30)
    recovery = measure_system_health

    analyze_memory_handling(baseline, recovery)
  ensure
    leak_thread.kill if leak_thread&.alive?
  end

  private

  def start_memory_leak
    Thread.new do
      leaked_memory = []
      loop do
        # Allocate memory at controlled rate
        mb_to_allocate = @memory_leak_rate / 60.0
        leaked_memory << Array.new((mb_to_allocate * 1024 * 1024 / 8).to_i, 0)
        sleep(1)

        # Stop if target reached
        current_mb = leaked_memory.sum { |a| a.size * 8 / 1024 / 1024 }
        break if current_mb >= @target_memory_mb
      end
    end
  end

  def monitor_degradation(duration)
    measurements = []
    start_time = Time.now

    while Time.now - start_time < duration
      health = measure_system_health
      measurements << health

      # Check for unacceptable degradation
      if health[:response_time_p99] > 5000
        Logger.warn("Response time exceeded threshold: #{health.inspect}")
      end

      if health[:error_rate] > 0.05
        Logger.error("Error rate exceeded threshold: #{health.inspect}")
        halt_experiment
        break
      end

      sleep(10)
    end

    measurements
  end

  def measure_system_health
    {
      memory_mb: get_process_memory_mb,
      response_time_p99: measure_response_time_p99,
      error_rate: measure_error_rate,
      throughput_rps: measure_throughput
    }
  end

  def get_process_memory_mb
    `ps -o rss= -p #{Process.pid}`.to_i / 1024
  end

  def force_garbage_collection
    3.times do
      GC.start(full_mark: true, immediate_sweep: true)
      sleep(1)
    end
  end

  def analyze_memory_handling(baseline, recovery)
    # Application should maintain acceptable performance under memory pressure
    # GC should prevent unbounded growth
    # Recovery should restore baseline performance

    Logger.info("Memory pressure test completed: baseline=#{baseline}, recovery=#{recovery}")
  end
end

Real-World Applications

Financial Services Transaction Processing

A payment processing company runs continuous chaos experiments to ensure transaction integrity during failures. The experiments terminate payment service instances randomly, introduce database connection failures, and simulate network partitions between payment processing and fraud detection services.

The chaos program discovered that payment retries occasionally created duplicate charges when network failures occurred at specific timing windows. The team added idempotency keys to payment API calls and implemented distributed transaction coordinators. Automated experiments run every 6 hours in production, targeting 5% of payment traffic. The experiments halt immediately if the duplicate transaction rate exceeds 0.01% or if payment success rate drops below 99.9%.

The continuous validation caught a regression when a developer removed retry logic during a refactoring. The automated experiment failed within 4 hours of deployment, before the change affected customer payments. The team reverted the deployment and added tests to prevent similar regressions.

E-Commerce Platform Scaling

An online retailer conducts weekly game days where teams inject failures during peak traffic periods. The experiments include terminating application servers, introducing latency to product catalog services, and simulating payment gateway outages.

A game day revealed that the shopping cart service cascaded failures to the checkout service when the product inventory API became slow. The cart service called inventory synchronously for every cart item, and slow responses caused request timeouts. The team refactored the cart service to call inventory asynchronously and cache results. They added circuit breakers with fallback to stale inventory data.

The retailer automated the most valuable experiments after game days validated their effectiveness. Automated experiments run during non-peak hours, gradually increasing traffic percentage as confidence builds. The experiments discovered that auto-scaling policies reacted too slowly to sudden traffic spikes, leading to temporary capacity shortages. The team adjusted auto-scaling triggers and added standing capacity buffers.

Video Streaming Service Resilience

A video streaming platform runs chaos experiments targeting their content delivery network, origin servers, and encoding pipeline. Experiments simulate CDN node failures, origin server overload, and encoding job failures.

The experiments revealed that video players retried failed segment fetches without exponential backoff, creating thundering herd problems when CDN nodes failed. During a node failure, thousands of players simultaneously retried requests, overwhelming the origin server. The team implemented exponential backoff with jitter in player retry logic and added rate limiting to origin servers.

Continuous experiments test different failure scenarios daily. The platform measures video start time, buffering ratio, and error rates during experiments. Experiments halt if buffering ratio exceeds 0.5% or start time increases beyond 5 seconds. The automated system has caught performance regressions in player code, CDN configuration changes that reduced cache hit rates, and origin server capacity issues.

Microservices API Gateway

An API gateway handling authentication and routing for 200+ microservices implements automated chaos testing for circuit breaker behavior. Experiments inject failures into backend services to verify that circuit breakers open appropriately and the gateway provides sensible fallback responses.

The experiments discovered that circuit breakers shared state across all gateway instances, creating coordination overhead. When a backend service failed, all gateway instances attempted to update circuit breaker state simultaneously, causing contention. The team moved to independent circuit breakers per instance with eventual consistency rather than strong consistency.

Automated experiments run continuously, targeting different backend services hourly. Each experiment injects failures into one service while monitoring whether circuit breakers open within acceptable time and whether other services remain unaffected. The experiments also verify that circuit breakers half-open and test backend service recovery.

Distributed Database Cluster

A company operating a distributed database performs chaos experiments to validate consensus algorithms and data replication. Experiments partition network segments, introduce packet loss, and terminate database nodes during write operations.

The experiments identified a split-brain scenario where network partitions caused two nodes to simultaneously believe they were cluster leaders. The team strengthened leader election algorithms and added fencing tokens to prevent split-brain. They also discovered that rebuilding nodes after failures consumed too much bandwidth, impacting production traffic. The team implemented bandwidth throttling for rebuilds.

The chaos program includes quarterly disaster recovery drills where teams shut down entire data centers. The drills verify that databases fail over correctly, replication catches up, and applications adapt to new database endpoints. The drills take 4-6 hours and involve coordination across multiple teams.

Reference

Experiment Types

Type	Description	Typical Impact Scope
Instance Termination	Kill application servers or containers	Single service availability
Network Latency	Add delay to network requests	Request timeouts, degraded performance
Network Partition	Block network communication between services	Service isolation, split-brain
Packet Loss	Drop percentage of network packets	Retry storms, connection failures
DNS Failure	Return errors for DNS lookups	Service discovery failures
Disk I/O Delay	Slow disk read/write operations	Database slowdown, log delays
CPU Exhaustion	Consume CPU resources	Request queuing, timeout
Memory Exhaustion	Consume memory resources	Out of memory errors, thrashing
Dependency Failure	Simulate external service failures	Cascading failures, timeout
Database Connection Failure	Terminate or block database connections	Transaction failures, connection pool exhaustion
Message Queue Failure	Block or delay message processing	Job backlog, processing delays

Blast Radius Control Techniques

Technique	Implementation	Risk Level
Percentage-based targeting	Affect only X% of requests or instances	Low
Geographic limitation	Limit experiments to single region	Low
Canary targeting	Target specific instance or customer segment	Low
Time-based limiting	Run experiments for fixed duration	Medium
Automatic halt conditions	Stop if metrics exceed threshold	Medium
Manual approval	Require human approval to proceed	High control
Instance count minimum	Never reduce below minimum capacity	Critical

Steady State Metrics

Metric Type	Examples	Measurement Approach
Availability	Uptime percentage, successful request rate	Monitor HTTP status codes
Latency	Response time percentiles, time to first byte	Track request durations
Throughput	Requests per second, transactions per minute	Count completed operations
Error Rate	Failed requests per total requests	Track error responses
Data Integrity	Checksums, duplicate detection, consistency checks	Validate data correctness
Business Metrics	Conversion rate, revenue per minute, user sessions	Monitor business outcomes

Common Halt Conditions

Condition	Threshold Example	Action
Error rate increase	Error rate exceeds 1%	Immediate halt
Latency increase	P99 latency exceeds 5x baseline	Immediate halt
Availability drop	Success rate below 99%	Immediate halt
Business impact	Revenue drop exceeds $100/minute	Immediate halt
Custom metrics	Cart abandonment exceeds 10%	Team review
Duration limit	Experiment runs longer than 30 minutes	Automatic halt

Ruby Chaos Tools Configuration

Tool/Gem	Installation	Basic Configuration
chaos_rb	gem install chaos_rb	ENV variables for error/latency rates
Semian	gem install semian	Circuit breaker thresholds and timeouts
Toxiproxy	gem install toxiproxy	HTTP API or client library configuration
ActiveRecord Chaos	Custom middleware	Database connection pool wrapper
Rack Chaos Middleware	Custom implementation	Request interception with failure injection

Experiment Analysis Questions

Question	Data Required	Analysis Method
Did steady state maintain during experiment?	Baseline and experiment metrics	Statistical comparison
What was the blast radius?	Affected instances, requests, customers	Impact measurement
How quickly did the system recover?	Time-series metrics after experiment	Recovery time analysis
Were circuit breakers effective?	Circuit open/close events	Event correlation
Did alerts fire appropriately?	Alert history during experiment	Alert verification
What was the customer impact?	Business metrics, support tickets	Customer impact assessment

Chaos Engineering Maturity Levels

Level	Characteristics	Typical Activities
1 - Ad Hoc	Manual experiments, no automation	Occasional instance termination
2 - Scheduled	Regular game days, some automation	Monthly planned chaos events
3 - Integrated	Automated experiments in CI/CD	Pre-deployment chaos testing
4 - Continuous	Always-on production experiments	Hourly automated experiments
5 - Advanced	Automated experiment selection and analysis	Self-optimizing chaos program

Ruby Circuit Breaker Patterns

Pattern	Implementation	Use Case
Fail Fast	Raise error immediately when circuit open	User-facing requests
Fallback Response	Return cached or default value	Product catalogs, recommendations
Degraded Mode	Provide limited functionality	Search with reduced features
Queue for Retry	Store failed requests for later processing	Background jobs, analytics
Timeout Protection	Enforce maximum wait time	External API calls

Automation Scheduling Considerations

Factor	Consideration	Recommendation
Time of Day	Avoid peak traffic hours	Schedule during low traffic
Deployment Frequency	Coordinate with releases	Wait 1 hour after deployments
Team Availability	Ensure on-call coverage	Align with on-call schedules
Experiment Duration	Balance insight vs risk	Start with 5-10 minute experiments
Frequency	Build confidence gradually	Weekly to daily as maturity increases
Geographic Distribution	Respect regional peak times	Rotate regions for experiments

Chaos Engineering