Overview
Chaos Engineering originated at Netflix in 2011 with the creation of Chaos Monkey, a tool that randomly terminates virtual machine instances in production. The practice emerged from the recognition that traditional testing approaches fail to identify systemic weaknesses in distributed systems. Complex systems exhibit emergent behaviors that only manifest under production conditions with real traffic patterns, network topologies, and component interactions.
The discipline addresses a fundamental challenge in modern software development: systems grow increasingly complex while failures become increasingly expensive. A distributed system with dozens of microservices, multiple databases, caching layers, message queues, and external dependencies creates thousands of potential failure scenarios. Testing each scenario individually becomes impractical, and many failure modes only emerge from the interaction between multiple components under specific conditions.
Chaos Engineering inverts the traditional testing paradigm. Instead of attempting to prove a system works correctly, practitioners attempt to discover how it fails. The approach assumes failures will occur and seeks to identify weaknesses proactively rather than reactively during incidents. Organizations conduct controlled experiments in production, observing how systems respond to deliberate disruptions while maintaining acceptable service levels.
The practice differs from traditional fault injection testing in several key aspects. Chaos experiments run continuously in production environments rather than isolated test environments. They target entire systems rather than individual components. The experiments measure real business metrics rather than synthetic test assertions. Teams halt experiments immediately if they detect customer impact, treating each experiment as a learning opportunity rather than a pass-fail test.
Consider a financial services application handling payment processing. Traditional testing might verify that the payment service correctly processes transactions and handles database connection failures. Chaos Engineering would terminate random payment service instances during peak traffic, introduce latency in database queries, and corrupt network packets between services to observe whether the system maintains transaction integrity, preserves customer data, and provides appropriate error messages. The experiments reveal whether circuit breakers activate correctly, whether retries happen with appropriate backoff, and whether the system gracefully degrades rather than cascading failures across services.
Key Principles
The Principles of Chaos Engineering, formalized by Netflix engineers, define five core tenets that guide the practice. These principles establish the theoretical foundation and differentiate chaos engineering from random disruption.
Build a Hypothesis Around Steady State Behavior
Chaos experiments begin by defining steady state as measurable output that indicates normal system operation. The steady state represents business metrics rather than internal system metrics. For an e-commerce platform, steady state might measure successful checkout completion rate, average page load time, and search result accuracy. The hypothesis predicts that steady state will continue during chaos experiments. Teams avoid focusing on internal metrics like CPU utilization or memory consumption unless they directly correlate with customer experience.
Defining steady state requires understanding what constitutes acceptable system behavior under normal conditions. A video streaming service might define steady state as 99.9% of streams starting within 3 seconds with less than 0.1% buffering events per hour. The definition must be specific enough to detect degradation yet tolerant enough to accommodate normal variations in system performance.
Vary Real-World Events
Chaos experiments simulate failures that occur in production systems. Hardware failures, network partitions, traffic spikes, dependency outages, configuration errors, and resource exhaustion all represent real scenarios. The events selected for experiments come from post-incident reviews, monitoring data, and production failure patterns. Teams prioritize scenarios based on historical frequency and potential customer impact.
Real-world events include both sudden failures and gradual degradations. A database server might fail instantly, or it might slowly degrade as disk I/O performance decreases. Network latency might spike suddenly or creep upward over hours. Memory leaks cause gradual resource exhaustion. Chaos experiments model these patterns rather than only simulating catastrophic instant failures.
The scope of events extends beyond infrastructure failures. Application-level failures matter equally. Chaos experiments might introduce bugs in code paths, corrupt data in caches, return incorrect results from dependencies, or simulate third-party API changes. State-based failures where the system enters invalid states due to timing issues or race conditions represent particularly valuable experiment targets.
Run Experiments in Production
Production environments exhibit complexity that cannot be replicated in test environments. Real traffic patterns, actual data volumes, genuine network topologies, authentic security controls, and production deployment configurations create emergent behaviors. Running experiments in production provides the only reliable method to verify system behavior under actual operating conditions.
The production requirement creates tension with risk management. Organizations balance the need for realistic experiments against the potential for customer impact. Progressive rollout strategies mitigate risk. Experiments begin with small percentages of traffic, limited geographic regions, or specific customer segments. Teams monitor business metrics continuously and implement automated halt conditions that stop experiments when impact exceeds acceptable thresholds.
Production experiments require sophisticated control mechanisms. Teams need the ability to target specific components, limit the blast radius, and instantly revert changes. The experiment infrastructure itself must be more reliable than the systems under test to prevent chaos tooling from becoming a source of outages.
Automate Experiments to Run Continuously
Manual chaos engineering fails to scale and becomes inconsistent. Systems change daily through deployments, configuration updates, dependency upgrades, and infrastructure modifications. Each change potentially introduces new failure modes. Automated continuous experimentation catches regressions and validates that recent changes maintain resilience properties.
Automation enables coverage across the entire system surface. A manually conducted experiment might test a specific failure scenario once per quarter. Automated experiments run the same scenario daily across all environments, services, and regions. The continuous execution builds confidence that the system maintains resilience properties over time rather than only at the moment of manual testing.
The automation encompasses experiment design, execution, monitoring, and analysis. Systems automatically select scenarios based on production traffic patterns and recent changes. They execute experiments according to schedules that avoid maintenance windows and high-traffic periods. Automated analysis compares steady state metrics before and during experiments, flagging anomalies for human review.
Minimize Blast Radius
Chaos experiments must not cause outages. The practice exists to prevent customer-impacting failures, not create them. Teams design experiments to contain potential damage while still providing valuable insights. Blast radius limitations include targeting specific instances rather than entire clusters, limiting experiments to subset of traffic, implementing rapid rollback mechanisms, and defining clear halt conditions.
Minimizing blast radius requires careful experiment design. An experiment testing database failover should affect one database replica rather than the primary. Network latency injection should target a small percentage of requests. Instance terminations should respect minimum capacity requirements. Each experiment includes explicit boundaries that prevent catastrophic failures even if the system responds unexpectedly.
The principle extends to organizational boundaries. Teams own experiments affecting their services. Cross-team experiments require coordination and approval. Shared infrastructure experiments need particularly careful scoping to avoid impacting multiple teams simultaneously.
Design Considerations
Organizations face several decisions when implementing chaos engineering programs. The choices affect experiment effectiveness, operational overhead, and organizational adoption.
Experiment Scope Selection
Teams must decide whether to start with infrastructure-level experiments or application-level experiments. Infrastructure experiments like instance termination and network failures provide quick value with minimal code changes. Application-level experiments testing business logic failures and data corruption require deeper integration but reveal more sophisticated failure modes. Starting with infrastructure experiments builds confidence and organizational support before expanding to application-level testing.
The decision between broad system-wide experiments and narrow component-specific experiments presents another tradeoff. System-wide experiments reveal emergent behaviors and cross-service dependencies but create complex failure scenarios that make root cause analysis difficult. Component-specific experiments provide clear causal relationships but might miss interactions between components. Mature chaos engineering programs employ both approaches: component experiments verify individual service resilience while system experiments validate end-to-end behavior.
Production vs Non-Production Environments
Running experiments exclusively in production maximizes realism but maximizes risk. Organizations with low risk tolerance or strict compliance requirements might begin chaos engineering in production-like staging environments. This approach reduces risk but also reduces insights since staging rarely matches production complexity. Systems behave differently under synthetic load compared to real traffic patterns. Dependencies configured in staging might not match production. The staging approach works as a stepping stone but should progress toward production experiments as teams build confidence.
Some organizations adopt a hybrid model where infrastructure experiments run in production while application experiments run in staging. Another approach uses production experiments with limited scope, affecting only internal users or specific customer segments. The key decision factor is whether the organization values realism over risk minimization.
Manual vs Automated Execution
Manual chaos engineering requires less initial investment and provides more control during experiments. Teams manually trigger experiments, observe system behavior, and make real-time decisions about continuing or halting. This approach works well for learning the practice and understanding system behavior but doesn't scale. Manual experiments happen infrequently, cover limited scenarios, and depend on individual expertise.
Automated chaos engineering requires significant upfront investment in tooling and monitoring but provides continuous validation. Systems automatically select scenarios, execute experiments, evaluate results, and alert on anomalies. The automation enables coverage across all services and scenarios while requiring less human involvement. Organizations typically begin with manual experiments to understand failure modes before automating the most valuable scenarios.
Metrics and Observability Requirements
Effective chaos engineering depends on comprehensive observability. Systems need metrics that accurately reflect customer experience, detailed enough to identify specific failure modes, and real-time enough to halt experiments quickly. Organizations must decide whether to use existing monitoring or build chaos-specific observability.
Existing monitoring provides immediate value but might lack the granularity needed for experiment analysis. A single "requests per second" metric doesn't reveal whether errors affect specific endpoints or customer segments. Chaos-specific observability adds metrics like "successful checkout rate for experiment participants" but increases complexity. The decision depends on the maturity of existing monitoring and the organization's ability to maintain additional tooling.
Organizational Readiness
Chaos engineering requires cultural and technical prerequisites. The organization needs incident response processes, blameless postmortems, on-call rotations, and monitoring infrastructure before beginning chaos experiments. Teams must have time allocated for resilience work rather than only feature development. Engineering leadership must support learning from failures rather than punishing them.
Technical readiness includes deployment automation, feature flags, circuit breakers, retry logic, and graceful degradation mechanisms. Organizations lacking these capabilities should build them before starting chaos engineering since experiments will immediately reveal their absence. Starting chaos engineering too early creates frustration when experiments find problems teams lack the tools to fix.
Implementation Approaches
Chaos engineering programs can follow several implementation strategies, each with different timelines, resource requirements, and organizational impacts.
Game Days
Game days schedule regular chaos exercises where teams manually inject failures and observe system behavior. The approach typically happens quarterly or monthly, involves multiple teams, and lasts several hours. Game days provide structured learning opportunities, build team coordination, and create shared understanding of system dependencies.
The game day structure begins with planning where teams select scenarios based on recent incidents or identified risks. During execution, one team injects failures while other teams monitor their services and coordinate responses. Post-game analysis reviews what worked, what failed, and what improvements the organization needs. Game days work well for organizations beginning chaos engineering since they provide controlled environments for learning without requiring automated tooling.
Game days have limitations. They happen infrequently, cover limited scenarios, and require significant coordination overhead. Systems might behave differently during scheduled exercises compared to unexpected failures. Teams prepare specifically for game days rather than maintaining continuous readiness. Game days serve as stepping stones toward continuous automated chaos engineering rather than long-term solutions.
Steady-State Automated Testing
This approach integrates chaos experiments into continuous integration pipelines. Each deployment triggers automated chaos tests that verify the new version maintains resilience properties. Tests might terminate instances of the new service version, introduce latency to dependencies, or simulate downstream failures. The deployment proceeds only if chaos tests pass.
Steady-state testing provides rapid feedback on resilience regressions. Developers learn immediately when code changes break circuit breakers or introduce cascading failures. The approach prevents resilience problems from reaching production. However, it shares limitations of all pre-production testing: the test environment differs from production, synthetic load differs from real traffic, and the full system complexity might not exist in the test environment.
Organizations implementing this approach need investment in test environment infrastructure and automation. The test environment must be complex enough to reveal meaningful failure modes yet cost-effective enough to run continuously. The automation must execute experiments reliably and produce deterministic results to avoid flaky tests that block deployments.
Continuous Production Verification
Continuous production verification runs chaos experiments constantly in production environments. The system automatically selects scenarios, executes experiments at regular intervals, evaluates results, and alerts on anomalies. This approach provides continuous validation that the system maintains resilience properties despite daily changes.
Implementation requires sophisticated automation that selects appropriate scenarios based on recent deployments, production traffic patterns, and historical experiments. The system must implement blast radius controls, automated halt conditions, and result analysis. Organizations need mature incident response processes since experiments occasionally reveal real problems requiring immediate attention.
The continuous approach provides maximum confidence in system resilience. Teams detect regressions within hours rather than weeks. The frequent experiments normalize failure injection, making teams more comfortable with operational uncertainty. However, the approach requires significant investment in automation tooling and organizational commitment to resilience.
Fault Injection at the Service Level
Service-level fault injection embeds chaos capabilities directly into application code rather than using external tools. Services include configuration-driven fault injection that introduces errors, latency, or resource constraints based on runtime parameters. Teams activate injection through feature flags, enabling experiments without deploying new code.
This approach provides fine-grained control over failure modes. Application code can simulate specific error conditions, corrupt data structures, or introduce business logic failures that external tools cannot create. The injection code lives alongside the application code, making it easy to maintain and update. Teams can inject failures at specific code paths rather than affecting entire services.
The disadvantage is maintenance overhead. Each service needs fault injection code. The injection logic can become complex and introduce bugs. Teams must remember to activate and deactivate injection, or build automation to manage it. Services accumulate injection code that never runs in normal operation, increasing testing burden.
Tools & Ecosystem
The chaos engineering ecosystem includes commercial platforms, open-source tools, cloud provider services, and language-specific libraries. Selection depends on infrastructure environment, experiment sophistication, and budget.
Commercial Platforms
Gremlin provides a comprehensive chaos engineering platform supporting AWS, Azure, Google Cloud, and Kubernetes environments. The platform offers pre-built attack types including instance termination, resource exhaustion, network failures, and state manipulation. Gremlin includes blast radius controls, automated halt conditions, and integrated metrics analysis. The commercial platform reduces implementation time but requires budget allocation and vendor lock-in acceptance.
Harness Chaos Engineering (formerly ChaosNative Litmus) offers both open-source and commercial options. The platform emphasizes Kubernetes-native chaos engineering with extensive support for container orchestration failures. Teams can define chaos experiments as YAML manifests, integrating experiments into GitOps workflows. The commercial tier adds scheduled experiments, centralized reporting, and enterprise support.
Open Source Tools
Chaos Monkey remains widely used despite being Netflix's original tool. The tool randomly terminates instances in AWS, Azure, or Google Cloud according to configurable schedules and rules. Chaos Monkey integrates with Spinnaker for deployment coordination. The tool provides basic instance termination but lacks sophisticated failure injection capabilities.
Chaos Toolkit offers language-agnostic chaos engineering through extensions and drivers. Teams define experiments in JSON or YAML describing steady state probes, experimental actions, and rollback procedures. The tool supports numerous platforms through drivers including Kubernetes, AWS, Azure, and Google Cloud. Chaos Toolkit emphasizes experiment reproducibility and version control integration.
Toxiproxy, created by Shopify, specializes in network failure injection. The proxy sits between services and introduces latency, bandwidth throttling, connection errors, and data corruption. Toxiproxy works particularly well for testing application resilience to dependency failures. The tool provides programmatic control through HTTP API or client libraries in multiple languages.
Pumba focuses on Docker and Kubernetes chaos engineering. The tool can kill containers, pause containers, stop containers temporarily, introduce network delays, packet loss, bandwidth limits, and corrupt network packets. Pumba supports randomized and scheduled experiments, making it suitable for both manual and automated chaos engineering.
Cloud Provider Services
AWS Fault Injection Simulator provides managed chaos engineering for AWS services. The service supports injecting failures into EC2 instances, ECS tasks, EKS pods, and RDS databases. Experiments target specific resources using tags or resource identifiers. The service includes built-in templates for common scenarios and integrates with AWS monitoring services. The managed approach reduces operational overhead but limits experiments to AWS environments.
Azure Chaos Studio offers similar capabilities for Azure resources. The platform supports experiments on virtual machines, Kubernetes clusters, and Azure services. Chaos Studio emphasizes integration with Azure Monitor and Application Insights for experiment analysis. Microsoft provides pre-defined experiments and supports custom experiments through extensibility mechanisms.
Ruby-Specific Libraries
The chaos_rb gem provides Ruby applications with built-in chaos engineering capabilities. The library supports error injection, latency injection, and resource exhaustion simulation. Applications configure chaos experiments through environment variables or configuration files, enabling activation without code changes. The gem integrates with popular Ruby web frameworks including Rails and Sinatra.
Semian, created by Shopify, implements circuit breakers and bulkheading for Ruby applications. While not strictly a chaos engineering tool, Semian enables testing of resilience patterns by allowing controlled activation of circuit breakers and resource limits. The library protects applications from cascading failures and dependency overload.
Kubernetes Chaos Engineering
Chaos Mesh provides comprehensive chaos engineering for Kubernetes clusters. The tool injects failures at multiple levels including pod failures, network chaos, stress testing, and file system failures. Teams define chaos experiments as Kubernetes custom resources, integrating experiments into cluster management workflows. Chaos Mesh includes a web interface for experiment management and visualization.
PowerfulSeal tests Kubernetes cluster resilience by killing pods, draining nodes, and introducing network issues. The tool supports policy-driven chaos where teams define rules for automated chaos injection. PowerfulSeal emphasizes testing cluster resilience to node failures and pod rescheduling scenarios.
Ruby Implementation
Ruby applications can implement chaos engineering through built-in fault injection, external tool integration, or framework-specific patterns. The implementation approach depends on application architecture and infrastructure environment.
Built-in Fault Injection with Middleware
Rack middleware provides a natural integration point for fault injection in Ruby web applications. The middleware intercepts requests and introduces failures based on configuration:
class ChaosMiddleware
def initialize(app, config = {})
@app = app
@error_rate = config.fetch(:error_rate, 0.0)
@latency_rate = config.fetch(:latency_rate, 0.0)
@latency_ms = config.fetch(:latency_ms, 100)
@enabled = config.fetch(:enabled, false)
end
def call(env)
return @app.call(env) unless @enabled
return @app.call(env) unless chaos_enabled_for_request?(env)
inject_latency if should_inject_latency?
inject_error if should_inject_error?
@app.call(env)
end
private
def chaos_enabled_for_request?(env)
# Enable chaos for specific endpoints or percentage of requests
path = env['PATH_INFO']
return false if path.start_with?('/health')
# Enable for 10% of requests based on request ID
request_id = env['HTTP_X_REQUEST_ID']
return false unless request_id
request_id.hash % 100 < 10
end
def should_inject_latency?
rand < @latency_rate
end
def should_inject_error?
rand < @error_rate
end
def inject_latency
sleep(@latency_ms / 1000.0)
end
def inject_error
raise StandardError, "Chaos engineering error injection"
end
end
The middleware configuration happens in the application initialization:
# config.ru or application configuration
chaos_config = {
enabled: ENV['CHAOS_ENABLED'] == 'true',
error_rate: ENV.fetch('CHAOS_ERROR_RATE', '0.01').to_f,
latency_rate: ENV.fetch('CHAOS_LATENCY_RATE', '0.05').to_f,
latency_ms: ENV.fetch('CHAOS_LATENCY_MS', '200').to_i
}
use ChaosMiddleware, chaos_config
Dependency Chaos Injection
Applications can inject failures into external dependencies using HTTP client wrappers:
class ChaosHTTPClient
def initialize(client, config = {})
@client = client
@enabled = config.fetch(:enabled, false)
@timeout_rate = config.fetch(:timeout_rate, 0.0)
@error_rate = config.fetch(:error_rate, 0.0)
@latency_rate = config.fetch(:latency_rate, 0.0)
@latency_ms = config.fetch(:latency_ms, 100)
end
def get(url, options = {})
execute_with_chaos(:get, url, options)
end
def post(url, body, options = {})
execute_with_chaos(:post, url, body, options)
end
private
def execute_with_chaos(method, *args)
return @client.send(method, *args) unless @enabled
inject_latency if rand < @latency_rate
raise Timeout::Error, "Chaos timeout" if rand < @timeout_rate
raise StandardError, "Chaos error" if rand < @error_rate
@client.send(method, *args)
end
def inject_latency
sleep(@latency_ms / 1000.0)
end
end
# Usage in application
http_client = ChaosHTTPClient.new(
Faraday.new,
enabled: ENV['CHAOS_HTTP_ENABLED'] == 'true',
timeout_rate: 0.01,
error_rate: 0.02,
latency_rate: 0.05,
latency_ms: 300
)
Database Connection Chaos
Testing database resilience requires injecting failures at the connection level:
class ChaosDatabase
def initialize(connection_pool, config = {})
@pool = connection_pool
@enabled = config.fetch(:enabled, false)
@disconnect_rate = config.fetch(:disconnect_rate, 0.0)
@slow_query_rate = config.fetch(:slow_query_rate, 0.0)
@slow_query_ms = config.fetch(:slow_query_ms, 1000)
end
def execute(sql)
return @pool.execute(sql) unless @enabled
if rand < @disconnect_rate
raise PG::ConnectionBad, "Chaos connection failure"
end
if rand < @slow_query_rate
sleep(@slow_query_ms / 1000.0)
end
@pool.execute(sql)
end
def transaction(&block)
return @pool.transaction(&block) unless @enabled
if rand < @disconnect_rate
raise PG::ConnectionBad, "Chaos transaction failure"
end
@pool.transaction(&block)
end
end
Circuit Breaker Testing
Applications using circuit breakers can inject failures to verify breaker behavior:
require 'semian'
# Configure circuit breaker with Semian
Semian.register(
:payment_service,
tickets: 10,
timeout: 0.5,
error_threshold: 3,
error_timeout: 10,
success_threshold: 2
)
class PaymentService
def process_payment(amount)
Semian[:payment_service].acquire do
# Inject chaos to trigger circuit breaker
if chaos_enabled? && rand < chaos_error_rate
raise StandardError, "Chaos payment error"
end
make_payment_api_call(amount)
end
rescue Semian::OpenCircuitError => e
# Circuit breaker opened due to failures
Logger.warn("Payment service circuit open: #{e.message}")
queue_for_retry(amount)
end
private
def chaos_enabled?
ENV['CHAOS_CIRCUIT_BREAKER'] == 'true'
end
def chaos_error_rate
ENV.fetch('CHAOS_CIRCUIT_ERROR_RATE', '0.3').to_f
end
end
Resource Exhaustion Simulation
Testing application behavior under resource constraints:
class ChaosResourceManager
def self.exhaust_memory(megabytes, duration_seconds)
return unless ENV['CHAOS_MEMORY'] == 'true'
arrays = []
megabytes.times do
# Allocate 1MB arrays
arrays << Array.new(262_144, 0)
end
sleep(duration_seconds)
arrays.clear
GC.start
end
def self.exhaust_cpu(duration_seconds)
return unless ENV['CHAOS_CPU'] == 'true'
start_time = Time.now
threads = []
# Spawn threads to consume CPU
4.times do
threads << Thread.new do
while Time.now - start_time < duration_seconds
# Busy work
1000.times { Math.sqrt(rand) }
end
end
end
threads.each(&:join)
end
def self.exhaust_file_descriptors(count, duration_seconds)
return unless ENV['CHAOS_FD'] == 'true'
files = []
count.times do |i|
files << File.open("/tmp/chaos_fd_#{i}", 'w')
end
sleep(duration_seconds)
files.each(&:close)
end
end
# Usage in background job or test
ChaosResourceManager.exhaust_memory(500, 30) # 500MB for 30 seconds
ChaosResourceManager.exhaust_cpu(60) # CPU stress for 60 seconds
Automated Experiment Scheduling
Ruby applications can schedule chaos experiments using background job frameworks:
class ChaosExperimentJob
include Sidekiq::Job
def perform(experiment_type, config)
return unless Rails.env.production?
return unless experiment_enabled?(experiment_type)
experiment = create_experiment(experiment_type, config)
begin
experiment.setup
experiment.execute
record_success(experiment)
rescue => e
experiment.halt
record_failure(experiment, e)
raise
ensure
experiment.cleanup
end
end
private
def experiment_enabled?(type)
Redis.current.get("chaos:experiment:#{type}:enabled") == 'true'
end
def create_experiment(type, config)
case type
when 'instance_termination'
InstanceTerminationExperiment.new(config)
when 'network_latency'
NetworkLatencyExperiment.new(config)
when 'dependency_failure'
DependencyFailureExperiment.new(config)
end
end
def record_success(experiment)
Metrics.increment('chaos.experiment.success',
tags: ["type:#{experiment.type}"])
end
def record_failure(experiment, error)
Metrics.increment('chaos.experiment.failure',
tags: ["type:#{experiment.type}"])
Logger.error("Chaos experiment failed: #{error.message}")
end
end
# Schedule experiments
ChaosExperimentJob.perform_in(1.hour, 'network_latency', {
target_service: 'payment_api',
latency_ms: 500,
duration_seconds: 300
})
Practical Examples
Example 1: Database Failover Testing
A Ruby application depends on a PostgreSQL primary-replica setup. The chaos experiment verifies the application handles primary database failure gracefully:
class DatabaseFailoverExperiment
def initialize(config)
@duration_seconds = config.fetch(:duration, 300)
@primary_host = config.fetch(:primary_host)
@replica_host = config.fetch(:replica_host)
@steady_state_threshold = config.fetch(:success_rate_threshold, 0.99)
end
def execute
# Measure steady state before experiment
baseline_metrics = measure_steady_state(60)
unless meets_threshold?(baseline_metrics)
raise "Baseline below threshold: #{baseline_metrics.inspect}"
end
# Inject failure: block traffic to primary database
block_database_traffic(@primary_host)
# Monitor system behavior during failure
failure_metrics = measure_during_failure(@duration_seconds)
# Restore traffic
restore_database_traffic(@primary_host)
# Wait for recovery
sleep(30)
# Measure recovery metrics
recovery_metrics = measure_steady_state(60)
analyze_results(baseline_metrics, failure_metrics, recovery_metrics)
end
private
def measure_steady_state(duration)
success_count = 0
total_count = 0
latency_samples = []
start_time = Time.now
while Time.now - start_time < duration
result = perform_sample_query
total_count += 1
success_count += 1 if result.success?
latency_samples << result.latency_ms if result.success?
sleep(1)
end
{
success_rate: success_count.to_f / total_count,
p99_latency: calculate_percentile(latency_samples, 0.99),
sample_count: total_count
}
end
def measure_during_failure(duration)
# Similar measurement but expects degraded performance
measure_steady_state(duration)
end
def meets_threshold?(metrics)
metrics[:success_rate] >= @steady_state_threshold
end
def block_database_traffic(host)
# Use iptables or cloud provider API to block traffic
system("iptables -A OUTPUT -d #{host} -j DROP")
end
def restore_database_traffic(host)
system("iptables -D OUTPUT -d #{host} -j DROP")
end
def analyze_results(baseline, failure, recovery)
# The application should failover to replica
# Success rate should remain high during failure
# Latency may increase but should remain acceptable
# Recovery should return to baseline
results = {
baseline_success: baseline[:success_rate],
failure_success: failure[:success_rate],
recovery_success: recovery[:success_rate],
baseline_latency: baseline[:p99_latency],
failure_latency: failure[:p99_latency],
recovery_latency: recovery[:p99_latency]
}
if failure[:success_rate] < 0.95
raise "Application failed during database failover: #{results.inspect}"
end
Logger.info("Database failover experiment passed: #{results.inspect}")
results
end
end
Example 2: Dependency Timeout Testing
Testing that the application properly handles slow dependencies without cascading failures:
class DependencyTimeoutExperiment
def initialize(config)
@target_service = config.fetch(:target_service)
@latency_ms = config.fetch(:latency_ms)
@duration_seconds = config.fetch(:duration_seconds)
@affected_percentage = config.fetch(:affected_percentage, 0.5)
end
def execute
# Instrument dependency calls
instrument_service_calls
# Measure baseline
baseline = measure_metrics(30)
# Start latency injection using Toxiproxy
toxiproxy_client = Toxiproxy::Client.new
proxy = toxiproxy_client.proxies.find { |p| p.name == @target_service }
toxic = proxy.downstream(:latency,
latency: @latency_ms,
jitter: @latency_ms * 0.2)
toxic.apply
# Measure during latency
failure_metrics = measure_metrics(@duration_seconds)
# Remove latency injection
toxic.remove
# Measure recovery
sleep(30)
recovery_metrics = measure_metrics(30)
evaluate_timeout_handling(baseline, failure_metrics, recovery_metrics)
end
private
def instrument_service_calls
# Track calls to the dependency
ServiceCallTracker.instrument(@target_service) do |call|
{
duration_ms: call.duration_ms,
success: call.success?,
timeout: call.timeout?,
circuit_open: call.circuit_open?
}
end
end
def measure_metrics(duration)
calls = ServiceCallTracker.calls_for(@target_service,
duration: duration)
{
total_calls: calls.count,
success_rate: calls.count(&:success?) / calls.count.to_f,
timeout_rate: calls.count(&:timeout?) / calls.count.to_f,
circuit_open_rate: calls.count(&:circuit_open?) / calls.count.to_f,
p95_latency: calculate_percentile(calls.map(&:duration_ms), 0.95)
}
end
def evaluate_timeout_handling(baseline, failure, recovery)
# Circuit breaker should open when timeouts exceed threshold
# Application should handle circuit open gracefully
# Recovery should restore normal operation
if failure[:timeout_rate] < 0.3
raise "Expected higher timeout rate, got #{failure[:timeout_rate]}"
end
if failure[:circuit_open_rate] < 0.5
raise "Circuit breaker did not open: #{failure[:circuit_open_rate]}"
end
if recovery[:success_rate] < baseline[:success_rate] * 0.95
raise "Recovery incomplete: #{recovery.inspect}"
end
Logger.info("Dependency timeout handled correctly: #{failure.inspect}")
end
end
Example 3: Message Queue Processing Chaos
Testing Sidekiq job processing resilience when Redis becomes unavailable:
class MessageQueueChaosExperiment
def initialize(config)
@chaos_duration = config.fetch(:duration_seconds)
@redis_host = config.fetch(:redis_host)
@acceptable_job_loss_rate = config.fetch(:acceptable_job_loss, 0.01)
end
def execute
# Enqueue test jobs continuously
test_jobs = start_job_enqueuing
# Measure baseline processing
sleep(30)
baseline_processed = count_processed_jobs(test_jobs)
# Block Redis traffic to simulate failure
block_redis_traffic
# Continue enqueuing jobs during failure
sleep(@chaos_duration)
failure_processed = count_processed_jobs(test_jobs)
# Restore Redis
restore_redis_traffic
# Wait for queue to drain
wait_for_queue_drain
# Count final processed jobs
final_processed = count_processed_jobs(test_jobs)
evaluate_queue_resilience(test_jobs.count, final_processed)
ensure
stop_job_enqueuing(test_jobs)
end
private
def start_job_enqueuing
job_ids = []
@enqueuing_thread = Thread.new do
loop do
job_id = SecureRandom.uuid
TestJob.perform_async(job_id)
job_ids << job_id
sleep(0.1)
rescue => e
Logger.debug("Enqueue failed during chaos: #{e.message}")
end
end
job_ids
end
def stop_job_enqueuing(job_ids)
@enqueuing_thread.kill if @enqueuing_thread
end
def count_processed_jobs(job_ids)
# Check database or tracking system for processed jobs
ProcessedJobTracker.count_processed(job_ids)
end
def block_redis_traffic
system("iptables -A OUTPUT -d #{@redis_host} -j DROP")
end
def restore_redis_traffic
system("iptables -D OUTPUT -d #{@redis_host} -j DROP")
end
def wait_for_queue_drain
timeout = 300
start = Time.now
while Sidekiq::Queue.new.size > 0
break if Time.now - start > timeout
sleep(5)
end
end
def evaluate_queue_resilience(total_jobs, processed_jobs)
loss_rate = (total_jobs - processed_jobs).to_f / total_jobs
if loss_rate > @acceptable_job_loss_rate
raise "Job loss rate #{loss_rate} exceeds threshold"
end
Logger.info("Queue chaos handled with #{processed_jobs}/#{total_jobs} processed")
end
end
class TestJob
include Sidekiq::Job
def perform(job_id)
# Record that this job processed successfully
ProcessedJobTracker.mark_processed(job_id)
end
end
Example 4: Memory Leak Simulation
Testing application behavior under memory pressure:
class MemoryPressureExperiment
def initialize(config)
@target_memory_mb = config.fetch(:memory_mb)
@duration_seconds = config.fetch(:duration_seconds)
@memory_leak_rate = config.fetch(:leak_rate_mb_per_minute)
end
def execute
# Measure baseline memory and response times
baseline = measure_system_health
# Start gradual memory leak
leak_thread = start_memory_leak
# Monitor system degradation
monitor_degradation(@duration_seconds)
# Stop leak and force GC
leak_thread.kill
force_garbage_collection
# Measure recovery
sleep(30)
recovery = measure_system_health
analyze_memory_handling(baseline, recovery)
ensure
leak_thread.kill if leak_thread&.alive?
end
private
def start_memory_leak
Thread.new do
leaked_memory = []
loop do
# Allocate memory at controlled rate
mb_to_allocate = @memory_leak_rate / 60.0
leaked_memory << Array.new((mb_to_allocate * 1024 * 1024 / 8).to_i, 0)
sleep(1)
# Stop if target reached
current_mb = leaked_memory.sum { |a| a.size * 8 / 1024 / 1024 }
break if current_mb >= @target_memory_mb
end
end
end
def monitor_degradation(duration)
measurements = []
start_time = Time.now
while Time.now - start_time < duration
health = measure_system_health
measurements << health
# Check for unacceptable degradation
if health[:response_time_p99] > 5000
Logger.warn("Response time exceeded threshold: #{health.inspect}")
end
if health[:error_rate] > 0.05
Logger.error("Error rate exceeded threshold: #{health.inspect}")
halt_experiment
break
end
sleep(10)
end
measurements
end
def measure_system_health
{
memory_mb: get_process_memory_mb,
response_time_p99: measure_response_time_p99,
error_rate: measure_error_rate,
throughput_rps: measure_throughput
}
end
def get_process_memory_mb
`ps -o rss= -p #{Process.pid}`.to_i / 1024
end
def force_garbage_collection
3.times do
GC.start(full_mark: true, immediate_sweep: true)
sleep(1)
end
end
def analyze_memory_handling(baseline, recovery)
# Application should maintain acceptable performance under memory pressure
# GC should prevent unbounded growth
# Recovery should restore baseline performance
Logger.info("Memory pressure test completed: baseline=#{baseline}, recovery=#{recovery}")
end
end
Real-World Applications
Financial Services Transaction Processing
A payment processing company runs continuous chaos experiments to ensure transaction integrity during failures. The experiments terminate payment service instances randomly, introduce database connection failures, and simulate network partitions between payment processing and fraud detection services.
The chaos program discovered that payment retries occasionally created duplicate charges when network failures occurred at specific timing windows. The team added idempotency keys to payment API calls and implemented distributed transaction coordinators. Automated experiments run every 6 hours in production, targeting 5% of payment traffic. The experiments halt immediately if the duplicate transaction rate exceeds 0.01% or if payment success rate drops below 99.9%.
The continuous validation caught a regression when a developer removed retry logic during a refactoring. The automated experiment failed within 4 hours of deployment, before the change affected customer payments. The team reverted the deployment and added tests to prevent similar regressions.
E-Commerce Platform Scaling
An online retailer conducts weekly game days where teams inject failures during peak traffic periods. The experiments include terminating application servers, introducing latency to product catalog services, and simulating payment gateway outages.
A game day revealed that the shopping cart service cascaded failures to the checkout service when the product inventory API became slow. The cart service called inventory synchronously for every cart item, and slow responses caused request timeouts. The team refactored the cart service to call inventory asynchronously and cache results. They added circuit breakers with fallback to stale inventory data.
The retailer automated the most valuable experiments after game days validated their effectiveness. Automated experiments run during non-peak hours, gradually increasing traffic percentage as confidence builds. The experiments discovered that auto-scaling policies reacted too slowly to sudden traffic spikes, leading to temporary capacity shortages. The team adjusted auto-scaling triggers and added standing capacity buffers.
Video Streaming Service Resilience
A video streaming platform runs chaos experiments targeting their content delivery network, origin servers, and encoding pipeline. Experiments simulate CDN node failures, origin server overload, and encoding job failures.
The experiments revealed that video players retried failed segment fetches without exponential backoff, creating thundering herd problems when CDN nodes failed. During a node failure, thousands of players simultaneously retried requests, overwhelming the origin server. The team implemented exponential backoff with jitter in player retry logic and added rate limiting to origin servers.
Continuous experiments test different failure scenarios daily. The platform measures video start time, buffering ratio, and error rates during experiments. Experiments halt if buffering ratio exceeds 0.5% or start time increases beyond 5 seconds. The automated system has caught performance regressions in player code, CDN configuration changes that reduced cache hit rates, and origin server capacity issues.
Microservices API Gateway
An API gateway handling authentication and routing for 200+ microservices implements automated chaos testing for circuit breaker behavior. Experiments inject failures into backend services to verify that circuit breakers open appropriately and the gateway provides sensible fallback responses.
The experiments discovered that circuit breakers shared state across all gateway instances, creating coordination overhead. When a backend service failed, all gateway instances attempted to update circuit breaker state simultaneously, causing contention. The team moved to independent circuit breakers per instance with eventual consistency rather than strong consistency.
Automated experiments run continuously, targeting different backend services hourly. Each experiment injects failures into one service while monitoring whether circuit breakers open within acceptable time and whether other services remain unaffected. The experiments also verify that circuit breakers half-open and test backend service recovery.
Distributed Database Cluster
A company operating a distributed database performs chaos experiments to validate consensus algorithms and data replication. Experiments partition network segments, introduce packet loss, and terminate database nodes during write operations.
The experiments identified a split-brain scenario where network partitions caused two nodes to simultaneously believe they were cluster leaders. The team strengthened leader election algorithms and added fencing tokens to prevent split-brain. They also discovered that rebuilding nodes after failures consumed too much bandwidth, impacting production traffic. The team implemented bandwidth throttling for rebuilds.
The chaos program includes quarterly disaster recovery drills where teams shut down entire data centers. The drills verify that databases fail over correctly, replication catches up, and applications adapt to new database endpoints. The drills take 4-6 hours and involve coordination across multiple teams.
Reference
Experiment Types
| Type | Description | Typical Impact Scope |
|---|---|---|
| Instance Termination | Kill application servers or containers | Single service availability |
| Network Latency | Add delay to network requests | Request timeouts, degraded performance |
| Network Partition | Block network communication between services | Service isolation, split-brain |
| Packet Loss | Drop percentage of network packets | Retry storms, connection failures |
| DNS Failure | Return errors for DNS lookups | Service discovery failures |
| Disk I/O Delay | Slow disk read/write operations | Database slowdown, log delays |
| CPU Exhaustion | Consume CPU resources | Request queuing, timeout |
| Memory Exhaustion | Consume memory resources | Out of memory errors, thrashing |
| Dependency Failure | Simulate external service failures | Cascading failures, timeout |
| Database Connection Failure | Terminate or block database connections | Transaction failures, connection pool exhaustion |
| Message Queue Failure | Block or delay message processing | Job backlog, processing delays |
Blast Radius Control Techniques
| Technique | Implementation | Risk Level |
|---|---|---|
| Percentage-based targeting | Affect only X% of requests or instances | Low |
| Geographic limitation | Limit experiments to single region | Low |
| Canary targeting | Target specific instance or customer segment | Low |
| Time-based limiting | Run experiments for fixed duration | Medium |
| Automatic halt conditions | Stop if metrics exceed threshold | Medium |
| Manual approval | Require human approval to proceed | High control |
| Instance count minimum | Never reduce below minimum capacity | Critical |
Steady State Metrics
| Metric Type | Examples | Measurement Approach |
|---|---|---|
| Availability | Uptime percentage, successful request rate | Monitor HTTP status codes |
| Latency | Response time percentiles, time to first byte | Track request durations |
| Throughput | Requests per second, transactions per minute | Count completed operations |
| Error Rate | Failed requests per total requests | Track error responses |
| Data Integrity | Checksums, duplicate detection, consistency checks | Validate data correctness |
| Business Metrics | Conversion rate, revenue per minute, user sessions | Monitor business outcomes |
Common Halt Conditions
| Condition | Threshold Example | Action |
|---|---|---|
| Error rate increase | Error rate exceeds 1% | Immediate halt |
| Latency increase | P99 latency exceeds 5x baseline | Immediate halt |
| Availability drop | Success rate below 99% | Immediate halt |
| Business impact | Revenue drop exceeds $100/minute | Immediate halt |
| Custom metrics | Cart abandonment exceeds 10% | Team review |
| Duration limit | Experiment runs longer than 30 minutes | Automatic halt |
Ruby Chaos Tools Configuration
| Tool/Gem | Installation | Basic Configuration |
|---|---|---|
| chaos_rb | gem install chaos_rb | ENV variables for error/latency rates |
| Semian | gem install semian | Circuit breaker thresholds and timeouts |
| Toxiproxy | gem install toxiproxy | HTTP API or client library configuration |
| ActiveRecord Chaos | Custom middleware | Database connection pool wrapper |
| Rack Chaos Middleware | Custom implementation | Request interception with failure injection |
Experiment Analysis Questions
| Question | Data Required | Analysis Method |
|---|---|---|
| Did steady state maintain during experiment? | Baseline and experiment metrics | Statistical comparison |
| What was the blast radius? | Affected instances, requests, customers | Impact measurement |
| How quickly did the system recover? | Time-series metrics after experiment | Recovery time analysis |
| Were circuit breakers effective? | Circuit open/close events | Event correlation |
| Did alerts fire appropriately? | Alert history during experiment | Alert verification |
| What was the customer impact? | Business metrics, support tickets | Customer impact assessment |
Chaos Engineering Maturity Levels
| Level | Characteristics | Typical Activities |
|---|---|---|
| 1 - Ad Hoc | Manual experiments, no automation | Occasional instance termination |
| 2 - Scheduled | Regular game days, some automation | Monthly planned chaos events |
| 3 - Integrated | Automated experiments in CI/CD | Pre-deployment chaos testing |
| 4 - Continuous | Always-on production experiments | Hourly automated experiments |
| 5 - Advanced | Automated experiment selection and analysis | Self-optimizing chaos program |
Ruby Circuit Breaker Patterns
| Pattern | Implementation | Use Case |
|---|---|---|
| Fail Fast | Raise error immediately when circuit open | User-facing requests |
| Fallback Response | Return cached or default value | Product catalogs, recommendations |
| Degraded Mode | Provide limited functionality | Search with reduced features |
| Queue for Retry | Store failed requests for later processing | Background jobs, analytics |
| Timeout Protection | Enforce maximum wait time | External API calls |
Automation Scheduling Considerations
| Factor | Consideration | Recommendation |
|---|---|---|
| Time of Day | Avoid peak traffic hours | Schedule during low traffic |
| Deployment Frequency | Coordinate with releases | Wait 1 hour after deployments |
| Team Availability | Ensure on-call coverage | Align with on-call schedules |
| Experiment Duration | Balance insight vs risk | Start with 5-10 minute experiments |
| Frequency | Build confidence gradually | Weekly to daily as maturity increases |
| Geographic Distribution | Respect regional peak times | Rotate regions for experiments |