Overview
Site Reliability Engineering (SRE) applies software engineering principles to infrastructure and operations challenges. Google originated the discipline in 2003 to manage large-scale systems through automation, monitoring, and systematic problem-solving rather than traditional operations approaches.
SRE treats operations as a software problem. Instead of manual interventions and reactive firefighting, SRE teams write code to automate operational tasks, build reliable systems through engineering practices, and measure reliability through quantifiable metrics. The discipline bridges development and operations by having software engineers run production systems.
The core premise distinguishes SRE from traditional operations: reliability is a feature that engineers build into systems, not a property that operations teams maintain through heroic effort. This shift moves reliability left in the development cycle, making it a design consideration rather than an operational afterthought.
SRE emerged from the recognition that scaling systems and teams requires different approaches than traditional operations. Manual operations scale linearly with system size, creating unsustainable staffing requirements. SRE addresses this through:
Automation - Replace manual operational work with software systems that perform tasks reliably at scale without human intervention.
Measurement - Define reliability through quantifiable Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that guide engineering decisions.
Balance - Accept that perfect reliability is expensive and unnecessary. Define acceptable reliability levels and use error budgets to balance feature velocity with stability.
Toil Reduction - Systematically eliminate repetitive operational work that lacks enduring value, freeing engineering time for improving systems.
The discipline requires teams to maintain both development skills and operational knowledge. SREs write production code, design distributed systems, and respond to incidents, applying software engineering rigor to each activity.
Key Principles
SRE operates on several fundamental principles that distinguish it from traditional operations approaches and define how reliability engineering teams function.
Service Level Indicators and Objectives
SLIs measure specific aspects of service behavior that matter to users. An SLI quantifies one dimension of reliability: request latency, error rate, system throughput, or data durability. SLIs use metrics that users experience directly rather than infrastructure-focused measurements.
SLOs set target values for SLIs that define acceptable service behavior. An SLO specifies that 99.9% of requests must complete within 200ms, or that the error rate must stay below 0.1%. SLOs create a shared understanding between engineering teams and stakeholders about expected service quality.
The relationship between SLIs and SLOs establishes a reliability contract. SLIs provide measurement mechanisms. SLOs define success criteria. Together they answer: "Is the service reliable enough?"
Error Budgets
Error budgets quantify acceptable unreliability. If an SLO targets 99.9% availability, the service has a 0.1% error budget - the amount of downtime that falls within acceptable parameters. Error budgets transform reliability from a subjective goal into a concrete resource.
Teams consume error budget through incidents, deployments, and planned maintenance. When error budget remains, teams can take risks: deploy more frequently, experiment with new features, or push performance improvements. When error budget depletes, teams focus on stability: slow deployments, fix bugs, improve monitoring, and reduce risk.
This mechanism balances innovation and reliability. Perfect reliability requires moving slowly and avoiding changes. Zero reliability produces unusable services. Error budgets define the optimal middle ground where teams can innovate within acceptable risk parameters.
Toil
Toil describes operational work that is manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly with service growth. Toil differs from overhead (meetings, planning) and engineering work (design, coding). Examples include manually resetting services, running deployment scripts, processing tickets for routine requests, or manually scaling capacity.
SRE teams target toil reduction because toil prevents engineering work that improves reliability. A team spending 80% of time on toil has 20% capacity for automation, architectural improvements, or reliability projects. This creates a negative feedback loop where operational load prevents the engineering work needed to reduce operational load.
Google's SRE guidance suggests limiting toil to 50% of an SRE's time. The remaining time goes to engineering projects: automation, tool development, system design, and reliability improvements. This ratio ensures teams continuously reduce operational burden rather than accepting it as inevitable.
Monitoring and Alerting
SRE monitoring distinguishes between symptoms and causes. Symptom-based monitoring alerts on user-visible problems: high error rates, slow responses, or service unavailability. Cause-based monitoring tracks internal state: CPU usage, memory consumption, or queue depth. Alerts trigger on symptoms. Dashboards display causes.
This approach prevents alert fatigue and reduces mean time to detection. Traditional monitoring alerts on every potential problem, creating noise and training teams to ignore alerts. Symptom-based alerting fires only when users experience problems, ensuring every alert represents a real issue requiring response.
The monitoring system must answer three questions: what is broken, why is it broken, and what actions will fix it. Alerts answer the first question. Dashboards and logs answer the second. Runbooks answer the third. This information structure reduces incident response time by providing responders with context immediately.
Blameless Postmortems
After incidents, SRE teams conduct postmortems that focus on system improvements rather than individual mistakes. Blameless postmortems assume that everyone involved made reasonable decisions given available information and competing priorities. The analysis identifies systemic issues that enabled the incident rather than assigning fault to individuals.
Effective postmortems document the timeline, root causes, impact, and action items. The timeline establishes what happened and when. Root causes identify why the incident occurred and what conditions allowed it. Impact quantifies user effects and error budget consumption. Action items specify concrete changes to prevent recurrence.
The blameless approach encourages honest reporting and learning. When teams fear punishment, incidents go unreported, reducing organizational learning. Blameless culture treats incidents as learning opportunities and system design feedback rather than failures requiring discipline.
Implementation Approaches
Organizations implement SRE through different models depending on size, culture, and existing team structures.
Dedicated SRE Teams
The dedicated team model creates specialized SRE teams that own reliability for specific services or platforms. These teams work alongside development teams, providing reliability expertise, managing production systems, and building operational tooling. The SRE team acts as a consulting group, embedded partner, or service provider depending on organizational needs.
This approach works well for organizations with multiple complex services requiring dedicated reliability focus. SRE teams develop deep operational expertise and build reusable tools that benefit multiple services. The model concentrates reliability knowledge, making it easier to establish standards and share best practices.
Dedicated teams require sufficient scale to justify specialized roles. Organizations need enough services and complexity to keep SRE teams engaged in meaningful engineering work rather than pure operational toil. The model works best when SRE teams maintain 50% or more time for engineering projects.
Embedded SRE Model
The embedded model places SRE engineers within development teams rather than forming separate SRE organizations. Each product team includes members with SRE skills who focus on reliability, operations, and production concerns while remaining part of the development team structure.
This approach distributes reliability ownership across the organization. Every team takes responsibility for their service's reliability rather than handing off operational concerns to a separate group. The model scales naturally with team growth and avoids bottlenecks from centralized SRE resources.
Embedded models require investing in reliability skills across engineering teams. Organizations must train developers on operational practices, monitoring, incident response, and production systems. This investment increases overall engineering capability but requires more time and resources than concentrating expertise in dedicated teams.
SRE as Consultancy
The consultancy model creates an SRE organization that provides guidance, reviews, and recommendations to development teams rather than directly managing services. SRE consultants review architecture, suggest improvements, define SLOs, and teach reliability practices. Development teams own their production services with SRE guidance.
This lightweight model works for organizations where development teams have operational skills but need reliability expertise for complex decisions. The consultancy scales better than dedicated teams because one SRE consultant can advise multiple teams. However, it requires development teams to execute SRE recommendations themselves.
The consultancy approach establishes reliability standards without creating operational dependencies. Development teams remain fully responsible for their services while benefiting from SRE knowledge. This model suits organizations with mature engineering teams that need guidance rather than hands-on operational support.
Platform SRE
Platform SRE teams build internal platforms and tools that enable other teams to operate reliably. Rather than managing specific services, platform SRE creates deployment systems, monitoring infrastructure, incident response tools, and operational frameworks. Development teams use these platforms to run their own services.
This model treats reliability as a platform problem. By building excellent operational tools, platform SRE enables all teams to run reliable services without requiring dedicated SRE support for each service. The approach scales through tooling rather than direct service management.
Platform SRE requires significant upfront investment in tooling and infrastructure. The model works best for larger organizations with many services sharing common operational needs. Platform teams must balance building generic tools with supporting specific team requirements.
Ruby Implementation
Ruby provides multiple tools and libraries for implementing SRE practices, from automation scripts to monitoring systems and operational tooling.
Service Health Checks
require 'net/http'
require 'json'
class ServiceHealthCheck
def initialize(service_url)
@service_url = service_url
@uri = URI("#{service_url}/health")
end
def check
start_time = Time.now
response = Net::HTTP.get_response(@uri)
latency = Time.now - start_time
{
healthy: response.code == '200',
latency_ms: (latency * 1000).round(2),
status_code: response.code,
timestamp: Time.now.iso8601
}
rescue StandardError => e
{
healthy: false,
error: e.message,
timestamp: Time.now.iso8601
}
end
end
# Monitor multiple services
services = [
'https://api.example.com',
'https://web.example.com',
'https://admin.example.com'
]
results = services.map do |service_url|
checker = ServiceHealthCheck.new(service_url)
[service_url, checker.check]
end.to_h
puts JSON.pretty_generate(results)
SLO Measurement and Tracking
class SLOTracker
attr_reader :total_requests, :successful_requests
def initialize(target_percentage)
@target_percentage = target_percentage
@total_requests = 0
@successful_requests = 0
@errors = []
end
def record_request(success:, latency_ms: nil, error: nil)
@total_requests += 1
if success
@successful_requests += 1
else
@errors << {
timestamp: Time.now,
latency: latency_ms,
error: error
}
end
end
def current_slo
return 0.0 if @total_requests.zero?
(@successful_requests.to_f / @total_requests * 100).round(2)
end
def error_budget_remaining
target_failures = @total_requests * (1 - @target_percentage / 100.0)
actual_failures = @total_requests - @successful_requests
remaining = target_failures - actual_failures
{
target_failures: target_failures.round(2),
actual_failures: actual_failures,
remaining: remaining.round(2),
percentage: (remaining / target_failures * 100).round(2)
}
end
def slo_breached?
current_slo < @target_percentage
end
def report
{
target_slo: @target_percentage,
current_slo: current_slo,
total_requests: @total_requests,
successful_requests: @successful_requests,
failed_requests: @total_requests - @successful_requests,
error_budget: error_budget_remaining,
breached: slo_breached?
}
end
end
# Track availability SLO of 99.9%
slo = SLOTracker.new(99.9)
# Simulate requests
1000.times do |i|
success = rand < 0.998 # 99.8% success rate
latency = rand(50..200)
slo.record_request(
success: success,
latency_ms: latency,
error: success ? nil : "Service unavailable"
)
end
puts JSON.pretty_generate(slo.report)
# => {
# "target_slo": 99.9,
# "current_slo": 99.8,
# "total_requests": 1000,
# "successful_requests": 998,
# "failed_requests": 2,
# "error_budget": {...},
# "breached": true
# }
Automated Incident Response
require 'slack-notifier'
class IncidentResponder
def initialize(slack_webhook_url)
@notifier = Slack::Notifier.new(slack_webhook_url)
@incidents = []
end
def detect_incident(service_name, metrics)
return unless incident_conditions_met?(metrics)
incident = create_incident(service_name, metrics)
@incidents << incident
notify_team(incident)
execute_remediation(incident)
incident
end
private
def incident_conditions_met?(metrics)
metrics[:error_rate] > 1.0 ||
metrics[:latency_p99] > 1000 ||
metrics[:availability] < 99.0
end
def create_incident(service_name, metrics)
{
id: generate_incident_id,
service: service_name,
severity: determine_severity(metrics),
started_at: Time.now,
metrics: metrics,
status: 'open'
}
end
def generate_incident_id
"INC-#{Time.now.strftime('%Y%m%d')}-#{rand(1000..9999)}"
end
def determine_severity(metrics)
return 'critical' if metrics[:availability] < 95.0
return 'high' if metrics[:error_rate] > 5.0
return 'medium' if metrics[:latency_p99] > 2000
'low'
end
def notify_team(incident)
message = format_alert_message(incident)
@notifier.ping(message, channel: '#incidents')
end
def format_alert_message(incident)
<<~MESSAGE
:rotating_light: Incident Detected: #{incident[:id]}
Service: #{incident[:service]}
Severity: #{incident[:severity].upcase}
Started: #{incident[:started_at]}
Metrics:
- Error Rate: #{incident[:metrics][:error_rate]}%
- P99 Latency: #{incident[:metrics][:latency_p99]}ms
- Availability: #{incident[:metrics][:availability]}%
MESSAGE
end
def execute_remediation(incident)
case incident[:severity]
when 'critical'
trigger_failover(incident[:service])
scale_capacity(incident[:service], factor: 2)
when 'high'
restart_unhealthy_instances(incident[:service])
when 'medium'
clear_caches(incident[:service])
end
end
def trigger_failover(service)
puts "Triggering failover for #{service}"
# Implementation would interact with infrastructure
end
def scale_capacity(service, factor:)
puts "Scaling #{service} capacity by #{factor}x"
# Implementation would interact with auto-scaling groups
end
def restart_unhealthy_instances(service)
puts "Restarting unhealthy instances for #{service}"
# Implementation would interact with orchestration system
end
def clear_caches(service)
puts "Clearing caches for #{service}"
# Implementation would interact with cache systems
end
end
# Monitor and respond to incidents
responder = IncidentResponder.new(ENV['SLACK_WEBHOOK_URL'])
metrics = {
error_rate: 5.2,
latency_p99: 1500,
availability: 98.5
}
incident = responder.detect_incident('payment-api', metrics)
Deployment Automation with Error Budget Checks
class DeploymentGuard
def initialize(slo_tracker, min_error_budget_percentage: 20)
@slo_tracker = slo_tracker
@min_error_budget_percentage = min_error_budget_percentage
end
def can_deploy?
budget = @slo_tracker.error_budget_remaining
if budget[:percentage] < @min_error_budget_percentage
{
allowed: false,
reason: "Insufficient error budget",
current_budget: budget[:percentage],
required_budget: @min_error_budget_percentage
}
else
{
allowed: true,
current_budget: budget[:percentage]
}
end
end
def deploy_with_rollback(deployment_block)
guard_result = can_deploy?
unless guard_result[:allowed]
raise DeploymentBlockedError, guard_result[:reason]
end
initial_metrics = capture_metrics
begin
deployment_block.call
sleep(300) # Observe for 5 minutes
post_deployment_metrics = capture_metrics
if deployment_degraded_service?(initial_metrics, post_deployment_metrics)
rollback
raise DeploymentFailedError, "Service degradation detected"
end
{ success: true, metrics: post_deployment_metrics }
rescue StandardError => e
rollback
raise
end
end
private
def capture_metrics
{
error_rate: @slo_tracker.current_slo,
timestamp: Time.now
}
end
def deployment_degraded_service?(before, after)
degradation = before[:error_rate] - after[:error_rate]
degradation > 0.5 # More than 0.5% SLO decrease indicates issues
end
def rollback
puts "Executing rollback..."
# Implementation would revert deployment
end
end
class DeploymentBlockedError < StandardError; end
class DeploymentFailedError < StandardError; end
Tools & Ecosystem
SRE practices rely on extensive tooling for monitoring, alerting, deployment, and incident management.
Monitoring and Observability
Prometheus forms the foundation of many SRE monitoring stacks. The system collects time-series metrics through a pull model, stores them efficiently, and provides a query language for analysis. Prometheus integrates with Ruby applications through client libraries that expose metrics via HTTP endpoints.
Grafana visualizes metrics from Prometheus and other data sources. Teams build dashboards showing service health, SLO compliance, and system behavior. Grafana supports alerting based on query results, complementing Prometheus's alert manager.
New Relic and Datadog provide commercial observability platforms with Ruby agents that automatically instrument applications. These platforms collect traces, metrics, and logs in a unified system, reducing the operational burden of maintaining separate monitoring components.
Ruby Monitoring Libraries
The prometheus-client gem instruments Ruby applications for Prometheus:
require 'prometheus/client'
require 'prometheus/client/rack/exporter'
prometheus = Prometheus::Client.registry
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
http_duration = prometheus.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path]
)
# In Rack middleware
class MetricsMiddleware
def initialize(app, requests_counter, duration_histogram)
@app = app
@requests = requests_counter
@duration = duration_histogram
end
def call(env)
start_time = Time.now
status, headers, body = @app.call(env)
duration = Time.now - start_time
method = env['REQUEST_METHOD']
path = env['PATH_INFO']
@requests.increment(labels: { method: method, path: path, status: status })
@duration.observe(duration, labels: { method: method, path: path })
[status, headers, body]
end
end
Honeycomb provides distributed tracing and observability. The Ruby beeline automatically instruments common libraries and frameworks:
require 'honeycomb-beeline'
Honeycomb.init(
writekey: ENV['HONEYCOMB_WRITEKEY'],
dataset: 'production',
service_name: 'payment-service'
)
Honeycomb.start_span(name: 'process_payment') do |span|
span.add_field('user_id', user.id)
span.add_field('amount', payment.amount)
result = process_payment(payment)
span.add_field('result', result.status)
result
end
Incident Management
PagerDuty manages on-call schedules, incident routing, and escalation policies. Ruby applications integrate through the PagerDuty API to create incidents programmatically:
require 'net/http'
require 'json'
class PagerDutyIncident
def initialize(integration_key)
@integration_key = integration_key
@api_url = 'https://events.pagerduty.com/v2/enqueue'
end
def trigger(summary:, severity:, details:)
payload = {
routing_key: @integration_key,
event_action: 'trigger',
payload: {
summary: summary,
severity: severity,
source: Socket.gethostname,
timestamp: Time.now.iso8601,
custom_details: details
}
}
uri = URI(@api_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Post.new(uri.path, 'Content-Type' => 'application/json')
request.body = payload.to_json
response = http.request(request)
JSON.parse(response.body)
end
end
Opsgenie provides similar incident management capabilities with different workflow options. VictorOps (now Splunk On-Call) offers another alternative with timeline-based incident views.
Deployment and Release Management
GitHub Actions automates deployments with SLO checks integrated into CI/CD pipelines. Ruby scripts validate error budgets before allowing deployments to proceed.
Kubernetes handles container orchestration, providing deployment strategies like rolling updates, blue-green deployments, and canary releases. The Ruby Kubernetes client interacts with cluster resources:
require 'kubeclient'
client = Kubeclient::Client.new(
'https://kubernetes.default.svc',
'v1',
ssl_options: { verify_ssl: OpenSSL::SSL::VERIFY_NONE }
)
# Check deployment status
deployment = client.get_deployment('payment-api', 'production')
ready_replicas = deployment.status.readyReplicas
desired_replicas = deployment.spec.replicas
if ready_replicas == desired_replicas
puts "Deployment healthy: #{ready_replicas}/#{desired_replicas} ready"
else
puts "Deployment degraded: #{ready_replicas}/#{desired_replicas} ready"
end
Spinnaker provides sophisticated deployment pipelines with automated canary analysis and rollback capabilities. Ruby services deploy through Spinnaker pipelines that validate health metrics before promoting releases.
Alerting Systems
Alert Manager receives alerts from Prometheus and routes them based on severity, team, and service. Configuration defines routing trees that match alerts to notification channels:
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-pager'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
Ruby applications define custom alert receivers that process webhook notifications and execute automated responses.
Real-World Applications
SRE practices apply across different organizational scales and service types, with implementation patterns varying based on context.
High-Traffic Web Services
Organizations running high-traffic web applications implement SRE through comprehensive monitoring, automated capacity management, and rigorous change control. These services handle millions of requests per day, making reliability critical for business operations.
The SRE team maintains detailed SLO definitions covering availability, latency percentiles, and error rates. Monitoring systems track these metrics continuously, with alerting thresholds set below SLO boundaries to provide early warning before customer impact.
Capacity planning automation responds to traffic patterns, scaling infrastructure based on demand forecasts and real-time metrics. Ruby services use auto-scaling integrations that add capacity when request rates increase or when error budgets show signs of depletion.
Change management processes require all deployments to pass through automated verification. Services deploy using canary releases that expose changes to small traffic percentages initially, expanding only after metrics confirm no degradation. Deployment gates check error budgets and block releases when reliability concerns exist.
Microservices Architectures
Organizations with microservices architectures face distributed reliability challenges. Each service requires its own SLOs, but service dependencies create cascading failure risks. SRE teams implement circuit breakers, timeouts, and fallback behaviors to contain failures.
Service mesh technology like Istio provides observability, traffic management, and security across microservices. Ruby services integrate with service meshes through sidecar proxies that handle cross-cutting concerns like retries, timeouts, and distributed tracing.
The SRE team establishes service ownership models where each microservice has a responsible team. SRE provides frameworks and platforms that service teams use to operate reliably without requiring dedicated SRE support for each service.
Incident response in microservices requires understanding service dependencies and failure propagation. Ruby applications instrument their dependency chains, recording trace context that helps responders identify which service initiated failures during incidents.
Data Pipeline Reliability
Data processing pipelines require different SRE approaches than request-response services. Instead of latency and availability SLOs, data pipelines track processing latency, data quality, and throughput. Ruby applications processing data streams implement monitoring for pipeline health.
Backpressure handling prevents upstream services from overwhelming downstream components. Ruby data processors implement queue-based architectures with configurable limits that reject work when capacity is exceeded rather than failing silently.
Data quality monitoring validates pipeline outputs, checking for schema violations, null values, or suspicious patterns that indicate processing errors. Automated alerts notify teams when data quality degrades, enabling rapid response before downstream consumers encounter issues.
Retry mechanisms handle transient failures in data processing. Ruby workers implement exponential backoff with jitter, avoiding retry storms that worsen system load during incidents. Dead letter queues capture messages that fail processing repeatedly, allowing manual investigation without blocking pipeline progress.
Platform Services
Organizations building internal platforms apply SRE principles to developer-facing services like CI/CD systems, API gateways, and authentication services. These platforms require high reliability because outages block all dependent teams.
Platform SRE teams define strict SLOs reflecting the platform's critical role. Authentication service SLOs target 99.99% availability because authentication failures prevent users from accessing any services. Ruby authentication services implement redundancy, fast failover, and degraded mode operations that maintain core functionality during partial failures.
Documentation and self-service tools reduce operational burden on platform teams. Ruby applications expose health endpoints, metrics, and debugging interfaces that enable service consumers to troubleshoot issues without platform team involvement.
Capacity management for platforms requires understanding usage patterns across multiple consuming teams. Ruby platform services track per-tenant metrics, identifying heavy users and enforcing rate limits that prevent single consumers from impacting overall platform reliability.
Common Patterns
SRE teams repeatedly apply certain patterns when building and operating reliable services.
Gradual Rollouts
Gradual rollout strategies minimize blast radius when deploying changes. Rather than updating all instances simultaneously, deployments proceed in stages, with monitoring gates between stages that halt rollouts if problems emerge.
Canary deployments expose changes to a small percentage of traffic initially. Ruby applications route traffic based on headers, cookies, or random selection, directing a portion to the new version. Metrics comparison between canary and baseline versions determines whether to proceed.
class CanaryRouter
def initialize(canary_percentage:, canary_version:, baseline_version:)
@canary_percentage = canary_percentage
@canary_version = canary_version
@baseline_version = baseline_version
end
def select_version(request_id)
canary_traffic = (request_id.hash % 100) < @canary_percentage
canary_traffic ? @canary_version : @baseline_version
end
def evaluate_canary(canary_metrics, baseline_metrics)
error_rate_increase = canary_metrics[:error_rate] - baseline_metrics[:error_rate]
latency_increase = canary_metrics[:p99_latency] - baseline_metrics[:p99_latency]
if error_rate_increase > 0.5 || latency_increase > 100
{ proceed: false, reason: 'Metrics degradation detected' }
else
{ proceed: true }
end
end
end
Blue-green deployments maintain two complete environments, routing traffic to one while updating the other. After verification, traffic switches to the updated environment instantly. Ruby applications use load balancer integration or DNS updates to implement traffic switching.
Circuit Breakers
Circuit breakers prevent cascading failures by stopping requests to failing dependencies. When error rates exceed thresholds, the circuit breaker opens, rejecting requests immediately rather than waiting for timeouts. After a recovery period, the breaker allows limited traffic through to test if the dependency recovered.
class CircuitBreaker
STATES = [:closed, :open, :half_open].freeze
def initialize(failure_threshold:, recovery_timeout:, success_threshold:)
@failure_threshold = failure_threshold
@recovery_timeout = recovery_timeout
@success_threshold = success_threshold
@state = :closed
@failure_count = 0
@success_count = 0
@last_failure_time = nil
end
def call
case @state
when :open
if Time.now - @last_failure_time > @recovery_timeout
transition_to_half_open
else
raise CircuitOpenError, "Circuit breaker open"
end
when :half_open
execute_with_monitoring { yield }
when :closed
execute_with_monitoring { yield }
end
end
private
def execute_with_monitoring
result = yield
record_success
result
rescue StandardError => e
record_failure
raise
end
def record_success
@success_count += 1
if @state == :half_open && @success_count >= @success_threshold
transition_to_closed
end
end
def record_failure
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
transition_to_open
end
end
def transition_to_closed
@state = :closed
@failure_count = 0
@success_count = 0
end
def transition_to_open
@state = :open
@failure_count = 0
end
def transition_to_half_open
@state = :half_open
@success_count = 0
end
end
class CircuitOpenError < StandardError; end
Retry with Exponential Backoff
Transient failures often resolve quickly, making retries valuable. However, naive retry strategies can overwhelm recovering services. Exponential backoff spaces retries further apart with each attempt, giving systems time to recover. Jitter randomizes retry timing, preventing thundering herds where many clients retry simultaneously.
class RetryWithBackoff
def initialize(max_attempts:, base_delay:, max_delay:, jitter: true)
@max_attempts = max_attempts
@base_delay = base_delay
@max_delay = max_delay
@jitter = jitter
end
def call
attempts = 0
begin
attempts += 1
yield
rescue StandardError => e
if attempts < @max_attempts
delay = calculate_delay(attempts)
sleep(delay)
retry
else
raise
end
end
end
private
def calculate_delay(attempt)
exponential_delay = @base_delay * (2 ** (attempt - 1))
capped_delay = [exponential_delay, @max_delay].min
if @jitter
random_jitter = rand(0..capped_delay * 0.3)
capped_delay + random_jitter
else
capped_delay
end
end
end
# Usage
retry_handler = RetryWithBackoff.new(
max_attempts: 3,
base_delay: 1,
max_delay: 10
)
retry_handler.call do
external_api.fetch_data
end
Feature Flags for Risk Reduction
Feature flags decouple deployment from release, allowing code deployment without immediately activating new behavior. Ruby applications check feature flag state before executing new code paths, enabling gradual rollout, A/B testing, and instant rollback without redeploying.
class FeatureFlags
def initialize(redis_client)
@redis = redis_client
end
def enabled?(flag_name, user_id: nil, context: {})
flag_config = fetch_flag_config(flag_name)
return false unless flag_config
if user_id && flag_config['user_whitelist']&.include?(user_id)
return true
end
percentage = flag_config['rollout_percentage'] || 0
return false if percentage.zero?
return true if percentage >= 100
user_hash = Digest::MD5.hexdigest("#{flag_name}-#{user_id}").to_i(16)
(user_hash % 100) < percentage
end
def disable_flag(flag_name)
update_flag(flag_name, 'rollout_percentage' => 0)
end
private
def fetch_flag_config(flag_name)
config_json = @redis.get("feature_flag:#{flag_name}")
JSON.parse(config_json) if config_json
end
def update_flag(flag_name, changes)
config = fetch_flag_config(flag_name) || {}
config.merge!(changes)
@redis.set("feature_flag:#{flag_name}", config.to_json)
end
end
Health Check Endpoints
Services expose health check endpoints that report readiness and liveness status. Kubernetes and load balancers query these endpoints to determine whether to route traffic to instances.
class HealthCheck
def initialize
@checks = {}
end
def register_check(name, &check_block)
@checks[name] = check_block
end
def perform_checks
results = @checks.map do |name, check_block|
begin
result = check_block.call
[name, { status: 'healthy', details: result }]
rescue StandardError => e
[name, { status: 'unhealthy', error: e.message }]
end
end.to_h
overall_healthy = results.values.all? { |r| r[:status] == 'healthy' }
{
status: overall_healthy ? 'healthy' : 'unhealthy',
checks: results,
timestamp: Time.now.iso8601
}
end
end
# In Sinatra or Rails controller
health_checker = HealthCheck.new
health_checker.register_check('database') do
ActiveRecord::Base.connection.execute('SELECT 1')
{ connected: true }
end
health_checker.register_check('redis') do
Redis.current.ping == 'PONG'
{ connected: true }
end
health_checker.register_check('external_api') do
response = Net::HTTP.get_response(URI('https://api.example.com/health'))
{ reachable: response.code == '200' }
end
get '/health' do
result = health_checker.perform_checks
status result[:status] == 'healthy' ? 200 : 503
json result
end
Reference
Core SRE Metrics
| Metric | Description | Target Range |
|---|---|---|
| Availability | Percentage of time service responds successfully | 99.9% - 99.99% |
| Latency P50 | Median request completion time | < 100ms |
| Latency P99 | 99th percentile request completion time | < 500ms |
| Error Rate | Percentage of requests returning errors | < 0.1% |
| Throughput | Requests processed per second | Varies by service |
| Time to Detect | Time from incident start to detection | < 5 minutes |
| Time to Resolve | Time from detection to resolution | < 1 hour |
SLO Components
| Component | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator - quantifiable measure of service quality | Request success rate |
| SLO | Service Level Objective - target value for an SLI | 99.9% of requests succeed |
| SLA | Service Level Agreement - contract with consequences for missing SLOs | 99.9% uptime or credit |
| Error Budget | Acceptable amount of unreliability | 0.1% failure rate |
Toil Categories
| Category | Characteristics | Examples |
|---|---|---|
| Manual | Requires human execution | Manually restarting services |
| Repetitive | Same actions performed repeatedly | Processing routine tickets |
| Automatable | Can be replaced with code | Deployment scripts |
| Tactical | Interrupt-driven reactive work | Responding to pages |
| No Enduring Value | Provides no permanent improvement | Temporary fixes |
| Linear Growth | Scales directly with service size | Manual capacity adjustments |
Incident Severity Levels
| Severity | Impact | Response Time | Escalation |
|---|---|---|---|
| Critical | Complete service outage or data loss | Immediate | All hands |
| High | Major feature unavailable or severe degradation | < 15 minutes | On-call + manager |
| Medium | Minor feature impaired or moderate degradation | < 1 hour | On-call engineer |
| Low | Cosmetic issues or minimal impact | Next business day | Standard queue |
Error Budget Policy Actions
| Budget Remaining | Actions | Deployment Frequency |
|---|---|---|
| > 50% | Normal operations, accept reasonable risk | Multiple per day |
| 25-50% | Increase caution, require additional review | Daily |
| 10-25% | Focus on reliability, defer non-critical features | Weekly |
| < 10% | Deployment freeze except critical fixes | Emergency only |
Monitoring Tiers
| Tier | Purpose | Alert Condition | Response |
|---|---|---|---|
| Symptom | User-facing problems | Error rate exceeds threshold | Page on-call |
| Cause | Internal component issues | High memory usage | Create ticket |
| Debug | Detailed troubleshooting data | N/A | Dashboard only |
| Audit | Historical analysis | N/A | Log retention |
Common Ruby SRE Gems
| Gem | Purpose | Use Case |
|---|---|---|
| prometheus-client | Metrics collection | Exposing Prometheus metrics |
| sentry-ruby | Error tracking | Exception monitoring |
| honeycomb-beeline | Distributed tracing | Request tracing across services |
| dogstatsd-ruby | StatsD metrics | Sending metrics to Datadog |
| flipper | Feature flags | Progressive rollouts |
| semian | Circuit breakers | Protecting against cascading failures |
| scientist | Experimental testing | Comparing implementations safely |
Postmortem Template Sections
| Section | Contents | Purpose |
|---|---|---|
| Summary | Brief incident description and impact | Quick context for readers |
| Timeline | Chronological events with timestamps | Understanding incident progression |
| Root Cause | Technical reason for failure | Identifying what broke |
| Impact | User effects and error budget consumed | Quantifying damage |
| Action Items | Specific tasks to prevent recurrence | Driving improvement |
| Lessons Learned | Insights from incident response | Organizational learning |
Deployment Strategies
| Strategy | Description | Risk Level | Rollback Speed |
|---|---|---|---|
| Big Bang | Update all instances simultaneously | High | Slow |
| Rolling | Update instances sequentially | Medium | Medium |
| Blue-Green | Maintain two environments, switch traffic | Low | Fast |
| Canary | Route small traffic percentage to new version | Very Low | Fast |
| Feature Flag | Deploy code, control activation separately | Minimal | Instant |