CrackedRuby CrackedRuby

Site Reliability Engineering

Overview

Site Reliability Engineering (SRE) applies software engineering principles to infrastructure and operations challenges. Google originated the discipline in 2003 to manage large-scale systems through automation, monitoring, and systematic problem-solving rather than traditional operations approaches.

SRE treats operations as a software problem. Instead of manual interventions and reactive firefighting, SRE teams write code to automate operational tasks, build reliable systems through engineering practices, and measure reliability through quantifiable metrics. The discipline bridges development and operations by having software engineers run production systems.

The core premise distinguishes SRE from traditional operations: reliability is a feature that engineers build into systems, not a property that operations teams maintain through heroic effort. This shift moves reliability left in the development cycle, making it a design consideration rather than an operational afterthought.

SRE emerged from the recognition that scaling systems and teams requires different approaches than traditional operations. Manual operations scale linearly with system size, creating unsustainable staffing requirements. SRE addresses this through:

Automation - Replace manual operational work with software systems that perform tasks reliably at scale without human intervention.

Measurement - Define reliability through quantifiable Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that guide engineering decisions.

Balance - Accept that perfect reliability is expensive and unnecessary. Define acceptable reliability levels and use error budgets to balance feature velocity with stability.

Toil Reduction - Systematically eliminate repetitive operational work that lacks enduring value, freeing engineering time for improving systems.

The discipline requires teams to maintain both development skills and operational knowledge. SREs write production code, design distributed systems, and respond to incidents, applying software engineering rigor to each activity.

Key Principles

SRE operates on several fundamental principles that distinguish it from traditional operations approaches and define how reliability engineering teams function.

Service Level Indicators and Objectives

SLIs measure specific aspects of service behavior that matter to users. An SLI quantifies one dimension of reliability: request latency, error rate, system throughput, or data durability. SLIs use metrics that users experience directly rather than infrastructure-focused measurements.

SLOs set target values for SLIs that define acceptable service behavior. An SLO specifies that 99.9% of requests must complete within 200ms, or that the error rate must stay below 0.1%. SLOs create a shared understanding between engineering teams and stakeholders about expected service quality.

The relationship between SLIs and SLOs establishes a reliability contract. SLIs provide measurement mechanisms. SLOs define success criteria. Together they answer: "Is the service reliable enough?"

Error Budgets

Error budgets quantify acceptable unreliability. If an SLO targets 99.9% availability, the service has a 0.1% error budget - the amount of downtime that falls within acceptable parameters. Error budgets transform reliability from a subjective goal into a concrete resource.

Teams consume error budget through incidents, deployments, and planned maintenance. When error budget remains, teams can take risks: deploy more frequently, experiment with new features, or push performance improvements. When error budget depletes, teams focus on stability: slow deployments, fix bugs, improve monitoring, and reduce risk.

This mechanism balances innovation and reliability. Perfect reliability requires moving slowly and avoiding changes. Zero reliability produces unusable services. Error budgets define the optimal middle ground where teams can innovate within acceptable risk parameters.

Toil

Toil describes operational work that is manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly with service growth. Toil differs from overhead (meetings, planning) and engineering work (design, coding). Examples include manually resetting services, running deployment scripts, processing tickets for routine requests, or manually scaling capacity.

SRE teams target toil reduction because toil prevents engineering work that improves reliability. A team spending 80% of time on toil has 20% capacity for automation, architectural improvements, or reliability projects. This creates a negative feedback loop where operational load prevents the engineering work needed to reduce operational load.

Google's SRE guidance suggests limiting toil to 50% of an SRE's time. The remaining time goes to engineering projects: automation, tool development, system design, and reliability improvements. This ratio ensures teams continuously reduce operational burden rather than accepting it as inevitable.

Monitoring and Alerting

SRE monitoring distinguishes between symptoms and causes. Symptom-based monitoring alerts on user-visible problems: high error rates, slow responses, or service unavailability. Cause-based monitoring tracks internal state: CPU usage, memory consumption, or queue depth. Alerts trigger on symptoms. Dashboards display causes.

This approach prevents alert fatigue and reduces mean time to detection. Traditional monitoring alerts on every potential problem, creating noise and training teams to ignore alerts. Symptom-based alerting fires only when users experience problems, ensuring every alert represents a real issue requiring response.

The monitoring system must answer three questions: what is broken, why is it broken, and what actions will fix it. Alerts answer the first question. Dashboards and logs answer the second. Runbooks answer the third. This information structure reduces incident response time by providing responders with context immediately.

Blameless Postmortems

After incidents, SRE teams conduct postmortems that focus on system improvements rather than individual mistakes. Blameless postmortems assume that everyone involved made reasonable decisions given available information and competing priorities. The analysis identifies systemic issues that enabled the incident rather than assigning fault to individuals.

Effective postmortems document the timeline, root causes, impact, and action items. The timeline establishes what happened and when. Root causes identify why the incident occurred and what conditions allowed it. Impact quantifies user effects and error budget consumption. Action items specify concrete changes to prevent recurrence.

The blameless approach encourages honest reporting and learning. When teams fear punishment, incidents go unreported, reducing organizational learning. Blameless culture treats incidents as learning opportunities and system design feedback rather than failures requiring discipline.

Implementation Approaches

Organizations implement SRE through different models depending on size, culture, and existing team structures.

Dedicated SRE Teams

The dedicated team model creates specialized SRE teams that own reliability for specific services or platforms. These teams work alongside development teams, providing reliability expertise, managing production systems, and building operational tooling. The SRE team acts as a consulting group, embedded partner, or service provider depending on organizational needs.

This approach works well for organizations with multiple complex services requiring dedicated reliability focus. SRE teams develop deep operational expertise and build reusable tools that benefit multiple services. The model concentrates reliability knowledge, making it easier to establish standards and share best practices.

Dedicated teams require sufficient scale to justify specialized roles. Organizations need enough services and complexity to keep SRE teams engaged in meaningful engineering work rather than pure operational toil. The model works best when SRE teams maintain 50% or more time for engineering projects.

Embedded SRE Model

The embedded model places SRE engineers within development teams rather than forming separate SRE organizations. Each product team includes members with SRE skills who focus on reliability, operations, and production concerns while remaining part of the development team structure.

This approach distributes reliability ownership across the organization. Every team takes responsibility for their service's reliability rather than handing off operational concerns to a separate group. The model scales naturally with team growth and avoids bottlenecks from centralized SRE resources.

Embedded models require investing in reliability skills across engineering teams. Organizations must train developers on operational practices, monitoring, incident response, and production systems. This investment increases overall engineering capability but requires more time and resources than concentrating expertise in dedicated teams.

SRE as Consultancy

The consultancy model creates an SRE organization that provides guidance, reviews, and recommendations to development teams rather than directly managing services. SRE consultants review architecture, suggest improvements, define SLOs, and teach reliability practices. Development teams own their production services with SRE guidance.

This lightweight model works for organizations where development teams have operational skills but need reliability expertise for complex decisions. The consultancy scales better than dedicated teams because one SRE consultant can advise multiple teams. However, it requires development teams to execute SRE recommendations themselves.

The consultancy approach establishes reliability standards without creating operational dependencies. Development teams remain fully responsible for their services while benefiting from SRE knowledge. This model suits organizations with mature engineering teams that need guidance rather than hands-on operational support.

Platform SRE

Platform SRE teams build internal platforms and tools that enable other teams to operate reliably. Rather than managing specific services, platform SRE creates deployment systems, monitoring infrastructure, incident response tools, and operational frameworks. Development teams use these platforms to run their own services.

This model treats reliability as a platform problem. By building excellent operational tools, platform SRE enables all teams to run reliable services without requiring dedicated SRE support for each service. The approach scales through tooling rather than direct service management.

Platform SRE requires significant upfront investment in tooling and infrastructure. The model works best for larger organizations with many services sharing common operational needs. Platform teams must balance building generic tools with supporting specific team requirements.

Ruby Implementation

Ruby provides multiple tools and libraries for implementing SRE practices, from automation scripts to monitoring systems and operational tooling.

Service Health Checks

require 'net/http'
require 'json'

class ServiceHealthCheck
  def initialize(service_url)
    @service_url = service_url
    @uri = URI("#{service_url}/health")
  end

  def check
    start_time = Time.now
    response = Net::HTTP.get_response(@uri)
    latency = Time.now - start_time

    {
      healthy: response.code == '200',
      latency_ms: (latency * 1000).round(2),
      status_code: response.code,
      timestamp: Time.now.iso8601
    }
  rescue StandardError => e
    {
      healthy: false,
      error: e.message,
      timestamp: Time.now.iso8601
    }
  end
end

# Monitor multiple services
services = [
  'https://api.example.com',
  'https://web.example.com',
  'https://admin.example.com'
]

results = services.map do |service_url|
  checker = ServiceHealthCheck.new(service_url)
  [service_url, checker.check]
end.to_h

puts JSON.pretty_generate(results)

SLO Measurement and Tracking

class SLOTracker
  attr_reader :total_requests, :successful_requests

  def initialize(target_percentage)
    @target_percentage = target_percentage
    @total_requests = 0
    @successful_requests = 0
    @errors = []
  end

  def record_request(success:, latency_ms: nil, error: nil)
    @total_requests += 1
    if success
      @successful_requests += 1
    else
      @errors << {
        timestamp: Time.now,
        latency: latency_ms,
        error: error
      }
    end
  end

  def current_slo
    return 0.0 if @total_requests.zero?
    (@successful_requests.to_f / @total_requests * 100).round(2)
  end

  def error_budget_remaining
    target_failures = @total_requests * (1 - @target_percentage / 100.0)
    actual_failures = @total_requests - @successful_requests
    remaining = target_failures - actual_failures
    
    {
      target_failures: target_failures.round(2),
      actual_failures: actual_failures,
      remaining: remaining.round(2),
      percentage: (remaining / target_failures * 100).round(2)
    }
  end

  def slo_breached?
    current_slo < @target_percentage
  end

  def report
    {
      target_slo: @target_percentage,
      current_slo: current_slo,
      total_requests: @total_requests,
      successful_requests: @successful_requests,
      failed_requests: @total_requests - @successful_requests,
      error_budget: error_budget_remaining,
      breached: slo_breached?
    }
  end
end

# Track availability SLO of 99.9%
slo = SLOTracker.new(99.9)

# Simulate requests
1000.times do |i|
  success = rand < 0.998  # 99.8% success rate
  latency = rand(50..200)
  slo.record_request(
    success: success,
    latency_ms: latency,
    error: success ? nil : "Service unavailable"
  )
end

puts JSON.pretty_generate(slo.report)
# => {
#   "target_slo": 99.9,
#   "current_slo": 99.8,
#   "total_requests": 1000,
#   "successful_requests": 998,
#   "failed_requests": 2,
#   "error_budget": {...},
#   "breached": true
# }

Automated Incident Response

require 'slack-notifier'

class IncidentResponder
  def initialize(slack_webhook_url)
    @notifier = Slack::Notifier.new(slack_webhook_url)
    @incidents = []
  end

  def detect_incident(service_name, metrics)
    return unless incident_conditions_met?(metrics)

    incident = create_incident(service_name, metrics)
    @incidents << incident
    
    notify_team(incident)
    execute_remediation(incident)
    
    incident
  end

  private

  def incident_conditions_met?(metrics)
    metrics[:error_rate] > 1.0 ||
      metrics[:latency_p99] > 1000 ||
      metrics[:availability] < 99.0
  end

  def create_incident(service_name, metrics)
    {
      id: generate_incident_id,
      service: service_name,
      severity: determine_severity(metrics),
      started_at: Time.now,
      metrics: metrics,
      status: 'open'
    }
  end

  def generate_incident_id
    "INC-#{Time.now.strftime('%Y%m%d')}-#{rand(1000..9999)}"
  end

  def determine_severity(metrics)
    return 'critical' if metrics[:availability] < 95.0
    return 'high' if metrics[:error_rate] > 5.0
    return 'medium' if metrics[:latency_p99] > 2000
    'low'
  end

  def notify_team(incident)
    message = format_alert_message(incident)
    @notifier.ping(message, channel: '#incidents')
  end

  def format_alert_message(incident)
    <<~MESSAGE
      :rotating_light: Incident Detected: #{incident[:id]}
      Service: #{incident[:service]}
      Severity: #{incident[:severity].upcase}
      Started: #{incident[:started_at]}
      
      Metrics:
      - Error Rate: #{incident[:metrics][:error_rate]}%
      - P99 Latency: #{incident[:metrics][:latency_p99]}ms
      - Availability: #{incident[:metrics][:availability]}%
    MESSAGE
  end

  def execute_remediation(incident)
    case incident[:severity]
    when 'critical'
      trigger_failover(incident[:service])
      scale_capacity(incident[:service], factor: 2)
    when 'high'
      restart_unhealthy_instances(incident[:service])
    when 'medium'
      clear_caches(incident[:service])
    end
  end

  def trigger_failover(service)
    puts "Triggering failover for #{service}"
    # Implementation would interact with infrastructure
  end

  def scale_capacity(service, factor:)
    puts "Scaling #{service} capacity by #{factor}x"
    # Implementation would interact with auto-scaling groups
  end

  def restart_unhealthy_instances(service)
    puts "Restarting unhealthy instances for #{service}"
    # Implementation would interact with orchestration system
  end

  def clear_caches(service)
    puts "Clearing caches for #{service}"
    # Implementation would interact with cache systems
  end
end

# Monitor and respond to incidents
responder = IncidentResponder.new(ENV['SLACK_WEBHOOK_URL'])

metrics = {
  error_rate: 5.2,
  latency_p99: 1500,
  availability: 98.5
}

incident = responder.detect_incident('payment-api', metrics)

Deployment Automation with Error Budget Checks

class DeploymentGuard
  def initialize(slo_tracker, min_error_budget_percentage: 20)
    @slo_tracker = slo_tracker
    @min_error_budget_percentage = min_error_budget_percentage
  end

  def can_deploy?
    budget = @slo_tracker.error_budget_remaining
    
    if budget[:percentage] < @min_error_budget_percentage
      {
        allowed: false,
        reason: "Insufficient error budget",
        current_budget: budget[:percentage],
        required_budget: @min_error_budget_percentage
      }
    else
      {
        allowed: true,
        current_budget: budget[:percentage]
      }
    end
  end

  def deploy_with_rollback(deployment_block)
    guard_result = can_deploy?
    
    unless guard_result[:allowed]
      raise DeploymentBlockedError, guard_result[:reason]
    end

    initial_metrics = capture_metrics
    
    begin
      deployment_block.call
      
      sleep(300)  # Observe for 5 minutes
      
      post_deployment_metrics = capture_metrics
      
      if deployment_degraded_service?(initial_metrics, post_deployment_metrics)
        rollback
        raise DeploymentFailedError, "Service degradation detected"
      end
      
      { success: true, metrics: post_deployment_metrics }
    rescue StandardError => e
      rollback
      raise
    end
  end

  private

  def capture_metrics
    {
      error_rate: @slo_tracker.current_slo,
      timestamp: Time.now
    }
  end

  def deployment_degraded_service?(before, after)
    degradation = before[:error_rate] - after[:error_rate]
    degradation > 0.5  # More than 0.5% SLO decrease indicates issues
  end

  def rollback
    puts "Executing rollback..."
    # Implementation would revert deployment
  end
end

class DeploymentBlockedError < StandardError; end
class DeploymentFailedError < StandardError; end

Tools & Ecosystem

SRE practices rely on extensive tooling for monitoring, alerting, deployment, and incident management.

Monitoring and Observability

Prometheus forms the foundation of many SRE monitoring stacks. The system collects time-series metrics through a pull model, stores them efficiently, and provides a query language for analysis. Prometheus integrates with Ruby applications through client libraries that expose metrics via HTTP endpoints.

Grafana visualizes metrics from Prometheus and other data sources. Teams build dashboards showing service health, SLO compliance, and system behavior. Grafana supports alerting based on query results, complementing Prometheus's alert manager.

New Relic and Datadog provide commercial observability platforms with Ruby agents that automatically instrument applications. These platforms collect traces, metrics, and logs in a unified system, reducing the operational burden of maintaining separate monitoring components.

Ruby Monitoring Libraries

The prometheus-client gem instruments Ruby applications for Prometheus:

require 'prometheus/client'
require 'prometheus/client/rack/exporter'

prometheus = Prometheus::Client.registry

http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

http_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path]
)

# In Rack middleware
class MetricsMiddleware
  def initialize(app, requests_counter, duration_histogram)
    @app = app
    @requests = requests_counter
    @duration = duration_histogram
  end

  def call(env)
    start_time = Time.now
    
    status, headers, body = @app.call(env)
    
    duration = Time.now - start_time
    method = env['REQUEST_METHOD']
    path = env['PATH_INFO']
    
    @requests.increment(labels: { method: method, path: path, status: status })
    @duration.observe(duration, labels: { method: method, path: path })
    
    [status, headers, body]
  end
end

Honeycomb provides distributed tracing and observability. The Ruby beeline automatically instruments common libraries and frameworks:

require 'honeycomb-beeline'

Honeycomb.init(
  writekey: ENV['HONEYCOMB_WRITEKEY'],
  dataset: 'production',
  service_name: 'payment-service'
)

Honeycomb.start_span(name: 'process_payment') do |span|
  span.add_field('user_id', user.id)
  span.add_field('amount', payment.amount)
  
  result = process_payment(payment)
  
  span.add_field('result', result.status)
  result
end

Incident Management

PagerDuty manages on-call schedules, incident routing, and escalation policies. Ruby applications integrate through the PagerDuty API to create incidents programmatically:

require 'net/http'
require 'json'

class PagerDutyIncident
  def initialize(integration_key)
    @integration_key = integration_key
    @api_url = 'https://events.pagerduty.com/v2/enqueue'
  end

  def trigger(summary:, severity:, details:)
    payload = {
      routing_key: @integration_key,
      event_action: 'trigger',
      payload: {
        summary: summary,
        severity: severity,
        source: Socket.gethostname,
        timestamp: Time.now.iso8601,
        custom_details: details
      }
    }

    uri = URI(@api_url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true

    request = Net::HTTP::Post.new(uri.path, 'Content-Type' => 'application/json')
    request.body = payload.to_json

    response = http.request(request)
    JSON.parse(response.body)
  end
end

Opsgenie provides similar incident management capabilities with different workflow options. VictorOps (now Splunk On-Call) offers another alternative with timeline-based incident views.

Deployment and Release Management

GitHub Actions automates deployments with SLO checks integrated into CI/CD pipelines. Ruby scripts validate error budgets before allowing deployments to proceed.

Kubernetes handles container orchestration, providing deployment strategies like rolling updates, blue-green deployments, and canary releases. The Ruby Kubernetes client interacts with cluster resources:

require 'kubeclient'

client = Kubeclient::Client.new(
  'https://kubernetes.default.svc',
  'v1',
  ssl_options: { verify_ssl: OpenSSL::SSL::VERIFY_NONE }
)

# Check deployment status
deployment = client.get_deployment('payment-api', 'production')
ready_replicas = deployment.status.readyReplicas
desired_replicas = deployment.spec.replicas

if ready_replicas == desired_replicas
  puts "Deployment healthy: #{ready_replicas}/#{desired_replicas} ready"
else
  puts "Deployment degraded: #{ready_replicas}/#{desired_replicas} ready"
end

Spinnaker provides sophisticated deployment pipelines with automated canary analysis and rollback capabilities. Ruby services deploy through Spinnaker pipelines that validate health metrics before promoting releases.

Alerting Systems

Alert Manager receives alerts from Prometheus and routes them based on severity, team, and service. Configuration defines routing trees that match alerts to notification channels:

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-pager'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

Ruby applications define custom alert receivers that process webhook notifications and execute automated responses.

Real-World Applications

SRE practices apply across different organizational scales and service types, with implementation patterns varying based on context.

High-Traffic Web Services

Organizations running high-traffic web applications implement SRE through comprehensive monitoring, automated capacity management, and rigorous change control. These services handle millions of requests per day, making reliability critical for business operations.

The SRE team maintains detailed SLO definitions covering availability, latency percentiles, and error rates. Monitoring systems track these metrics continuously, with alerting thresholds set below SLO boundaries to provide early warning before customer impact.

Capacity planning automation responds to traffic patterns, scaling infrastructure based on demand forecasts and real-time metrics. Ruby services use auto-scaling integrations that add capacity when request rates increase or when error budgets show signs of depletion.

Change management processes require all deployments to pass through automated verification. Services deploy using canary releases that expose changes to small traffic percentages initially, expanding only after metrics confirm no degradation. Deployment gates check error budgets and block releases when reliability concerns exist.

Microservices Architectures

Organizations with microservices architectures face distributed reliability challenges. Each service requires its own SLOs, but service dependencies create cascading failure risks. SRE teams implement circuit breakers, timeouts, and fallback behaviors to contain failures.

Service mesh technology like Istio provides observability, traffic management, and security across microservices. Ruby services integrate with service meshes through sidecar proxies that handle cross-cutting concerns like retries, timeouts, and distributed tracing.

The SRE team establishes service ownership models where each microservice has a responsible team. SRE provides frameworks and platforms that service teams use to operate reliably without requiring dedicated SRE support for each service.

Incident response in microservices requires understanding service dependencies and failure propagation. Ruby applications instrument their dependency chains, recording trace context that helps responders identify which service initiated failures during incidents.

Data Pipeline Reliability

Data processing pipelines require different SRE approaches than request-response services. Instead of latency and availability SLOs, data pipelines track processing latency, data quality, and throughput. Ruby applications processing data streams implement monitoring for pipeline health.

Backpressure handling prevents upstream services from overwhelming downstream components. Ruby data processors implement queue-based architectures with configurable limits that reject work when capacity is exceeded rather than failing silently.

Data quality monitoring validates pipeline outputs, checking for schema violations, null values, or suspicious patterns that indicate processing errors. Automated alerts notify teams when data quality degrades, enabling rapid response before downstream consumers encounter issues.

Retry mechanisms handle transient failures in data processing. Ruby workers implement exponential backoff with jitter, avoiding retry storms that worsen system load during incidents. Dead letter queues capture messages that fail processing repeatedly, allowing manual investigation without blocking pipeline progress.

Platform Services

Organizations building internal platforms apply SRE principles to developer-facing services like CI/CD systems, API gateways, and authentication services. These platforms require high reliability because outages block all dependent teams.

Platform SRE teams define strict SLOs reflecting the platform's critical role. Authentication service SLOs target 99.99% availability because authentication failures prevent users from accessing any services. Ruby authentication services implement redundancy, fast failover, and degraded mode operations that maintain core functionality during partial failures.

Documentation and self-service tools reduce operational burden on platform teams. Ruby applications expose health endpoints, metrics, and debugging interfaces that enable service consumers to troubleshoot issues without platform team involvement.

Capacity management for platforms requires understanding usage patterns across multiple consuming teams. Ruby platform services track per-tenant metrics, identifying heavy users and enforcing rate limits that prevent single consumers from impacting overall platform reliability.

Common Patterns

SRE teams repeatedly apply certain patterns when building and operating reliable services.

Gradual Rollouts

Gradual rollout strategies minimize blast radius when deploying changes. Rather than updating all instances simultaneously, deployments proceed in stages, with monitoring gates between stages that halt rollouts if problems emerge.

Canary deployments expose changes to a small percentage of traffic initially. Ruby applications route traffic based on headers, cookies, or random selection, directing a portion to the new version. Metrics comparison between canary and baseline versions determines whether to proceed.

class CanaryRouter
  def initialize(canary_percentage:, canary_version:, baseline_version:)
    @canary_percentage = canary_percentage
    @canary_version = canary_version
    @baseline_version = baseline_version
  end

  def select_version(request_id)
    canary_traffic = (request_id.hash % 100) < @canary_percentage
    canary_traffic ? @canary_version : @baseline_version
  end

  def evaluate_canary(canary_metrics, baseline_metrics)
    error_rate_increase = canary_metrics[:error_rate] - baseline_metrics[:error_rate]
    latency_increase = canary_metrics[:p99_latency] - baseline_metrics[:p99_latency]

    if error_rate_increase > 0.5 || latency_increase > 100
      { proceed: false, reason: 'Metrics degradation detected' }
    else
      { proceed: true }
    end
  end
end

Blue-green deployments maintain two complete environments, routing traffic to one while updating the other. After verification, traffic switches to the updated environment instantly. Ruby applications use load balancer integration or DNS updates to implement traffic switching.

Circuit Breakers

Circuit breakers prevent cascading failures by stopping requests to failing dependencies. When error rates exceed thresholds, the circuit breaker opens, rejecting requests immediately rather than waiting for timeouts. After a recovery period, the breaker allows limited traffic through to test if the dependency recovered.

class CircuitBreaker
  STATES = [:closed, :open, :half_open].freeze

  def initialize(failure_threshold:, recovery_timeout:, success_threshold:)
    @failure_threshold = failure_threshold
    @recovery_timeout = recovery_timeout
    @success_threshold = success_threshold
    @state = :closed
    @failure_count = 0
    @success_count = 0
    @last_failure_time = nil
  end

  def call
    case @state
    when :open
      if Time.now - @last_failure_time > @recovery_timeout
        transition_to_half_open
      else
        raise CircuitOpenError, "Circuit breaker open"
      end
    when :half_open
      execute_with_monitoring { yield }
    when :closed
      execute_with_monitoring { yield }
    end
  end

  private

  def execute_with_monitoring
    result = yield
    record_success
    result
  rescue StandardError => e
    record_failure
    raise
  end

  def record_success
    @success_count += 1
    if @state == :half_open && @success_count >= @success_threshold
      transition_to_closed
    end
  end

  def record_failure
    @failure_count += 1
    @last_failure_time = Time.now
    
    if @failure_count >= @failure_threshold
      transition_to_open
    end
  end

  def transition_to_closed
    @state = :closed
    @failure_count = 0
    @success_count = 0
  end

  def transition_to_open
    @state = :open
    @failure_count = 0
  end

  def transition_to_half_open
    @state = :half_open
    @success_count = 0
  end
end

class CircuitOpenError < StandardError; end

Retry with Exponential Backoff

Transient failures often resolve quickly, making retries valuable. However, naive retry strategies can overwhelm recovering services. Exponential backoff spaces retries further apart with each attempt, giving systems time to recover. Jitter randomizes retry timing, preventing thundering herds where many clients retry simultaneously.

class RetryWithBackoff
  def initialize(max_attempts:, base_delay:, max_delay:, jitter: true)
    @max_attempts = max_attempts
    @base_delay = base_delay
    @max_delay = max_delay
    @jitter = jitter
  end

  def call
    attempts = 0
    begin
      attempts += 1
      yield
    rescue StandardError => e
      if attempts < @max_attempts
        delay = calculate_delay(attempts)
        sleep(delay)
        retry
      else
        raise
      end
    end
  end

  private

  def calculate_delay(attempt)
    exponential_delay = @base_delay * (2 ** (attempt - 1))
    capped_delay = [exponential_delay, @max_delay].min
    
    if @jitter
      random_jitter = rand(0..capped_delay * 0.3)
      capped_delay + random_jitter
    else
      capped_delay
    end
  end
end

# Usage
retry_handler = RetryWithBackoff.new(
  max_attempts: 3,
  base_delay: 1,
  max_delay: 10
)

retry_handler.call do
  external_api.fetch_data
end

Feature Flags for Risk Reduction

Feature flags decouple deployment from release, allowing code deployment without immediately activating new behavior. Ruby applications check feature flag state before executing new code paths, enabling gradual rollout, A/B testing, and instant rollback without redeploying.

class FeatureFlags
  def initialize(redis_client)
    @redis = redis_client
  end

  def enabled?(flag_name, user_id: nil, context: {})
    flag_config = fetch_flag_config(flag_name)
    return false unless flag_config

    if user_id && flag_config['user_whitelist']&.include?(user_id)
      return true
    end

    percentage = flag_config['rollout_percentage'] || 0
    return false if percentage.zero?
    return true if percentage >= 100

    user_hash = Digest::MD5.hexdigest("#{flag_name}-#{user_id}").to_i(16)
    (user_hash % 100) < percentage
  end

  def disable_flag(flag_name)
    update_flag(flag_name, 'rollout_percentage' => 0)
  end

  private

  def fetch_flag_config(flag_name)
    config_json = @redis.get("feature_flag:#{flag_name}")
    JSON.parse(config_json) if config_json
  end

  def update_flag(flag_name, changes)
    config = fetch_flag_config(flag_name) || {}
    config.merge!(changes)
    @redis.set("feature_flag:#{flag_name}", config.to_json)
  end
end

Health Check Endpoints

Services expose health check endpoints that report readiness and liveness status. Kubernetes and load balancers query these endpoints to determine whether to route traffic to instances.

class HealthCheck
  def initialize
    @checks = {}
  end

  def register_check(name, &check_block)
    @checks[name] = check_block
  end

  def perform_checks
    results = @checks.map do |name, check_block|
      begin
        result = check_block.call
        [name, { status: 'healthy', details: result }]
      rescue StandardError => e
        [name, { status: 'unhealthy', error: e.message }]
      end
    end.to_h

    overall_healthy = results.values.all? { |r| r[:status] == 'healthy' }
    
    {
      status: overall_healthy ? 'healthy' : 'unhealthy',
      checks: results,
      timestamp: Time.now.iso8601
    }
  end
end

# In Sinatra or Rails controller
health_checker = HealthCheck.new

health_checker.register_check('database') do
  ActiveRecord::Base.connection.execute('SELECT 1')
  { connected: true }
end

health_checker.register_check('redis') do
  Redis.current.ping == 'PONG'
  { connected: true }
end

health_checker.register_check('external_api') do
  response = Net::HTTP.get_response(URI('https://api.example.com/health'))
  { reachable: response.code == '200' }
end

get '/health' do
  result = health_checker.perform_checks
  status result[:status] == 'healthy' ? 200 : 503
  json result
end

Reference

Core SRE Metrics

Metric Description Target Range
Availability Percentage of time service responds successfully 99.9% - 99.99%
Latency P50 Median request completion time < 100ms
Latency P99 99th percentile request completion time < 500ms
Error Rate Percentage of requests returning errors < 0.1%
Throughput Requests processed per second Varies by service
Time to Detect Time from incident start to detection < 5 minutes
Time to Resolve Time from detection to resolution < 1 hour

SLO Components

Component Definition Example
SLI Service Level Indicator - quantifiable measure of service quality Request success rate
SLO Service Level Objective - target value for an SLI 99.9% of requests succeed
SLA Service Level Agreement - contract with consequences for missing SLOs 99.9% uptime or credit
Error Budget Acceptable amount of unreliability 0.1% failure rate

Toil Categories

Category Characteristics Examples
Manual Requires human execution Manually restarting services
Repetitive Same actions performed repeatedly Processing routine tickets
Automatable Can be replaced with code Deployment scripts
Tactical Interrupt-driven reactive work Responding to pages
No Enduring Value Provides no permanent improvement Temporary fixes
Linear Growth Scales directly with service size Manual capacity adjustments

Incident Severity Levels

Severity Impact Response Time Escalation
Critical Complete service outage or data loss Immediate All hands
High Major feature unavailable or severe degradation < 15 minutes On-call + manager
Medium Minor feature impaired or moderate degradation < 1 hour On-call engineer
Low Cosmetic issues or minimal impact Next business day Standard queue

Error Budget Policy Actions

Budget Remaining Actions Deployment Frequency
> 50% Normal operations, accept reasonable risk Multiple per day
25-50% Increase caution, require additional review Daily
10-25% Focus on reliability, defer non-critical features Weekly
< 10% Deployment freeze except critical fixes Emergency only

Monitoring Tiers

Tier Purpose Alert Condition Response
Symptom User-facing problems Error rate exceeds threshold Page on-call
Cause Internal component issues High memory usage Create ticket
Debug Detailed troubleshooting data N/A Dashboard only
Audit Historical analysis N/A Log retention

Common Ruby SRE Gems

Gem Purpose Use Case
prometheus-client Metrics collection Exposing Prometheus metrics
sentry-ruby Error tracking Exception monitoring
honeycomb-beeline Distributed tracing Request tracing across services
dogstatsd-ruby StatsD metrics Sending metrics to Datadog
flipper Feature flags Progressive rollouts
semian Circuit breakers Protecting against cascading failures
scientist Experimental testing Comparing implementations safely

Postmortem Template Sections

Section Contents Purpose
Summary Brief incident description and impact Quick context for readers
Timeline Chronological events with timestamps Understanding incident progression
Root Cause Technical reason for failure Identifying what broke
Impact User effects and error budget consumed Quantifying damage
Action Items Specific tasks to prevent recurrence Driving improvement
Lessons Learned Insights from incident response Organizational learning

Deployment Strategies

Strategy Description Risk Level Rollback Speed
Big Bang Update all instances simultaneously High Slow
Rolling Update instances sequentially Medium Medium
Blue-Green Maintain two environments, switch traffic Low Fast
Canary Route small traffic percentage to new version Very Low Fast
Feature Flag Deploy code, control activation separately Minimal Instant