CrackedRuby - Incident Management

Overview

Incident management represents the structured process of identifying, analyzing, and resolving unplanned disruptions to software systems. An incident refers to any event that degrades or interrupts normal service operation, ranging from complete outages to performance degradation. The practice emerged from IT service management disciplines in the 1980s and became codified through frameworks like ITIL (Information Technology Infrastructure Library).

The primary objective centers on restoring normal service operation as quickly as possible while minimizing adverse impact on business operations. Unlike problem management, which focuses on identifying root causes, incident management prioritizes immediate resolution and service restoration. The distinction matters because during an active incident, determining why a system failed takes secondary importance to making it work again.

Modern incident management encompasses several interconnected activities: detection through monitoring and alerting, triage to assess severity and impact, investigation to understand the failure, mitigation to restore service, communication to inform stakeholders, and post-incident analysis to prevent recurrence. Each activity requires different skills, tools, and processes.

The practice has evolved significantly with the adoption of distributed systems, microservices, and cloud infrastructure. Traditional incident management assumed relatively stable, monolithic applications where incidents were infrequent and discrete events. Contemporary systems generate continuous streams of alerts, partial failures, and cascading issues that require more sophisticated approaches.

# Basic incident detection in a Rails application
class IncidentDetector
  def initialize(threshold: 0.05)
    @error_threshold = threshold
    @time_window = 5.minutes
  end

  def check_error_rate
    recent_requests = RequestLog.where('created_at > ?', @time_window.ago).count
    recent_errors = RequestLog.where('created_at > ? AND status >= 500', @time_window.ago).count
    
    error_rate = recent_errors.to_f / recent_requests
    
    if error_rate > @error_threshold
      trigger_incident(error_rate)
    end
  end
  
  def trigger_incident(error_rate)
    Incident.create!(
      severity: calculate_severity(error_rate),
      title: "High error rate detected",
      details: "Error rate: #{(error_rate * 100).round(2)}%"
    )
  end
end

The field continues to adapt as organizations shift toward observability-driven approaches that emphasize understanding system behavior through metrics, logs, and traces rather than relying solely on predefined alerts.

Key Principles

Incident management operates on several fundamental principles that guide effective response and resolution. These principles reflect decades of operational experience across diverse software systems.

Severity classification forms the foundation of incident response. Systems categorize incidents based on impact and urgency, typically using levels like SEV1 (critical), SEV2 (major), SEV3 (minor), and SEV4 (low). Critical incidents affect all users or core functionality and require immediate response. Major incidents impact significant user segments or important features. Minor incidents affect limited users or non-critical features. Low-severity incidents represent cosmetic issues or edge cases. The classification determines response speed, escalation paths, and resource allocation.

Incident lifecycle management defines distinct phases each incident traverses. Detection occurs through automated monitoring or user reports. Triage assesses severity and assigns ownership. Investigation identifies the nature and scope of the problem. Mitigation implements temporary fixes to restore service. Resolution addresses the underlying issue. Closure documents the incident and triggers follow-up activities. Each phase has specific goals, participants, and exit criteria.

Role-based response assigns clear responsibilities during incidents. The incident commander coordinates the overall response, makes decisions, and communicates with stakeholders. Technical responders investigate and implement fixes. Communication managers handle external updates. Subject matter experts provide specialized knowledge. Clear role assignment prevents confusion and duplicated effort during high-stress situations.

Communication protocols ensure stakeholders receive timely, accurate information. Internal communication keeps team members synchronized on investigation progress and mitigation attempts. External communication updates users about impact and expected resolution times. Status pages provide transparent, real-time incident information. Communication frequency increases with incident severity and duration.

Blameless culture emphasizes learning over punishment. Post-incident reviews focus on systemic issues rather than individual actions. The goal involves understanding how the incident occurred within existing processes and controls rather than identifying who made a mistake. This approach encourages transparency and thorough analysis.

Documentation requirements capture incident details for future reference. Incident records include timeline of events, actions taken, impact assessment, and resolution steps. This documentation serves multiple purposes: knowledge transfer, pattern recognition, audit compliance, and post-incident analysis.

Escalation procedures define when and how to involve additional resources. Time-based escalation automatically notifies senior staff if incidents remain unresolved after specific durations. Impact-based escalation involves leadership when incidents exceed certain severity thresholds. Technical escalation brings in specialized expertise for complex issues.

class IncidentResponse
  SEVERITY_LEVELS = {
    sev1: { response_time: 15.minutes, escalation_time: 30.minutes },
    sev2: { response_time: 1.hour, escalation_time: 4.hours },
    sev3: { response_time: 4.hours, escalation_time: 24.hours },
    sev4: { response_time: 24.hours, escalation_time: nil }
  }.freeze

  def self.handle(incident)
    response_config = SEVERITY_LEVELS[incident.severity]
    
    # Assign incident commander
    commander = assign_commander(incident.severity)
    incident.update(commander: commander)
    
    # Set response deadline
    incident.update(
      response_deadline: Time.current + response_config[:response_time],
      escalation_deadline: response_config[:escalation_time] ? 
        Time.current + response_config[:escalation_time] : nil
    )
    
    # Notify responders
    notify_on_call_team(incident)
    
    # Start communication
    create_status_page_entry(incident) if incident.sev1? || incident.sev2?
  end
end

Mean time to resolution (MTTR) serves as the primary metric for incident management effectiveness. MTTR measures the average time from incident detection to complete resolution. Organizations track MTTR trends to assess whether incident response improves over time. However, MTTR alone provides incomplete insight—frequency and severity of incidents matter equally.

Runbooks and playbooks codify response procedures for common incident types. Runbooks contain step-by-step instructions for diagnosing and resolving specific issues. Playbooks define processes for managing different incident categories. Both reduce response time by eliminating the need to rediscover solutions during incidents.

Ruby Implementation

Ruby applications implement incident management through integration with monitoring services, logging frameworks, and alerting systems. The Ruby ecosystem provides numerous gems and patterns for detecting, tracking, and responding to incidents.

Error tracking integration represents the most common incident detection mechanism. Services like Sentry, Honeybadger, and Rollbar capture exceptions and aggregate them for analysis. These tools integrate directly into Ruby applications to automatically report errors.

# Sentry integration in Rails
# config/initializers/sentry.rb
Sentry.init do |config|
  config.dsn = ENV['SENTRY_DSN']
  config.breadcrumbs_logger = [:active_support_logger, :http_logger]
  
  config.traces_sample_rate = 0.1
  config.profiles_sample_rate = 0.1
  
  config.before_send = lambda do |event, hint|
    # Add custom context
    event.user = {
      id: Current.user&.id,
      email: Current.user&.email
    }
    
    # Filter sensitive data
    if event.request
      event.request.data = filter_sensitive_data(event.request.data)
    end
    
    event
  end
end

# Capturing custom incidents
class PaymentProcessor
  def process_payment(order)
    result = PaymentGateway.charge(order.total, order.payment_token)
    
    if result.failure?
      Sentry.capture_message(
        "Payment processing failure",
        level: :error,
        extra: {
          order_id: order.id,
          amount: order.total,
          error_code: result.error_code
        }
      )
      
      # Create internal incident record
      Incident.create!(
        severity: :sev2,
        title: "Payment processing failure",
        service: "payment_processor",
        details: result.error_message
      )
    end
    
    result
  end
end

Health check endpoints enable external monitoring systems to detect incidents. These endpoints verify critical system components and dependencies.

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!
  
  def show
    checks = {
      database: check_database,
      redis: check_redis,
      external_api: check_external_api,
      disk_space: check_disk_space
    }
    
    healthy = checks.values.all? { |check| check[:status] == 'ok' }
    status_code = healthy ? 200 : 503
    
    render json: {
      status: healthy ? 'healthy' : 'unhealthy',
      checks: checks,
      timestamp: Time.current.iso8601
    }, status: status_code
  end
  
  private
  
  def check_database
    ActiveRecord::Base.connection.execute('SELECT 1')
    { status: 'ok', response_time: measure_response_time { ActiveRecord::Base.connection.execute('SELECT 1') } }
  rescue => e
    { status: 'failed', error: e.message }
  end
  
  def check_redis
    Redis.current.ping
    { status: 'ok', response_time: measure_response_time { Redis.current.ping } }
  rescue => e
    { status: 'failed', error: e.message }
  end
  
  def measure_response_time
    start = Time.current
    yield
    ((Time.current - start) * 1000).round(2)
  end
end

Custom monitoring and alerting allows Ruby applications to define domain-specific incident conditions. The application monitors business metrics and system health indicators, triggering incidents when thresholds are exceeded.

class BusinessMetricsMonitor
  def initialize
    @statsd = Statsd.new('localhost', 8125)
  end
  
  def check_conversion_rate
    window = 1.hour.ago
    visitors = Visit.where('created_at > ?', window).count
    purchases = Order.where('created_at > ?', window).count
    
    conversion_rate = purchases.to_f / visitors
    @statsd.gauge('business.conversion_rate', conversion_rate)
    
    if conversion_rate < 0.02 && visitors > 100
      trigger_incident(
        title: "Low conversion rate detected",
        severity: :sev2,
        metrics: {
          conversion_rate: conversion_rate,
          visitors: visitors,
          purchases: purchases
        }
      )
    end
  end
  
  def check_payment_success_rate
    window = 15.minutes.ago
    attempts = PaymentAttempt.where('created_at > ?', window).count
    successful = PaymentAttempt.where('created_at > ? AND status = ?', window, 'succeeded').count
    
    success_rate = successful.to_f / attempts
    @statsd.gauge('payments.success_rate', success_rate)
    
    if success_rate < 0.95 && attempts > 10
      trigger_incident(
        title: "Payment success rate below threshold",
        severity: :sev1,
        metrics: {
          success_rate: success_rate,
          attempts: attempts,
          successful: successful
        }
      )
    end
  end
  
  private
  
  def trigger_incident(title:, severity:, metrics:)
    incident = Incident.create!(
      title: title,
      severity: severity,
      details: metrics.to_json,
      detected_at: Time.current
    )
    
    IncidentNotifier.notify(incident)
  end
end

Incident notification systems deliver alerts to on-call engineers through various channels. Integration with PagerDuty, Opsgenie, or similar services ensures incidents reach the appropriate responders.

class IncidentNotifier
  def self.notify(incident)
    new(incident).notify
  end
  
  def initialize(incident)
    @incident = incident
  end
  
  def notify
    case @incident.severity
    when :sev1, :sev2
      notify_pagerduty
      notify_slack
      update_status_page
    when :sev3
      notify_slack
    when :sev4
      create_ticket
    end
  end
  
  private
  
  def notify_pagerduty
    HTTParty.post(
      'https://events.pagerduty.com/v2/enqueue',
      headers: { 'Content-Type' => 'application/json' },
      body: {
        routing_key: ENV['PAGERDUTY_ROUTING_KEY'],
        event_action: 'trigger',
        payload: {
          summary: @incident.title,
          severity: pagerduty_severity,
          source: 'application',
          custom_details: {
            incident_id: @incident.id,
            details: @incident.details,
            detected_at: @incident.detected_at
          }
        }
      }.to_json
    )
  end
  
  def notify_slack
    SlackNotifier.new.post(
      text: format_slack_message,
      channel: incident_channel,
      username: 'Incident Manager'
    )
  end
  
  def format_slack_message
    <<~MESSAGE
      :warning: *#{@incident.severity.upcase} Incident*
      *Title:* #{@incident.title}
      *Detected:* #{@incident.detected_at.strftime('%Y-%m-%d %H:%M:%S %Z')}
      *Details:* #{@incident.details}
      *Link:* #{incident_url}
    MESSAGE
  end
end

Incident tracking models maintain incident state and history within the application database. These models store incident details, track status changes, and link related information.

class Incident < ApplicationRecord
  enum severity: { sev1: 0, sev2: 1, sev3: 2, sev4: 3 }
  enum status: { 
    detected: 0, 
    investigating: 1, 
    identified: 2, 
    monitoring: 3, 
    resolved: 4 
  }
  
  belongs_to :commander, class_name: 'User', optional: true
  has_many :incident_updates, dependent: :destroy
  has_many :incident_actions, dependent: :destroy
  
  validates :title, :severity, :status, presence: true
  
  after_create :start_incident_timer
  after_update :record_status_change, if: :saved_change_to_status?
  
  def duration
    return nil unless resolved?
    resolved_at - detected_at
  end
  
  def time_to_detect
    detected_at - created_at
  end
  
  def time_to_resolve
    return nil unless resolved?
    resolved_at - detected_at
  end
  
  def add_update(message, user)
    incident_updates.create!(
      message: message,
      user: user,
      created_at: Time.current
    )
  end
  
  def escalate!
    if sev2?
      update!(severity: :sev1)
    elsif sev3?
      update!(severity: :sev2)
    end
    
    IncidentNotifier.notify(self)
  end
  
  private
  
  def start_incident_timer
    IncidentTimerJob.set(wait: response_deadline).perform_later(id)
  end
  
  def response_deadline
    case severity
    when 'sev1' then 15.minutes
    when 'sev2' then 1.hour
    when 'sev3' then 4.hours
    else 24.hours
    end
  end
  
  def record_status_change
    incident_actions.create!(
      action: "Status changed from #{status_before_last_save} to #{status}",
      performed_at: Time.current
    )
  end
end

Implementation Approaches

Organizations adopt different incident management strategies based on their operational maturity, system complexity, and team structure. Each approach represents different trade-offs between automation, process rigor, and flexibility.

Reactive approach represents the most basic incident management strategy. Teams respond to incidents as they occur without proactive monitoring or standardized processes. Engineers receive alerts through informal channels like direct messages or email. Response depends heavily on individual knowledge and availability. This approach works for small teams managing simple systems but breaks down as complexity increases. The primary risk involves inconsistent response quality and knowledge concentration in specific individuals.

Process-driven approach implements formal incident management procedures aligned with frameworks like ITIL. Organizations define detailed processes for detection, triage, investigation, and resolution. Each step includes specific activities, roles, and documentation requirements. Incident tickets flow through defined stages with approval gates and handoffs between teams. This approach ensures consistency and auditability but can introduce overhead that slows incident resolution. Process-driven approaches suit regulated industries or large organizations requiring compliance and standardization.

On-call rotation approach distributes incident response responsibility across team members. Engineers take turns being on-call for specific time periods, typically one week. During their rotation, they carry the responsibility for acknowledging and responding to incidents. Organizations implement primary and secondary on-call coverage to ensure redundancy. This approach spreads incident knowledge across the team and prevents burnout from constant interruptions. However, it requires sufficient team size and can create handoff issues between rotations.

class OnCallSchedule
  def initialize
    @schedule_client = PagerDuty::Client.new(api_token: ENV['PAGERDUTY_API_TOKEN'])
  end
  
  def current_responder(schedule_id)
    response = @schedule_client.get("schedules/#{schedule_id}/users", since: Time.current)
    response['users'].first
  end
  
  def escalate_to_secondary(incident)
    secondary_schedule = ENV['SECONDARY_SCHEDULE_ID']
    secondary_responder = current_responder(secondary_schedule)
    
    incident.update!(
      escalated_to: secondary_responder['id'],
      escalated_at: Time.current
    )
    
    notify_responder(secondary_responder, incident)
  end
end

Follow-the-sun approach maintains continuous incident coverage by distributing on-call responsibilities across global time zones. As one region ends their workday, responsibility transfers to another region beginning their workday. This approach reduces after-hours incidents for individual engineers and provides faster incident response by routing to engineers during their normal working hours. Implementation requires geographic team distribution and effective handoff procedures between regions.

Incident command system (ICS) structures response around clearly defined roles with explicit responsibilities. The incident commander leads the response, makes decisions, and coordinates activities. Technical leads handle specific investigation areas. Communication leads manage stakeholder updates. Scribe documents the incident timeline. This approach, borrowed from emergency management, works well for complex incidents involving multiple teams. Role clarity prevents confusion during high-stress situations.

War room approach brings all incident responders into a dedicated communication channel (physical or virtual) for critical incidents. The war room centralizes communication, accelerates decision-making, and maintains shared context. Teams use dedicated Slack channels, Zoom calls, or physical meeting rooms depending on circumstances. This approach concentrates expertise and enables rapid collaboration but consumes significant resources. Organizations typically reserve war rooms for SEV1 incidents affecting major functionality.

Automated remediation approach implements self-healing capabilities that detect and resolve common incidents without human intervention. Systems monitor for known failure conditions and execute predefined remediation actions. Automated remediation works for well-understood, repeatable issues like restarting failed services, clearing cache, or failing over to backup systems. This approach reduces MTTR and engineering toil but requires significant upfront investment in automation infrastructure.

class AutomatedRemediation
  REMEDIATION_ACTIONS = {
    high_memory: :restart_service,
    database_connection_pool_exhausted: :increase_pool_size,
    disk_full: :clear_logs,
    cache_miss_rate_high: :warm_cache
  }.freeze

  def attempt_remediation(incident)
    action = REMEDIATION_ACTIONS[incident.incident_type.to_sym]
    return false unless action
    
    incident.update!(status: :remediating)
    
    result = send(action, incident)
    
    if result.success?
      incident.update!(
        status: :resolved,
        resolution: "Auto-remediated via #{action}",
        resolved_at: Time.current
      )
      true
    else
      incident.update!(
        status: :investigating,
        notes: "Automated remediation failed: #{result.error}"
      )
      false
    end
  end
  
  private
  
  def restart_service(incident)
    service_name = incident.metadata['service_name']
    SystemCtl.restart(service_name)
  rescue => e
    OpenStruct.new(success?: false, error: e.message)
  end
end

Chaos engineering approach proactively injects failures into production systems to validate incident detection and response capabilities. Teams regularly conduct game days where they simulate incidents and practice response procedures. This approach identifies gaps in monitoring, documentation, and processes before real incidents occur. Chaos engineering builds confidence in system resilience and response capabilities but requires organizational maturity and careful implementation to avoid causing actual outages.

Tools & Ecosystem

The Ruby ecosystem includes numerous tools and services for implementing comprehensive incident management. Selection depends on organization size, budget, system complexity, and operational requirements.

Error tracking services aggregate application exceptions and provide analysis capabilities. Sentry remains the most popular option in the Ruby community, offering exception grouping, release tracking, and performance monitoring. Honeybadger provides similar functionality with focus on Rails applications. Rollbar includes deployment tracking and person tracking to understand which users encounter errors. Airbrake offers lightweight error monitoring with simple integration. Bugsnag provides comprehensive error diagnostics with support for multiple languages.

# Comparing error tracking integrations
# Sentry
Sentry.capture_exception(exception, extra: { order_id: order.id })

# Honeybadger
Honeybadger.notify(exception, context: { order_id: order.id })

# Rollbar
Rollbar.error(exception, order_id: order.id)

# Airbrake
Airbrake.notify(exception, params: { order_id: order.id })

Incident management platforms coordinate on-call schedules, alert routing, and incident response. PagerDuty dominates the market with comprehensive scheduling, escalation policies, and integrations. Opsgenie (owned by Atlassian) provides similar functionality with tighter Jira integration. VictorOps (Splunk On-Call) focuses on collaborative incident response. xMatters emphasizes workflow automation and communication templates. incident.io offers modern incident management with built-in status pages and post-incident analysis.

Monitoring and observability platforms detect anomalies and trigger incidents. New Relic provides full-stack observability including application performance monitoring (APM), infrastructure monitoring, and logging. Datadog offers comprehensive monitoring with strong visualization capabilities and anomaly detection. AppSignal specializes in Ruby monitoring with excellent Rails integration. Scout APM focuses specifically on Rails performance monitoring. Prometheus plus Grafana provide open-source monitoring and alerting.

# New Relic custom instrumentation
class PaymentProcessor
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation
  
  def process_payment(order)
    add_custom_attributes(order_id: order.id, amount: order.total)
    
    result = PaymentGateway.charge(order.total, order.payment_token)
    
    if result.failure?
      notice_error(PaymentError.new(result.error_message))
    end
    
    result
  end
  
  add_transaction_tracer :process_payment, category: :task
end

Logging aggregation services centralize log collection and analysis. Papertrail provides simple log aggregation with search capabilities. Loggly offers log management with dashboards and alerts. Splunk provides enterprise-grade log analysis and security information. ELK Stack (Elasticsearch, Logstash, Kibana) offers open-source log aggregation and analysis. Loki (from Grafana Labs) provides lightweight log aggregation integrated with Grafana.

Status page services communicate incident status to users. Statuspage (Atlassian) provides customizable status pages with automatic updates. Better Uptime includes status pages with integrated monitoring. StatusCast offers white-label status pages for embedding. Sorry provides simple, developer-friendly status pages with API-driven updates.

Communication tools facilitate incident response coordination. Slack remains the dominant platform with extensive integration capabilities. The slack-ruby-client gem enables programmatic Slack interaction. Microsoft Teams provides similar functionality for Microsoft-centric organizations. Mattermost offers self-hosted team communication. Zoom or Google Meet handle synchronous communication during major incidents.

# Slack integration for incident updates
class SlackIncidentChannel
  def initialize(incident)
    @incident = incident
    @client = Slack::Web::Client.new(token: ENV['SLACK_BOT_TOKEN'])
  end
  
  def create_channel
    channel_name = "incident-#{@incident.id}-#{@incident.title.parameterize}"
    
    response = @client.conversations_create(
      name: channel_name,
      is_private: false
    )
    
    channel_id = response['channel']['id']
    
    @client.conversations_setTopic(
      channel: channel_id,
      topic: "SEV#{@incident.severity} - #{@incident.title}"
    )
    
    post_initial_message(channel_id)
    invite_responders(channel_id)
    
    channel_id
  end
  
  private
  
  def post_initial_message(channel_id)
    @client.chat_postMessage(
      channel: channel_id,
      blocks: incident_blocks,
      text: "Incident #{@incident.id}: #{@incident.title}"
    )
  end
  
  def incident_blocks
    [
      {
        type: 'header',
        text: {
          type: 'plain_text',
          text: "🚨 SEV#{@incident.severity} Incident"
        }
      },
      {
        type: 'section',
        fields: [
          { type: 'mrkdwn', text: "*Title:*\n#{@incident.title}" },
          { type: 'mrkdwn', text: "*Status:*\n#{@incident.status}" },
          { type: 'mrkdwn', text: "*Detected:*\n#{@incident.detected_at}" },
          { type: 'mrkdwn', text: "*Commander:*\n#{@incident.commander&.name || 'Unassigned'}" }
        ]
      }
    ]
  end
end

Runbook automation tools codify incident response procedures. RunDeck executes automated workflows for common operations. Ansible handles configuration management and remediation tasks. The Chef and Puppet configuration management tools automate infrastructure changes. Ruby-based Capistrano deploys code and executes remote commands.

Post-incident tools facilitate learning and improvement. The Jira or Linear project management tools track follow-up action items. Confluence or Notion document post-incident reviews. Blameless provides dedicated post-incident review workflows. incident.io includes built-in post-incident analysis features.

Ruby gems for incident management extend functionality. The exception_notification gem sends email notifications for exceptions. The exception-track gem logs exceptions to database for analysis. The rollbar gem integrates with Rollbar service. The sentry-ruby gem provides Sentry integration. The health_check gem implements health check endpoints. The rack-attack gem prevents incident-causing abuse.

Practical Examples

Incident management concepts become concrete through realistic scenarios showing detection, response, and resolution of actual system failures.

Database connection pool exhaustion represents a common incident type in Ruby applications. The application experiences intermittent timeouts and users report slow response times.

# Initial detection via health check
class DatabaseHealthCheck
  def check
    start = Time.current
    ActiveRecord::Base.connection_pool.with_connection do |conn|
      conn.execute('SELECT 1')
    end
    duration = Time.current - start
    
    pool_stats = ActiveRecord::Base.connection_pool.stat
    
    if pool_stats[:waiting] > 0 || duration > 1.0
      {
        status: 'degraded',
        waiting_connections: pool_stats[:waiting],
        active_connections: pool_stats[:busy],
        pool_size: pool_stats[:size],
        response_time: duration
      }
    else
      { status: 'healthy' }
    end
  end
end

# Automated incident creation when threshold exceeded
class ConnectionPoolMonitor
  def check_pool_health
    check_result = DatabaseHealthCheck.new.check
    
    if check_result[:status] == 'degraded' && check_result[:waiting_connections] > 5
      create_incident(check_result)
    end
  end
  
  private
  
  def create_incident(metrics)
    Incident.create!(
      severity: :sev2,
      title: "Database connection pool exhaustion",
      incident_type: 'database_connection_pool_exhausted',
      status: :detected,
      details: metrics.to_json,
      detected_at: Time.current
    )
  end
end

# Automated remediation attempt
class DatabasePoolRemediation
  def remediate(incident)
    current_pool_size = ActiveRecord::Base.connection_pool.size
    new_pool_size = [current_pool_size * 1.5, 100].min.to_i
    
    # Update pool size dynamically
    ActiveRecord::Base.establish_connection(
      ActiveRecord::Base.connection_config.merge(pool: new_pool_size)
    )
    
    incident.add_update(
      "Increased connection pool from #{current_pool_size} to #{new_pool_size}",
      User.system_user
    )
    
    # Monitor for improvement
    sleep 30
    
    check_result = DatabaseHealthCheck.new.check
    if check_result[:status] == 'healthy'
      incident.update!(
        status: :resolved,
        resolved_at: Time.current,
        resolution: "Increased connection pool size to #{new_pool_size}"
      )
    else
      incident.update!(status: :investigating)
      escalate_to_engineer(incident)
    end
  end
end

API rate limit breach occurs when external service rate limits are exceeded, causing cascading failures. The application receives 429 Too Many Requests responses from a payment provider.

class PaymentServiceIncident
  def self.detect_rate_limit_breach
    recent_failures = PaymentAttempt
      .where('created_at > ?', 5.minutes.ago)
      .where(status: 'rate_limited')
      .count
    
    if recent_failures > 10
      incident = Incident.create!(
        severity: :sev1,
        title: "Payment service rate limit breach",
        incident_type: 'api_rate_limit',
        status: :detected,
        details: {
          failed_attempts: recent_failures,
          service: 'stripe',
          time_window: '5 minutes'
        }.to_json
      )
      
      implement_circuit_breaker(incident)
    end
  end
  
  def self.implement_circuit_breaker(incident)
    # Enable circuit breaker to prevent further requests
    Rails.cache.write('payment_service_circuit_breaker', 'open', expires_in: 15.minutes)
    
    # Queue failed payments for retry
    failed_orders = Order.where('created_at > ? AND payment_status = ?', 5.minutes.ago, 'pending')
    failed_orders.each do |order|
      PaymentRetryJob.set(wait: 20.minutes).perform_later(order.id)
    end
    
    incident.add_update(
      "Circuit breaker opened. #{failed_orders.count} orders queued for retry.",
      User.system_user
    )
    
    # Monitor and auto-recover
    CircuitBreakerMonitorJob.set(wait: 15.minutes).perform_later(incident.id)
  end
end

# Circuit breaker monitoring and recovery
class CircuitBreakerMonitorJob < ApplicationJob
  def perform(incident_id)
    incident = Incident.find(incident_id)
    
    # Test with single request
    test_result = PaymentGateway.test_connection
    
    if test_result.success?
      Rails.cache.delete('payment_service_circuit_breaker')
      
      incident.update!(
        status: :resolved,
        resolved_at: Time.current,
        resolution: "Rate limit window expired. Circuit breaker closed."
      )
      
      # Process queued payments
      RetryQueuedPaymentsJob.perform_later
    else
      # Extend circuit breaker
      Rails.cache.write('payment_service_circuit_breaker', 'open', expires_in: 15.minutes)
      incident.add_update("Circuit breaker remains open. Retrying in 15 minutes.", User.system_user)
      self.class.set(wait: 15.minutes).perform_later(incident_id)
    end
  end
end

Memory leak incident demonstrates gradual degradation requiring investigation and coordinated response. Application memory usage grows continuously until pods restart due to OOM conditions.

class MemoryLeakDetection
  def monitor_memory_trend
    current_memory = process_memory_usage
    
    # Store historical data
    MemoryMetric.create!(
      value: current_memory,
      timestamp: Time.current
    )
    
    # Analyze trend over last hour
    hourly_samples = MemoryMetric.where('timestamp > ?', 1.hour.ago).order(:timestamp)
    
    if memory_leak_detected?(hourly_samples)
      create_memory_leak_incident(hourly_samples)
    end
  end
  
  private
  
  def memory_leak_detected?(samples)
    return false if samples.count < 12 # Need sufficient data
    
    # Calculate memory growth rate
    first_value = samples.first.value
    last_value = samples.last.value
    growth_rate = (last_value - first_value) / first_value
    
    # Check if memory grew more than 50% and continues growing
    growth_rate > 0.5 && consistent_growth?(samples)
  end
  
  def consistent_growth?(samples)
    increases = 0
    samples.each_cons(2) do |prev, current|
      increases += 1 if current.value > prev.value
    end
    increases.to_f / (samples.count - 1) > 0.7 # 70% of samples show growth
  end
  
  def create_memory_leak_incident(samples)
    Incident.create!(
      severity: :sev2,
      title: "Potential memory leak detected",
      incident_type: 'memory_leak',
      status: :investigating,
      details: {
        current_memory: samples.last.value,
        initial_memory: samples.first.value,
        growth_rate: ((samples.last.value - samples.first.value) / samples.first.value * 100).round(2),
        sample_count: samples.count
      }.to_json
    )
  end
  
  def process_memory_usage
    # Returns memory usage in MB
    `ps -o rss= -p #{Process.pid}`.to_i / 1024
  end
end

# Investigation tooling
class MemoryLeakInvestigation
  def capture_heap_dump(incident)
    timestamp = Time.current.to_i
    dump_path = Rails.root.join('tmp', "heap_dump_#{timestamp}.json")
    
    require 'objspace'
    ObjectSpace.trace_object_allocations_start
    
    # Trigger GC to reduce noise
    GC.start
    
    # Generate heap dump
    File.open(dump_path, 'w') do |f|
      ObjectSpace.dump_all(output: f)
    end
    
    incident.add_update(
      "Heap dump captured: #{dump_path}",
      User.system_user
    )
    
    # Analyze for common leak patterns
    analyze_heap_dump(dump_path, incident)
  end
  
  def analyze_heap_dump(path, incident)
    # Parse heap dump for object counts
    object_counts = Hash.new(0)
    
    File.foreach(path) do |line|
      data = JSON.parse(line)
      object_counts[data['type']] += 1 if data['type']
    end
    
    # Identify suspicious object accumulation
    suspicious = object_counts.select { |_type, count| count > 10000 }
      .sort_by { |_type, count| -count }
      .first(5)
    
    if suspicious.any?
      incident.add_update(
        "High object counts detected: #{suspicious.to_h}",
        User.system_user
      )
    end
  end
end

Real-World Applications

Production incident management encompasses operational practices, tooling choices, and organizational patterns that teams implement at scale.

E-commerce platform incident response illustrates comprehensive incident management in high-stakes environments. During peak shopping periods like Black Friday, organizations maintain heightened alertness and adjust response procedures.

E-commerce companies establish tiered severity classifications based on revenue impact. SEV1 incidents block customer purchases or payment processing. SEV2 incidents affect search, recommendations, or cart functionality. SEV3 incidents impact non-critical features like reviews or wishlists. Financial impact calculations inform severity assignments—a checkout failure affecting 100 transactions per minute receives SEV1 classification immediately.

Response procedures differ for peak versus normal periods. During major sales events, organizations maintain dedicated war rooms with all critical teams present. Incident commanders make faster decisions with less investigation because revenue loss from downtime exceeds the risk of potentially unnecessary rollbacks or changes. Teams pre-deploy rollback automation and maintain hot standby capacity to absorb traffic spikes or failover needs.

class EcommerceIncidentResponse
  REVENUE_THRESHOLDS = {
    sev1: 1000, # $1000/minute revenue impact
    sev2: 100,  # $100/minute revenue impact
    sev3: 10    # $10/minute revenue impact
  }.freeze
  
  def classify_by_revenue_impact(incident)
    affected_feature = incident.metadata['feature']
    conversion_rate = calculate_conversion_rate(affected_feature)
    average_order_value = Order.recent.average(:total)
    traffic = current_traffic_rate
    
    revenue_impact_per_minute = traffic * conversion_rate * average_order_value
    
    severity = REVENUE_THRESHOLDS.find { |sev, threshold| 
      revenue_impact_per_minute >= threshold 
    }&.first || :sev4
    
    incident.update!(
      severity: severity,
      metadata: incident.metadata.merge(
        estimated_revenue_impact: revenue_impact_per_minute
      )
    )
  end
  
  def should_auto_rollback?(incident)
    # Aggressive rollback during peak periods
    peak_period = BlackFridaySchedule.peak_period?(Time.current)
    
    if peak_period && incident.sev1?
      recent_deployment = Deployment.where('deployed_at > ?', 30.minutes.ago).last
      return true if recent_deployment
    end
    
    false
  end
end

SaaS platform incident management focuses on multi-tenant considerations and customer communication. SaaS providers must determine whether incidents affect all customers or specific tenants. Tenant-specific incidents require different response procedures than platform-wide issues.

Customer communication represents a critical component. Status pages provide real-time updates, but high-value customers may require direct outreach. Organizations maintain customer success managers who receive incident notifications for their assigned accounts. The balance involves transparency about issues while avoiding unnecessary alarm for incidents that don't affect specific customers.

SaaS platforms implement tenant isolation to contain incidents. Database-per-tenant architectures limit incident blast radius to individual customers. Shared infrastructure failures affect multiple tenants, triggering elevated severity classifications. Incident response includes identifying affected tenant lists and assessing whether partial functionality remains available.

Financial services incident management operates under strict regulatory requirements and audit trails. Financial institutions must document every incident, investigation step, and resolution action. Regulatory compliance mandates specific reporting timeframes for security incidents or data breaches.

Risk management frameworks integrate with incident processes. Each incident undergoes assessment for potential fraud indicators, data exposure, or compliance violations. Security teams receive automatic notification of incidents involving authentication systems, payment processing, or customer data access.

Financial services maintain aggressive MTTR targets because downtime directly impacts transaction volume and regulatory compliance. Many institutions maintain hot-hot database configurations allowing instant failover without data loss. Incident response procedures emphasize immediate mitigation over thorough investigation during active incidents.

Healthcare system incident management prioritizes patient safety above all other considerations. Healthcare organizations classify incidents based on potential patient impact rather than system functionality. An incident affecting appointment scheduling receives lower priority than one impacting medication dispensing or emergency department systems.

HIPAA compliance requirements influence incident response procedures. Security incidents involving potential protected health information (PHI) exposure trigger mandatory breach assessment protocols. Healthcare organizations maintain dedicated privacy officers who participate in security incident response.

Healthcare systems often maintain paper backup procedures for critical workflows. When electronic health record systems fail, staff revert to paper forms and manual processes. Incident response includes activating backup procedures and communicating downtime procedures to clinical staff. Post-incident work includes reconciling paper records with electronic systems after restoration.

Gaming platform incident management deals with massive traffic spikes and player experience focus. Game launches, content updates, and promotional events generate enormous traffic surges that stress infrastructure. Gaming companies accept that some incidents will occur during launches and prepare procedures for rapid scaling and recovery.

Player sentiment monitoring augments traditional monitoring. Social media analysis, forum monitoring, and customer support ticket patterns provide early incident detection. Players report issues through these channels before internal monitoring detects problems, particularly for edge cases affecting specific game regions or configurations.

Gaming platforms implement queue systems and maintenance mode capabilities as incident mitigation strategies. When backend systems reach capacity, queue systems throttle player connections rather than allowing complete failures. Planned maintenance windows provide opportunities for infrastructure changes and issue resolution with minimal player impact.

Reference

Incident Severity Classifications

Severity	Response Time	Description	Example
SEV1	15 minutes	Complete service outage or critical functionality unavailable to all users	Payment processing down, database offline, application completely unreachable
SEV2	1 hour	Major functionality degraded or unavailable to significant user segment	Checkout flow slow, search not working, API rate limiting affecting major integration
SEV3	4 hours	Minor functionality impaired or affecting limited users	Image uploads failing, notification delays, secondary feature broken
SEV4	24 hours	Cosmetic issues or edge case problems	UI styling broken, rare error condition, minor feature request

Incident Status Workflow

Status	Description	Next Actions	Duration Target
Detected	Incident identified through monitoring or reports	Assign commander, begin triage	5-15 minutes
Investigating	Team actively researching cause and scope	Implement monitoring, test hypotheses	15-60 minutes
Identified	Root cause understood	Implement fix or mitigation	10-30 minutes
Monitoring	Fix deployed, observing for improvement	Verify metrics, collect data	30-120 minutes
Resolved	Service fully restored and stable	Document resolution, schedule post-incident	N/A

Common Incident Types

Type	Detection Method	Typical Cause	Standard Mitigation
Database Connection Pool Exhaustion	Health check failure, timeout errors	Connection leak, query timeout, traffic spike	Increase pool size, restart connections, kill long queries
Memory Leak	Gradual memory growth, OOM kills	Object accumulation, cache unbounded growth	Rolling restart, heap dump analysis, patch deployment
API Rate Limit	429 responses, external service errors	Traffic spike, incorrect rate calculation	Circuit breaker, request queuing, backoff retry
Dependency Failure	External API timeouts, connection refused	Third-party outage, network issue	Fallback behavior, circuit breaker, cached data
Deployment Failure	Error spike post-deploy, health check fail	Code bug, config error, database migration	Rollback deployment, fix forward, feature flag disable
Cache Invalidation	Cache miss spike, slow response times	Cache flush, cache server restart	Warm cache, reduce cache dependency, add cache layers
Disk Space Exhaustion	Write failures, log errors	Log accumulation, temp file buildup	Clear logs, expand storage, implement log rotation

Key Metrics

Metric	Description	Target	Calculation
MTTI	Mean Time to Identify - average time from incident occurrence to detection	Under 5 minutes	Sum of detection delays / incident count
MTTA	Mean Time to Acknowledge - average time from detection to response start	Under 15 minutes	Sum of acknowledgment times / incident count
MTTR	Mean Time to Resolve - average time from detection to resolution	Varies by severity	Sum of resolution times / incident count
MTBF	Mean Time Between Failures - average time between incidents	Increasing trend	Total operational time / incident count
Incident Count	Total incidents in time period by severity	Decreasing trend	Count of incidents per week/month
False Positive Rate	Percentage of alerts that are not real incidents	Under 10%	False alerts / total alerts

Ruby Monitoring Integration Examples

Service	Gem	Configuration Pattern
Sentry	sentry-ruby, sentry-rails	Initialize with DSN, set environment and release tags, configure breadcrumbs and sampling
Honeybadger	honeybadger	Configure API key, set environment, customize error filtering and grouping
New Relic	newrelic_rpm	License key configuration, instrument custom transactions, set labels and tags
Datadog	ddtrace	Service name and environment, trace sampling rate, custom span tags
Scout APM	scout_apm	Key and app name, monitor background jobs, track custom context

Incident Communication Templates

Scenario	Initial Message	Update Cadence	Resolution Message
SEV1 - Complete Outage	We are currently experiencing a complete service outage. All users are affected. Engineers are investigating.	Every 15 minutes	Service has been fully restored. All systems are operational. Post-incident review will follow.
SEV2 - Partial Degradation	We are experiencing degraded performance in [feature]. Some users may experience [impact].	Every 30 minutes	Performance has been restored to normal levels. Monitoring continues.
SEV3 - Minor Issue	We are aware of an issue affecting [feature]. Impact is limited to [scope].	Hourly	Issue has been resolved. Service operating normally.

Incident Response Checklist

Phase	Actions	Owner	Completed
Detection	Verify incident is occurring, assess initial severity, create incident record	Monitoring System / On-call	□
Triage	Assign incident commander, determine severity, notify stakeholders	Incident Commander	□
Investigation	Review recent changes, check monitoring dashboards, examine logs	Technical Responders	□
Mitigation	Implement temporary fix, roll back if needed, enable workarounds	Technical Responders	□
Communication	Update status page, notify affected users, brief stakeholders	Communication Lead	□
Resolution	Deploy permanent fix, verify restoration, document resolution	Technical Responders	□
Post-Incident	Schedule review, identify action items, update runbooks	Incident Commander	□

Escalation Thresholds

Trigger	Escalation Action	Escalation Target
SEV1 incident detected	Immediate page to primary on-call	Primary On-call Engineer
No acknowledgment in 5 minutes	Page secondary on-call	Secondary On-call Engineer
No acknowledgment in 10 minutes	Page manager and escalation team	Engineering Manager
Unresolved after 30 minutes	Notify senior leadership	Director / VP Engineering
User-facing impact exceeds 1 hour	Notify executive team	CTO / CEO
Security incident detected	Immediate security team notification	Security Team Lead

Automation Triggers

Condition	Automated Action	Implementation
Error rate exceeds 5%	Create SEV2 incident, notify on-call	Monitoring rule triggers webhook to incident creation endpoint
Health check fails 3 consecutive times	Create incident, attempt automated remediation	Monitoring triggers runbook execution
Memory usage above 90%	Alert before OOM, trigger investigation	Prometheus alert triggers PagerDuty
Response time p95 above 2 seconds	Create SEV3 incident for investigation	APM monitoring threshold alert
Deployment causes error spike	Automatic rollback initiated	Deployment pipeline monitors error rates post-deploy
External dependency timeout rate high	Enable circuit breaker	Application monitors timeout patterns and activates protection

Incident Management