Overview
Incident management represents the structured process of identifying, analyzing, and resolving unplanned disruptions to software systems. An incident refers to any event that degrades or interrupts normal service operation, ranging from complete outages to performance degradation. The practice emerged from IT service management disciplines in the 1980s and became codified through frameworks like ITIL (Information Technology Infrastructure Library).
The primary objective centers on restoring normal service operation as quickly as possible while minimizing adverse impact on business operations. Unlike problem management, which focuses on identifying root causes, incident management prioritizes immediate resolution and service restoration. The distinction matters because during an active incident, determining why a system failed takes secondary importance to making it work again.
Modern incident management encompasses several interconnected activities: detection through monitoring and alerting, triage to assess severity and impact, investigation to understand the failure, mitigation to restore service, communication to inform stakeholders, and post-incident analysis to prevent recurrence. Each activity requires different skills, tools, and processes.
The practice has evolved significantly with the adoption of distributed systems, microservices, and cloud infrastructure. Traditional incident management assumed relatively stable, monolithic applications where incidents were infrequent and discrete events. Contemporary systems generate continuous streams of alerts, partial failures, and cascading issues that require more sophisticated approaches.
# Basic incident detection in a Rails application
class IncidentDetector
def initialize(threshold: 0.05)
@error_threshold = threshold
@time_window = 5.minutes
end
def check_error_rate
recent_requests = RequestLog.where('created_at > ?', @time_window.ago).count
recent_errors = RequestLog.where('created_at > ? AND status >= 500', @time_window.ago).count
error_rate = recent_errors.to_f / recent_requests
if error_rate > @error_threshold
trigger_incident(error_rate)
end
end
def trigger_incident(error_rate)
Incident.create!(
severity: calculate_severity(error_rate),
title: "High error rate detected",
details: "Error rate: #{(error_rate * 100).round(2)}%"
)
end
end
The field continues to adapt as organizations shift toward observability-driven approaches that emphasize understanding system behavior through metrics, logs, and traces rather than relying solely on predefined alerts.
Key Principles
Incident management operates on several fundamental principles that guide effective response and resolution. These principles reflect decades of operational experience across diverse software systems.
Severity classification forms the foundation of incident response. Systems categorize incidents based on impact and urgency, typically using levels like SEV1 (critical), SEV2 (major), SEV3 (minor), and SEV4 (low). Critical incidents affect all users or core functionality and require immediate response. Major incidents impact significant user segments or important features. Minor incidents affect limited users or non-critical features. Low-severity incidents represent cosmetic issues or edge cases. The classification determines response speed, escalation paths, and resource allocation.
Incident lifecycle management defines distinct phases each incident traverses. Detection occurs through automated monitoring or user reports. Triage assesses severity and assigns ownership. Investigation identifies the nature and scope of the problem. Mitigation implements temporary fixes to restore service. Resolution addresses the underlying issue. Closure documents the incident and triggers follow-up activities. Each phase has specific goals, participants, and exit criteria.
Role-based response assigns clear responsibilities during incidents. The incident commander coordinates the overall response, makes decisions, and communicates with stakeholders. Technical responders investigate and implement fixes. Communication managers handle external updates. Subject matter experts provide specialized knowledge. Clear role assignment prevents confusion and duplicated effort during high-stress situations.
Communication protocols ensure stakeholders receive timely, accurate information. Internal communication keeps team members synchronized on investigation progress and mitigation attempts. External communication updates users about impact and expected resolution times. Status pages provide transparent, real-time incident information. Communication frequency increases with incident severity and duration.
Blameless culture emphasizes learning over punishment. Post-incident reviews focus on systemic issues rather than individual actions. The goal involves understanding how the incident occurred within existing processes and controls rather than identifying who made a mistake. This approach encourages transparency and thorough analysis.
Documentation requirements capture incident details for future reference. Incident records include timeline of events, actions taken, impact assessment, and resolution steps. This documentation serves multiple purposes: knowledge transfer, pattern recognition, audit compliance, and post-incident analysis.
Escalation procedures define when and how to involve additional resources. Time-based escalation automatically notifies senior staff if incidents remain unresolved after specific durations. Impact-based escalation involves leadership when incidents exceed certain severity thresholds. Technical escalation brings in specialized expertise for complex issues.
class IncidentResponse
SEVERITY_LEVELS = {
sev1: { response_time: 15.minutes, escalation_time: 30.minutes },
sev2: { response_time: 1.hour, escalation_time: 4.hours },
sev3: { response_time: 4.hours, escalation_time: 24.hours },
sev4: { response_time: 24.hours, escalation_time: nil }
}.freeze
def self.handle(incident)
response_config = SEVERITY_LEVELS[incident.severity]
# Assign incident commander
commander = assign_commander(incident.severity)
incident.update(commander: commander)
# Set response deadline
incident.update(
response_deadline: Time.current + response_config[:response_time],
escalation_deadline: response_config[:escalation_time] ?
Time.current + response_config[:escalation_time] : nil
)
# Notify responders
notify_on_call_team(incident)
# Start communication
create_status_page_entry(incident) if incident.sev1? || incident.sev2?
end
end
Mean time to resolution (MTTR) serves as the primary metric for incident management effectiveness. MTTR measures the average time from incident detection to complete resolution. Organizations track MTTR trends to assess whether incident response improves over time. However, MTTR alone provides incomplete insight—frequency and severity of incidents matter equally.
Runbooks and playbooks codify response procedures for common incident types. Runbooks contain step-by-step instructions for diagnosing and resolving specific issues. Playbooks define processes for managing different incident categories. Both reduce response time by eliminating the need to rediscover solutions during incidents.
Ruby Implementation
Ruby applications implement incident management through integration with monitoring services, logging frameworks, and alerting systems. The Ruby ecosystem provides numerous gems and patterns for detecting, tracking, and responding to incidents.
Error tracking integration represents the most common incident detection mechanism. Services like Sentry, Honeybadger, and Rollbar capture exceptions and aggregate them for analysis. These tools integrate directly into Ruby applications to automatically report errors.
# Sentry integration in Rails
# config/initializers/sentry.rb
Sentry.init do |config|
config.dsn = ENV['SENTRY_DSN']
config.breadcrumbs_logger = [:active_support_logger, :http_logger]
config.traces_sample_rate = 0.1
config.profiles_sample_rate = 0.1
config.before_send = lambda do |event, hint|
# Add custom context
event.user = {
id: Current.user&.id,
email: Current.user&.email
}
# Filter sensitive data
if event.request
event.request.data = filter_sensitive_data(event.request.data)
end
event
end
end
# Capturing custom incidents
class PaymentProcessor
def process_payment(order)
result = PaymentGateway.charge(order.total, order.payment_token)
if result.failure?
Sentry.capture_message(
"Payment processing failure",
level: :error,
extra: {
order_id: order.id,
amount: order.total,
error_code: result.error_code
}
)
# Create internal incident record
Incident.create!(
severity: :sev2,
title: "Payment processing failure",
service: "payment_processor",
details: result.error_message
)
end
result
end
end
Health check endpoints enable external monitoring systems to detect incidents. These endpoints verify critical system components and dependencies.
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user!
def show
checks = {
database: check_database,
redis: check_redis,
external_api: check_external_api,
disk_space: check_disk_space
}
healthy = checks.values.all? { |check| check[:status] == 'ok' }
status_code = healthy ? 200 : 503
render json: {
status: healthy ? 'healthy' : 'unhealthy',
checks: checks,
timestamp: Time.current.iso8601
}, status: status_code
end
private
def check_database
ActiveRecord::Base.connection.execute('SELECT 1')
{ status: 'ok', response_time: measure_response_time { ActiveRecord::Base.connection.execute('SELECT 1') } }
rescue => e
{ status: 'failed', error: e.message }
end
def check_redis
Redis.current.ping
{ status: 'ok', response_time: measure_response_time { Redis.current.ping } }
rescue => e
{ status: 'failed', error: e.message }
end
def measure_response_time
start = Time.current
yield
((Time.current - start) * 1000).round(2)
end
end
Custom monitoring and alerting allows Ruby applications to define domain-specific incident conditions. The application monitors business metrics and system health indicators, triggering incidents when thresholds are exceeded.
class BusinessMetricsMonitor
def initialize
@statsd = Statsd.new('localhost', 8125)
end
def check_conversion_rate
window = 1.hour.ago
visitors = Visit.where('created_at > ?', window).count
purchases = Order.where('created_at > ?', window).count
conversion_rate = purchases.to_f / visitors
@statsd.gauge('business.conversion_rate', conversion_rate)
if conversion_rate < 0.02 && visitors > 100
trigger_incident(
title: "Low conversion rate detected",
severity: :sev2,
metrics: {
conversion_rate: conversion_rate,
visitors: visitors,
purchases: purchases
}
)
end
end
def check_payment_success_rate
window = 15.minutes.ago
attempts = PaymentAttempt.where('created_at > ?', window).count
successful = PaymentAttempt.where('created_at > ? AND status = ?', window, 'succeeded').count
success_rate = successful.to_f / attempts
@statsd.gauge('payments.success_rate', success_rate)
if success_rate < 0.95 && attempts > 10
trigger_incident(
title: "Payment success rate below threshold",
severity: :sev1,
metrics: {
success_rate: success_rate,
attempts: attempts,
successful: successful
}
)
end
end
private
def trigger_incident(title:, severity:, metrics:)
incident = Incident.create!(
title: title,
severity: severity,
details: metrics.to_json,
detected_at: Time.current
)
IncidentNotifier.notify(incident)
end
end
Incident notification systems deliver alerts to on-call engineers through various channels. Integration with PagerDuty, Opsgenie, or similar services ensures incidents reach the appropriate responders.
class IncidentNotifier
def self.notify(incident)
new(incident).notify
end
def initialize(incident)
@incident = incident
end
def notify
case @incident.severity
when :sev1, :sev2
notify_pagerduty
notify_slack
update_status_page
when :sev3
notify_slack
when :sev4
create_ticket
end
end
private
def notify_pagerduty
HTTParty.post(
'https://events.pagerduty.com/v2/enqueue',
headers: { 'Content-Type' => 'application/json' },
body: {
routing_key: ENV['PAGERDUTY_ROUTING_KEY'],
event_action: 'trigger',
payload: {
summary: @incident.title,
severity: pagerduty_severity,
source: 'application',
custom_details: {
incident_id: @incident.id,
details: @incident.details,
detected_at: @incident.detected_at
}
}
}.to_json
)
end
def notify_slack
SlackNotifier.new.post(
text: format_slack_message,
channel: incident_channel,
username: 'Incident Manager'
)
end
def format_slack_message
<<~MESSAGE
:warning: *#{@incident.severity.upcase} Incident*
*Title:* #{@incident.title}
*Detected:* #{@incident.detected_at.strftime('%Y-%m-%d %H:%M:%S %Z')}
*Details:* #{@incident.details}
*Link:* #{incident_url}
MESSAGE
end
end
Incident tracking models maintain incident state and history within the application database. These models store incident details, track status changes, and link related information.
class Incident < ApplicationRecord
enum severity: { sev1: 0, sev2: 1, sev3: 2, sev4: 3 }
enum status: {
detected: 0,
investigating: 1,
identified: 2,
monitoring: 3,
resolved: 4
}
belongs_to :commander, class_name: 'User', optional: true
has_many :incident_updates, dependent: :destroy
has_many :incident_actions, dependent: :destroy
validates :title, :severity, :status, presence: true
after_create :start_incident_timer
after_update :record_status_change, if: :saved_change_to_status?
def duration
return nil unless resolved?
resolved_at - detected_at
end
def time_to_detect
detected_at - created_at
end
def time_to_resolve
return nil unless resolved?
resolved_at - detected_at
end
def add_update(message, user)
incident_updates.create!(
message: message,
user: user,
created_at: Time.current
)
end
def escalate!
if sev2?
update!(severity: :sev1)
elsif sev3?
update!(severity: :sev2)
end
IncidentNotifier.notify(self)
end
private
def start_incident_timer
IncidentTimerJob.set(wait: response_deadline).perform_later(id)
end
def response_deadline
case severity
when 'sev1' then 15.minutes
when 'sev2' then 1.hour
when 'sev3' then 4.hours
else 24.hours
end
end
def record_status_change
incident_actions.create!(
action: "Status changed from #{status_before_last_save} to #{status}",
performed_at: Time.current
)
end
end
Implementation Approaches
Organizations adopt different incident management strategies based on their operational maturity, system complexity, and team structure. Each approach represents different trade-offs between automation, process rigor, and flexibility.
Reactive approach represents the most basic incident management strategy. Teams respond to incidents as they occur without proactive monitoring or standardized processes. Engineers receive alerts through informal channels like direct messages or email. Response depends heavily on individual knowledge and availability. This approach works for small teams managing simple systems but breaks down as complexity increases. The primary risk involves inconsistent response quality and knowledge concentration in specific individuals.
Process-driven approach implements formal incident management procedures aligned with frameworks like ITIL. Organizations define detailed processes for detection, triage, investigation, and resolution. Each step includes specific activities, roles, and documentation requirements. Incident tickets flow through defined stages with approval gates and handoffs between teams. This approach ensures consistency and auditability but can introduce overhead that slows incident resolution. Process-driven approaches suit regulated industries or large organizations requiring compliance and standardization.
On-call rotation approach distributes incident response responsibility across team members. Engineers take turns being on-call for specific time periods, typically one week. During their rotation, they carry the responsibility for acknowledging and responding to incidents. Organizations implement primary and secondary on-call coverage to ensure redundancy. This approach spreads incident knowledge across the team and prevents burnout from constant interruptions. However, it requires sufficient team size and can create handoff issues between rotations.
class OnCallSchedule
def initialize
@schedule_client = PagerDuty::Client.new(api_token: ENV['PAGERDUTY_API_TOKEN'])
end
def current_responder(schedule_id)
response = @schedule_client.get("schedules/#{schedule_id}/users", since: Time.current)
response['users'].first
end
def escalate_to_secondary(incident)
secondary_schedule = ENV['SECONDARY_SCHEDULE_ID']
secondary_responder = current_responder(secondary_schedule)
incident.update!(
escalated_to: secondary_responder['id'],
escalated_at: Time.current
)
notify_responder(secondary_responder, incident)
end
end
Follow-the-sun approach maintains continuous incident coverage by distributing on-call responsibilities across global time zones. As one region ends their workday, responsibility transfers to another region beginning their workday. This approach reduces after-hours incidents for individual engineers and provides faster incident response by routing to engineers during their normal working hours. Implementation requires geographic team distribution and effective handoff procedures between regions.
Incident command system (ICS) structures response around clearly defined roles with explicit responsibilities. The incident commander leads the response, makes decisions, and coordinates activities. Technical leads handle specific investigation areas. Communication leads manage stakeholder updates. Scribe documents the incident timeline. This approach, borrowed from emergency management, works well for complex incidents involving multiple teams. Role clarity prevents confusion during high-stress situations.
War room approach brings all incident responders into a dedicated communication channel (physical or virtual) for critical incidents. The war room centralizes communication, accelerates decision-making, and maintains shared context. Teams use dedicated Slack channels, Zoom calls, or physical meeting rooms depending on circumstances. This approach concentrates expertise and enables rapid collaboration but consumes significant resources. Organizations typically reserve war rooms for SEV1 incidents affecting major functionality.
Automated remediation approach implements self-healing capabilities that detect and resolve common incidents without human intervention. Systems monitor for known failure conditions and execute predefined remediation actions. Automated remediation works for well-understood, repeatable issues like restarting failed services, clearing cache, or failing over to backup systems. This approach reduces MTTR and engineering toil but requires significant upfront investment in automation infrastructure.
class AutomatedRemediation
REMEDIATION_ACTIONS = {
high_memory: :restart_service,
database_connection_pool_exhausted: :increase_pool_size,
disk_full: :clear_logs,
cache_miss_rate_high: :warm_cache
}.freeze
def attempt_remediation(incident)
action = REMEDIATION_ACTIONS[incident.incident_type.to_sym]
return false unless action
incident.update!(status: :remediating)
result = send(action, incident)
if result.success?
incident.update!(
status: :resolved,
resolution: "Auto-remediated via #{action}",
resolved_at: Time.current
)
true
else
incident.update!(
status: :investigating,
notes: "Automated remediation failed: #{result.error}"
)
false
end
end
private
def restart_service(incident)
service_name = incident.metadata['service_name']
SystemCtl.restart(service_name)
rescue => e
OpenStruct.new(success?: false, error: e.message)
end
end
Chaos engineering approach proactively injects failures into production systems to validate incident detection and response capabilities. Teams regularly conduct game days where they simulate incidents and practice response procedures. This approach identifies gaps in monitoring, documentation, and processes before real incidents occur. Chaos engineering builds confidence in system resilience and response capabilities but requires organizational maturity and careful implementation to avoid causing actual outages.
Tools & Ecosystem
The Ruby ecosystem includes numerous tools and services for implementing comprehensive incident management. Selection depends on organization size, budget, system complexity, and operational requirements.
Error tracking services aggregate application exceptions and provide analysis capabilities. Sentry remains the most popular option in the Ruby community, offering exception grouping, release tracking, and performance monitoring. Honeybadger provides similar functionality with focus on Rails applications. Rollbar includes deployment tracking and person tracking to understand which users encounter errors. Airbrake offers lightweight error monitoring with simple integration. Bugsnag provides comprehensive error diagnostics with support for multiple languages.
# Comparing error tracking integrations
# Sentry
Sentry.capture_exception(exception, extra: { order_id: order.id })
# Honeybadger
Honeybadger.notify(exception, context: { order_id: order.id })
# Rollbar
Rollbar.error(exception, order_id: order.id)
# Airbrake
Airbrake.notify(exception, params: { order_id: order.id })
Incident management platforms coordinate on-call schedules, alert routing, and incident response. PagerDuty dominates the market with comprehensive scheduling, escalation policies, and integrations. Opsgenie (owned by Atlassian) provides similar functionality with tighter Jira integration. VictorOps (Splunk On-Call) focuses on collaborative incident response. xMatters emphasizes workflow automation and communication templates. incident.io offers modern incident management with built-in status pages and post-incident analysis.
Monitoring and observability platforms detect anomalies and trigger incidents. New Relic provides full-stack observability including application performance monitoring (APM), infrastructure monitoring, and logging. Datadog offers comprehensive monitoring with strong visualization capabilities and anomaly detection. AppSignal specializes in Ruby monitoring with excellent Rails integration. Scout APM focuses specifically on Rails performance monitoring. Prometheus plus Grafana provide open-source monitoring and alerting.
# New Relic custom instrumentation
class PaymentProcessor
include NewRelic::Agent::Instrumentation::ControllerInstrumentation
def process_payment(order)
add_custom_attributes(order_id: order.id, amount: order.total)
result = PaymentGateway.charge(order.total, order.payment_token)
if result.failure?
notice_error(PaymentError.new(result.error_message))
end
result
end
add_transaction_tracer :process_payment, category: :task
end
Logging aggregation services centralize log collection and analysis. Papertrail provides simple log aggregation with search capabilities. Loggly offers log management with dashboards and alerts. Splunk provides enterprise-grade log analysis and security information. ELK Stack (Elasticsearch, Logstash, Kibana) offers open-source log aggregation and analysis. Loki (from Grafana Labs) provides lightweight log aggregation integrated with Grafana.
Status page services communicate incident status to users. Statuspage (Atlassian) provides customizable status pages with automatic updates. Better Uptime includes status pages with integrated monitoring. StatusCast offers white-label status pages for embedding. Sorry provides simple, developer-friendly status pages with API-driven updates.
Communication tools facilitate incident response coordination. Slack remains the dominant platform with extensive integration capabilities. The slack-ruby-client gem enables programmatic Slack interaction. Microsoft Teams provides similar functionality for Microsoft-centric organizations. Mattermost offers self-hosted team communication. Zoom or Google Meet handle synchronous communication during major incidents.
# Slack integration for incident updates
class SlackIncidentChannel
def initialize(incident)
@incident = incident
@client = Slack::Web::Client.new(token: ENV['SLACK_BOT_TOKEN'])
end
def create_channel
channel_name = "incident-#{@incident.id}-#{@incident.title.parameterize}"
response = @client.conversations_create(
name: channel_name,
is_private: false
)
channel_id = response['channel']['id']
@client.conversations_setTopic(
channel: channel_id,
topic: "SEV#{@incident.severity} - #{@incident.title}"
)
post_initial_message(channel_id)
invite_responders(channel_id)
channel_id
end
private
def post_initial_message(channel_id)
@client.chat_postMessage(
channel: channel_id,
blocks: incident_blocks,
text: "Incident #{@incident.id}: #{@incident.title}"
)
end
def incident_blocks
[
{
type: 'header',
text: {
type: 'plain_text',
text: "🚨 SEV#{@incident.severity} Incident"
}
},
{
type: 'section',
fields: [
{ type: 'mrkdwn', text: "*Title:*\n#{@incident.title}" },
{ type: 'mrkdwn', text: "*Status:*\n#{@incident.status}" },
{ type: 'mrkdwn', text: "*Detected:*\n#{@incident.detected_at}" },
{ type: 'mrkdwn', text: "*Commander:*\n#{@incident.commander&.name || 'Unassigned'}" }
]
}
]
end
end
Runbook automation tools codify incident response procedures. RunDeck executes automated workflows for common operations. Ansible handles configuration management and remediation tasks. The Chef and Puppet configuration management tools automate infrastructure changes. Ruby-based Capistrano deploys code and executes remote commands.
Post-incident tools facilitate learning and improvement. The Jira or Linear project management tools track follow-up action items. Confluence or Notion document post-incident reviews. Blameless provides dedicated post-incident review workflows. incident.io includes built-in post-incident analysis features.
Ruby gems for incident management extend functionality. The exception_notification gem sends email notifications for exceptions. The exception-track gem logs exceptions to database for analysis. The rollbar gem integrates with Rollbar service. The sentry-ruby gem provides Sentry integration. The health_check gem implements health check endpoints. The rack-attack gem prevents incident-causing abuse.
Practical Examples
Incident management concepts become concrete through realistic scenarios showing detection, response, and resolution of actual system failures.
Database connection pool exhaustion represents a common incident type in Ruby applications. The application experiences intermittent timeouts and users report slow response times.
# Initial detection via health check
class DatabaseHealthCheck
def check
start = Time.current
ActiveRecord::Base.connection_pool.with_connection do |conn|
conn.execute('SELECT 1')
end
duration = Time.current - start
pool_stats = ActiveRecord::Base.connection_pool.stat
if pool_stats[:waiting] > 0 || duration > 1.0
{
status: 'degraded',
waiting_connections: pool_stats[:waiting],
active_connections: pool_stats[:busy],
pool_size: pool_stats[:size],
response_time: duration
}
else
{ status: 'healthy' }
end
end
end
# Automated incident creation when threshold exceeded
class ConnectionPoolMonitor
def check_pool_health
check_result = DatabaseHealthCheck.new.check
if check_result[:status] == 'degraded' && check_result[:waiting_connections] > 5
create_incident(check_result)
end
end
private
def create_incident(metrics)
Incident.create!(
severity: :sev2,
title: "Database connection pool exhaustion",
incident_type: 'database_connection_pool_exhausted',
status: :detected,
details: metrics.to_json,
detected_at: Time.current
)
end
end
# Automated remediation attempt
class DatabasePoolRemediation
def remediate(incident)
current_pool_size = ActiveRecord::Base.connection_pool.size
new_pool_size = [current_pool_size * 1.5, 100].min.to_i
# Update pool size dynamically
ActiveRecord::Base.establish_connection(
ActiveRecord::Base.connection_config.merge(pool: new_pool_size)
)
incident.add_update(
"Increased connection pool from #{current_pool_size} to #{new_pool_size}",
User.system_user
)
# Monitor for improvement
sleep 30
check_result = DatabaseHealthCheck.new.check
if check_result[:status] == 'healthy'
incident.update!(
status: :resolved,
resolved_at: Time.current,
resolution: "Increased connection pool size to #{new_pool_size}"
)
else
incident.update!(status: :investigating)
escalate_to_engineer(incident)
end
end
end
API rate limit breach occurs when external service rate limits are exceeded, causing cascading failures. The application receives 429 Too Many Requests responses from a payment provider.
class PaymentServiceIncident
def self.detect_rate_limit_breach
recent_failures = PaymentAttempt
.where('created_at > ?', 5.minutes.ago)
.where(status: 'rate_limited')
.count
if recent_failures > 10
incident = Incident.create!(
severity: :sev1,
title: "Payment service rate limit breach",
incident_type: 'api_rate_limit',
status: :detected,
details: {
failed_attempts: recent_failures,
service: 'stripe',
time_window: '5 minutes'
}.to_json
)
implement_circuit_breaker(incident)
end
end
def self.implement_circuit_breaker(incident)
# Enable circuit breaker to prevent further requests
Rails.cache.write('payment_service_circuit_breaker', 'open', expires_in: 15.minutes)
# Queue failed payments for retry
failed_orders = Order.where('created_at > ? AND payment_status = ?', 5.minutes.ago, 'pending')
failed_orders.each do |order|
PaymentRetryJob.set(wait: 20.minutes).perform_later(order.id)
end
incident.add_update(
"Circuit breaker opened. #{failed_orders.count} orders queued for retry.",
User.system_user
)
# Monitor and auto-recover
CircuitBreakerMonitorJob.set(wait: 15.minutes).perform_later(incident.id)
end
end
# Circuit breaker monitoring and recovery
class CircuitBreakerMonitorJob < ApplicationJob
def perform(incident_id)
incident = Incident.find(incident_id)
# Test with single request
test_result = PaymentGateway.test_connection
if test_result.success?
Rails.cache.delete('payment_service_circuit_breaker')
incident.update!(
status: :resolved,
resolved_at: Time.current,
resolution: "Rate limit window expired. Circuit breaker closed."
)
# Process queued payments
RetryQueuedPaymentsJob.perform_later
else
# Extend circuit breaker
Rails.cache.write('payment_service_circuit_breaker', 'open', expires_in: 15.minutes)
incident.add_update("Circuit breaker remains open. Retrying in 15 minutes.", User.system_user)
self.class.set(wait: 15.minutes).perform_later(incident_id)
end
end
end
Memory leak incident demonstrates gradual degradation requiring investigation and coordinated response. Application memory usage grows continuously until pods restart due to OOM conditions.
class MemoryLeakDetection
def monitor_memory_trend
current_memory = process_memory_usage
# Store historical data
MemoryMetric.create!(
value: current_memory,
timestamp: Time.current
)
# Analyze trend over last hour
hourly_samples = MemoryMetric.where('timestamp > ?', 1.hour.ago).order(:timestamp)
if memory_leak_detected?(hourly_samples)
create_memory_leak_incident(hourly_samples)
end
end
private
def memory_leak_detected?(samples)
return false if samples.count < 12 # Need sufficient data
# Calculate memory growth rate
first_value = samples.first.value
last_value = samples.last.value
growth_rate = (last_value - first_value) / first_value
# Check if memory grew more than 50% and continues growing
growth_rate > 0.5 && consistent_growth?(samples)
end
def consistent_growth?(samples)
increases = 0
samples.each_cons(2) do |prev, current|
increases += 1 if current.value > prev.value
end
increases.to_f / (samples.count - 1) > 0.7 # 70% of samples show growth
end
def create_memory_leak_incident(samples)
Incident.create!(
severity: :sev2,
title: "Potential memory leak detected",
incident_type: 'memory_leak',
status: :investigating,
details: {
current_memory: samples.last.value,
initial_memory: samples.first.value,
growth_rate: ((samples.last.value - samples.first.value) / samples.first.value * 100).round(2),
sample_count: samples.count
}.to_json
)
end
def process_memory_usage
# Returns memory usage in MB
`ps -o rss= -p #{Process.pid}`.to_i / 1024
end
end
# Investigation tooling
class MemoryLeakInvestigation
def capture_heap_dump(incident)
timestamp = Time.current.to_i
dump_path = Rails.root.join('tmp', "heap_dump_#{timestamp}.json")
require 'objspace'
ObjectSpace.trace_object_allocations_start
# Trigger GC to reduce noise
GC.start
# Generate heap dump
File.open(dump_path, 'w') do |f|
ObjectSpace.dump_all(output: f)
end
incident.add_update(
"Heap dump captured: #{dump_path}",
User.system_user
)
# Analyze for common leak patterns
analyze_heap_dump(dump_path, incident)
end
def analyze_heap_dump(path, incident)
# Parse heap dump for object counts
object_counts = Hash.new(0)
File.foreach(path) do |line|
data = JSON.parse(line)
object_counts[data['type']] += 1 if data['type']
end
# Identify suspicious object accumulation
suspicious = object_counts.select { |_type, count| count > 10000 }
.sort_by { |_type, count| -count }
.first(5)
if suspicious.any?
incident.add_update(
"High object counts detected: #{suspicious.to_h}",
User.system_user
)
end
end
end
Real-World Applications
Production incident management encompasses operational practices, tooling choices, and organizational patterns that teams implement at scale.
E-commerce platform incident response illustrates comprehensive incident management in high-stakes environments. During peak shopping periods like Black Friday, organizations maintain heightened alertness and adjust response procedures.
E-commerce companies establish tiered severity classifications based on revenue impact. SEV1 incidents block customer purchases or payment processing. SEV2 incidents affect search, recommendations, or cart functionality. SEV3 incidents impact non-critical features like reviews or wishlists. Financial impact calculations inform severity assignments—a checkout failure affecting 100 transactions per minute receives SEV1 classification immediately.
Response procedures differ for peak versus normal periods. During major sales events, organizations maintain dedicated war rooms with all critical teams present. Incident commanders make faster decisions with less investigation because revenue loss from downtime exceeds the risk of potentially unnecessary rollbacks or changes. Teams pre-deploy rollback automation and maintain hot standby capacity to absorb traffic spikes or failover needs.
class EcommerceIncidentResponse
REVENUE_THRESHOLDS = {
sev1: 1000, # $1000/minute revenue impact
sev2: 100, # $100/minute revenue impact
sev3: 10 # $10/minute revenue impact
}.freeze
def classify_by_revenue_impact(incident)
affected_feature = incident.metadata['feature']
conversion_rate = calculate_conversion_rate(affected_feature)
average_order_value = Order.recent.average(:total)
traffic = current_traffic_rate
revenue_impact_per_minute = traffic * conversion_rate * average_order_value
severity = REVENUE_THRESHOLDS.find { |sev, threshold|
revenue_impact_per_minute >= threshold
}&.first || :sev4
incident.update!(
severity: severity,
metadata: incident.metadata.merge(
estimated_revenue_impact: revenue_impact_per_minute
)
)
end
def should_auto_rollback?(incident)
# Aggressive rollback during peak periods
peak_period = BlackFridaySchedule.peak_period?(Time.current)
if peak_period && incident.sev1?
recent_deployment = Deployment.where('deployed_at > ?', 30.minutes.ago).last
return true if recent_deployment
end
false
end
end
SaaS platform incident management focuses on multi-tenant considerations and customer communication. SaaS providers must determine whether incidents affect all customers or specific tenants. Tenant-specific incidents require different response procedures than platform-wide issues.
Customer communication represents a critical component. Status pages provide real-time updates, but high-value customers may require direct outreach. Organizations maintain customer success managers who receive incident notifications for their assigned accounts. The balance involves transparency about issues while avoiding unnecessary alarm for incidents that don't affect specific customers.
SaaS platforms implement tenant isolation to contain incidents. Database-per-tenant architectures limit incident blast radius to individual customers. Shared infrastructure failures affect multiple tenants, triggering elevated severity classifications. Incident response includes identifying affected tenant lists and assessing whether partial functionality remains available.
Financial services incident management operates under strict regulatory requirements and audit trails. Financial institutions must document every incident, investigation step, and resolution action. Regulatory compliance mandates specific reporting timeframes for security incidents or data breaches.
Risk management frameworks integrate with incident processes. Each incident undergoes assessment for potential fraud indicators, data exposure, or compliance violations. Security teams receive automatic notification of incidents involving authentication systems, payment processing, or customer data access.
Financial services maintain aggressive MTTR targets because downtime directly impacts transaction volume and regulatory compliance. Many institutions maintain hot-hot database configurations allowing instant failover without data loss. Incident response procedures emphasize immediate mitigation over thorough investigation during active incidents.
Healthcare system incident management prioritizes patient safety above all other considerations. Healthcare organizations classify incidents based on potential patient impact rather than system functionality. An incident affecting appointment scheduling receives lower priority than one impacting medication dispensing or emergency department systems.
HIPAA compliance requirements influence incident response procedures. Security incidents involving potential protected health information (PHI) exposure trigger mandatory breach assessment protocols. Healthcare organizations maintain dedicated privacy officers who participate in security incident response.
Healthcare systems often maintain paper backup procedures for critical workflows. When electronic health record systems fail, staff revert to paper forms and manual processes. Incident response includes activating backup procedures and communicating downtime procedures to clinical staff. Post-incident work includes reconciling paper records with electronic systems after restoration.
Gaming platform incident management deals with massive traffic spikes and player experience focus. Game launches, content updates, and promotional events generate enormous traffic surges that stress infrastructure. Gaming companies accept that some incidents will occur during launches and prepare procedures for rapid scaling and recovery.
Player sentiment monitoring augments traditional monitoring. Social media analysis, forum monitoring, and customer support ticket patterns provide early incident detection. Players report issues through these channels before internal monitoring detects problems, particularly for edge cases affecting specific game regions or configurations.
Gaming platforms implement queue systems and maintenance mode capabilities as incident mitigation strategies. When backend systems reach capacity, queue systems throttle player connections rather than allowing complete failures. Planned maintenance windows provide opportunities for infrastructure changes and issue resolution with minimal player impact.
Reference
Incident Severity Classifications
| Severity | Response Time | Description | Example |
|---|---|---|---|
| SEV1 | 15 minutes | Complete service outage or critical functionality unavailable to all users | Payment processing down, database offline, application completely unreachable |
| SEV2 | 1 hour | Major functionality degraded or unavailable to significant user segment | Checkout flow slow, search not working, API rate limiting affecting major integration |
| SEV3 | 4 hours | Minor functionality impaired or affecting limited users | Image uploads failing, notification delays, secondary feature broken |
| SEV4 | 24 hours | Cosmetic issues or edge case problems | UI styling broken, rare error condition, minor feature request |
Incident Status Workflow
| Status | Description | Next Actions | Duration Target |
|---|---|---|---|
| Detected | Incident identified through monitoring or reports | Assign commander, begin triage | 5-15 minutes |
| Investigating | Team actively researching cause and scope | Implement monitoring, test hypotheses | 15-60 minutes |
| Identified | Root cause understood | Implement fix or mitigation | 10-30 minutes |
| Monitoring | Fix deployed, observing for improvement | Verify metrics, collect data | 30-120 minutes |
| Resolved | Service fully restored and stable | Document resolution, schedule post-incident | N/A |
Common Incident Types
| Type | Detection Method | Typical Cause | Standard Mitigation |
|---|---|---|---|
| Database Connection Pool Exhaustion | Health check failure, timeout errors | Connection leak, query timeout, traffic spike | Increase pool size, restart connections, kill long queries |
| Memory Leak | Gradual memory growth, OOM kills | Object accumulation, cache unbounded growth | Rolling restart, heap dump analysis, patch deployment |
| API Rate Limit | 429 responses, external service errors | Traffic spike, incorrect rate calculation | Circuit breaker, request queuing, backoff retry |
| Dependency Failure | External API timeouts, connection refused | Third-party outage, network issue | Fallback behavior, circuit breaker, cached data |
| Deployment Failure | Error spike post-deploy, health check fail | Code bug, config error, database migration | Rollback deployment, fix forward, feature flag disable |
| Cache Invalidation | Cache miss spike, slow response times | Cache flush, cache server restart | Warm cache, reduce cache dependency, add cache layers |
| Disk Space Exhaustion | Write failures, log errors | Log accumulation, temp file buildup | Clear logs, expand storage, implement log rotation |
Key Metrics
| Metric | Description | Target | Calculation |
|---|---|---|---|
| MTTI | Mean Time to Identify - average time from incident occurrence to detection | Under 5 minutes | Sum of detection delays / incident count |
| MTTA | Mean Time to Acknowledge - average time from detection to response start | Under 15 minutes | Sum of acknowledgment times / incident count |
| MTTR | Mean Time to Resolve - average time from detection to resolution | Varies by severity | Sum of resolution times / incident count |
| MTBF | Mean Time Between Failures - average time between incidents | Increasing trend | Total operational time / incident count |
| Incident Count | Total incidents in time period by severity | Decreasing trend | Count of incidents per week/month |
| False Positive Rate | Percentage of alerts that are not real incidents | Under 10% | False alerts / total alerts |
Ruby Monitoring Integration Examples
| Service | Gem | Configuration Pattern |
|---|---|---|
| Sentry | sentry-ruby, sentry-rails | Initialize with DSN, set environment and release tags, configure breadcrumbs and sampling |
| Honeybadger | honeybadger | Configure API key, set environment, customize error filtering and grouping |
| New Relic | newrelic_rpm | License key configuration, instrument custom transactions, set labels and tags |
| Datadog | ddtrace | Service name and environment, trace sampling rate, custom span tags |
| Scout APM | scout_apm | Key and app name, monitor background jobs, track custom context |
Incident Communication Templates
| Scenario | Initial Message | Update Cadence | Resolution Message |
|---|---|---|---|
| SEV1 - Complete Outage | We are currently experiencing a complete service outage. All users are affected. Engineers are investigating. | Every 15 minutes | Service has been fully restored. All systems are operational. Post-incident review will follow. |
| SEV2 - Partial Degradation | We are experiencing degraded performance in [feature]. Some users may experience [impact]. | Every 30 minutes | Performance has been restored to normal levels. Monitoring continues. |
| SEV3 - Minor Issue | We are aware of an issue affecting [feature]. Impact is limited to [scope]. | Hourly | Issue has been resolved. Service operating normally. |
Incident Response Checklist
| Phase | Actions | Owner | Completed |
|---|---|---|---|
| Detection | Verify incident is occurring, assess initial severity, create incident record | Monitoring System / On-call | □ |
| Triage | Assign incident commander, determine severity, notify stakeholders | Incident Commander | □ |
| Investigation | Review recent changes, check monitoring dashboards, examine logs | Technical Responders | □ |
| Mitigation | Implement temporary fix, roll back if needed, enable workarounds | Technical Responders | □ |
| Communication | Update status page, notify affected users, brief stakeholders | Communication Lead | □ |
| Resolution | Deploy permanent fix, verify restoration, document resolution | Technical Responders | □ |
| Post-Incident | Schedule review, identify action items, update runbooks | Incident Commander | □ |
Escalation Thresholds
| Trigger | Escalation Action | Escalation Target |
|---|---|---|
| SEV1 incident detected | Immediate page to primary on-call | Primary On-call Engineer |
| No acknowledgment in 5 minutes | Page secondary on-call | Secondary On-call Engineer |
| No acknowledgment in 10 minutes | Page manager and escalation team | Engineering Manager |
| Unresolved after 30 minutes | Notify senior leadership | Director / VP Engineering |
| User-facing impact exceeds 1 hour | Notify executive team | CTO / CEO |
| Security incident detected | Immediate security team notification | Security Team Lead |
Automation Triggers
| Condition | Automated Action | Implementation |
|---|---|---|
| Error rate exceeds 5% | Create SEV2 incident, notify on-call | Monitoring rule triggers webhook to incident creation endpoint |
| Health check fails 3 consecutive times | Create incident, attempt automated remediation | Monitoring triggers runbook execution |
| Memory usage above 90% | Alert before OOM, trigger investigation | Prometheus alert triggers PagerDuty |
| Response time p95 above 2 seconds | Create SEV3 incident for investigation | APM monitoring threshold alert |
| Deployment causes error spike | Automatic rollback initiated | Deployment pipeline monitors error rates post-deploy |
| External dependency timeout rate high | Enable circuit breaker | Application monitors timeout patterns and activates protection |