CrackedRuby - Alerting Strategies

Overview

Alerting strategies define how systems notify operators about application state changes, errors, performance degradation, and infrastructure issues. An alerting system monitors metrics, logs, and events, then triggers notifications when predefined conditions occur. The distinction between monitoring and alerting matters: monitoring collects data continuously, while alerting analyzes that data and sends notifications only when intervention may be required.

Effective alerting systems balance sensitivity and specificity. Too many alerts create noise and alert fatigue, causing operators to ignore critical notifications. Too few alerts allow problems to escalate undetected. The challenge involves identifying which conditions warrant immediate attention versus those that require only logging or passive monitoring.

Alerting strategies encompass several components: detection mechanisms that identify anomalies, threshold definitions that determine when conditions become actionable, routing logic that directs notifications to appropriate responders, and escalation policies that ensure critical issues receive attention. Modern alerting systems integrate with incident management platforms, communication tools, and on-call scheduling systems.

The architecture of alerting systems varies based on scale and requirements. Small applications might use simple error notifications via email. Large distributed systems require sophisticated alerting infrastructures with metric aggregation, anomaly detection, alert correlation, and multi-channel notification delivery.

# Simple alert when error rate exceeds threshold
class AlertMonitor
  def initialize(error_threshold: 100)
    @error_threshold = error_threshold
    @error_count = 0
    @last_check = Time.now
  end

  def record_error
    @error_count += 1
    check_threshold
  end

  def check_threshold
    elapsed = Time.now - @last_check
    if elapsed >= 60 && @error_count > @error_threshold
      trigger_alert("Error rate: #{@error_count} errors/minute")
      reset_counters
    end
  end

  private

  def trigger_alert(message)
    # Send notification
    AlertService.notify(
      severity: :critical,
      message: message,
      timestamp: Time.now
    )
  end

  def reset_counters
    @error_count = 0
    @last_check = Time.now
  end
end

Key Principles

Alerting systems operate on several fundamental principles that determine effectiveness. The first principle involves actionability: every alert should require human action or decision-making. Alerts that require no response waste attention and train operators to ignore notifications. This principle distinguishes alerts from logs or metrics dashboards, which provide information without demanding immediate action.

Severity classification forms another core principle. Alerts typically fall into categories like critical, warning, and informational. Critical alerts indicate service disruption or imminent failure requiring immediate response. Warning alerts signal degraded performance or conditions that may escalate. Informational alerts notify about state changes without urgency. The severity determines routing, escalation, and response expectations.

Context enrichment ensures alerts contain sufficient information for diagnosis. An alert stating "database connection failed" provides less value than one including the database host, connection pool status, recent query patterns, and related system metrics. Context reduces mean time to resolution by eliminating initial investigation steps.

Alert aggregation prevents notification storms. When a single root cause generates hundreds of symptoms, the system should correlate related alerts and present them as a unified incident. Without aggregation, operators face overwhelming notification volumes that obscure the actual problem.

Time-based considerations affect alerting strategy. Some conditions require immediate notification regardless of time. Others can wait for business hours. The distinction depends on service level objectives and business impact. A retail application might treat payment processing failures as critical during business hours but less urgent at 3 AM when transaction volumes are minimal.

Feedback loops complete the alerting system. After incident resolution, post-mortems identify whether alerts fired appropriately. Missing alerts indicate monitoring gaps. False positives suggest threshold tuning needs. This continuous refinement improves system effectiveness over time.

The signal-to-noise ratio quantifies alerting quality. High-quality alerting systems maintain ratios where most alerts correspond to actual issues requiring attention. Low ratios indicate excessive false positives or alerts for non-actionable conditions.

# Alert with context and severity classification
class ContextualAlert
  SEVERITIES = [:info, :warning, :critical].freeze

  attr_reader :severity, :message, :context, :timestamp

  def initialize(severity:, message:, context: {})
    raise ArgumentError unless SEVERITIES.include?(severity)
    
    @severity = severity
    @message = message
    @context = context
    @timestamp = Time.now
    @alert_id = generate_alert_id
  end

  def actionable?
    severity == :critical || 
    (severity == :warning && context[:trend] == :increasing)
  end

  def correlate_with(other_alert)
    return false unless other_alert.is_a?(ContextualAlert)
    
    # Check if alerts share common attributes
    shared_host = context[:host] == other_alert.context[:host]
    shared_service = context[:service] == other_alert.context[:service]
    time_proximity = (timestamp - other_alert.timestamp).abs < 300
    
    shared_host && shared_service && time_proximity
  end

  def enrich_context(additional_context)
    @context.merge!(additional_context)
    @context[:enriched_at] = Time.now
  end

  private

  def generate_alert_id
    "#{severity}_#{timestamp.to_i}_#{SecureRandom.hex(4)}"
  end
end

Design Considerations

Threshold selection represents a primary design decision in alerting systems. Static thresholds define fixed boundaries: CPU usage above 80%, error rate above 50 per minute, response time above 500ms. Static thresholds work well for systems with predictable behavior but generate false positives when normal patterns shift. An e-commerce site experiences predictable traffic spikes during sales events, making static error rate thresholds problematic.

Dynamic thresholds adapt to historical patterns. These systems establish baselines from past behavior and alert on deviations. A baseline might show 1000 requests per minute typically occur at 2 PM on weekdays. If requests drop to 200, the system alerts even though 200 requests per minute might be normal at other times. Dynamic thresholds reduce false positives from expected variations while catching anomalies that static thresholds miss.

Alert routing determines notification destinations. Simple routing sends all alerts to a single channel. Sophisticated routing considers alert attributes: severity, service, time of day, on-call schedules. A warning about slow database queries routes to the database team during business hours. Critical payment processing failures route to the payment team immediately via SMS and phone call.

Rate limiting prevents notification storms. When a service fails, thousands of errors may occur within seconds. Without rate limiting, the alerting system floods notification channels, making information consumption impossible. Rate limiting sends the first N alerts immediately, then summarizes subsequent alerts in batches.

# Rate-limited alert dispatcher
class RateLimitedDispatcher
  def initialize(limit: 10, window: 60)
    @limit = limit
    @window = window
    @alert_counts = Hash.new { |h, k| h[k] = [] }
  end

  def dispatch(alert)
    key = alert_key(alert)
    now = Time.now
    
    # Remove old timestamps outside the window
    @alert_counts[key].reject! { |t| now - t > @window }
    
    if @alert_counts[key].size < @limit
      @alert_counts[key] << now
      send_alert(alert)
    else
      queue_for_summary(alert, key)
    end
  end

  private

  def alert_key(alert)
    "#{alert.severity}_#{alert.context[:service]}"
  end

  def send_alert(alert)
    NotificationService.send(alert)
  end

  def queue_for_summary(alert, key)
    # Store for periodic batch notification
    SummaryQueue.add(alert, key)
  end
end

Escalation policies define response paths when alerts go unacknowledged. A typical escalation policy notifies the primary on-call engineer first. If unacknowledged after 5 minutes, it notifies the secondary. After another 5 minutes, it escalates to the team lead. This ensures critical issues receive attention even when individuals are unavailable.

Alert deduplication consolidates identical notifications. If a service restarts repeatedly, each restart generates an alert. Without deduplication, operators receive dozens of notifications about the same underlying issue. Deduplication recognizes alerts with identical or similar fingerprints and groups them.

The trade-off between false positives and false negatives shapes threshold decisions. Setting thresholds too sensitive generates false positives where alerts fire for normal conditions. Setting them too relaxed creates false negatives where actual problems go undetected. The cost of each error type guides threshold tuning. For financial transactions, false negatives cost more than false positives. For internal dashboards, false positives cause more harm through alert fatigue.

Alert suppression temporarily disables notifications during maintenance windows or known issues. When deploying new code, elevated error rates may be expected during the transition period. Suppression prevents alerts during this window. Suppression rules should include expiration times to avoid accidentally leaving alerts disabled.

Implementation Approaches

Metric-based alerting evaluates numeric measurements against thresholds. Systems collect metrics like CPU usage, memory consumption, request latency, error counts, and business metrics. Alert rules evaluate these metrics periodically. When a metric crosses a threshold, the system generates an alert. Metric-based alerting excels at detecting resource exhaustion, performance degradation, and volume anomalies.

# Metric-based alerting system
class MetricAlertEvaluator
  def initialize
    @rules = []
    @metric_store = MetricStore.new
  end

  def add_rule(name, &block)
    @rules << { name: name, condition: block }
  end

  def evaluate_rules
    alerts = []
    
    @rules.each do |rule|
      metrics = @metric_store.fetch_recent(time_window: 300)
      
      begin
        result = rule[:condition].call(metrics)
        alerts << create_alert(rule[:name], result) if result
      rescue => e
        log_evaluation_error(rule[:name], e)
      end
    end
    
    alerts
  end

  private

  def create_alert(rule_name, result)
    ContextualAlert.new(
      severity: result[:severity],
      message: "Rule '#{rule_name}' triggered: #{result[:message]}",
      context: result[:context] || {}
    )
  end
end

# Usage
evaluator = MetricAlertEvaluator.new

evaluator.add_rule("high_error_rate") do |metrics|
  error_rate = metrics[:errors_per_minute]
  if error_rate > 100
    {
      severity: :critical,
      message: "Error rate at #{error_rate}/min",
      context: { metric_value: error_rate }
    }
  end
end

evaluator.add_rule("slow_response_time") do |metrics|
  p95_latency = metrics[:response_time_p95]
  if p95_latency > 1000
    {
      severity: :warning,
      message: "P95 latency at #{p95_latency}ms",
      context: { metric_value: p95_latency }
    }
  end
end

Log-based alerting analyzes application logs for patterns indicating problems. Regular expressions or structured log parsing identifies error signatures. Log-based alerting catches issues that metrics miss: specific error messages, security events, business logic failures. The challenge involves parsing diverse log formats and managing high log volumes.

Event-based alerting responds to discrete occurrences rather than continuous measurements. Events include deployment completions, certificate expirations, backup failures, or external webhook notifications. Event-based alerting often integrates with system event streams or message queues.

Synthetic monitoring proactively tests system functionality. Synthetic tests simulate user actions: logging in, completing purchases, accessing APIs. When synthetic tests fail, alerts fire before users encounter issues. This approach provides early warning but requires maintaining test scripts that accurately represent user behavior.

Anomaly detection applies statistical or machine learning techniques to identify unusual patterns. Instead of fixed thresholds, anomaly detection learns normal behavior and alerts on deviations. This approach handles systems with complex patterns but may generate false positives during legitimate behavior changes.

# Simple anomaly detection using statistical methods
class AnomalyDetector
  def initialize(window_size: 100, threshold: 3)
    @window_size = window_size
    @threshold = threshold
    @historical_values = []
  end

  def check_value(value)
    @historical_values << value
    @historical_values.shift if @historical_values.size > @window_size
    
    return nil if @historical_values.size < 10
    
    mean = calculate_mean
    std_dev = calculate_std_dev(mean)
    
    z_score = (value - mean) / std_dev
    
    if z_score.abs > @threshold
      {
        anomaly: true,
        value: value,
        mean: mean,
        std_dev: std_dev,
        z_score: z_score
      }
    end
  end

  private

  def calculate_mean
    @historical_values.sum / @historical_values.size.to_f
  end

  def calculate_std_dev(mean)
    variance = @historical_values.map { |v| (v - mean) ** 2 }.sum / @historical_values.size
    Math.sqrt(variance)
  end
end

Composite alerting combines multiple signals before triggering notifications. A composite rule might require both high error rate AND low throughput before alerting. This reduces false positives by confirming problems through multiple indicators.

Heartbeat monitoring detects silent failures. Services send periodic heartbeat signals. Missing heartbeats within expected intervals trigger alerts. Heartbeat monitoring catches scenarios where services crash without generating errors.

Ruby Implementation

Ruby applications implement alerting through various mechanisms. Exception notification libraries intercept unhandled exceptions and deliver alerts. Application performance monitoring integrations track metrics and trigger alerts through external platforms. Custom alerting code implements business-specific logic.

The exception_notification gem provides exception alerting for Ruby applications. It captures exceptions and sends notifications via email, Slack, or custom notifiers.

# Configuring exception_notification in Rails
Rails.application.config.middleware.use(
  ExceptionNotification::Rack,
  email: {
    deliver_with: :deliver_now,
    email_prefix: '[ERROR] ',
    sender_address: %{"Notifier" <notifier@example.com>},
    exception_recipients: %w{admin@example.com}
  },
  slack: {
    webhook_url: ENV['SLACK_WEBHOOK_URL'],
    channel: '#alerts',
    additional_parameters: {
      mrkdwn: true
    }
  }
)

# Custom notifier for specific exceptions
class CustomAlertNotifier
  def self.call(exception, options = {})
    return unless alert_worthy?(exception)
    
    alert = ContextualAlert.new(
      severity: severity_for(exception),
      message: exception.message,
      context: {
        exception_class: exception.class.name,
        backtrace: exception.backtrace.first(5),
        environment: Rails.env,
        host: Socket.gethostname
      }
    )
    
    AlertDispatcher.dispatch(alert)
  end

  def self.alert_worthy?(exception)
    # Skip common exceptions that don't require alerts
    !exception.is_a?(ActiveRecord::RecordNotFound) &&
    !exception.is_a?(ActionController::RoutingError)
  end

  def self.severity_for(exception)
    case exception
    when PaymentError, SecurityError
      :critical
    when ValidationError
      :warning
    else
      :info
    end
  end
end

Application performance monitoring services like New Relic, Datadog, and Scout provide Ruby agents that collect metrics and trigger alerts based on configured conditions. These agents track response times, throughput, error rates, and custom business metrics.

# Custom metric tracking with alerting potential
class MetricTracker
  def self.track_business_metric(metric_name, value, tags = {})
    # Send to monitoring service
    StatsD.gauge(metric_name, value, tags: tags)
    
    # Check local alert conditions
    AlertRuleEngine.evaluate(metric_name, value, tags)
  end

  def self.track_timing(metric_name)
    start_time = Time.now
    result = yield
    duration = ((Time.now - start_time) * 1000).round
    
    StatsD.timing(metric_name, duration)
    AlertRuleEngine.evaluate_timing(metric_name, duration)
    
    result
  end
end

# Usage in application code
class PaymentProcessor
  def process_payment(amount)
    MetricTracker.track_timing('payment.processing_time') do
      result = charge_payment(amount)
      
      if result.success?
        MetricTracker.track_business_metric(
          'payment.successful',
          amount,
          tags: { currency: result.currency }
        )
      else
        MetricTracker.track_business_metric(
          'payment.failed',
          1,
          tags: { reason: result.failure_reason }
        )
      end
      
      result
    end
  end
end

Health check endpoints enable external monitoring systems to probe application status. These endpoints verify database connectivity, cache availability, and critical service dependencies.

# Health check endpoint with detailed status
class HealthCheckController < ApplicationController
  def show
    checks = {
      database: check_database,
      redis: check_redis,
      external_api: check_external_api,
      disk_space: check_disk_space
    }
    
    status = checks.values.all? { |c| c[:healthy] } ? :ok : :service_unavailable
    
    render json: {
      status: status,
      timestamp: Time.now.iso8601,
      checks: checks
    }, status: status
  end

  private

  def check_database
    {
      healthy: ActiveRecord::Base.connection.active?,
      response_time: measure_query_time
    }
  rescue => e
    {
      healthy: false,
      error: e.message
    }
  end

  def check_redis
    {
      healthy: Redis.current.ping == 'PONG',
      response_time: measure_redis_time
    }
  rescue => e
    {
      healthy: false,
      error: e.message
    }
  end

  def measure_query_time
    start = Time.now
    ActiveRecord::Base.connection.execute('SELECT 1')
    ((Time.now - start) * 1000).round
  end
end

Background job monitoring tracks queue depths, processing rates, and failure rates. Large queue depths indicate processing bottlenecks. High failure rates signal systemic issues.

# Sidekiq monitoring with alerting
class SidekiqAlertMonitor
  def self.check_queues
    Sidekiq::Queue.all.each do |queue|
      size = queue.size
      latency = queue.latency
      
      if size > queue_size_threshold(queue.name)
        trigger_alert(
          severity: :warning,
          message: "Queue #{queue.name} backed up: #{size} jobs",
          context: { queue: queue.name, size: size, latency: latency }
        )
      end
      
      if latency > latency_threshold(queue.name)
        trigger_alert(
          severity: :critical,
          message: "Queue #{queue.name} latency high: #{latency}s",
          context: { queue: queue.name, latency: latency }
        )
      end
    end
  end

  def self.queue_size_threshold(queue_name)
    thresholds = {
      'critical' => 100,
      'default' => 1000,
      'low_priority' => 10000
    }
    thresholds[queue_name] || 1000
  end

  def self.trigger_alert(severity:, message:, context:)
    AlertDispatcher.dispatch(
      ContextualAlert.new(
        severity: severity,
        message: message,
        context: context
      )
    )
  end
end

# Schedule periodic checks
Sidekiq.configure_server do |config|
  config.on(:startup) do
    Thread.new do
      loop do
        SidekiqAlertMonitor.check_queues
        sleep 60
      end
    end
  end
end

Tools & Ecosystem

PagerDuty provides incident management and on-call scheduling. It receives alerts from monitoring systems, routes them according to escalation policies, and tracks incident resolution. PagerDuty integrations exist for most monitoring platforms and support custom alert ingestion via API or email.

Datadog combines metrics collection, log aggregation, and alerting. The datadog agent runs on application servers, collecting system and application metrics. Alert rules evaluate metrics in real-time. Datadog supports anomaly detection, forecasting, and composite monitors.

New Relic focuses on application performance monitoring. It tracks transaction traces, error rates, and custom business metrics. Alert policies trigger on threshold violations or anomalous behavior. New Relic provides mobile alerts and integrates with incident management platforms.

Prometheus and Alertmanager form an open-source monitoring and alerting stack. Prometheus scrapes metrics from instrumented applications. Alertmanager handles alert routing, grouping, and silencing. The Ruby prometheus-client gem exposes application metrics in Prometheus format.

# Exposing Prometheus metrics from Ruby application
require 'prometheus/client'
require 'prometheus/client/rack/collector'
require 'prometheus/client/rack/exporter'

# Initialize registry
prometheus = Prometheus::Client.registry

# Define metrics
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

request_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path]
)

# Track metrics in application code
class ApplicationController < ActionController::Base
  around_action :track_request

  def track_request
    start = Time.now
    begin
      yield
    ensure
      duration = Time.now - start
      http_requests.increment(
        labels: {
          method: request.method,
          path: request.path,
          status: response.status
        }
      )
      request_duration.observe(
        duration,
        labels: {
          method: request.method,
          path: request.path
        }
      )
    end
  end
end

# Mount Prometheus exporter
use Prometheus::Client::Rack::Collector
use Prometheus::Client::Rack::Exporter

Grafana visualizes metrics and supports alerting. Alert rules evaluate time series data and send notifications through multiple channels. Grafana connects to various data sources including Prometheus, InfluxDB, and Elasticsearch.

Sentry specializes in error tracking and performance monitoring. The sentry-ruby gem captures exceptions with rich context including request parameters, user information, and breadcrumbs showing actions leading to errors. Sentry groups similar errors and supports custom alert rules.

OpsGenie provides advanced alert routing and on-call management. It deduplicates alerts, applies routing rules, and integrates with monitoring tools. OpsGenie supports custom actions like auto-remediation scripts triggered by specific alerts.

Slack serves as a notification destination for many alerting systems. Custom webhooks deliver formatted alerts to specific channels. Slack apps enable interactive alerts where responders acknowledge or escalate directly from messages.

Common Pitfalls

Alert fatigue represents the most significant pitfall in alerting systems. When operators receive excessive alerts, they begin ignoring notifications or develop processes that bypass alerting systems. Alert fatigue develops gradually: initially, teams respond to every alert, but as volume increases and false positives accumulate, response rates decline. The solution requires ruthless pruning of non-actionable alerts and aggressive threshold tuning.

Alerting on symptoms rather than causes generates multiple notifications for single issues. When a database server fails, dozens of dependent services may generate alerts. Without correlation, operators waste time investigating symptoms while the root cause remains unclear. Alert aggregation and dependency modeling reduce symptom-based alert storms.

Missing context in alert notifications forces operators to spend time gathering basic information before diagnosis. An alert stating "API errors increased" requires operators to determine which API, error types, affected endpoints, and correlation with recent changes. Enriching alerts with this context during generation accelerates response.

Overly sensitive thresholds generate false positives that erode confidence in alerting systems. Setting error rate thresholds without considering normal variations causes alerts during expected spikes. Traffic patterns vary by time of day, day of week, and seasonal factors. Thresholds must account for these patterns.

Alert rules that never fire suggest monitoring gaps or incorrect configurations. Regular review identifies dormant alert rules. Either the conditions never occur, indicating unnecessary rules, or monitoring gaps prevent rule evaluation.

Ignoring alert acknowledgment creates ambiguity about incident ownership. When multiple people receive alerts but acknowledgment is optional, coordination breaks down. Some incidents go unhandled while others receive duplicate attention. Mandatory acknowledgment with escalation ensures accountability.

Testing alerting systems only during incidents reveals gaps when problems occur. Periodic testing verifies alert delivery, notification routing, and escalation policies. Test alerts should traverse the complete path from detection through notification.

Hardcoded alert destinations reduce flexibility and complicate updates. Alert routing should reference on-call schedules and team assignments rather than individual email addresses or phone numbers. Configuration management separates routing logic from alert definitions.

Lack of alert expiration causes ongoing notifications for resolved issues. Alerts should include resolution conditions or automatic timeout. When a service recovers but alerts continue, operators waste time investigating non-existent problems.

Treating all alerts equally regardless of business impact misallocates attention. A website typo and payment processing failure require different urgency levels. Business impact classification guides severity assignment and escalation policies.

Reference

Alert Severity Classification

Severity	Response Time	Example Conditions	Escalation
Critical	Immediate	Service outage, data loss, security breach	After 5 minutes
Warning	Within 30 min	Performance degradation, elevated errors	After 30 minutes
Info	Business hours	Configuration changes, capacity trends	No escalation

Common Alert Types

Type	Trigger Condition	Key Metrics	Typical Threshold
Error Rate	Errors exceed normal	Errors per minute	>100/min or 3x baseline
Latency	Response time high	P95, P99 latency	>1000ms P95
Throughput	Request volume anomaly	Requests per second	<50% or >200% of baseline
Resource	CPU/Memory high	Usage percentage	>80% sustained 5min
Availability	Health check failure	Successful checks	<95% over 5min
Queue Depth	Job backlog growing	Queue size	>1000 jobs
Saturation	Resource near limit	Disk space, connections	>90% capacity

Notification Channels

Channel	Use Case	Latency	Reliability	Context Support
SMS	Critical alerts	Seconds	High	Limited text
Phone	Urgent escalation	Seconds	High	Voice only
Email	Non-urgent alerts	Minutes	Medium	Rich formatting
Slack	Team notifications	Seconds	Medium	Rich formatting
PagerDuty	On-call routing	Seconds	High	Structured data
Webhook	System integration	Seconds	Variable	JSON payload

Alert Lifecycle States

State	Description	Valid Transitions	Actions Available
Triggered	Alert condition met	Acknowledged, Resolved	Acknowledge, Escalate, Suppress
Acknowledged	Engineer notified	Investigating, Resolved	Update, Escalate
Investigating	Diagnosis in progress	Resolved, Escalated	Add notes, Link incident
Escalated	Sent to next tier	Acknowledged, Resolved	Reassign
Resolved	Condition cleared	Closed	Reopen if recurs
Suppressed	Temporarily disabled	Expired	Extend, Cancel

Ruby Alerting Gems

Gem	Purpose	Integration	Features
exception_notification	Exception alerting	Rack middleware	Email, Slack, custom
bugsnag	Error monitoring	Agent-based	Grouping, releases
sentry-ruby	Error tracking	SDK integration	Breadcrumbs, context
newrelic_rpm	APM monitoring	Agent	Metrics, traces, alerts
datadog	Infrastructure monitoring	Agent + SDK	Logs, metrics, APM
prometheus-client	Metrics exposition	Exporter	Counter, gauge, histogram

Alert Rule Patterns

Pattern	Formula	When to Use
Static Threshold	value > threshold	Predictable limits
Rate of Change	delta > threshold	Detecting spikes
Moving Average	value > avg(window) * multiplier	Smoothing noise
Percentile	P95 > threshold	Tail latency
Count over Window	count(condition) > N in T minutes	Frequency limits
Ratio Comparison	errors / requests > threshold	Error rates
Absence Detection	no data for T minutes	Missing heartbeats

Health Check Response Format

Field	Type	Description	Example
status	string	Overall health	ok, degraded, unavailable
timestamp	ISO8601	Check time	2025-10-11T14:30:00Z
version	string	Application version	2.3.1
checks	object	Component statuses	Database: ok, Redis: degraded
response_time	integer	Check duration ms	45

Escalation Policy Template

Level	Wait Time	Contact Method	Example Role
1	0 min	SMS + Push	Primary on-call
2	5 min	SMS + Phone	Secondary on-call
3	10 min	Phone	Team lead
4	15 min	Phone	Engineering manager

Alert Context Fields

Field	Purpose	Example Value
service	Affected component	payment-api
environment	Deployment stage	production
host	Server identifier	web-01.prod
region	Geographic location	us-east-1
version	Code version	3.2.1-abc123
user_impact	Affected users	1500 users
runbook_url	Response guide	https://wiki/runbook-db-conn
dashboard_url	Metrics view	https://grafana/db-health

Alerting Strategies