Overview
Alerting strategies define how systems notify operators about application state changes, errors, performance degradation, and infrastructure issues. An alerting system monitors metrics, logs, and events, then triggers notifications when predefined conditions occur. The distinction between monitoring and alerting matters: monitoring collects data continuously, while alerting analyzes that data and sends notifications only when intervention may be required.
Effective alerting systems balance sensitivity and specificity. Too many alerts create noise and alert fatigue, causing operators to ignore critical notifications. Too few alerts allow problems to escalate undetected. The challenge involves identifying which conditions warrant immediate attention versus those that require only logging or passive monitoring.
Alerting strategies encompass several components: detection mechanisms that identify anomalies, threshold definitions that determine when conditions become actionable, routing logic that directs notifications to appropriate responders, and escalation policies that ensure critical issues receive attention. Modern alerting systems integrate with incident management platforms, communication tools, and on-call scheduling systems.
The architecture of alerting systems varies based on scale and requirements. Small applications might use simple error notifications via email. Large distributed systems require sophisticated alerting infrastructures with metric aggregation, anomaly detection, alert correlation, and multi-channel notification delivery.
# Simple alert when error rate exceeds threshold
class AlertMonitor
def initialize(error_threshold: 100)
@error_threshold = error_threshold
@error_count = 0
@last_check = Time.now
end
def record_error
@error_count += 1
check_threshold
end
def check_threshold
elapsed = Time.now - @last_check
if elapsed >= 60 && @error_count > @error_threshold
trigger_alert("Error rate: #{@error_count} errors/minute")
reset_counters
end
end
private
def trigger_alert(message)
# Send notification
AlertService.notify(
severity: :critical,
message: message,
timestamp: Time.now
)
end
def reset_counters
@error_count = 0
@last_check = Time.now
end
end
Key Principles
Alerting systems operate on several fundamental principles that determine effectiveness. The first principle involves actionability: every alert should require human action or decision-making. Alerts that require no response waste attention and train operators to ignore notifications. This principle distinguishes alerts from logs or metrics dashboards, which provide information without demanding immediate action.
Severity classification forms another core principle. Alerts typically fall into categories like critical, warning, and informational. Critical alerts indicate service disruption or imminent failure requiring immediate response. Warning alerts signal degraded performance or conditions that may escalate. Informational alerts notify about state changes without urgency. The severity determines routing, escalation, and response expectations.
Context enrichment ensures alerts contain sufficient information for diagnosis. An alert stating "database connection failed" provides less value than one including the database host, connection pool status, recent query patterns, and related system metrics. Context reduces mean time to resolution by eliminating initial investigation steps.
Alert aggregation prevents notification storms. When a single root cause generates hundreds of symptoms, the system should correlate related alerts and present them as a unified incident. Without aggregation, operators face overwhelming notification volumes that obscure the actual problem.
Time-based considerations affect alerting strategy. Some conditions require immediate notification regardless of time. Others can wait for business hours. The distinction depends on service level objectives and business impact. A retail application might treat payment processing failures as critical during business hours but less urgent at 3 AM when transaction volumes are minimal.
Feedback loops complete the alerting system. After incident resolution, post-mortems identify whether alerts fired appropriately. Missing alerts indicate monitoring gaps. False positives suggest threshold tuning needs. This continuous refinement improves system effectiveness over time.
The signal-to-noise ratio quantifies alerting quality. High-quality alerting systems maintain ratios where most alerts correspond to actual issues requiring attention. Low ratios indicate excessive false positives or alerts for non-actionable conditions.
# Alert with context and severity classification
class ContextualAlert
SEVERITIES = [:info, :warning, :critical].freeze
attr_reader :severity, :message, :context, :timestamp
def initialize(severity:, message:, context: {})
raise ArgumentError unless SEVERITIES.include?(severity)
@severity = severity
@message = message
@context = context
@timestamp = Time.now
@alert_id = generate_alert_id
end
def actionable?
severity == :critical ||
(severity == :warning && context[:trend] == :increasing)
end
def correlate_with(other_alert)
return false unless other_alert.is_a?(ContextualAlert)
# Check if alerts share common attributes
shared_host = context[:host] == other_alert.context[:host]
shared_service = context[:service] == other_alert.context[:service]
time_proximity = (timestamp - other_alert.timestamp).abs < 300
shared_host && shared_service && time_proximity
end
def enrich_context(additional_context)
@context.merge!(additional_context)
@context[:enriched_at] = Time.now
end
private
def generate_alert_id
"#{severity}_#{timestamp.to_i}_#{SecureRandom.hex(4)}"
end
end
Design Considerations
Threshold selection represents a primary design decision in alerting systems. Static thresholds define fixed boundaries: CPU usage above 80%, error rate above 50 per minute, response time above 500ms. Static thresholds work well for systems with predictable behavior but generate false positives when normal patterns shift. An e-commerce site experiences predictable traffic spikes during sales events, making static error rate thresholds problematic.
Dynamic thresholds adapt to historical patterns. These systems establish baselines from past behavior and alert on deviations. A baseline might show 1000 requests per minute typically occur at 2 PM on weekdays. If requests drop to 200, the system alerts even though 200 requests per minute might be normal at other times. Dynamic thresholds reduce false positives from expected variations while catching anomalies that static thresholds miss.
Alert routing determines notification destinations. Simple routing sends all alerts to a single channel. Sophisticated routing considers alert attributes: severity, service, time of day, on-call schedules. A warning about slow database queries routes to the database team during business hours. Critical payment processing failures route to the payment team immediately via SMS and phone call.
Rate limiting prevents notification storms. When a service fails, thousands of errors may occur within seconds. Without rate limiting, the alerting system floods notification channels, making information consumption impossible. Rate limiting sends the first N alerts immediately, then summarizes subsequent alerts in batches.
# Rate-limited alert dispatcher
class RateLimitedDispatcher
def initialize(limit: 10, window: 60)
@limit = limit
@window = window
@alert_counts = Hash.new { |h, k| h[k] = [] }
end
def dispatch(alert)
key = alert_key(alert)
now = Time.now
# Remove old timestamps outside the window
@alert_counts[key].reject! { |t| now - t > @window }
if @alert_counts[key].size < @limit
@alert_counts[key] << now
send_alert(alert)
else
queue_for_summary(alert, key)
end
end
private
def alert_key(alert)
"#{alert.severity}_#{alert.context[:service]}"
end
def send_alert(alert)
NotificationService.send(alert)
end
def queue_for_summary(alert, key)
# Store for periodic batch notification
SummaryQueue.add(alert, key)
end
end
Escalation policies define response paths when alerts go unacknowledged. A typical escalation policy notifies the primary on-call engineer first. If unacknowledged after 5 minutes, it notifies the secondary. After another 5 minutes, it escalates to the team lead. This ensures critical issues receive attention even when individuals are unavailable.
Alert deduplication consolidates identical notifications. If a service restarts repeatedly, each restart generates an alert. Without deduplication, operators receive dozens of notifications about the same underlying issue. Deduplication recognizes alerts with identical or similar fingerprints and groups them.
The trade-off between false positives and false negatives shapes threshold decisions. Setting thresholds too sensitive generates false positives where alerts fire for normal conditions. Setting them too relaxed creates false negatives where actual problems go undetected. The cost of each error type guides threshold tuning. For financial transactions, false negatives cost more than false positives. For internal dashboards, false positives cause more harm through alert fatigue.
Alert suppression temporarily disables notifications during maintenance windows or known issues. When deploying new code, elevated error rates may be expected during the transition period. Suppression prevents alerts during this window. Suppression rules should include expiration times to avoid accidentally leaving alerts disabled.
Implementation Approaches
Metric-based alerting evaluates numeric measurements against thresholds. Systems collect metrics like CPU usage, memory consumption, request latency, error counts, and business metrics. Alert rules evaluate these metrics periodically. When a metric crosses a threshold, the system generates an alert. Metric-based alerting excels at detecting resource exhaustion, performance degradation, and volume anomalies.
# Metric-based alerting system
class MetricAlertEvaluator
def initialize
@rules = []
@metric_store = MetricStore.new
end
def add_rule(name, &block)
@rules << { name: name, condition: block }
end
def evaluate_rules
alerts = []
@rules.each do |rule|
metrics = @metric_store.fetch_recent(time_window: 300)
begin
result = rule[:condition].call(metrics)
alerts << create_alert(rule[:name], result) if result
rescue => e
log_evaluation_error(rule[:name], e)
end
end
alerts
end
private
def create_alert(rule_name, result)
ContextualAlert.new(
severity: result[:severity],
message: "Rule '#{rule_name}' triggered: #{result[:message]}",
context: result[:context] || {}
)
end
end
# Usage
evaluator = MetricAlertEvaluator.new
evaluator.add_rule("high_error_rate") do |metrics|
error_rate = metrics[:errors_per_minute]
if error_rate > 100
{
severity: :critical,
message: "Error rate at #{error_rate}/min",
context: { metric_value: error_rate }
}
end
end
evaluator.add_rule("slow_response_time") do |metrics|
p95_latency = metrics[:response_time_p95]
if p95_latency > 1000
{
severity: :warning,
message: "P95 latency at #{p95_latency}ms",
context: { metric_value: p95_latency }
}
end
end
Log-based alerting analyzes application logs for patterns indicating problems. Regular expressions or structured log parsing identifies error signatures. Log-based alerting catches issues that metrics miss: specific error messages, security events, business logic failures. The challenge involves parsing diverse log formats and managing high log volumes.
Event-based alerting responds to discrete occurrences rather than continuous measurements. Events include deployment completions, certificate expirations, backup failures, or external webhook notifications. Event-based alerting often integrates with system event streams or message queues.
Synthetic monitoring proactively tests system functionality. Synthetic tests simulate user actions: logging in, completing purchases, accessing APIs. When synthetic tests fail, alerts fire before users encounter issues. This approach provides early warning but requires maintaining test scripts that accurately represent user behavior.
Anomaly detection applies statistical or machine learning techniques to identify unusual patterns. Instead of fixed thresholds, anomaly detection learns normal behavior and alerts on deviations. This approach handles systems with complex patterns but may generate false positives during legitimate behavior changes.
# Simple anomaly detection using statistical methods
class AnomalyDetector
def initialize(window_size: 100, threshold: 3)
@window_size = window_size
@threshold = threshold
@historical_values = []
end
def check_value(value)
@historical_values << value
@historical_values.shift if @historical_values.size > @window_size
return nil if @historical_values.size < 10
mean = calculate_mean
std_dev = calculate_std_dev(mean)
z_score = (value - mean) / std_dev
if z_score.abs > @threshold
{
anomaly: true,
value: value,
mean: mean,
std_dev: std_dev,
z_score: z_score
}
end
end
private
def calculate_mean
@historical_values.sum / @historical_values.size.to_f
end
def calculate_std_dev(mean)
variance = @historical_values.map { |v| (v - mean) ** 2 }.sum / @historical_values.size
Math.sqrt(variance)
end
end
Composite alerting combines multiple signals before triggering notifications. A composite rule might require both high error rate AND low throughput before alerting. This reduces false positives by confirming problems through multiple indicators.
Heartbeat monitoring detects silent failures. Services send periodic heartbeat signals. Missing heartbeats within expected intervals trigger alerts. Heartbeat monitoring catches scenarios where services crash without generating errors.
Ruby Implementation
Ruby applications implement alerting through various mechanisms. Exception notification libraries intercept unhandled exceptions and deliver alerts. Application performance monitoring integrations track metrics and trigger alerts through external platforms. Custom alerting code implements business-specific logic.
The exception_notification gem provides exception alerting for Ruby applications. It captures exceptions and sends notifications via email, Slack, or custom notifiers.
# Configuring exception_notification in Rails
Rails.application.config.middleware.use(
ExceptionNotification::Rack,
email: {
deliver_with: :deliver_now,
email_prefix: '[ERROR] ',
sender_address: %{"Notifier" <notifier@example.com>},
exception_recipients: %w{admin@example.com}
},
slack: {
webhook_url: ENV['SLACK_WEBHOOK_URL'],
channel: '#alerts',
additional_parameters: {
mrkdwn: true
}
}
)
# Custom notifier for specific exceptions
class CustomAlertNotifier
def self.call(exception, options = {})
return unless alert_worthy?(exception)
alert = ContextualAlert.new(
severity: severity_for(exception),
message: exception.message,
context: {
exception_class: exception.class.name,
backtrace: exception.backtrace.first(5),
environment: Rails.env,
host: Socket.gethostname
}
)
AlertDispatcher.dispatch(alert)
end
def self.alert_worthy?(exception)
# Skip common exceptions that don't require alerts
!exception.is_a?(ActiveRecord::RecordNotFound) &&
!exception.is_a?(ActionController::RoutingError)
end
def self.severity_for(exception)
case exception
when PaymentError, SecurityError
:critical
when ValidationError
:warning
else
:info
end
end
end
Application performance monitoring services like New Relic, Datadog, and Scout provide Ruby agents that collect metrics and trigger alerts based on configured conditions. These agents track response times, throughput, error rates, and custom business metrics.
# Custom metric tracking with alerting potential
class MetricTracker
def self.track_business_metric(metric_name, value, tags = {})
# Send to monitoring service
StatsD.gauge(metric_name, value, tags: tags)
# Check local alert conditions
AlertRuleEngine.evaluate(metric_name, value, tags)
end
def self.track_timing(metric_name)
start_time = Time.now
result = yield
duration = ((Time.now - start_time) * 1000).round
StatsD.timing(metric_name, duration)
AlertRuleEngine.evaluate_timing(metric_name, duration)
result
end
end
# Usage in application code
class PaymentProcessor
def process_payment(amount)
MetricTracker.track_timing('payment.processing_time') do
result = charge_payment(amount)
if result.success?
MetricTracker.track_business_metric(
'payment.successful',
amount,
tags: { currency: result.currency }
)
else
MetricTracker.track_business_metric(
'payment.failed',
1,
tags: { reason: result.failure_reason }
)
end
result
end
end
end
Health check endpoints enable external monitoring systems to probe application status. These endpoints verify database connectivity, cache availability, and critical service dependencies.
# Health check endpoint with detailed status
class HealthCheckController < ApplicationController
def show
checks = {
database: check_database,
redis: check_redis,
external_api: check_external_api,
disk_space: check_disk_space
}
status = checks.values.all? { |c| c[:healthy] } ? :ok : :service_unavailable
render json: {
status: status,
timestamp: Time.now.iso8601,
checks: checks
}, status: status
end
private
def check_database
{
healthy: ActiveRecord::Base.connection.active?,
response_time: measure_query_time
}
rescue => e
{
healthy: false,
error: e.message
}
end
def check_redis
{
healthy: Redis.current.ping == 'PONG',
response_time: measure_redis_time
}
rescue => e
{
healthy: false,
error: e.message
}
end
def measure_query_time
start = Time.now
ActiveRecord::Base.connection.execute('SELECT 1')
((Time.now - start) * 1000).round
end
end
Background job monitoring tracks queue depths, processing rates, and failure rates. Large queue depths indicate processing bottlenecks. High failure rates signal systemic issues.
# Sidekiq monitoring with alerting
class SidekiqAlertMonitor
def self.check_queues
Sidekiq::Queue.all.each do |queue|
size = queue.size
latency = queue.latency
if size > queue_size_threshold(queue.name)
trigger_alert(
severity: :warning,
message: "Queue #{queue.name} backed up: #{size} jobs",
context: { queue: queue.name, size: size, latency: latency }
)
end
if latency > latency_threshold(queue.name)
trigger_alert(
severity: :critical,
message: "Queue #{queue.name} latency high: #{latency}s",
context: { queue: queue.name, latency: latency }
)
end
end
end
def self.queue_size_threshold(queue_name)
thresholds = {
'critical' => 100,
'default' => 1000,
'low_priority' => 10000
}
thresholds[queue_name] || 1000
end
def self.trigger_alert(severity:, message:, context:)
AlertDispatcher.dispatch(
ContextualAlert.new(
severity: severity,
message: message,
context: context
)
)
end
end
# Schedule periodic checks
Sidekiq.configure_server do |config|
config.on(:startup) do
Thread.new do
loop do
SidekiqAlertMonitor.check_queues
sleep 60
end
end
end
end
Tools & Ecosystem
PagerDuty provides incident management and on-call scheduling. It receives alerts from monitoring systems, routes them according to escalation policies, and tracks incident resolution. PagerDuty integrations exist for most monitoring platforms and support custom alert ingestion via API or email.
Datadog combines metrics collection, log aggregation, and alerting. The datadog agent runs on application servers, collecting system and application metrics. Alert rules evaluate metrics in real-time. Datadog supports anomaly detection, forecasting, and composite monitors.
New Relic focuses on application performance monitoring. It tracks transaction traces, error rates, and custom business metrics. Alert policies trigger on threshold violations or anomalous behavior. New Relic provides mobile alerts and integrates with incident management platforms.
Prometheus and Alertmanager form an open-source monitoring and alerting stack. Prometheus scrapes metrics from instrumented applications. Alertmanager handles alert routing, grouping, and silencing. The Ruby prometheus-client gem exposes application metrics in Prometheus format.
# Exposing Prometheus metrics from Ruby application
require 'prometheus/client'
require 'prometheus/client/rack/collector'
require 'prometheus/client/rack/exporter'
# Initialize registry
prometheus = Prometheus::Client.registry
# Define metrics
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
request_duration = prometheus.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path]
)
# Track metrics in application code
class ApplicationController < ActionController::Base
around_action :track_request
def track_request
start = Time.now
begin
yield
ensure
duration = Time.now - start
http_requests.increment(
labels: {
method: request.method,
path: request.path,
status: response.status
}
)
request_duration.observe(
duration,
labels: {
method: request.method,
path: request.path
}
)
end
end
end
# Mount Prometheus exporter
use Prometheus::Client::Rack::Collector
use Prometheus::Client::Rack::Exporter
Grafana visualizes metrics and supports alerting. Alert rules evaluate time series data and send notifications through multiple channels. Grafana connects to various data sources including Prometheus, InfluxDB, and Elasticsearch.
Sentry specializes in error tracking and performance monitoring. The sentry-ruby gem captures exceptions with rich context including request parameters, user information, and breadcrumbs showing actions leading to errors. Sentry groups similar errors and supports custom alert rules.
OpsGenie provides advanced alert routing and on-call management. It deduplicates alerts, applies routing rules, and integrates with monitoring tools. OpsGenie supports custom actions like auto-remediation scripts triggered by specific alerts.
Slack serves as a notification destination for many alerting systems. Custom webhooks deliver formatted alerts to specific channels. Slack apps enable interactive alerts where responders acknowledge or escalate directly from messages.
Common Pitfalls
Alert fatigue represents the most significant pitfall in alerting systems. When operators receive excessive alerts, they begin ignoring notifications or develop processes that bypass alerting systems. Alert fatigue develops gradually: initially, teams respond to every alert, but as volume increases and false positives accumulate, response rates decline. The solution requires ruthless pruning of non-actionable alerts and aggressive threshold tuning.
Alerting on symptoms rather than causes generates multiple notifications for single issues. When a database server fails, dozens of dependent services may generate alerts. Without correlation, operators waste time investigating symptoms while the root cause remains unclear. Alert aggregation and dependency modeling reduce symptom-based alert storms.
Missing context in alert notifications forces operators to spend time gathering basic information before diagnosis. An alert stating "API errors increased" requires operators to determine which API, error types, affected endpoints, and correlation with recent changes. Enriching alerts with this context during generation accelerates response.
Overly sensitive thresholds generate false positives that erode confidence in alerting systems. Setting error rate thresholds without considering normal variations causes alerts during expected spikes. Traffic patterns vary by time of day, day of week, and seasonal factors. Thresholds must account for these patterns.
Alert rules that never fire suggest monitoring gaps or incorrect configurations. Regular review identifies dormant alert rules. Either the conditions never occur, indicating unnecessary rules, or monitoring gaps prevent rule evaluation.
Ignoring alert acknowledgment creates ambiguity about incident ownership. When multiple people receive alerts but acknowledgment is optional, coordination breaks down. Some incidents go unhandled while others receive duplicate attention. Mandatory acknowledgment with escalation ensures accountability.
Testing alerting systems only during incidents reveals gaps when problems occur. Periodic testing verifies alert delivery, notification routing, and escalation policies. Test alerts should traverse the complete path from detection through notification.
Hardcoded alert destinations reduce flexibility and complicate updates. Alert routing should reference on-call schedules and team assignments rather than individual email addresses or phone numbers. Configuration management separates routing logic from alert definitions.
Lack of alert expiration causes ongoing notifications for resolved issues. Alerts should include resolution conditions or automatic timeout. When a service recovers but alerts continue, operators waste time investigating non-existent problems.
Treating all alerts equally regardless of business impact misallocates attention. A website typo and payment processing failure require different urgency levels. Business impact classification guides severity assignment and escalation policies.
Reference
Alert Severity Classification
| Severity | Response Time | Example Conditions | Escalation |
|---|---|---|---|
| Critical | Immediate | Service outage, data loss, security breach | After 5 minutes |
| Warning | Within 30 min | Performance degradation, elevated errors | After 30 minutes |
| Info | Business hours | Configuration changes, capacity trends | No escalation |
Common Alert Types
| Type | Trigger Condition | Key Metrics | Typical Threshold |
|---|---|---|---|
| Error Rate | Errors exceed normal | Errors per minute | >100/min or 3x baseline |
| Latency | Response time high | P95, P99 latency | >1000ms P95 |
| Throughput | Request volume anomaly | Requests per second | <50% or >200% of baseline |
| Resource | CPU/Memory high | Usage percentage | >80% sustained 5min |
| Availability | Health check failure | Successful checks | <95% over 5min |
| Queue Depth | Job backlog growing | Queue size | >1000 jobs |
| Saturation | Resource near limit | Disk space, connections | >90% capacity |
Notification Channels
| Channel | Use Case | Latency | Reliability | Context Support |
|---|---|---|---|---|
| SMS | Critical alerts | Seconds | High | Limited text |
| Phone | Urgent escalation | Seconds | High | Voice only |
| Non-urgent alerts | Minutes | Medium | Rich formatting | |
| Slack | Team notifications | Seconds | Medium | Rich formatting |
| PagerDuty | On-call routing | Seconds | High | Structured data |
| Webhook | System integration | Seconds | Variable | JSON payload |
Alert Lifecycle States
| State | Description | Valid Transitions | Actions Available |
|---|---|---|---|
| Triggered | Alert condition met | Acknowledged, Resolved | Acknowledge, Escalate, Suppress |
| Acknowledged | Engineer notified | Investigating, Resolved | Update, Escalate |
| Investigating | Diagnosis in progress | Resolved, Escalated | Add notes, Link incident |
| Escalated | Sent to next tier | Acknowledged, Resolved | Reassign |
| Resolved | Condition cleared | Closed | Reopen if recurs |
| Suppressed | Temporarily disabled | Expired | Extend, Cancel |
Ruby Alerting Gems
| Gem | Purpose | Integration | Features |
|---|---|---|---|
| exception_notification | Exception alerting | Rack middleware | Email, Slack, custom |
| bugsnag | Error monitoring | Agent-based | Grouping, releases |
| sentry-ruby | Error tracking | SDK integration | Breadcrumbs, context |
| newrelic_rpm | APM monitoring | Agent | Metrics, traces, alerts |
| datadog | Infrastructure monitoring | Agent + SDK | Logs, metrics, APM |
| prometheus-client | Metrics exposition | Exporter | Counter, gauge, histogram |
Alert Rule Patterns
| Pattern | Formula | When to Use |
|---|---|---|
| Static Threshold | value > threshold | Predictable limits |
| Rate of Change | delta > threshold | Detecting spikes |
| Moving Average | value > avg(window) * multiplier | Smoothing noise |
| Percentile | P95 > threshold | Tail latency |
| Count over Window | count(condition) > N in T minutes | Frequency limits |
| Ratio Comparison | errors / requests > threshold | Error rates |
| Absence Detection | no data for T minutes | Missing heartbeats |
Health Check Response Format
| Field | Type | Description | Example |
|---|---|---|---|
| status | string | Overall health | ok, degraded, unavailable |
| timestamp | ISO8601 | Check time | 2025-10-11T14:30:00Z |
| version | string | Application version | 2.3.1 |
| checks | object | Component statuses | Database: ok, Redis: degraded |
| response_time | integer | Check duration ms | 45 |
Escalation Policy Template
| Level | Wait Time | Contact Method | Example Role |
|---|---|---|---|
| 1 | 0 min | SMS + Push | Primary on-call |
| 2 | 5 min | SMS + Phone | Secondary on-call |
| 3 | 10 min | Phone | Team lead |
| 4 | 15 min | Phone | Engineering manager |
Alert Context Fields
| Field | Purpose | Example Value |
|---|---|---|
| service | Affected component | payment-api |
| environment | Deployment stage | production |
| host | Server identifier | web-01.prod |
| region | Geographic location | us-east-1 |
| version | Code version | 3.2.1-abc123 |
| user_impact | Affected users | 1500 users |
| runbook_url | Response guide | https://wiki/runbook-db-conn |
| dashboard_url | Metrics view | https://grafana/db-health |