Overview
Monitoring and observability represent two related but distinct approaches to understanding system behavior in production environments. Monitoring answers specific questions about system health through predefined metrics and alerts, while observability provides the capability to explore unknown system states and answer arbitrary questions about behavior.
Monitoring emerged first, focusing on collecting and tracking known metrics like CPU usage, memory consumption, and error rates. Systems send alerts when these metrics cross predetermined thresholds. This approach works when teams know what problems to expect and can define appropriate thresholds in advance.
Observability extends beyond traditional monitoring by instrumenting systems to expose internal state through three primary signals: metrics (numerical measurements over time), logs (discrete event records), and traces (request flow through distributed systems). The term comes from control theory, where a system is observable if its internal state can be inferred from external outputs.
The distinction matters because modern distributed systems exhibit emergent behaviors that teams cannot predict during development. A microservices architecture might fail in novel ways as services interact under load. Observability tools allow engineers to explore these unknown states by querying telemetry data without requiring predefined dashboards or alerts.
Consider a Ruby web application experiencing intermittent slow responses. Traditional monitoring shows elevated response times and triggers an alert. Observability data reveals the specific request paths affected, the database queries causing slowness, the user segments impacted, and correlations with deployment events—all discoverable through ad-hoc queries rather than pre-configured dashboards.
# Traditional monitoring: Track predefined metric
class ApplicationController < ActionController::Base
around_action :track_response_time
def track_response_time
start = Time.now
yield
duration = Time.now - start
Metrics.gauge('http.response_time', duration)
end
end
# Observability: Rich contextual data
class ApplicationController < ActionController::Base
around_action :instrument_request
def instrument_request
tracer.in_span('http.request') do |span|
span.set_attribute('http.method', request.method)
span.set_attribute('http.path', request.path)
span.set_attribute('user.id', current_user&.id)
yield
span.set_attribute('http.status', response.status)
end
end
end
The observability approach captures arbitrary attributes on each request, allowing engineers to filter and group by any dimension during investigation. Monitoring provides alerts for known failure modes; observability enables exploration of unknown issues.
Key Principles
Observability rests on three foundational pillars that work together to provide system insight: metrics, logs, and distributed traces. Each pillar captures different aspects of system behavior and serves distinct purposes during operation and debugging.
Metrics represent numerical measurements sampled over time intervals. Counter metrics track cumulative totals like request counts or error tallies. Gauge metrics capture point-in-time values like memory usage or queue depth. Histogram metrics record value distributions like response time percentiles. Timer metrics measure operation duration. Metrics excel at showing trends and triggering alerts but lack the context to explain why values changed.
Logs record discrete events with structured or unstructured data. Each log entry captures what happened at a specific moment with contextual details. Application logs document business events, error messages, and state changes. Access logs record HTTP requests. Audit logs track security-relevant actions. Logs provide narrative detail about system behavior but generate high data volume and prove difficult to correlate across services without additional context.
Distributed traces track individual requests as they flow through multiple services. Each trace contains spans representing operations within services. Spans form parent-child relationships showing request paths. Span attributes capture metadata like database queries, cache hits, or external API calls. Traces reveal which services contribute to request latency and where failures occur in distributed systems.
The three pillars complement each other. Metrics identify when problems occur through anomaly detection. Logs explain what happened during those time periods. Traces show how requests moved through the system and where time was spent. Teams combine signals to investigate issues: metrics trigger alerts, traces identify slow services, logs reveal error details.
Cardinality affects storage costs and query performance. High-cardinality data contains many unique values like user IDs or request IDs. Metrics work best with low cardinality (service name, HTTP status code). Logs and traces handle high cardinality better but cost more to store and query. Teams must balance detail against infrastructure costs.
Sampling reduces data volume by collecting subsets of telemetry. Head-based sampling decides whether to collect a trace before it starts based on probability. Tail-based sampling makes decisions after trace completion based on criteria like errors or latency. Sampling trades completeness for cost savings but risks missing rare issues.
Context propagation connects telemetry across service boundaries. Requests carry trace IDs and span IDs through HTTP headers or message metadata. Each service extracts parent context and creates child spans. Without proper propagation, distributed traces fragment into disconnected segments that obscure request flows.
Data retention balances detail against storage costs. Raw metrics might aggregate after 30 days to reduce storage while retaining long-term trends. Recent logs might be hot-searchable while older logs move to cold storage. Trace sampling rates might increase for recent time periods. Retention policies affect debugging capability—investigating month-old incidents requires data from that period.
Instrumentation generates telemetry by adding code to measure operations. Manual instrumentation gives precise control but requires developer effort. Automatic instrumentation uses libraries to instrument frameworks and dependencies without code changes. Both approaches have trade-offs between completeness and maintenance burden.
Service Level Objectives (SLOs) define target reliability levels using metrics. An SLO might specify that 99% of requests complete within 300ms over a 30-day window. Error budgets calculate remaining allowed failures before violating SLOs. SLOs focus monitoring on user-impacting metrics rather than infrastructure details.
# Defining SLOs with metrics
class SLOTracker
def initialize
@success_count = 0
@total_count = 0
end
def track_request(duration, status)
@total_count += 1
# SLO: 99% of requests under 300ms with 2xx status
if duration < 0.3 && (200..299).include?(status)
@success_count += 1
end
end
def slo_compliance
return 100.0 if @total_count.zero?
(@success_count.to_f / @total_count * 100).round(2)
end
def error_budget_remaining
target_compliance = 99.0
actual_compliance = slo_compliance
return 0.0 if actual_compliance < target_compliance
allowed_failures = @total_count * (100 - target_compliance) / 100
actual_failures = @total_count - @success_count
remaining_failures = allowed_failures - actual_failures
(remaining_failures / allowed_failures * 100).round(2)
end
end
Implementation Approaches
Teams implement monitoring and observability through different architectural patterns depending on system complexity, team size, and operational requirements.
Push-based telemetry sends metrics, logs, and traces from applications to collection endpoints. Applications actively transmit data to collectors at regular intervals or when events occur. This approach works well for short-lived processes like serverless functions that might not exist when a scraper tries to pull data. Push systems handle dynamic infrastructure where endpoints change frequently. However, push increases application complexity and network traffic since each service manages data transmission.
Pull-based telemetry exposes metrics endpoints that collectors scrape at intervals. Prometheus popularized this pattern for metrics collection. Applications expose HTTP endpoints returning current metric values. Collectors discover targets through service discovery and periodically fetch metrics. Pull-based systems reduce application complexity since services only expose data rather than managing transmission. Central collectors control scraping frequency and can handle backpressure. The downside is collectors must reach all application endpoints, complicating network configurations.
Agent-based collection deploys collection agents alongside applications. Agents run as sidecars in container environments or as daemons on virtual machines. Applications write logs to stdout/stderr and expose metrics endpoints locally. Agents collect data and forward it to backends. This approach centralizes collection logic and reduces application dependencies. Applications don't need direct connectivity to observability backends. Agents can buffer data during network issues. However, agents consume additional resources and add operational complexity.
Library instrumentation embeds telemetry collection directly in application code through libraries and SDKs. OpenTelemetry provides vendor-neutral libraries for metrics, logs, and traces. Applications import libraries, configure exporters, and add instrumentation to code. This approach gives fine-grained control over what data gets collected. However, it requires code changes and library maintenance across all services.
Auto-instrumentation injects telemetry collection into applications without code changes. Ruby supports auto-instrumentation through gem dependencies that monkey-patch common frameworks. Java and .NET use byte-code manipulation. This approach reduces developer burden but provides less control over instrumentation details and may miss custom code paths.
# Library instrumentation approach
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'user-service'
c.use_all() # Auto-instrument supported gems
end
class UserController < ApplicationController
def show
tracer = OpenTelemetry.tracer_provider.tracer('user-service')
tracer.in_span('fetch_user') do |span|
@user = User.find(params[:id])
span.set_attribute('user.id', @user.id)
end
tracer.in_span('fetch_permissions') do
@permissions = @user.permissions
end
end
end
Structured logging formats log messages as key-value pairs rather than unstructured text. JSON logs enable efficient searching and filtering. Applications output structured data that log processors can parse and index. This approach requires standardizing log formats across services.
# Structured logging implementation
require 'logger'
require 'json'
class StructuredLogger
def initialize(output = $stdout)
@logger = Logger.new(output)
@logger.formatter = proc do |severity, datetime, progname, msg|
JSON.generate(
timestamp: datetime.iso8601,
severity: severity,
message: msg.is_a?(Hash) ? msg[:message] : msg,
**extract_context(msg)
) + "\n"
end
end
def info(message_or_hash)
@logger.info(message_or_hash)
end
private
def extract_context(msg)
return {} unless msg.is_a?(Hash)
msg.except(:message)
end
end
logger = StructuredLogger.new
logger.info(
message: 'User created',
user_id: 12345,
email: 'user@example.com',
source_ip: '192.168.1.1'
)
# => {"timestamp":"2025-10-10T10:30:45Z","severity":"INFO","message":"User created","user_id":12345,"email":"user@example.com","source_ip":"192.168.1.1"}
Centralized vs distributed collection represents an architectural choice. Centralized systems send all telemetry to a single backend for storage and analysis. This simplifies querying across services but creates scaling bottlenecks. Distributed systems partition data across multiple backends, improving scalability but complicating cross-service queries.
Sampling strategies control data volume. Probabilistic sampling collects a fixed percentage of traces regardless of content. Adaptive sampling adjusts rates based on traffic volume. Intelligent sampling prioritizes interesting traces like errors or slow requests. Teams combine strategies, keeping all error traces while sampling successful requests.
Ruby Implementation
Ruby applications implement monitoring and observability through gems, framework integrations, and language features. The ecosystem provides tools for each observability pillar with varying levels of maturity.
Metrics collection in Ruby commonly uses the Prometheus client gem for exposing metrics or StatsD for push-based metrics. The prometheus-client gem provides metric types and an HTTP endpoint for scraping.
require 'prometheus/client'
require 'prometheus/client/rack/exporter'
# Initialize registry
prometheus = Prometheus::Client.registry
# Define metrics
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
http_duration = prometheus.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)
# Middleware to track metrics
class MetricsMiddleware
def initialize(app, requests_counter, duration_histogram)
@app = app
@requests = requests_counter
@duration = duration_histogram
end
def call(env)
start = Time.now
status, headers, body = @app.call(env)
duration = Time.now - start
labels = {
method: env['REQUEST_METHOD'],
path: env['PATH_INFO'],
status: status
}
@requests.increment(labels: labels.slice(:method, :path, :status))
@duration.observe(duration, labels: labels.slice(:method, :path))
[status, headers, body]
end
end
# Rack application
use MetricsMiddleware, http_requests, http_duration
use Prometheus::Client::Rack::Exporter
Logging in Ruby applications typically uses the standard Logger class or structured logging gems like Semantic Logger or Ougai. Rails applications have built-in logging that can be configured for structured output.
require 'ougai'
class ApplicationLogger < Ougai::Logger
include ActiveSupport::LoggerThreadSafeLevel
include ActiveSupport::LoggerSilence
def create_formatter
if Rails.env.production?
Ougai::Formatters::Bunyan.new
else
Ougai::Formatters::Readable.new
end
end
end
# Configure Rails logger
Rails.application.configure do
config.logger = ApplicationLogger.new(STDOUT)
config.log_level = :info
end
# Usage in application code
class OrdersController < ApplicationController
def create
Rails.logger.info(
'Order creation started',
user_id: current_user.id,
items_count: params[:items].length
)
order = Order.create!(order_params)
Rails.logger.info(
'Order created successfully',
order_id: order.id,
total_amount: order.total,
user_id: current_user.id
)
render json: order
rescue StandardError => e
Rails.logger.error(
'Order creation failed',
error: e.class.name,
message: e.message,
backtrace: e.backtrace.first(5),
user_id: current_user.id
)
render json: { error: 'Order creation failed' }, status: :unprocessable_entity
end
end
Distributed tracing in Ruby uses OpenTelemetry gems. The opentelemetry-sdk gem provides the core SDK, while instrumentation gems add automatic tracing for frameworks like Rails, Sidekiq, and database libraries.
# config/initializers/opentelemetry.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'rails-app'
c.service_version = '1.0.0'
# Configure exporter
c.add_span_processor(
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
OpenTelemetry::Exporter::OTLP::Exporter.new(
endpoint: ENV.fetch('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://localhost:4318')
)
)
)
# Auto-instrument common gems
c.use_all({
'OpenTelemetry::Instrumentation::Rails' => { enabled: true },
'OpenTelemetry::Instrumentation::ActiveRecord' => { enabled: true },
'OpenTelemetry::Instrumentation::Redis' => { enabled: true },
'OpenTelemetry::Instrumentation::Sidekiq' => { enabled: true }
})
end
# Manual instrumentation
class PaymentService
def process_payment(order)
tracer = OpenTelemetry.tracer_provider.tracer('payment-service')
tracer.in_span('payment.process', attributes: {
'order.id' => order.id,
'order.total' => order.total
}) do |span|
gateway_response = tracer.in_span('payment.gateway_call') do
call_payment_gateway(order)
end
span.set_attribute('payment.gateway_id', gateway_response.id)
span.set_attribute('payment.status', gateway_response.status)
if gateway_response.success?
span.status = OpenTelemetry::Trace::Status.ok
else
span.status = OpenTelemetry::Trace::Status.error('Payment failed')
span.record_exception(PaymentError.new(gateway_response.error))
end
gateway_response
end
end
end
Background job monitoring requires special consideration since jobs run outside the request/response cycle. Sidekiq provides hooks for instrumentation.
class SidekiqInstrumentation
def call(worker, job, queue)
tracer = OpenTelemetry.tracer_provider.tracer('sidekiq')
tracer.in_span(
"sidekiq.job.#{worker.class.name}",
attributes: {
'job.id' => job['jid'],
'job.queue' => queue,
'job.args' => job['args'].to_json,
'job.retry_count' => job['retry_count'] || 0
},
kind: :consumer
) do |span|
start = Time.now
begin
yield
span.status = OpenTelemetry::Trace::Status.ok
rescue StandardError => e
span.status = OpenTelemetry::Trace::Status.error(e.message)
span.record_exception(e)
raise
ensure
duration = Time.now - start
span.set_attribute('job.duration', duration)
end
end
end
end
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add SidekiqInstrumentation
end
end
Context propagation in Ruby requires extracting trace context from incoming requests and injecting it into outgoing requests. OpenTelemetry handles this through propagators.
class ApiClient
def initialize
@conn = Faraday.new(url: 'https://api.example.com')
@tracer = OpenTelemetry.tracer_provider.tracer('api-client')
end
def fetch_user(user_id)
@tracer.in_span('api.fetch_user', attributes: { 'user.id' => user_id }) do
headers = {}
# Inject trace context into headers
OpenTelemetry.propagation.inject(headers)
response = @conn.get("/users/#{user_id}") do |req|
headers.each { |key, value| req.headers[key] = value }
end
JSON.parse(response.body)
end
end
end
Error tracking integration captures exceptions with additional context. Gems like Sentry and Honeybadger integrate with tracing.
class ErrorTracker
def self.capture_exception(exception, context = {})
# Add trace context to error reports
trace_id = OpenTelemetry::Trace.current_span.context.trace_id.unpack1('H*')
span_id = OpenTelemetry::Trace.current_span.context.span_id.unpack1('H*')
Sentry.capture_exception(exception, extra: context.merge(
trace_id: trace_id,
span_id: span_id
))
end
end
Tools & Ecosystem
The observability ecosystem provides specialized tools for collecting, storing, and analyzing telemetry data. Ruby applications integrate with these tools through client libraries and exporters.
Application Performance Monitoring (APM) platforms provide end-to-end observability for Ruby applications. New Relic, Datadog, and AppSignal offer Ruby-specific agents that auto-instrument frameworks. These agents collect metrics, traces, and errors with minimal configuration.
New Relic's Ruby agent instruments Rails, Sinatra, and Grape frameworks automatically. The agent tracks transaction traces, database queries, external service calls, and background jobs. Configuration occurs through a YAML file or environment variables.
# Gemfile
gem 'newrelic_rpm'
# config/newrelic.yml
production:
license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
app_name: <%= ENV['NEW_RELIC_APP_NAME'] %>
distributed_tracing:
enabled: true
transaction_tracer:
enabled: true
record_sql: obfuscated
error_collector:
enabled: true
ignore_errors: "ActionController::RoutingError"
# Custom instrumentation
class ReportGenerator
include NewRelic::Agent::Instrumentation::ControllerInstrumentation
def generate
perform_action_with_newrelic_trace(
name: 'generate_report',
category: :task
) do
# Report generation logic
end
end
end
Prometheus and Grafana form an open-source monitoring stack. Prometheus scrapes metrics from application endpoints and stores time-series data. Grafana visualizes metrics through dashboards. The prometheus-client gem exposes metrics for Prometheus scraping.
ELK Stack (Elasticsearch, Logstash, Kibana) handles log aggregation and analysis. Applications send logs to Logstash, which parses and forwards them to Elasticsearch for storage. Kibana provides search and visualization. Ruby applications typically log to stdout/stderr, which container orchestrators forward to log collectors.
OpenTelemetry provides vendor-neutral instrumentation. The OpenTelemetry Ruby SDK supports multiple exporters, allowing teams to send data to different backends without changing instrumentation code.
# Configure multiple exporters
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
require 'opentelemetry/exporter/jaeger'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'multi-backend-app'
# OTLP exporter for production
otlp_exporter = OpenTelemetry::Exporter::OTLP::Exporter.new(
endpoint: ENV['OTEL_EXPORTER_OTLP_ENDPOINT']
)
# Jaeger exporter for local development
jaeger_exporter = OpenTelemetry::Exporter::Jaeger::CollectorExporter.new(
endpoint: 'http://localhost:14268/api/traces'
)
# Use appropriate exporter based on environment
exporter = Rails.env.production? ? otlp_exporter : jaeger_exporter
c.add_span_processor(
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(exporter)
)
end
StatsD provides a simple protocol for metrics collection. Applications send metrics over UDP to a StatsD daemon, which aggregates and forwards them to backends like Graphite or Datadog. The statsd-ruby gem implements the client.
require 'statsd-ruby'
class MetricsClient
def initialize
@statsd = Statsd.new('localhost', 8125)
@statsd.namespace = 'myapp'
end
def track_request(controller, action, duration, status)
# Increment counter
@statsd.increment('http.requests', tags: [
"controller:#{controller}",
"action:#{action}",
"status:#{status}"
])
# Record timing
@statsd.timing('http.response_time', duration * 1000, tags: [
"controller:#{controller}",
"action:#{action}"
])
# Track gauge
@statsd.gauge('http.active_requests', Thread.list.count)
end
end
Jaeger and Zipkin specialize in distributed tracing. Both implement OpenTelemetry standards. Jaeger offers better performance for high-volume tracing, while Zipkin provides simpler deployment. OpenTelemetry exporters support both systems.
Sentry and Bugsnag focus on error tracking with context. These tools capture exceptions, stack traces, and breadcrumbs showing events leading to errors. Integration with tracing links errors to specific traces.
Fluentd and Fluent Bit serve as log collectors and forwarders. These tools gather logs from multiple sources, parse structured data, and route to various destinations. Ruby applications output JSON logs that Fluentd parses without additional configuration.
InfluxDB and TimescaleDB store time-series data efficiently. These databases optimize for metrics workloads with time-based queries. Teams use them as alternatives to Prometheus for long-term metrics storage.
Honeycomb focuses on high-cardinality observability. The platform handles exploratory queries across dimensions that would overwhelm traditional metrics systems. Ruby applications use the honeycomb-beeline gem.
require 'honeycomb-beeline'
Honeycomb.init(
writekey: ENV['HONEYCOMB_WRITEKEY'],
dataset: 'rails-app',
service_name: 'web'
)
class OrdersController < ApplicationController
def create
Honeycomb.start_span(name: 'orders.create') do
Honeycomb.add_field('user.id', current_user.id)
Honeycomb.add_field('items.count', params[:items].length)
order = Order.create!(order_params)
Honeycomb.add_field('order.id', order.id)
Honeycomb.add_field('order.total', order.total)
render json: order
end
end
end
Integration & Interoperability
Integrating observability into Ruby applications requires coordination between application code, infrastructure, and analysis tools. Different integration patterns suit different architectural needs.
Rack middleware provides a standard integration point for web applications. Middleware intercepts requests and responses, adding instrumentation without modifying application code.
# Comprehensive observability middleware
class ObservabilityMiddleware
def initialize(app, metrics:, logger:, tracer:)
@app = app
@metrics = metrics
@logger = logger
@tracer = tracer
end
def call(env)
request_id = SecureRandom.uuid
env['HTTP_X_REQUEST_ID'] = request_id
@tracer.in_span('http.request', attributes: base_attributes(env, request_id)) do |span|
@logger.info('Request started', request_attributes(env, request_id))
start = Time.now
status, headers, body = @app.call(env)
duration = Time.now - start
record_metrics(env, status, duration)
span.set_attribute('http.status_code', status)
span.set_attribute('http.duration', duration)
@logger.info('Request completed',
request_id: request_id,
status: status,
duration: duration
)
headers['X-Request-ID'] = request_id
[status, headers, body]
end
end
private
def base_attributes(env, request_id)
{
'http.method' => env['REQUEST_METHOD'],
'http.url' => env['PATH_INFO'],
'http.user_agent' => env['HTTP_USER_AGENT'],
'request.id' => request_id
}
end
def request_attributes(env, request_id)
{
request_id: request_id,
method: env['REQUEST_METHOD'],
path: env['PATH_INFO'],
user_agent: env['HTTP_USER_AGENT'],
remote_ip: env['REMOTE_ADDR']
}
end
def record_metrics(env, status, duration)
labels = {
method: env['REQUEST_METHOD'],
path: normalize_path(env['PATH_INFO']),
status: status
}
@metrics.increment('http_requests_total', labels: labels)
@metrics.observe('http_request_duration_seconds', duration, labels: labels.except(:status))
end
def normalize_path(path)
# Replace IDs with placeholders for lower cardinality
path.gsub(/\/\d+/, '/:id')
end
end
Database query instrumentation tracks slow queries and N+1 problems. ActiveRecord supports query subscribers that receive notifications for all SQL queries.
# Database query monitoring
class DatabaseQuerySubscriber
def initialize(tracer, logger, slow_query_threshold: 0.1)
@tracer = tracer
@logger = logger
@slow_query_threshold = slow_query_threshold
end
def start(name, id, payload)
@query_start_times ||= {}
@query_start_times[id] = Time.now
end
def finish(name, id, payload)
return unless @query_start_times[id]
duration = Time.now - @query_start_times[id]
@query_start_times.delete(id)
@tracer.in_span('db.query', attributes: {
'db.statement' => payload[:sql],
'db.duration' => duration,
'db.connection_id' => payload[:connection_id]
}) do |span|
if duration > @slow_query_threshold
@logger.warn('Slow database query detected',
sql: payload[:sql],
duration: duration,
connection_id: payload[:connection_id]
)
span.add_event('slow_query', attributes: { threshold: @slow_query_threshold })
end
end
end
end
# Register subscriber
subscriber = DatabaseQuerySubscriber.new(tracer, logger)
ActiveSupport::Notifications.subscribe('sql.active_record', subscriber)
Background job integration requires linking jobs to originating requests. Jobs carry trace context through job arguments.
class TracedJob < ApplicationJob
around_perform do |job, block|
# Extract trace context from job arguments
trace_context = job.arguments.last.is_a?(Hash) ? job.arguments.last.delete(:trace_context) : nil
if trace_context
# Create new span linked to parent trace
parent_context = OpenTelemetry.propagation.extract(trace_context)
OpenTelemetry::Context.with_current(parent_context) do
tracer.in_span("job.#{job.class.name}", attributes: {
'job.id' => job.job_id,
'job.queue' => job.queue_name
}) do
block.call
end
end
else
# No parent context, create root span
tracer.in_span("job.#{job.class.name}") do
block.call
end
end
end
private
def tracer
OpenTelemetry.tracer_provider.tracer('background-jobs')
end
end
# Enqueue with trace context
class OrdersController < ApplicationController
def create
order = Order.create!(order_params)
# Inject current trace context
trace_context = {}
OpenTelemetry.propagation.inject(trace_context)
OrderConfirmationJob.perform_later(order.id, trace_context: trace_context)
render json: order
end
end
External service calls propagate trace context through HTTP headers. HTTP client libraries support header injection.
class ServiceClient
def initialize(base_url)
@base_url = base_url
@tracer = OpenTelemetry.tracer_provider.tracer('http-client')
end
def get(path, params: {})
@tracer.in_span("http.get", attributes: {
'http.url' => "#{@base_url}#{path}",
'http.method' => 'GET'
}) do |span|
headers = { 'Content-Type' => 'application/json' }
OpenTelemetry.propagation.inject(headers)
response = HTTP.headers(headers).get("#{@base_url}#{path}", params: params)
span.set_attribute('http.status_code', response.code)
if response.status.success?
JSON.parse(response.body)
else
span.status = OpenTelemetry::Trace::Status.error("HTTP #{response.code}")
raise ServiceError, "Request failed with status #{response.code}"
end
end
end
end
Health check endpoints expose service status for orchestrators and load balancers. These endpoints report dependency health and readiness.
class HealthController < ApplicationController
def liveness
# Simple check that process is running
render json: { status: 'ok' }, status: :ok
end
def readiness
checks = {
database: check_database,
redis: check_redis,
external_api: check_external_api
}
all_healthy = checks.values.all? { |check| check[:healthy] }
status_code = all_healthy ? :ok : :service_unavailable
render json: {
status: all_healthy ? 'ready' : 'not_ready',
checks: checks
}, status: status_code
end
private
def check_database
ActiveRecord::Base.connection.execute('SELECT 1')
{ healthy: true }
rescue StandardError => e
{ healthy: false, error: e.message }
end
def check_redis
Redis.current.ping
{ healthy: true }
rescue StandardError => e
{ healthy: false, error: e.message }
end
def check_external_api
response = HTTP.timeout(2).get(ENV['EXTERNAL_API_HEALTH_URL'])
{ healthy: response.status.success? }
rescue StandardError => e
{ healthy: false, error: e.message }
end
end
Real-World Applications
Production Ruby applications implement observability patterns that balance insight against overhead. Teams adapt instrumentation based on traffic patterns, system architecture, and operational requirements.
High-throughput APIs face challenges instrumenting millions of requests without impacting performance. Sampling reduces overhead while maintaining visibility into issues. Adaptive sampling adjusts rates based on traffic and error rates.
class AdaptiveSampler
def initialize(base_rate: 0.01, error_rate: 1.0)
@base_rate = base_rate
@error_rate = error_rate
@request_count = 0
@error_count = 0
@last_adjustment = Time.now
end
def should_sample?(error: false)
# Always sample errors
return true if error
# Adjust sampling rate every 1000 requests
adjust_rate if @request_count % 1000 == 0
rand < current_rate
end
def record_request(error: false)
@request_count += 1
@error_count += 1 if error
end
private
def current_rate
# Increase sampling when error rate rises
error_ratio = @error_count.to_f / [@request_count, 1].max
if error_ratio > 0.01 # 1% errors
[@base_rate * 10, 1.0].min
else
@base_rate
end
end
def adjust_rate
if Time.now - @last_adjustment > 60
@request_count = 0
@error_count = 0
@last_adjustment = Time.now
end
end
end
# Integration with tracing
class TracingSampler
def initialize(sampler)
@sampler = sampler
end
def should_sample?(trace_id, parent_context, links, name, kind, attributes)
error = attributes['error'] == true
@sampler.should_sample?(error: error)
end
end
Microservices architectures require coordinated instrumentation across services. Consistent service naming, attribute keys, and context propagation enable cross-service analysis.
# Shared observability configuration
module ObservabilityConfig
STANDARD_ATTRIBUTES = {
deployment_environment: ENV['RAILS_ENV'],
service_version: ENV['APP_VERSION'],
kubernetes_pod: ENV['HOSTNAME'],
kubernetes_namespace: ENV['K8S_NAMESPACE']
}.freeze
def self.configure_telemetry(service_name)
OpenTelemetry::SDK.configure do |c|
c.service_name = service_name
# Add standard resource attributes
c.resource = OpenTelemetry::SDK::Resources::Resource.create(
STANDARD_ATTRIBUTES.merge(
'service.name' => service_name,
'service.namespace' => 'production'
)
)
# Configure exporters
c.add_span_processor(
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
OpenTelemetry::Exporter::OTLP::Exporter.new
)
)
# Use consistent instrumentation
c.use_all
end
end
end
# Each service uses consistent configuration
ObservabilityConfig.configure_telemetry('user-service')
Serverless functions in AWS Lambda face constraints around persistent connections and startup time. Instrumentation minimizes cold start impact.
# Lambda-optimized observability
require 'aws-xray-sdk'
class LambdaHandler
def initialize
# Initialize once during cold start
@tracer = OpenTelemetry.tracer_provider.tracer('lambda-function')
@logger = Logger.new($stdout)
@logger.formatter = proc { |severity, datetime, progname, msg|
JSON.generate(timestamp: datetime.iso8601, severity: severity, message: msg) + "\n"
}
end
def handle_event(event:, context:)
# Extract request ID from Lambda context
request_id = context.request_id
@tracer.in_span('lambda.invocation', attributes: {
'faas.execution' => request_id,
'faas.id' => context.invoked_function_arn
}) do |span|
@logger.info(
message: 'Function invoked',
request_id: request_id,
function_name: context.function_name,
remaining_time_ms: context.get_remaining_time_in_millis
)
result = process_event(event)
span.set_attribute('event.type', event['type'])
@logger.info(
message: 'Function completed',
request_id: request_id,
result_count: result.length
)
result
end
rescue StandardError => e
@logger.error(
message: 'Function failed',
request_id: request_id,
error: e.class.name,
error_message: e.message
)
raise
end
private
def process_event(event)
# Business logic
end
end
# Lambda handler
HANDLER = LambdaHandler.new
def lambda_handler(event:, context:)
HANDLER.handle_event(event: event, context: context)
end
Multi-tenant applications track metrics and traces per tenant without creating cardinality explosions. Tenant IDs appear in span attributes rather than metric labels.
class TenantTracking
def self.with_tenant_context(tenant_id)
OpenTelemetry::Trace.current_span.set_attribute('tenant.id', tenant_id)
# Add tenant to log context
Thread.current[:tenant_id] = tenant_id
yield
ensure
Thread.current[:tenant_id] = nil
end
end
# Middleware to extract tenant
class TenantMiddleware
def call(env)
tenant_id = extract_tenant(env)
TenantTracking.with_tenant_context(tenant_id) do
@app.call(env)
end
end
private
def extract_tenant(env)
# Extract from subdomain, header, or token
subdomain = env['HTTP_HOST'].split('.').first
Tenant.find_by(subdomain: subdomain)&.id
end
end
Deployment monitoring tracks service health during rollouts. Metrics compare new versions against baselines to detect regressions.
class DeploymentMonitor
def initialize
@baseline_latency = fetch_baseline_latency
end
def check_deployment_health
current_latency = calculate_current_latency
error_rate = calculate_error_rate
{
latency_regression: current_latency > @baseline_latency * 1.5,
error_rate_high: error_rate > 0.01,
current_latency: current_latency,
baseline_latency: @baseline_latency,
error_rate: error_rate
}
end
private
def fetch_baseline_latency
# Query metrics backend for P95 latency from previous version
0.3 # Example baseline
end
def calculate_current_latency
# Query recent P95 latency
0.25
end
def calculate_error_rate
# Query recent error rate
0.005
end
end
Reference
Metric Types
| Type | Description | Use Case | Aggregation |
|---|---|---|---|
| Counter | Cumulative value that only increases | Request counts, error tallies | Rate, sum |
| Gauge | Point-in-time value that can increase or decrease | Memory usage, queue depth, active connections | Current value, average |
| Histogram | Distribution of values in configurable buckets | Response times, request sizes | Percentiles, averages |
| Summary | Pre-calculated percentiles | Client-side percentile calculation | Quantiles |
| Timer | Duration measurement | Operation execution time | Percentiles, rates |
Standard Span Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
| http.method | string | HTTP request method | GET, POST |
| http.url | string | Full request URL | https://api.example.com/users |
| http.status_code | integer | HTTP response status | 200, 404, 500 |
| db.system | string | Database type | postgresql, redis |
| db.statement | string | Database query | SELECT * FROM users WHERE id = 1 |
| error | boolean | Whether span represents error | true, false |
| messaging.system | string | Message queue system | rabbitmq, kafka |
| peer.service | string | Remote service name | payment-service |
Log Levels
| Level | Severity | Use Case | Production Volume |
|---|---|---|---|
| DEBUG | 7 | Development debugging, detailed state | Disabled |
| INFO | 6 | Normal operations, business events | Moderate |
| WARN | 4 | Recoverable errors, deprecated usage | Low |
| ERROR | 3 | Error conditions requiring attention | Very low |
| FATAL | 2 | Critical failures requiring immediate action | Extremely low |
Common SLO Targets
| Service Type | Availability | Latency P95 | Latency P99 |
|---|---|---|---|
| Internal API | 99.9% | 100ms | 250ms |
| Public API | 99.95% | 200ms | 500ms |
| Background Job | 99% | 5s | 30s |
| Real-time Service | 99.99% | 50ms | 100ms |
| Batch Processing | 99% | N/A | N/A |
Sampling Decision Matrix
| Traffic Volume | Error Rate | Trace Value | Sample Rate |
|---|---|---|---|
| Low (< 100 req/s) | Any | Any | 100% |
| Medium (100-1000 req/s) | < 1% | Normal | 10% |
| Medium (100-1000 req/s) | > 1% | Normal | 50% |
| High (> 1000 req/s) | < 1% | Normal | 1% |
| High (> 1000 req/s) | > 1% | Normal | 10% |
| Any | Any | Error/Slow | 100% |
OpenTelemetry Exporters
| Exporter | Protocol | Backend | Use Case |
|---|---|---|---|
| OTLP | gRPC/HTTP | OpenTelemetry Collector | Production standard |
| Jaeger | Thrift/gRPC | Jaeger | Development/testing |
| Zipkin | HTTP | Zipkin | Legacy systems |
| Prometheus | HTTP | Prometheus | Metrics only |
| Logging | Stdout | Any log collector | Debugging |
Data Retention Strategies
| Data Type | Hot Storage | Warm Storage | Cold Storage | Archive |
|---|---|---|---|---|
| Traces | 7 days full | 30 days sampled | N/A | N/A |
| Logs | 7 days | 30 days | 90 days | 1 year |
| Metrics (raw) | 30 days | N/A | N/A | N/A |
| Metrics (aggregated) | 1 year | 3 years | N/A | Forever |
Ruby Observability Gems
| Gem | Purpose | Backend | Auto-instrumentation |
|---|---|---|---|
| opentelemetry-sdk | Traces, metrics | Any OTLP | Yes |
| prometheus-client | Metrics | Prometheus | No |
| newrelic_rpm | APM | New Relic | Yes |
| ddtrace | APM | Datadog | Yes |
| sentry-ruby | Errors | Sentry | Yes |
| statsd-ruby | Metrics | StatsD backends | No |
| semantic_logger | Structured logging | Any | No |
| ougai | JSON logging | Any | No |
Context Propagation Headers
| Header | Standard | Format | Example |
|---|---|---|---|
| traceparent | W3C Trace Context | version-traceid-spanid-flags | 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 |
| tracestate | W3C Trace Context | key=value pairs | vendor1=value1,vendor2=value2 |
| X-B3-TraceId | Zipkin B3 | 128-bit trace ID | 463ac35c9f6413ad48485a3953bb6124 |
| X-B3-SpanId | Zipkin B3 | 64-bit span ID | a2fb4a1d1a96d312 |
| X-Request-ID | Custom | UUID | 550e8400-e29b-41d4-a716-446655440000 |