CrackedRuby CrackedRuby

Monitoring and Observability

Overview

Monitoring and observability represent two related but distinct approaches to understanding system behavior in production environments. Monitoring answers specific questions about system health through predefined metrics and alerts, while observability provides the capability to explore unknown system states and answer arbitrary questions about behavior.

Monitoring emerged first, focusing on collecting and tracking known metrics like CPU usage, memory consumption, and error rates. Systems send alerts when these metrics cross predetermined thresholds. This approach works when teams know what problems to expect and can define appropriate thresholds in advance.

Observability extends beyond traditional monitoring by instrumenting systems to expose internal state through three primary signals: metrics (numerical measurements over time), logs (discrete event records), and traces (request flow through distributed systems). The term comes from control theory, where a system is observable if its internal state can be inferred from external outputs.

The distinction matters because modern distributed systems exhibit emergent behaviors that teams cannot predict during development. A microservices architecture might fail in novel ways as services interact under load. Observability tools allow engineers to explore these unknown states by querying telemetry data without requiring predefined dashboards or alerts.

Consider a Ruby web application experiencing intermittent slow responses. Traditional monitoring shows elevated response times and triggers an alert. Observability data reveals the specific request paths affected, the database queries causing slowness, the user segments impacted, and correlations with deployment events—all discoverable through ad-hoc queries rather than pre-configured dashboards.

# Traditional monitoring: Track predefined metric
class ApplicationController < ActionController::Base
  around_action :track_response_time
  
  def track_response_time
    start = Time.now
    yield
    duration = Time.now - start
    Metrics.gauge('http.response_time', duration)
  end
end

# Observability: Rich contextual data
class ApplicationController < ActionController::Base
  around_action :instrument_request
  
  def instrument_request
    tracer.in_span('http.request') do |span|
      span.set_attribute('http.method', request.method)
      span.set_attribute('http.path', request.path)
      span.set_attribute('user.id', current_user&.id)
      
      yield
      
      span.set_attribute('http.status', response.status)
    end
  end
end

The observability approach captures arbitrary attributes on each request, allowing engineers to filter and group by any dimension during investigation. Monitoring provides alerts for known failure modes; observability enables exploration of unknown issues.

Key Principles

Observability rests on three foundational pillars that work together to provide system insight: metrics, logs, and distributed traces. Each pillar captures different aspects of system behavior and serves distinct purposes during operation and debugging.

Metrics represent numerical measurements sampled over time intervals. Counter metrics track cumulative totals like request counts or error tallies. Gauge metrics capture point-in-time values like memory usage or queue depth. Histogram metrics record value distributions like response time percentiles. Timer metrics measure operation duration. Metrics excel at showing trends and triggering alerts but lack the context to explain why values changed.

Logs record discrete events with structured or unstructured data. Each log entry captures what happened at a specific moment with contextual details. Application logs document business events, error messages, and state changes. Access logs record HTTP requests. Audit logs track security-relevant actions. Logs provide narrative detail about system behavior but generate high data volume and prove difficult to correlate across services without additional context.

Distributed traces track individual requests as they flow through multiple services. Each trace contains spans representing operations within services. Spans form parent-child relationships showing request paths. Span attributes capture metadata like database queries, cache hits, or external API calls. Traces reveal which services contribute to request latency and where failures occur in distributed systems.

The three pillars complement each other. Metrics identify when problems occur through anomaly detection. Logs explain what happened during those time periods. Traces show how requests moved through the system and where time was spent. Teams combine signals to investigate issues: metrics trigger alerts, traces identify slow services, logs reveal error details.

Cardinality affects storage costs and query performance. High-cardinality data contains many unique values like user IDs or request IDs. Metrics work best with low cardinality (service name, HTTP status code). Logs and traces handle high cardinality better but cost more to store and query. Teams must balance detail against infrastructure costs.

Sampling reduces data volume by collecting subsets of telemetry. Head-based sampling decides whether to collect a trace before it starts based on probability. Tail-based sampling makes decisions after trace completion based on criteria like errors or latency. Sampling trades completeness for cost savings but risks missing rare issues.

Context propagation connects telemetry across service boundaries. Requests carry trace IDs and span IDs through HTTP headers or message metadata. Each service extracts parent context and creates child spans. Without proper propagation, distributed traces fragment into disconnected segments that obscure request flows.

Data retention balances detail against storage costs. Raw metrics might aggregate after 30 days to reduce storage while retaining long-term trends. Recent logs might be hot-searchable while older logs move to cold storage. Trace sampling rates might increase for recent time periods. Retention policies affect debugging capability—investigating month-old incidents requires data from that period.

Instrumentation generates telemetry by adding code to measure operations. Manual instrumentation gives precise control but requires developer effort. Automatic instrumentation uses libraries to instrument frameworks and dependencies without code changes. Both approaches have trade-offs between completeness and maintenance burden.

Service Level Objectives (SLOs) define target reliability levels using metrics. An SLO might specify that 99% of requests complete within 300ms over a 30-day window. Error budgets calculate remaining allowed failures before violating SLOs. SLOs focus monitoring on user-impacting metrics rather than infrastructure details.

# Defining SLOs with metrics
class SLOTracker
  def initialize
    @success_count = 0
    @total_count = 0
  end
  
  def track_request(duration, status)
    @total_count += 1
    
    # SLO: 99% of requests under 300ms with 2xx status
    if duration < 0.3 && (200..299).include?(status)
      @success_count += 1
    end
  end
  
  def slo_compliance
    return 100.0 if @total_count.zero?
    (@success_count.to_f / @total_count * 100).round(2)
  end
  
  def error_budget_remaining
    target_compliance = 99.0
    actual_compliance = slo_compliance
    
    return 0.0 if actual_compliance < target_compliance
    
    allowed_failures = @total_count * (100 - target_compliance) / 100
    actual_failures = @total_count - @success_count
    remaining_failures = allowed_failures - actual_failures
    
    (remaining_failures / allowed_failures * 100).round(2)
  end
end

Implementation Approaches

Teams implement monitoring and observability through different architectural patterns depending on system complexity, team size, and operational requirements.

Push-based telemetry sends metrics, logs, and traces from applications to collection endpoints. Applications actively transmit data to collectors at regular intervals or when events occur. This approach works well for short-lived processes like serverless functions that might not exist when a scraper tries to pull data. Push systems handle dynamic infrastructure where endpoints change frequently. However, push increases application complexity and network traffic since each service manages data transmission.

Pull-based telemetry exposes metrics endpoints that collectors scrape at intervals. Prometheus popularized this pattern for metrics collection. Applications expose HTTP endpoints returning current metric values. Collectors discover targets through service discovery and periodically fetch metrics. Pull-based systems reduce application complexity since services only expose data rather than managing transmission. Central collectors control scraping frequency and can handle backpressure. The downside is collectors must reach all application endpoints, complicating network configurations.

Agent-based collection deploys collection agents alongside applications. Agents run as sidecars in container environments or as daemons on virtual machines. Applications write logs to stdout/stderr and expose metrics endpoints locally. Agents collect data and forward it to backends. This approach centralizes collection logic and reduces application dependencies. Applications don't need direct connectivity to observability backends. Agents can buffer data during network issues. However, agents consume additional resources and add operational complexity.

Library instrumentation embeds telemetry collection directly in application code through libraries and SDKs. OpenTelemetry provides vendor-neutral libraries for metrics, logs, and traces. Applications import libraries, configure exporters, and add instrumentation to code. This approach gives fine-grained control over what data gets collected. However, it requires code changes and library maintenance across all services.

Auto-instrumentation injects telemetry collection into applications without code changes. Ruby supports auto-instrumentation through gem dependencies that monkey-patch common frameworks. Java and .NET use byte-code manipulation. This approach reduces developer burden but provides less control over instrumentation details and may miss custom code paths.

# Library instrumentation approach
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'user-service'
  c.use_all() # Auto-instrument supported gems
end

class UserController < ApplicationController
  def show
    tracer = OpenTelemetry.tracer_provider.tracer('user-service')
    
    tracer.in_span('fetch_user') do |span|
      @user = User.find(params[:id])
      span.set_attribute('user.id', @user.id)
    end
    
    tracer.in_span('fetch_permissions') do
      @permissions = @user.permissions
    end
  end
end

Structured logging formats log messages as key-value pairs rather than unstructured text. JSON logs enable efficient searching and filtering. Applications output structured data that log processors can parse and index. This approach requires standardizing log formats across services.

# Structured logging implementation
require 'logger'
require 'json'

class StructuredLogger
  def initialize(output = $stdout)
    @logger = Logger.new(output)
    @logger.formatter = proc do |severity, datetime, progname, msg|
      JSON.generate(
        timestamp: datetime.iso8601,
        severity: severity,
        message: msg.is_a?(Hash) ? msg[:message] : msg,
        **extract_context(msg)
      ) + "\n"
    end
  end
  
  def info(message_or_hash)
    @logger.info(message_or_hash)
  end
  
  private
  
  def extract_context(msg)
    return {} unless msg.is_a?(Hash)
    msg.except(:message)
  end
end

logger = StructuredLogger.new
logger.info(
  message: 'User created',
  user_id: 12345,
  email: 'user@example.com',
  source_ip: '192.168.1.1'
)
# => {"timestamp":"2025-10-10T10:30:45Z","severity":"INFO","message":"User created","user_id":12345,"email":"user@example.com","source_ip":"192.168.1.1"}

Centralized vs distributed collection represents an architectural choice. Centralized systems send all telemetry to a single backend for storage and analysis. This simplifies querying across services but creates scaling bottlenecks. Distributed systems partition data across multiple backends, improving scalability but complicating cross-service queries.

Sampling strategies control data volume. Probabilistic sampling collects a fixed percentage of traces regardless of content. Adaptive sampling adjusts rates based on traffic volume. Intelligent sampling prioritizes interesting traces like errors or slow requests. Teams combine strategies, keeping all error traces while sampling successful requests.

Ruby Implementation

Ruby applications implement monitoring and observability through gems, framework integrations, and language features. The ecosystem provides tools for each observability pillar with varying levels of maturity.

Metrics collection in Ruby commonly uses the Prometheus client gem for exposing metrics or StatsD for push-based metrics. The prometheus-client gem provides metric types and an HTTP endpoint for scraping.

require 'prometheus/client'
require 'prometheus/client/rack/exporter'

# Initialize registry
prometheus = Prometheus::Client.registry

# Define metrics
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

http_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# Middleware to track metrics
class MetricsMiddleware
  def initialize(app, requests_counter, duration_histogram)
    @app = app
    @requests = requests_counter
    @duration = duration_histogram
  end
  
  def call(env)
    start = Time.now
    status, headers, body = @app.call(env)
    duration = Time.now - start
    
    labels = {
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      status: status
    }
    
    @requests.increment(labels: labels.slice(:method, :path, :status))
    @duration.observe(duration, labels: labels.slice(:method, :path))
    
    [status, headers, body]
  end
end

# Rack application
use MetricsMiddleware, http_requests, http_duration
use Prometheus::Client::Rack::Exporter

Logging in Ruby applications typically uses the standard Logger class or structured logging gems like Semantic Logger or Ougai. Rails applications have built-in logging that can be configured for structured output.

require 'ougai'

class ApplicationLogger < Ougai::Logger
  include ActiveSupport::LoggerThreadSafeLevel
  include ActiveSupport::LoggerSilence
  
  def create_formatter
    if Rails.env.production?
      Ougai::Formatters::Bunyan.new
    else
      Ougai::Formatters::Readable.new
    end
  end
end

# Configure Rails logger
Rails.application.configure do
  config.logger = ApplicationLogger.new(STDOUT)
  config.log_level = :info
end

# Usage in application code
class OrdersController < ApplicationController
  def create
    Rails.logger.info(
      'Order creation started',
      user_id: current_user.id,
      items_count: params[:items].length
    )
    
    order = Order.create!(order_params)
    
    Rails.logger.info(
      'Order created successfully',
      order_id: order.id,
      total_amount: order.total,
      user_id: current_user.id
    )
    
    render json: order
  rescue StandardError => e
    Rails.logger.error(
      'Order creation failed',
      error: e.class.name,
      message: e.message,
      backtrace: e.backtrace.first(5),
      user_id: current_user.id
    )
    
    render json: { error: 'Order creation failed' }, status: :unprocessable_entity
  end
end

Distributed tracing in Ruby uses OpenTelemetry gems. The opentelemetry-sdk gem provides the core SDK, while instrumentation gems add automatic tracing for frameworks like Rails, Sidekiq, and database libraries.

# config/initializers/opentelemetry.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'rails-app'
  c.service_version = '1.0.0'
  
  # Configure exporter
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
      OpenTelemetry::Exporter::OTLP::Exporter.new(
        endpoint: ENV.fetch('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://localhost:4318')
      )
    )
  )
  
  # Auto-instrument common gems
  c.use_all({
    'OpenTelemetry::Instrumentation::Rails' => { enabled: true },
    'OpenTelemetry::Instrumentation::ActiveRecord' => { enabled: true },
    'OpenTelemetry::Instrumentation::Redis' => { enabled: true },
    'OpenTelemetry::Instrumentation::Sidekiq' => { enabled: true }
  })
end

# Manual instrumentation
class PaymentService
  def process_payment(order)
    tracer = OpenTelemetry.tracer_provider.tracer('payment-service')
    
    tracer.in_span('payment.process', attributes: {
      'order.id' => order.id,
      'order.total' => order.total
    }) do |span|
      
      gateway_response = tracer.in_span('payment.gateway_call') do
        call_payment_gateway(order)
      end
      
      span.set_attribute('payment.gateway_id', gateway_response.id)
      span.set_attribute('payment.status', gateway_response.status)
      
      if gateway_response.success?
        span.status = OpenTelemetry::Trace::Status.ok
      else
        span.status = OpenTelemetry::Trace::Status.error('Payment failed')
        span.record_exception(PaymentError.new(gateway_response.error))
      end
      
      gateway_response
    end
  end
end

Background job monitoring requires special consideration since jobs run outside the request/response cycle. Sidekiq provides hooks for instrumentation.

class SidekiqInstrumentation
  def call(worker, job, queue)
    tracer = OpenTelemetry.tracer_provider.tracer('sidekiq')
    
    tracer.in_span(
      "sidekiq.job.#{worker.class.name}",
      attributes: {
        'job.id' => job['jid'],
        'job.queue' => queue,
        'job.args' => job['args'].to_json,
        'job.retry_count' => job['retry_count'] || 0
      },
      kind: :consumer
    ) do |span|
      start = Time.now
      
      begin
        yield
        span.status = OpenTelemetry::Trace::Status.ok
      rescue StandardError => e
        span.status = OpenTelemetry::Trace::Status.error(e.message)
        span.record_exception(e)
        raise
      ensure
        duration = Time.now - start
        span.set_attribute('job.duration', duration)
      end
    end
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add SidekiqInstrumentation
  end
end

Context propagation in Ruby requires extracting trace context from incoming requests and injecting it into outgoing requests. OpenTelemetry handles this through propagators.

class ApiClient
  def initialize
    @conn = Faraday.new(url: 'https://api.example.com')
    @tracer = OpenTelemetry.tracer_provider.tracer('api-client')
  end
  
  def fetch_user(user_id)
    @tracer.in_span('api.fetch_user', attributes: { 'user.id' => user_id }) do
      headers = {}
      
      # Inject trace context into headers
      OpenTelemetry.propagation.inject(headers)
      
      response = @conn.get("/users/#{user_id}") do |req|
        headers.each { |key, value| req.headers[key] = value }
      end
      
      JSON.parse(response.body)
    end
  end
end

Error tracking integration captures exceptions with additional context. Gems like Sentry and Honeybadger integrate with tracing.

class ErrorTracker
  def self.capture_exception(exception, context = {})
    # Add trace context to error reports
    trace_id = OpenTelemetry::Trace.current_span.context.trace_id.unpack1('H*')
    span_id = OpenTelemetry::Trace.current_span.context.span_id.unpack1('H*')
    
    Sentry.capture_exception(exception, extra: context.merge(
      trace_id: trace_id,
      span_id: span_id
    ))
  end
end

Tools & Ecosystem

The observability ecosystem provides specialized tools for collecting, storing, and analyzing telemetry data. Ruby applications integrate with these tools through client libraries and exporters.

Application Performance Monitoring (APM) platforms provide end-to-end observability for Ruby applications. New Relic, Datadog, and AppSignal offer Ruby-specific agents that auto-instrument frameworks. These agents collect metrics, traces, and errors with minimal configuration.

New Relic's Ruby agent instruments Rails, Sinatra, and Grape frameworks automatically. The agent tracks transaction traces, database queries, external service calls, and background jobs. Configuration occurs through a YAML file or environment variables.

# Gemfile
gem 'newrelic_rpm'

# config/newrelic.yml
production:
  license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
  app_name: <%= ENV['NEW_RELIC_APP_NAME'] %>
  distributed_tracing:
    enabled: true
  transaction_tracer:
    enabled: true
    record_sql: obfuscated
  error_collector:
    enabled: true
    ignore_errors: "ActionController::RoutingError"

# Custom instrumentation
class ReportGenerator
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation
  
  def generate
    perform_action_with_newrelic_trace(
      name: 'generate_report',
      category: :task
    ) do
      # Report generation logic
    end
  end
end

Prometheus and Grafana form an open-source monitoring stack. Prometheus scrapes metrics from application endpoints and stores time-series data. Grafana visualizes metrics through dashboards. The prometheus-client gem exposes metrics for Prometheus scraping.

ELK Stack (Elasticsearch, Logstash, Kibana) handles log aggregation and analysis. Applications send logs to Logstash, which parses and forwards them to Elasticsearch for storage. Kibana provides search and visualization. Ruby applications typically log to stdout/stderr, which container orchestrators forward to log collectors.

OpenTelemetry provides vendor-neutral instrumentation. The OpenTelemetry Ruby SDK supports multiple exporters, allowing teams to send data to different backends without changing instrumentation code.

# Configure multiple exporters
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
require 'opentelemetry/exporter/jaeger'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'multi-backend-app'
  
  # OTLP exporter for production
  otlp_exporter = OpenTelemetry::Exporter::OTLP::Exporter.new(
    endpoint: ENV['OTEL_EXPORTER_OTLP_ENDPOINT']
  )
  
  # Jaeger exporter for local development
  jaeger_exporter = OpenTelemetry::Exporter::Jaeger::CollectorExporter.new(
    endpoint: 'http://localhost:14268/api/traces'
  )
  
  # Use appropriate exporter based on environment
  exporter = Rails.env.production? ? otlp_exporter : jaeger_exporter
  
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(exporter)
  )
end

StatsD provides a simple protocol for metrics collection. Applications send metrics over UDP to a StatsD daemon, which aggregates and forwards them to backends like Graphite or Datadog. The statsd-ruby gem implements the client.

require 'statsd-ruby'

class MetricsClient
  def initialize
    @statsd = Statsd.new('localhost', 8125)
    @statsd.namespace = 'myapp'
  end
  
  def track_request(controller, action, duration, status)
    # Increment counter
    @statsd.increment('http.requests', tags: [
      "controller:#{controller}",
      "action:#{action}",
      "status:#{status}"
    ])
    
    # Record timing
    @statsd.timing('http.response_time', duration * 1000, tags: [
      "controller:#{controller}",
      "action:#{action}"
    ])
    
    # Track gauge
    @statsd.gauge('http.active_requests', Thread.list.count)
  end
end

Jaeger and Zipkin specialize in distributed tracing. Both implement OpenTelemetry standards. Jaeger offers better performance for high-volume tracing, while Zipkin provides simpler deployment. OpenTelemetry exporters support both systems.

Sentry and Bugsnag focus on error tracking with context. These tools capture exceptions, stack traces, and breadcrumbs showing events leading to errors. Integration with tracing links errors to specific traces.

Fluentd and Fluent Bit serve as log collectors and forwarders. These tools gather logs from multiple sources, parse structured data, and route to various destinations. Ruby applications output JSON logs that Fluentd parses without additional configuration.

InfluxDB and TimescaleDB store time-series data efficiently. These databases optimize for metrics workloads with time-based queries. Teams use them as alternatives to Prometheus for long-term metrics storage.

Honeycomb focuses on high-cardinality observability. The platform handles exploratory queries across dimensions that would overwhelm traditional metrics systems. Ruby applications use the honeycomb-beeline gem.

require 'honeycomb-beeline'

Honeycomb.init(
  writekey: ENV['HONEYCOMB_WRITEKEY'],
  dataset: 'rails-app',
  service_name: 'web'
)

class OrdersController < ApplicationController
  def create
    Honeycomb.start_span(name: 'orders.create') do
      Honeycomb.add_field('user.id', current_user.id)
      Honeycomb.add_field('items.count', params[:items].length)
      
      order = Order.create!(order_params)
      
      Honeycomb.add_field('order.id', order.id)
      Honeycomb.add_field('order.total', order.total)
      
      render json: order
    end
  end
end

Integration & Interoperability

Integrating observability into Ruby applications requires coordination between application code, infrastructure, and analysis tools. Different integration patterns suit different architectural needs.

Rack middleware provides a standard integration point for web applications. Middleware intercepts requests and responses, adding instrumentation without modifying application code.

# Comprehensive observability middleware
class ObservabilityMiddleware
  def initialize(app, metrics:, logger:, tracer:)
    @app = app
    @metrics = metrics
    @logger = logger
    @tracer = tracer
  end
  
  def call(env)
    request_id = SecureRandom.uuid
    env['HTTP_X_REQUEST_ID'] = request_id
    
    @tracer.in_span('http.request', attributes: base_attributes(env, request_id)) do |span|
      @logger.info('Request started', request_attributes(env, request_id))
      
      start = Time.now
      status, headers, body = @app.call(env)
      duration = Time.now - start
      
      record_metrics(env, status, duration)
      span.set_attribute('http.status_code', status)
      span.set_attribute('http.duration', duration)
      
      @logger.info('Request completed', 
        request_id: request_id,
        status: status,
        duration: duration
      )
      
      headers['X-Request-ID'] = request_id
      [status, headers, body]
    end
  end
  
  private
  
  def base_attributes(env, request_id)
    {
      'http.method' => env['REQUEST_METHOD'],
      'http.url' => env['PATH_INFO'],
      'http.user_agent' => env['HTTP_USER_AGENT'],
      'request.id' => request_id
    }
  end
  
  def request_attributes(env, request_id)
    {
      request_id: request_id,
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      user_agent: env['HTTP_USER_AGENT'],
      remote_ip: env['REMOTE_ADDR']
    }
  end
  
  def record_metrics(env, status, duration)
    labels = {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO']),
      status: status
    }
    
    @metrics.increment('http_requests_total', labels: labels)
    @metrics.observe('http_request_duration_seconds', duration, labels: labels.except(:status))
  end
  
  def normalize_path(path)
    # Replace IDs with placeholders for lower cardinality
    path.gsub(/\/\d+/, '/:id')
  end
end

Database query instrumentation tracks slow queries and N+1 problems. ActiveRecord supports query subscribers that receive notifications for all SQL queries.

# Database query monitoring
class DatabaseQuerySubscriber
  def initialize(tracer, logger, slow_query_threshold: 0.1)
    @tracer = tracer
    @logger = logger
    @slow_query_threshold = slow_query_threshold
  end
  
  def start(name, id, payload)
    @query_start_times ||= {}
    @query_start_times[id] = Time.now
  end
  
  def finish(name, id, payload)
    return unless @query_start_times[id]
    
    duration = Time.now - @query_start_times[id]
    @query_start_times.delete(id)
    
    @tracer.in_span('db.query', attributes: {
      'db.statement' => payload[:sql],
      'db.duration' => duration,
      'db.connection_id' => payload[:connection_id]
    }) do |span|
      
      if duration > @slow_query_threshold
        @logger.warn('Slow database query detected',
          sql: payload[:sql],
          duration: duration,
          connection_id: payload[:connection_id]
        )
        span.add_event('slow_query', attributes: { threshold: @slow_query_threshold })
      end
    end
  end
end

# Register subscriber
subscriber = DatabaseQuerySubscriber.new(tracer, logger)
ActiveSupport::Notifications.subscribe('sql.active_record', subscriber)

Background job integration requires linking jobs to originating requests. Jobs carry trace context through job arguments.

class TracedJob < ApplicationJob
  around_perform do |job, block|
    # Extract trace context from job arguments
    trace_context = job.arguments.last.is_a?(Hash) ? job.arguments.last.delete(:trace_context) : nil
    
    if trace_context
      # Create new span linked to parent trace
      parent_context = OpenTelemetry.propagation.extract(trace_context)
      
      OpenTelemetry::Context.with_current(parent_context) do
        tracer.in_span("job.#{job.class.name}", attributes: {
          'job.id' => job.job_id,
          'job.queue' => job.queue_name
        }) do
          block.call
        end
      end
    else
      # No parent context, create root span
      tracer.in_span("job.#{job.class.name}") do
        block.call
      end
    end
  end
  
  private
  
  def tracer
    OpenTelemetry.tracer_provider.tracer('background-jobs')
  end
end

# Enqueue with trace context
class OrdersController < ApplicationController
  def create
    order = Order.create!(order_params)
    
    # Inject current trace context
    trace_context = {}
    OpenTelemetry.propagation.inject(trace_context)
    
    OrderConfirmationJob.perform_later(order.id, trace_context: trace_context)
    
    render json: order
  end
end

External service calls propagate trace context through HTTP headers. HTTP client libraries support header injection.

class ServiceClient
  def initialize(base_url)
    @base_url = base_url
    @tracer = OpenTelemetry.tracer_provider.tracer('http-client')
  end
  
  def get(path, params: {})
    @tracer.in_span("http.get", attributes: {
      'http.url' => "#{@base_url}#{path}",
      'http.method' => 'GET'
    }) do |span|
      
      headers = { 'Content-Type' => 'application/json' }
      OpenTelemetry.propagation.inject(headers)
      
      response = HTTP.headers(headers).get("#{@base_url}#{path}", params: params)
      
      span.set_attribute('http.status_code', response.code)
      
      if response.status.success?
        JSON.parse(response.body)
      else
        span.status = OpenTelemetry::Trace::Status.error("HTTP #{response.code}")
        raise ServiceError, "Request failed with status #{response.code}"
      end
    end
  end
end

Health check endpoints expose service status for orchestrators and load balancers. These endpoints report dependency health and readiness.

class HealthController < ApplicationController
  def liveness
    # Simple check that process is running
    render json: { status: 'ok' }, status: :ok
  end
  
  def readiness
    checks = {
      database: check_database,
      redis: check_redis,
      external_api: check_external_api
    }
    
    all_healthy = checks.values.all? { |check| check[:healthy] }
    
    status_code = all_healthy ? :ok : :service_unavailable
    
    render json: {
      status: all_healthy ? 'ready' : 'not_ready',
      checks: checks
    }, status: status_code
  end
  
  private
  
  def check_database
    ActiveRecord::Base.connection.execute('SELECT 1')
    { healthy: true }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_redis
    Redis.current.ping
    { healthy: true }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_external_api
    response = HTTP.timeout(2).get(ENV['EXTERNAL_API_HEALTH_URL'])
    { healthy: response.status.success? }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
end

Real-World Applications

Production Ruby applications implement observability patterns that balance insight against overhead. Teams adapt instrumentation based on traffic patterns, system architecture, and operational requirements.

High-throughput APIs face challenges instrumenting millions of requests without impacting performance. Sampling reduces overhead while maintaining visibility into issues. Adaptive sampling adjusts rates based on traffic and error rates.

class AdaptiveSampler
  def initialize(base_rate: 0.01, error_rate: 1.0)
    @base_rate = base_rate
    @error_rate = error_rate
    @request_count = 0
    @error_count = 0
    @last_adjustment = Time.now
  end
  
  def should_sample?(error: false)
    # Always sample errors
    return true if error
    
    # Adjust sampling rate every 1000 requests
    adjust_rate if @request_count % 1000 == 0
    
    rand < current_rate
  end
  
  def record_request(error: false)
    @request_count += 1
    @error_count += 1 if error
  end
  
  private
  
  def current_rate
    # Increase sampling when error rate rises
    error_ratio = @error_count.to_f / [@request_count, 1].max
    
    if error_ratio > 0.01 # 1% errors
      [@base_rate * 10, 1.0].min
    else
      @base_rate
    end
  end
  
  def adjust_rate
    if Time.now - @last_adjustment > 60
      @request_count = 0
      @error_count = 0
      @last_adjustment = Time.now
    end
  end
end

# Integration with tracing
class TracingSampler
  def initialize(sampler)
    @sampler = sampler
  end
  
  def should_sample?(trace_id, parent_context, links, name, kind, attributes)
    error = attributes['error'] == true
    @sampler.should_sample?(error: error)
  end
end

Microservices architectures require coordinated instrumentation across services. Consistent service naming, attribute keys, and context propagation enable cross-service analysis.

# Shared observability configuration
module ObservabilityConfig
  STANDARD_ATTRIBUTES = {
    deployment_environment: ENV['RAILS_ENV'],
    service_version: ENV['APP_VERSION'],
    kubernetes_pod: ENV['HOSTNAME'],
    kubernetes_namespace: ENV['K8S_NAMESPACE']
  }.freeze
  
  def self.configure_telemetry(service_name)
    OpenTelemetry::SDK.configure do |c|
      c.service_name = service_name
      
      # Add standard resource attributes
      c.resource = OpenTelemetry::SDK::Resources::Resource.create(
        STANDARD_ATTRIBUTES.merge(
          'service.name' => service_name,
          'service.namespace' => 'production'
        )
      )
      
      # Configure exporters
      c.add_span_processor(
        OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
          OpenTelemetry::Exporter::OTLP::Exporter.new
        )
      )
      
      # Use consistent instrumentation
      c.use_all
    end
  end
end

# Each service uses consistent configuration
ObservabilityConfig.configure_telemetry('user-service')

Serverless functions in AWS Lambda face constraints around persistent connections and startup time. Instrumentation minimizes cold start impact.

# Lambda-optimized observability
require 'aws-xray-sdk'

class LambdaHandler
  def initialize
    # Initialize once during cold start
    @tracer = OpenTelemetry.tracer_provider.tracer('lambda-function')
    @logger = Logger.new($stdout)
    @logger.formatter = proc { |severity, datetime, progname, msg|
      JSON.generate(timestamp: datetime.iso8601, severity: severity, message: msg) + "\n"
    }
  end
  
  def handle_event(event:, context:)
    # Extract request ID from Lambda context
    request_id = context.request_id
    
    @tracer.in_span('lambda.invocation', attributes: {
      'faas.execution' => request_id,
      'faas.id' => context.invoked_function_arn
    }) do |span|
      
      @logger.info(
        message: 'Function invoked',
        request_id: request_id,
        function_name: context.function_name,
        remaining_time_ms: context.get_remaining_time_in_millis
      )
      
      result = process_event(event)
      
      span.set_attribute('event.type', event['type'])
      
      @logger.info(
        message: 'Function completed',
        request_id: request_id,
        result_count: result.length
      )
      
      result
    end
  rescue StandardError => e
    @logger.error(
      message: 'Function failed',
      request_id: request_id,
      error: e.class.name,
      error_message: e.message
    )
    raise
  end
  
  private
  
  def process_event(event)
    # Business logic
  end
end

# Lambda handler
HANDLER = LambdaHandler.new

def lambda_handler(event:, context:)
  HANDLER.handle_event(event: event, context: context)
end

Multi-tenant applications track metrics and traces per tenant without creating cardinality explosions. Tenant IDs appear in span attributes rather than metric labels.

class TenantTracking
  def self.with_tenant_context(tenant_id)
    OpenTelemetry::Trace.current_span.set_attribute('tenant.id', tenant_id)
    
    # Add tenant to log context
    Thread.current[:tenant_id] = tenant_id
    
    yield
  ensure
    Thread.current[:tenant_id] = nil
  end
end

# Middleware to extract tenant
class TenantMiddleware
  def call(env)
    tenant_id = extract_tenant(env)
    
    TenantTracking.with_tenant_context(tenant_id) do
      @app.call(env)
    end
  end
  
  private
  
  def extract_tenant(env)
    # Extract from subdomain, header, or token
    subdomain = env['HTTP_HOST'].split('.').first
    Tenant.find_by(subdomain: subdomain)&.id
  end
end

Deployment monitoring tracks service health during rollouts. Metrics compare new versions against baselines to detect regressions.

class DeploymentMonitor
  def initialize
    @baseline_latency = fetch_baseline_latency
  end
  
  def check_deployment_health
    current_latency = calculate_current_latency
    error_rate = calculate_error_rate
    
    {
      latency_regression: current_latency > @baseline_latency * 1.5,
      error_rate_high: error_rate > 0.01,
      current_latency: current_latency,
      baseline_latency: @baseline_latency,
      error_rate: error_rate
    }
  end
  
  private
  
  def fetch_baseline_latency
    # Query metrics backend for P95 latency from previous version
    0.3 # Example baseline
  end
  
  def calculate_current_latency
    # Query recent P95 latency
    0.25
  end
  
  def calculate_error_rate
    # Query recent error rate
    0.005
  end
end

Reference

Metric Types

Type Description Use Case Aggregation
Counter Cumulative value that only increases Request counts, error tallies Rate, sum
Gauge Point-in-time value that can increase or decrease Memory usage, queue depth, active connections Current value, average
Histogram Distribution of values in configurable buckets Response times, request sizes Percentiles, averages
Summary Pre-calculated percentiles Client-side percentile calculation Quantiles
Timer Duration measurement Operation execution time Percentiles, rates

Standard Span Attributes

Attribute Type Description Example
http.method string HTTP request method GET, POST
http.url string Full request URL https://api.example.com/users
http.status_code integer HTTP response status 200, 404, 500
db.system string Database type postgresql, redis
db.statement string Database query SELECT * FROM users WHERE id = 1
error boolean Whether span represents error true, false
messaging.system string Message queue system rabbitmq, kafka
peer.service string Remote service name payment-service

Log Levels

Level Severity Use Case Production Volume
DEBUG 7 Development debugging, detailed state Disabled
INFO 6 Normal operations, business events Moderate
WARN 4 Recoverable errors, deprecated usage Low
ERROR 3 Error conditions requiring attention Very low
FATAL 2 Critical failures requiring immediate action Extremely low

Common SLO Targets

Service Type Availability Latency P95 Latency P99
Internal API 99.9% 100ms 250ms
Public API 99.95% 200ms 500ms
Background Job 99% 5s 30s
Real-time Service 99.99% 50ms 100ms
Batch Processing 99% N/A N/A

Sampling Decision Matrix

Traffic Volume Error Rate Trace Value Sample Rate
Low (< 100 req/s) Any Any 100%
Medium (100-1000 req/s) < 1% Normal 10%
Medium (100-1000 req/s) > 1% Normal 50%
High (> 1000 req/s) < 1% Normal 1%
High (> 1000 req/s) > 1% Normal 10%
Any Any Error/Slow 100%

OpenTelemetry Exporters

Exporter Protocol Backend Use Case
OTLP gRPC/HTTP OpenTelemetry Collector Production standard
Jaeger Thrift/gRPC Jaeger Development/testing
Zipkin HTTP Zipkin Legacy systems
Prometheus HTTP Prometheus Metrics only
Logging Stdout Any log collector Debugging

Data Retention Strategies

Data Type Hot Storage Warm Storage Cold Storage Archive
Traces 7 days full 30 days sampled N/A N/A
Logs 7 days 30 days 90 days 1 year
Metrics (raw) 30 days N/A N/A N/A
Metrics (aggregated) 1 year 3 years N/A Forever

Ruby Observability Gems

Gem Purpose Backend Auto-instrumentation
opentelemetry-sdk Traces, metrics Any OTLP Yes
prometheus-client Metrics Prometheus No
newrelic_rpm APM New Relic Yes
ddtrace APM Datadog Yes
sentry-ruby Errors Sentry Yes
statsd-ruby Metrics StatsD backends No
semantic_logger Structured logging Any No
ougai JSON logging Any No

Context Propagation Headers

Header Standard Format Example
traceparent W3C Trace Context version-traceid-spanid-flags 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate W3C Trace Context key=value pairs vendor1=value1,vendor2=value2
X-B3-TraceId Zipkin B3 128-bit trace ID 463ac35c9f6413ad48485a3953bb6124
X-B3-SpanId Zipkin B3 64-bit span ID a2fb4a1d1a96d312
X-Request-ID Custom UUID 550e8400-e29b-41d4-a716-446655440000