CrackedRuby - Monitoring and Observability

Overview

Monitoring and observability represent two related but distinct approaches to understanding system behavior in production environments. Monitoring answers specific questions about system health through predefined metrics and alerts, while observability provides the capability to explore unknown system states and answer arbitrary questions about behavior.

Monitoring emerged first, focusing on collecting and tracking known metrics like CPU usage, memory consumption, and error rates. Systems send alerts when these metrics cross predetermined thresholds. This approach works when teams know what problems to expect and can define appropriate thresholds in advance.

Observability extends beyond traditional monitoring by instrumenting systems to expose internal state through three primary signals: metrics (numerical measurements over time), logs (discrete event records), and traces (request flow through distributed systems). The term comes from control theory, where a system is observable if its internal state can be inferred from external outputs.

The distinction matters because modern distributed systems exhibit emergent behaviors that teams cannot predict during development. A microservices architecture might fail in novel ways as services interact under load. Observability tools allow engineers to explore these unknown states by querying telemetry data without requiring predefined dashboards or alerts.

Consider a Ruby web application experiencing intermittent slow responses. Traditional monitoring shows elevated response times and triggers an alert. Observability data reveals the specific request paths affected, the database queries causing slowness, the user segments impacted, and correlations with deployment events—all discoverable through ad-hoc queries rather than pre-configured dashboards.

# Traditional monitoring: Track predefined metric
class ApplicationController < ActionController::Base
  around_action :track_response_time
  
  def track_response_time
    start = Time.now
    yield
    duration = Time.now - start
    Metrics.gauge('http.response_time', duration)
  end
end

# Observability: Rich contextual data
class ApplicationController < ActionController::Base
  around_action :instrument_request
  
  def instrument_request
    tracer.in_span('http.request') do |span|
      span.set_attribute('http.method', request.method)
      span.set_attribute('http.path', request.path)
      span.set_attribute('user.id', current_user&.id)
      
      yield
      
      span.set_attribute('http.status', response.status)
    end
  end
end

The observability approach captures arbitrary attributes on each request, allowing engineers to filter and group by any dimension during investigation. Monitoring provides alerts for known failure modes; observability enables exploration of unknown issues.

Key Principles

Observability rests on three foundational pillars that work together to provide system insight: metrics, logs, and distributed traces. Each pillar captures different aspects of system behavior and serves distinct purposes during operation and debugging.

Metrics represent numerical measurements sampled over time intervals. Counter metrics track cumulative totals like request counts or error tallies. Gauge metrics capture point-in-time values like memory usage or queue depth. Histogram metrics record value distributions like response time percentiles. Timer metrics measure operation duration. Metrics excel at showing trends and triggering alerts but lack the context to explain why values changed.

Logs record discrete events with structured or unstructured data. Each log entry captures what happened at a specific moment with contextual details. Application logs document business events, error messages, and state changes. Access logs record HTTP requests. Audit logs track security-relevant actions. Logs provide narrative detail about system behavior but generate high data volume and prove difficult to correlate across services without additional context.

Distributed traces track individual requests as they flow through multiple services. Each trace contains spans representing operations within services. Spans form parent-child relationships showing request paths. Span attributes capture metadata like database queries, cache hits, or external API calls. Traces reveal which services contribute to request latency and where failures occur in distributed systems.

The three pillars complement each other. Metrics identify when problems occur through anomaly detection. Logs explain what happened during those time periods. Traces show how requests moved through the system and where time was spent. Teams combine signals to investigate issues: metrics trigger alerts, traces identify slow services, logs reveal error details.

Cardinality affects storage costs and query performance. High-cardinality data contains many unique values like user IDs or request IDs. Metrics work best with low cardinality (service name, HTTP status code). Logs and traces handle high cardinality better but cost more to store and query. Teams must balance detail against infrastructure costs.

Sampling reduces data volume by collecting subsets of telemetry. Head-based sampling decides whether to collect a trace before it starts based on probability. Tail-based sampling makes decisions after trace completion based on criteria like errors or latency. Sampling trades completeness for cost savings but risks missing rare issues.

Context propagation connects telemetry across service boundaries. Requests carry trace IDs and span IDs through HTTP headers or message metadata. Each service extracts parent context and creates child spans. Without proper propagation, distributed traces fragment into disconnected segments that obscure request flows.

Data retention balances detail against storage costs. Raw metrics might aggregate after 30 days to reduce storage while retaining long-term trends. Recent logs might be hot-searchable while older logs move to cold storage. Trace sampling rates might increase for recent time periods. Retention policies affect debugging capability—investigating month-old incidents requires data from that period.

Instrumentation generates telemetry by adding code to measure operations. Manual instrumentation gives precise control but requires developer effort. Automatic instrumentation uses libraries to instrument frameworks and dependencies without code changes. Both approaches have trade-offs between completeness and maintenance burden.

Service Level Objectives (SLOs) define target reliability levels using metrics. An SLO might specify that 99% of requests complete within 300ms over a 30-day window. Error budgets calculate remaining allowed failures before violating SLOs. SLOs focus monitoring on user-impacting metrics rather than infrastructure details.

# Defining SLOs with metrics
class SLOTracker
  def initialize
    @success_count = 0
    @total_count = 0
  end
  
  def track_request(duration, status)
    @total_count += 1
    
    # SLO: 99% of requests under 300ms with 2xx status
    if duration < 0.3 && (200..299).include?(status)
      @success_count += 1
    end
  end
  
  def slo_compliance
    return 100.0 if @total_count.zero?
    (@success_count.to_f / @total_count * 100).round(2)
  end
  
  def error_budget_remaining
    target_compliance = 99.0
    actual_compliance = slo_compliance
    
    return 0.0 if actual_compliance < target_compliance
    
    allowed_failures = @total_count * (100 - target_compliance) / 100
    actual_failures = @total_count - @success_count
    remaining_failures = allowed_failures - actual_failures
    
    (remaining_failures / allowed_failures * 100).round(2)
  end
end

Implementation Approaches

Teams implement monitoring and observability through different architectural patterns depending on system complexity, team size, and operational requirements.

Push-based telemetry sends metrics, logs, and traces from applications to collection endpoints. Applications actively transmit data to collectors at regular intervals or when events occur. This approach works well for short-lived processes like serverless functions that might not exist when a scraper tries to pull data. Push systems handle dynamic infrastructure where endpoints change frequently. However, push increases application complexity and network traffic since each service manages data transmission.

Pull-based telemetry exposes metrics endpoints that collectors scrape at intervals. Prometheus popularized this pattern for metrics collection. Applications expose HTTP endpoints returning current metric values. Collectors discover targets through service discovery and periodically fetch metrics. Pull-based systems reduce application complexity since services only expose data rather than managing transmission. Central collectors control scraping frequency and can handle backpressure. The downside is collectors must reach all application endpoints, complicating network configurations.

Agent-based collection deploys collection agents alongside applications. Agents run as sidecars in container environments or as daemons on virtual machines. Applications write logs to stdout/stderr and expose metrics endpoints locally. Agents collect data and forward it to backends. This approach centralizes collection logic and reduces application dependencies. Applications don't need direct connectivity to observability backends. Agents can buffer data during network issues. However, agents consume additional resources and add operational complexity.

Library instrumentation embeds telemetry collection directly in application code through libraries and SDKs. OpenTelemetry provides vendor-neutral libraries for metrics, logs, and traces. Applications import libraries, configure exporters, and add instrumentation to code. This approach gives fine-grained control over what data gets collected. However, it requires code changes and library maintenance across all services.

Auto-instrumentation injects telemetry collection into applications without code changes. Ruby supports auto-instrumentation through gem dependencies that monkey-patch common frameworks. Java and .NET use byte-code manipulation. This approach reduces developer burden but provides less control over instrumentation details and may miss custom code paths.

# Library instrumentation approach
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'user-service'
  c.use_all() # Auto-instrument supported gems
end

class UserController < ApplicationController
  def show
    tracer = OpenTelemetry.tracer_provider.tracer('user-service')
    
    tracer.in_span('fetch_user') do |span|
      @user = User.find(params[:id])
      span.set_attribute('user.id', @user.id)
    end
    
    tracer.in_span('fetch_permissions') do
      @permissions = @user.permissions
    end
  end
end

Structured logging formats log messages as key-value pairs rather than unstructured text. JSON logs enable efficient searching and filtering. Applications output structured data that log processors can parse and index. This approach requires standardizing log formats across services.

# Structured logging implementation
require 'logger'
require 'json'

class StructuredLogger
  def initialize(output = $stdout)
    @logger = Logger.new(output)
    @logger.formatter = proc do |severity, datetime, progname, msg|
      JSON.generate(
        timestamp: datetime.iso8601,
        severity: severity,
        message: msg.is_a?(Hash) ? msg[:message] : msg,
        **extract_context(msg)
      ) + "\n"
    end
  end
  
  def info(message_or_hash)
    @logger.info(message_or_hash)
  end
  
  private
  
  def extract_context(msg)
    return {} unless msg.is_a?(Hash)
    msg.except(:message)
  end
end

logger = StructuredLogger.new
logger.info(
  message: 'User created',
  user_id: 12345,
  email: 'user@example.com',
  source_ip: '192.168.1.1'
)
# => {"timestamp":"2025-10-10T10:30:45Z","severity":"INFO","message":"User created","user_id":12345,"email":"user@example.com","source_ip":"192.168.1.1"}

Centralized vs distributed collection represents an architectural choice. Centralized systems send all telemetry to a single backend for storage and analysis. This simplifies querying across services but creates scaling bottlenecks. Distributed systems partition data across multiple backends, improving scalability but complicating cross-service queries.

Sampling strategies control data volume. Probabilistic sampling collects a fixed percentage of traces regardless of content. Adaptive sampling adjusts rates based on traffic volume. Intelligent sampling prioritizes interesting traces like errors or slow requests. Teams combine strategies, keeping all error traces while sampling successful requests.

Ruby Implementation

Ruby applications implement monitoring and observability through gems, framework integrations, and language features. The ecosystem provides tools for each observability pillar with varying levels of maturity.

Metrics collection in Ruby commonly uses the Prometheus client gem for exposing metrics or StatsD for push-based metrics. The prometheus-client gem provides metric types and an HTTP endpoint for scraping.

require 'prometheus/client'
require 'prometheus/client/rack/exporter'

# Initialize registry
prometheus = Prometheus::Client.registry

# Define metrics
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

http_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# Middleware to track metrics
class MetricsMiddleware
  def initialize(app, requests_counter, duration_histogram)
    @app = app
    @requests = requests_counter
    @duration = duration_histogram
  end
  
  def call(env)
    start = Time.now
    status, headers, body = @app.call(env)
    duration = Time.now - start
    
    labels = {
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      status: status
    }
    
    @requests.increment(labels: labels.slice(:method, :path, :status))
    @duration.observe(duration, labels: labels.slice(:method, :path))
    
    [status, headers, body]
  end
end

# Rack application
use MetricsMiddleware, http_requests, http_duration
use Prometheus::Client::Rack::Exporter

Logging in Ruby applications typically uses the standard Logger class or structured logging gems like Semantic Logger or Ougai. Rails applications have built-in logging that can be configured for structured output.

require 'ougai'

class ApplicationLogger < Ougai::Logger
  include ActiveSupport::LoggerThreadSafeLevel
  include ActiveSupport::LoggerSilence
  
  def create_formatter
    if Rails.env.production?
      Ougai::Formatters::Bunyan.new
    else
      Ougai::Formatters::Readable.new
    end
  end
end

# Configure Rails logger
Rails.application.configure do
  config.logger = ApplicationLogger.new(STDOUT)
  config.log_level = :info
end

# Usage in application code
class OrdersController < ApplicationController
  def create
    Rails.logger.info(
      'Order creation started',
      user_id: current_user.id,
      items_count: params[:items].length
    )
    
    order = Order.create!(order_params)
    
    Rails.logger.info(
      'Order created successfully',
      order_id: order.id,
      total_amount: order.total,
      user_id: current_user.id
    )
    
    render json: order
  rescue StandardError => e
    Rails.logger.error(
      'Order creation failed',
      error: e.class.name,
      message: e.message,
      backtrace: e.backtrace.first(5),
      user_id: current_user.id
    )
    
    render json: { error: 'Order creation failed' }, status: :unprocessable_entity
  end
end

Distributed tracing in Ruby uses OpenTelemetry gems. The opentelemetry-sdk gem provides the core SDK, while instrumentation gems add automatic tracing for frameworks like Rails, Sidekiq, and database libraries.

# config/initializers/opentelemetry.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'rails-app'
  c.service_version = '1.0.0'
  
  # Configure exporter
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
      OpenTelemetry::Exporter::OTLP::Exporter.new(
        endpoint: ENV.fetch('OTEL_EXPORTER_OTLP_ENDPOINT', 'http://localhost:4318')
      )
    )
  )
  
  # Auto-instrument common gems
  c.use_all({
    'OpenTelemetry::Instrumentation::Rails' => { enabled: true },
    'OpenTelemetry::Instrumentation::ActiveRecord' => { enabled: true },
    'OpenTelemetry::Instrumentation::Redis' => { enabled: true },
    'OpenTelemetry::Instrumentation::Sidekiq' => { enabled: true }
  })
end

# Manual instrumentation
class PaymentService
  def process_payment(order)
    tracer = OpenTelemetry.tracer_provider.tracer('payment-service')
    
    tracer.in_span('payment.process', attributes: {
      'order.id' => order.id,
      'order.total' => order.total
    }) do |span|
      
      gateway_response = tracer.in_span('payment.gateway_call') do
        call_payment_gateway(order)
      end
      
      span.set_attribute('payment.gateway_id', gateway_response.id)
      span.set_attribute('payment.status', gateway_response.status)
      
      if gateway_response.success?
        span.status = OpenTelemetry::Trace::Status.ok
      else
        span.status = OpenTelemetry::Trace::Status.error('Payment failed')
        span.record_exception(PaymentError.new(gateway_response.error))
      end
      
      gateway_response
    end
  end
end

Background job monitoring requires special consideration since jobs run outside the request/response cycle. Sidekiq provides hooks for instrumentation.

class SidekiqInstrumentation
  def call(worker, job, queue)
    tracer = OpenTelemetry.tracer_provider.tracer('sidekiq')
    
    tracer.in_span(
      "sidekiq.job.#{worker.class.name}",
      attributes: {
        'job.id' => job['jid'],
        'job.queue' => queue,
        'job.args' => job['args'].to_json,
        'job.retry_count' => job['retry_count'] || 0
      },
      kind: :consumer
    ) do |span|
      start = Time.now
      
      begin
        yield
        span.status = OpenTelemetry::Trace::Status.ok
      rescue StandardError => e
        span.status = OpenTelemetry::Trace::Status.error(e.message)
        span.record_exception(e)
        raise
      ensure
        duration = Time.now - start
        span.set_attribute('job.duration', duration)
      end
    end
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add SidekiqInstrumentation
  end
end

Context propagation in Ruby requires extracting trace context from incoming requests and injecting it into outgoing requests. OpenTelemetry handles this through propagators.

class ApiClient
  def initialize
    @conn = Faraday.new(url: 'https://api.example.com')
    @tracer = OpenTelemetry.tracer_provider.tracer('api-client')
  end
  
  def fetch_user(user_id)
    @tracer.in_span('api.fetch_user', attributes: { 'user.id' => user_id }) do
      headers = {}
      
      # Inject trace context into headers
      OpenTelemetry.propagation.inject(headers)
      
      response = @conn.get("/users/#{user_id}") do |req|
        headers.each { |key, value| req.headers[key] = value }
      end
      
      JSON.parse(response.body)
    end
  end
end

Error tracking integration captures exceptions with additional context. Gems like Sentry and Honeybadger integrate with tracing.

class ErrorTracker
  def self.capture_exception(exception, context = {})
    # Add trace context to error reports
    trace_id = OpenTelemetry::Trace.current_span.context.trace_id.unpack1('H*')
    span_id = OpenTelemetry::Trace.current_span.context.span_id.unpack1('H*')
    
    Sentry.capture_exception(exception, extra: context.merge(
      trace_id: trace_id,
      span_id: span_id
    ))
  end
end

Tools & Ecosystem

The observability ecosystem provides specialized tools for collecting, storing, and analyzing telemetry data. Ruby applications integrate with these tools through client libraries and exporters.

Application Performance Monitoring (APM) platforms provide end-to-end observability for Ruby applications. New Relic, Datadog, and AppSignal offer Ruby-specific agents that auto-instrument frameworks. These agents collect metrics, traces, and errors with minimal configuration.

New Relic's Ruby agent instruments Rails, Sinatra, and Grape frameworks automatically. The agent tracks transaction traces, database queries, external service calls, and background jobs. Configuration occurs through a YAML file or environment variables.

# Gemfile
gem 'newrelic_rpm'

# config/newrelic.yml
production:
  license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
  app_name: <%= ENV['NEW_RELIC_APP_NAME'] %>
  distributed_tracing:
    enabled: true
  transaction_tracer:
    enabled: true
    record_sql: obfuscated
  error_collector:
    enabled: true
    ignore_errors: "ActionController::RoutingError"

# Custom instrumentation
class ReportGenerator
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation
  
  def generate
    perform_action_with_newrelic_trace(
      name: 'generate_report',
      category: :task
    ) do
      # Report generation logic
    end
  end
end

Prometheus and Grafana form an open-source monitoring stack. Prometheus scrapes metrics from application endpoints and stores time-series data. Grafana visualizes metrics through dashboards. The prometheus-client gem exposes metrics for Prometheus scraping.

ELK Stack (Elasticsearch, Logstash, Kibana) handles log aggregation and analysis. Applications send logs to Logstash, which parses and forwards them to Elasticsearch for storage. Kibana provides search and visualization. Ruby applications typically log to stdout/stderr, which container orchestrators forward to log collectors.

OpenTelemetry provides vendor-neutral instrumentation. The OpenTelemetry Ruby SDK supports multiple exporters, allowing teams to send data to different backends without changing instrumentation code.

# Configure multiple exporters
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
require 'opentelemetry/exporter/jaeger'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'multi-backend-app'
  
  # OTLP exporter for production
  otlp_exporter = OpenTelemetry::Exporter::OTLP::Exporter.new(
    endpoint: ENV['OTEL_EXPORTER_OTLP_ENDPOINT']
  )
  
  # Jaeger exporter for local development
  jaeger_exporter = OpenTelemetry::Exporter::Jaeger::CollectorExporter.new(
    endpoint: 'http://localhost:14268/api/traces'
  )
  
  # Use appropriate exporter based on environment
  exporter = Rails.env.production? ? otlp_exporter : jaeger_exporter
  
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(exporter)
  )
end

StatsD provides a simple protocol for metrics collection. Applications send metrics over UDP to a StatsD daemon, which aggregates and forwards them to backends like Graphite or Datadog. The statsd-ruby gem implements the client.

require 'statsd-ruby'

class MetricsClient
  def initialize
    @statsd = Statsd.new('localhost', 8125)
    @statsd.namespace = 'myapp'
  end
  
  def track_request(controller, action, duration, status)
    # Increment counter
    @statsd.increment('http.requests', tags: [
      "controller:#{controller}",
      "action:#{action}",
      "status:#{status}"
    ])
    
    # Record timing
    @statsd.timing('http.response_time', duration * 1000, tags: [
      "controller:#{controller}",
      "action:#{action}"
    ])
    
    # Track gauge
    @statsd.gauge('http.active_requests', Thread.list.count)
  end
end

Jaeger and Zipkin specialize in distributed tracing. Both implement OpenTelemetry standards. Jaeger offers better performance for high-volume tracing, while Zipkin provides simpler deployment. OpenTelemetry exporters support both systems.

Sentry and Bugsnag focus on error tracking with context. These tools capture exceptions, stack traces, and breadcrumbs showing events leading to errors. Integration with tracing links errors to specific traces.

Fluentd and Fluent Bit serve as log collectors and forwarders. These tools gather logs from multiple sources, parse structured data, and route to various destinations. Ruby applications output JSON logs that Fluentd parses without additional configuration.

InfluxDB and TimescaleDB store time-series data efficiently. These databases optimize for metrics workloads with time-based queries. Teams use them as alternatives to Prometheus for long-term metrics storage.

Honeycomb focuses on high-cardinality observability. The platform handles exploratory queries across dimensions that would overwhelm traditional metrics systems. Ruby applications use the honeycomb-beeline gem.

require 'honeycomb-beeline'

Honeycomb.init(
  writekey: ENV['HONEYCOMB_WRITEKEY'],
  dataset: 'rails-app',
  service_name: 'web'
)

class OrdersController < ApplicationController
  def create
    Honeycomb.start_span(name: 'orders.create') do
      Honeycomb.add_field('user.id', current_user.id)
      Honeycomb.add_field('items.count', params[:items].length)
      
      order = Order.create!(order_params)
      
      Honeycomb.add_field('order.id', order.id)
      Honeycomb.add_field('order.total', order.total)
      
      render json: order
    end
  end
end

Integration & Interoperability

Integrating observability into Ruby applications requires coordination between application code, infrastructure, and analysis tools. Different integration patterns suit different architectural needs.

Rack middleware provides a standard integration point for web applications. Middleware intercepts requests and responses, adding instrumentation without modifying application code.

# Comprehensive observability middleware
class ObservabilityMiddleware
  def initialize(app, metrics:, logger:, tracer:)
    @app = app
    @metrics = metrics
    @logger = logger
    @tracer = tracer
  end
  
  def call(env)
    request_id = SecureRandom.uuid
    env['HTTP_X_REQUEST_ID'] = request_id
    
    @tracer.in_span('http.request', attributes: base_attributes(env, request_id)) do |span|
      @logger.info('Request started', request_attributes(env, request_id))
      
      start = Time.now
      status, headers, body = @app.call(env)
      duration = Time.now - start
      
      record_metrics(env, status, duration)
      span.set_attribute('http.status_code', status)
      span.set_attribute('http.duration', duration)
      
      @logger.info('Request completed', 
        request_id: request_id,
        status: status,
        duration: duration
      )
      
      headers['X-Request-ID'] = request_id
      [status, headers, body]
    end
  end
  
  private
  
  def base_attributes(env, request_id)
    {
      'http.method' => env['REQUEST_METHOD'],
      'http.url' => env['PATH_INFO'],
      'http.user_agent' => env['HTTP_USER_AGENT'],
      'request.id' => request_id
    }
  end
  
  def request_attributes(env, request_id)
    {
      request_id: request_id,
      method: env['REQUEST_METHOD'],
      path: env['PATH_INFO'],
      user_agent: env['HTTP_USER_AGENT'],
      remote_ip: env['REMOTE_ADDR']
    }
  end
  
  def record_metrics(env, status, duration)
    labels = {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO']),
      status: status
    }
    
    @metrics.increment('http_requests_total', labels: labels)
    @metrics.observe('http_request_duration_seconds', duration, labels: labels.except(:status))
  end
  
  def normalize_path(path)
    # Replace IDs with placeholders for lower cardinality
    path.gsub(/\/\d+/, '/:id')
  end
end

Database query instrumentation tracks slow queries and N+1 problems. ActiveRecord supports query subscribers that receive notifications for all SQL queries.

# Database query monitoring
class DatabaseQuerySubscriber
  def initialize(tracer, logger, slow_query_threshold: 0.1)
    @tracer = tracer
    @logger = logger
    @slow_query_threshold = slow_query_threshold
  end
  
  def start(name, id, payload)
    @query_start_times ||= {}
    @query_start_times[id] = Time.now
  end
  
  def finish(name, id, payload)
    return unless @query_start_times[id]
    
    duration = Time.now - @query_start_times[id]
    @query_start_times.delete(id)
    
    @tracer.in_span('db.query', attributes: {
      'db.statement' => payload[:sql],
      'db.duration' => duration,
      'db.connection_id' => payload[:connection_id]
    }) do |span|
      
      if duration > @slow_query_threshold
        @logger.warn('Slow database query detected',
          sql: payload[:sql],
          duration: duration,
          connection_id: payload[:connection_id]
        )
        span.add_event('slow_query', attributes: { threshold: @slow_query_threshold })
      end
    end
  end
end

# Register subscriber
subscriber = DatabaseQuerySubscriber.new(tracer, logger)
ActiveSupport::Notifications.subscribe('sql.active_record', subscriber)

Background job integration requires linking jobs to originating requests. Jobs carry trace context through job arguments.

class TracedJob < ApplicationJob
  around_perform do |job, block|
    # Extract trace context from job arguments
    trace_context = job.arguments.last.is_a?(Hash) ? job.arguments.last.delete(:trace_context) : nil
    
    if trace_context
      # Create new span linked to parent trace
      parent_context = OpenTelemetry.propagation.extract(trace_context)
      
      OpenTelemetry::Context.with_current(parent_context) do
        tracer.in_span("job.#{job.class.name}", attributes: {
          'job.id' => job.job_id,
          'job.queue' => job.queue_name
        }) do
          block.call
        end
      end
    else
      # No parent context, create root span
      tracer.in_span("job.#{job.class.name}") do
        block.call
      end
    end
  end
  
  private
  
  def tracer
    OpenTelemetry.tracer_provider.tracer('background-jobs')
  end
end

# Enqueue with trace context
class OrdersController < ApplicationController
  def create
    order = Order.create!(order_params)
    
    # Inject current trace context
    trace_context = {}
    OpenTelemetry.propagation.inject(trace_context)
    
    OrderConfirmationJob.perform_later(order.id, trace_context: trace_context)
    
    render json: order
  end
end

External service calls propagate trace context through HTTP headers. HTTP client libraries support header injection.

class ServiceClient
  def initialize(base_url)
    @base_url = base_url
    @tracer = OpenTelemetry.tracer_provider.tracer('http-client')
  end
  
  def get(path, params: {})
    @tracer.in_span("http.get", attributes: {
      'http.url' => "#{@base_url}#{path}",
      'http.method' => 'GET'
    }) do |span|
      
      headers = { 'Content-Type' => 'application/json' }
      OpenTelemetry.propagation.inject(headers)
      
      response = HTTP.headers(headers).get("#{@base_url}#{path}", params: params)
      
      span.set_attribute('http.status_code', response.code)
      
      if response.status.success?
        JSON.parse(response.body)
      else
        span.status = OpenTelemetry::Trace::Status.error("HTTP #{response.code}")
        raise ServiceError, "Request failed with status #{response.code}"
      end
    end
  end
end

Health check endpoints expose service status for orchestrators and load balancers. These endpoints report dependency health and readiness.

class HealthController < ApplicationController
  def liveness
    # Simple check that process is running
    render json: { status: 'ok' }, status: :ok
  end
  
  def readiness
    checks = {
      database: check_database,
      redis: check_redis,
      external_api: check_external_api
    }
    
    all_healthy = checks.values.all? { |check| check[:healthy] }
    
    status_code = all_healthy ? :ok : :service_unavailable
    
    render json: {
      status: all_healthy ? 'ready' : 'not_ready',
      checks: checks
    }, status: status_code
  end
  
  private
  
  def check_database
    ActiveRecord::Base.connection.execute('SELECT 1')
    { healthy: true }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_redis
    Redis.current.ping
    { healthy: true }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
  
  def check_external_api
    response = HTTP.timeout(2).get(ENV['EXTERNAL_API_HEALTH_URL'])
    { healthy: response.status.success? }
  rescue StandardError => e
    { healthy: false, error: e.message }
  end
end

Real-World Applications

Production Ruby applications implement observability patterns that balance insight against overhead. Teams adapt instrumentation based on traffic patterns, system architecture, and operational requirements.

High-throughput APIs face challenges instrumenting millions of requests without impacting performance. Sampling reduces overhead while maintaining visibility into issues. Adaptive sampling adjusts rates based on traffic and error rates.

class AdaptiveSampler
  def initialize(base_rate: 0.01, error_rate: 1.0)
    @base_rate = base_rate
    @error_rate = error_rate
    @request_count = 0
    @error_count = 0
    @last_adjustment = Time.now
  end
  
  def should_sample?(error: false)
    # Always sample errors
    return true if error
    
    # Adjust sampling rate every 1000 requests
    adjust_rate if @request_count % 1000 == 0
    
    rand < current_rate
  end
  
  def record_request(error: false)
    @request_count += 1
    @error_count += 1 if error
  end
  
  private
  
  def current_rate
    # Increase sampling when error rate rises
    error_ratio = @error_count.to_f / [@request_count, 1].max
    
    if error_ratio > 0.01 # 1% errors
      [@base_rate * 10, 1.0].min
    else
      @base_rate
    end
  end
  
  def adjust_rate
    if Time.now - @last_adjustment > 60
      @request_count = 0
      @error_count = 0
      @last_adjustment = Time.now
    end
  end
end

# Integration with tracing
class TracingSampler
  def initialize(sampler)
    @sampler = sampler
  end
  
  def should_sample?(trace_id, parent_context, links, name, kind, attributes)
    error = attributes['error'] == true
    @sampler.should_sample?(error: error)
  end
end

Microservices architectures require coordinated instrumentation across services. Consistent service naming, attribute keys, and context propagation enable cross-service analysis.

# Shared observability configuration
module ObservabilityConfig
  STANDARD_ATTRIBUTES = {
    deployment_environment: ENV['RAILS_ENV'],
    service_version: ENV['APP_VERSION'],
    kubernetes_pod: ENV['HOSTNAME'],
    kubernetes_namespace: ENV['K8S_NAMESPACE']
  }.freeze
  
  def self.configure_telemetry(service_name)
    OpenTelemetry::SDK.configure do |c|
      c.service_name = service_name
      
      # Add standard resource attributes
      c.resource = OpenTelemetry::SDK::Resources::Resource.create(
        STANDARD_ATTRIBUTES.merge(
          'service.name' => service_name,
          'service.namespace' => 'production'
        )
      )
      
      # Configure exporters
      c.add_span_processor(
        OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
          OpenTelemetry::Exporter::OTLP::Exporter.new
        )
      )
      
      # Use consistent instrumentation
      c.use_all
    end
  end
end

# Each service uses consistent configuration
ObservabilityConfig.configure_telemetry('user-service')

Serverless functions in AWS Lambda face constraints around persistent connections and startup time. Instrumentation minimizes cold start impact.

# Lambda-optimized observability
require 'aws-xray-sdk'

class LambdaHandler
  def initialize
    # Initialize once during cold start
    @tracer = OpenTelemetry.tracer_provider.tracer('lambda-function')
    @logger = Logger.new($stdout)
    @logger.formatter = proc { |severity, datetime, progname, msg|
      JSON.generate(timestamp: datetime.iso8601, severity: severity, message: msg) + "\n"
    }
  end
  
  def handle_event(event:, context:)
    # Extract request ID from Lambda context
    request_id = context.request_id
    
    @tracer.in_span('lambda.invocation', attributes: {
      'faas.execution' => request_id,
      'faas.id' => context.invoked_function_arn
    }) do |span|
      
      @logger.info(
        message: 'Function invoked',
        request_id: request_id,
        function_name: context.function_name,
        remaining_time_ms: context.get_remaining_time_in_millis
      )
      
      result = process_event(event)
      
      span.set_attribute('event.type', event['type'])
      
      @logger.info(
        message: 'Function completed',
        request_id: request_id,
        result_count: result.length
      )
      
      result
    end
  rescue StandardError => e
    @logger.error(
      message: 'Function failed',
      request_id: request_id,
      error: e.class.name,
      error_message: e.message
    )
    raise
  end
  
  private
  
  def process_event(event)
    # Business logic
  end
end

# Lambda handler
HANDLER = LambdaHandler.new

def lambda_handler(event:, context:)
  HANDLER.handle_event(event: event, context: context)
end

Multi-tenant applications track metrics and traces per tenant without creating cardinality explosions. Tenant IDs appear in span attributes rather than metric labels.

class TenantTracking
  def self.with_tenant_context(tenant_id)
    OpenTelemetry::Trace.current_span.set_attribute('tenant.id', tenant_id)
    
    # Add tenant to log context
    Thread.current[:tenant_id] = tenant_id
    
    yield
  ensure
    Thread.current[:tenant_id] = nil
  end
end

# Middleware to extract tenant
class TenantMiddleware
  def call(env)
    tenant_id = extract_tenant(env)
    
    TenantTracking.with_tenant_context(tenant_id) do
      @app.call(env)
    end
  end
  
  private
  
  def extract_tenant(env)
    # Extract from subdomain, header, or token
    subdomain = env['HTTP_HOST'].split('.').first
    Tenant.find_by(subdomain: subdomain)&.id
  end
end

Deployment monitoring tracks service health during rollouts. Metrics compare new versions against baselines to detect regressions.

class DeploymentMonitor
  def initialize
    @baseline_latency = fetch_baseline_latency
  end
  
  def check_deployment_health
    current_latency = calculate_current_latency
    error_rate = calculate_error_rate
    
    {
      latency_regression: current_latency > @baseline_latency * 1.5,
      error_rate_high: error_rate > 0.01,
      current_latency: current_latency,
      baseline_latency: @baseline_latency,
      error_rate: error_rate
    }
  end
  
  private
  
  def fetch_baseline_latency
    # Query metrics backend for P95 latency from previous version
    0.3 # Example baseline
  end
  
  def calculate_current_latency
    # Query recent P95 latency
    0.25
  end
  
  def calculate_error_rate
    # Query recent error rate
    0.005
  end
end

Reference

Metric Types

Type	Description	Use Case	Aggregation
Counter	Cumulative value that only increases	Request counts, error tallies	Rate, sum
Gauge	Point-in-time value that can increase or decrease	Memory usage, queue depth, active connections	Current value, average
Histogram	Distribution of values in configurable buckets	Response times, request sizes	Percentiles, averages
Summary	Pre-calculated percentiles	Client-side percentile calculation	Quantiles
Timer	Duration measurement	Operation execution time	Percentiles, rates

Standard Span Attributes

Attribute	Type	Description	Example
http.method	string	HTTP request method	GET, POST
http.url	string	Full request URL	https://api.example.com/users
http.status_code	integer	HTTP response status	200, 404, 500
db.system	string	Database type	postgresql, redis
db.statement	string	Database query	SELECT * FROM users WHERE id = 1
error	boolean	Whether span represents error	true, false
messaging.system	string	Message queue system	rabbitmq, kafka
peer.service	string	Remote service name	payment-service

Log Levels

Level	Severity	Use Case	Production Volume
DEBUG	7	Development debugging, detailed state	Disabled
INFO	6	Normal operations, business events	Moderate
WARN	4	Recoverable errors, deprecated usage	Low
ERROR	3	Error conditions requiring attention	Very low
FATAL	2	Critical failures requiring immediate action	Extremely low

Common SLO Targets

Service Type	Availability	Latency P95	Latency P99
Internal API	99.9%	100ms	250ms
Public API	99.95%	200ms	500ms
Background Job	99%	5s	30s
Real-time Service	99.99%	50ms	100ms
Batch Processing	99%	N/A	N/A

Sampling Decision Matrix

Traffic Volume	Error Rate	Trace Value	Sample Rate
Low (< 100 req/s)	Any	Any	100%
Medium (100-1000 req/s)	< 1%	Normal	10%
Medium (100-1000 req/s)	> 1%	Normal	50%
High (> 1000 req/s)	< 1%	Normal	1%
High (> 1000 req/s)	> 1%	Normal	10%
Any	Any	Error/Slow	100%

OpenTelemetry Exporters

Exporter	Protocol	Backend	Use Case
OTLP	gRPC/HTTP	OpenTelemetry Collector	Production standard
Jaeger	Thrift/gRPC	Jaeger	Development/testing
Zipkin	HTTP	Zipkin	Legacy systems
Prometheus	HTTP	Prometheus	Metrics only
Logging	Stdout	Any log collector	Debugging

Data Retention Strategies

Data Type	Hot Storage	Warm Storage	Cold Storage	Archive
Traces	7 days full	30 days sampled	N/A	N/A
Logs	7 days	30 days	90 days	1 year
Metrics (raw)	30 days	N/A	N/A	N/A
Metrics (aggregated)	1 year	3 years	N/A	Forever

Ruby Observability Gems

Gem	Purpose	Backend	Auto-instrumentation
opentelemetry-sdk	Traces, metrics	Any OTLP	Yes
prometheus-client	Metrics	Prometheus	No
newrelic_rpm	APM	New Relic	Yes
ddtrace	APM	Datadog	Yes
sentry-ruby	Errors	Sentry	Yes
statsd-ruby	Metrics	StatsD backends	No
semantic_logger	Structured logging	Any	No
ougai	JSON logging	Any	No

Context Propagation Headers

Header	Standard	Format	Example
traceparent	W3C Trace Context	version-traceid-spanid-flags	00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate	W3C Trace Context	key=value pairs	vendor1=value1,vendor2=value2
X-B3-TraceId	Zipkin B3	128-bit trace ID	463ac35c9f6413ad48485a3953bb6124
X-B3-SpanId	Zipkin B3	64-bit span ID	a2fb4a1d1a96d312
X-Request-ID	Custom	UUID	550e8400-e29b-41d4-a716-446655440000

Monitoring and Observability