CrackedRuby CrackedRuby

Overview

Performance monitoring involves collecting, analyzing, and acting on metrics that describe how software systems behave under real-world conditions. The practice emerged from the need to understand production systems beyond simple uptime checks, providing visibility into response times, throughput, resource utilization, and error rates.

Modern performance monitoring extends beyond server metrics to include application-level instrumentation, distributed tracing, and user experience tracking. The shift from monolithic to distributed architectures made comprehensive monitoring essential, as failures and performance degradation can occur across multiple services and infrastructure layers.

Ruby applications require specific monitoring approaches due to the language's interpreted nature, garbage collection behavior, and typical deployment patterns with application servers like Puma or Unicorn. Performance characteristics differ significantly between MRI Ruby, JRuby, and TruffleRuby, requiring monitoring strategies adapted to each runtime.

# Basic performance measurement
start_time = Time.now
result = expensive_operation
duration = Time.now - start_time

logger.info("Operation completed in #{duration}s")

The monitoring landscape includes application performance monitoring (APM) tools, custom instrumentation, log-based analysis, and real user monitoring (RUM). Each approach provides different insights, and production systems typically combine multiple methods to achieve comprehensive observability.

Key Principles

Performance monitoring operates on several fundamental principles that guide effective implementation and metric interpretation.

Metrics Collection forms the foundation of monitoring systems. Metrics fall into four primary categories: counters track cumulative values that only increase (requests served, errors encountered), gauges measure values that fluctuate (memory usage, queue depth), histograms capture distributions of values (response time percentiles), and timers specifically measure duration. Each metric type serves different analytical purposes and requires different storage and aggregation strategies.

Instrumentation refers to the code added to applications to emit metrics. Instrumentation can occur at multiple levels: automatic instrumentation through middleware or framework hooks requires minimal code changes but provides less control, manual instrumentation offers precise control over what gets measured but increases maintenance burden, and sampling-based profiling reduces overhead by collecting data intermittently rather than continuously.

# Manual instrumentation example
class OrderProcessor
  def process(order)
    start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    begin
      result = perform_processing(order)
      record_success_metric
      result
    rescue => error
      record_error_metric(error.class.name)
      raise
    ensure
      duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
      record_duration_metric(duration)
    end
  end
end

Observability distinguishes modern monitoring from traditional approaches. Observable systems emit enough data to answer questions about internal state without requiring code deployments to add new instrumentation. The three pillars of observability—metrics, logs, and traces—work together to provide complete system visibility. Metrics identify when problems occur, logs explain what happened, and traces show how requests flow through distributed systems.

Overhead Management remains critical because monitoring itself consumes resources. Every metric collected, log line written, and trace captured adds CPU cycles, memory allocation, and network traffic. Production monitoring must balance data richness against performance impact. Techniques include sampling (measuring only a percentage of requests), aggregation (computing statistics locally before transmission), and buffering (batching data before sending).

Baseline Establishment enables meaningful interpretation of metrics. Systems must establish normal operating ranges for metrics before identifying anomalies. Baselines account for daily, weekly, and seasonal patterns—traffic differs between business hours and nights, weekdays and weekends. Statistical methods like moving averages, standard deviations, and percentile tracking help distinguish signal from noise.

Alerting Principles determine when humans need notification about system behavior. Effective alerts signal actionable problems rather than every metric fluctuation. Alert fatigue occurs when too many non-critical alerts train operators to ignore notifications. Good alerting focuses on symptoms affecting users (high error rates, slow responses) rather than underlying causes (high CPU, memory pressure) because symptoms directly impact user experience while causes may not.

Ruby Implementation

Ruby provides multiple approaches to implementing performance monitoring, from standard library profiling to third-party instrumentation frameworks.

The Benchmark module in the standard library offers simple timing measurements:

require 'benchmark'

result = Benchmark.measure do
  10_000.times { expensive_calculation }
end

puts "User CPU time: #{result.utime}"
puts "System CPU time: #{result.stime}"
puts "Total time: #{result.real}"

For more detailed profiling, ruby-prof provides comprehensive runtime analysis:

require 'ruby-prof'

RubyProf.start

# Code to profile
process_large_dataset(data)

result = RubyProf.stop

# Generate different report types
printer = RubyProf::FlatPrinter.new(result)
printer.print(STDOUT, min_percent: 2)

# Call stack report
printer = RubyProf::CallStackPrinter.new(result)
File.open('callstack.html', 'w') { |f| printer.print(f) }

ActiveSupport::Notifications provides instrumentation hooks for Rails applications and can be used in any Ruby application:

require 'active_support/notifications'

# Subscribe to events
ActiveSupport::Notifications.subscribe('process.order') do |name, start, finish, id, payload|
  duration = finish - start
  
  # Send to monitoring service
  MetricsClient.timing('order.processing.duration', duration)
  MetricsClient.increment('order.processing.count')
  
  if payload[:error]
    MetricsClient.increment('order.processing.errors')
  end
end

# Instrument code
def process_order(order)
  ActiveSupport::Notifications.instrument('process.order', order_id: order.id) do |payload|
    begin
      result = perform_processing(order)
      payload[:success] = true
      result
    rescue => error
      payload[:error] = error
      raise
    end
  end
end

Rack middleware enables HTTP-level monitoring for web applications:

class PerformanceMonitoring
  def initialize(app)
    @app = app
  end
  
  def call(env)
    request_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    status, headers, body = @app.call(env)
    
    request_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - request_start
    
    # Record metrics
    record_request_metrics(
      path: env['PATH_INFO'],
      method: env['REQUEST_METHOD'],
      status: status,
      duration: request_duration
    )
    
    [status, headers, body]
  rescue => error
    record_error_metrics(error)
    raise
  end
  
  private
  
  def record_request_metrics(path:, method:, status:, duration:)
    tags = {
      path: normalize_path(path),
      method: method,
      status_range: "#{status / 100}xx"
    }
    
    MetricsClient.timing('http.request.duration', duration, tags: tags)
    MetricsClient.increment('http.request.count', tags: tags)
  end
  
  def normalize_path(path)
    # Convert /users/123 to /users/:id
    path.gsub(/\/\d+/, '/:id')
  end
end

Memory profiling requires specialized tools. The memory_profiler gem tracks allocations:

require 'memory_profiler'

report = MemoryProfiler.report do
  process_large_file('data.csv')
end

# Show top allocations by class
report.pretty_print(scale_bytes: true, top: 20)

# Identify string allocations
string_report = report.strings_retained.sort_by(&:memsize).reverse
string_report.first(10).each do |str|
  puts "#{str.memsize} bytes: #{str.value.inspect}"
end

GC statistics provide insights into garbage collection performance:

GC.stat.each do |key, value|
  MetricsClient.gauge("ruby.gc.#{key}", value)
end

# Monitor GC between operations
before_gc = GC.stat(:total_allocated_objects)

perform_operation

after_gc = GC.stat(:total_allocated_objects)
allocations = after_gc - before_gc

MetricsClient.gauge('operation.allocations', allocations)

Database query monitoring through ActiveRecord:

ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
  duration = (finish - start) * 1000 # Convert to milliseconds
  
  unless payload[:name] == 'SCHEMA'
    MetricsClient.timing('database.query.duration', duration, tags: {
      operation: payload[:sql].split.first,
      connection: payload[:connection_id]
    })
    
    # Alert on slow queries
    if duration > 1000
      logger.warn("Slow query (#{duration}ms): #{payload[:sql]}")
    end
  end
end

Implementation Approaches

Performance monitoring strategies range from comprehensive APM solutions to custom instrumentation tailored to specific application needs.

Application Performance Monitoring (APM) provides end-to-end visibility through vendor-provided agents that automatically instrument applications. APM tools like New Relic, Datadog, or AppSignal install as gems that hook into Ruby frameworks, collecting metrics, traces, and errors without extensive manual instrumentation. This approach offers rapid implementation and broad coverage but introduces vendor dependencies and monthly costs. APM agents add overhead (typically 1-5% performance impact) and may not capture application-specific business metrics without custom instrumentation.

APM implementation follows a standard pattern:

# Gemfile
gem 'newrelic_rpm'

# config/newrelic.yml configuration
# Automatic instrumentation of Rails, Sidekiq, databases

# Custom instrumentation when needed
class BusinessMetrics
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation
  
  def critical_operation
    perform_with_newrelic_trace('Custom/critical_operation') do
      # Operation code
    end
  end
end

Custom Metrics Systems build monitoring infrastructure using time-series databases like Prometheus, InfluxDB, or Graphite. Applications emit metrics through client libraries, gaining full control over data collection, storage, and visualization. This approach eliminates vendor lock-in and recurring costs but requires infrastructure management and custom instrumentation throughout the codebase.

# Custom metrics with Prometheus
require 'prometheus/client'

prometheus = Prometheus::Client.registry

http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

http_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
)

# Instrumentation
def handle_request(env)
  start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  
  status, headers, body = @app.call(env)
  
  duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
  
  http_requests.increment(
    labels: {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO']),
      status: status
    }
  )
  
  http_duration.observe(
    duration,
    labels: {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO'])
    }
  )
  
  [status, headers, body]
end

Log-Based Monitoring extracts performance data from structured logs, requiring minimal runtime overhead since applications already generate logs. Log aggregation systems like ELK (Elasticsearch, Logstash, Kibana) or Splunk parse logs to extract metrics and generate visualizations. This approach works well for batch processing and systems where structured logging already exists, but provides less real-time visibility than dedicated metrics systems.

require 'json'

class StructuredLogger
  def log_request(method:, path:, status:, duration:, error: nil)
    entry = {
      timestamp: Time.now.iso8601,
      type: 'http_request',
      method: method,
      path: path,
      status: status,
      duration_ms: (duration * 1000).round(2),
      error: error&.class&.name
    }
    
    # Structured JSON logging
    logger.info(entry.to_json)
  end
end

Sampling-Based Profiling reduces overhead by collecting detailed performance data for only a subset of requests. The rack-mini-profiler gem exemplifies this approach, profiling requests when triggered by specific conditions rather than continuously monitoring all traffic.

# Profile only slow requests
class SelectiveProfiler
  def initialize(app, threshold: 1.0)
    @app = app
    @threshold = threshold
  end
  
  def call(env)
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    
    status, headers, body = @app.call(env)
    
    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    
    if duration > @threshold
      # Detailed profiling for slow requests
      profile_slow_request(env, duration)
    end
    
    [status, headers, body]
  end
end

Distributed Tracing tracks requests across microservices, correlating activity through unique trace IDs. OpenTelemetry provides standardized tracing instrumentation that works across languages and vendors.

require 'opentelemetry/sdk'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'order-service'
  c.use 'OpenTelemetry::Instrumentation::Rails'
  c.use 'OpenTelemetry::Instrumentation::Sidekiq'
end

tracer = OpenTelemetry.tracer_provider.tracer('order-service')

def process_order(order)
  tracer.in_span('order.process', attributes: { 'order.id' => order.id }) do |span|
    # Trace propagates to downstream services
    result = call_payment_service(order)
    span.set_attribute('order.total', result.total)
    result
  end
end

Tools & Ecosystem

The Ruby monitoring ecosystem includes both language-specific tools and polyglot platforms that support Ruby through dedicated agents or libraries.

APM Platforms provide comprehensive monitoring through automatic instrumentation and cloud-hosted analytics.

New Relic offers deep Ruby support with automatic Rails instrumentation, background job monitoring, and custom metric APIs. The newrelic_rpm gem hooks into common frameworks and libraries without code changes. New Relic provides transaction tracing showing time spent in database queries, external HTTP calls, and application code, plus distributed tracing for microservices and error tracking with stack traces and request context.

Datadog combines APM with infrastructure monitoring, correlating application performance with server metrics, container orchestration, and log aggregation. The ddtrace gem instruments Ruby applications while Datadog agents collect system-level metrics. Datadog supports custom metrics through DogStatsD and provides advanced querying and alerting capabilities.

AppSignal focuses specifically on Ruby and Elixir applications, offering low overhead monitoring with magic dashboard generation based on detected frameworks. The appsignal gem provides automatic instrumentation with customizable sampling rates. AppSignal includes performance monitoring, error tracking, and host metrics in a unified interface.

Skylight specializes in Rails performance monitoring with a focus on identifying N+1 queries and rendering bottlenecks. The skylight gem provides production profiling with minimal overhead (claimed <1% impact) and aggregates data to show typical request performance rather than individual traces.

Metrics Libraries enable custom instrumentation with various backend storage options.

StatsD client libraries send metrics to StatsD aggregation servers:

require 'statsd-ruby'

statsd = Statsd.new('localhost', 8125)

# Simple metrics
statsd.increment('page.views')
statsd.gauge('queue.depth', 42)
statsd.timing('api.request', 250) # milliseconds

# Timing blocks
statsd.time('expensive.operation') do
  perform_expensive_operation
end

# Batch to reduce network overhead
statsd.batch do |batch|
  batch.increment('batch.processed')
  batch.timing('batch.duration', duration)
end

Prometheus client library for native Prometheus integration:

require 'prometheus/client'
require 'prometheus/client/push'

registry = Prometheus::Client.registry

# Metric types
counter = registry.counter(:http_requests, docstring: 'Total requests')
gauge = registry.gauge(:queue_size, docstring: 'Current queue size')
histogram = registry.histogram(:response_time, docstring: 'Response times')
summary = registry.summary(:payload_size, docstring: 'Payload sizes')

# Push metrics (for batch jobs)
Prometheus::Client::Push.new(
  job: 'batch_processor',
  gateway: 'http://pushgateway:9091'
).add(registry)

Profiling Tools analyze application performance at the code level.

ruby-prof provides comprehensive profiling with multiple output formats:

require 'ruby-prof'

RubyProf.measure_mode = RubyProf::WALL_TIME # or CPU_TIME, ALLOCATIONS, MEMORY

result = RubyProf.profile do
  # Code to profile
end

# Different visualizations
RubyProf::FlatPrinter.new(result).print(STDOUT)
RubyProf::GraphPrinter.new(result).print(STDOUT)
RubyProf::CallStackPrinter.new(result).print(File.new('callstack.html', 'w'))

stackprof samples call stacks with low overhead, suitable for production profiling:

require 'stackprof'

StackProf.run(mode: :cpu, out: 'tmp/stackprof.dump') do
  # Application code runs normally
end

# Analysis with CLI tool
# stackprof tmp/stackprof.dump --text --limit 20

rack-mini-profiler adds in-browser profiling for web requests:

# Gemfile
gem 'rack-mini-profiler'

# Automatic profiling in development
# Visit /?pp=flamegraph for flame graph
# Visit /?pp=profile-memory for memory analysis

Database Monitoring tools specifically track query performance.

Bullet detects N+1 queries and unused eager loading:

# config/environments/development.rb
config.after_initialize do
  Bullet.enable = true
  Bullet.alert = true
  Bullet.bullet_logger = true
  Bullet.add_footer = true
end

# Detects N+1 queries and suggests fixes

PgHero provides PostgreSQL-specific monitoring:

# Database query statistics
PgHero.slow_queries(duration: 20) # Queries taking >20ms
PgHero.long_running_queries
PgHero.index_usage
PgHero.unused_indexes

Real User Monitoring tracks client-side performance from actual user browsers, measuring page load times, JavaScript errors, and user interactions. Most APM platforms include browser monitoring through JavaScript snippets injected into rendered HTML.

Common Patterns

Performance monitoring follows established patterns for instrumentation, metric collection, and alerting that have proven effective across diverse production systems.

Middleware Pattern centralizes HTTP request monitoring at the Rack middleware layer:

class RequestMonitoring
  IGNORE_PATHS = ['/health', '/metrics'].freeze
  
  def initialize(app, metrics_client:)
    @app = app
    @metrics = metrics_client
  end
  
  def call(env)
    return @app.call(env) if ignored_path?(env['PATH_INFO'])
    
    start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    exception_raised = false
    
    begin
      status, headers, body = @app.call(env)
    rescue => error
      exception_raised = true
      record_exception(env, error)
      raise
    ensure
      unless exception_raised
        duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
        record_request(env, status, duration)
      end
    end
    
    [status, headers, body]
  end
  
  private
  
  def record_request(env, status, duration)
    tags = build_tags(env, status)
    
    @metrics.timing('http.request.duration', duration, tags: tags)
    @metrics.increment('http.request.count', tags: tags)
    
    if duration > 5.0
      @metrics.increment('http.request.slow', tags: tags)
    end
  end
  
  def build_tags(env, status)
    {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO']),
      status: status,
      status_class: "#{status / 100}xx"
    }
  end
end

Decorator Pattern adds monitoring to existing classes without modifying their implementation:

class MonitoredRepository
  def initialize(repository, metrics:)
    @repository = repository
    @metrics = metrics
  end
  
  def find(id)
    measure('repository.find') do
      @repository.find(id)
    end
  end
  
  def save(entity)
    measure('repository.save') do
      @repository.save(entity)
    end
  end
  
  def method_missing(method, *args, &block)
    measure("repository.#{method}") do
      @repository.send(method, *args, &block)
    end
  end
  
  private
  
  def measure(operation)
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    result = yield
    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    
    @metrics.timing(operation, duration)
    @metrics.increment("#{operation}.count")
    
    result
  rescue => error
    @metrics.increment("#{operation}.error")
    raise
  end
end

Aspect-Oriented Monitoring uses method wrapping to add instrumentation transparently:

module Instrumentation
  def self.included(base)
    base.extend(ClassMethods)
  end
  
  module ClassMethods
    def instrument_method(method_name, metric_name: nil)
      original_method = instance_method(method_name)
      metric = metric_name || "#{name.underscore}.#{method_name}"
      
      define_method(method_name) do |*args, &block|
        start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
        
        begin
          result = original_method.bind(self).call(*args, &block)
          MetricsClient.increment("#{metric}.success")
          result
        rescue => error
          MetricsClient.increment("#{metric}.error", tags: { error: error.class.name })
          raise
        ensure
          duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
          MetricsClient.timing("#{metric}.duration", duration)
        end
      end
    end
  end
end

class PaymentProcessor
  include Instrumentation
  
  def process_payment(payment)
    # Implementation
  end
  
  instrument_method :process_payment
end

Circuit Breaker with Metrics combines failure detection with monitoring:

class CircuitBreaker
  STATES = [:closed, :open, :half_open].freeze
  
  def initialize(service_name, threshold: 5, timeout: 60, metrics:)
    @service_name = service_name
    @threshold = threshold
    @timeout = timeout
    @metrics = metrics
    @state = :closed
    @failure_count = 0
    @last_failure_time = nil
  end
  
  def call
    case @state
    when :open
      if Time.now - @last_failure_time > @timeout
        transition_to(:half_open)
        attempt_call { yield }
      else
        @metrics.increment("circuit_breaker.rejected", tags: { service: @service_name })
        raise CircuitOpenError
      end
    when :half_open, :closed
      attempt_call { yield }
    end
  end
  
  private
  
  def attempt_call
    result = yield
    on_success
    result
  rescue => error
    on_failure
    raise
  end
  
  def on_success
    @failure_count = 0
    transition_to(:closed) if @state == :half_open
    @metrics.increment("circuit_breaker.success", tags: { service: @service_name })
  end
  
  def on_failure
    @failure_count += 1
    @last_failure_time = Time.now
    @metrics.increment("circuit_breaker.failure", tags: { service: @service_name })
    
    if @failure_count >= @threshold
      transition_to(:open)
    end
  end
  
  def transition_to(new_state)
    old_state = @state
    @state = new_state
    @metrics.increment("circuit_breaker.state_change", tags: {
      service: @service_name,
      from: old_state,
      to: new_state
    })
  end
end

Heartbeat Pattern monitors background job health:

class JobHeartbeat
  def initialize(job_name, interval: 60, metrics:)
    @job_name = job_name
    @interval = interval
    @metrics = metrics
    @last_heartbeat = Time.now
  end
  
  def beat
    now = Time.now
    gap = now - @last_heartbeat
    
    @metrics.gauge("job.heartbeat.gap", gap, tags: { job: @job_name })
    @metrics.gauge("job.heartbeat.timestamp", now.to_i, tags: { job: @job_name })
    
    if gap > @interval * 2
      @metrics.increment("job.heartbeat.missed", tags: { job: @job_name })
    end
    
    @last_heartbeat = now
  end
end

class BackgroundProcessor
  def run
    heartbeat = JobHeartbeat.new('background_processor', metrics: MetricsClient)
    
    loop do
      process_batch
      heartbeat.beat
      sleep 60
    end
  end
end

Percentile Tracking provides more meaningful performance metrics than averages:

class PercentileTracker
  def initialize(window_size: 1000)
    @values = []
    @window_size = window_size
  end
  
  def record(value)
    @values << value
    @values.shift if @values.size > @window_size
  end
  
  def percentile(p)
    return nil if @values.empty?
    
    sorted = @values.sort
    index = (p / 100.0 * sorted.length).ceil - 1
    sorted[[index, 0].max]
  end
  
  def report_metrics(metric_name)
    [50, 95, 99].each do |p|
      value = percentile(p)
      MetricsClient.gauge("#{metric_name}.p#{p}", value) if value
    end
  end
end

Real-World Applications

Production monitoring implementations vary based on system architecture, scale, and operational requirements.

Microservices Monitoring requires distributed tracing to understand request flows across services. Each service instruments critical operations and propagates trace context:

# Order service
class OrderService
  def create_order(params)
    OpenTelemetry.tracer.in_span('order.create') do |span|
      span.set_attribute('order.items', params[:items].size)
      span.set_attribute('order.total', params[:total])
      
      order = Order.create!(params)
      
      # Call inventory service
      inventory_result = InventoryClient.reserve_items(
        order.id,
        order.items,
        trace_context: current_trace_context
      )
      
      # Call payment service
      payment_result = PaymentClient.charge(
        order.id,
        order.total,
        trace_context: current_trace_context
      )
      
      span.set_attribute('order.status', order.status)
      order
    end
  rescue => error
    span.record_exception(error)
    span.set_status(OpenTelemetry::Trace::Status.error)
    raise
  end
end

# Inventory service receives trace context
class InventoryClient
  def reserve_items(order_id, items, trace_context:)
    headers = {
      'X-Trace-ID' => trace_context.trace_id,
      'X-Span-ID' => trace_context.span_id
    }
    
    response = HTTP.headers(headers).post(
      "#{base_url}/reserve",
      json: { order_id: order_id, items: items }
    )
    
    MetricsClient.timing('inventory.reserve.duration', response.time)
    response.parse
  end
end

Database Performance Monitoring tracks query patterns to identify optimization opportunities:

class DatabaseMonitoring
  def initialize
    subscribe_to_queries
    @slow_query_threshold = 100 # milliseconds
  end
  
  def subscribe_to_queries
    ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
      event = ActiveSupport::Notifications::Event.new(*args)
      process_query_event(event)
    end
  end
  
  def process_query_event(event)
    return if event.payload[:name] == 'SCHEMA'
    
    duration_ms = event.duration
    sql = event.payload[:sql]
    operation = extract_operation(sql)
    
    tags = {
      operation: operation,
      connection: event.payload[:connection_id]
    }
    
    MetricsClient.timing('database.query.duration', duration_ms, tags: tags)
    MetricsClient.increment('database.query.count', tags: tags)
    
    if duration_ms > @slow_query_threshold
      log_slow_query(sql, duration_ms, event.payload)
      MetricsClient.increment('database.query.slow', tags: tags)
    end
    
    detect_n_plus_one(sql, duration_ms)
  end
  
  def extract_operation(sql)
    sql.strip.split(/\s+/).first.upcase
  end
  
  def detect_n_plus_one(sql, duration)
    # Track similar queries in short time windows
    @query_tracker ||= {}
    pattern = normalize_query(sql)
    
    @query_tracker[pattern] ||= { count: 0, first_seen: Time.now }
    @query_tracker[pattern][:count] += 1
    
    window = Time.now - @query_tracker[pattern][:first_seen]
    
    if @query_tracker[pattern][:count] > 10 && window < 1.0
      MetricsClient.increment('database.potential_n_plus_one', tags: {
        pattern: pattern
      })
    end
  end
  
  def normalize_query(sql)
    # Convert specific values to placeholders
    sql.gsub(/\d+/, '?').gsub(/'[^']+'/, '?')
  end
end

Background Job Monitoring tracks job execution, failures, and queue depths:

class JobMonitoring
  def self.instrument_sidekiq
    Sidekiq.configure_server do |config|
      config.server_middleware do |chain|
        chain.add JobMetricsMiddleware
      end
    end
  end
  
  class JobMetricsMiddleware
    def call(worker, job, queue)
      start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
      job_class = worker.class.name
      
      MetricsClient.increment('job.started', tags: {
        worker: job_class,
        queue: queue
      })
      
      begin
        yield
        
        duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
        
        MetricsClient.timing('job.duration', duration, tags: {
          worker: job_class,
          queue: queue
        })
        
        MetricsClient.increment('job.completed', tags: {
          worker: job_class,
          queue: queue
        })
        
      rescue => error
        MetricsClient.increment('job.failed', tags: {
          worker: job_class,
          queue: queue,
          error: error.class.name
        })
        raise
      end
    end
  end
  
  # Monitor queue depths
  def self.monitor_queues
    Sidekiq::Queue.all.each do |queue|
      MetricsClient.gauge('job.queue.size', queue.size, tags: {
        queue: queue.name
      })
      
      MetricsClient.gauge('job.queue.latency', queue.latency, tags: {
        queue: queue.name
      })
    end
  end
end

Health Check Endpoints expose system status for load balancers and orchestrators:

class HealthCheckController < ApplicationController
  def show
    checks = {
      database: check_database,
      redis: check_redis,
      external_api: check_external_api
    }
    
    all_healthy = checks.values.all? { |result| result[:healthy] }
    
    MetricsClient.gauge('health_check.status', all_healthy ? 1 : 0)
    
    checks.each do |service, result|
      MetricsClient.gauge("health_check.#{service}", result[:healthy] ? 1 : 0)
      MetricsClient.timing("health_check.#{service}.duration", result[:duration])
    end
    
    status = all_healthy ? :ok : :service_unavailable
    render json: { status: status, checks: checks }, status: status
  end
  
  private
  
  def check_database
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    begin
      ActiveRecord::Base.connection.execute('SELECT 1')
      duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
      { healthy: true, duration: duration }
    rescue => error
      duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
      { healthy: false, duration: duration, error: error.message }
    end
  end
end

Memory Leak Detection monitors memory growth over time:

class MemoryMonitoring
  def self.monitor
    loop do
      collect_metrics
      sleep 60
    end
  end
  
  def self.collect_metrics
    # Process memory from /proc filesystem (Linux)
    if File.exist?('/proc/self/status')
      status = File.read('/proc/self/status')
      
      if vmsize = status[/VmSize:\s+(\d+)/, 1]
        MetricsClient.gauge('memory.vmsize', vmsize.to_i)
      end
      
      if rss = status[/VmRSS:\s+(\d+)/, 1]
        MetricsClient.gauge('memory.rss', rss.to_i)
      end
    end
    
    # Ruby heap statistics
    stat = GC.stat
    MetricsClient.gauge('memory.heap_live_slots', stat[:heap_live_slots])
    MetricsClient.gauge('memory.heap_free_slots', stat[:heap_free_slots])
    MetricsClient.gauge('memory.old_objects', stat[:old_objects])
    
    # ObjectSpace statistics
    ObjectSpace.count_objects.each do |type, count|
      MetricsClient.gauge("memory.objects.#{type}", count)
    end
  end
end

Reference

Metric Types

Type Description Use Cases Ruby Example
Counter Monotonically increasing value Request counts, error counts statsd.increment('requests')
Gauge Point-in-time measurement Memory usage, queue depth statsd.gauge('queue_size', 42)
Histogram Distribution of values Response time distribution prometheus.histogram(:duration)
Timer Duration measurement Operation timing statsd.timing('api_call', 150)
Summary Statistical summary Quantiles, percentiles prometheus.summary(:size)

Common Metrics

Metric Category Description Alert Threshold
Request Rate Throughput Requests per second Sudden drop >50%
Error Rate Reliability Errors per total requests >1% for critical paths
Response Time P95 Performance 95th percentile response time >500ms typical
Response Time P99 Performance 99th percentile response time >1000ms typical
Database Query Time Performance Time spent in database >100ms per query
Memory Usage Resources RSS or heap size >80% of available
GC Time Ratio Resources GC time / total time >10% sustained
Queue Depth Throughput Pending job count Growing over 1 hour
Active Connections Resources Open database connections >80% of pool

Ruby Profiling Tools

Tool Type Overhead Use Case Output Format
ruby-prof Comprehensive profiler High Development, detailed analysis Flat, graph, call stack
stackprof Sampling profiler Low Production profiling Flame graphs, text
memory_profiler Memory profiler High Memory leak investigation Allocation reports
rack-mini-profiler Request profiler Medium Web request analysis HTML, flame graphs
benchmark Simple timing Minimal Quick measurements Text output

APM Comparison

Platform Focus Overhead Pricing Model Best For
New Relic Full-stack APM 2-5% Per host Enterprise, complex systems
Datadog Infrastructure + APM 1-3% Per host + metrics DevOps teams
AppSignal Ruby/Elixir APM <1% Per app Ruby shops, cost-conscious
Skylight Rails optimization <1% Per app Rails performance tuning
Scout Developer-focused 1-2% Per app Smaller teams

Instrumentation Patterns

Pattern Implementation Pros Cons
Middleware Rack middleware layer Centralized, automatic HTTP requests only
Decorator Wrap classes Flexible, testable More code
AOP Method wrapping Minimal code changes Magic, harder to debug
Notifications Event subscription Decoupled Indirect flow
Manual Explicit calls Full control Verbose, error-prone

Performance Baselines

Metric Good Acceptable Poor Critical
Web Response Time (P95) <200ms <500ms <1000ms >1000ms
API Response Time (P95) <100ms <300ms <800ms >800ms
Background Job Duration <30s <120s <300s >300s
Database Query Time <10ms <50ms <100ms >100ms
Error Rate <0.1% <1% <5% >5%
Memory per Process <200MB <500MB <1GB >1GB
GC Pause Time <10ms <50ms <100ms >100ms

Alert Configuration

Alert Type Condition Evaluation Window Notification
Error Rate Spike >2x baseline 5 minutes Immediate
Response Time Degradation P95 >1.5x baseline 10 minutes Warning
Memory Leak Steady growth >10% 1 hour Warning
Queue Buildup Growing >15 min 15 minutes Warning
Service Down 0 successful requests 2 minutes Immediate
Database Slow >50% queries >100ms 5 minutes Warning
Disk Full >90% used 1 minute Immediate

Monitoring Stack Components

Component Purpose Examples Integration Point
Instrumentation Emit metrics StatsD, OpenTelemetry Application code
Collection Receive metrics StatsD daemon, collectors Network endpoint
Storage Store time-series data Prometheus, InfluxDB Collector target
Visualization Display metrics Grafana, Kibana Query storage
Alerting Notify on conditions Alertmanager, PagerDuty Alert rules
Tracing Track request flow Jaeger, Zipkin Trace context propagation