Overview
Performance monitoring involves collecting, analyzing, and acting on metrics that describe how software systems behave under real-world conditions. The practice emerged from the need to understand production systems beyond simple uptime checks, providing visibility into response times, throughput, resource utilization, and error rates.
Modern performance monitoring extends beyond server metrics to include application-level instrumentation, distributed tracing, and user experience tracking. The shift from monolithic to distributed architectures made comprehensive monitoring essential, as failures and performance degradation can occur across multiple services and infrastructure layers.
Ruby applications require specific monitoring approaches due to the language's interpreted nature, garbage collection behavior, and typical deployment patterns with application servers like Puma or Unicorn. Performance characteristics differ significantly between MRI Ruby, JRuby, and TruffleRuby, requiring monitoring strategies adapted to each runtime.
# Basic performance measurement
start_time = Time.now
result = expensive_operation
duration = Time.now - start_time
logger.info("Operation completed in #{duration}s")
The monitoring landscape includes application performance monitoring (APM) tools, custom instrumentation, log-based analysis, and real user monitoring (RUM). Each approach provides different insights, and production systems typically combine multiple methods to achieve comprehensive observability.
Key Principles
Performance monitoring operates on several fundamental principles that guide effective implementation and metric interpretation.
Metrics Collection forms the foundation of monitoring systems. Metrics fall into four primary categories: counters track cumulative values that only increase (requests served, errors encountered), gauges measure values that fluctuate (memory usage, queue depth), histograms capture distributions of values (response time percentiles), and timers specifically measure duration. Each metric type serves different analytical purposes and requires different storage and aggregation strategies.
Instrumentation refers to the code added to applications to emit metrics. Instrumentation can occur at multiple levels: automatic instrumentation through middleware or framework hooks requires minimal code changes but provides less control, manual instrumentation offers precise control over what gets measured but increases maintenance burden, and sampling-based profiling reduces overhead by collecting data intermittently rather than continuously.
# Manual instrumentation example
class OrderProcessor
def process(order)
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
begin
result = perform_processing(order)
record_success_metric
result
rescue => error
record_error_metric(error.class.name)
raise
ensure
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
record_duration_metric(duration)
end
end
end
Observability distinguishes modern monitoring from traditional approaches. Observable systems emit enough data to answer questions about internal state without requiring code deployments to add new instrumentation. The three pillars of observability—metrics, logs, and traces—work together to provide complete system visibility. Metrics identify when problems occur, logs explain what happened, and traces show how requests flow through distributed systems.
Overhead Management remains critical because monitoring itself consumes resources. Every metric collected, log line written, and trace captured adds CPU cycles, memory allocation, and network traffic. Production monitoring must balance data richness against performance impact. Techniques include sampling (measuring only a percentage of requests), aggregation (computing statistics locally before transmission), and buffering (batching data before sending).
Baseline Establishment enables meaningful interpretation of metrics. Systems must establish normal operating ranges for metrics before identifying anomalies. Baselines account for daily, weekly, and seasonal patterns—traffic differs between business hours and nights, weekdays and weekends. Statistical methods like moving averages, standard deviations, and percentile tracking help distinguish signal from noise.
Alerting Principles determine when humans need notification about system behavior. Effective alerts signal actionable problems rather than every metric fluctuation. Alert fatigue occurs when too many non-critical alerts train operators to ignore notifications. Good alerting focuses on symptoms affecting users (high error rates, slow responses) rather than underlying causes (high CPU, memory pressure) because symptoms directly impact user experience while causes may not.
Ruby Implementation
Ruby provides multiple approaches to implementing performance monitoring, from standard library profiling to third-party instrumentation frameworks.
The Benchmark module in the standard library offers simple timing measurements:
require 'benchmark'
result = Benchmark.measure do
10_000.times { expensive_calculation }
end
puts "User CPU time: #{result.utime}"
puts "System CPU time: #{result.stime}"
puts "Total time: #{result.real}"
For more detailed profiling, ruby-prof provides comprehensive runtime analysis:
require 'ruby-prof'
RubyProf.start
# Code to profile
process_large_dataset(data)
result = RubyProf.stop
# Generate different report types
printer = RubyProf::FlatPrinter.new(result)
printer.print(STDOUT, min_percent: 2)
# Call stack report
printer = RubyProf::CallStackPrinter.new(result)
File.open('callstack.html', 'w') { |f| printer.print(f) }
ActiveSupport::Notifications provides instrumentation hooks for Rails applications and can be used in any Ruby application:
require 'active_support/notifications'
# Subscribe to events
ActiveSupport::Notifications.subscribe('process.order') do |name, start, finish, id, payload|
duration = finish - start
# Send to monitoring service
MetricsClient.timing('order.processing.duration', duration)
MetricsClient.increment('order.processing.count')
if payload[:error]
MetricsClient.increment('order.processing.errors')
end
end
# Instrument code
def process_order(order)
ActiveSupport::Notifications.instrument('process.order', order_id: order.id) do |payload|
begin
result = perform_processing(order)
payload[:success] = true
result
rescue => error
payload[:error] = error
raise
end
end
end
Rack middleware enables HTTP-level monitoring for web applications:
class PerformanceMonitoring
def initialize(app)
@app = app
end
def call(env)
request_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
status, headers, body = @app.call(env)
request_duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - request_start
# Record metrics
record_request_metrics(
path: env['PATH_INFO'],
method: env['REQUEST_METHOD'],
status: status,
duration: request_duration
)
[status, headers, body]
rescue => error
record_error_metrics(error)
raise
end
private
def record_request_metrics(path:, method:, status:, duration:)
tags = {
path: normalize_path(path),
method: method,
status_range: "#{status / 100}xx"
}
MetricsClient.timing('http.request.duration', duration, tags: tags)
MetricsClient.increment('http.request.count', tags: tags)
end
def normalize_path(path)
# Convert /users/123 to /users/:id
path.gsub(/\/\d+/, '/:id')
end
end
Memory profiling requires specialized tools. The memory_profiler gem tracks allocations:
require 'memory_profiler'
report = MemoryProfiler.report do
process_large_file('data.csv')
end
# Show top allocations by class
report.pretty_print(scale_bytes: true, top: 20)
# Identify string allocations
string_report = report.strings_retained.sort_by(&:memsize).reverse
string_report.first(10).each do |str|
puts "#{str.memsize} bytes: #{str.value.inspect}"
end
GC statistics provide insights into garbage collection performance:
GC.stat.each do |key, value|
MetricsClient.gauge("ruby.gc.#{key}", value)
end
# Monitor GC between operations
before_gc = GC.stat(:total_allocated_objects)
perform_operation
after_gc = GC.stat(:total_allocated_objects)
allocations = after_gc - before_gc
MetricsClient.gauge('operation.allocations', allocations)
Database query monitoring through ActiveRecord:
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
duration = (finish - start) * 1000 # Convert to milliseconds
unless payload[:name] == 'SCHEMA'
MetricsClient.timing('database.query.duration', duration, tags: {
operation: payload[:sql].split.first,
connection: payload[:connection_id]
})
# Alert on slow queries
if duration > 1000
logger.warn("Slow query (#{duration}ms): #{payload[:sql]}")
end
end
end
Implementation Approaches
Performance monitoring strategies range from comprehensive APM solutions to custom instrumentation tailored to specific application needs.
Application Performance Monitoring (APM) provides end-to-end visibility through vendor-provided agents that automatically instrument applications. APM tools like New Relic, Datadog, or AppSignal install as gems that hook into Ruby frameworks, collecting metrics, traces, and errors without extensive manual instrumentation. This approach offers rapid implementation and broad coverage but introduces vendor dependencies and monthly costs. APM agents add overhead (typically 1-5% performance impact) and may not capture application-specific business metrics without custom instrumentation.
APM implementation follows a standard pattern:
# Gemfile
gem 'newrelic_rpm'
# config/newrelic.yml configuration
# Automatic instrumentation of Rails, Sidekiq, databases
# Custom instrumentation when needed
class BusinessMetrics
include NewRelic::Agent::Instrumentation::ControllerInstrumentation
def critical_operation
perform_with_newrelic_trace('Custom/critical_operation') do
# Operation code
end
end
end
Custom Metrics Systems build monitoring infrastructure using time-series databases like Prometheus, InfluxDB, or Graphite. Applications emit metrics through client libraries, gaining full control over data collection, storage, and visualization. This approach eliminates vendor lock-in and recurring costs but requires infrastructure management and custom instrumentation throughout the codebase.
# Custom metrics with Prometheus
require 'prometheus/client'
prometheus = Prometheus::Client.registry
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
http_duration = prometheus.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
)
# Instrumentation
def handle_request(env)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
status, headers, body = @app.call(env)
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
http_requests.increment(
labels: {
method: env['REQUEST_METHOD'],
path: normalize_path(env['PATH_INFO']),
status: status
}
)
http_duration.observe(
duration,
labels: {
method: env['REQUEST_METHOD'],
path: normalize_path(env['PATH_INFO'])
}
)
[status, headers, body]
end
Log-Based Monitoring extracts performance data from structured logs, requiring minimal runtime overhead since applications already generate logs. Log aggregation systems like ELK (Elasticsearch, Logstash, Kibana) or Splunk parse logs to extract metrics and generate visualizations. This approach works well for batch processing and systems where structured logging already exists, but provides less real-time visibility than dedicated metrics systems.
require 'json'
class StructuredLogger
def log_request(method:, path:, status:, duration:, error: nil)
entry = {
timestamp: Time.now.iso8601,
type: 'http_request',
method: method,
path: path,
status: status,
duration_ms: (duration * 1000).round(2),
error: error&.class&.name
}
# Structured JSON logging
logger.info(entry.to_json)
end
end
Sampling-Based Profiling reduces overhead by collecting detailed performance data for only a subset of requests. The rack-mini-profiler gem exemplifies this approach, profiling requests when triggered by specific conditions rather than continuously monitoring all traffic.
# Profile only slow requests
class SelectiveProfiler
def initialize(app, threshold: 1.0)
@app = app
@threshold = threshold
end
def call(env)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
status, headers, body = @app.call(env)
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
if duration > @threshold
# Detailed profiling for slow requests
profile_slow_request(env, duration)
end
[status, headers, body]
end
end
Distributed Tracing tracks requests across microservices, correlating activity through unique trace IDs. OpenTelemetry provides standardized tracing instrumentation that works across languages and vendors.
require 'opentelemetry/sdk'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'order-service'
c.use 'OpenTelemetry::Instrumentation::Rails'
c.use 'OpenTelemetry::Instrumentation::Sidekiq'
end
tracer = OpenTelemetry.tracer_provider.tracer('order-service')
def process_order(order)
tracer.in_span('order.process', attributes: { 'order.id' => order.id }) do |span|
# Trace propagates to downstream services
result = call_payment_service(order)
span.set_attribute('order.total', result.total)
result
end
end
Tools & Ecosystem
The Ruby monitoring ecosystem includes both language-specific tools and polyglot platforms that support Ruby through dedicated agents or libraries.
APM Platforms provide comprehensive monitoring through automatic instrumentation and cloud-hosted analytics.
New Relic offers deep Ruby support with automatic Rails instrumentation, background job monitoring, and custom metric APIs. The newrelic_rpm gem hooks into common frameworks and libraries without code changes. New Relic provides transaction tracing showing time spent in database queries, external HTTP calls, and application code, plus distributed tracing for microservices and error tracking with stack traces and request context.
Datadog combines APM with infrastructure monitoring, correlating application performance with server metrics, container orchestration, and log aggregation. The ddtrace gem instruments Ruby applications while Datadog agents collect system-level metrics. Datadog supports custom metrics through DogStatsD and provides advanced querying and alerting capabilities.
AppSignal focuses specifically on Ruby and Elixir applications, offering low overhead monitoring with magic dashboard generation based on detected frameworks. The appsignal gem provides automatic instrumentation with customizable sampling rates. AppSignal includes performance monitoring, error tracking, and host metrics in a unified interface.
Skylight specializes in Rails performance monitoring with a focus on identifying N+1 queries and rendering bottlenecks. The skylight gem provides production profiling with minimal overhead (claimed <1% impact) and aggregates data to show typical request performance rather than individual traces.
Metrics Libraries enable custom instrumentation with various backend storage options.
StatsD client libraries send metrics to StatsD aggregation servers:
require 'statsd-ruby'
statsd = Statsd.new('localhost', 8125)
# Simple metrics
statsd.increment('page.views')
statsd.gauge('queue.depth', 42)
statsd.timing('api.request', 250) # milliseconds
# Timing blocks
statsd.time('expensive.operation') do
perform_expensive_operation
end
# Batch to reduce network overhead
statsd.batch do |batch|
batch.increment('batch.processed')
batch.timing('batch.duration', duration)
end
Prometheus client library for native Prometheus integration:
require 'prometheus/client'
require 'prometheus/client/push'
registry = Prometheus::Client.registry
# Metric types
counter = registry.counter(:http_requests, docstring: 'Total requests')
gauge = registry.gauge(:queue_size, docstring: 'Current queue size')
histogram = registry.histogram(:response_time, docstring: 'Response times')
summary = registry.summary(:payload_size, docstring: 'Payload sizes')
# Push metrics (for batch jobs)
Prometheus::Client::Push.new(
job: 'batch_processor',
gateway: 'http://pushgateway:9091'
).add(registry)
Profiling Tools analyze application performance at the code level.
ruby-prof provides comprehensive profiling with multiple output formats:
require 'ruby-prof'
RubyProf.measure_mode = RubyProf::WALL_TIME # or CPU_TIME, ALLOCATIONS, MEMORY
result = RubyProf.profile do
# Code to profile
end
# Different visualizations
RubyProf::FlatPrinter.new(result).print(STDOUT)
RubyProf::GraphPrinter.new(result).print(STDOUT)
RubyProf::CallStackPrinter.new(result).print(File.new('callstack.html', 'w'))
stackprof samples call stacks with low overhead, suitable for production profiling:
require 'stackprof'
StackProf.run(mode: :cpu, out: 'tmp/stackprof.dump') do
# Application code runs normally
end
# Analysis with CLI tool
# stackprof tmp/stackprof.dump --text --limit 20
rack-mini-profiler adds in-browser profiling for web requests:
# Gemfile
gem 'rack-mini-profiler'
# Automatic profiling in development
# Visit /?pp=flamegraph for flame graph
# Visit /?pp=profile-memory for memory analysis
Database Monitoring tools specifically track query performance.
Bullet detects N+1 queries and unused eager loading:
# config/environments/development.rb
config.after_initialize do
Bullet.enable = true
Bullet.alert = true
Bullet.bullet_logger = true
Bullet.add_footer = true
end
# Detects N+1 queries and suggests fixes
PgHero provides PostgreSQL-specific monitoring:
# Database query statistics
PgHero.slow_queries(duration: 20) # Queries taking >20ms
PgHero.long_running_queries
PgHero.index_usage
PgHero.unused_indexes
Real User Monitoring tracks client-side performance from actual user browsers, measuring page load times, JavaScript errors, and user interactions. Most APM platforms include browser monitoring through JavaScript snippets injected into rendered HTML.
Common Patterns
Performance monitoring follows established patterns for instrumentation, metric collection, and alerting that have proven effective across diverse production systems.
Middleware Pattern centralizes HTTP request monitoring at the Rack middleware layer:
class RequestMonitoring
IGNORE_PATHS = ['/health', '/metrics'].freeze
def initialize(app, metrics_client:)
@app = app
@metrics = metrics_client
end
def call(env)
return @app.call(env) if ignored_path?(env['PATH_INFO'])
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
exception_raised = false
begin
status, headers, body = @app.call(env)
rescue => error
exception_raised = true
record_exception(env, error)
raise
ensure
unless exception_raised
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start_time
record_request(env, status, duration)
end
end
[status, headers, body]
end
private
def record_request(env, status, duration)
tags = build_tags(env, status)
@metrics.timing('http.request.duration', duration, tags: tags)
@metrics.increment('http.request.count', tags: tags)
if duration > 5.0
@metrics.increment('http.request.slow', tags: tags)
end
end
def build_tags(env, status)
{
method: env['REQUEST_METHOD'],
path: normalize_path(env['PATH_INFO']),
status: status,
status_class: "#{status / 100}xx"
}
end
end
Decorator Pattern adds monitoring to existing classes without modifying their implementation:
class MonitoredRepository
def initialize(repository, metrics:)
@repository = repository
@metrics = metrics
end
def find(id)
measure('repository.find') do
@repository.find(id)
end
end
def save(entity)
measure('repository.save') do
@repository.save(entity)
end
end
def method_missing(method, *args, &block)
measure("repository.#{method}") do
@repository.send(method, *args, &block)
end
end
private
def measure(operation)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = yield
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
@metrics.timing(operation, duration)
@metrics.increment("#{operation}.count")
result
rescue => error
@metrics.increment("#{operation}.error")
raise
end
end
Aspect-Oriented Monitoring uses method wrapping to add instrumentation transparently:
module Instrumentation
def self.included(base)
base.extend(ClassMethods)
end
module ClassMethods
def instrument_method(method_name, metric_name: nil)
original_method = instance_method(method_name)
metric = metric_name || "#{name.underscore}.#{method_name}"
define_method(method_name) do |*args, &block|
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
begin
result = original_method.bind(self).call(*args, &block)
MetricsClient.increment("#{metric}.success")
result
rescue => error
MetricsClient.increment("#{metric}.error", tags: { error: error.class.name })
raise
ensure
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
MetricsClient.timing("#{metric}.duration", duration)
end
end
end
end
end
class PaymentProcessor
include Instrumentation
def process_payment(payment)
# Implementation
end
instrument_method :process_payment
end
Circuit Breaker with Metrics combines failure detection with monitoring:
class CircuitBreaker
STATES = [:closed, :open, :half_open].freeze
def initialize(service_name, threshold: 5, timeout: 60, metrics:)
@service_name = service_name
@threshold = threshold
@timeout = timeout
@metrics = metrics
@state = :closed
@failure_count = 0
@last_failure_time = nil
end
def call
case @state
when :open
if Time.now - @last_failure_time > @timeout
transition_to(:half_open)
attempt_call { yield }
else
@metrics.increment("circuit_breaker.rejected", tags: { service: @service_name })
raise CircuitOpenError
end
when :half_open, :closed
attempt_call { yield }
end
end
private
def attempt_call
result = yield
on_success
result
rescue => error
on_failure
raise
end
def on_success
@failure_count = 0
transition_to(:closed) if @state == :half_open
@metrics.increment("circuit_breaker.success", tags: { service: @service_name })
end
def on_failure
@failure_count += 1
@last_failure_time = Time.now
@metrics.increment("circuit_breaker.failure", tags: { service: @service_name })
if @failure_count >= @threshold
transition_to(:open)
end
end
def transition_to(new_state)
old_state = @state
@state = new_state
@metrics.increment("circuit_breaker.state_change", tags: {
service: @service_name,
from: old_state,
to: new_state
})
end
end
Heartbeat Pattern monitors background job health:
class JobHeartbeat
def initialize(job_name, interval: 60, metrics:)
@job_name = job_name
@interval = interval
@metrics = metrics
@last_heartbeat = Time.now
end
def beat
now = Time.now
gap = now - @last_heartbeat
@metrics.gauge("job.heartbeat.gap", gap, tags: { job: @job_name })
@metrics.gauge("job.heartbeat.timestamp", now.to_i, tags: { job: @job_name })
if gap > @interval * 2
@metrics.increment("job.heartbeat.missed", tags: { job: @job_name })
end
@last_heartbeat = now
end
end
class BackgroundProcessor
def run
heartbeat = JobHeartbeat.new('background_processor', metrics: MetricsClient)
loop do
process_batch
heartbeat.beat
sleep 60
end
end
end
Percentile Tracking provides more meaningful performance metrics than averages:
class PercentileTracker
def initialize(window_size: 1000)
@values = []
@window_size = window_size
end
def record(value)
@values << value
@values.shift if @values.size > @window_size
end
def percentile(p)
return nil if @values.empty?
sorted = @values.sort
index = (p / 100.0 * sorted.length).ceil - 1
sorted[[index, 0].max]
end
def report_metrics(metric_name)
[50, 95, 99].each do |p|
value = percentile(p)
MetricsClient.gauge("#{metric_name}.p#{p}", value) if value
end
end
end
Real-World Applications
Production monitoring implementations vary based on system architecture, scale, and operational requirements.
Microservices Monitoring requires distributed tracing to understand request flows across services. Each service instruments critical operations and propagates trace context:
# Order service
class OrderService
def create_order(params)
OpenTelemetry.tracer.in_span('order.create') do |span|
span.set_attribute('order.items', params[:items].size)
span.set_attribute('order.total', params[:total])
order = Order.create!(params)
# Call inventory service
inventory_result = InventoryClient.reserve_items(
order.id,
order.items,
trace_context: current_trace_context
)
# Call payment service
payment_result = PaymentClient.charge(
order.id,
order.total,
trace_context: current_trace_context
)
span.set_attribute('order.status', order.status)
order
end
rescue => error
span.record_exception(error)
span.set_status(OpenTelemetry::Trace::Status.error)
raise
end
end
# Inventory service receives trace context
class InventoryClient
def reserve_items(order_id, items, trace_context:)
headers = {
'X-Trace-ID' => trace_context.trace_id,
'X-Span-ID' => trace_context.span_id
}
response = HTTP.headers(headers).post(
"#{base_url}/reserve",
json: { order_id: order_id, items: items }
)
MetricsClient.timing('inventory.reserve.duration', response.time)
response.parse
end
end
Database Performance Monitoring tracks query patterns to identify optimization opportunities:
class DatabaseMonitoring
def initialize
subscribe_to_queries
@slow_query_threshold = 100 # milliseconds
end
def subscribe_to_queries
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
process_query_event(event)
end
end
def process_query_event(event)
return if event.payload[:name] == 'SCHEMA'
duration_ms = event.duration
sql = event.payload[:sql]
operation = extract_operation(sql)
tags = {
operation: operation,
connection: event.payload[:connection_id]
}
MetricsClient.timing('database.query.duration', duration_ms, tags: tags)
MetricsClient.increment('database.query.count', tags: tags)
if duration_ms > @slow_query_threshold
log_slow_query(sql, duration_ms, event.payload)
MetricsClient.increment('database.query.slow', tags: tags)
end
detect_n_plus_one(sql, duration_ms)
end
def extract_operation(sql)
sql.strip.split(/\s+/).first.upcase
end
def detect_n_plus_one(sql, duration)
# Track similar queries in short time windows
@query_tracker ||= {}
pattern = normalize_query(sql)
@query_tracker[pattern] ||= { count: 0, first_seen: Time.now }
@query_tracker[pattern][:count] += 1
window = Time.now - @query_tracker[pattern][:first_seen]
if @query_tracker[pattern][:count] > 10 && window < 1.0
MetricsClient.increment('database.potential_n_plus_one', tags: {
pattern: pattern
})
end
end
def normalize_query(sql)
# Convert specific values to placeholders
sql.gsub(/\d+/, '?').gsub(/'[^']+'/, '?')
end
end
Background Job Monitoring tracks job execution, failures, and queue depths:
class JobMonitoring
def self.instrument_sidekiq
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add JobMetricsMiddleware
end
end
end
class JobMetricsMiddleware
def call(worker, job, queue)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
job_class = worker.class.name
MetricsClient.increment('job.started', tags: {
worker: job_class,
queue: queue
})
begin
yield
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
MetricsClient.timing('job.duration', duration, tags: {
worker: job_class,
queue: queue
})
MetricsClient.increment('job.completed', tags: {
worker: job_class,
queue: queue
})
rescue => error
MetricsClient.increment('job.failed', tags: {
worker: job_class,
queue: queue,
error: error.class.name
})
raise
end
end
end
# Monitor queue depths
def self.monitor_queues
Sidekiq::Queue.all.each do |queue|
MetricsClient.gauge('job.queue.size', queue.size, tags: {
queue: queue.name
})
MetricsClient.gauge('job.queue.latency', queue.latency, tags: {
queue: queue.name
})
end
end
end
Health Check Endpoints expose system status for load balancers and orchestrators:
class HealthCheckController < ApplicationController
def show
checks = {
database: check_database,
redis: check_redis,
external_api: check_external_api
}
all_healthy = checks.values.all? { |result| result[:healthy] }
MetricsClient.gauge('health_check.status', all_healthy ? 1 : 0)
checks.each do |service, result|
MetricsClient.gauge("health_check.#{service}", result[:healthy] ? 1 : 0)
MetricsClient.timing("health_check.#{service}.duration", result[:duration])
end
status = all_healthy ? :ok : :service_unavailable
render json: { status: status, checks: checks }, status: status
end
private
def check_database
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
begin
ActiveRecord::Base.connection.execute('SELECT 1')
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
{ healthy: true, duration: duration }
rescue => error
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
{ healthy: false, duration: duration, error: error.message }
end
end
end
Memory Leak Detection monitors memory growth over time:
class MemoryMonitoring
def self.monitor
loop do
collect_metrics
sleep 60
end
end
def self.collect_metrics
# Process memory from /proc filesystem (Linux)
if File.exist?('/proc/self/status')
status = File.read('/proc/self/status')
if vmsize = status[/VmSize:\s+(\d+)/, 1]
MetricsClient.gauge('memory.vmsize', vmsize.to_i)
end
if rss = status[/VmRSS:\s+(\d+)/, 1]
MetricsClient.gauge('memory.rss', rss.to_i)
end
end
# Ruby heap statistics
stat = GC.stat
MetricsClient.gauge('memory.heap_live_slots', stat[:heap_live_slots])
MetricsClient.gauge('memory.heap_free_slots', stat[:heap_free_slots])
MetricsClient.gauge('memory.old_objects', stat[:old_objects])
# ObjectSpace statistics
ObjectSpace.count_objects.each do |type, count|
MetricsClient.gauge("memory.objects.#{type}", count)
end
end
end
Reference
Metric Types
| Type | Description | Use Cases | Ruby Example |
|---|---|---|---|
| Counter | Monotonically increasing value | Request counts, error counts | statsd.increment('requests') |
| Gauge | Point-in-time measurement | Memory usage, queue depth | statsd.gauge('queue_size', 42) |
| Histogram | Distribution of values | Response time distribution | prometheus.histogram(:duration) |
| Timer | Duration measurement | Operation timing | statsd.timing('api_call', 150) |
| Summary | Statistical summary | Quantiles, percentiles | prometheus.summary(:size) |
Common Metrics
| Metric | Category | Description | Alert Threshold |
|---|---|---|---|
| Request Rate | Throughput | Requests per second | Sudden drop >50% |
| Error Rate | Reliability | Errors per total requests | >1% for critical paths |
| Response Time P95 | Performance | 95th percentile response time | >500ms typical |
| Response Time P99 | Performance | 99th percentile response time | >1000ms typical |
| Database Query Time | Performance | Time spent in database | >100ms per query |
| Memory Usage | Resources | RSS or heap size | >80% of available |
| GC Time Ratio | Resources | GC time / total time | >10% sustained |
| Queue Depth | Throughput | Pending job count | Growing over 1 hour |
| Active Connections | Resources | Open database connections | >80% of pool |
Ruby Profiling Tools
| Tool | Type | Overhead | Use Case | Output Format |
|---|---|---|---|---|
| ruby-prof | Comprehensive profiler | High | Development, detailed analysis | Flat, graph, call stack |
| stackprof | Sampling profiler | Low | Production profiling | Flame graphs, text |
| memory_profiler | Memory profiler | High | Memory leak investigation | Allocation reports |
| rack-mini-profiler | Request profiler | Medium | Web request analysis | HTML, flame graphs |
| benchmark | Simple timing | Minimal | Quick measurements | Text output |
APM Comparison
| Platform | Focus | Overhead | Pricing Model | Best For |
|---|---|---|---|---|
| New Relic | Full-stack APM | 2-5% | Per host | Enterprise, complex systems |
| Datadog | Infrastructure + APM | 1-3% | Per host + metrics | DevOps teams |
| AppSignal | Ruby/Elixir APM | <1% | Per app | Ruby shops, cost-conscious |
| Skylight | Rails optimization | <1% | Per app | Rails performance tuning |
| Scout | Developer-focused | 1-2% | Per app | Smaller teams |
Instrumentation Patterns
| Pattern | Implementation | Pros | Cons |
|---|---|---|---|
| Middleware | Rack middleware layer | Centralized, automatic | HTTP requests only |
| Decorator | Wrap classes | Flexible, testable | More code |
| AOP | Method wrapping | Minimal code changes | Magic, harder to debug |
| Notifications | Event subscription | Decoupled | Indirect flow |
| Manual | Explicit calls | Full control | Verbose, error-prone |
Performance Baselines
| Metric | Good | Acceptable | Poor | Critical |
|---|---|---|---|---|
| Web Response Time (P95) | <200ms | <500ms | <1000ms | >1000ms |
| API Response Time (P95) | <100ms | <300ms | <800ms | >800ms |
| Background Job Duration | <30s | <120s | <300s | >300s |
| Database Query Time | <10ms | <50ms | <100ms | >100ms |
| Error Rate | <0.1% | <1% | <5% | >5% |
| Memory per Process | <200MB | <500MB | <1GB | >1GB |
| GC Pause Time | <10ms | <50ms | <100ms | >100ms |
Alert Configuration
| Alert Type | Condition | Evaluation Window | Notification |
|---|---|---|---|
| Error Rate Spike | >2x baseline | 5 minutes | Immediate |
| Response Time Degradation | P95 >1.5x baseline | 10 minutes | Warning |
| Memory Leak | Steady growth >10% | 1 hour | Warning |
| Queue Buildup | Growing >15 min | 15 minutes | Warning |
| Service Down | 0 successful requests | 2 minutes | Immediate |
| Database Slow | >50% queries >100ms | 5 minutes | Warning |
| Disk Full | >90% used | 1 minute | Immediate |
Monitoring Stack Components
| Component | Purpose | Examples | Integration Point |
|---|---|---|---|
| Instrumentation | Emit metrics | StatsD, OpenTelemetry | Application code |
| Collection | Receive metrics | StatsD daemon, collectors | Network endpoint |
| Storage | Store time-series data | Prometheus, InfluxDB | Collector target |
| Visualization | Display metrics | Grafana, Kibana | Query storage |
| Alerting | Notify on conditions | Alertmanager, PagerDuty | Alert rules |
| Tracing | Track request flow | Jaeger, Zipkin | Trace context propagation |