Overview
Resource monitoring measures and tracks the consumption of system resources by applications and processes. This practice provides visibility into CPU utilization, memory allocation, disk operations, network traffic, and other system metrics. Resource monitoring serves multiple purposes: detecting performance degradation, identifying bottlenecks, capacity planning, and troubleshooting production issues.
Modern applications operate in complex environments with multiple interdependencies. A web application might consume database connections, cache memory, background job queues, and external API quotas. Without proper monitoring, resource exhaustion manifests as mysterious failures, degraded performance, or complete service outages. Resource monitoring transforms these opaque failures into observable, measurable events.
The practice encompasses several monitoring layers. Operating system metrics track CPU cycles, memory pages, disk sectors, and network packets. Process-level monitoring measures individual application resource consumption. Application-level monitoring examines domain-specific resources like database connection pools, thread pools, and cache hit rates. Infrastructure monitoring covers distributed system resources across multiple hosts.
# Basic system resource snapshot
require 'sys/cpu'
require 'sys/filesystem'
cpu_usage = Sys::CPU.load_avg
disk_stats = Sys::Filesystem.stat('/')
memory_info = `free -m`.split("\n")[1].split
puts "CPU Load: #{cpu_usage[0]}"
puts "Disk Usage: #{(disk_stats.bytes_used.to_f / disk_stats.bytes_total * 100).round(2)}%"
puts "Memory Usage: #{memory_info[2].to_i} / #{memory_info[1].to_i} MB"
Resource monitoring operates through periodic sampling or continuous observation. Sampling captures resource states at fixed intervals, trading accuracy for lower overhead. Continuous monitoring tracks every resource event but consumes more resources itself. The monitoring approach affects both the accuracy of measurements and the performance impact on monitored systems.
Key Principles
Resource monitoring builds on several foundational concepts that define how systems observe and report resource consumption. Understanding these principles clarifies the mechanisms behind monitoring implementations and the trade-offs involved in different approaches.
Metrics Collection: Monitoring systems gather quantitative measurements of resource usage. Metrics represent point-in-time values (gauges), cumulative counts (counters), or distributions of measurements (histograms). A gauge records current memory usage. A counter tracks total requests processed. A histogram captures response time distribution. Each metric type serves specific analytical purposes and requires different storage strategies.
Sampling Intervals: Monitoring systems collect metrics at defined frequencies. Shorter intervals provide higher resolution but increase collection overhead. A one-second interval captures rapid fluctuations but generates 86,400 data points daily per metric. A one-minute interval reduces data volume by 98% but misses short-lived spikes. The sampling interval creates a trade-off between temporal resolution and resource consumption by the monitoring system itself.
Aggregation: Raw metrics generate excessive data volume for long-term storage. Aggregation computes summary statistics over time windows: averages, maximums, minimums, percentiles. A system might retain one-second resolution for one hour, one-minute resolution for one day, and hourly averages for one year. This hierarchical aggregation balances data retention costs against analytical requirements.
Thresholds and Alerting: Monitoring systems compare metrics against defined limits. Threshold breaches trigger notifications, automated responses, or escalations. Simple thresholds check absolute values: alert when CPU exceeds 80%. Complex thresholds evaluate trends, rates of change, or relationships between metrics. Threshold configuration balances sensitivity to problems against false positive rates.
Resource Attribution: Monitoring attributes resource consumption to specific processes, users, or operations. Attribution identifies which components consume resources and enables targeted optimization. Process-level attribution tracks memory per process ID. Request-level attribution measures resource consumption per API endpoint. Without attribution, aggregate metrics obscure the source of resource pressure.
# Metric collection with timestamp and labels
class MetricCollector
def initialize
@metrics = []
end
def record(name, value, labels = {})
@metrics << {
name: name,
value: value,
timestamp: Time.now.to_i,
labels: labels
}
end
def gauge(name, value, labels = {})
record(name, value, labels.merge(type: 'gauge'))
end
def counter(name, increment = 1, labels = {})
current = find_metric(name, labels) || 0
record(name, current + increment, labels.merge(type: 'counter'))
end
def histogram(name, value, labels = {})
record(name, value, labels.merge(type: 'histogram'))
end
private
def find_metric(name, labels)
metric = @metrics.reverse.find { |m| m[:name] == name && m[:labels] == labels }
metric ? metric[:value] : nil
end
end
collector = MetricCollector.new
collector.gauge('memory.used', 4_200_000_000, service: 'api')
collector.counter('requests.total', 1, endpoint: '/users')
collector.histogram('response.time', 0.145, endpoint: '/users')
Overhead Management: Monitoring consumes resources: CPU for metric collection, memory for buffering, network bandwidth for transmission, disk for storage. High-frequency monitoring of many metrics creates measurable overhead. A production system might allocate 1-5% of resources to monitoring infrastructure. Exceeding this budget degrades application performance, creating a feedback loop where monitoring itself causes the problems it aims to detect.
Time Series Data: Resource metrics form time series: sequences of values indexed by timestamp. Time series analysis identifies trends, seasonality, anomalies, and correlations. A metric showing gradual memory growth indicates a leak. Periodic CPU spikes correlate with scheduled batch jobs. Time series storage systems optimize for append-heavy workloads and time-range queries characteristic of monitoring data.
Ruby Implementation
Ruby provides multiple approaches to resource monitoring through standard library modules, system calls, and third-party gems. The implementation strategy depends on the scope of monitoring: single process, multiple processes, or distributed systems.
Process Resource Monitoring: Ruby's Process module exposes basic resource information for the current process. The Process.clock_gettime method measures various time metrics, while operating system gems provide detailed resource statistics.
require 'sys/proctable'
# Get current process information
ps_info = Sys::ProcTable.ps(pid: Process.pid)
puts "Process: #{ps_info.pid} (#{ps_info.comm})"
puts "CPU Time: #{ps_info.utime + ps_info.stime} seconds"
puts "Memory: #{ps_info.rss / 1024} MB"
puts "Threads: #{ps_info.nlwp}"
puts "Priority: #{ps_info.priority}"
# Monitor resource usage over time
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
start_cpu = Process.times
# Perform work
1_000_000.times { |i| i ** 2 }
end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
end_cpu = Process.times
puts "Wall time: #{(end_time - start_time).round(3)}s"
puts "CPU time: #{(end_cpu.utime - start_cpu.utime).round(3)}s"
Memory Profiling: Ruby's ObjectSpace module enables memory analysis by inspecting allocated objects. This provides application-level memory monitoring distinct from operating system memory metrics.
require 'objspace'
# Take memory snapshot
before = ObjectSpace.memsize_of_all
# Create objects
array = Array.new(100_000) { |i| "string_#{i}" }
after = ObjectSpace.memsize_of_all
puts "Memory allocated: #{(after - before) / 1024 / 1024} MB"
# Count objects by class
object_counts = ObjectSpace.count_objects
puts "Total objects: #{object_counts[:TOTAL]}"
puts "Strings: #{object_counts[:T_STRING]}"
puts "Arrays: #{object_counts[:T_ARRAY]}"
puts "Hashes: #{object_counts[:T_HASH]}"
# Detailed object tracking
ObjectSpace.trace_object_allocations_start
users = Array.new(1000) { { name: "User", email: "user@example.com" } }
ObjectSpace.trace_object_allocations_stop
# Find allocation source
sample_object = users.first
file = ObjectSpace.allocation_sourcefile(sample_object)
line = ObjectSpace.allocation_sourceline(sample_object)
puts "Allocated at #{file}:#{line}"
System Metrics Collection: The sys-proctable, sys-cpu, and sys-filesystem gems provide cross-platform access to system metrics. These gems wrap platform-specific system calls into consistent Ruby interfaces.
require 'sys/cpu'
require 'sys/filesystem'
require 'sys/proctable'
class SystemMonitor
def cpu_stats
{
load_average: Sys::CPU.load_avg,
num_cpus: Sys::CPU.num_cpu,
cpu_freq: Sys::CPU.cpu_freq
}
end
def memory_stats
total = `grep MemTotal /proc/meminfo`.split[1].to_i
available = `grep MemAvailable /proc/meminfo`.split[1].to_i
{
total_kb: total,
available_kb: available,
used_kb: total - available,
usage_percent: ((total - available).to_f / total * 100).round(2)
}
end
def disk_stats(mount_point = '/')
stat = Sys::Filesystem.stat(mount_point)
{
mount_point: mount_point,
total_bytes: stat.bytes_total,
used_bytes: stat.bytes_used,
available_bytes: stat.bytes_available,
usage_percent: (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
}
end
def process_list
Sys::ProcTable.ps.map do |proc|
{
pid: proc.pid,
name: proc.comm,
cpu_percent: calculate_cpu_percent(proc),
memory_mb: proc.rss / 1024,
state: proc.state
}
end.sort_by { |p| p[:memory_mb] }.reverse.take(10)
end
private
def calculate_cpu_percent(proc)
((proc.utime + proc.stime).to_f / Process.clock_gettime(Process::CLOCK_MONOTONIC) * 100).round(2)
end
end
monitor = SystemMonitor.new
puts monitor.cpu_stats
puts monitor.memory_stats
puts monitor.disk_stats
Custom Metric Instrumentation: Application-specific monitoring requires instrumentation code that records domain metrics. Ruby applications typically instrument critical paths with timing and counting logic.
class MetricInstrumentor
def initialize
@metrics = Hash.new { |h, k| h[k] = [] }
end
def time(metric_name, labels = {})
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
result = yield
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
record(metric_name, duration, labels.merge(unit: 'seconds'))
result
end
def increment(metric_name, value = 1, labels = {})
record(metric_name, value, labels)
end
def gauge(metric_name, value, labels = {})
@metrics[metric_name] = [{ value: value, labels: labels, timestamp: Time.now }]
end
def record(metric_name, value, labels = {})
@metrics[metric_name] << {
value: value,
labels: labels,
timestamp: Time.now
}
end
def summary(metric_name)
values = @metrics[metric_name].map { |m| m[:value] }
return {} if values.empty?
sorted = values.sort
{
count: values.size,
sum: values.sum,
mean: values.sum / values.size.to_f,
min: sorted.first,
max: sorted.last,
p50: percentile(sorted, 0.5),
p95: percentile(sorted, 0.95),
p99: percentile(sorted, 0.99)
}
end
private
def percentile(sorted_values, p)
index = (sorted_values.size * p).ceil - 1
sorted_values[index]
end
end
instrumentor = MetricInstrumentor.new
# Instrument database queries
instrumentor.time('db.query', table: 'users') do
sleep(0.1) # Simulate query
end
# Track cache hits/misses
instrumentor.increment('cache.hits', 1, cache: 'redis')
instrumentor.increment('cache.misses', 1, cache: 'redis')
# Record queue depth
instrumentor.gauge('queue.depth', 42, queue: 'background_jobs')
puts instrumentor.summary('db.query')
Periodic Metric Collection: Long-running monitoring requires scheduled metric collection. Ruby applications use threads, fibers, or separate processes to collect metrics without blocking main application logic.
require 'concurrent'
class PeriodicMonitor
def initialize(interval: 60)
@interval = interval
@running = false
@collectors = []
end
def add_collector(name, &block)
@collectors << { name: name, block: block }
end
def start
@running = true
@task = Concurrent::TimerTask.new(execution_interval: @interval) do
collect_metrics
end
@task.execute
end
def stop
@running = false
@task&.shutdown
end
private
def collect_metrics
@collectors.each do |collector|
begin
metrics = collector[:block].call
report_metrics(collector[:name], metrics)
rescue => e
puts "Error collecting #{collector[:name]}: #{e.message}"
end
end
end
def report_metrics(name, metrics)
puts "[#{Time.now}] #{name}: #{metrics}"
end
end
monitor = PeriodicMonitor.new(interval: 5)
monitor.add_collector('memory') do
{
objects: ObjectSpace.count_objects[:TOTAL],
size_mb: ObjectSpace.memsize_of_all / 1024 / 1024
}
end
monitor.add_collector('gc') do
stats = GC.stat
{
count: stats[:count],
heap_used: stats[:heap_used],
heap_free: stats[:heap_free]
}
end
monitor.start
sleep(20)
monitor.stop
Implementation Approaches
Resource monitoring strategies vary based on monitoring scope, infrastructure architecture, and operational requirements. Different approaches balance accuracy, overhead, complexity, and cost.
Push-Based Monitoring: Applications actively send metrics to a central collection service. Each monitored process runs instrumentation code that periodically transmits metrics to collectors like StatsD, Prometheus push gateway, or cloud monitoring services. Push-based systems work well for short-lived processes, serverless functions, and batch jobs that terminate before scrapers can collect metrics. The application controls transmission timing and can batch metrics to reduce network overhead. However, push-based monitoring requires reliable network connectivity and creates coupling between applications and monitoring infrastructure.
require 'socket'
class StatsDClient
def initialize(host: 'localhost', port: 8125)
@socket = UDPSocket.new
@host = host
@port = port
end
def gauge(metric, value, tags = {})
send_metric("#{metric}:#{value}|g#{format_tags(tags)}")
end
def counter(metric, value = 1, tags = {})
send_metric("#{metric}:#{value}|c#{format_tags(tags)}")
end
def timing(metric, duration_ms, tags = {})
send_metric("#{metric}:#{duration_ms}|ms#{format_tags(tags)}")
end
def histogram(metric, value, tags = {})
send_metric("#{metric}:#{value}|h#{format_tags(tags)}")
end
private
def send_metric(data)
@socket.send(data, 0, @host, @port)
rescue => e
# Silently fail to avoid impacting application
nil
end
def format_tags(tags)
return '' if tags.empty?
tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
"|##{tag_string}"
end
end
# Usage in application code
statsd = StatsDClient.new
statsd.timing('api.request', 145, endpoint: '/users', method: 'GET')
statsd.counter('db.queries', 1, table: 'orders')
statsd.gauge('queue.size', 23, queue: 'mailers')
Pull-Based Monitoring: Monitoring systems scrape metrics from application endpoints at regular intervals. Applications expose metrics via HTTP endpoints that return current metric values. Prometheus popularized this approach with its scraping model. Pull-based monitoring centralizes collection timing and reduces application complexity. Applications only maintain current metric state rather than managing transmission logic. The monitoring system discovers targets, controls scraping frequency, and handles failures. Pull-based systems struggle with highly dynamic environments where monitored processes come and go rapidly, and network policies must allow scraper access to all monitored services.
require 'webrick'
class MetricsEndpoint
def initialize(port: 9394)
@port = port
@metrics = {}
@mutex = Mutex.new
end
def set_gauge(name, value, labels = {})
@mutex.synchronize do
key = metric_key(name, labels)
@metrics[key] = { type: 'gauge', value: value }
end
end
def increment_counter(name, value = 1, labels = {})
@mutex.synchronize do
key = metric_key(name, labels)
@metrics[key] ||= { type: 'counter', value: 0 }
@metrics[key][:value] += value
end
end
def start
server = WEBrick::HTTPServer.new(Port: @port)
server.mount_proc '/metrics' do |req, res|
res.body = format_metrics
res.content_type = 'text/plain'
end
trap('INT') { server.shutdown }
server.start
end
private
def metric_key(name, labels)
label_str = labels.sort.map { |k, v| "#{k}=\"#{v}\"" }.join(',')
label_str.empty? ? name : "#{name}{#{label_str}}"
end
def format_metrics
@mutex.synchronize do
@metrics.map do |key, data|
"#{key} #{data[:value]}"
end.join("\n")
end
end
end
# Background thread updating metrics
endpoint = MetricsEndpoint.new
Thread.new do
loop do
endpoint.set_gauge('memory_bytes', ObjectSpace.memsize_of_all)
endpoint.set_gauge('gc_runs', GC.stat[:count])
endpoint.increment_counter('requests_total', 1, path: '/api/users')
sleep(1)
end
end
endpoint.start
Agent-Based Monitoring: Dedicated monitoring agents run alongside applications on the same host, collecting system and application metrics. Agents aggregate metrics locally before forwarding to central systems, reducing network traffic and providing local caching during network outages. This approach separates monitoring concerns from application code. Applications write metrics to local files, sockets, or shared memory, and agents handle collection, aggregation, and transmission. Agents add operational complexity through additional process management, configuration, and resource consumption.
Logging-Based Monitoring: Applications emit structured logs containing metric data. Log aggregation systems parse logs to extract metrics, treating logs as a unified stream of observability data. This approach requires no separate metric collection infrastructure but depends on log processing capacity. Logs provide rich context around metrics but consume more bandwidth and storage than dedicated metric protocols. The parsing overhead affects processing costs and introduces latency between event occurrence and metric availability.
Synthetic Monitoring: External systems probe application endpoints to measure availability and performance from a user perspective. Synthetic monitoring detects failures invisible to internal metrics: DNS resolution failures, TLS certificate expiration, geographic routing issues. Health check endpoints return application status, dependency health, and resource availability. Synthetic monitoring complements internal metrics by validating external accessibility.
require 'net/http'
require 'json'
class HealthCheckServer
def initialize(port: 8080)
@port = port
@checks = {}
end
def add_check(name, &block)
@checks[name] = block
end
def start
server = WEBrick::HTTPServer.new(Port: @port)
server.mount_proc '/health' do |req, res|
results = run_checks
res.status = results[:healthy] ? 200 : 503
res.body = JSON.generate(results)
res.content_type = 'application/json'
end
server.start
end
private
def run_checks
check_results = {}
healthy = true
@checks.each do |name, check|
begin
result = check.call
check_results[name] = result
healthy = false unless result[:status] == 'ok'
rescue => e
check_results[name] = { status: 'error', message: e.message }
healthy = false
end
end
{
healthy: healthy,
checks: check_results,
timestamp: Time.now.iso8601
}
end
end
health = HealthCheckServer.new
health.add_check('database') do
# Simulate database check
{ status: 'ok', latency_ms: 12 }
end
health.add_check('redis') do
{ status: 'ok', latency_ms: 3 }
end
health.add_check('disk_space') do
stat = Sys::Filesystem.stat('/')
usage = (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
if usage > 90
{ status: 'critical', usage_percent: usage }
elsif usage > 75
{ status: 'warning', usage_percent: usage }
else
{ status: 'ok', usage_percent: usage }
end
end
health.start
Tools & Ecosystem
Ruby's monitoring ecosystem includes system-level tools, application performance monitoring solutions, and cloud-native monitoring integrations. Selecting appropriate tools depends on deployment environment, scale requirements, and existing infrastructure.
System Monitoring Gems: Several gems provide access to operating system metrics. The sys-proctable gem offers process table information across platforms. The sys-cpu gem exposes CPU statistics. The sys-filesystem gem provides filesystem metrics. These gems wrap platform-specific system calls, providing consistent interfaces across Linux, macOS, and Windows.
Application Performance Monitoring: APM platforms provide comprehensive monitoring including distributed tracing, error tracking, and performance profiling. New Relic, Datadog, Scout APM, and AppSignal offer Ruby agents that automatically instrument common frameworks like Rails and Sinatra. These agents inject monitoring code into application execution paths, capturing request traces, database queries, cache operations, and external API calls.
# New Relic configuration example
require 'newrelic_rpm'
class ApplicationController
# Automatic instrumentation via agent
def process_order
# Custom instrumentation
NewRelic::Agent.record_custom_event('OrderProcessed', {
order_id: order.id,
amount: order.total,
processing_time: elapsed_time
})
end
end
# Manual transaction tracing
NewRelic::Agent::Tracer.in_transaction(category: :task, name: 'DataImport') do
perform_import
end
Prometheus Integration: Prometheus has become a standard for metrics collection in containerized environments. The prometheus-client gem provides Ruby integration, exposing metrics in Prometheus format.
require 'prometheus/client'
require 'prometheus/client/formats/text'
# Initialize registry
prometheus = Prometheus::Client.registry
# Define metrics
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
response_duration = prometheus.histogram(
:http_response_duration_seconds,
docstring: 'Response duration in seconds',
labels: [:method, :path]
)
active_connections = prometheus.gauge(
:http_active_connections,
docstring: 'Currently active connections'
)
# Record metrics
http_requests.increment(labels: { method: 'GET', path: '/users', status: 200 })
response_duration.observe(0.234, labels: { method: 'GET', path: '/users' })
active_connections.set(42)
# Expose metrics endpoint
require 'webrick'
server = WEBrick::HTTPServer.new(Port: 9394)
server.mount_proc '/metrics' do |req, res|
res.body = Prometheus::Client::Formats::Text.marshal(prometheus)
res.content_type = Prometheus::Client::Formats::Text::CONTENT_TYPE
end
server.start
StatsD Integration: StatsD provides a simple UDP-based protocol for metric collection. The statsd-ruby gem offers a lightweight client for pushing metrics to StatsD servers.
require 'statsd-ruby'
statsd = Statsd.new('localhost', 8125)
# Timing blocks
statsd.time('db.query') do
database.execute(query)
end
# Manual timing
start = Time.now
perform_operation
statsd.timing('operation.duration', ((Time.now - start) * 1000).to_i)
# Counters and gauges
statsd.increment('api.requests', tags: ['endpoint:users', 'method:get'])
statsd.gauge('queue.depth', background_queue.size)
# Batching for efficiency
statsd.batch do |s|
s.increment('page.views')
s.timing('page.load', 320)
s.gauge('active.users', 1523)
end
OpenTelemetry: OpenTelemetry provides vendor-neutral observability instrumentation. The opentelemetry-sdk and framework-specific gems enable distributed tracing and metrics collection.
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'my-ruby-app'
c.use_all
end
# Manual span creation
tracer = OpenTelemetry.tracer_provider.tracer('my-app')
tracer.in_span('process_request') do |span|
span.set_attribute('user.id', user_id)
span.set_attribute('request.size', request_body.bytesize)
process_request(request_body)
end
# Automatic instrumentation for frameworks
# Rails, Sinatra, Rack instrumented automatically
Cloud Provider Monitoring: AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor provide integrated monitoring for cloud deployments. Ruby SDKs enable custom metric publication and log streaming.
require 'aws-sdk-cloudwatch'
cloudwatch = Aws::CloudWatch::Client.new(region: 'us-east-1')
# Publish custom metrics
cloudwatch.put_metric_data({
namespace: 'MyApp',
metric_data: [
{
metric_name: 'ProcessedOrders',
value: 42,
unit: 'Count',
timestamp: Time.now,
dimensions: [
{ name: 'Environment', value: 'production' }
]
},
{
metric_name: 'ResponseTime',
value: 0.234,
unit: 'Seconds',
timestamp: Time.now
}
]
})
Logging Integration: Structured logging libraries like semantic_logger and lograge format logs for parsing by monitoring systems. JSON-formatted logs enable metric extraction from log aggregation platforms.
require 'semantic_logger'
SemanticLogger.add_appender(io: $stdout, formatter: :json)
logger = SemanticLogger['MyApp']
# Structured logging with metrics
logger.info('Request processed',
method: 'GET',
path: '/api/users',
status: 200,
duration_ms: 145,
db_queries: 3
)
# Metric extraction from logs
logger.measure_info('Database query',
metric: 'db/query/duration',
table: 'users'
) do
database.query('SELECT * FROM users')
end
Practical Examples
Resource monitoring implementations vary based on application architecture and operational requirements. These examples demonstrate monitoring patterns for common scenarios.
Web Application Request Monitoring: Web applications require per-request metrics including response time, throughput, error rates, and resource consumption. Middleware captures metrics transparently without modifying application code.
require 'rack'
class RequestMonitoringMiddleware
def initialize(app, metrics_client)
@app = app
@metrics = metrics_client
end
def call(env)
start_time = Time.now
begin
status, headers, body = @app.call(env)
duration = Time.now - start_time
@metrics.timing('http.request.duration', duration * 1000,
method: env['REQUEST_METHOD'],
path: sanitize_path(env['PATH_INFO']),
status: status
)
@metrics.increment('http.request.count',
method: env['REQUEST_METHOD'],
path: sanitize_path(env['PATH_INFO']),
status: status
)
[status, headers, body]
rescue => e
@metrics.increment('http.request.errors',
method: env['REQUEST_METHOD'],
path: sanitize_path(env['PATH_INFO']),
error_class: e.class.name
)
raise
end
end
private
def sanitize_path(path)
# Remove IDs to reduce cardinality
path.gsub(/\/\d+/, '/:id')
end
end
# Rails integration
class Application < Rails::Application
config.middleware.use RequestMonitoringMiddleware, StatsDClient.new
end
Background Job Monitoring: Background job systems require monitoring for queue depth, processing time, failure rates, and retry counts. This visibility identifies performance bottlenecks and capacity requirements.
class MonitoredJob
def self.perform(*args)
job_name = self.name
start_time = Time.now
metrics.gauge('jobs.queue.depth', queue_depth, job: job_name)
metrics.increment('jobs.started', job: job_name)
begin
result = perform_work(*args)
duration = Time.now - start_time
metrics.timing('jobs.duration', duration * 1000, job: job_name)
metrics.increment('jobs.completed', job: job_name, status: 'success')
result
rescue => e
metrics.increment('jobs.completed', job: job_name, status: 'failed')
metrics.increment('jobs.errors', job: job_name, error: e.class.name)
raise
end
end
def self.perform_work(*args)
# Actual job logic
raise NotImplementedError
end
def self.metrics
@metrics ||= StatsDClient.new
end
def self.queue_depth
# Fetch from job queue system
0
end
end
class EmailDeliveryJob < MonitoredJob
def self.perform_work(user_id, template)
user = User.find(user_id)
EmailService.send(user.email, template)
end
end
Database Connection Pool Monitoring: Connection pools require monitoring to detect exhaustion, identify leak patterns, and tune pool sizing. These metrics inform capacity planning and optimization efforts.
require 'connection_pool'
class MonitoredConnectionPool
def initialize(pool)
@pool = pool
@metrics = StatsDClient.new
start_monitoring
end
def with(&block)
start_wait = Time.now
@pool.with do |connection|
wait_time = Time.now - start_wait
@metrics.timing('pool.checkout.wait', wait_time * 1000)
start_use = Time.now
result = block.call(connection)
use_time = Time.now - start_use
@metrics.timing('pool.connection.use', use_time * 1000)
result
end
end
private
def start_monitoring
Thread.new do
loop do
@metrics.gauge('pool.size', @pool.size)
@metrics.gauge('pool.available', @pool.available)
sleep(10)
end
end
end
end
# Usage
db_pool = ConnectionPool.new(size: 5, timeout: 5) do
DatabaseConnection.new
end
monitored_pool = MonitoredConnectionPool.new(db_pool)
monitored_pool.with do |conn|
conn.execute('SELECT * FROM users')
end
Memory Leak Detection: Long-running processes require memory monitoring to detect leaks early. Tracking object counts and memory growth patterns reveals memory management issues.
class MemoryLeakDetector
def initialize(interval: 300)
@interval = interval
@baseline = nil
@samples = []
@threshold_percent = 20
end
def start
@running = true
Thread.new do
while @running
sample = collect_sample
analyze_sample(sample)
sleep(@interval)
end
end
end
def stop
@running = false
end
private
def collect_sample
GC.start
{
timestamp: Time.now,
memory_bytes: ObjectSpace.memsize_of_all,
object_count: ObjectSpace.count_objects[:TOTAL],
gc_runs: GC.stat[:count],
heap_used: GC.stat[:heap_used],
heap_free: GC.stat[:heap_free]
}
end
def analyze_sample(sample)
@samples << sample
@samples.shift if @samples.size > 20
@baseline ||= sample
memory_growth = (sample[:memory_bytes] - @baseline[:memory_bytes]) /
@baseline[:memory_bytes].to_f * 100
if memory_growth > @threshold_percent
alert_leak_detected(sample, memory_growth)
end
log_sample(sample)
end
def alert_leak_detected(sample, growth)
puts "MEMORY LEAK DETECTED: #{growth.round(2)}% growth"
puts "Current: #{sample[:memory_bytes] / 1024 / 1024} MB"
puts "Baseline: #{@baseline[:memory_bytes] / 1024 / 1024} MB"
puts "Objects: #{sample[:object_count]}"
end
def log_sample(sample)
puts "[#{sample[:timestamp]}] Memory: #{sample[:memory_bytes] / 1024 / 1024} MB, " \
"Objects: #{sample[:object_count]}, GC runs: #{sample[:gc_runs]}"
end
end
detector = MemoryLeakDetector.new(interval: 60)
detector.start
Performance Considerations
Resource monitoring introduces overhead that affects application performance. Understanding performance characteristics enables informed decisions about monitoring strategies and configurations.
Collection Overhead: Metric collection consumes CPU cycles. Reading process statistics, traversing object graphs, and calculating aggregates require processing time. High-frequency collection amplifies overhead. A system sampling every second spends more CPU on monitoring than one sampling every minute. The overhead scales with the number of monitored metrics and the complexity of calculations. Simple counters impose minimal overhead. Histogram calculations involving sorting or percentile computation consume more resources. Production systems typically allocate 1-3% of CPU capacity to monitoring infrastructure.
Memory Impact: Monitoring systems buffer metrics in memory before transmission or aggregation. Large metric sets or extended buffering intervals increase memory consumption. A system tracking 1,000 metrics with 1-second resolution generates 60,000 data points per minute. At 100 bytes per data point, this consumes 6 MB per minute of memory. Memory usage scales with metric cardinality: the number of unique label combinations. Labels like user IDs or request URLs create unbounded cardinality, potentially exhausting memory. Monitoring implementations should limit metric cardinality and implement buffer size limits.
class BoundedMetricBuffer
def initialize(max_size: 10_000)
@max_size = max_size
@buffer = []
@mutex = Mutex.new
@dropped_count = 0
end
def add(metric)
@mutex.synchronize do
if @buffer.size < @max_size
@buffer << metric
else
@dropped_count += 1
end
end
end
def flush
@mutex.synchronize do
metrics = @buffer.dup
@buffer.clear
if @dropped_count > 0
puts "WARNING: Dropped #{@dropped_count} metrics due to buffer overflow"
@dropped_count = 0
end
metrics
end
end
end
Network Overhead: Push-based monitoring transmits metrics over networks, consuming bandwidth and adding latency. Sending individual metrics creates protocol overhead. A UDP packet carries 28 bytes of headers plus metric payload. Sending 1,000 one-metric packets wastes 28 KB on headers. Batching reduces overhead: combining 10 metrics per packet reduces header overhead by 90%. However, batching adds memory requirements and introduces transmission delay. Network failures cause metric loss in UDP-based systems or blocking in TCP-based systems. Asynchronous transmission prevents network issues from blocking application logic.
class BatchingMetricsClient
def initialize(host:, port:, batch_size: 50, flush_interval: 1.0)
@host = host
@port = port
@batch_size = batch_size
@flush_interval = flush_interval
@buffer = []
@mutex = Mutex.new
@socket = UDPSocket.new
start_flush_timer
end
def metric(name, value, type, tags = {})
formatted = format_metric(name, value, type, tags)
@mutex.synchronize do
@buffer << formatted
flush if @buffer.size >= @batch_size
end
end
private
def start_flush_timer
Thread.new do
loop do
sleep(@flush_interval)
flush
end
end
end
def flush
return if @buffer.empty?
batch = @mutex.synchronize do
data = @buffer.join("\n")
@buffer.clear
data
end
@socket.send(batch, 0, @host, @port)
rescue => e
# Handle failures without impacting application
nil
end
def format_metric(name, value, type, tags)
tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
tags_part = tag_string.empty? ? '' : "|##{tag_string}"
"#{name}:#{value}|#{type}#{tags_part}"
end
end
Storage Requirements: Long-term metric storage consumes disk space. A system with 10,000 metrics sampled every 10 seconds generates 86.4 million data points daily. At 12 bytes per data point (timestamp, value, metadata), this requires 1 GB daily or 365 GB annually. Time series databases use compression to reduce storage requirements by 10-100x. Downsampling stores high-resolution data for short periods and aggregated data for longer retention. Storing one-second resolution for 24 hours, one-minute resolution for 30 days, and hourly averages for one year reduces storage requirements significantly while maintaining analytical capability.
Query Performance: Metric queries scan time series data to compute aggregates, identify trends, or generate visualizations. Query performance depends on data volume, time range, aggregation complexity, and index structures. A query spanning one week across 1,000 metrics evaluates 60 million data points. Efficient queries leverage pre-computed aggregates, filter early on time ranges, and limit result cardinality. Monitoring systems should establish query performance budgets and implement timeouts to prevent runaway queries from overloading storage systems.
Sampling Trade-offs: Sampling frequency creates a trade-off between temporal resolution and overhead. High-frequency sampling captures short-lived events but increases costs. Low-frequency sampling reduces overhead but misses transient issues. A five-minute sampling interval misses spikes lasting seconds. Adaptive sampling increases frequency during detected anomalies while maintaining low baseline overhead. Event-triggered monitoring captures metrics when significant changes occur rather than on fixed schedules.
Reference
Metric Types
| Type | Description | Example Use Case | Implementation |
|---|---|---|---|
| Counter | Monotonically increasing value | Total requests processed | Increment on each event |
| Gauge | Point-in-time snapshot | Current memory usage | Set to current value |
| Histogram | Distribution of values | Response time distribution | Record each observation |
| Summary | Quantile calculations | p50, p95, p99 latency | Calculate percentiles |
| Timer | Duration measurement | Operation execution time | Record start and end |
Common System Metrics
| Metric | Source | Interpretation | Collection Method |
|---|---|---|---|
| CPU Usage | /proc/stat | Percentage of CPU time used | Sample over interval |
| Memory RSS | /proc/[pid]/status | Physical memory consumed | Read from procfs |
| Disk I/O | /proc/diskstats | Read/write operations and bytes | Calculate deltas |
| Network Traffic | /proc/net/dev | Bytes sent/received | Calculate deltas |
| Load Average | /proc/loadavg | Process queue depth | Direct read |
| File Descriptors | /proc/[pid]/fd | Open file handles | Count directory entries |
Ruby Memory Metrics
| Metric | Method | Description | Use Case |
|---|---|---|---|
| Object Count | ObjectSpace.count_objects | Objects by class | Detect object accumulation |
| Total Memory | ObjectSpace.memsize_of_all | Allocated memory size | Track memory growth |
| GC Runs | GC.stat[:count] | Garbage collection count | Assess GC pressure |
| Heap Used | GC.stat[:heap_used] | Allocated heap pages | Monitor heap growth |
| Heap Free | GC.stat[:heap_free] | Available heap pages | Detect memory pressure |
Metric Label Best Practices
| Pattern | Cardinality | Example | Usage |
|---|---|---|---|
| Service Name | Low | service:api | Always include |
| Environment | Low | env:production | Always include |
| HTTP Method | Low | method:GET | Safe for HTTP |
| HTTP Path | Medium | path:/users/:id | Template URLs |
| Status Code | Low | status:200 | Safe for HTTP |
| Error Type | Medium | error:TimeoutError | Limit to known types |
| User ID | High | user_id:12345 | Avoid in metrics |
| Request ID | High | request_id:abc123 | Avoid in metrics |
Monitoring Pattern Decision Matrix
| Scenario | Pattern | Rationale | Implementation |
|---|---|---|---|
| Short-lived processes | Push-based | Process terminates before scrape | StatsD client |
| Long-running services | Pull-based | Simplifies application code | Prometheus endpoint |
| Serverless functions | Push-based | No persistent endpoint | CloudWatch SDK |
| Container orchestration | Pull-based | Service discovery integration | Prometheus with k8s |
| Batch jobs | Push-based | Single execution completion | Metric push on finish |
| High-frequency metrics | Local aggregation | Reduce network overhead | Agent-based collection |
Alert Threshold Guidelines
| Metric Type | Threshold Strategy | Example | Justification |
|---|---|---|---|
| Resource utilization | Percentage-based | CPU > 80% | Capacity buffer |
| Error rates | Rate-based | Errors > 1% of requests | Acceptable failure rate |
| Latency | Percentile-based | p99 > 1000ms | Tail latency impact |
| Queue depth | Absolute count | Queue > 10000 items | Processing capacity |
| Connection pools | Utilization rate | Pool > 90% full | Connection availability |
| Disk space | Percentage remaining | Disk < 10% free | Growth runway |
Sampling Interval Selection
| Monitoring Scope | Interval | Data Volume | Use Case |
|---|---|---|---|
| Real-time alerts | 1-10 seconds | Very high | Detect immediate issues |
| Performance analysis | 10-60 seconds | High | Investigate problems |
| Capacity planning | 5-15 minutes | Medium | Trend analysis |
| Cost optimization | 1 hour | Low | Long-term patterns |
| Compliance reporting | 1 day | Very low | Historical records |
Integration Patterns
| Tool | Protocol | Format | Transport | Ruby Gem |
|---|---|---|---|---|
| Prometheus | Pull/HTTP | Text exposition | HTTP GET | prometheus-client |
| StatsD | Push/UDP | Line protocol | UDP packets | statsd-ruby |
| Graphite | Push/TCP | Line protocol | TCP stream | graphite-api |
| InfluxDB | Push/HTTP | Line protocol | HTTP POST | influxdb-client |
| CloudWatch | Push/API | JSON | HTTPS | aws-sdk-cloudwatch |
| Datadog | Push/API | JSON | HTTPS | dogstatsd-ruby |
| New Relic | Agent | Binary | HTTPS | newrelic_rpm |
Performance Budget Guidelines
| Resource | Budget | Measurement | Acceptable Impact |
|---|---|---|---|
| CPU overhead | 1-3% | Process CPU time | Minimal performance impact |
| Memory overhead | 50-200 MB | Process RSS | Depends on application size |
| Network bandwidth | 10-100 KB/s | Outbound traffic | Based on link capacity |
| Disk I/O | 100-1000 ops/s | Write operations | Depends on disk type |
| Request latency | <1ms | P50 added latency | Imperceptible to users |