CrackedRuby CrackedRuby

Overview

Resource monitoring measures and tracks the consumption of system resources by applications and processes. This practice provides visibility into CPU utilization, memory allocation, disk operations, network traffic, and other system metrics. Resource monitoring serves multiple purposes: detecting performance degradation, identifying bottlenecks, capacity planning, and troubleshooting production issues.

Modern applications operate in complex environments with multiple interdependencies. A web application might consume database connections, cache memory, background job queues, and external API quotas. Without proper monitoring, resource exhaustion manifests as mysterious failures, degraded performance, or complete service outages. Resource monitoring transforms these opaque failures into observable, measurable events.

The practice encompasses several monitoring layers. Operating system metrics track CPU cycles, memory pages, disk sectors, and network packets. Process-level monitoring measures individual application resource consumption. Application-level monitoring examines domain-specific resources like database connection pools, thread pools, and cache hit rates. Infrastructure monitoring covers distributed system resources across multiple hosts.

# Basic system resource snapshot
require 'sys/cpu'
require 'sys/filesystem'

cpu_usage = Sys::CPU.load_avg
disk_stats = Sys::Filesystem.stat('/')
memory_info = `free -m`.split("\n")[1].split

puts "CPU Load: #{cpu_usage[0]}"
puts "Disk Usage: #{(disk_stats.bytes_used.to_f / disk_stats.bytes_total * 100).round(2)}%"
puts "Memory Usage: #{memory_info[2].to_i} / #{memory_info[1].to_i} MB"

Resource monitoring operates through periodic sampling or continuous observation. Sampling captures resource states at fixed intervals, trading accuracy for lower overhead. Continuous monitoring tracks every resource event but consumes more resources itself. The monitoring approach affects both the accuracy of measurements and the performance impact on monitored systems.

Key Principles

Resource monitoring builds on several foundational concepts that define how systems observe and report resource consumption. Understanding these principles clarifies the mechanisms behind monitoring implementations and the trade-offs involved in different approaches.

Metrics Collection: Monitoring systems gather quantitative measurements of resource usage. Metrics represent point-in-time values (gauges), cumulative counts (counters), or distributions of measurements (histograms). A gauge records current memory usage. A counter tracks total requests processed. A histogram captures response time distribution. Each metric type serves specific analytical purposes and requires different storage strategies.

Sampling Intervals: Monitoring systems collect metrics at defined frequencies. Shorter intervals provide higher resolution but increase collection overhead. A one-second interval captures rapid fluctuations but generates 86,400 data points daily per metric. A one-minute interval reduces data volume by 98% but misses short-lived spikes. The sampling interval creates a trade-off between temporal resolution and resource consumption by the monitoring system itself.

Aggregation: Raw metrics generate excessive data volume for long-term storage. Aggregation computes summary statistics over time windows: averages, maximums, minimums, percentiles. A system might retain one-second resolution for one hour, one-minute resolution for one day, and hourly averages for one year. This hierarchical aggregation balances data retention costs against analytical requirements.

Thresholds and Alerting: Monitoring systems compare metrics against defined limits. Threshold breaches trigger notifications, automated responses, or escalations. Simple thresholds check absolute values: alert when CPU exceeds 80%. Complex thresholds evaluate trends, rates of change, or relationships between metrics. Threshold configuration balances sensitivity to problems against false positive rates.

Resource Attribution: Monitoring attributes resource consumption to specific processes, users, or operations. Attribution identifies which components consume resources and enables targeted optimization. Process-level attribution tracks memory per process ID. Request-level attribution measures resource consumption per API endpoint. Without attribution, aggregate metrics obscure the source of resource pressure.

# Metric collection with timestamp and labels
class MetricCollector
  def initialize
    @metrics = []
  end

  def record(name, value, labels = {})
    @metrics << {
      name: name,
      value: value,
      timestamp: Time.now.to_i,
      labels: labels
    }
  end

  def gauge(name, value, labels = {})
    record(name, value, labels.merge(type: 'gauge'))
  end

  def counter(name, increment = 1, labels = {})
    current = find_metric(name, labels) || 0
    record(name, current + increment, labels.merge(type: 'counter'))
  end

  def histogram(name, value, labels = {})
    record(name, value, labels.merge(type: 'histogram'))
  end

  private

  def find_metric(name, labels)
    metric = @metrics.reverse.find { |m| m[:name] == name && m[:labels] == labels }
    metric ? metric[:value] : nil
  end
end

collector = MetricCollector.new
collector.gauge('memory.used', 4_200_000_000, service: 'api')
collector.counter('requests.total', 1, endpoint: '/users')
collector.histogram('response.time', 0.145, endpoint: '/users')

Overhead Management: Monitoring consumes resources: CPU for metric collection, memory for buffering, network bandwidth for transmission, disk for storage. High-frequency monitoring of many metrics creates measurable overhead. A production system might allocate 1-5% of resources to monitoring infrastructure. Exceeding this budget degrades application performance, creating a feedback loop where monitoring itself causes the problems it aims to detect.

Time Series Data: Resource metrics form time series: sequences of values indexed by timestamp. Time series analysis identifies trends, seasonality, anomalies, and correlations. A metric showing gradual memory growth indicates a leak. Periodic CPU spikes correlate with scheduled batch jobs. Time series storage systems optimize for append-heavy workloads and time-range queries characteristic of monitoring data.

Ruby Implementation

Ruby provides multiple approaches to resource monitoring through standard library modules, system calls, and third-party gems. The implementation strategy depends on the scope of monitoring: single process, multiple processes, or distributed systems.

Process Resource Monitoring: Ruby's Process module exposes basic resource information for the current process. The Process.clock_gettime method measures various time metrics, while operating system gems provide detailed resource statistics.

require 'sys/proctable'

# Get current process information
ps_info = Sys::ProcTable.ps(pid: Process.pid)

puts "Process: #{ps_info.pid} (#{ps_info.comm})"
puts "CPU Time: #{ps_info.utime + ps_info.stime} seconds"
puts "Memory: #{ps_info.rss / 1024} MB"
puts "Threads: #{ps_info.nlwp}"
puts "Priority: #{ps_info.priority}"

# Monitor resource usage over time
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
start_cpu = Process.times

# Perform work
1_000_000.times { |i| i ** 2 }

end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
end_cpu = Process.times

puts "Wall time: #{(end_time - start_time).round(3)}s"
puts "CPU time: #{(end_cpu.utime - start_cpu.utime).round(3)}s"

Memory Profiling: Ruby's ObjectSpace module enables memory analysis by inspecting allocated objects. This provides application-level memory monitoring distinct from operating system memory metrics.

require 'objspace'

# Take memory snapshot
before = ObjectSpace.memsize_of_all

# Create objects
array = Array.new(100_000) { |i| "string_#{i}" }

after = ObjectSpace.memsize_of_all

puts "Memory allocated: #{(after - before) / 1024 / 1024} MB"

# Count objects by class
object_counts = ObjectSpace.count_objects
puts "Total objects: #{object_counts[:TOTAL]}"
puts "Strings: #{object_counts[:T_STRING]}"
puts "Arrays: #{object_counts[:T_ARRAY]}"
puts "Hashes: #{object_counts[:T_HASH]}"

# Detailed object tracking
ObjectSpace.trace_object_allocations_start

users = Array.new(1000) { { name: "User", email: "user@example.com" } }

ObjectSpace.trace_object_allocations_stop

# Find allocation source
sample_object = users.first
file = ObjectSpace.allocation_sourcefile(sample_object)
line = ObjectSpace.allocation_sourceline(sample_object)
puts "Allocated at #{file}:#{line}"

System Metrics Collection: The sys-proctable, sys-cpu, and sys-filesystem gems provide cross-platform access to system metrics. These gems wrap platform-specific system calls into consistent Ruby interfaces.

require 'sys/cpu'
require 'sys/filesystem'
require 'sys/proctable'

class SystemMonitor
  def cpu_stats
    {
      load_average: Sys::CPU.load_avg,
      num_cpus: Sys::CPU.num_cpu,
      cpu_freq: Sys::CPU.cpu_freq
    }
  end

  def memory_stats
    total = `grep MemTotal /proc/meminfo`.split[1].to_i
    available = `grep MemAvailable /proc/meminfo`.split[1].to_i
    
    {
      total_kb: total,
      available_kb: available,
      used_kb: total - available,
      usage_percent: ((total - available).to_f / total * 100).round(2)
    }
  end

  def disk_stats(mount_point = '/')
    stat = Sys::Filesystem.stat(mount_point)
    
    {
      mount_point: mount_point,
      total_bytes: stat.bytes_total,
      used_bytes: stat.bytes_used,
      available_bytes: stat.bytes_available,
      usage_percent: (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
    }
  end

  def process_list
    Sys::ProcTable.ps.map do |proc|
      {
        pid: proc.pid,
        name: proc.comm,
        cpu_percent: calculate_cpu_percent(proc),
        memory_mb: proc.rss / 1024,
        state: proc.state
      }
    end.sort_by { |p| p[:memory_mb] }.reverse.take(10)
  end

  private

  def calculate_cpu_percent(proc)
    ((proc.utime + proc.stime).to_f / Process.clock_gettime(Process::CLOCK_MONOTONIC) * 100).round(2)
  end
end

monitor = SystemMonitor.new
puts monitor.cpu_stats
puts monitor.memory_stats
puts monitor.disk_stats

Custom Metric Instrumentation: Application-specific monitoring requires instrumentation code that records domain metrics. Ruby applications typically instrument critical paths with timing and counting logic.

class MetricInstrumentor
  def initialize
    @metrics = Hash.new { |h, k| h[k] = [] }
  end

  def time(metric_name, labels = {})
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    result = yield
    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    
    record(metric_name, duration, labels.merge(unit: 'seconds'))
    result
  end

  def increment(metric_name, value = 1, labels = {})
    record(metric_name, value, labels)
  end

  def gauge(metric_name, value, labels = {})
    @metrics[metric_name] = [{ value: value, labels: labels, timestamp: Time.now }]
  end

  def record(metric_name, value, labels = {})
    @metrics[metric_name] << {
      value: value,
      labels: labels,
      timestamp: Time.now
    }
  end

  def summary(metric_name)
    values = @metrics[metric_name].map { |m| m[:value] }
    return {} if values.empty?

    sorted = values.sort
    {
      count: values.size,
      sum: values.sum,
      mean: values.sum / values.size.to_f,
      min: sorted.first,
      max: sorted.last,
      p50: percentile(sorted, 0.5),
      p95: percentile(sorted, 0.95),
      p99: percentile(sorted, 0.99)
    }
  end

  private

  def percentile(sorted_values, p)
    index = (sorted_values.size * p).ceil - 1
    sorted_values[index]
  end
end

instrumentor = MetricInstrumentor.new

# Instrument database queries
instrumentor.time('db.query', table: 'users') do
  sleep(0.1)  # Simulate query
end

# Track cache hits/misses
instrumentor.increment('cache.hits', 1, cache: 'redis')
instrumentor.increment('cache.misses', 1, cache: 'redis')

# Record queue depth
instrumentor.gauge('queue.depth', 42, queue: 'background_jobs')

puts instrumentor.summary('db.query')

Periodic Metric Collection: Long-running monitoring requires scheduled metric collection. Ruby applications use threads, fibers, or separate processes to collect metrics without blocking main application logic.

require 'concurrent'

class PeriodicMonitor
  def initialize(interval: 60)
    @interval = interval
    @running = false
    @collectors = []
  end

  def add_collector(name, &block)
    @collectors << { name: name, block: block }
  end

  def start
    @running = true
    @task = Concurrent::TimerTask.new(execution_interval: @interval) do
      collect_metrics
    end
    @task.execute
  end

  def stop
    @running = false
    @task&.shutdown
  end

  private

  def collect_metrics
    @collectors.each do |collector|
      begin
        metrics = collector[:block].call
        report_metrics(collector[:name], metrics)
      rescue => e
        puts "Error collecting #{collector[:name]}: #{e.message}"
      end
    end
  end

  def report_metrics(name, metrics)
    puts "[#{Time.now}] #{name}: #{metrics}"
  end
end

monitor = PeriodicMonitor.new(interval: 5)

monitor.add_collector('memory') do
  {
    objects: ObjectSpace.count_objects[:TOTAL],
    size_mb: ObjectSpace.memsize_of_all / 1024 / 1024
  }
end

monitor.add_collector('gc') do
  stats = GC.stat
  {
    count: stats[:count],
    heap_used: stats[:heap_used],
    heap_free: stats[:heap_free]
  }
end

monitor.start
sleep(20)
monitor.stop

Implementation Approaches

Resource monitoring strategies vary based on monitoring scope, infrastructure architecture, and operational requirements. Different approaches balance accuracy, overhead, complexity, and cost.

Push-Based Monitoring: Applications actively send metrics to a central collection service. Each monitored process runs instrumentation code that periodically transmits metrics to collectors like StatsD, Prometheus push gateway, or cloud monitoring services. Push-based systems work well for short-lived processes, serverless functions, and batch jobs that terminate before scrapers can collect metrics. The application controls transmission timing and can batch metrics to reduce network overhead. However, push-based monitoring requires reliable network connectivity and creates coupling between applications and monitoring infrastructure.

require 'socket'

class StatsDClient
  def initialize(host: 'localhost', port: 8125)
    @socket = UDPSocket.new
    @host = host
    @port = port
  end

  def gauge(metric, value, tags = {})
    send_metric("#{metric}:#{value}|g#{format_tags(tags)}")
  end

  def counter(metric, value = 1, tags = {})
    send_metric("#{metric}:#{value}|c#{format_tags(tags)}")
  end

  def timing(metric, duration_ms, tags = {})
    send_metric("#{metric}:#{duration_ms}|ms#{format_tags(tags)}")
  end

  def histogram(metric, value, tags = {})
    send_metric("#{metric}:#{value}|h#{format_tags(tags)}")
  end

  private

  def send_metric(data)
    @socket.send(data, 0, @host, @port)
  rescue => e
    # Silently fail to avoid impacting application
    nil
  end

  def format_tags(tags)
    return '' if tags.empty?
    tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
    "|##{tag_string}"
  end
end

# Usage in application code
statsd = StatsDClient.new

statsd.timing('api.request', 145, endpoint: '/users', method: 'GET')
statsd.counter('db.queries', 1, table: 'orders')
statsd.gauge('queue.size', 23, queue: 'mailers')

Pull-Based Monitoring: Monitoring systems scrape metrics from application endpoints at regular intervals. Applications expose metrics via HTTP endpoints that return current metric values. Prometheus popularized this approach with its scraping model. Pull-based monitoring centralizes collection timing and reduces application complexity. Applications only maintain current metric state rather than managing transmission logic. The monitoring system discovers targets, controls scraping frequency, and handles failures. Pull-based systems struggle with highly dynamic environments where monitored processes come and go rapidly, and network policies must allow scraper access to all monitored services.

require 'webrick'

class MetricsEndpoint
  def initialize(port: 9394)
    @port = port
    @metrics = {}
    @mutex = Mutex.new
  end

  def set_gauge(name, value, labels = {})
    @mutex.synchronize do
      key = metric_key(name, labels)
      @metrics[key] = { type: 'gauge', value: value }
    end
  end

  def increment_counter(name, value = 1, labels = {})
    @mutex.synchronize do
      key = metric_key(name, labels)
      @metrics[key] ||= { type: 'counter', value: 0 }
      @metrics[key][:value] += value
    end
  end

  def start
    server = WEBrick::HTTPServer.new(Port: @port)
    
    server.mount_proc '/metrics' do |req, res|
      res.body = format_metrics
      res.content_type = 'text/plain'
    end

    trap('INT') { server.shutdown }
    server.start
  end

  private

  def metric_key(name, labels)
    label_str = labels.sort.map { |k, v| "#{k}=\"#{v}\"" }.join(',')
    label_str.empty? ? name : "#{name}{#{label_str}}"
  end

  def format_metrics
    @mutex.synchronize do
      @metrics.map do |key, data|
        "#{key} #{data[:value]}"
      end.join("\n")
    end
  end
end

# Background thread updating metrics
endpoint = MetricsEndpoint.new

Thread.new do
  loop do
    endpoint.set_gauge('memory_bytes', ObjectSpace.memsize_of_all)
    endpoint.set_gauge('gc_runs', GC.stat[:count])
    endpoint.increment_counter('requests_total', 1, path: '/api/users')
    sleep(1)
  end
end

endpoint.start

Agent-Based Monitoring: Dedicated monitoring agents run alongside applications on the same host, collecting system and application metrics. Agents aggregate metrics locally before forwarding to central systems, reducing network traffic and providing local caching during network outages. This approach separates monitoring concerns from application code. Applications write metrics to local files, sockets, or shared memory, and agents handle collection, aggregation, and transmission. Agents add operational complexity through additional process management, configuration, and resource consumption.

Logging-Based Monitoring: Applications emit structured logs containing metric data. Log aggregation systems parse logs to extract metrics, treating logs as a unified stream of observability data. This approach requires no separate metric collection infrastructure but depends on log processing capacity. Logs provide rich context around metrics but consume more bandwidth and storage than dedicated metric protocols. The parsing overhead affects processing costs and introduces latency between event occurrence and metric availability.

Synthetic Monitoring: External systems probe application endpoints to measure availability and performance from a user perspective. Synthetic monitoring detects failures invisible to internal metrics: DNS resolution failures, TLS certificate expiration, geographic routing issues. Health check endpoints return application status, dependency health, and resource availability. Synthetic monitoring complements internal metrics by validating external accessibility.

require 'net/http'
require 'json'

class HealthCheckServer
  def initialize(port: 8080)
    @port = port
    @checks = {}
  end

  def add_check(name, &block)
    @checks[name] = block
  end

  def start
    server = WEBrick::HTTPServer.new(Port: @port)
    
    server.mount_proc '/health' do |req, res|
      results = run_checks
      res.status = results[:healthy] ? 200 : 503
      res.body = JSON.generate(results)
      res.content_type = 'application/json'
    end

    server.start
  end

  private

  def run_checks
    check_results = {}
    healthy = true

    @checks.each do |name, check|
      begin
        result = check.call
        check_results[name] = result
        healthy = false unless result[:status] == 'ok'
      rescue => e
        check_results[name] = { status: 'error', message: e.message }
        healthy = false
      end
    end

    {
      healthy: healthy,
      checks: check_results,
      timestamp: Time.now.iso8601
    }
  end
end

health = HealthCheckServer.new

health.add_check('database') do
  # Simulate database check
  { status: 'ok', latency_ms: 12 }
end

health.add_check('redis') do
  { status: 'ok', latency_ms: 3 }
end

health.add_check('disk_space') do
  stat = Sys::Filesystem.stat('/')
  usage = (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
  
  if usage > 90
    { status: 'critical', usage_percent: usage }
  elsif usage > 75
    { status: 'warning', usage_percent: usage }
  else
    { status: 'ok', usage_percent: usage }
  end
end

health.start

Tools & Ecosystem

Ruby's monitoring ecosystem includes system-level tools, application performance monitoring solutions, and cloud-native monitoring integrations. Selecting appropriate tools depends on deployment environment, scale requirements, and existing infrastructure.

System Monitoring Gems: Several gems provide access to operating system metrics. The sys-proctable gem offers process table information across platforms. The sys-cpu gem exposes CPU statistics. The sys-filesystem gem provides filesystem metrics. These gems wrap platform-specific system calls, providing consistent interfaces across Linux, macOS, and Windows.

Application Performance Monitoring: APM platforms provide comprehensive monitoring including distributed tracing, error tracking, and performance profiling. New Relic, Datadog, Scout APM, and AppSignal offer Ruby agents that automatically instrument common frameworks like Rails and Sinatra. These agents inject monitoring code into application execution paths, capturing request traces, database queries, cache operations, and external API calls.

# New Relic configuration example
require 'newrelic_rpm'

class ApplicationController
  # Automatic instrumentation via agent

  def process_order
    # Custom instrumentation
    NewRelic::Agent.record_custom_event('OrderProcessed', {
      order_id: order.id,
      amount: order.total,
      processing_time: elapsed_time
    })
  end
end

# Manual transaction tracing
NewRelic::Agent::Tracer.in_transaction(category: :task, name: 'DataImport') do
  perform_import
end

Prometheus Integration: Prometheus has become a standard for metrics collection in containerized environments. The prometheus-client gem provides Ruby integration, exposing metrics in Prometheus format.

require 'prometheus/client'
require 'prometheus/client/formats/text'

# Initialize registry
prometheus = Prometheus::Client.registry

# Define metrics
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

response_duration = prometheus.histogram(
  :http_response_duration_seconds,
  docstring: 'Response duration in seconds',
  labels: [:method, :path]
)

active_connections = prometheus.gauge(
  :http_active_connections,
  docstring: 'Currently active connections'
)

# Record metrics
http_requests.increment(labels: { method: 'GET', path: '/users', status: 200 })
response_duration.observe(0.234, labels: { method: 'GET', path: '/users' })
active_connections.set(42)

# Expose metrics endpoint
require 'webrick'

server = WEBrick::HTTPServer.new(Port: 9394)
server.mount_proc '/metrics' do |req, res|
  res.body = Prometheus::Client::Formats::Text.marshal(prometheus)
  res.content_type = Prometheus::Client::Formats::Text::CONTENT_TYPE
end
server.start

StatsD Integration: StatsD provides a simple UDP-based protocol for metric collection. The statsd-ruby gem offers a lightweight client for pushing metrics to StatsD servers.

require 'statsd-ruby'

statsd = Statsd.new('localhost', 8125)

# Timing blocks
statsd.time('db.query') do
  database.execute(query)
end

# Manual timing
start = Time.now
perform_operation
statsd.timing('operation.duration', ((Time.now - start) * 1000).to_i)

# Counters and gauges
statsd.increment('api.requests', tags: ['endpoint:users', 'method:get'])
statsd.gauge('queue.depth', background_queue.size)

# Batching for efficiency
statsd.batch do |s|
  s.increment('page.views')
  s.timing('page.load', 320)
  s.gauge('active.users', 1523)
end

OpenTelemetry: OpenTelemetry provides vendor-neutral observability instrumentation. The opentelemetry-sdk and framework-specific gems enable distributed tracing and metrics collection.

require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'my-ruby-app'
  c.use_all
end

# Manual span creation
tracer = OpenTelemetry.tracer_provider.tracer('my-app')

tracer.in_span('process_request') do |span|
  span.set_attribute('user.id', user_id)
  span.set_attribute('request.size', request_body.bytesize)
  
  process_request(request_body)
end

# Automatic instrumentation for frameworks
# Rails, Sinatra, Rack instrumented automatically

Cloud Provider Monitoring: AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor provide integrated monitoring for cloud deployments. Ruby SDKs enable custom metric publication and log streaming.

require 'aws-sdk-cloudwatch'

cloudwatch = Aws::CloudWatch::Client.new(region: 'us-east-1')

# Publish custom metrics
cloudwatch.put_metric_data({
  namespace: 'MyApp',
  metric_data: [
    {
      metric_name: 'ProcessedOrders',
      value: 42,
      unit: 'Count',
      timestamp: Time.now,
      dimensions: [
        { name: 'Environment', value: 'production' }
      ]
    },
    {
      metric_name: 'ResponseTime',
      value: 0.234,
      unit: 'Seconds',
      timestamp: Time.now
    }
  ]
})

Logging Integration: Structured logging libraries like semantic_logger and lograge format logs for parsing by monitoring systems. JSON-formatted logs enable metric extraction from log aggregation platforms.

require 'semantic_logger'

SemanticLogger.add_appender(io: $stdout, formatter: :json)

logger = SemanticLogger['MyApp']

# Structured logging with metrics
logger.info('Request processed',
  method: 'GET',
  path: '/api/users',
  status: 200,
  duration_ms: 145,
  db_queries: 3
)

# Metric extraction from logs
logger.measure_info('Database query',
  metric: 'db/query/duration',
  table: 'users'
) do
  database.query('SELECT * FROM users')
end

Practical Examples

Resource monitoring implementations vary based on application architecture and operational requirements. These examples demonstrate monitoring patterns for common scenarios.

Web Application Request Monitoring: Web applications require per-request metrics including response time, throughput, error rates, and resource consumption. Middleware captures metrics transparently without modifying application code.

require 'rack'

class RequestMonitoringMiddleware
  def initialize(app, metrics_client)
    @app = app
    @metrics = metrics_client
  end

  def call(env)
    start_time = Time.now
    
    begin
      status, headers, body = @app.call(env)
      
      duration = Time.now - start_time
      
      @metrics.timing('http.request.duration', duration * 1000,
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        status: status
      )
      
      @metrics.increment('http.request.count',
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        status: status
      )
      
      [status, headers, body]
    rescue => e
      @metrics.increment('http.request.errors',
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        error_class: e.class.name
      )
      raise
    end
  end

  private

  def sanitize_path(path)
    # Remove IDs to reduce cardinality
    path.gsub(/\/\d+/, '/:id')
  end
end

# Rails integration
class Application < Rails::Application
  config.middleware.use RequestMonitoringMiddleware, StatsDClient.new
end

Background Job Monitoring: Background job systems require monitoring for queue depth, processing time, failure rates, and retry counts. This visibility identifies performance bottlenecks and capacity requirements.

class MonitoredJob
  def self.perform(*args)
    job_name = self.name
    start_time = Time.now
    
    metrics.gauge('jobs.queue.depth', queue_depth, job: job_name)
    metrics.increment('jobs.started', job: job_name)
    
    begin
      result = perform_work(*args)
      
      duration = Time.now - start_time
      metrics.timing('jobs.duration', duration * 1000, job: job_name)
      metrics.increment('jobs.completed', job: job_name, status: 'success')
      
      result
    rescue => e
      metrics.increment('jobs.completed', job: job_name, status: 'failed')
      metrics.increment('jobs.errors', job: job_name, error: e.class.name)
      raise
    end
  end

  def self.perform_work(*args)
    # Actual job logic
    raise NotImplementedError
  end

  def self.metrics
    @metrics ||= StatsDClient.new
  end

  def self.queue_depth
    # Fetch from job queue system
    0
  end
end

class EmailDeliveryJob < MonitoredJob
  def self.perform_work(user_id, template)
    user = User.find(user_id)
    EmailService.send(user.email, template)
  end
end

Database Connection Pool Monitoring: Connection pools require monitoring to detect exhaustion, identify leak patterns, and tune pool sizing. These metrics inform capacity planning and optimization efforts.

require 'connection_pool'

class MonitoredConnectionPool
  def initialize(pool)
    @pool = pool
    @metrics = StatsDClient.new
    start_monitoring
  end

  def with(&block)
    start_wait = Time.now
    
    @pool.with do |connection|
      wait_time = Time.now - start_wait
      @metrics.timing('pool.checkout.wait', wait_time * 1000)
      
      start_use = Time.now
      result = block.call(connection)
      use_time = Time.now - start_use
      
      @metrics.timing('pool.connection.use', use_time * 1000)
      result
    end
  end

  private

  def start_monitoring
    Thread.new do
      loop do
        @metrics.gauge('pool.size', @pool.size)
        @metrics.gauge('pool.available', @pool.available)
        sleep(10)
      end
    end
  end
end

# Usage
db_pool = ConnectionPool.new(size: 5, timeout: 5) do
  DatabaseConnection.new
end

monitored_pool = MonitoredConnectionPool.new(db_pool)

monitored_pool.with do |conn|
  conn.execute('SELECT * FROM users')
end

Memory Leak Detection: Long-running processes require memory monitoring to detect leaks early. Tracking object counts and memory growth patterns reveals memory management issues.

class MemoryLeakDetector
  def initialize(interval: 300)
    @interval = interval
    @baseline = nil
    @samples = []
    @threshold_percent = 20
  end

  def start
    @running = true
    
    Thread.new do
      while @running
        sample = collect_sample
        analyze_sample(sample)
        sleep(@interval)
      end
    end
  end

  def stop
    @running = false
  end

  private

  def collect_sample
    GC.start
    
    {
      timestamp: Time.now,
      memory_bytes: ObjectSpace.memsize_of_all,
      object_count: ObjectSpace.count_objects[:TOTAL],
      gc_runs: GC.stat[:count],
      heap_used: GC.stat[:heap_used],
      heap_free: GC.stat[:heap_free]
    }
  end

  def analyze_sample(sample)
    @samples << sample
    @samples.shift if @samples.size > 20
    
    @baseline ||= sample
    
    memory_growth = (sample[:memory_bytes] - @baseline[:memory_bytes]) / 
                    @baseline[:memory_bytes].to_f * 100
    
    if memory_growth > @threshold_percent
      alert_leak_detected(sample, memory_growth)
    end
    
    log_sample(sample)
  end

  def alert_leak_detected(sample, growth)
    puts "MEMORY LEAK DETECTED: #{growth.round(2)}% growth"
    puts "Current: #{sample[:memory_bytes] / 1024 / 1024} MB"
    puts "Baseline: #{@baseline[:memory_bytes] / 1024 / 1024} MB"
    puts "Objects: #{sample[:object_count]}"
  end

  def log_sample(sample)
    puts "[#{sample[:timestamp]}] Memory: #{sample[:memory_bytes] / 1024 / 1024} MB, " \
         "Objects: #{sample[:object_count]}, GC runs: #{sample[:gc_runs]}"
  end
end

detector = MemoryLeakDetector.new(interval: 60)
detector.start

Performance Considerations

Resource monitoring introduces overhead that affects application performance. Understanding performance characteristics enables informed decisions about monitoring strategies and configurations.

Collection Overhead: Metric collection consumes CPU cycles. Reading process statistics, traversing object graphs, and calculating aggregates require processing time. High-frequency collection amplifies overhead. A system sampling every second spends more CPU on monitoring than one sampling every minute. The overhead scales with the number of monitored metrics and the complexity of calculations. Simple counters impose minimal overhead. Histogram calculations involving sorting or percentile computation consume more resources. Production systems typically allocate 1-3% of CPU capacity to monitoring infrastructure.

Memory Impact: Monitoring systems buffer metrics in memory before transmission or aggregation. Large metric sets or extended buffering intervals increase memory consumption. A system tracking 1,000 metrics with 1-second resolution generates 60,000 data points per minute. At 100 bytes per data point, this consumes 6 MB per minute of memory. Memory usage scales with metric cardinality: the number of unique label combinations. Labels like user IDs or request URLs create unbounded cardinality, potentially exhausting memory. Monitoring implementations should limit metric cardinality and implement buffer size limits.

class BoundedMetricBuffer
  def initialize(max_size: 10_000)
    @max_size = max_size
    @buffer = []
    @mutex = Mutex.new
    @dropped_count = 0
  end

  def add(metric)
    @mutex.synchronize do
      if @buffer.size < @max_size
        @buffer << metric
      else
        @dropped_count += 1
      end
    end
  end

  def flush
    @mutex.synchronize do
      metrics = @buffer.dup
      @buffer.clear
      
      if @dropped_count > 0
        puts "WARNING: Dropped #{@dropped_count} metrics due to buffer overflow"
        @dropped_count = 0
      end
      
      metrics
    end
  end
end

Network Overhead: Push-based monitoring transmits metrics over networks, consuming bandwidth and adding latency. Sending individual metrics creates protocol overhead. A UDP packet carries 28 bytes of headers plus metric payload. Sending 1,000 one-metric packets wastes 28 KB on headers. Batching reduces overhead: combining 10 metrics per packet reduces header overhead by 90%. However, batching adds memory requirements and introduces transmission delay. Network failures cause metric loss in UDP-based systems or blocking in TCP-based systems. Asynchronous transmission prevents network issues from blocking application logic.

class BatchingMetricsClient
  def initialize(host:, port:, batch_size: 50, flush_interval: 1.0)
    @host = host
    @port = port
    @batch_size = batch_size
    @flush_interval = flush_interval
    @buffer = []
    @mutex = Mutex.new
    @socket = UDPSocket.new
    
    start_flush_timer
  end

  def metric(name, value, type, tags = {})
    formatted = format_metric(name, value, type, tags)
    
    @mutex.synchronize do
      @buffer << formatted
      flush if @buffer.size >= @batch_size
    end
  end

  private

  def start_flush_timer
    Thread.new do
      loop do
        sleep(@flush_interval)
        flush
      end
    end
  end

  def flush
    return if @buffer.empty?
    
    batch = @mutex.synchronize do
      data = @buffer.join("\n")
      @buffer.clear
      data
    end
    
    @socket.send(batch, 0, @host, @port)
  rescue => e
    # Handle failures without impacting application
    nil
  end

  def format_metric(name, value, type, tags)
    tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
    tags_part = tag_string.empty? ? '' : "|##{tag_string}"
    "#{name}:#{value}|#{type}#{tags_part}"
  end
end

Storage Requirements: Long-term metric storage consumes disk space. A system with 10,000 metrics sampled every 10 seconds generates 86.4 million data points daily. At 12 bytes per data point (timestamp, value, metadata), this requires 1 GB daily or 365 GB annually. Time series databases use compression to reduce storage requirements by 10-100x. Downsampling stores high-resolution data for short periods and aggregated data for longer retention. Storing one-second resolution for 24 hours, one-minute resolution for 30 days, and hourly averages for one year reduces storage requirements significantly while maintaining analytical capability.

Query Performance: Metric queries scan time series data to compute aggregates, identify trends, or generate visualizations. Query performance depends on data volume, time range, aggregation complexity, and index structures. A query spanning one week across 1,000 metrics evaluates 60 million data points. Efficient queries leverage pre-computed aggregates, filter early on time ranges, and limit result cardinality. Monitoring systems should establish query performance budgets and implement timeouts to prevent runaway queries from overloading storage systems.

Sampling Trade-offs: Sampling frequency creates a trade-off between temporal resolution and overhead. High-frequency sampling captures short-lived events but increases costs. Low-frequency sampling reduces overhead but misses transient issues. A five-minute sampling interval misses spikes lasting seconds. Adaptive sampling increases frequency during detected anomalies while maintaining low baseline overhead. Event-triggered monitoring captures metrics when significant changes occur rather than on fixed schedules.

Reference

Metric Types

Type Description Example Use Case Implementation
Counter Monotonically increasing value Total requests processed Increment on each event
Gauge Point-in-time snapshot Current memory usage Set to current value
Histogram Distribution of values Response time distribution Record each observation
Summary Quantile calculations p50, p95, p99 latency Calculate percentiles
Timer Duration measurement Operation execution time Record start and end

Common System Metrics

Metric Source Interpretation Collection Method
CPU Usage /proc/stat Percentage of CPU time used Sample over interval
Memory RSS /proc/[pid]/status Physical memory consumed Read from procfs
Disk I/O /proc/diskstats Read/write operations and bytes Calculate deltas
Network Traffic /proc/net/dev Bytes sent/received Calculate deltas
Load Average /proc/loadavg Process queue depth Direct read
File Descriptors /proc/[pid]/fd Open file handles Count directory entries

Ruby Memory Metrics

Metric Method Description Use Case
Object Count ObjectSpace.count_objects Objects by class Detect object accumulation
Total Memory ObjectSpace.memsize_of_all Allocated memory size Track memory growth
GC Runs GC.stat[:count] Garbage collection count Assess GC pressure
Heap Used GC.stat[:heap_used] Allocated heap pages Monitor heap growth
Heap Free GC.stat[:heap_free] Available heap pages Detect memory pressure

Metric Label Best Practices

Pattern Cardinality Example Usage
Service Name Low service:api Always include
Environment Low env:production Always include
HTTP Method Low method:GET Safe for HTTP
HTTP Path Medium path:/users/:id Template URLs
Status Code Low status:200 Safe for HTTP
Error Type Medium error:TimeoutError Limit to known types
User ID High user_id:12345 Avoid in metrics
Request ID High request_id:abc123 Avoid in metrics

Monitoring Pattern Decision Matrix

Scenario Pattern Rationale Implementation
Short-lived processes Push-based Process terminates before scrape StatsD client
Long-running services Pull-based Simplifies application code Prometheus endpoint
Serverless functions Push-based No persistent endpoint CloudWatch SDK
Container orchestration Pull-based Service discovery integration Prometheus with k8s
Batch jobs Push-based Single execution completion Metric push on finish
High-frequency metrics Local aggregation Reduce network overhead Agent-based collection

Alert Threshold Guidelines

Metric Type Threshold Strategy Example Justification
Resource utilization Percentage-based CPU > 80% Capacity buffer
Error rates Rate-based Errors > 1% of requests Acceptable failure rate
Latency Percentile-based p99 > 1000ms Tail latency impact
Queue depth Absolute count Queue > 10000 items Processing capacity
Connection pools Utilization rate Pool > 90% full Connection availability
Disk space Percentage remaining Disk < 10% free Growth runway

Sampling Interval Selection

Monitoring Scope Interval Data Volume Use Case
Real-time alerts 1-10 seconds Very high Detect immediate issues
Performance analysis 10-60 seconds High Investigate problems
Capacity planning 5-15 minutes Medium Trend analysis
Cost optimization 1 hour Low Long-term patterns
Compliance reporting 1 day Very low Historical records

Integration Patterns

Tool Protocol Format Transport Ruby Gem
Prometheus Pull/HTTP Text exposition HTTP GET prometheus-client
StatsD Push/UDP Line protocol UDP packets statsd-ruby
Graphite Push/TCP Line protocol TCP stream graphite-api
InfluxDB Push/HTTP Line protocol HTTP POST influxdb-client
CloudWatch Push/API JSON HTTPS aws-sdk-cloudwatch
Datadog Push/API JSON HTTPS dogstatsd-ruby
New Relic Agent Binary HTTPS newrelic_rpm

Performance Budget Guidelines

Resource Budget Measurement Acceptable Impact
CPU overhead 1-3% Process CPU time Minimal performance impact
Memory overhead 50-200 MB Process RSS Depends on application size
Network bandwidth 10-100 KB/s Outbound traffic Based on link capacity
Disk I/O 100-1000 ops/s Write operations Depends on disk type
Request latency <1ms P50 added latency Imperceptible to users