CrackedRuby - Resource Monitoring

Overview

Resource monitoring measures and tracks the consumption of system resources by applications and processes. This practice provides visibility into CPU utilization, memory allocation, disk operations, network traffic, and other system metrics. Resource monitoring serves multiple purposes: detecting performance degradation, identifying bottlenecks, capacity planning, and troubleshooting production issues.

Modern applications operate in complex environments with multiple interdependencies. A web application might consume database connections, cache memory, background job queues, and external API quotas. Without proper monitoring, resource exhaustion manifests as mysterious failures, degraded performance, or complete service outages. Resource monitoring transforms these opaque failures into observable, measurable events.

The practice encompasses several monitoring layers. Operating system metrics track CPU cycles, memory pages, disk sectors, and network packets. Process-level monitoring measures individual application resource consumption. Application-level monitoring examines domain-specific resources like database connection pools, thread pools, and cache hit rates. Infrastructure monitoring covers distributed system resources across multiple hosts.

# Basic system resource snapshot
require 'sys/cpu'
require 'sys/filesystem'

cpu_usage = Sys::CPU.load_avg
disk_stats = Sys::Filesystem.stat('/')
memory_info = `free -m`.split("\n")[1].split

puts "CPU Load: #{cpu_usage[0]}"
puts "Disk Usage: #{(disk_stats.bytes_used.to_f / disk_stats.bytes_total * 100).round(2)}%"
puts "Memory Usage: #{memory_info[2].to_i} / #{memory_info[1].to_i} MB"

Resource monitoring operates through periodic sampling or continuous observation. Sampling captures resource states at fixed intervals, trading accuracy for lower overhead. Continuous monitoring tracks every resource event but consumes more resources itself. The monitoring approach affects both the accuracy of measurements and the performance impact on monitored systems.

Key Principles

Resource monitoring builds on several foundational concepts that define how systems observe and report resource consumption. Understanding these principles clarifies the mechanisms behind monitoring implementations and the trade-offs involved in different approaches.

Metrics Collection: Monitoring systems gather quantitative measurements of resource usage. Metrics represent point-in-time values (gauges), cumulative counts (counters), or distributions of measurements (histograms). A gauge records current memory usage. A counter tracks total requests processed. A histogram captures response time distribution. Each metric type serves specific analytical purposes and requires different storage strategies.

Sampling Intervals: Monitoring systems collect metrics at defined frequencies. Shorter intervals provide higher resolution but increase collection overhead. A one-second interval captures rapid fluctuations but generates 86,400 data points daily per metric. A one-minute interval reduces data volume by 98% but misses short-lived spikes. The sampling interval creates a trade-off between temporal resolution and resource consumption by the monitoring system itself.

Aggregation: Raw metrics generate excessive data volume for long-term storage. Aggregation computes summary statistics over time windows: averages, maximums, minimums, percentiles. A system might retain one-second resolution for one hour, one-minute resolution for one day, and hourly averages for one year. This hierarchical aggregation balances data retention costs against analytical requirements.

Thresholds and Alerting: Monitoring systems compare metrics against defined limits. Threshold breaches trigger notifications, automated responses, or escalations. Simple thresholds check absolute values: alert when CPU exceeds 80%. Complex thresholds evaluate trends, rates of change, or relationships between metrics. Threshold configuration balances sensitivity to problems against false positive rates.

Resource Attribution: Monitoring attributes resource consumption to specific processes, users, or operations. Attribution identifies which components consume resources and enables targeted optimization. Process-level attribution tracks memory per process ID. Request-level attribution measures resource consumption per API endpoint. Without attribution, aggregate metrics obscure the source of resource pressure.

# Metric collection with timestamp and labels
class MetricCollector
  def initialize
    @metrics = []
  end

  def record(name, value, labels = {})
    @metrics << {
      name: name,
      value: value,
      timestamp: Time.now.to_i,
      labels: labels
    }
  end

  def gauge(name, value, labels = {})
    record(name, value, labels.merge(type: 'gauge'))
  end

  def counter(name, increment = 1, labels = {})
    current = find_metric(name, labels) || 0
    record(name, current + increment, labels.merge(type: 'counter'))
  end

  def histogram(name, value, labels = {})
    record(name, value, labels.merge(type: 'histogram'))
  end

  private

  def find_metric(name, labels)
    metric = @metrics.reverse.find { |m| m[:name] == name && m[:labels] == labels }
    metric ? metric[:value] : nil
  end
end

collector = MetricCollector.new
collector.gauge('memory.used', 4_200_000_000, service: 'api')
collector.counter('requests.total', 1, endpoint: '/users')
collector.histogram('response.time', 0.145, endpoint: '/users')

Overhead Management: Monitoring consumes resources: CPU for metric collection, memory for buffering, network bandwidth for transmission, disk for storage. High-frequency monitoring of many metrics creates measurable overhead. A production system might allocate 1-5% of resources to monitoring infrastructure. Exceeding this budget degrades application performance, creating a feedback loop where monitoring itself causes the problems it aims to detect.

Time Series Data: Resource metrics form time series: sequences of values indexed by timestamp. Time series analysis identifies trends, seasonality, anomalies, and correlations. A metric showing gradual memory growth indicates a leak. Periodic CPU spikes correlate with scheduled batch jobs. Time series storage systems optimize for append-heavy workloads and time-range queries characteristic of monitoring data.

Ruby Implementation

Ruby provides multiple approaches to resource monitoring through standard library modules, system calls, and third-party gems. The implementation strategy depends on the scope of monitoring: single process, multiple processes, or distributed systems.

Process Resource Monitoring: Ruby's Process module exposes basic resource information for the current process. The Process.clock_gettime method measures various time metrics, while operating system gems provide detailed resource statistics.

require 'sys/proctable'

# Get current process information
ps_info = Sys::ProcTable.ps(pid: Process.pid)

puts "Process: #{ps_info.pid} (#{ps_info.comm})"
puts "CPU Time: #{ps_info.utime + ps_info.stime} seconds"
puts "Memory: #{ps_info.rss / 1024} MB"
puts "Threads: #{ps_info.nlwp}"
puts "Priority: #{ps_info.priority}"

# Monitor resource usage over time
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
start_cpu = Process.times

# Perform work
1_000_000.times { |i| i ** 2 }

end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
end_cpu = Process.times

puts "Wall time: #{(end_time - start_time).round(3)}s"
puts "CPU time: #{(end_cpu.utime - start_cpu.utime).round(3)}s"

Memory Profiling: Ruby's ObjectSpace module enables memory analysis by inspecting allocated objects. This provides application-level memory monitoring distinct from operating system memory metrics.

require 'objspace'

# Take memory snapshot
before = ObjectSpace.memsize_of_all

# Create objects
array = Array.new(100_000) { |i| "string_#{i}" }

after = ObjectSpace.memsize_of_all

puts "Memory allocated: #{(after - before) / 1024 / 1024} MB"

# Count objects by class
object_counts = ObjectSpace.count_objects
puts "Total objects: #{object_counts[:TOTAL]}"
puts "Strings: #{object_counts[:T_STRING]}"
puts "Arrays: #{object_counts[:T_ARRAY]}"
puts "Hashes: #{object_counts[:T_HASH]}"

# Detailed object tracking
ObjectSpace.trace_object_allocations_start

users = Array.new(1000) { { name: "User", email: "user@example.com" } }

ObjectSpace.trace_object_allocations_stop

# Find allocation source
sample_object = users.first
file = ObjectSpace.allocation_sourcefile(sample_object)
line = ObjectSpace.allocation_sourceline(sample_object)
puts "Allocated at #{file}:#{line}"

System Metrics Collection: The sys-proctable, sys-cpu, and sys-filesystem gems provide cross-platform access to system metrics. These gems wrap platform-specific system calls into consistent Ruby interfaces.

require 'sys/cpu'
require 'sys/filesystem'
require 'sys/proctable'

class SystemMonitor
  def cpu_stats
    {
      load_average: Sys::CPU.load_avg,
      num_cpus: Sys::CPU.num_cpu,
      cpu_freq: Sys::CPU.cpu_freq
    }
  end

  def memory_stats
    total = `grep MemTotal /proc/meminfo`.split[1].to_i
    available = `grep MemAvailable /proc/meminfo`.split[1].to_i
    
    {
      total_kb: total,
      available_kb: available,
      used_kb: total - available,
      usage_percent: ((total - available).to_f / total * 100).round(2)
    }
  end

  def disk_stats(mount_point = '/')
    stat = Sys::Filesystem.stat(mount_point)
    
    {
      mount_point: mount_point,
      total_bytes: stat.bytes_total,
      used_bytes: stat.bytes_used,
      available_bytes: stat.bytes_available,
      usage_percent: (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
    }
  end

  def process_list
    Sys::ProcTable.ps.map do |proc|
      {
        pid: proc.pid,
        name: proc.comm,
        cpu_percent: calculate_cpu_percent(proc),
        memory_mb: proc.rss / 1024,
        state: proc.state
      }
    end.sort_by { |p| p[:memory_mb] }.reverse.take(10)
  end

  private

  def calculate_cpu_percent(proc)
    ((proc.utime + proc.stime).to_f / Process.clock_gettime(Process::CLOCK_MONOTONIC) * 100).round(2)
  end
end

monitor = SystemMonitor.new
puts monitor.cpu_stats
puts monitor.memory_stats
puts monitor.disk_stats

Custom Metric Instrumentation: Application-specific monitoring requires instrumentation code that records domain metrics. Ruby applications typically instrument critical paths with timing and counting logic.

class MetricInstrumentor
  def initialize
    @metrics = Hash.new { |h, k| h[k] = [] }
  end

  def time(metric_name, labels = {})
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    result = yield
    duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    
    record(metric_name, duration, labels.merge(unit: 'seconds'))
    result
  end

  def increment(metric_name, value = 1, labels = {})
    record(metric_name, value, labels)
  end

  def gauge(metric_name, value, labels = {})
    @metrics[metric_name] = [{ value: value, labels: labels, timestamp: Time.now }]
  end

  def record(metric_name, value, labels = {})
    @metrics[metric_name] << {
      value: value,
      labels: labels,
      timestamp: Time.now
    }
  end

  def summary(metric_name)
    values = @metrics[metric_name].map { |m| m[:value] }
    return {} if values.empty?

    sorted = values.sort
    {
      count: values.size,
      sum: values.sum,
      mean: values.sum / values.size.to_f,
      min: sorted.first,
      max: sorted.last,
      p50: percentile(sorted, 0.5),
      p95: percentile(sorted, 0.95),
      p99: percentile(sorted, 0.99)
    }
  end

  private

  def percentile(sorted_values, p)
    index = (sorted_values.size * p).ceil - 1
    sorted_values[index]
  end
end

instrumentor = MetricInstrumentor.new

# Instrument database queries
instrumentor.time('db.query', table: 'users') do
  sleep(0.1)  # Simulate query
end

# Track cache hits/misses
instrumentor.increment('cache.hits', 1, cache: 'redis')
instrumentor.increment('cache.misses', 1, cache: 'redis')

# Record queue depth
instrumentor.gauge('queue.depth', 42, queue: 'background_jobs')

puts instrumentor.summary('db.query')

Periodic Metric Collection: Long-running monitoring requires scheduled metric collection. Ruby applications use threads, fibers, or separate processes to collect metrics without blocking main application logic.

require 'concurrent'

class PeriodicMonitor
  def initialize(interval: 60)
    @interval = interval
    @running = false
    @collectors = []
  end

  def add_collector(name, &block)
    @collectors << { name: name, block: block }
  end

  def start
    @running = true
    @task = Concurrent::TimerTask.new(execution_interval: @interval) do
      collect_metrics
    end
    @task.execute
  end

  def stop
    @running = false
    @task&.shutdown
  end

  private

  def collect_metrics
    @collectors.each do |collector|
      begin
        metrics = collector[:block].call
        report_metrics(collector[:name], metrics)
      rescue => e
        puts "Error collecting #{collector[:name]}: #{e.message}"
      end
    end
  end

  def report_metrics(name, metrics)
    puts "[#{Time.now}] #{name}: #{metrics}"
  end
end

monitor = PeriodicMonitor.new(interval: 5)

monitor.add_collector('memory') do
  {
    objects: ObjectSpace.count_objects[:TOTAL],
    size_mb: ObjectSpace.memsize_of_all / 1024 / 1024
  }
end

monitor.add_collector('gc') do
  stats = GC.stat
  {
    count: stats[:count],
    heap_used: stats[:heap_used],
    heap_free: stats[:heap_free]
  }
end

monitor.start
sleep(20)
monitor.stop

Implementation Approaches

Resource monitoring strategies vary based on monitoring scope, infrastructure architecture, and operational requirements. Different approaches balance accuracy, overhead, complexity, and cost.

Push-Based Monitoring: Applications actively send metrics to a central collection service. Each monitored process runs instrumentation code that periodically transmits metrics to collectors like StatsD, Prometheus push gateway, or cloud monitoring services. Push-based systems work well for short-lived processes, serverless functions, and batch jobs that terminate before scrapers can collect metrics. The application controls transmission timing and can batch metrics to reduce network overhead. However, push-based monitoring requires reliable network connectivity and creates coupling between applications and monitoring infrastructure.

require 'socket'

class StatsDClient
  def initialize(host: 'localhost', port: 8125)
    @socket = UDPSocket.new
    @host = host
    @port = port
  end

  def gauge(metric, value, tags = {})
    send_metric("#{metric}:#{value}|g#{format_tags(tags)}")
  end

  def counter(metric, value = 1, tags = {})
    send_metric("#{metric}:#{value}|c#{format_tags(tags)}")
  end

  def timing(metric, duration_ms, tags = {})
    send_metric("#{metric}:#{duration_ms}|ms#{format_tags(tags)}")
  end

  def histogram(metric, value, tags = {})
    send_metric("#{metric}:#{value}|h#{format_tags(tags)}")
  end

  private

  def send_metric(data)
    @socket.send(data, 0, @host, @port)
  rescue => e
    # Silently fail to avoid impacting application
    nil
  end

  def format_tags(tags)
    return '' if tags.empty?
    tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
    "|##{tag_string}"
  end
end

# Usage in application code
statsd = StatsDClient.new

statsd.timing('api.request', 145, endpoint: '/users', method: 'GET')
statsd.counter('db.queries', 1, table: 'orders')
statsd.gauge('queue.size', 23, queue: 'mailers')

Pull-Based Monitoring: Monitoring systems scrape metrics from application endpoints at regular intervals. Applications expose metrics via HTTP endpoints that return current metric values. Prometheus popularized this approach with its scraping model. Pull-based monitoring centralizes collection timing and reduces application complexity. Applications only maintain current metric state rather than managing transmission logic. The monitoring system discovers targets, controls scraping frequency, and handles failures. Pull-based systems struggle with highly dynamic environments where monitored processes come and go rapidly, and network policies must allow scraper access to all monitored services.

require 'webrick'

class MetricsEndpoint
  def initialize(port: 9394)
    @port = port
    @metrics = {}
    @mutex = Mutex.new
  end

  def set_gauge(name, value, labels = {})
    @mutex.synchronize do
      key = metric_key(name, labels)
      @metrics[key] = { type: 'gauge', value: value }
    end
  end

  def increment_counter(name, value = 1, labels = {})
    @mutex.synchronize do
      key = metric_key(name, labels)
      @metrics[key] ||= { type: 'counter', value: 0 }
      @metrics[key][:value] += value
    end
  end

  def start
    server = WEBrick::HTTPServer.new(Port: @port)
    
    server.mount_proc '/metrics' do |req, res|
      res.body = format_metrics
      res.content_type = 'text/plain'
    end

    trap('INT') { server.shutdown }
    server.start
  end

  private

  def metric_key(name, labels)
    label_str = labels.sort.map { |k, v| "#{k}=\"#{v}\"" }.join(',')
    label_str.empty? ? name : "#{name}{#{label_str}}"
  end

  def format_metrics
    @mutex.synchronize do
      @metrics.map do |key, data|
        "#{key} #{data[:value]}"
      end.join("\n")
    end
  end
end

# Background thread updating metrics
endpoint = MetricsEndpoint.new

Thread.new do
  loop do
    endpoint.set_gauge('memory_bytes', ObjectSpace.memsize_of_all)
    endpoint.set_gauge('gc_runs', GC.stat[:count])
    endpoint.increment_counter('requests_total', 1, path: '/api/users')
    sleep(1)
  end
end

endpoint.start

Agent-Based Monitoring: Dedicated monitoring agents run alongside applications on the same host, collecting system and application metrics. Agents aggregate metrics locally before forwarding to central systems, reducing network traffic and providing local caching during network outages. This approach separates monitoring concerns from application code. Applications write metrics to local files, sockets, or shared memory, and agents handle collection, aggregation, and transmission. Agents add operational complexity through additional process management, configuration, and resource consumption.

Logging-Based Monitoring: Applications emit structured logs containing metric data. Log aggregation systems parse logs to extract metrics, treating logs as a unified stream of observability data. This approach requires no separate metric collection infrastructure but depends on log processing capacity. Logs provide rich context around metrics but consume more bandwidth and storage than dedicated metric protocols. The parsing overhead affects processing costs and introduces latency between event occurrence and metric availability.

Synthetic Monitoring: External systems probe application endpoints to measure availability and performance from a user perspective. Synthetic monitoring detects failures invisible to internal metrics: DNS resolution failures, TLS certificate expiration, geographic routing issues. Health check endpoints return application status, dependency health, and resource availability. Synthetic monitoring complements internal metrics by validating external accessibility.

require 'net/http'
require 'json'

class HealthCheckServer
  def initialize(port: 8080)
    @port = port
    @checks = {}
  end

  def add_check(name, &block)
    @checks[name] = block
  end

  def start
    server = WEBrick::HTTPServer.new(Port: @port)
    
    server.mount_proc '/health' do |req, res|
      results = run_checks
      res.status = results[:healthy] ? 200 : 503
      res.body = JSON.generate(results)
      res.content_type = 'application/json'
    end

    server.start
  end

  private

  def run_checks
    check_results = {}
    healthy = true

    @checks.each do |name, check|
      begin
        result = check.call
        check_results[name] = result
        healthy = false unless result[:status] == 'ok'
      rescue => e
        check_results[name] = { status: 'error', message: e.message }
        healthy = false
      end
    end

    {
      healthy: healthy,
      checks: check_results,
      timestamp: Time.now.iso8601
    }
  end
end

health = HealthCheckServer.new

health.add_check('database') do
  # Simulate database check
  { status: 'ok', latency_ms: 12 }
end

health.add_check('redis') do
  { status: 'ok', latency_ms: 3 }
end

health.add_check('disk_space') do
  stat = Sys::Filesystem.stat('/')
  usage = (stat.bytes_used.to_f / stat.bytes_total * 100).round(2)
  
  if usage > 90
    { status: 'critical', usage_percent: usage }
  elsif usage > 75
    { status: 'warning', usage_percent: usage }
  else
    { status: 'ok', usage_percent: usage }
  end
end

health.start

Tools & Ecosystem

Ruby's monitoring ecosystem includes system-level tools, application performance monitoring solutions, and cloud-native monitoring integrations. Selecting appropriate tools depends on deployment environment, scale requirements, and existing infrastructure.

System Monitoring Gems: Several gems provide access to operating system metrics. The sys-proctable gem offers process table information across platforms. The sys-cpu gem exposes CPU statistics. The sys-filesystem gem provides filesystem metrics. These gems wrap platform-specific system calls, providing consistent interfaces across Linux, macOS, and Windows.

Application Performance Monitoring: APM platforms provide comprehensive monitoring including distributed tracing, error tracking, and performance profiling. New Relic, Datadog, Scout APM, and AppSignal offer Ruby agents that automatically instrument common frameworks like Rails and Sinatra. These agents inject monitoring code into application execution paths, capturing request traces, database queries, cache operations, and external API calls.

# New Relic configuration example
require 'newrelic_rpm'

class ApplicationController
  # Automatic instrumentation via agent

  def process_order
    # Custom instrumentation
    NewRelic::Agent.record_custom_event('OrderProcessed', {
      order_id: order.id,
      amount: order.total,
      processing_time: elapsed_time
    })
  end
end

# Manual transaction tracing
NewRelic::Agent::Tracer.in_transaction(category: :task, name: 'DataImport') do
  perform_import
end

Prometheus Integration: Prometheus has become a standard for metrics collection in containerized environments. The prometheus-client gem provides Ruby integration, exposing metrics in Prometheus format.

require 'prometheus/client'
require 'prometheus/client/formats/text'

# Initialize registry
prometheus = Prometheus::Client.registry

# Define metrics
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

response_duration = prometheus.histogram(
  :http_response_duration_seconds,
  docstring: 'Response duration in seconds',
  labels: [:method, :path]
)

active_connections = prometheus.gauge(
  :http_active_connections,
  docstring: 'Currently active connections'
)

# Record metrics
http_requests.increment(labels: { method: 'GET', path: '/users', status: 200 })
response_duration.observe(0.234, labels: { method: 'GET', path: '/users' })
active_connections.set(42)

# Expose metrics endpoint
require 'webrick'

server = WEBrick::HTTPServer.new(Port: 9394)
server.mount_proc '/metrics' do |req, res|
  res.body = Prometheus::Client::Formats::Text.marshal(prometheus)
  res.content_type = Prometheus::Client::Formats::Text::CONTENT_TYPE
end
server.start

StatsD Integration: StatsD provides a simple UDP-based protocol for metric collection. The statsd-ruby gem offers a lightweight client for pushing metrics to StatsD servers.

require 'statsd-ruby'

statsd = Statsd.new('localhost', 8125)

# Timing blocks
statsd.time('db.query') do
  database.execute(query)
end

# Manual timing
start = Time.now
perform_operation
statsd.timing('operation.duration', ((Time.now - start) * 1000).to_i)

# Counters and gauges
statsd.increment('api.requests', tags: ['endpoint:users', 'method:get'])
statsd.gauge('queue.depth', background_queue.size)

# Batching for efficiency
statsd.batch do |s|
  s.increment('page.views')
  s.timing('page.load', 320)
  s.gauge('active.users', 1523)
end

OpenTelemetry: OpenTelemetry provides vendor-neutral observability instrumentation. The opentelemetry-sdk and framework-specific gems enable distributed tracing and metrics collection.

require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'my-ruby-app'
  c.use_all
end

# Manual span creation
tracer = OpenTelemetry.tracer_provider.tracer('my-app')

tracer.in_span('process_request') do |span|
  span.set_attribute('user.id', user_id)
  span.set_attribute('request.size', request_body.bytesize)
  
  process_request(request_body)
end

# Automatic instrumentation for frameworks
# Rails, Sinatra, Rack instrumented automatically

Cloud Provider Monitoring: AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor provide integrated monitoring for cloud deployments. Ruby SDKs enable custom metric publication and log streaming.

require 'aws-sdk-cloudwatch'

cloudwatch = Aws::CloudWatch::Client.new(region: 'us-east-1')

# Publish custom metrics
cloudwatch.put_metric_data({
  namespace: 'MyApp',
  metric_data: [
    {
      metric_name: 'ProcessedOrders',
      value: 42,
      unit: 'Count',
      timestamp: Time.now,
      dimensions: [
        { name: 'Environment', value: 'production' }
      ]
    },
    {
      metric_name: 'ResponseTime',
      value: 0.234,
      unit: 'Seconds',
      timestamp: Time.now
    }
  ]
})

Logging Integration: Structured logging libraries like semantic_logger and lograge format logs for parsing by monitoring systems. JSON-formatted logs enable metric extraction from log aggregation platforms.

require 'semantic_logger'

SemanticLogger.add_appender(io: $stdout, formatter: :json)

logger = SemanticLogger['MyApp']

# Structured logging with metrics
logger.info('Request processed',
  method: 'GET',
  path: '/api/users',
  status: 200,
  duration_ms: 145,
  db_queries: 3
)

# Metric extraction from logs
logger.measure_info('Database query',
  metric: 'db/query/duration',
  table: 'users'
) do
  database.query('SELECT * FROM users')
end

Practical Examples

Resource monitoring implementations vary based on application architecture and operational requirements. These examples demonstrate monitoring patterns for common scenarios.

Web Application Request Monitoring: Web applications require per-request metrics including response time, throughput, error rates, and resource consumption. Middleware captures metrics transparently without modifying application code.

require 'rack'

class RequestMonitoringMiddleware
  def initialize(app, metrics_client)
    @app = app
    @metrics = metrics_client
  end

  def call(env)
    start_time = Time.now
    
    begin
      status, headers, body = @app.call(env)
      
      duration = Time.now - start_time
      
      @metrics.timing('http.request.duration', duration * 1000,
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        status: status
      )
      
      @metrics.increment('http.request.count',
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        status: status
      )
      
      [status, headers, body]
    rescue => e
      @metrics.increment('http.request.errors',
        method: env['REQUEST_METHOD'],
        path: sanitize_path(env['PATH_INFO']),
        error_class: e.class.name
      )
      raise
    end
  end

  private

  def sanitize_path(path)
    # Remove IDs to reduce cardinality
    path.gsub(/\/\d+/, '/:id')
  end
end

# Rails integration
class Application < Rails::Application
  config.middleware.use RequestMonitoringMiddleware, StatsDClient.new
end

Background Job Monitoring: Background job systems require monitoring for queue depth, processing time, failure rates, and retry counts. This visibility identifies performance bottlenecks and capacity requirements.

class MonitoredJob
  def self.perform(*args)
    job_name = self.name
    start_time = Time.now
    
    metrics.gauge('jobs.queue.depth', queue_depth, job: job_name)
    metrics.increment('jobs.started', job: job_name)
    
    begin
      result = perform_work(*args)
      
      duration = Time.now - start_time
      metrics.timing('jobs.duration', duration * 1000, job: job_name)
      metrics.increment('jobs.completed', job: job_name, status: 'success')
      
      result
    rescue => e
      metrics.increment('jobs.completed', job: job_name, status: 'failed')
      metrics.increment('jobs.errors', job: job_name, error: e.class.name)
      raise
    end
  end

  def self.perform_work(*args)
    # Actual job logic
    raise NotImplementedError
  end

  def self.metrics
    @metrics ||= StatsDClient.new
  end

  def self.queue_depth
    # Fetch from job queue system
    0
  end
end

class EmailDeliveryJob < MonitoredJob
  def self.perform_work(user_id, template)
    user = User.find(user_id)
    EmailService.send(user.email, template)
  end
end

Database Connection Pool Monitoring: Connection pools require monitoring to detect exhaustion, identify leak patterns, and tune pool sizing. These metrics inform capacity planning and optimization efforts.

require 'connection_pool'

class MonitoredConnectionPool
  def initialize(pool)
    @pool = pool
    @metrics = StatsDClient.new
    start_monitoring
  end

  def with(&block)
    start_wait = Time.now
    
    @pool.with do |connection|
      wait_time = Time.now - start_wait
      @metrics.timing('pool.checkout.wait', wait_time * 1000)
      
      start_use = Time.now
      result = block.call(connection)
      use_time = Time.now - start_use
      
      @metrics.timing('pool.connection.use', use_time * 1000)
      result
    end
  end

  private

  def start_monitoring
    Thread.new do
      loop do
        @metrics.gauge('pool.size', @pool.size)
        @metrics.gauge('pool.available', @pool.available)
        sleep(10)
      end
    end
  end
end

# Usage
db_pool = ConnectionPool.new(size: 5, timeout: 5) do
  DatabaseConnection.new
end

monitored_pool = MonitoredConnectionPool.new(db_pool)

monitored_pool.with do |conn|
  conn.execute('SELECT * FROM users')
end

Memory Leak Detection: Long-running processes require memory monitoring to detect leaks early. Tracking object counts and memory growth patterns reveals memory management issues.

class MemoryLeakDetector
  def initialize(interval: 300)
    @interval = interval
    @baseline = nil
    @samples = []
    @threshold_percent = 20
  end

  def start
    @running = true
    
    Thread.new do
      while @running
        sample = collect_sample
        analyze_sample(sample)
        sleep(@interval)
      end
    end
  end

  def stop
    @running = false
  end

  private

  def collect_sample
    GC.start
    
    {
      timestamp: Time.now,
      memory_bytes: ObjectSpace.memsize_of_all,
      object_count: ObjectSpace.count_objects[:TOTAL],
      gc_runs: GC.stat[:count],
      heap_used: GC.stat[:heap_used],
      heap_free: GC.stat[:heap_free]
    }
  end

  def analyze_sample(sample)
    @samples << sample
    @samples.shift if @samples.size > 20
    
    @baseline ||= sample
    
    memory_growth = (sample[:memory_bytes] - @baseline[:memory_bytes]) / 
                    @baseline[:memory_bytes].to_f * 100
    
    if memory_growth > @threshold_percent
      alert_leak_detected(sample, memory_growth)
    end
    
    log_sample(sample)
  end

  def alert_leak_detected(sample, growth)
    puts "MEMORY LEAK DETECTED: #{growth.round(2)}% growth"
    puts "Current: #{sample[:memory_bytes] / 1024 / 1024} MB"
    puts "Baseline: #{@baseline[:memory_bytes] / 1024 / 1024} MB"
    puts "Objects: #{sample[:object_count]}"
  end

  def log_sample(sample)
    puts "[#{sample[:timestamp]}] Memory: #{sample[:memory_bytes] / 1024 / 1024} MB, " \
         "Objects: #{sample[:object_count]}, GC runs: #{sample[:gc_runs]}"
  end
end

detector = MemoryLeakDetector.new(interval: 60)
detector.start

Performance Considerations

Resource monitoring introduces overhead that affects application performance. Understanding performance characteristics enables informed decisions about monitoring strategies and configurations.

Collection Overhead: Metric collection consumes CPU cycles. Reading process statistics, traversing object graphs, and calculating aggregates require processing time. High-frequency collection amplifies overhead. A system sampling every second spends more CPU on monitoring than one sampling every minute. The overhead scales with the number of monitored metrics and the complexity of calculations. Simple counters impose minimal overhead. Histogram calculations involving sorting or percentile computation consume more resources. Production systems typically allocate 1-3% of CPU capacity to monitoring infrastructure.

Memory Impact: Monitoring systems buffer metrics in memory before transmission or aggregation. Large metric sets or extended buffering intervals increase memory consumption. A system tracking 1,000 metrics with 1-second resolution generates 60,000 data points per minute. At 100 bytes per data point, this consumes 6 MB per minute of memory. Memory usage scales with metric cardinality: the number of unique label combinations. Labels like user IDs or request URLs create unbounded cardinality, potentially exhausting memory. Monitoring implementations should limit metric cardinality and implement buffer size limits.

class BoundedMetricBuffer
  def initialize(max_size: 10_000)
    @max_size = max_size
    @buffer = []
    @mutex = Mutex.new
    @dropped_count = 0
  end

  def add(metric)
    @mutex.synchronize do
      if @buffer.size < @max_size
        @buffer << metric
      else
        @dropped_count += 1
      end
    end
  end

  def flush
    @mutex.synchronize do
      metrics = @buffer.dup
      @buffer.clear
      
      if @dropped_count > 0
        puts "WARNING: Dropped #{@dropped_count} metrics due to buffer overflow"
        @dropped_count = 0
      end
      
      metrics
    end
  end
end

Network Overhead: Push-based monitoring transmits metrics over networks, consuming bandwidth and adding latency. Sending individual metrics creates protocol overhead. A UDP packet carries 28 bytes of headers plus metric payload. Sending 1,000 one-metric packets wastes 28 KB on headers. Batching reduces overhead: combining 10 metrics per packet reduces header overhead by 90%. However, batching adds memory requirements and introduces transmission delay. Network failures cause metric loss in UDP-based systems or blocking in TCP-based systems. Asynchronous transmission prevents network issues from blocking application logic.

class BatchingMetricsClient
  def initialize(host:, port:, batch_size: 50, flush_interval: 1.0)
    @host = host
    @port = port
    @batch_size = batch_size
    @flush_interval = flush_interval
    @buffer = []
    @mutex = Mutex.new
    @socket = UDPSocket.new
    
    start_flush_timer
  end

  def metric(name, value, type, tags = {})
    formatted = format_metric(name, value, type, tags)
    
    @mutex.synchronize do
      @buffer << formatted
      flush if @buffer.size >= @batch_size
    end
  end

  private

  def start_flush_timer
    Thread.new do
      loop do
        sleep(@flush_interval)
        flush
      end
    end
  end

  def flush
    return if @buffer.empty?
    
    batch = @mutex.synchronize do
      data = @buffer.join("\n")
      @buffer.clear
      data
    end
    
    @socket.send(batch, 0, @host, @port)
  rescue => e
    # Handle failures without impacting application
    nil
  end

  def format_metric(name, value, type, tags)
    tag_string = tags.map { |k, v| "#{k}:#{v}" }.join(',')
    tags_part = tag_string.empty? ? '' : "|##{tag_string}"
    "#{name}:#{value}|#{type}#{tags_part}"
  end
end

Storage Requirements: Long-term metric storage consumes disk space. A system with 10,000 metrics sampled every 10 seconds generates 86.4 million data points daily. At 12 bytes per data point (timestamp, value, metadata), this requires 1 GB daily or 365 GB annually. Time series databases use compression to reduce storage requirements by 10-100x. Downsampling stores high-resolution data for short periods and aggregated data for longer retention. Storing one-second resolution for 24 hours, one-minute resolution for 30 days, and hourly averages for one year reduces storage requirements significantly while maintaining analytical capability.

Query Performance: Metric queries scan time series data to compute aggregates, identify trends, or generate visualizations. Query performance depends on data volume, time range, aggregation complexity, and index structures. A query spanning one week across 1,000 metrics evaluates 60 million data points. Efficient queries leverage pre-computed aggregates, filter early on time ranges, and limit result cardinality. Monitoring systems should establish query performance budgets and implement timeouts to prevent runaway queries from overloading storage systems.

Sampling Trade-offs: Sampling frequency creates a trade-off between temporal resolution and overhead. High-frequency sampling captures short-lived events but increases costs. Low-frequency sampling reduces overhead but misses transient issues. A five-minute sampling interval misses spikes lasting seconds. Adaptive sampling increases frequency during detected anomalies while maintaining low baseline overhead. Event-triggered monitoring captures metrics when significant changes occur rather than on fixed schedules.

Reference

Metric Types

Type	Description	Example Use Case	Implementation
Counter	Monotonically increasing value	Total requests processed	Increment on each event
Gauge	Point-in-time snapshot	Current memory usage	Set to current value
Histogram	Distribution of values	Response time distribution	Record each observation
Summary	Quantile calculations	p50, p95, p99 latency	Calculate percentiles
Timer	Duration measurement	Operation execution time	Record start and end

Common System Metrics

Metric	Source	Interpretation	Collection Method
CPU Usage	/proc/stat	Percentage of CPU time used	Sample over interval
Memory RSS	/proc/[pid]/status	Physical memory consumed	Read from procfs
Disk I/O	/proc/diskstats	Read/write operations and bytes	Calculate deltas
Network Traffic	/proc/net/dev	Bytes sent/received	Calculate deltas
Load Average	/proc/loadavg	Process queue depth	Direct read
File Descriptors	/proc/[pid]/fd	Open file handles	Count directory entries

Ruby Memory Metrics

Metric	Method	Description	Use Case
Object Count	ObjectSpace.count_objects	Objects by class	Detect object accumulation
Total Memory	ObjectSpace.memsize_of_all	Allocated memory size	Track memory growth
GC Runs	GC.stat[:count]	Garbage collection count	Assess GC pressure
Heap Used	GC.stat[:heap_used]	Allocated heap pages	Monitor heap growth
Heap Free	GC.stat[:heap_free]	Available heap pages	Detect memory pressure

Metric Label Best Practices

Pattern	Cardinality	Example	Usage
Service Name	Low	service:api	Always include
Environment	Low	env:production	Always include
HTTP Method	Low	method:GET	Safe for HTTP
HTTP Path	Medium	path:/users/:id	Template URLs
Status Code	Low	status:200	Safe for HTTP
Error Type	Medium	error:TimeoutError	Limit to known types
User ID	High	user_id:12345	Avoid in metrics
Request ID	High	request_id:abc123	Avoid in metrics

Monitoring Pattern Decision Matrix

Scenario	Pattern	Rationale	Implementation
Short-lived processes	Push-based	Process terminates before scrape	StatsD client
Long-running services	Pull-based	Simplifies application code	Prometheus endpoint
Serverless functions	Push-based	No persistent endpoint	CloudWatch SDK
Container orchestration	Pull-based	Service discovery integration	Prometheus with k8s
Batch jobs	Push-based	Single execution completion	Metric push on finish
High-frequency metrics	Local aggregation	Reduce network overhead	Agent-based collection

Alert Threshold Guidelines

Metric Type	Threshold Strategy	Example	Justification
Resource utilization	Percentage-based	CPU > 80%	Capacity buffer
Error rates	Rate-based	Errors > 1% of requests	Acceptable failure rate
Latency	Percentile-based	p99 > 1000ms	Tail latency impact
Queue depth	Absolute count	Queue > 10000 items	Processing capacity
Connection pools	Utilization rate	Pool > 90% full	Connection availability
Disk space	Percentage remaining	Disk < 10% free	Growth runway

Sampling Interval Selection

Monitoring Scope	Interval	Data Volume	Use Case
Real-time alerts	1-10 seconds	Very high	Detect immediate issues
Performance analysis	10-60 seconds	High	Investigate problems
Capacity planning	5-15 minutes	Medium	Trend analysis
Cost optimization	1 hour	Low	Long-term patterns
Compliance reporting	1 day	Very low	Historical records

Integration Patterns

Tool	Protocol	Format	Transport	Ruby Gem
Prometheus	Pull/HTTP	Text exposition	HTTP GET	prometheus-client
StatsD	Push/UDP	Line protocol	UDP packets	statsd-ruby
Graphite	Push/TCP	Line protocol	TCP stream	graphite-api
InfluxDB	Push/HTTP	Line protocol	HTTP POST	influxdb-client
CloudWatch	Push/API	JSON	HTTPS	aws-sdk-cloudwatch
Datadog	Push/API	JSON	HTTPS	dogstatsd-ruby
New Relic	Agent	Binary	HTTPS	newrelic_rpm

Performance Budget Guidelines

Resource	Budget	Measurement	Acceptable Impact
CPU overhead	1-3%	Process CPU time	Minimal performance impact
Memory overhead	50-200 MB	Process RSS	Depends on application size
Network bandwidth	10-100 KB/s	Outbound traffic	Based on link capacity
Disk I/O	100-1000 ops/s	Write operations	Depends on disk type
Request latency	<1ms	P50 added latency	Imperceptible to users

Resource Monitoring