CrackedRuby - Metrics Collection

Overview

Metrics collection captures numerical data about application performance, system resources, business operations, and user behavior. These measurements enable teams to monitor system health, detect anomalies, diagnose performance issues, and make data-driven decisions about capacity planning and optimization.

The practice emerged from the need to understand distributed systems at scale. As applications grew beyond single servers, teams required systematic approaches to gather and analyze operational data. Metrics differ from logs and traces—while logs provide detailed event narratives and traces show request flows, metrics offer aggregated numerical measurements optimized for time-series analysis and alerting.

Core metric types include counters (monotonically increasing values like total requests), gauges (point-in-time measurements like memory usage), histograms (distribution of values like request latencies), and summaries (statistical calculations over time windows). Each type serves distinct analysis needs and carries different storage and computation costs.

# Basic counter metric
class RequestCounter
  def initialize
    @count = 0
    @mutex = Mutex.new
  end
  
  def increment
    @mutex.synchronize { @count += 1 }
  end
  
  def value
    @mutex.synchronize { @count }
  end
end

# Basic gauge metric
class MemoryGauge
  def value
    `ps -o rss= -p #{Process.pid}`.to_i * 1024 # bytes
  end
end

Metrics collection operates through instrumentation (adding measurement code), aggregation (combining measurements), storage (persisting time-series data), and visualization (displaying trends and patterns). The pipeline design affects system overhead, data granularity, and query performance.

Key Principles

Metrics exist as time-series data: sequences of timestamped values representing measurements over time. Each metric consists of a name, one or more values, a timestamp, and optional labels (key-value pairs providing dimensional context). Labels enable filtering and grouping without creating separate metric names for each variant.

# Metric with labels
metric = {
  name: "http_requests_total",
  value: 1247,
  timestamp: Time.now.to_i,
  labels: {
    method: "GET",
    endpoint: "/api/users",
    status: "200"
  }
}

Cardinality represents the number of unique label combinations for a metric. High cardinality occurs when labels contain unbounded values (user IDs, session tokens, full URLs). Metrics systems store separate time-series for each unique label combination, so high cardinality exponentially increases storage requirements and query costs.

Sampling strategies determine which events generate metric updates. Counter metrics typically record every event, while histograms may sample to reduce overhead. Reservoir sampling maintains statistical accuracy while limiting memory usage, particularly for percentile calculations across large datasets.

# Reservoir sampling for histogram
class ReservoirHistogram
  RESERVOIR_SIZE = 1028
  
  def initialize
    @values = []
    @count = 0
  end
  
  def record(value)
    @count += 1
    
    if @values.size < RESERVOIR_SIZE
      @values << value
    else
      # Randomly replace with decreasing probability
      idx = rand(@count)
      @values[idx] = value if idx < RESERVOIR_SIZE
    end
  end
  
  def percentile(p)
    return nil if @values.empty?
    sorted = @values.sort
    idx = (p / 100.0 * sorted.size).ceil - 1
    sorted[[idx, 0].max]
  end
end

Aggregation determines how raw measurements combine into reportable values. Temporal aggregation combines measurements within time windows (per-minute averages). Spatial aggregation combines measurements across instances (cluster-wide request rates). The aggregation method depends on metric type—counters sum, gauges average or report last value, histograms merge buckets.

Push versus pull collection models define data flow direction. Push models have applications send metrics to collectors. Pull models have collectors scrape applications at intervals. Push models offer immediate delivery and work through firewalls, but require client-side buffering and can overwhelm collectors. Pull models provide service discovery and rate limiting but require exposed endpoints and may miss short-lived processes.

Metric namespacing organizes measurements hierarchically. Conventions vary by system (Prometheus uses underscores, StatsD uses dots), but consistent naming enables discovery and prevents collisions. Names typically include the subsystem, measurement type, and unit: http_request_duration_seconds rather than request_time.

Ruby Implementation

Ruby applications instrument code using metric libraries that abstract collection mechanics. Libraries provide metric primitives (counters, gauges, histograms) and handle thread-safety, aggregation, and reporter integration.

require 'prometheus/client'

# Initialize registry
prometheus = Prometheus::Client.registry

# Create counter
http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :endpoint]
)

# Increment counter
http_requests.increment(labels: { method: 'GET', endpoint: '/api/users' })

# Create histogram
request_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# Record observation
request_duration.observe(0.234, labels: { method: 'POST' })

The Prometheus client library follows the Prometheus exposition format, making metrics available via HTTP endpoint for scraping. Metrics accumulate in memory until scraped, requiring no external dependencies for basic collection.

StatsD integration uses UDP datagrams to send metrics to a StatsD server, which aggregates and forwards to backends. The fire-and-forget protocol minimizes application overhead but offers no delivery guarantees.

require 'statsd-instrument'

# Configure client
StatsD.backend = StatsD::Instrument::Backends::UDPBackend.new(
  'localhost:8125',
  :statsd
)

# Increment counter
StatsD.increment('api.requests', tags: ['endpoint:users', 'method:GET'])

# Set gauge
StatsD.gauge('database.pool.size', 20)

# Measure timing
StatsD.measure('api.response_time', 234.5)

# Distribution (histogram)
StatsD.distribution('cache.key_size', 1024)

Custom instrumentation wraps application code to capture measurements. Rack middleware instruments HTTP requests, database adapters measure query performance, and background job libraries track processing times.

class MetricsMiddleware
  def initialize(app, metrics)
    @app = app
    @requests = metrics.counter(
      :http_requests_total,
      docstring: 'Total requests',
      labels: [:method, :path, :status]
    )
    @duration = metrics.histogram(
      :http_request_duration_seconds,
      docstring: 'Request duration',
      labels: [:method, :path],
      buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
    )
  end
  
  def call(env)
    start = Time.now
    status, headers, body = @app.call(env)
    duration = Time.now - start
    
    labels = {
      method: env['REQUEST_METHOD'],
      path: normalize_path(env['PATH_INFO'])
    }
    
    @requests.increment(labels: labels.merge(status: status.to_s))
    @duration.observe(duration, labels: labels)
    
    [status, headers, body]
  end
  
  private
  
  def normalize_path(path)
    # Replace IDs with placeholder
    path.gsub(/\/\d+/, '/:id')
  end
end

Thread-safety requires synchronization when updating shared metric objects. Most libraries handle this internally using mutexes or atomic operations. Custom implementations must protect concurrent access to prevent race conditions.

class ThreadSafeCounter
  def initialize
    @value = Concurrent::AtomicFixnum.new(0)
  end
  
  def increment(amount = 1)
    @value.increment(amount)
  end
  
  def value
    @value.value
  end
end

Implementation Approaches

Application-level instrumentation embeds metric collection directly in application code. Developers explicitly instrument critical paths, business logic, and external calls. This approach provides precise control over what gets measured and enables domain-specific metrics (order completion rates, user signups), but requires code changes and maintenance as the application evolves.

Framework instrumentation hooks into framework internals to automatically capture common metrics. Rails provides ActiveSupport::Notifications for event-based instrumentation, allowing subscription to database queries, cache operations, and view rendering without modifying application code.

# Subscribe to Active Record queries
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
  duration = finish - start
  
  DB_QUERY_DURATION.observe(
    duration,
    labels: { operation: payload[:name] }
  )
  
  DB_QUERIES_TOTAL.increment(
    labels: { operation: payload[:name] }
  )
end

Sidecar collection deploys separate processes alongside applications to gather metrics without application changes. Sidecars monitor system resources (CPU, memory, network), parse application logs for patterns, or intercept network traffic. This decouples collection from application code but limits access to internal application state.

Service mesh integration captures network-level metrics as traffic flows through proxy sidecars. Metrics include request rates, error rates, and latencies between services without application instrumentation. Service meshes standardize observability across polyglot microservices but add infrastructure complexity and latency.

Agent-based collection runs local daemons that aggregate metrics from multiple sources before forwarding to central systems. Agents reduce network traffic, provide local buffering during network partitions, and enable sampling or filtering before transmission. The Telegraf agent collects from diverse inputs (StatsD, system stats, JMX) and outputs to various backends.

# Send metrics to local Telegraf StatsD listener
require 'statsd-instrument'

StatsD.backend = StatsD::Instrument::Backends::UDPBackend.new(
  'localhost:8125',
  :statsd
)

# Metrics flow: App -> Telegraf -> InfluxDB/Prometheus/Datadog

Batch collection accumulates metrics in memory and flushes periodically rather than sending each measurement immediately. Batching reduces network overhead and collector load but delays metric availability and risks data loss on crashes.

class BatchedMetricReporter
  def initialize(client, interval: 60)
    @client = client
    @buffer = []
    @mutex = Mutex.new
    @interval = interval
    start_flush_timer
  end
  
  def record(metric)
    @mutex.synchronize do
      @buffer << metric
    end
  end
  
  private
  
  def start_flush_timer
    Thread.new do
      loop do
        sleep @interval
        flush
      end
    end
  end
  
  def flush
    metrics = @mutex.synchronize do
      buffer = @buffer.dup
      @buffer.clear
      buffer
    end
    
    return if metrics.empty?
    
    @client.send_batch(metrics)
  rescue => e
    # Log error, metrics lost
    warn "Failed to flush metrics: #{e.message}"
  end
end

Sampling reduces data volume by recording only a subset of events. Random sampling selects events with fixed probability. Adaptive sampling adjusts rates based on traffic volume or error conditions. Tail-based sampling preserves interesting traces (errors, slow requests) while discarding routine successes.

Tools & Ecosystem

Prometheus provides open-source monitoring and alerting with a dimensional data model and powerful query language (PromQL). Applications expose metrics at HTTP endpoints which Prometheus scrapes at configured intervals. The prometheus-client gem implements the exposition format for Ruby applications.

# Prometheus exposition
require 'prometheus/client'
require 'prometheus/middleware/exporter'

# Create registry and metrics
prometheus = Prometheus::Client.registry

requests = prometheus.counter(
  :myapp_requests_total,
  docstring: 'Total requests',
  labels: [:status]
)

# Expose metrics endpoint
use Prometheus::Middleware::Exporter

# Metrics available at /metrics
# HELP myapp_requests_total Total requests
# TYPE myapp_requests_total counter
# myapp_requests_total{status="200"} 1247

StatsD aggregates metrics sent via UDP from multiple applications, then forwards to storage backends (Graphite, Datadog, InfluxDB). The statsd-instrument gem provides Ruby integration with method decorators and manual instrumentation helpers.

OpenTelemetry unifies metrics, traces, and logs under a single framework with vendor-neutral APIs and semantic conventions. The opentelemetry-ruby SDK instruments applications and exports to OTLP-compatible backends (Prometheus, Jaeger, vendor solutions).

require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'my-service'
  c.use_all() # Auto-instrument supported gems
end

# Access meter for custom metrics
meter = OpenTelemetry.meter_provider.meter('my-app')

request_counter = meter.create_counter(
  'http.server.requests',
  unit: 'requests',
  description: 'Total HTTP requests'
)

request_counter.add(1, attributes: { 'http.method' => 'GET' })

Datadog provides commercial monitoring with agents that collect system and application metrics. The dogstatsd-ruby gem sends metrics to the Datadog agent via StatsD protocol, enabling custom application metrics alongside infrastructure monitoring.

New Relic instruments Ruby applications through automatic agent installation that tracks transactions, database queries, and external calls. Custom instrumentation extends default metrics with business-specific measurements.

InfluxDB stores time-series data with high write throughput and efficient compression. The influxdb-client gem writes metrics using line protocol over HTTP.

require 'influxdb-client'

client = InfluxDB2::Client.new(
  'http://localhost:8086',
  'my-token',
  bucket: 'my-bucket',
  org: 'my-org'
)

write_api = client.create_write_api

# Write point
point = InfluxDB2::Point.new(name: 'temperature')
  .add_tag('location', 'room1')
  .add_field('value', 23.5)
  .time(Time.now.utc, InfluxDB2::WritePrecision::NANOSECOND)

write_api.write(data: point)

Grafana visualizes metrics from multiple data sources (Prometheus, InfluxDB, Elasticsearch) with customizable dashboards. Teams build dashboards showing key performance indicators, system health, and business metrics.

Practical Examples

HTTP API instrumentation tracks request throughput, latency distribution, and error rates. Middleware wraps requests to capture timing and status information without modifying controller code.

class APIMetrics
  def initialize(app)
    @app = app
    @registry = Prometheus::Client.registry
    
    @request_duration = @registry.histogram(
      :http_request_duration_seconds,
      docstring: 'Request duration',
      labels: [:method, :endpoint, :status],
      buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
    )
    
    @requests_total = @registry.counter(
      :http_requests_total,
      docstring: 'Total requests',
      labels: [:method, :endpoint, :status]
    )
    
    @requests_in_progress = @registry.gauge(
      :http_requests_in_progress,
      docstring: 'Requests currently processing',
      labels: [:method]
    )
  end
  
  def call(env)
    method = env['REQUEST_METHOD']
    endpoint = extract_endpoint(env['PATH_INFO'])
    
    @requests_in_progress.increment(labels: { method: method })
    
    start = Time.now
    status, headers, body = @app.call(env)
    duration = Time.now - start
    
    labels = {
      method: method,
      endpoint: endpoint,
      status: status.to_s[0]
    }
    
    @request_duration.observe(duration, labels: labels)
    @requests_total.increment(labels: labels)
    @requests_in_progress.decrement(labels: { method: method })
    
    [status, headers, body]
  ensure
    @requests_in_progress.decrement(labels: { method: method })
  end
  
  private
  
  def extract_endpoint(path)
    # Normalize paths to prevent cardinality explosion
    path.gsub(/\/\d+/, '/:id')
        .gsub(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/, '/:uuid')
  end
end

Database connection pool monitoring prevents exhaustion by tracking active connections, wait times, and checkout duration. This identifies when applications need additional connections or queries require optimization.

class ConnectionPoolMetrics
  def initialize(pool)
    @pool = pool
    @registry = Prometheus::Client.registry
    
    @pool_size = @registry.gauge(
      :db_connection_pool_size,
      docstring: 'Configured pool size'
    )
    
    @pool_active = @registry.gauge(
      :db_connection_pool_active,
      docstring: 'Active connections'
    )
    
    @pool_waiting = @registry.gauge(
      :db_connection_pool_waiting,
      docstring: 'Threads waiting for connections'
    )
    
    @checkout_duration = @registry.histogram(
      :db_connection_checkout_duration_seconds,
      docstring: 'Time to acquire connection',
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
    )
    
    start_collection
  end
  
  def start_collection
    Thread.new do
      loop do
        @pool_size.set(@pool.size)
        @pool_active.set(@pool.connections.count(&:in_use?))
        @pool_waiting.set(@pool.instance_variable_get(:@available).num_waiting)
        
        sleep 10
      end
    end
  end
  
  def with_checkout_timing
    start = Time.now
    connection = @pool.checkout
    duration = Time.now - start
    
    @checkout_duration.observe(duration)
    
    yield connection
  ensure
    @pool.checkin(connection) if connection
  end
end

Background job processing metrics track queue depth, processing times, retry rates, and job-specific failures. These metrics identify bottlenecks and failing jobs before they impact users.

class SidekiqMetrics
  def initialize
    @registry = Prometheus::Client.registry
    
    @jobs_total = @registry.counter(
      :sidekiq_jobs_total,
      docstring: 'Total jobs processed',
      labels: [:queue, :worker, :status]
    )
    
    @job_duration = @registry.histogram(
      :sidekiq_job_duration_seconds,
      docstring: 'Job processing duration',
      labels: [:queue, :worker],
      buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 300]
    )
    
    @queue_size = @registry.gauge(
      :sidekiq_queue_size,
      docstring: 'Jobs waiting in queue',
      labels: [:queue]
    )
    
    @queue_latency = @registry.gauge(
      :sidekiq_queue_latency_seconds,
      docstring: 'Time until oldest job starts',
      labels: [:queue]
    )
  end
  
  def call(worker, job, queue)
    start = Time.now
    labels = { queue: queue, worker: worker.class.name }
    
    begin
      yield
      
      duration = Time.now - start
      @job_duration.observe(duration, labels: labels)
      @jobs_total.increment(labels: labels.merge(status: 'success'))
    rescue => error
      @jobs_total.increment(labels: labels.merge(status: 'failure'))
      raise
    end
  end
  
  def collect_queue_stats
    Sidekiq::Queue.all.each do |queue|
      @queue_size.set(queue.size, labels: { queue: queue.name })
      
      if queue.size > 0
        oldest = queue.first
        latency = Time.now - Time.at(oldest['enqueued_at'])
        @queue_latency.set(latency, labels: { queue: queue.name })
      end
    end
  end
end

Business metrics capture domain-specific measurements beyond technical performance. E-commerce applications track order values, conversion rates, and cart abandonment. SaaS platforms measure signups, feature usage, and churn.

class BusinessMetrics
  def initialize
    @registry = Prometheus::Client.registry
    
    @orders_total = @registry.counter(
      :orders_total,
      docstring: 'Total orders created',
      labels: [:status]
    )
    
    @order_value = @registry.histogram(
      :order_value_dollars,
      docstring: 'Order value distribution',
      buckets: [10, 25, 50, 100, 250, 500, 1000, 2500]
    )
    
    @cart_abandonments = @registry.counter(
      :cart_abandonments_total,
      docstring: 'Abandoned carts'
    )
    
    @active_users = @registry.gauge(
      :active_users,
      docstring: 'Users active in last hour'
    )
  end
  
  def record_order(order)
    @orders_total.increment(labels: { status: order.status })
    @order_value.observe(order.total_amount)
  end
  
  def record_cart_abandonment
    @cart_abandonments.increment
  end
  
  def update_active_users(count)
    @active_users.set(count)
  end
end

Common Pitfalls

High cardinality labels create exponential time-series growth. Using user IDs, session tokens, full URLs, or unbounded strings as labels generates unique series for each value. Systems struggle with millions of series, causing memory exhaustion and slow queries.

# BAD - unbounded cardinality
http_requests.increment(labels: { 
  user_id: current_user.id,  # Thousands of unique values
  url: request.fullpath      # Infinite unique values
})

# GOOD - bounded cardinality
http_requests.increment(labels: {
  endpoint: normalize_path(request.path),  # Limited set of routes
  status: response.status.to_s[0]         # Single digit
})

Missing normalization in path-based metrics causes cardinality explosion. Every unique ID in a URL creates a new series. Normalization replaces variable components with placeholders, grouping similar requests.

Incorrect metric types lead to meaningless aggregations. Using gauges for cumulative counts prevents rate calculations. Using counters for fluctuating values produces nonsensical totals. Histograms require appropriate bucket boundaries—too few buckets lose resolution, too many waste storage.

# BAD - gauge for cumulative count
active_requests = registry.gauge(:requests_active)
active_requests.increment  # Gauges shouldn't increment

# GOOD - counter for cumulative count
total_requests = registry.counter(:requests_total)
total_requests.increment

# BAD - counter for current value
memory_used = registry.counter(:memory_bytes)
memory_used.increment(get_memory_usage)  # Makes no sense

# GOOD - gauge for current value
memory_used = registry.gauge(:memory_bytes)
memory_used.set(get_memory_usage)

Race conditions occur when multiple threads update shared metrics without synchronization. Counter increments become non-atomic, losing updates. Histogram observations interleave, corrupting distributions.

# BAD - unsynchronized access
class UnsafeCounter
  def initialize
    @count = 0
  end
  
  def increment
    # Race condition: read-modify-write not atomic
    @count = @count + 1
  end
end

# GOOD - atomic operations
require 'concurrent'

class SafeCounter
  def initialize
    @count = Concurrent::AtomicFixnum.new(0)
  end
  
  def increment
    @count.increment
  end
end

Memory leaks accumulate when metrics persist indefinitely with high-cardinality labels. Applications exhaust memory as series grow unbounded. Implement label value limits, periodic pruning, or switch to logging for high-cardinality data.

Blocking operations in metric collection paths add latency to request handling. Synchronous network calls to remote collectors block application threads. Use async reporters, batching, or fire-and-forget protocols like StatsD.

Missing units in metric names create ambiguity. Duration metrics need unit suffixes (_seconds, _milliseconds). Size metrics need byte/kilobyte disambiguation. Follow naming conventions: http_request_duration_seconds not request_time.

Clock skew between application and monitoring systems causes timestamp misalignment. Some systems use collector receive time rather than event time, masking the issue until clock differences grow large. Synchronize clocks with NTP and consider using server-side timestamps.

Over-instrumentation degrades performance and creates noise. Measuring every method call or emitting metrics in tight loops wastes resources. Focus instrumentation on request boundaries, external calls, and business-critical operations.

Reference

Metric Type Comparison

Type	Description	Aggregation	Use Case
Counter	Monotonically increasing value	Sum, Rate	Total requests, errors, bytes sent
Gauge	Point-in-time measurement	Average, Min, Max	Memory usage, queue size, temperature
Histogram	Distribution of values	Quantiles, Average	Request latency, response size
Summary	Pre-calculated quantiles	Quantiles	Client-side percentiles

Common Labels

Label	Purpose	Example Values
method	HTTP method	GET, POST, PUT, DELETE
status	Response status category	2xx, 4xx, 5xx
endpoint	Normalized path	/api/users/:id
environment	Deployment environment	production, staging
instance	Server identifier	web-01, worker-03
job	Background job class	OrderProcessor, EmailWorker

Ruby Metric Libraries

Library	Protocol	Features
prometheus-client	Prometheus	Pull-based, histogram support, multi-process
statsd-instrument	StatsD	UDP, minimal overhead, decorators
dogstatsd-ruby	StatsD	Datadog integration, tags, events
opentelemetry-ruby	OTLP	Unified observability, auto-instrumentation
influxdb-client	InfluxDB	Direct writes, batching

Histogram Bucket Recommendations

Measurement	Buckets (seconds)
API latency	0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Database query	0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5
Background job	0.1, 0.5, 1, 5, 10, 30, 60, 300, 600
External API	0.05, 0.1, 0.5, 1, 2, 5, 10, 30

Naming Conventions

Pattern	Example	Meaning
subsystem_component_unit	http_request_duration_seconds	HTTP request duration in seconds
subsystem_component_total	cache_hits_total	Total cache hits counter
subsystem_component_status	database_connection_pool_active	Current active connections
subsystem_errors_total	api_errors_total	Total errors counter

Performance Impact Guidelines

Collection Method	Overhead	Latency Impact
In-memory counter	Negligible	< 1 microsecond
In-memory histogram	Low	< 10 microseconds
StatsD UDP	Low	< 100 microseconds
Synchronous HTTP	High	1-50 milliseconds
Batched write	Medium	Variable (on flush)

Anti-Pattern Detection

Anti-Pattern	Problem	Solution
User ID as label	Unbounded cardinality	Use user tier/segment
Full URL as label	Cardinality explosion	Normalize paths
Gauge for cumulative values	Wrong aggregation	Use counter
Counter for fluctuating values	Nonsensical totals	Use gauge
Missing thread safety	Race conditions	Use atomic operations
Blocking network calls	Request latency	Use async/batching

Metrics Collection